CFPB Open Tech

Clojure developers: Work with us on the Public Data Platform

Marc Esher — Sat, 19 Jul 2014 00:00:00 +0000

The CFPB is now recruiting talented, motivated people for positions in design, cybersecurity, data, and development. I’m Marc, a back-end developer in the first round of Fellows, and our dev team is looking for a few good Clojure developers in the second round of the Fellowship to help build technology infrastructure to support the Bureau’s work.

With the support of the tech fellows, this year the Bureau launched an easy-to-use online tool that enables consumers to explore public information about the mortgage market. This set of web-based JavaScript applications run on top of an API we built using our own Clojure-powered tool.

In January 2014, the platform launched. This platform supports the Home Mortgage Disclosure Act (HMDA) data from 2007–2012, representing roughly 113 million rows of deidentified mortgage loan application records. Mortgage lenders have collected HMDA data since 1975, but the data was only available online in raw form, limiting its usefulness. This new platform allows users to filter and aggregate the data in ways previously unavailable because of the size and complexity of the data. The data is now easier for the public to access, navigate, and analyze.

We continue to make improvements to the platform, and we’re excited about its potential to make information easier for the public to access and use. That’s where you come in.

Qu (https://github.com/cfpb/qu) is our open source software, written in Clojure, and it is the heart of the Public Data Platform. To continue our important work, we need talented Clojure developers.

Some of our high-priority tasks are:

Pluggable backends. Right now, the platform runs on MongoDB, but we think it should run on multiple databases, including RDBMS.
Improved, simplified data loading.
Admin dashboard for easier insight into platform behavior.
Improved automation, such as Docker containerization, for a much better “getting started” experience.
Anything that makes it easier to put open, publicly-available data in the hands of the public to promote our consumer financial protection mission.

As a Technology and Innovation Fellow, you can work from anywhere in the U.S., including on-site at our headquarters in Washington, D.C. We put a premium on communication and collaboration, and we’ve learned a lot about successful remote working working. You’ll join a team of creative, dedicated developers and designers scattered around the country, and we’ll help you stay connected.

If this mission appeals to you, and if these software challenges excite you, please apply at http://www.consumerfinance.gov/jobs/technology-innovation-fellows/. If you’re not a Clojure developer, but all of this sounds fantastic, take heart: We’re also hiring for talented Python and JavaScript devs, graphic designers, UX and more, and we encourage you to apply.

Now recruiting: Technology & Innovation Fellows for 2015

Ashwin Vasan — Mon, 23 Jun 2014 00:00:00 +0000

We’re excited to announce applications are now being accepted for the next round of CFPB Technology & Innovation Fellows. The fellowship is a two year program for software developers, graphic and user experience (UX) designers, data specialists, and cybersecurity professionals interested in leveraging technology to help further our mission of making financial products and services work for consumers.

We’re looking for talented individuals with diverse backgrounds who embrace our mission and are excited about building technology and helping to build our organization. We expect the next group of fellows to begin work in January 2015.

Since the program launched two years ago, fellows have been hard at work applying their talents to build amazing things to help financial products and services work for consumers. Today, I’m proud to share with you some of their work.

Fellows have been instrumental in creating and building:

An easy-to use online tool to visualize and track public mortgage data
Ask CFPB, an interactive tool that includes over 1,000 frequently-asked questions and answers to help consumers find clear, unbiased answers to their financial questions (also available in Spanish)
eRegulations, a tool that makes regulations easier to use by parsing public regulations, and then presenting the information in a format that’s easier to read, navigate, and understand
Consumer-friendly design that is being incorporated in all facets of our work, such as forms that consumers use for submitting complaints, the college cost shopping sheet and student debt repayment tool, or our public reports.

Looking ahead, the next round of fellows will continue to build on these accomplishments as well as tackle new projects in areas such as building software for our website, developing consumer-friendly tools and materials, and supporting agency cybersecurity functions.

Technology and innovation are fundamental to our ability to achieve our consumer protection mission. If you’re ready to serve the public and help us build amazing things, apply now or sign up here.

Want to learn more? Check us out on GitHub or peruse the rest of this site to learn more about the web applications our current fellows have developed, and check out our Design Reel to see how current fellows have improved the ways consumers interact with the federal government.

Automating, enhancing, and improving eRegulations

Shashank Khandelwal — Mon, 12 May 2014 00:00:00 +0000

Regulation Z (Truth in Lending) is now hosted on eRegulations – CFPB’s platform to make regulations easier to find, read, and understand. This regulation follows approximately six months since the last regulation (Regulation E – Electronic Fund Transfers) was first hosted. That’s admittedly a long time for what seems like a simple content update. However, Regulation Z differs significantly from Regulation E, requiring us to improve and update the eRegulations platform. eRegulations now handles longer regulations gracefully, and these updates also make it significantly easier to host additional regulations going forward. Here, we’ll share some of the more interesting improvements and in the process also reveal more about how our platform works.

Hosting larger regulations

The first and foremost difference between the two regulations is that Regulation Z is significantly larger than Regulation E in almost all aspects. In Table 1, you can see the difference between Regulation E and Regulation Z for each type of content. For example, Regulation E has 26 sections while Regulation Z has 54.

	Regulation E	Regulation Z
Subparts	2	7
Sections	26	54
Appendices	2	15

Table 1: The number of each type of content per regulation.

Regulation E is 1.5 MB on disk, while Regulation Z is almost ten times larger at 11 MB, when both texts are represented as a pretty-printed JSON trees (not including images). That’s a lot of text; in comparison, War and Peace by Leo Tolstoy is 3.1 MB. The fact that Regulation Z is a significantly longer regulation than Regulation E drove how we approached the updates and improvements to the tool – from the need to automatically retrieve content changes, to allowing additional types of appendices, to separating the supplement into more manageable chunks.

Compiling regulations

A primary feature of eRegulations is the ability to view past, current, and future versions of a regulation. Previously, the source content that was fed to the parser to generate each version was created manually. The most significant change we made over the past six months was to automate this process.

Each version of a regulation consists of a series of Federal Register (FR) final rule notices applied to the previous version of the regulation. Each notice describes changes to individual paragraphs of the regulation (think of it like a diff). A change can add, revise, delete, or move a paragraph and looks something like this:

Section 1026.32 is amended by:

Revising paragraph (a)(2)(iii)

The revisions read as follows:

(a) ***

(2) ***

(iii) A transaction originated by a Housing Finance Agency, where the Housing Finance Agency is the creditor for the transaction; or

This example is from https://www.federalregister.gov/articles/2013/10/01/2013-22752/amendments-to-the-2013-mortgage-rules-under-the-equal-credit-opportunity-act-regulation-b-real#p-amd-32

Lines 1 and 2 describe which paragraph has changed, and how it has changed (known as the amendatory instructions). Line 6 shows you how paragraph 1026.36 (a)(2)(iii) reads after the revision. A notice can contain multiples of these changes.

Each version of a regulation on our platform is represented on the back end as a data structure (more specifically an ordered n-ary tree) that represents the entire regulation at that point in time. For each version of Regulation E, we manually read each FR notice and meticulously compiled plaintext versions that were fed to our parser to generate the tree structure. This was possible since in Regulation E we have three versions consisting of eight FR notices. Regulation Z, on the other hand, has 12 versions and 23 notices. Manual compilation of versions would be inefficient and more prone to error. It also is not a sustainable solution going forward. We wanted to be able to simply start the parser when the next Regulation E or Z notice was published – without having to manually apply the changes from the new notice.

We now automatically compile regulation versions. Each FR notice is processed by parsing the amendatory instructions (what has changed) and the actual changes (how it has changed), matching those up, and compiling the changes into a new version. Each FR notice has a corresponding XML representation – this also drove the conversion of our parser from being text-based to XML-based. This resulted in a far more sustainable application requiring less manual intervention to add an additional regulation.

Splitting and cleaning Federal Register notices

An individual regulation paragraph can change in a limited number of ways. A paragraph can be added, revised, moved, or deleted. Usually, these changes are written with reasonably consistent phrasing – making parsing them tractable. However, sometimes there are exceptions when the change is not expressed as clearly as possible. Adding rules to the code for these exceptions would have diminishing returns in the sense that the effort of getting the code correct, tested and ensuring that it doesn’t break any of the other parsing, would far outweigh the benefits of the unique rule. To handle those special cases, we built a mechanism to allow us to keep local copies of the XML notices taken from the Federal Register and make changes to that copy to make it easier to parser. The parser looks first in our local repository of notices to see if a copy of a required notice exists, before downloading it from the Federal Register. This enabled us to gracefully handle phrases that aren’t used frequently enough to warrant their own custom rule.

The same mechanism came in handy when we discovered that several notices for Regulation Z had more than one effective date. Notices with the same effective date are what comprise a version of a regulation. The following example illustrates how complicated this can get:

This final rule is effective January 10, 2014, except for the amendments to §§ 1026.35(b)(2)(iii), 1026.36(a), (b), and (j), and commentary to §§ 1026.25(c)(2), 1026.35, and 1026.36(a), (b), (d), and (f) in Supp. I to part 1026, which are effective January 1, 2014, and the amendments to commentary to § 1002.14(b)(3) in Supplement I to part 1002, which are effective January 18, 2014.

From: https://www.federalregister.gov/articles/2013/10/01/2013-22752/amendments-to-the-2013-mortgage-rules-under-the-equal-credit-opportunity-act-regulation-b-real#p-40

In these cases, we manually split up the notices, creating a new XML source document for each effective date. This was another situation in which a manual override made the most sense given time and effort constraints.

Improving appendices

The types of information the appendices for Regulation Z contain are far more varied than those for Regulation E. First, the structure of the text in the appendices for Regulation Z differs from that of Regulation E. This required a complete re-write of the appendix parsing code to allow for the new format. Secondly, the appendices for Regulation Z contain equations, tables, SAS code, and many images. Each of those presented unique challenges. To handle tables we had to parse the XML that exhaustively represented the tables into something meaningful and concise, and then display that in visually pleasing HTML tables. The SAS code was handled by the same mechanism.

Some of the appendices in Regulation Z contain many images. To speed up page loads for those sections we re-saved all of the images using image formats that compress the content with minimal quality degradation and introduced thumbnails. Clicking on the thumbnail brings the user to the larger image, but the thumbnails ensure that pages load faster. We also lazy-load the images on scroll to speed up the initial page load. Regulation Z, in its original form, also contains a number of appendices where the images contain text. We pulled the text out of those images, so that the text is now searchable and linkable providing for a better user experience. With the exception of compiling regulations, most of the changes we made for Regulation Z were directly a result of that fact that Regulation Z is longer.

Breaking up the supplement

Supplement I is the part of the regulation that contains the official interpretations to the regulation. Loading Supplement I as a single page worked well for Regulation E (where the content is relatively short) but with Regulation Z this led to a degraded experience as the supplement is significantly longer. We split Supplement I, so it could be displayed a subpart at a time. Displaying the interpretations a subpart at a time was considered a more cohesive experience by our product owner (rather than breaking Supplement I to be read a section at a time). Our code was previously written with the intent of displaying a section at a time (with the entirety of Supplement I considered as a section). This worked nicely because that also reflects how the data that drives everything is represented. With the Supplement displayed a subpart at a time, there is no corresponding underlying data structure that tells us that the following sections of Supplement I should be collected and displayed together. This required a rewrite of some of our display logic. Supplement I is now easier to read as a result.

Conclusion

We made many other changes to the eRegulations tool along the way: introducing a landing page for all the regulations, extending the logic to identify defined terms with the regulation, and based on user feedback – introducing help text to the application. Each one of those represents a significant effort, but here we wanted to explain some of the larger efforts. All our code is open source, so you can see what we’ve been up to in excruciating detail (and suggest changes).

Through this set of changes, we’ve hopefully made it easier to navigate, understand, and comply with Regulation Z. Going forward it will also be easier to add future regulations and deal with longer regulations.

A peek inside Qu

Matthew Burton — Fri, 25 Apr 2014 00:00:00 +0000

This past January, we launched a data exploration tool that lets the public dissect the 113 million mortgage applications that Americans filed between 2007 and 2012. This data, from the Home Mortgage Disclosure Act (HMDA), has been publicly available for years, but never before has it been in such an accessible format. This web-based tool is powered by Qu, a data platform we built to help us quickly deliver data like this to researchers and developers. To learn about it and where it’s going, I spoke to Clinton Dreisbach, Qu’s lead developer.

Question:

To start, what is Qu? Can you give some of the backstory on why we built it?

Answer:

Qu is an open-source platform to deliver large sets of data. It allows you to query that data, combine it with other data, and summarize that data. We built it because we wanted to serve millions of mortgage application records, and there was nothing out there that could do the same thing on the scale we were looking for. There are some smaller things—Socrata, CKAN’s data tables—and some really large enterprise-y things like Apache Drill, but nothing really in the middle, for serving 10–100 million rows of data easily.

It’s important to note that Qu isn’t just “the CFPB data platform”; it’s a platform for building your own data APIs.

Question:

Right; other people can use it for their own data sets that have nothing to do with us.

Answer:

The work we’re doing right now is to make that as easy as possible.

Question:

What’s the difference between Qu and tools like Socrata and CKAN? Is it an alternative to them, or a complement?

Answer:

Yes and yes? I think it makes a nice complement with CKAN, as CKAN is more focused on being a data catalog, whereas Qu is a data provider. That is, CKAN is great for showing the world your data sets, including sets in non-machine-readable formats, like PDF or Word documents. Qu is good for taking the machine-readable data and putting a simple API on top of it.

Question:

The features found on our HMDA tool—those are applications built using the API, not Qu itself, right?

Answer:

Correct. Those are JavaScript applications based on our mortgage application API, which itself was built using Qu. We’ve built a template that lets you use Qu to build APIs for your own datasets. This is the first step in turning Qu into something like Django or Ruby on Rails—a library you use in your own app, instead of an app by itself.

Socrata and CKAN are applications. You download them and install them. They are like WordPress in this way: a web application you put on your server. You configure the application, but in the end, you have that application.

Qu was like this until recently. The big change we are making is that Qu is becoming a toolkit to build your API with. It doesn’t take much, and you might only have one simple file. For example, here’s the file that runs api.consumerfinance.gov. This was generated by the Leinengen template (linked above). But, you can add whatever you want.

This is how Qu has become more like Django or Rails. It makes it infinitely extensible without mucking around in the source code of Qu itself. Right now, we’ve just begun exploring what that can give us.

Question:

Elementary question: what’s the benefit to building your own API instead of just using the one that comes out of the box with one of those other products?

Answer:

To be honest, right now, not a lot, besides that you can benefit from upgrades in Qu’s core software easily. But the end goal will make it matter a lot, because you will be able to pick and choose Qu components—the database, the formats—easily. Qu has always tried to follow the principle that APIs should be discoverable by a human. So, the API has an HTML interface that should let you use the whole thing.

Question:

Going back a few minutes to what you were saying about making Qu infinitely extensible, you said you’re just beginning to explore the benefits of this, but you must have had some reason for doing it to begin with.

Answer:

This fell out of me wanting to make the database interchangeable. Doing that led me to think about the best way to make it switchable through configuration. And I ended up with an application template/builder rather than an application.

Question:

So what this means is, if I have a data set that I want to provide an API for, but don’t know how to build APIs, a future version of Qu will let me build a powerful one with relative ease, without forcing me to run it from a database I don’t have.

Answer:

Right. Exactly.

Question:

Cool. Why did you choose Clojure?

Answer:

Two reasons. First, for dealing with this much data, we need to use all the capabilities of our machines. There are not a lot of languages out there that make using multiple threads easy, and Clojure’s one of the few. (By the way, here’s a curriculum I wrote that explains why Clojure is good at this.)

#2: Clojure is fundamentally about data. It’s not an object-oriented language. Everything in Clojure is a data structure, which fits well when you’re writing programs to transform data.

#3 (I said two, but not true): Clojure is nothing more than a library for the Java Virtual Machine. This lets us use next-level technology while still being able to use all the Java libraries that exist today. In addition, most government and corporate environments know how to deploy Java applications. It’s a nice mix of looking forward without overwhelming our existing infrastructure.

And #4: I like using Clojure. Qu started as a prototype, so I used what I know and love. The prototype grew—like they do—and became the real application.

Question:

What have been some of the bigger engineering challenges?

Answer:

Figuring out how to deliver an arbitrary amount of data was a big deal. If you’re working with our mortgage application data set, you can request any amount of data for download and we will serve it. This is hard, because we have a finite amount of memory and a very large amount of data. You can ask for 4 GB of data and we will serve it, yet we never keep that much data in memory.

Clojure made this fun and easy: it has “lazy sequences”, which not only allow us not to process things until we need them, but also allows us to garbage-collect data after we’ve used it.

Question:

“Garbage-collect data”?

Answer:

We release the memory the data was using. The only data in memory is the data currently being delivered. Once you’ve got the data, we throw it away. This allows us to service multiple requests for large data sets without exploding. Imagine a window that you can look at a bunch of data through. That window moves over the data, showing only what’s necessary at any given time.

We serve that data using HTTP streaming, so you don’t have to wait for it all to be ready before you start receiving it.

Question:

So if you want to download a big file, we don’t have to tell you, “Okay, sit tight while we generate the data set for you—then you can come back and download your huge file.” Instead, it starts immediately.

Answer:

Yes. Although, we have to do that right now for queries that are hard to calculate. We’re working on that.

Question:

What are some of the things at the top of your to-do list for Qu?

Answer:

Our roadmap is public. I want to make Qu even easier to customize. Individual organizations should be able to take Qu and add new data backends, new endpoints, and new data formats very easily.

I want to overhaul the way you import data. Right now, it’s complex. You have to know a special format for the data definition. This should be easy to write, or even better, partially inferred.

And I want to continue to make the whole thing pluggable. For example, adding an admin dashboard adds a bunch of complexity for something you might not need. But, having a plugin for an admin dashboard lets you have that or leave it out as you wish.

My biggest goal is to get others using Qu so we can see what they need and they can contribute back. I’d love to see a CKAN/Qu integration. But I’d love to see someone else write it.

Question:

If someone else wanted to make some code contributions, where should they focus?

Answer:

Definitely on data loading. It’s the first part of the app I wrote. It’s both pretty easy to understand and crufty: Here’s a sample data set ready for loading, and here is the definition file. It is huge and gross. I would love proposals on a better format for describing data coming in.

Soon, once things are a little more settled on this modularization, I’d love to see people write database adapters for DBs other than Mongo. Here are the docs on that.

–

Matthew Burton is the former Acting CIO of the Consumer Financial Protection Bureau. Though he has moved back to Brooklyn, he still works with the Bureau’s technology team on a part-time basis.

Clinton Dreisbach is a Clojure and Python hacker for the Consumer Financial Protection Bureau. He is the lead developer on Qu, the CFPB’s public data platform, and a contributor to Clojure, Hy, and other open source projects.

Rules about rules: Tools of the trade

CM Lubinski — Fri, 31 Jan 2014 00:00:00 +0000

Congress conveys its general vision by enacting laws; regulatory bodies, like the CFPB, implement that vision by publishing detailed regulations. These regulations are usually long, monolithic, difficult-to-read documents buried deep within Federal agency websites. With the recently released eRegulations project (source code), we aimed to make regulations more approachable by presenting them with structure and layers of relevant material. One of the core contributions of the project is a plain-text parser for these regulations, a.k.a. “rules.” This parser pulls structure from the documents such that each paragraph can be properly indexed; it discovers citations between the paragraphs and to external works; it determines definitions; it even calculates differences between versions of the regulation. Due to the sensitive (and authoritative) nature of regulations, we cannot leave room for probabilistic methods employed via machine learning. Instead we retrieve all of the information through parsing, a rule-based approach to natural language processing.

In this article, we’ll touch on a few of the tools we use when parsing regulations.

XML: So much structure, so little meaning

The Government Printing Office publishes regulations in the Code of Federal Regulations (CFR) as XML bulk downloads. Surely, with a structured language such as XML defining a regulation, we don’t have much to do, right? Unfortunately, as Cornell discovered, not all XML documents are created equal, and the CFR’s data isn’t exactly clean. The Cornell analysis cites major issues with both inconsistent markup and, more insidiously, structure-without-meaning. Referring to the documents as a “bag of tags” conveys the problem well; just because a document has formatting does not mean it follows a logical structure. The XML provided in these bulk downloads was designed for conveying format, rather than structure, meaning header tags might be used to center text and italic paragraphs might imply headings.

In our efforts towards a minimum-viable-product, we chose to skip both the potential hints and pitfalls of XML parsing in favor of plain-text versions of the regulations. Our current development relies more heavily on XML, yet we continue to use plain text in many of our features, as it’s easier to reason about. For the sake of simplicity, this writeup will proceed with the assumption that the regulation is provided as a plain-text document.

Regular expressions: Regexi?

Regular expressions are one of the building blocks of almost any text parser. While we won’t discuss them in great detail (there are many, better resources available), I will note that learning how to write simple regexes doesn’t take much time at all. As you progress and want to match more and more, Google around: due to their widespread use, it’s basically guaranteed that someone’s had the same problem.

Regular expressions allow you to describe the “shape” of text you would like to match. For example, if a sentence has the phrase “the term”, followed by some text, followed by “means” we might assume that that sentence is defining a word or phrase. Regexes give us many tools to narrow down the shape of acceptable text, including special characters to indicate whitespace, the beginning and end of a line, and “word boundaries” like commas, spaces, etc.

"the term .* means"    # likely indicates a defined term
"\ba\b"                # only matches the word "a"; doesn't match "a" inside another word such as "bad"

Regexes also let us retrieve matching text. In our example above, we could determine not only that a defined term was likely present but also what that term or phrase would be. Expressions may include multiple segments of retrieved text (known as “capture groups”), and advanced tools will provide deeper inspection such as segmenting out repeated expressions.

"Appendix ([A-Z]\d*) to Part (\d+)"
# Allows us to retrieve 'A6' and '2345' from "Appendix A6 to Part 2345"

Regular expressions serve as both a low-ish level tool for parsing and as a building block on which almost all parsing libraries are constructed. Understanding them will help you debug problems with higher-level tools as well as know their fundamental limitations.

When is an (i) not an (i)?

Regulations generally follow a relatively strict hierarchy, where sections are broken into many levels of paragraphs and sub-paragraphs. The levels begin with the lower-case alphabet, then arabic numerals, followed by roman numerals, the upper-case alphabet, and then italic versions of many of these. Paragraphs each have a “marker”, indicating where the paragraph begins and giving it a reference, but these markers may not always be at the beginning of a line. This means that, to find paragraphs, we’ll need to search for markers throughout every line of text.

It’s not a simple matter of starting a new paragraph whenever a marker is found, however. Paragraph markers are also sprinkled throughout the regulation inside citations to other paragraphs (e.g. See paragraph (b)(4)). To solve this issue, we can run a citation parser (touched on shortly) to find the citations within a text and ignore paragraph markers found within them.

There’s also a pesky matter of ambiguity. Many of the roman numerals are identical (in appearance) to members of the lower-case alphabet. Further, when using plain text as a source, all italics are lost, so the deepest layers of the paragraph tree are indistinguishable from their parents. Luckily, we can both keep track of what we have seen before (i.e. what could the next marker be) as well as peek forward to see which marker follows. If a (i)-marker is followed by a (ii) or a (j), we can deduce exactly to which level in the tree the (i) corresponds.

Parser combinators: Not as scary as they sound

Regular expressions certainly require additional mental overhead by future developers, who will generally “run” expressions in their mind to see what they do. Well-named expressions help a bit, but the syntax for naming capture groups in generally quite ugly. Further, combining expressions is error-prone and leads to even more indecipherable code. So-called “parser combinators” (i.e. parsers that can be combined) resolve or at least alleviate both of these issues. Combinators allow expressions to be named and easily combined to build larger expressions. Below, examples demonstrate these features using pyparsing, a parser combinator library for Python.

part = Word(digits).setResultsName("part")
section = Word(digits).setResultsName("section")
part_section = part + "." + section

parsed = part_section.parseString("1234.56")
assert(parsed.part == "1234")
assert(parsed.section == "56")

Parser combinators allow us to match relatively sophisticated citations, such as phrases which include multiple references separated by conjunction text. The parameter listAllMatches tells pyparsing to “collect” all the phrases which match our request. In this case, that means we can handle each citation by walking through the list.

citations = (
    citation.copy().setResultsName("head")
    + ZeroOrMore(conj_phrase 
                 + citation.copy().setResultsName("tail",
                                                  listAllMatches=True)))

cits = citations.parseString("See paragraphs (a)(2), (3), and (b)")
for cit in [citations.head] + citations.tail:
    handleCitation(cit)

What about meaning?

Thus far, we have matched text, searched for markers, and retrieved sophisticated values out of the regulation. I can understand why this might feel like a bit of a letdown — the parser isn’t doing any magic. It doesn’t know what sentences mean; it simply knows how to find and retrieve specific kinds of substrings. While we could argue that this is a foundation of understanding, let’s do something fun instead.

The problem we face is that we must determine what has changed when a regulation is modified. Modifications don’t result in new versions of the regulaton from the Government Printing Office (which only publishes entire regulations once a year). Instead, we must look at the “notice” that modifies the regulation (effectively a diff). Unfortunately, the pin-point accuracy that we need appears only in English phrases like:

4. Section 1005.32 is amended by revising paragraphs (b)(2)(ii) and (c)(3), adding paragraph (b)(3), revising paragraph (c)(4) and removing paragraph (c)(5) to read as follows

We can certainly parse out some of the citations, but we won’t understand what’s happening to the text with these citations alone. To aid our efforts, let’s focus on the parts of this sentence that we care about. Notably, we only really care about citations and verbs (“revising”, “adding”, “removing”). Citations will play both the roles of context and nouns (i.e. what’s being modified). We can reduce the sentence into a sequence of “tokens”, in this case becoming:

[Citation, Verb, Citation, Citation, Verb, Citation, Verb, Citation, Verb, Citation]

Each Citation token will know its (partial) citation (e.g. paragraph (b)(3) with no section), while each Verb will know what action is being performed as well as the active/passive voice (“revising” vs. “revised”).

We next convert all passive verbs into their corresponding active form by changing the order of the tokens. For example, “paragraph (b) is revised” gets converted into “revising paragraph (b)” in token form. Next, we can carry citation information from left to right. In this sentence, “Section 1005.32” carries context to each of the other paragraphs, filling in their partial citation information.

Finally, we can step through our list of tokens, keeping track of which modification mode we are in. We’d see “Section 1005.32” first, but since we start with no verb/mode set, we will ignore it. We then see “revising” and set our modification mode correspondingly. We can therefore mark each of the next two citations as “modified”. We then hit an “adding” verb, so we switch modes and mark the following citation as “added”. We continue this way, switching modes and marking citations until the whole sentence is parsed.

[Citation[No Verb], Verb == revise, Citation[Revise], Citation[Revise], Verb == add, Citation[Add], Verb == revise ...

Rules and anarchy

With combinations of just these tools, we can parse a great deal of content out of plain text regulations, including their structure, citations, definitions, diffs, and much more. What we’ve created has a great many limitations, however. The rule-based approach requires our developers think up “laws” for the English language, an approach which has proven itself ineffective in larger projects. Natural language is, in many ways, chaos, where machine learning and statistical techniques shine. In that realm, there is an expectation of inaccuracy simply because the problem is so big.

Fortunately, our task was not so large. The rule-based tools described above are effective with our limited set of examples (a subset of our own regulations). While the probabilistic techniques have, on average, higher accuracy for the general use case, they would not be as accurate as our tailored rules for our use cases. Striking the balance between rules and anarchy is difficult, but in this particular project, I believe we have chosen well.

Introducing CFPB Open Tech

Thu, 31 Oct 2013 00:00:00 +0000

Welcome to CFPB Open Tech, the new home on the web for the Consumer Financial Protection Bureau to share and discuss the work we do with technology to improve the financial lives of consumers throughout the country.

At the CFPB, we value openness and transparency. Additionally, one of our core values is innovation. Our organization embraces new ideas and technology. We are focused on continuously improving, learning, and pushing ourselves to be great. A natural result is that we are strong proponents of open source software, both using it in our organization and releasing software we build. You can see more details about our philosophy on using and releasing open source software by reading our Source Code Policy.

Along with software development, this website will also feature the CFPB’s design work, which is led by a great team of graphic and user experience designers. We’ll talk about our design process and the value of design in helping consumers to understand the risks and benefits of their financial choices. Too often, financial products, services, contracts, and terms are unfamiliar and confusing. Good design can help consumers make the right choices for themselves and their families.

The CFPB has its own internal design and development teams. The mission of the Bureau is to ensure that markets for consumer financial products and services are fair, transparent, and competitive, and we rely on intense collaboration between the technology team and our expert policy staff to accomplish that mission. Financial products are increasingly technology-based, and having an in-house team of designers and developers at the CFPB enables us to keep pace with today’s market and develop new tools that help consumers succeed in their financial lives.

Our hope is that by releasing as much of our code as possible here on GitHub, we can learn from and share with other agencies and individuals across the country (or even the world), in the spirit of open source software development and open government. We’ve already accepted a pull request from a member of the public.

From time to time, we’ll also discuss this work on consumerfinance.gov, but here you’ll find more detail about choices, processes, and techniques.

At the Consumer Financial Protection Bureau, we are committed to building world-class technology tools for the public and our colleagues. We look forward to sharing our work with you.