It's kind of a big data: IRS tax-exempt data and Detroit Ledger

What, you haven't heard the news? Forms 990, back to 2011, in machine readable formats, no FOIA required! OK, maybe if you aren't a programmer working with philanthropoy data, this flew under your radar. But let me explain how exciting this is for me & the team at Detroit Ledger.

We started work on a parser a while ago, but thanks to the Universtiy of Michigan School of Information, we got to work with a volunteer programmer this week to build a more universally useful tool.

Forms 990 and Detroit Ledger

Our project started with a few people volunteering their time to hand scrape data from foundation websites and news stories about philanthropic funding in Detroit, and through the miracles of technology over the past few years, we've progressed to... basically doing the same thing, with a few exceptions. If the IRS stays consistent with publishing the Form 990 xmls, it's going to substantially change the way we can present data on our site, and in addition we can build analytic tools that are useful for a scope beoynd our geographic focus on greater Detroit.

About the data

The IRS worked with Amazon to serve all of the data in an S3 bucket. It holds about 60 GB and 1.8 million objects. The vast majority of objects in that bucket are small (30kb-ish) XML documents that use one of seventeen possible schemas. The other files included in the bucket are the schemas themselves (along with somewhat-useful HTML diffs) and JSON documents that serve as yearly indicies of filings.

Archive.org hosts a mirror of all the files and has split off the schema bits, for convenience. I mirrored the files to my desktop & helped them fill in the 2016 data.

While each Form 990 record has a bunch of different information, at the Ledger we're mostly interested in Schedule I, which contains notable grants.

Challenges

Normalizing this data is hard -- there are a ton of schemas, and since we're sorta operating in a bubble here, we are mostly thinking about how we want to use the data at the Ledger. I'd like to offer a configuration syntax that works for as many different use cases as possible.

I'm not really sure what syntax makes the most sense for defining this sort of crosswalk from diverse XML data to an arbitrary dict.

I made an effort early in the week to reach out to other people who seem to be working in this space, and I'm going to work with them to set up a mailinglist or slack channel where we can hang out to share knowledge.

Where it's at

This week, I mostly focused on writing a class that deals with retrieving & reading the index files, while our ASB volunteer worked on everything else. Our code will load a CSV containing our EINs, retrieve the Forms 990 for those EINs for a given year, and pull out certain bits of the data. It has a basic, hard-coded normalization routine.

The index data reads the JSON files as a stream, so that part is pretty efficient.

I found that using the boto3 s3 client runs about 2x as fast as doing HTTP requests.

Next steps

  • Refactor more of the code into smaller functions & classes
  • Use more streams
  • Store the normalization configuration in a YAML file
  • Offer a single easy-to-use CLI utility that you provide a normalization configuration CSV with EIN and internal ID columns, which outputs a CSV
  • Offer a single easy-to-use class that you provide a normalization configuration CSV with EIN and internal ID columns, which outputs a stream of dicts (?)

I'll be working more on this over the next week; check back here for progress.