detriot.org

Benjamin Chodoroff's blog

Detroit Ledger is rising from a brief few months of hibernation: we received some funding to improve our database (more details soon), and all of a sudden, Jessica, Matt, and I are buzzing with ideas.

One of the ideas: relate all of our organization entities with IRS 990 and entity records found in Citizen Audit’s dataset. With this JOIN, we can pull in EINs (a very nice unique key, right?) as well as Citizen Audit’s yearly budget/salary figures, among other goodies.

Citizen Audit has an API that allows one-at-a-time row requests, as well as some data dumps (see link above) in the form of PostgreSQL tables. I want to look around this data and eventually import it into MySQL tables to live inside Detroit Ledger.

Most of this methodology can apply to any situation where you want to investigate postgres data and then pull it into mysql.

This is what I do:

Have a Debian Wheezy server

I’m on a 512M digitalocean instance.

Set up postgres

cat>>sources.list.d/pdgd.list<<EOF
deb http://apt.postgresql.org/pub/repos/apt/ wheezy-pgdg main
EOF
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo apt-get update
sudo apt-get install postgresql-9.3 pgadmin3

For our datamunging tasks, add a postgres user with no password who can create databases.

sudo -u postgres createuser -d -w citizenaudit

Get the psql tables

wget http://s3.citizenaudit.org/irs/bulk/masterfile.sql.gz
wget http://s3.citizenaudit.org/irs/bulk/extracts.sql.gz
wget http://s3.citizenaudit.org/irs/bulk/manifest.csv.gz

Look around data…

Looks useful. We can link on names and stuff. Now I need to import this biz into MySQL.

Install PHP and MySQL

# add repos from dotdeb.org first, and then...
sudo apt-get install php5-cli mysql-client mysql-server

Convert and load

git clone git@github.com:swrobel/pg2mysql.git
cd pg2mysql
# increase memory limit
git apply<<EOF
diff --git a/pg2mysql.inc.php b/pg2mysql.inc.php
index a60bdda..09fca3f 100644
--- a/pg2mysql.inc.php
+++ b/pg2mysql.inc.php
@@ -21,7 +21,7 @@ the Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 Boston, MA 02111-1307, USA.
 */
 
-ini_set("memory_limit","512M");
+ini_set("memory_limit","2048M");
 error_reporting(E_ALL & ~E_DEPRECATED);
 define ('PRODUCT',"pg2mysql");
 define ('VERSION',"1.9");
EOF
php pg2mysql_cli.php ../extracts.sql ../extracts-my.sql
php pg2mysql_cli.php ../masterfile.sql ../masterfile-my.sql
# as root in mysql:
CREATE DATABASE ca;
CREATE USER 'bc'@'localhost' IDENTIFIED BY 'smelly';
GRANT ALL PRIVILEGES ON ca . * TO 'bc'@'localhost';

and load the data…

mysql -ubc -psmelly ca < extracts-my.sql
mysql -ubc -psmelly ca < masterfile-my.sql

Great. Look at those tables.

Next steps take place in Drupal: create Data schema definitions for our tables and join everything up!


Thanks for reading. Email me at ben at falafelcopter dot com. Read my posts here.