Many Powerful, Free PDF Scraping Tools Available for Data Journos

March 5, 2014

We all know it. Some agencies and organizations publish data in PDF format to keep journalists and the public from using the raw data. Take heart. Help is on the way.

There are two kinds of files in the PDF format, a page-description language developed by Adobe for use with its Acrobat reader. One kind arranges text and graphics on a page, and the other is simply a scanned bitmap. Only the first kind is easy to convert to raw data.

One easy software tool — Tabula — was demonstrated at the annual conference of the National Institute of Computer-Assisted Reporting (NICAR) in Baltimore February 27, 2014. You can download it here.

We gave it a test run. It worked fine, though you want to pay attention to the instructions unless you have a tutor guiding your mouse hand. We started with this PDF — the biggest available database of coal-ash sites from Earthjustice and the Environmental Integrity Project. We ran it through an intermediate stage as a comma-separated-variable (CSV) text file. Then imported it into Excel. It came out as a perfectly good spreadsheet, which you can find here.