Toolbox: Scraping App for the Digital Gumshoe

August 28, 2013

Here's a nifty web-scraping tool we think investigative journalists might want to know about or use. What's web-scraping, you ask?

In the old days, gumshoes used shoe leather, telephones, and index cards. Today, they are gathering data from the web to drive investigative projects. As many journalists know, the data on the web is not always in a very convenient or usable form. "Scraping" means a methodical effort to capture data disclosed on the web and put it into useable form — typically a database.

First, let's note that neither journalists nor the public should have to do this, at least for federal data. The Electronic Freedom of Information Act of 1996 decreed that government records that exist in electronic form have to be made available in electronic form. But data published on the web in html "table" format may be hard to wrangle into a structured database, where it can be queried for investigative purposes.

Software exists to help with this. But it is often expensive, bewildering, bloated, and untrustworthy.

That's why the WatchDog was pleased to discover a free, open-source add-on to Mozilla's Firefox web browser called ExportToCSV, by Souvik Chatterjee. Even though it is nominally in "beta" (test) release, it works fine for us. You simply right-click on any html table, and it exports the table's data to a "comma-separated variable" (CSV) text file. Such CSV files can be easily imported into databases like Microsoft Access or spreadsheets like Excel. It has cleared preliminary vetting by Mozilla, the organization that produces Firefox.

Find it here.