Web Tool Brings Documents to Life

April 15, 2012
Amanda Hickman
PHOTO: COURTESY AMANDA HICKMAN

Reporter's Toolbox

By AMANDA HICKMAN

DocumentCloud lets journalists engage with the public and knowledgeable sources

Looking for ways to invite readers into the story and make your online reporting more engaging? Buried in documents and hoping there’s something out there that works better than a highlighter and a gross of sticky notes? If you aren’t using DocumentCloud, you should be.

DocumentCloud is a web-based tool that reporters can use to analyze, annotate, and publish the documents behind their reporting. It’s EPA reports, court filings, toxicology profiles, medical records and much, much, more.

Here’s how it works:

You need to acquire and scan your documents. If you need help getting your hands on the documents you know you need, check out Muckrock — they’re the most dependable FOIA and FOIL butlers you’ll ever meet. Once the documents are digital, you can log in and upload your documents — they don’t have to be PDFs, either. DocumentCloud can handle Word and Libre Office documents, as well as PDFs. Publish your documents immediately, or keep them to yourself while you work on your reporting.

A word about spreadsheets:

You can upload spreadsheets to DocumentCloud, though the software doesn’t have any understanding of rows and columns. If you’re looking for a good way to manage raw data and share it with your readers, keep an eye on the Panda project and look into TableSetter and TableStacker.

Using DocumentCloud to organize research

DocumentCloud takes a few minutes to process your documents — the software breaks out images of each page and stores those. If the document doesn’t already contain text, DocumentCloud uses a free and open source tool called Tesseract to extract text information from the document. Then it runs your text through Reuter’s OpenCalais, an entity extraction engine that will pull out and organize the names, places and key terms in each document. DocumentCloud will also extract information like dates, so that you can look at a document on a timeline and see the dates mentioned in it, as well as email addresses and phone numbers.

DocumentCloud does all this processing and turns the documents back over to you in a clean, fast-loading web interface where you can begin annotating documents. Private annotations will be visible only to you, while public annotations are as public as the document. Making an annotation is as simple as clicking “public annotation” or “private annotation” and drawing a box around the text you want to highlight. Add your note and click save. Choosing “private” keeps it as yours alone.

Every annotation has a unique URL, so you can use DocumentCloud to manage your research and organize facts in a project of almost any size. Investigative reporter Tracie McMillan, whose book The American Way of Eating looks at the life and labor of the food industry, created a great tool. (A onetime student of investigative reporter Wayne Barrett, she calls it “the barrettizer”). It’s a spreadsheet of facts from the book, each one linked to the source material, much of which McMillan has put on DocumentCloud.

When you’re ready to publish, you can embed a single annotation in your story, publish a searchable set of documents or publish whole documents one at a time. DocumentCloud is full of great tools to help smooth the editorial process. The collaboration tools let reporters, for example, show a lawyer an annotated copy of a document or invite a geologist to review and annotate a report that you’re struggling to understand.

If you have some programming chops, DocumentCloud’s API will let you automate almost every step of the way. Search GitHub for “DocumentCloud” to find a great list of tools your colleagues have already written to incorporate DocumentCloud into their own sites.

DocumentCloud was founded in 2009 with a Knight News Challenge grant. Investigative Reporters and Editors took over fundraising, support and development in 2011.

So how do you get an account? Write to: info@documentcloud.org and tell IRE’s Lauren Grandestaff who you are and what you’re reporting on.

Amanda Hickman helped launch The New York World and was program director at DocumentCloud, a Knight News Challenge-funded project that reporters around the world are using to analyze, annotate, and publish primary source documents. She currently serves as an adjunct faculty member in interactive journalism at the City University of New York.

SIDEBAR

Examples of DocumentCloud at work

LOGO COURTESY: WWW.DOCUMENTCLOUD.ORG

  • What emails reveal about the rescue effort at the massive mining disaster in West Virginia: Post. Document.
  • How a payday lending empire finances a famous auto racer: Story. Document (compares signatures).
  • An annotated fact sheet on fracking: Post. Document.
  • One of the many revelatory documents dug up in the aftermath of the BP oil blowout: Post. Document.
  • Another post-BP blowout revelation by ProPublica, this one on worker health: Post. Document.

More examples can be found here.


* From the quarterly newsletter SEJournal, Spring 2012. Each new issue of SEJournal is available to members and subscribers only; find subscription information here or learn how to join SEJ. Past issues are archived for the public here.