PDF Liberation Hackathon Resource Page

In preparation for the PDF Liberation Hackathon, we’re putting together a short how-to of different PDF Liberation tools starting with Tabula – a tool to extract tables of data from PDFs.

Tabula

Tabula is an open source tool built by Manuel Aristarán with the help of ProPublica, La Nación DATA and Knight-Mozilla OpenNews.

When you first open Tabula, you’re given the option to load PDFs into the system. For this example, we’ve taken the monthly veterans report from the Illinois Department of Employment Security (currently only available in PDF) and loaded it into Tabula.

TabulaDemo1

Once you upload it, Tabula will process the file. This can take a little bit of time depending on the size of the file.

Once it’s loaded, you simply draw rectangulars over the tables in the PDF.

TabulaDemo2

From there, Tabula will show you the data that’s it’s captured. Now, you can copy the data to the clipboard or download to your own local machine as a file. It’s that simple.

TabulaDemo3

You can find more information on how Tablua works on Source.  You can also find a list of other PDF extraction tools on the national PDF Liberation Hackathon homepage.

OpenRefine

Once you get the data into a csv, you may have to clean up the data. A common tool to do this is OpenRefine (formally Google Refine). You can load a CSV file into OpenRefine and dig into the data to find possible data entry errors (somebody writing in Chicgo instead of Chicago),  transform the data (change the format of a cell to show currency instead of text), and easily spot inconsistencies in the data (One set of entries classified as ‘phone’ and another ‘phone number’.

OpenRefine also has  comprehensive documentation on how to use it including videos tutorials. Here’s the video that introduces OpenRefine.

You can find out more information on OpenRefine on their website.

Google Fusion Tables

Once you have the data you’re interested in, you can load it into Google Fusion Tables in order to build apps that use the data.

Google Fusion Tables operates much the same way an Excel spreadsheet does. The difference is that you can use the Google Fusion Table API to load data into your civic app. A good example of this is Derek Eder’s Searchable Map Template.

Do you have PDFs that need liberated? Interested in freeing the data?

If you have PDF’s that you’d like to see data extracted from, you can fill out wufoo form here.  If you’re interested in taking part in the PDF Liberation hackathon, you can RSVP for the event here.

 

 

Hack Night Live: CivicLab

Today is the first Chicago OpenGov Hack Night of 2014!

Benjamin Sugar will introduce us to CivicLab, a new space that has opened up in Chicago to encourage ‘civic making.’ He’ll also touch on the intersection of civic media and it’s relationship to activism and community organizing.

As always, we’ll be live streaming the event right here on this blog. The live stream will start at about 6:15 CST.

Excerpt: Beyond Transparency, Building a Smarter Chicago: An Era of Incidental Transparency

For the next few Tuesdays, we will be excerpting sections from Beyond Transparency: Open Data and the Future of Civic Innovation“, an anthology edited by Brett Goldstein with Lauren Dyson and published by Code for America.

I wrote a chapter titled, “Building a Smarter Chicago“, which I call “an illustrative, incomplete, and idiosyncratic look at the ecosystem in Chicago. It is meant to provide a thumbnail take on how the ecosystem developed here, while sparking fires elsewhere”. Here’s the introduction and the first section, which gives a short history of the ecosystem:

Introduction

As the open data and open government movement continues, there is a lot of talk about building local ecosystems for the work. The general idea is that there has to be a mildly magic combination of data, policy, developers, capital, and products to enable the kind of growth that is necessary to take the movement to the next level—where there is a mature market for open government products that serve real community needs and lead to sustainable revenue.

The thing about building an ecosystem is that when it is done deliberately, it can be a slog. Building a developer community from scratch, convincing local government to publish data, getting venture capitalists to take a look at open government projects—all of this is tough work that takes time.

By looking at the Chicago example, however, we can see that there’s often more built than it first seems. The components can be found, in varying degrees, in any unit of government. The trick is to find, cobble, and congeal these pieces together.

What follows is an illustrative, incomplete, and idiosyncratic look at the ecosystem in Chicago. It is meant to provide a thumbnail take on how the ecosystem developed here, while sparking fires elsewhere.

Data: An Era of Incidental Transparency

The story starts with Citizen ICAM (Information Collection for Automated Mapping), the granddaddy of all crime mapping applications, created by the Chicago Police Department in May 1995. I wrote about this system back in 2006 because I wanted to understand the archaeology of this distinctly unique (and relatively difficult to use) interface (O’Neil, 2006). You can learn a lot about software by its backstory. Here’s the first sentence of a July 1996 National Institute of Justice report on Citizen ICAM:

To better understand the nature and extent of criminal and social problems in the community and improve allocation of resources, a growing number of crime control and prevention organizations are turning to computerized mapping. (Rich, 1996)

The impetus behind the project (“Citizen” is the first word in its name) was the Chicago Alternative Policing Strategy (CAPS) program. Here’s another snip from the 1996 report:

ICAM was developed as part of CPD’s far-reaching and ambitious community policing strategy. Unlike many other community-policing programs that are limited to a single unit in the department, the Chicago Alternative Policing Strategy (CAPS) is department-wide. The strategic plan for reinventing CPD describes CAPS as a “wholesale transformation of the department, from a largely centralized, incident-driven, crime suppression agency to a more decentralized, customer-driven organization dedicated to solving problems, preventing crime, and improving the quality of life in each of Chicago’s neighborhoods.

In fact, CAPS is really a city program with strong support from the Mayor’s office and close involvement of city agencies, which have been directed to give top priority to “CAPS service requests” that affect crime and neighborhood safety. (Rich, 1996)

This twenty-year-old project is a model for where we need to be now—and where the movement seems to be heading. It starts with deep input from residents to form a “customer-driven organization.”

In the technology world, we call these people “users.”

Adrian Holovaty’s ChicagoCrime.org—widely considered a major impetus in the open data movement—simply would not have existed without Citizen ICAM (Holovaty, 2008). At the same time, ChicagoCrime.org was certainly not well-formed public data. For instance, all data was retrieved by scraping with obscure URL calls that ignored the user interface, which limited searches to a quarter-mile radius.

Another example is transit data “published” by the Chicago Transit Authority in the context of their proprietary Bus Tracker system. I covered this extensively in a January 2009 blog post (O’Neil, 2009). The upshot is that Harper Reed scraped all data driving the app, cached it, and served it to developers. This led to a blossoming of transit-focused apps.

The culmination of this work is the publication of the CTA’s own API, a document wherein Harper and I are explicitly called out for helping them develop it:

Special thanks go to Harper Reed and Dan O’Neil for their support and encouragement, and to the independent development community, for showing such great interest in developing applications with CTA data, leading to the creation of this official API. Thank you. (Chicago Transit Authority, 2011)

This is the kind of inside/outside game that is also essential to the ecosystem. You have to work with government institutions to make their data fluency and data policy better.

A last example of early data in Chicago (and perhaps the first explicitly conscious publication of data in the city) is the wealth of Geographic Information Systems (GIS) data published by the City of Chicago. This was another early reason why ChicagoCrime (and, by extension, EveryBlock) could exist. Their policy was formalized in July 2007, but the data had been available long before that (City of Chicago, 2007).

The first section of their documentation, “Data Sharing Principles,” has the idea that public information should be public: “Wherever possible, direct requestors to publicly available internet sources of map information.”

This is the moment when the governmental provision of data goes from incidental to essential. Before that magic moment, it’s important for developers and citizens to look harder for data published in plain sight.

On the Launch of Expunge.io

We host a lot of apps at Smart Chicago through our Developer Resources program. In many instances, we get a request, fire up an EC2 instance, and the site is off and running. In other instances, we provide hours of behind-the-scenes technical assistance, product advice, and general jibber-jabber. Other times, we conceive of and execute on the thing on our own, or develop a site for clients.

Today’s launch of Expunge.io by Cathy Deng and the Mikva Challenge Juvenile Justice Council (JJC) is in a category all by itself. The site, which helps start the process of erasing juvenile arrests and/or court records, is pretty much why we exist. For us, it grew naturally out of work we did over our CivicSummer, interacting with youth on the JJC about what interested them, where their research took them, and what issues mattered most to them.

Then, as summer became autumn, I talked on a regular basis with the intrepid and indefatigable Chris Rudd about the need for an app that helped sort out the essential but obtuse process of expungement.

He never gave up, he never stopped talking about it, and he never stopped learning. We talked about how the core of the app was not the technology— it was the info that he, the JJC youth, and their partners like the people at the Juvenile Expungement Help Desk had and had to share.

Then comes Cathy Deng, a super-smart developer looking to do civic apps that make a difference. We had invited her to join us at our final #CivicSumer session at Roosevelt University, where she got a feel for what everyone was up to and connected to people on Twitter. Then came this:

We set them up with space, and Smart Chicago tech consultant & general civic tech godfather Scott Robbin provided some tech guidance. Mikva Challenge Executive Director Brian Brady supported all of this with the vision of a creative organization not bogged down by process. The legal partners and justice system personnel have been highly supportive.

So here we are today. If you know anyone with a juvenile record and 4 minutes of extra time on their hands, send them to Expunge.io now.

White House Safety Datapalooza Resources

On January 14th, White House will be hosting the Safety Datapalooza. The event will focus on safety data resources in the areas of transportation, food, occupational and consumer product safety and disaster preparation and response.

Chicago’s civic innovation community has worked in these areas in some detail and with more data being released, there’s potential for much more work to be done. Below we’ve put together a collection of data sources from local and federal resources and given some examples of safety apps being used around the country.

Untitled

Jim McGowan of the Chicago Red Cross demonstrates the volunteer connection app at the their headquarters.

Continue reading

Knight Prototype Fund Grant Awarded to Joe Olson to be Administered by Smart Chicago

x-knightlogoToday the Knight Foundation awarded a Prototype Fund award to Joe Olson, one of the developers of Foodborne Chicago, to “develop strategies to engage with targeted communities currently being missed through Chicago’s Twitter-based food poisoning incident detection system.”

Smart Chicago, through our fiscal agent, The Chicago Community Trust, is going to administer this grant. We are also helping out by using our CUTGroup program.