Last chance saloon: Manually converting a PDF to .csv format

I’ve been working at the Guardian this week and was faced with an important dataset that happened to be in PDF format. While PDFs look pretty, they’re all but useless for data journalists.

You can’t copy and paste PDF tables into a spreadsheet. Try it and you’ll end up with something that looks like this:

Pasting a PDF into a spreadsheet will result in random cell merging, odd spacing and a lot of unnecessary work.

Pasting a PDF into a spreadsheet will result in random cell merging, odd spacing and a lot of unnecessary work.

Why organisations insist on the more labour-intensive task of formatting their data this way rather than as a spreadsheet, I don’t know, but it is what it is and it’s an obstacle that there are ways around.

The obvious and easiest way to convert a PDF to a .csv or .xls file is to use one of the many free online conversion services. Simon Rogers at the Guardian suggested PDFtoExcelOnline.com, which is a free version of the more powerful conversion software made by Nitro.

The data I was looking at was released by the United Nations High Commissioner for Refugees. It was a series of yearly reports outlining the number of refugees from every country in the world, from 2003 to 2011. That’s nine lists each of well over 200 rows of data.

PDFs: Pretty but generally useless.

PDFs: Pretty but generally useless.

The online service is great, and the data generally comes back in a coherent form that requires some, but not much, tinkering with to clean it up. Unfortunately, two of the reports could not be converted. With the free service this sometimes just happens. For whatever reason, the converter just doesn’t like the file being uploaded.

I was left with only two options. The first was to input the data I wanted, manually, into a spreadsheet. Considering the size of the dataset it just wasn’t an option. The second route to take was to make a .csv out of the PDF myself. It turns out that, using the method I’m about to explain, a 200+-row table can be converted in a .csv, ready to use in Excel or Google Spreadsheets, in around five minutes.

Click on one of the thumbnails below to view a slideshow of exactly how I did this.

Advertisements

2 thoughts on “Last chance saloon: Manually converting a PDF to .csv format

  1. Pingback: How to extract data from a PDF - #Interhacktives

  2. Pingback: Pdp’den veri nasıl ayıklanır? - Data Journalism Veri Gazeteciliği - Dağ Medya Yayınıdır | www.dagmedya.net

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s