Turning pages into data

There is a lot of data on the web, meant to be looked at by people, but how do you turn it into a spreadsheet people could actually analyze statistically?

The technique to turn web pages intended for people into structured data sets intended for computers is called “screen scraping.” It has just been made easier with a wiki/community http://scraperwiki.com/.

They provide libraries to extract information from PDF, Excel files, to automatically fill in forms and similar. Moreover, the community aspect of it should allow researchers doing similar things to get connected. It’s very good. Here’s an example of scraping road accident data or port of London ship arrivals.

You can already find collections of structured data online, examples are Infochimps (“find the world’s data”), and Freebase (“An entity graph of people, places and things, built by a community that loves open data.”). There’s also a repository system for data, TheData (“An open-source application for publishing, citing and discovering research data”).

The challenge is how to keep these efforts alive and active. One early company helping people screen-scrape was Dapper that’s now helping retailers advertise by scraping their own websites. Perhaps the library funding should be used towards tools like that rather than piling up physical copies of expensive journals everyone reads just online.

Some earlier posts on this topic [1], [2].