Google Refine

Posted on September 15, 2011 11:39 AM by Malecki

Tools worth knowing about:

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

A recent discussion on the Polmeth list about the ANES Cumulative File is a setting where I think Refine might help (admittedly 49760×951 is bigger than I’d really like to deal with in the browser with js… but on a subset yes). [I might write this example up later.]

Go watch the screencast videos for Refine. Data-entry problems are rampant in stuff we all use — leading or trailing spaces; mixed decimal-indicators; different units or transformations used in the same column; mixed lettercase leading to false duplicates; that’s only the beginning. Refine certainly would help find duplicates, and it counts things for you too. Just counting rows is too much for researchers sometimes (see yesterday’s post)!

Refine 2.0 adds some data-collection tools for scraping and parsing web data. I have not had a chance to play with any of this kind of advanced scripting with it yet. I also have not had occasion to use Freebase which seems sort of similar (in that it is mostly open data with web APIs) to infochimps (for more on this, see the infochimps R package by Drew Conway).

2 thoughts on “Google Refine”

Tom Morris on September 15, 2011 12:51 PM at 12:51 pm said:

Only the UI for Google Refine runs in the browser. All the having lifting is done using Java in a private web server running on your machine. As long as you give it enough memory, 50K rows shouldn’t be an issue (although 50M cells is getting up there). I’d recommend turning off “automatically guess data types” on the import though for the first trials.
paul gronke on September 21, 2011 1:31 PM at 1:31 pm said:

How does this compare to the tools promoted by Yau such as beautiful soup? I am finding the Python hard to decode. Or encode. Or perhaps just code.

Comments are closed.