Skip to content

“Infochimps: Find any Dataset in the World”

Hal Daume pointed me to this. Could be useful, no?


  1. wcw says:

    It's missing the first thing I punched in, admittedly a bear of a data set but promulgated by the Fed, so hardly unavailable.

    In all fairness, is missing that one, too.

  2. Alex Cook says:

    The website only has three data sets listed under 'biology', and one of those isn't even biological.

    It seems the slogan is more aspiration than fact.

  3. Anne says:

    I'm a bit puzzled by the intended scope of this. What do they mean to include as data sets? Hundred-terabyte raw pulsar survey data sets? The ATNF pulsar catalogue? SIMBAD? The Fermi photon list (which is continually updating)? The astronomy (and related sciences) <a href="http://ad” target=”_blank”>literature database? There is a staggering variety of data sets in astronomy alone, even if you restrict yourself to ones that are of manageable size. Unless they make it a little clearer what they mean to include it's going to be pretty difficult to know when to bother searching there.

  4. zbicyclist says:

    Could be useful. We make available (and I manage) a 70 gig database of US Consumer Packaged Goods information used by marketing and economics academics ( ), and something like this might work for sets like this. We have to charge a small fee to defray production expenses (we ship a USB drive).

    Unfortunately, due to the nature of the way we acquire this data we require a signed NDA, so it wouldn't work for us as an archive.

  5. Although the Infochimp folks are interested in every type of data, the way to think about whether a data set might be appropriate for inclusion is to answer 2 questions.

    Who would want this data? Are people likely to search for it, or is it something more specialized that is already easy to find and access if you are an expert in the field? Astronomy, genomics, or PubMed data is in this latter category. These data are already curated and searchable. People who want it know where to find it. In contrast, the consumer finance data from the Fed is more likely to be of general interest. A political scientists or marketer might want this type of data but not know where to look for it.

    Does at least 1 column of the data hook up with something already being served from Infochimps? One of their long term goals is to mark up all of the data so that you can link it together. Their basic example is historical baseball scores and weather readings. It should be easy to merge these data by day and city. They do a lot of work to make sure that each column of data is labeled with appropriate units.

    This interview about their release of a data set that maps IP addresses, ZIP codes, and census data exemplifies the kind of data they are after.

  6. says:

    Andrew, thanks for the kind mention. We're avid readers of your 538 columns, and hope that infochimps helps make that kind of cross-disciplinary data mashing widespread.

    Our slogan is indeed aspirational — we started the site with the belief that only a crowdsourced kind of 'wikipedia for data'(*) can possibly succeed in making the web of structured data discoverable. @Alex: our biology coverage is slightly better than the three you found — — but still far short of acceptable. We've just made it far easier to add a dataset so please try it out and tell us what you think. (@wcw: I tested tonight's deploy using the fed survey, thanks).

    If any reader sends us a good long list of data sources, or has a particular tag used to tag datasets, we can bulk-load those directly.

    @Anne — the answer is yes, we want to catalog all of those, and to host the datasets themselves where it simplifies acquisition. You can catalog a dataset by simply adding a link; we can host the data too, but in the datasets you pointed out that would only complicate things. Once the data is catalogued, though, economists and biologists and journalists can discover the data, and as @mja highlighted the process of knitting these together into a connected web of data can begin.

    If you have any more feedback, please get in touch with me as We need astronomers to bootstrap in the core astronomy datasets, and biologists biology, and so forth. What features can we add to make that easier for you to carry this forward to your communities?

    (*) to clarify: on infochimps, the 'wikipedia' aspect extends to the metadata for a package. You can't walk up and edit the numbers inside NOAA's weather data, but you can help by adding a link, usage notes, or a script; and you can help by cleaning up the data and adding it as a dataset under your own name.

    — Philip (flip) Kromer,