“Frontiers in Massive Data Analysis”

Mike Jordan sends along this National Academies report on “big data.” This is not a research report but it could be interesting in that it conveys what are believed to be important technical challenges.

20 thoughts on ““Frontiers in Massive Data Analysis”

  1. I could only access the first page of executive summary. Maybe not a technical challenge (though I would argue it is that and more) but no mention of ethics/privacy.

    • from page 69: “Privacy is a very large topic in its own right, and this report does not attempt to address the privacy issues associated with massive data and its analysis.”

      If you just search through the PDF on the work “privacy” (assuming you are able to download it, @Snowden) you will find several mentions of “privacy” but all have this same tone, i.e. it’s a big issue we aren’t going to discuss.

      • As I am stuck in Moscow airport I can only access website through my smartphone. The option to read PDF online is therefore a non-starter. To be honest it seems that the Academy are running ahead of themselves; they cannot even get basic usability right…

        And besides, why would the Academy charge for its reports? Are they not tax payer funded? Maybe I will file a Freedom of Information request to access the report….

        About 85 percent of funding comes from the federal government through contracts and grants from agencies and 15 percent from state governments, private foundations, industrial organizations, and funds provided by the Academies member organizations.

        From Russia with love,

        S

        • PS Putin lent me his iPad. I know see that after registering the PDF is available for download.

          In passing, that registry might provide some useful information re people interested in big (sorry massive) data, and that may be a threat to the country. NSA should look into it.

  2. Ok who can expand the series farthest, or increase the graduation:

    No data
    Tiny data
    Small data
    Medium data
    Large data
    Big data
    Massive data
    Enormous data
    Gigantic data
    Ginormous data
    Infinite data,
    Infiginormous data,
    Massinfiginormous data, and
    Gigomassinfiginormous data.

    • PS note that by appending the word “scientist” to each entry above we get the proper designation for people who work with these type of data, as in “Gigomassinfiginormous data scientist”.

      Some scientists work with a variety of contiguous, monotonically increasing data sizes. For example “bigger data scientists” refers to individuals who work with data sizes above big data, and so on.

      Notice that the bigger your data, the narrower your field.

    • Anon:

      We can make some practice by going in the opposite direction, extending your list from the other end:

      Tiny data
      No data
      Bad data
      Negative data
      Bad, negative data
      Lots of bad, negative data
      Lots and lots of bad, negative data
      Really really misleading data
      . . .

      • Soon: IBM PASW/SPSS/EVENMOAR

        “In this entirely new version of the software, SPSS adds noise to your data make it EVEN MOAR massive.”

      • Sounds like something something Douglas Adams would invent — Zaphod Beeblebrox’s rose-coloured statistics package. If the data doesn’t support what you wanted to find, it crashes.
        The ‘Psych Science’ version does your causal interpretation for you. “Interprets a causality even an SEM modeller would call correlational.”

    • Computer designers have long had well-defined terms for data sizes, i.e., Megabytes, Gigabytes, etc, etc.
      In the 1990s, a well-known statistician proposed a taxonomy that started with Huber’s
      Tiny(10^2 bytes,
      Small(10^4),
      Medium(10^6) 1 Megabyte
      Large(10^8) 100MB
      Huge(10^10) Actually, 10GB

      and added Ridiculous(10^12) Actually, that’s 1 Terabyte, which was stressful in 1994, not now.

      All this nomenclature is just plan silly, for two reasons:
      a) people who actually do this for a living have precise terms for storage sizes.
      b) Trying to overload English words just confuses people. Quick: is massive bigger than huge? How about very large?

      The alternative approach is:
      a) Use well-established, specific terms for storage (and computational rates)
      b) Just say “Big Data” for data that stresses (disk size, memory size, bandwidths, etc), knowing perfectly well that what might be “ridiculous” in 1994, is commonplace on personal computers 20 years later (although of course, disk bandwidths have improved nowhere near as fast as density, and access times barely at all in a given price class.)

      Likewise, “supercomputer” has been a moving target.

      • Why not make the classification time dependent as in:

        Tiny < 10^f(b*current year)

        Where b captures Moore's law or some such.

        • In any case, it’s way more complex that Moore’s Law (which is relevant to DRAM densities, but not disks or bandwidths or processing complexities.)

          When there are perfectly good technical terms, assigning multiple vague English words is just plain silly, in my opinion. As it happens, I used to be a designer of supercomputer architectures, including the family to which this one belonged. See also The Origins of ‘Big Data’: An Etymological Detective Story, NYTimes.com.

        • John:

          I’m amazed at the people I get to meet in this blog!

          I see you are also a trustee of the Computer History Museum in Mnt. View. I’ve been meaning to visit it for quite some time. (In the meanwhile here is one computer you will appreciate: http://t.co/GPCQd5eXVb )

          PS In a fast moving environment you probably want to focus on the relative, not the absolute, properties of the scale. But you are right that the cut off function needs to be more complex.

  3. Size does not matter ;-)

    It’s how you process/interpret the data that makes it more or less misleading.

    Automatically interpreted data
    Superstitiously interpreted data
    Naively interpreted data
    Perceptively/purposefully interpreted data
    Omnipotently interpreted data.

Comments are closed.