“Frontiers in Massive Data Analysis”

Posted on July 9, 2013 11:06 PM by Andrew

Mike Jordan sends along this National Academies report on “big data.” This is not a research report but it could be interesting in that it conveys what are believed to be important technical challenges.

20 thoughts on ““Frontiers in Massive Data Analysis””

Snowden on July 10, 2013 1:23 AM at 1:23 am said:

I could only access the first page of executive summary. Maybe not a technical challenge (though I would argue it is that and more) but no mention of ethics/privacy.
- zbicyclist on July 10, 2013 9:31 AM at 9:31 am said:
  
  from page 69: “Privacy is a very large topic in its own right, and this report does not attempt to address the privacy issues associated with massive data and its analysis.”
  
  If you just search through the PDF on the work “privacy” (assuming you are able to download it, @Snowden) you will find several mentions of “privacy” but all have this same tone, i.e. it’s a big issue we aren’t going to discuss.
  - Snowden on July 10, 2013 10:08 AM at 10:08 am said:
    
    As I am stuck in Moscow airport I can only access website through my smartphone. The option to read PDF online is therefore a non-starter. To be honest it seems that the Academy are running ahead of themselves; they cannot even get basic usability right…
    
    And besides, why would the Academy charge for its reports? Are they not tax payer funded? Maybe I will file a Freedom of Information request to access the report….
    
    About 85 percent of funding comes from the federal government through contracts and grants from agencies and 15 percent from state governments, private foundations, industrial organizations, and funds provided by the Academies member organizations.
    
    From Russia with love,
    
    S
    - Snowden on July 10, 2013 10:22 AM at 10:22 am said:
      
      PS Putin lent me his iPad. I know see that after registering the PDF is available for download.
      
      In passing, that registry might provide some useful information re people interested in big (sorry massive) data, and that may be a threat to the country. NSA should look into it.
Anonymous on July 10, 2013 1:36 AM at 1:36 am said:

Ok who can expand the series farthest, or increase the graduation:

No data
Tiny data
Small data
Medium data
Large data
Big data
Massive data
Enormous data
Gigantic data
Ginormous data
Infinite data,
Infiginormous data,
Massinfiginormous data, and
Gigomassinfiginormous data.
- Anonymous on July 10, 2013 1:59 AM at 1:59 am said:
  
  PS note that by appending the word “scientist” to each entry above we get the proper designation for people who work with these type of data, as in “Gigomassinfiginormous data scientist”.
  
  Some scientists work with a variety of contiguous, monotonically increasing data sizes. For example “bigger data scientists” refers to individuals who work with data sizes above big data, and so on.
  
  Notice that the bigger your data, the narrower your field.
  - Aki Vehtari on July 10, 2013 2:47 AM at 2:47 am said:
    
    :D
  - Rahul on July 10, 2013 2:50 AM at 2:50 am said:
    
    You also get to release / sell a different software package for every level in that heirarchy?
    
    e.g. SAS-Gigantic data Edition
    or
    Infiginormous non-statistical data analysis toolkit.
    - RP on July 10, 2013 6:46 AM at 6:46 am said:
      
      And books, courses etc.
      Would ethics/privacy be different depending on the data size?
- Andrew on July 10, 2013 7:15 AM at 7:15 am said:
  
  Anon:
  
  We can make some practice by going in the opposite direction, extending your list from the other end:
  
  Tiny data
  No data
  Bad data
  Negative data
  Bad, negative data
  Lots of bad, negative data
  Lots and lots of bad, negative data
  Really really misleading data
  . . .
  - Fr. on July 10, 2013 7:43 AM at 7:43 am said:
    
    Soon: IBM PASW/SPSS/EVENMOAR
    
    “In this entirely new version of the software, SPSS adds noise to your data make it EVEN MOAR massive.”
  - ChrisP on July 10, 2013 8:58 AM at 8:58 am said:
    
    Sounds like something something Douglas Adams would invent — Zaphod Beeblebrox’s rose-coloured statistics package. If the data doesn’t support what you wanted to find, it crashes.
    The ‘Psych Science’ version does your causal interpretation for you. “Interprets a causality even an SEM modeller would call correlational.”
- John Mashey on July 10, 2013 6:37 PM at 6:37 pm said:
  
  Computer designers have long had well-defined terms for data sizes, i.e., Megabytes, Gigabytes, etc, etc.
  In the 1990s, a well-known statistician proposed a taxonomy that started with Huber’s
  Tiny(10^2 bytes,
  Small(10^4),
  Medium(10^6) 1 Megabyte
  Large(10^8) 100MB
  Huge(10^10) Actually, 10GB
  
  and added Ridiculous(10^12) Actually, that’s 1 Terabyte, which was stressful in 1994, not now.
  
  All this nomenclature is just plan silly, for two reasons:
  a) people who actually do this for a living have precise terms for storage sizes.
  b) Trying to overload English words just confuses people. Quick: is massive bigger than huge? How about very large?
  
  The alternative approach is:
  a) Use well-established, specific terms for storage (and computational rates)
  b) Just say “Big Data” for data that stresses (disk size, memory size, bandwidths, etc), knowing perfectly well that what might be “ridiculous” in 1994, is commonplace on personal computers 20 years later (although of course, disk bandwidths have improved nowhere near as fast as density, and access times barely at all in a given price class.)
  
  Likewise, “supercomputer” has been a moving target.
  - Fernando on July 10, 2013 7:41 PM at 7:41 pm said:
    
    Why not make the classification time dependent as in:
    
    Tiny < 10^f(b*current year)
    
    Where b captures Moore's law or some such.
    - John Mashey on July 12, 2013 1:00 PM at 1:00 pm said:
      
      In any case, it’s way more complex that Moore’s Law (which is relevant to DRAM densities, but not disks or bandwidths or processing complexities.)
      
      When there are perfectly good technical terms, assigning multiple vague English words is just plain silly, in my opinion. As it happens, I used to be a designer of supercomputer architectures, including the family to which this one belonged. See also The Origins of ‘Big Data’: An Etymological Detective Story, NYTimes.com.
    - Fernando on July 12, 2013 1:45 PM at 1:45 pm said:
      
      John:
      
      I’m amazed at the people I get to meet in this blog!
      
      I see you are also a trustee of the Computer History Museum in Mnt. View. I’ve been meaning to visit it for quite some time. (In the meanwhile here is one computer you will appreciate: https://t.co/GPCQd5eXVb )
      
      PS In a fast moving environment you probably want to focus on the relative, not the absolute, properties of the scale. But you are right that the cut off function needs to be more complex.
K? O'Rourke on July 10, 2013 7:41 AM at 7:41 am said:

Size does not matter ;-)

It’s how you process/interpret the data that makes it more or less misleading.

Automatically interpreted data
Superstitiously interpreted data
Naively interpreted data
Perceptively/purposefully interpreted data
Omnipotently interpreted data.
- Gigomassinfiginormous data scientist on July 10, 2013 9:36 AM at 9:36 am said:
  
  K?:
  
  You may say what you want but my data is bigger than yours.
  - Fr. on July 10, 2013 3:15 PM at 3:15 pm said:
    
    “Click here to enlarge your data”
    - Gigomassinfiginormous data scientist on July 10, 2013 4:59 PM at 4:59 pm said:
      
      Clearly spam. It cannot be true. I am the boundary condition.

Comments are closed.