Mike Jordan sends along this National Academies report on “big data.” This is not a research report but it could be interesting in that it conveys what are believed to be important technical challenges.
Mike Jordan sends along this National Academies report on “big data.” This is not a research report but it could be interesting in that it conveys what are believed to be important technical challenges.
I could only access the first page of executive summary. Maybe not a technical challenge (though I would argue it is that and more) but no mention of ethics/privacy.
from page 69: “Privacy is a very large topic in its own right, and this report does not attempt to address the privacy issues associated with massive data and its analysis.”
If you just search through the PDF on the work “privacy” (assuming you are able to download it, @Snowden) you will find several mentions of “privacy” but all have this same tone, i.e. it’s a big issue we aren’t going to discuss.
As I am stuck in Moscow airport I can only access website through my smartphone. The option to read PDF online is therefore a non-starter. To be honest it seems that the Academy are running ahead of themselves; they cannot even get basic usability right…
And besides, why would the Academy charge for its reports? Are they not tax payer funded? Maybe I will file a Freedom of Information request to access the report….
From Russia with love,
S
PS Putin lent me his iPad. I know see that after registering the PDF is available for download.
In passing, that registry might provide some useful information re people interested in big (sorry massive) data, and that may be a threat to the country. NSA should look into it.
Ok who can expand the series farthest, or increase the graduation:
No data
Tiny data
Small data
Medium data
Large data
Big data
Massive data
Enormous data
Gigantic data
Ginormous data
Infinite data,
Infiginormous data,
Massinfiginormous data, and
Gigomassinfiginormous data.
PS note that by appending the word “scientist” to each entry above we get the proper designation for people who work with these type of data, as in “Gigomassinfiginormous data scientist”.
Some scientists work with a variety of contiguous, monotonically increasing data sizes. For example “bigger data scientists” refers to individuals who work with data sizes above big data, and so on.
Notice that the bigger your data, the narrower your field.
:D
You also get to release / sell a different software package for every level in that heirarchy?
e.g. SAS-Gigantic data Edition
or
Infiginormous non-statistical data analysis toolkit.
And books, courses etc.
Would ethics/privacy be different depending on the data size?
Anon:
We can make some practice by going in the opposite direction, extending your list from the other end:
Tiny data
No data
Bad data
Negative data
Bad, negative data
Lots of bad, negative data
Lots and lots of bad, negative data
Really really misleading data
. . .
Soon: IBM PASW/SPSS/EVENMOAR
“In this entirely new version of the software, SPSS adds noise to your data make it EVEN MOAR massive.”
Sounds like something something Douglas Adams would invent — Zaphod Beeblebrox’s rose-coloured statistics package. If the data doesn’t support what you wanted to find, it crashes.
The ‘Psych Science’ version does your causal interpretation for you. “Interprets a causality even an SEM modeller would call correlational.”
Computer designers have long had well-defined terms for data sizes, i.e., Megabytes, Gigabytes, etc, etc.
In the 1990s, a well-known statistician proposed a taxonomy that started with Huber’s
Tiny(10^2 bytes,
Small(10^4),
Medium(10^6) 1 Megabyte
Large(10^8) 100MB
Huge(10^10) Actually, 10GB
and added Ridiculous(10^12) Actually, that’s 1 Terabyte, which was stressful in 1994, not now.
All this nomenclature is just plan silly, for two reasons:
a) people who actually do this for a living have precise terms for storage sizes.
b) Trying to overload English words just confuses people. Quick: is massive bigger than huge? How about very large?
The alternative approach is:
a) Use well-established, specific terms for storage (and computational rates)
b) Just say “Big Data” for data that stresses (disk size, memory size, bandwidths, etc), knowing perfectly well that what might be “ridiculous” in 1994, is commonplace on personal computers 20 years later (although of course, disk bandwidths have improved nowhere near as fast as density, and access times barely at all in a given price class.)
Likewise, “supercomputer” has been a moving target.
Why not make the classification time dependent as in:
Tiny < 10^f(b*current year)
Where b captures Moore's law or some such.
In any case, it’s way more complex that Moore’s Law (which is relevant to DRAM densities, but not disks or bandwidths or processing complexities.)
When there are perfectly good technical terms, assigning multiple vague English words is just plain silly, in my opinion. As it happens, I used to be a designer of supercomputer architectures, including the family to which this one belonged. See also The Origins of ‘Big Data’: An Etymological Detective Story, NYTimes.com.
John:
I’m amazed at the people I get to meet in this blog!
I see you are also a trustee of the Computer History Museum in Mnt. View. I’ve been meaning to visit it for quite some time. (In the meanwhile here is one computer you will appreciate: http://t.co/GPCQd5eXVb )
PS In a fast moving environment you probably want to focus on the relative, not the absolute, properties of the scale. But you are right that the cut off function needs to be more complex.
Size does not matter ;-)
It’s how you process/interpret the data that makes it more or less misleading.
Automatically interpreted data
Superstitiously interpreted data
Naively interpreted data
Perceptively/purposefully interpreted data
Omnipotently interpreted data.
K?:
You may say what you want but my data is bigger than yours.
“Click here to enlarge your data”
Clearly spam. It cannot be true. I am the boundary condition.