Ulrich Atz writes:

I regard myself fairly familiar with modern “big data” tools and models such as random forests, SVM etc. However, HyperCube is something I haven’t come across yet (met the marketing guy last week) and they advertise it as “disruptive”, “unique”, “best performing data analysis tool available”.

Have you seen it in action? Perhaps performing in any data science style competition?

On a side note, they claim it is “non-statistical” which I find absurd. A marketing ploy, but sounds like physics without math.

Hence, my question:

Do you think there is such a thing as a (1) non-statistical data analysis and (2) non-statistical data set?

Here’s what’s on the webpage:

The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. Instead, it takes input from a large volume of data and outputs the results from the data alone. This means that all the available data is taken into account.

I’m not quite what’s the difference between “take a sample” and “take input from a large volume of data.” All their examples involve generalizing from their ~~sample~~ data to a population. This sounds statistical to me.

The webpage continues:

The lack of a hypothesis is another advantage of HyperCube over statistics. HyperCube exposes the rules and dependencies that are indicated by the data, and is not tied to any previously held view. Statistics, on the other hand, test data to see whether it proves a specified scenario.

Again, to me this misses the point. The available data are not the point, they are a means to the larger goal of making predictions about future cases.

That said, even if the authors of this press material are confused about statistical inference and sampling, the software package could be good. I have no idea.

The key paragraph is:

How Does HyperCube Work?

HyperCube is a mathematical algorithm and rule generation technology that offers an explanation of the driving factors behind a complex mathematical issue or a scientific phenomenon, by identifying the sets of simultaneous conditions that yield a higher frequency of a specific occurrence. The output is a set of rules that are expressed in terms of the variables or dimensions in the dataset, and are easily understandable by end users

This is a straight standard issue pattern recognition tool, that makes no use of the context or the meaning of the variables and any information you may have about them, to search for correlates among the large number of columns; or possibly clustering in these high dimensional spaces. That’s why they can say things like “HyperCube definitely fits the role of a reusable asset”: it makes no use of domain specific understanding.

This sort of thing doesn’t really work well in practice, and the few times it does work it’s usually because someone exploited their understanding of the problem and variables to make the algorithm search in the right direction (the very thing they’re bragging about avoiding). There’s bound to be some arbitrary choices and parameter assignments in the algorithm somewhere which are presumably set to more or less arbitrary default values which are probably nonsense for the specific data set you’re trying to feed it.

The correlations discovered are usually impossible to interpret and often just outright meaningless if not highly misleading. For example, I saw a tool exactly like this one used to find geographic correlates to a certain type of military engagement. The data set they fed into it had all engagements in a very small area. The algorithm returned all geographic features in that area and predicted that any other area with those same features would also be the site of similar engagements. It was complete and total nonsense since almost all the geographic features in that engagement area had nothing to do with the phenomenon. A human analyst could have picked out which of those hundreds of geographic features if any were relevant faster, cheaper, and far more accurately.

I bet there are hundreds of other venders offering the same service. There’s probably even several dozen that are using the same algorithm under the hood. I wouldn’t be surprised if they were using regular old k-means clustering, or something very standard and well known. It’s shocking how often vendors will put a pretty front end onto trivial mathematics and try to sell it. I’m sure they’ll convince many middle managers though especially if they market it to the government. You can sucker them into almost anything. Just use the magic phrase “proprietary mathematical algorithm” and middle managers won’t ask too many awkward questions.

‘I’m sure they’ll convince many middle managers though especially if they market it to the government. You can sucker them into almost anything. Just use the magic phrase “proprietary mathematical algorithm” and middle managers won’t ask too many awkward questions.’

That rarely happens in an actual competitive procurement cycle. The slogan, “Nobody ever got fired for buying IBM” exists for a reason. Most of the time, small vendors who try to pull that kind of horseshit are squashed by the 800-pound gorillas.

I once sat on a purchasing committee for a software package where the vendor claimed they had or had applied for a patent on a correlation algorithm. We called their bluff and they backed down. They then proceeded to go a few ranks over our heads to a VP, offering a steep discount and telling him that going with their competitor was a huge mistake. We ended up with the competitor.

Yes and no.

I’ve seen this go the other way at some very big companies. It usually starts when word comes down that a VP has heard about some hot new thing like a proprietary method for “optimizing net promoter scores” or “gathering actionable intelligence from big data.”

I’ve allows seen it work the way Ed describes, but I’m still amazed at the amount of crap that MBAs will buy.

It’s good to hear at least some people aren’t fooled. Finding a VP who doesn’t know any math, but has purchasing authority, seems to be the business model of these vendors. Although in this case the HyperCube may have been a brake-even addition used to sweeten consulting deals, which is their real business, and so may not be doing much harm.

The worst example I’ve seen was a tool that consisted of a slightly modified Gaussian kernel density, which was dressed up as “predictive analytics”. Not only was it bought, but an long term contract was put into place for the vendor to provide analysts to run the tool. At no point was there a shred of evidence, or even a reason to believe, it was an improvement on regular Gaussian kernel density estimation, which was already in common use.

I have seen this before in other products, the “non-statistics” is a marketing ploy. What I think they really mean is “model agnositic” which is not the same thing. But for people who can’t handle statistics, the concept of “non-statistic” means “I can use this and not really know what I am doing, or I don’t need to hire someone who understands statisitics to run this”. Model agnositic approaches use statisitics to help determine which model fits best first then claims how well it fits, sounds like cheating to me but it has its (limited) purposes.

For example, in reliabilty engineering, you might want to test failure profiles for which PDF fits the data the best [typically chi^2 various possiblities] but I wouldn’t under any circumstances call that “non-statisitical”.

The people who call “model agnositic” approaches “non-statistical” are just people, like you said, who are confused on the idea.

The only way something could be called a non-statisitic analysis, is when the data IS the population data and you aren’t trying to make any inferences. Which, at least in my opinion, isn’t that high on the order of value add analyses in most cases.

Yep. Agreed. Also, the PR / Marketing folk need something to differentiate themselves against the existing products (

Why shouldn’t we keep using SAS?). The decision makers are often Suits who don’t appreciate subtleties.That’s the “non statistics” part. i.e. We are different.

Looking at it again on BearingPoint’s website, I totally agree with Entsophy, this is a repackaged “Multivariate Correlation Identifier”. Any high school student taking statistics could build the same thing, given enough cloud space. It will identify correlations (big whoop, so do scatterplots) but without contexualization, it doesn’t mean a whole lot.

But it might result in a lot of “ice cream vendors beign arrested for increasing the crime rates in inner cities during the summer months”

I’m not sure if I agree with “brute force approaches without contextualization” don’t work”. Sometimes they do have utility.

e.g. Wasn’t it shown in the Netflix contest that the best codes were ones that sort of relied on pure statistics and ignored most meta data about movies, genres, etc.”

About Netflix: sort of, the best models were actually blends of lots of models. Each one presumably capturing some of the truth and then blended into a single model. The same was found to be true for Heritage Health Prize which followed the Netflix competition.

Two things though, individual models weren’t that great, and a great deal depended on the statistics/variables created out of the original data to run the models on. A great deal of domain specific knowledge was used to created, refine, and pick those variables.

Model of models is fine but that can be done in a non contexual way too. An ensemble over methods can be done without domain knowledge.

Rahul, they were about as non-contextual as linear regression is. The winning models took many thousands of hours of man-in-the-loop fiddling to create even though they were often built out of standard canned procedures. At the end of the Netflix contest for example, they were wringing extra accuracy by looking at a small number of very hard to predict cases like the movie “Napoleon Dynamite” in which people either loved or hated and their choices didn’t seem to be obviously correlated with other factors. So even if they build many models out of standard components like “Random Forests” it isn’t non-contextual in the way your thinking. The resulting blends were highly specialized to the specific question and weren’t even close to being usable in other contexts. When the Heritage Health Prize started people had to start all over again building models.

In the Heritage Health Prize example, the raw data didn’t lend itself to being inputed into models directly. You pretty much had to create all variables from the raw data. Typically people looked for functions of the raw data that intuitively had some explanatory power and had nice properties. So even if two people were both using Random Forest from the same R package, they were still getting different results. Moreover, many of these canned procedures require arbitrary choices which were usually made by considering the context.

What is true is that domain expertise wasn’t critical for either contest. This was especially noticeable in the Heritage Health Prize contest, were some people thought that having a doctor on the team would give them an edge. It turned out that all you needed was fairly general knowledge like “a woman is likely to spend a few days in the hospital 7-9 months after first going to an OBGYN” in order to predict hospital stays. Having specialized medical training, or prior experience in medical datamining, seemed to be worthless. I don’t think any of the top 10 teams had any kind of medical expertise.

Note also, that the models which won the Netflix prize weren’t usable in practice. Since they were a blend of a bunch of canned procedures, Netflix found that it wouldn’t have been economical to implement the winning solution on their full data sets. So it’s quite possible that a single non-canned algorithm, built from scratch to answer the Netflix question, might have been more practicable.

I take the part about sampling to mean this: In legacy products (and with slower servers), sometimes if you had a really huge dataset, you were constrained to run your analysis on only parts of it because of limitations of the analysis software of processing time.

What they probably mean is that,

“Look, our code is fast enough that you can crunch however many records you throw at it. (given a cluster or something to parallelize it on? )”In their logic, if I do mean(x) where x is all my data, I’m not doing statistics, but if I do mean(sample(x,1000)) because my x is so huge I can’t compute the exact mean, I am doing statistics and then I have to understand the properties of sampling.

They are trying to tell people that the data you’ve got is the population. Whereas in most cases its already a sample…

Also, all that “HyperCube” “postmodern” “beyond” stuff is more like “TimeCube” than “HyperCube”…

“I’m not quite what’s the difference between “take a sample” and “take input from a large volume of data.” All their examples involve generalizing from their sample data to a population. This sounds statistical to me.”

They’re trying to bamboozle people, obviously. The kind of people who seem to think that “statistics” is when you gather a bunch of data, randomly sample your own data, then wave a statistics wand over it to magically produce conclusions.

“The lack of a hypothesis is another advantage of HyperCube over statistics. HyperCube exposes the rules and dependencies that are indicated by the data, and is not tied to any previously held view. Statistics, on the other hand, test data to see whether it proves a specified scenario.”

Awesome! Exactly the tool I needed to prove that the Dow going up is VERY HIGHLY CORRELATED with the S&P 500 also going up.

“The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. Instead, it takes input from a large volume of data and outputs the results from the data alone. This means that all the available data is taken into account.”

This is clearly all BS, but I have a different take on this. It sounds like what they’re getting at is hypothesis testing – that “statistics” only interprets the with respect to a null hypothesis and this is superior because the analysis is not with respect to “validating a hypothesis”.

Obviously, their understanding of statistics ended at the high school level, but I think this is what they’re attempting to get at.

That was my impression as well: their description of statistics seemed like a bad caricature of frequentist inference.

Having said that, I still don’t get what the connection to the n-dimensional square-faced polytope is!

There is a link here (http://www.bearingpoint.com/en-other/7-5605/) to Les Echos, which names somebody Augustin Huret as one of the inventors (and he is indeed a partner in Bearing Point). Searching for ‘Augustin Huret publications’ does not bring up much (especially considering the Bearing Point web site mentions 20 years of research at Ecole Polytechnique), but it does find this (), which does contain an explanation of the algorithm near the end. I have not yet read it in detail, but it seems to be a reasonably sophisticated combination of clustering method and other stuff.

The crucial link would help: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170284/pdf/pone.0024085.pdf.

“Statistics, on the other hand, test data to see whether it proves a specified scenario.”

Their notion of statistics is peculiar. From a frequentist point of view, the point of a hypothesis test is to falsify, not verify (or prove, if you will). From a Bayesian point of view — which I am not very familiar with — the above makes slightly more sense, but I doubt any Bayesian would claim to have proved, rather than corroborated, a hypothesis.

Andreas:

From the perspective of applied classical statistics (for example, what’s done in most research papers in psychology, medicine, etc), statistics is indeed used to prove hypotheses. The idea is that p<.05 so the hypothesis is treated as if it is true. (Or, you could say, the null hypothesis is taken to be false, but nobody cares about the null hypothesis. The research hypothesis of interest is that the effect is real.)

Andrew, I do see your point, but I disagree with the notion of having proved a hypothesis (in the epistemological sense).

To clarify, suppose we are interested in testing whether a given treatment has an effect on rats. Say, eating fatty food makes a rat fat. We investigate this hypothesis by conducting an experiment on a group of rats, while keeping a control group of rats (both groups chosen randomly). After the experimental data is in, we conduct a hypothesis test and find that we reject the hull hypothesis of “no effect” at a p-value of 0.01; thus, logically “proving” that there is an effect. However, as Clark points out in the comment below, “proof” is a misnomer in this context, because there is still the possibility that we have observed the result purely by chance. Therefore we can never claim to have proved the research hypothesis in the epistemological sense. I think Clark hit the nail on the head.

Returning to my original statement, I realize that I went beyond the above in saying that the point of frequentist hypothesis testing is solely to falsify, which is perhaps still controversial.

I frequently argue that the concept of “proof” has no role beyond mathematics. We might say “the evidence supports” or something similar, ideally with some level of statistical confidence, but it is impossible to “prove” anything — since proof is an absolute yes-or-no kind of thing.

The technology is non-statistical, meaning it does not take a sample and use algorithms in order to validate a hypothesis. Instead, it takes input from a large volume of data and outputs the results from the data alone. This means that all the available data is taken into account.

So pure descriptive statistics is non-statistical while statistical inference is statistical?

The lack of a hypothesis is another advantage of HyperCube over statistics. HyperCube exposes the rules and dependencies that are indicated by the data, and is not tied to any previously held view. Statistics, on the other hand, test data to see whether it proves a specified scenario.

So pure correlation analysis is non-statistical while causation analysis is statistical?

Welcome to the big (fail) data era.