Brian Mulford writes:

I [Mulford] ran across this blog post and found myself questioning the relevance of the test used.

I’d think Chi-Square would be inappropriate for trying to measure significance of choice in the manner presented here; irrespective of the cute hamster. Since this is a common test for marketers and website developers – I’d be interested in which techniques you might suggest?

For tests of this nature, I typically measure a variety of variables (image placement, size, type, page speed, “page feel” as expressed in a factor, etc) and use LOGIT, Cluster and possibly a simple Bayesian model to determine which variables were most significant (chosen). Pearson Chi-squared may be used to express relationships between variables and outcome but I’ve typically not used it to simply judge a 0/1 choice as statistically significant or not.

My reply:

I like the decision-theoretic way that the blogger (Jason Cohen, according to the webpage) starts:

If you wait too long between tests, you’re wasting time. If you don’t wait long enough for statistically conclusive results, you might think a variant is better and use that false assumption to create a new variant, and so forth, all on a wild goose chase! That’s not just a waste of time, it also prevents you from doing the correct thing, which is to come up with completely new text to test against.

But I agree with Mulford that chi-square is not the way to go. I’d prefer a direct inference on the difference in proportions. Take that inference–the point estimate and its uncertainty, estimated using the usual (y+1)/(n+2) formulas–and then carry that uncertainty into your decision making. Balance costs and benefits, and all that.

Moving forward, you’re probably making lots and lots of this sort of comparison, so put it into a hierarchical model and you’ll get inferences that are more reasonable and more precise.

But . . . who knows? Maybe Cohen’s advice is a net plus. Ignoring the chi-square stuff, the key message I take away from the above-linked blog is that, with small samples, randomness can be huge. And that’s an important lesson–really, one of *the* key concepts in statistics. Don’t overreact to small samples. If the silly old chi-square test is your way of coming to this conclusion, that’s not so bad.

I don't know about Cohen's advice being a net plus.

As my colleague at two companies, Mark Moody, used to put it — the only thing that really matters is the economic analysis.

Suppose there is no cost difference in "Code Review Tools" or "Tools for Code Review". Then why not take the version with the plurality, regardless of significance?

Suppose there is a cost difference. Then you determine the likelihood of profit increase and decrease depending on the shape of the modeled curve — statistical significance in itself isn't important. (confidence interval sizes are important, because they determine your odds of having made the right decision, but confidence intervals aren't the same thing as significance tests)

zbicyclist is right about the specific example — why not take the version with a plurality? — but I think Cohen's post is useful for more than just that piece of advice. Educating people about small-sample variation is a good thing; people really do have terrible intuition about it, and this leads to all kinds of problems.