I just posted this a few hours ago, but it’s such an important message that I’d like to post it again.

Actually, maybe we should just post nothing but the above graph every day, over and over again, for the next 20 years.

This is hugely important, one of the most important things we need to understand about statistics.

The top graph is what got published, the bottom graph is the preregistered replication from Joe Simmons and Leif Nelson.

Remember: **Just cos you have statistical significance and a graph that shows a clear and consistent pattern, it doesn’t mean this pattern is real, in the sense of generalizing beyond your sample.**

But the top graph looked like such strong evidence!

Sorry.

It’s so easy to get fooled. Think of all the studies with results that look like that top graph.

One of my grad school colleagues, now quite comfortably employed, would have said “but it still looks to me like if you add up all the blue and all the green, the blue is still ahead, so I can put that even though it’s smaller than the error there’s still a tendency…”

“I just posted this a few hours ago, but it’s such an important message that I’d like to post it again.” — I think that counts as a replication!

I beg to differ: it’s a repetition, not a replication! ;~)

I’m totally stuck on the heated butter knife thing. I never knew such a thing existed.

Does it only work on heated butter????

That’s exactly it. Don’t let others tell you otherwise!!

It’s not the *butter* that’s heated, it’s the …. Oh I see what you did there. It’s a joke. Good one, you got me.

It’s atomic.

Just wanna make sure I understand the takeaway here. Is there something about the top graph that should make me suspicious (aside from the implied hypothesis)? Or is this about exercising continued skepticism towards ideas?

Just read the previous post on this topic. Seems like this is about skepticism, not a problem with the first graph.

I must add, however that I’m a bit confused about why the hypothesis was credible in the first place. From the paper:

> “consumers’ views of scenery from a high physical elevation induce an illusory source of control, which in turn intensifies risk taking.”

Seriously? This seems almost like magical thinking. Am I being a jerk here? Are there real psychological effects that sound this unbelievable when you first here them?

Something as simple as the Müller-Lyer illusion?

I don’t think the finding is inherently incredible. Filmmakers have ‘known’ for a while that certain shot compositions can induce particular psychological effects in viewers. In addition, the basis of modern advertising is the juxtaposition of a sales pitch with particular not necessarily related imagery that are assumed to influence subtly a consumer’s buying decisions. Combine the two ideas, and boom. Perhaps the size of the effect in the original study was higher than I’d expect, but it’s not obviously false.

The hypothetical explanation provided is more tenuous, but that’s not at question here.

A single detail, that does not tell the whole story, is the “p = 0.016”. It’s not a strong significance. Example: in physics after a while they settled on requiring roughly p < 1e-7, many years ago. I would not continue using hypothesis testing, but I don't understand why they have not at least lowered the bar on p in other fields, it seems a practical way to cut off quickly a lot of bullshit.

you can’t compare the significance threshold in physics to social science; measurements are extremely precise in physics, phenomena being studied are also very stable.

I understand that something like 1e-7 may be too low. But, say, 1e-3? Something that you need a bit more effort to extort with forking paths.

Bear in mind also that the lower the significance theshold, the more highly exaggerated the effect sizes are in studies that manage to get published (for a fixed sample size anyway, which is admittedly unrealistic, but there is a limit to how much people would increase their sample sizes I think).

The fixation on ultra-low p-values in physics is to some extent cargo cult science. No one really claims the peak on the LHC is explained by random background variation. But a single experiment repeated by the same people on the same machine has an inherent pseudoreplication issue and no one is going to build a second LHC to replicate the results.

At some point if your p-value threshold is too low some things just can’t be investigated any more. Suppose that we are interested in an effect that increases people’s likelihood to buy heated butter knives by 5%, and the average person has a 0.5 probability to buy one with a sd of 0.25. Suddenly to get a power of 0.8 you need 7000 individuals! But a 5% increase in their sales would be a huge effect for a butter knife company. Just… not worth funding a study with 7000 individuals if they don’t know if the study would event get a positive result.

In practice forking paths do not just “randomise” the p-value. A maliciously poor choice of analysis can produce astoundingly low p-values out of nothing. A too low threshold could in fact dissuade researchers from producing honest but unimpressive analyses, and encourage them to instead go for fancy bullshit. From personal experience, too-low p-values are not in general indicators of a real effect, but rather some kind of model misspecification.

An example of what I mean in terms of dodgy use of p-values in physics:

https://images.ctfassets.net/cnu0m8re1exe/64v3YUbcvYSBW5CbV89wvp/68ca5256775fafcdb04089731f4af85d/lhccms_higgsdiscovery.gif?w=650

The measurements are the dots here, while the curve is a 4th order polynomial fit meant to capture the background variation. The little bump at 125 GeV is the signal that everyone was excited about. Now look at that curve, and ask yourself, does the +/- 2 sigma green area *really* represent the ‘random’ variation in this data? Do 95% of the points, apart from the little Higgs bump, really lie in that area?

If someone gave me just the points on this dataset and asked me to compute the p-value of the 125 GeV bump, I sure wouldn’t call it a 5 sigma effect. Clearly there is a source of variation in the event counts that isn’t captured in the stated variation.

I wouldn’t say that the dots there are the measurements. That chart is the final result of a not-so-simple model: http://cdsweb.cern.ch/record/1429931/files/HIG-12-001-pas.pdf

Carlos:

However complicated the model, it is clear that the model doesn’t accurately represent the variation in event counts outside of the Higgs bump. They do not align to the curve as closely as the CIs imply they should.

The bands represent the uncertainty in the background fit, not the expected variability in the measurements. I’m not sure what are you complaining about.

From a high level, we have a baseline function f(m), so the weighted event counts (however they are calculated) are

y_m = f(m) + Z

where Z under the null assumption varies randomly. The authors have a theoretical estimate of the variability of Z (sigma), and fitted f(m) as a low order polynomial, and for m=~125 they detect that y – f(m) is ~5 times sigma. Therefore, they argue that the blip at m=125 is a true signal with some very small p value.

However, the distribution of y-f(m) under their polynomial fit does not have a sd of sigma. It is substantially greater. Thus we come to a number of conclusions –

1. the physics-theory sigma significantly underestimates the real variability in Z, in which case it’s not an appropriate value to use to compare y-f(125) against, or

2. the low order polynomial background fit is underfit. This can be the case because their experiments to choose the background model is based on a F-test, and does not assume a particular value for the variability of Z.

We may be talking about different things.

When they say “a 5-sigma event” they are just expressing as a number of standard deviations (using the one-sided Gaussian tail convention) a p-value that has been calculated using a quite complex procedure http://cds.cern.ch/record/1379837/files/NOTE2011_005.pdf

That statistical methodology may be good or it may be bad (I don’t know) but I don’t think that you can tell from that chart and the equation y_m = f(m) + Z.

Well, neither of us are theoretical physicists. I just have a sense from reading this that what they’ve done in their analysis is correctly account for the inaccuracy in their measurement device, but they did not account for the distribution of random, unknown particles, and hence whether the presence of a particular bump with a particular mass really shows them the particular particle they are looking for, or just that there’s all sorts of mysterious unaccounted for particles being detected and so we don’t really know if this was the Higgs or not.

Zhou Fang:

Accounting for the bump coming from other particles is a matter of model choice. There’s the standard model, it says that you’ll find an Higgs boson around there, and nothing else. Then you also construct a modified standard model where there is no Higgs boson, and other particles behave as close as possible to usual.

That hypothesis test is just confronting these two possibilities. There are strong prior reasons for searching the Higgs boson and not something else. However, other analyses also looked at more detailed properties of what’s in the bump to get new evidence on it being the Higgs boson.

I don’t know the details of the original Higgs boson fit, but some years ago I tried to get an idea. The thing is statistically a mess, in the sense that I would have used bayesian methods while they spent years on inventing all sort of frequentist hacks to make a complicated fit as an hypothesis test. So I would not trust the analysis just by looking at the methodology because I think I am not smart enough to notice problems if there were.

But I trust it because they spent many years just on getting it right, with many people working on it independently (the two detectors, ATLAS and CMS, are built differently, are managed separately, publish separately, they don’t ever share most of their analysis code). I also know that at some point when the preliminary results were out they had a ~2-sigma difference on the mass of the Higgs from the two detectors, and so both experiments redid all the analysis in a different way. Also, I’ve seen that there’s a consistent group of “paranoic” physicists at CERN that doesn’t feel they can trust statistics because it’s too opinionated, so you can’t just say that the methodology is good, they require the analysis to be done in many ways at various levels and see how the results change.

So: I think that calling the Higgs fit an example of cargo cult science is not appropriate, although it surely is messy.

I point specifically to the “fixation on ultra-low p-values” as cargo fit science. The uncertainties in the analysis and all of that do not hinge on whether the fit is 2 or 3 or 6 sigmas. It hinges on the appropriateness of the assumptions.

Also, it has green and blue bars, with blue bars

on the right. Green represents plants, thus life, while blue is water. Since water is a prerequisite for life, a straight subconscious would put the blue bars on the left and the green bars on the right. But, if the person that made the graph was in a state of conflict, due to consciously making an unsupported claim, it would have subconsciously switched the colors to actualize its internal mind state in the causal reality.Moreover, they are using pictures. This does not work because you would need to actually put someone in front of the real scenery. Otherwise the representation scission will induce an anticorrelation in the behavioral patterns, such that the desire for heated butter knives cancels out with remote entryway lock systems (because they are at 2-odd positions in the graph).

Others’ assessments may differ from mine, but one thing I always note is that the error bars in the figure are undoubtedly standard errors. (I didn’t check, but I would bet a large amount of money on that.) Therefore, given that N is about 200, the actual spread in the measured quantity is sqrt(N) * the displayed bars, which is about 14*0.2 = 3-ish. So really, you should replace each bar in your mind with a fuzzy blob smeared out over a range that’s larger than the actual graph. Can we distinguish two such fuzzy blobs? Sure, that’s what standard error is for — but we’d believe this only if the phenomena is very robust, if we have a solid backdrop of theory+experiment that serves as its foundation, if we understand measurement error, if we have a good noise model, if negative results are weighted as equal to discoveries, etc. I doubt any of these are true for this sort of study.

This is an excellent point. I enjoy your reading your blog, by the way.

Death to dynamite plots!

Death by dynamite might be fitting?

Adam said, “Or is this about exercising continued skepticism towards ideas?”

I’d say it’s about exercising continued skepticism towards claims based on “statistical significance”.

But the graph is depicting evidence, right? Forget about significance, the data is showing a tendency, right? I’m not saying I buy the claims of the authors (not at all), but I don’t see the lesson from this graph – that it’s misleading or something. This graph looks like evidence to me. (I’m probably missing something here.)

“This graph looks like evidence to me.”

I think the problem is that you are forgetting about the ubiquity of uncertainty. You need to be more skeptical of your initial responses to a graph. Remember that a graph just displays data from a *sample* of the “population” of interest. The variation in a sample depends in part on the particular sample (that is, it varies from sample to sample), so should not be taken as representing the variation in the population of interest. (See also my response to your comment below.)

I would actually like to hear the answer to this question, because I still don’t get it. Is there something about this picture that is suspicious?

Matthijs:

No, there’s nothing suspicious-looking about that graph. That’s my point: it all looks so clean, but it didn’t replicate. It’s my impression that people are much too easily swayed by weak evidence that looks strong. I’ve seen this a lot: a pattern of averages that is statistically significant, but it doesn’t mean anything at all.

It’s fun and instructive to talk about examples that have obvious flaws, gaping holes in their arguments, botched data (as with Brian “pizzagate” Wansink and Richard “gremlins” Tol), ridiculous graphs (as in that China-air-pollution study), graphs that flat-out contradict the claims in the article (as in that ages-ending-in-9 paper), claims in the abstract that are not even addressed in the article (as in the power pose study), and claims that come pretty close to violating the laws of physics (that ESP paper). But it’s also good to remember that even a graph that looks clean and has no obvious flaws can be meaningless.

I also don’t think there’s anything really wrong with the graph, but I do think it looks more convincing than it actually is. To your eyes, there seem to be 5 independent pieces of evidence, when in fact the data are very correlated. When you look at the graph, you cannot mentally take that correlation into account to weigh the evidence.

The authors could have added 5 more gadgets, and the evidence would have looked even more compelling!

The p-value comes to the rescue, though. It does take the correlation into account and comes down to a modest 0.016.

Thanks for clarifying. But how do I use this insight. Should I now just dismiss convincing looking pictures? Should I dismiss convincing looking data? I’m genuinely asking and I see there is a problem here, but I just don’t know what to do with a blanket statement of “don’t trust convincing looking pictures”.

Another way of asking the question. You say this is weak evidence that looks convincing. How could we tell, before the failed replication, that the evidence was weak, and how does it relate to the picture (is the picture presenting too rosy a view)?

I would suggest taking the example given as cautionary: keep it in mind to remember the next time you see a graph that looks convincing — and remind yourself that the graph might be deceiving. Hold onto uncertainty, and if possible, actively look for possible reasons the graph might be deceiving — but wait for a replication before even tentatively concluding anything.

Mathijs:

What Martha said. Also, remember all the limitations of the study, which examines a small number of people in a narrow setting but was given the general title, “The Influence of Elevated Viewpoints on Risk Taking.” I think we should be skeptical of claims of broad import from narrow studies. The trouble with the top graph above is that the pattern looks so clean, it gives an inappropriate implication of generality.

Am I alone in thinking that the evidence would look far less compelling if the y-axis started a 0?

P=0.016 (0.02) is never very strong evidence against the null, and that would be more clear if the data were properly shown.

If you think the axis is important it should not start a 0 but at 1 (and go up to 7).

…and if you want a really strong effect size it should go up to 11.

+1 :)

Why 1 and not 0? Likelihoods have arbitrary scaling, but 0 is the lower bound in general.

1 seems to be the lower bound in the scale that goes from 1 = “Extremely Unlikely” to 7 = “Extremely Likely”. But I guess you could say that there are two implicit further extremes 0 = “No way” and 8 = “Take my money”.

Note that each pair of bars involves the same test subjects who are always in the same (high or low) condition. So we are looking at very correlated data. The modest p-value is correctly based on a repeated measures analysis, but the graph is a form of pseudo-replication. I think that’s why it looks more convincing than it actually is.

The first graph would look a *whole lot* less convincing if they started the y-axis scale at 0. The effect size of the most dramatic pair in the version they published (the “self-stirring mug”) appears visually to be about a 125% increase based on the relative sizes of the bars they present, but in fact it is only ~16% difference.

Agreed. Being able to ‘see’ the evidences particularly important given the awkward sort-of-logarithmic scaling of P-values as evidence.