Comments on: But the top graph looked like such strong evidence!

By: Adam Wheeler

Adam Wheeler — Wed, 17 Jun 2020 16:40:09 +0000

This is a good point. I think there’s still a distinction to be made between something that effects perception (Mueller-Lyer), and a purported effect that influences behavior in such a “push-button” way.

It’s believable to me that pictures taken from a high elevation give the viewer a subtle feeling of control, but we know from everyday experience that such an effect is pretty small. I’d expect it to be a drop in the bucket of purchasing propensity. Besides, what if having a greater sense of control gives me the self-control not to make unnecessary purchases? The causal chain is much longer and weaker.

I’m quite out of my depth, so it’s possible that there’s lots of sound and established psych saying I’m wrong here…

By: Carlos Ungil

Carlos Ungil — Sat, 23 May 2020 22:39:09 +0000

In reply to Michael J Lew. 1 seems to be the lower bound in the scale that goes from 1 = “Extremely Unlikely” to 7 = “Extremely Likely”. But I guess you could say that there are two implicit further extremes 0 = "No way" and 8 = "Take my money".

By: Michael J Lew

Michael J Lew — Sat, 23 May 2020 22:27:04 +0000

In reply to Divalent. Agreed. Being able to 'see' the evidences particularly important given the awkward sort-of-logarithmic scaling of P-values as evidence.

By: Michael J Lew

Michael J Lew — Sat, 23 May 2020 22:25:23 +0000

In reply to Carlos Ungil. Why 1 and not 0? Likelihoods have arbitrary scaling, but 0 is the lower bound in general.

By: Divalent

Divalent — Sat, 23 May 2020 17:52:35 +0000

The first graph would look a *whole lot* less convincing if they started the y-axis scale at 0. The effect size of the most dramatic pair in the version they published (the “self-stirring mug”) appears visually to be about a 125% increase based on the relative sizes of the bars they present, but in fact it is only ~16% difference.

By: Dzhaughn

Dzhaughn — Sat, 23 May 2020 16:22:43 +0000

In reply to dhogaza. It's atomic.

By: Andrew

Andrew — Sat, 23 May 2020 00:21:22 +0000

In reply to Mathijs Janssen. Mathijs: What Martha said. Also, remember all the limitations of the study, which examines a small number of people in a narrow setting but was given the general title, "The Influence of Elevated Viewpoints on Risk Taking." I think we should be skeptical of claims of broad import from narrow studies. The trouble with the top graph above is that the pattern looks so clean, it gives an inappropriate implication of generality.

By: Martha (Smith)

Martha (Smith) — Sat, 23 May 2020 00:13:40 +0000

In reply to MGN. Death by dynamite might be fitting?

By: Martha (Smith)

Martha (Smith) — Sat, 23 May 2020 00:11:56 +0000

In reply to Mathijs Janssen.

“This graph looks like evidence to me.”

I think the problem is that you are forgetting about the ubiquity of uncertainty. You need to be more skeptical of your initial responses to a graph. Remember that a graph just displays data from a *sample* of the “population” of interest. The variation in a sample depends in part on the particular sample (that is, it varies from sample to sample), so should not be taken as representing the variation in the population of interest. (See also my response to your comment below.)

By: Martha (Smith)

Martha (Smith) — Sat, 23 May 2020 00:03:45 +0000

In reply to Mathijs Janssen.

I would suggest taking the example given as cautionary: keep it in mind to remember the next time you see a graph that looks convincing — and remind yourself that the graph might be deceiving. Hold onto uncertainty, and if possible, actively look for possible reasons the graph might be deceiving — but wait for a replication before even tentatively concluding anything.

By: MGN

MGN — Fri, 22 May 2020 23:46:49 +0000

In reply to Raghuveer Parthasarathy. Death to dynamite plots!

By: Mathijs Janssen

Mathijs Janssen — Fri, 22 May 2020 23:33:22 +0000

In reply to Andrew.

Thanks for clarifying. But how do I use this insight. Should I now just dismiss convincing looking pictures? Should I dismiss convincing looking data? I’m genuinely asking and I see there is a problem here, but I just don’t know what to do with a blanket statement of “don’t trust convincing looking pictures”.

Another way of asking the question. You say this is weak evidence that looks convincing. How could we tell, before the failed replication, that the evidence was weak, and how does it relate to the picture (is the picture presenting too rosy a view)?

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 21:42:11 +0000

In reply to Giacomo Petrillo. I point specifically to the "fixation on ultra-low p-values" as cargo fit science. The uncertainties in the analysis and all of that do not hinge on whether the fit is 2 or 3 or 6 sigmas. It hinges on the appropriateness of the assumptions.

By: Erik

Erik — Fri, 22 May 2020 18:35:04 +0000

In reply to Andrew.

I also don’t think there’s anything really wrong with the graph, but I do think it looks more convincing than it actually is. To your eyes, there seem to be 5 independent pieces of evidence, when in fact the data are very correlated. When you look at the graph, you cannot mentally take that correlation into account to weigh the evidence.

The authors could have added 5 more gadgets, and the evidence would have looked even more compelling!

The p-value comes to the rescue, though. It does take the correlation into account and comes down to a modest 0.016.

By: Andrew

Andrew — Fri, 22 May 2020 18:08:11 +0000

In reply to Mathijs Janssen.

Matthijs:

No, there’s nothing suspicious-looking about that graph. That’s my point: it all looks so clean, but it didn’t replicate. It’s my impression that people are much too easily swayed by weak evidence that looks strong. I’ve seen this a lot: a pattern of averages that is statistically significant, but it doesn’t mean anything at all.

It’s fun and instructive to talk about examples that have obvious flaws, gaping holes in their arguments, botched data (as with Brian “pizzagate” Wansink and Richard “gremlins” Tol), ridiculous graphs (as in that China-air-pollution study), graphs that flat-out contradict the claims in the article (as in that ages-ending-in-9 paper), claims in the abstract that are not even addressed in the article (as in the power pose study), and claims that come pretty close to violating the laws of physics (that ESP paper). But it’s also good to remember that even a graph that looks clean and has no obvious flaws can be meaningless.

By: Mathijs Janssen

Mathijs Janssen — Fri, 22 May 2020 17:57:57 +0000

In reply to Martha (Smith).

But the graph is depicting evidence, right? Forget about significance, the data is showing a tendency, right? I’m not saying I buy the claims of the authors (not at all), but I don’t see the lesson from this graph – that it’s misleading or something. This graph looks like evidence to me. (I’m probably missing something here.)

By: Mathijs Janssen

Mathijs Janssen — Fri, 22 May 2020 17:55:26 +0000

In reply to Adam Wheeler. I would actually like to hear the answer to this question, because I still don't get it. Is there something about this picture that is suspicious?

By: Giacomo Petrillo

Giacomo Petrillo — Fri, 22 May 2020 15:49:16 +0000

In reply to Zhou Fang.

I don’t know the details of the original Higgs boson fit, but some years ago I tried to get an idea. The thing is statistically a mess, in the sense that I would have used bayesian methods while they spent years on inventing all sort of frequentist hacks to make a complicated fit as an hypothesis test. So I would not trust the analysis just by looking at the methodology because I think I am not smart enough to notice problems if there were.

But I trust it because they spent many years just on getting it right, with many people working on it independently (the two detectors, ATLAS and CMS, are built differently, are managed separately, publish separately, they don’t ever share most of their analysis code). I also know that at some point when the preliminary results were out they had a ~2-sigma difference on the mass of the Higgs from the two detectors, and so both experiments redid all the analysis in a different way. Also, I’ve seen that there’s a consistent group of “paranoic” physicists at CERN that doesn’t feel they can trust statistics because it’s too opinionated, so you can’t just say that the methodology is good, they require the analysis to be done in many ways at various levels and see how the results change.

So: I think that calling the Higgs fit an example of cargo cult science is not appropriate, although it surely is messy.

By: Giacomo Petrillo

Giacomo Petrillo — Fri, 22 May 2020 15:34:12 +0000

In reply to Zhou Fang.

Zhou Fang:

Accounting for the bump coming from other particles is a matter of model choice. There’s the standard model, it says that you’ll find an Higgs boson around there, and nothing else. Then you also construct a modified standard model where there is no Higgs boson, and other particles behave as close as possible to usual.

That hypothesis test is just confronting these two possibilities. There are strong prior reasons for searching the Higgs boson and not something else. However, other analyses also looked at more detailed properties of what’s in the bump to get new evidence on it being the Higgs boson.

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 15:26:48 +0000

In reply to Zhou Fang. Well, neither of us are theoretical physicists. I just have a sense from reading this that what they've done in their analysis is correctly account for the inaccuracy in their measurement device, but they did not account for the distribution of random, unknown particles, and hence whether the presence of a particular bump with a particular mass really shows them the particular particle they are looking for, or just that there's all sorts of mysterious unaccounted for particles being detected and so we don't really know if this was the Higgs or not.

By: Carlos Ungil

Carlos Ungil — Fri, 22 May 2020 15:10:40 +0000

In reply to Zhou Fang.

We may be talking about different things.

When they say “a 5-sigma event” they are just expressing as a number of standard deviations (using the one-sided Gaussian tail convention) a p-value that has been calculated using a quite complex procedure http://cds.cern.ch/record/1379837/files/NOTE2011_005.pdf

That statistical methodology may be good or it may be bad (I don’t know) but I don’t think that you can tell from that chart and the equation y_m = f(m) + Z.

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 14:33:09 +0000

In reply to Zhou Fang.

From a high level, we have a baseline function f(m), so the weighted event counts (however they are calculated) are

y_m = f(m) + Z

where Z under the null assumption varies randomly. The authors have a theoretical estimate of the variability of Z (sigma), and fitted f(m) as a low order polynomial, and for m=~125 they detect that y – f(m) is ~5 times sigma. Therefore, they argue that the blip at m=125 is a true signal with some very small p value.

However, the distribution of y-f(m) under their polynomial fit does not have a sd of sigma. It is substantially greater. Thus we come to a number of conclusions –

1. the physics-theory sigma significantly underestimates the real variability in Z, in which case it’s not an appropriate value to use to compare y-f(125) against, or

2. the low order polynomial background fit is underfit. This can be the case because their experiments to choose the background model is based on a F-test, and does not assume a particular value for the variability of Z.

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 14:09:18 +0000

In reply to Adam Wheeler.

I don’t think the finding is inherently incredible. Filmmakers have ‘known’ for a while that certain shot compositions can induce particular psychological effects in viewers. In addition, the basis of modern advertising is the juxtaposition of a sales pitch with particular not necessarily related imagery that are assumed to influence subtly a consumer’s buying decisions. Combine the two ideas, and boom. Perhaps the size of the effect in the original study was higher than I’d expect, but it’s not obviously false.

The hypothetical explanation provided is more tenuous, but that’s not at question here.

By: Carlos Ungil

Carlos Ungil — Fri, 22 May 2020 14:06:52 +0000

In reply to Zhou Fang. The bands represent the uncertainty in the background fit, not the expected variability in the measurements. I'm not sure what are you complaining about.

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 13:55:26 +0000

In reply to Zhou Fang. Carlos: However complicated the model, it is clear that the model doesn't accurately represent the variation in event counts outside of the Higgs bump. They do not align to the curve as closely as the CIs imply they should.

By: Carlos Ungil

Carlos Ungil — Fri, 22 May 2020 13:42:35 +0000

In reply to Zhou Fang.

I wouldn’t say that the dots there are the measurements. That chart is the final result of a not-so-simple model: http://cdsweb.cern.ch/record/1429931/files/HIG-12-001-pas.pdf

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 13:10:32 +0000

In reply to Zhou Fang.

An example of what I mean in terms of dodgy use of p-values in physics:

https://images.ctfassets.net/cnu0m8re1exe/64v3YUbcvYSBW5CbV89wvp/68ca5256775fafcdb04089731f4af85d/lhccms_higgsdiscovery.gif?w=650

The measurements are the dots here, while the curve is a 4th order polynomial fit meant to capture the background variation. The little bump at 125 GeV is the signal that everyone was excited about. Now look at that curve, and ask yourself, does the +/- 2 sigma green area *really* represent the ‘random’ variation in this data? Do 95% of the points, apart from the little Higgs bump, really lie in that area?

If someone gave me just the points on this dataset and asked me to compute the p-value of the 125 GeV bump, I sure wouldn’t call it a 5 sigma effect. Clearly there is a source of variation in the event counts that isn’t captured in the stated variation.

By: Terry

Terry — Fri, 22 May 2020 13:01:12 +0000

In reply to dhogaza.

I’m totally stuck on the heated butter knife thing. I never knew such a thing existed. Does it only work on heated butter????

It's not the *butter* that's heated, it's the .... Oh I see what you did there. It's a joke. Good one, you got me.

By: Zhou Fang

Zhou Fang — Fri, 22 May 2020 12:45:27 +0000

In reply to Giacomo Petrillo.

The fixation on ultra-low p-values in physics is to some extent cargo cult science. No one really claims the peak on the LHC is explained by random background variation. But a single experiment repeated by the same people on the same machine has an inherent pseudoreplication issue and no one is going to build a second LHC to replicate the results.

At some point if your p-value threshold is too low some things just can’t be investigated any more. Suppose that we are interested in an effect that increases people’s likelihood to buy heated butter knives by 5%, and the average person has a 0.5 probability to buy one with a sd of 0.25. Suddenly to get a power of 0.8 you need 7000 individuals! But a 5% increase in their sales would be a huge effect for a butter knife company. Just… not worth funding a study with 7000 individuals if they don’t know if the study would event get a positive result.

In practice forking paths do not just “randomise” the p-value. A maliciously poor choice of analysis can produce astoundingly low p-values out of nothing. A too low threshold could in fact dissuade researchers from producing honest but unimpressive analyses, and encourage them to instead go for fancy bullshit. From personal experience, too-low p-values are not in general indicators of a real effect, but rather some kind of model misspecification.

By: jrkrideau

jrkrideau — Fri, 22 May 2020 11:14:30 +0000

In reply to Adam Wheeler. Something as simple as the Müller-Lyer illusion?

By: Erik

Erik — Fri, 22 May 2020 10:22:31 +0000

Note that each pair of bars involves the same test subjects who are always in the same (high or low) condition. So we are looking at very correlated data. The modest p-value is correctly based on a repeated measures analysis, but the graph is a form of pseudo-replication. I think that’s why it looks more convincing than it actually is.

By: jim

jim — Fri, 22 May 2020 02:30:53 +0000

In reply to n-g. +1 :)

By: n-g

n-g — Fri, 22 May 2020 00:04:40 +0000

In reply to Carlos Ungil. ...and if you want a really strong effect size it should go up to 11.

By: Martha (Smith)

Martha (Smith) — Thu, 21 May 2020 23:52:32 +0000

In reply to Adam Wheeler.

Adam said, “Or is this about exercising continued skepticism towards ideas?”

I’d say it’s about exercising continued skepticism towards claims based on “statistical significance”.

By: Martha (Smith)

Martha (Smith) — Thu, 21 May 2020 23:49:42 +0000

In reply to Raghuveer Parthasarathy. I beg to differ: it's a repetition, not a replication! ;~)

By: Adam Wheeler

Adam Wheeler — Thu, 21 May 2020 22:20:53 +0000

In reply to Raghuveer Parthasarathy. This is an excellent point. I enjoy your reading your blog, by the way.

By: Carlos Ungil

Carlos Ungil — Thu, 21 May 2020 22:17:47 +0000

In reply to Michael J Lew. If you think the axis is important it should not start a 0 but at 1 (and go up to 7).

By: Michael J Lew

Michael J Lew — Thu, 21 May 2020 22:01:18 +0000

Am I alone in thinking that the evidence would look far less compelling if the y-axis started a 0?

P=0.016 (0.02) is never very strong evidence against the null, and that would be more clear if the data were properly shown.

By: Austin Fournier

Austin Fournier — Thu, 21 May 2020 21:57:42 +0000

In reply to Giacomo Petrillo. Bear in mind also that the lower the significance theshold, the more highly exaggerated the effect sizes are in studies that manage to get published (for a fixed sample size anyway, which is admittedly unrealistic, but there is a limit to how much people would increase their sample sizes I think).

By: Raghuveer Parthasarathy

Raghuveer Parthasarathy — Thu, 21 May 2020 21:23:44 +0000

In reply to Adam Wheeler.

Others’ assessments may differ from mine, but one thing I always note is that the error bars in the figure are undoubtedly standard errors. (I didn’t check, but I would bet a large amount of money on that.) Therefore, given that N is about 200, the actual spread in the measured quantity is sqrt(N) * the displayed bars, which is about 14*0.2 = 3-ish. So really, you should replace each bar in your mind with a fuzzy blob smeared out over a range that’s larger than the actual graph. Can we distinguish two such fuzzy blobs? Sure, that’s what standard error is for — but we’d believe this only if the phenomena is very robust, if we have a solid backdrop of theory+experiment that serves as its foundation, if we understand measurement error, if we have a good noise model, if negative results are weighted as equal to discoveries, etc. I doubt any of these are true for this sort of study.

By: Giacomo Petrillo

Giacomo Petrillo — Thu, 21 May 2020 21:15:03 +0000

In reply to matt. I understand that something like 1e-7 may be too low. But, say, 1e-3? Something that you need a bit more effort to extort with forking paths.

By: Giacomo Petrillo

Giacomo Petrillo — Thu, 21 May 2020 21:12:50 +0000

In reply to Adam Wheeler. Also, it has green and blue bars, with blue bars on the right. Green represents plants, thus life, while blue is water. Since water is a prerequisite for life, a straight subconscious would put the blue bars on the left and the green bars on the right. But, if the person that made the graph was in a state of conflict, due to consciously making an unsupported claim, it would have subconsciously switched the colors to actualize its internal mind state in the causal reality. Moreover, they are using pictures. This does not work because you would need to actually put someone in front of the real scenery. Otherwise the representation scission will induce an anticorrelation in the behavioral patterns, such that the desire for heated butter knives cancels out with remote entryway lock systems (because they are at 2-odd positions in the graph).

By: matt

matt — Thu, 21 May 2020 20:56:04 +0000

In reply to Giacomo Petrillo. you can't compare the significance threshold in physics to social science; measurements are extremely precise in physics, phenomena being studied are also very stable.

By: Giacomo Petrillo

Giacomo Petrillo — Thu, 21 May 2020 20:54:27 +0000

In reply to Adam Wheeler. A single detail, that does not tell the whole story, is the "p = 0.016". It's not a strong significance. Example: in physics after a while they settled on requiring roughly p < 1e-7, many years ago. I would not continue using hypothesis testing, but I don't understand why they have not at least lowered the bar on p in other fields, it seems a practical way to cut off quickly a lot of bullshit.

By: Adam Wheeler

Adam Wheeler — Thu, 21 May 2020 19:08:17 +0000

In reply to Adam Wheeler.

Just read the previous post on this topic. Seems like this is about skepticism, not a problem with the first graph.

I must add, however that I’m a bit confused about why the hypothesis was credible in the first place. From the paper:
> “consumers’ views of scenery from a high physical elevation induce an illusory source of control, which in turn intensifies risk taking.”

Seriously? This seems almost like magical thinking. Am I being a jerk here? Are there real psychological effects that sound this unbelievable when you first here them?

By: Adam Wheeler

Adam Wheeler — Thu, 21 May 2020 18:42:33 +0000

Just wanna make sure I understand the takeaway here. Is there something about the top graph that should make me suspicious (aside from the implied hypothesis)? Or is this about exercising continued skepticism towards ideas?

By: Zad

Zad — Thu, 21 May 2020 17:46:08 +0000

In reply to dhogaza. That’s exactly it. Don’t let others tell you otherwise!!

By: dhogaza

dhogaza — Thu, 21 May 2020 17:40:10 +0000

I’m totally stuck on the heated butter knife thing. I never knew such a thing existed.

Does it only work on heated butter????

By: Raghuveer Parthasarathy

Raghuveer Parthasarathy — Thu, 21 May 2020 16:49:00 +0000

“I just posted this a few hours ago, but it’s such an important message that I’d like to post it again.” — I think that counts as a replication!

By: jim

jim — Thu, 21 May 2020 16:39:10 +0000

One of my grad school colleagues, now quite comfortably employed, would have said “but it still looks to me like if you add up all the blue and all the green, the blue is still ahead, so I can put that even though it’s smaller than the error there’s still a tendency…”