Scientists Not Behaving Badly

Andrea Panizza writes:

I just read about psychologist Uri Simonson debunking a research by colleagues Raphael Silberzahn & Eric Uhlmann on the positive effects of noble-sounding German surnames on people’s careers (!!!). Here the fact is mentioned.

I think that the interesting part (apart, of course, from the general weirdness of Silberzahn & Uhlmann’s research hypothesis) is that Silberzahn & Uhlmann gave Simonson full access to their data, and apparently he debunked their results thanks to a better analytical approach.

My reply: Yes, this is an admirable reaction. I had seen that paper when it came out, and what struck me was that, if there is such a correlation, there could be lots of reasons not involving a causal effect of the name. in any case, it’s good to see people willing to recognize their errors: “Despite our public statements in the media weeks earlier, we had to acknowledge that Simonsohn’s technique showing no effect was more accurate.”

More generally, this sort of joint work is great, even if it isn’t always possible. Stand-alone criticism is useful, and collaborative criticism such as this is good too.

In a way it’s a sad state of affairs that we have to congratulate a researcher for acting constructively in response to criticism, but that’s where we’re at. Forward motion, I hope.

39 thoughts on “Scientists Not Behaving Badly

  1. Why do people think that the hypothesis is in itself weird?

    Aren’t there previous studies showing the adverse effects of black / Jewish sounding names on job applications? Or female names on orchestra auditions?

    I remember reading not sure where. Is all this body of work non-robust?

    • Rahul:

      All things are possible but the usual story in this sort of study is that there are so many many possible such indirect effects, that there’s no way all or even many of them can be large, and there’s no real reason, prior to the data, for us to believe that this particular effect will be large. So (a) any such effect will be hard to find, (b) anything that is found could well be noise, and (c) all of this puts a large burden on the data analysis, so that business-as-usual statistical errors that might not be consequential when studying larger effects can doom such a study.

      • Andrew:

        My point was, if one does believe that having black or Jewish names gets you strongly discriminated against (as, apparently, quite a lot of people believe) then isn’t it reasonable *prior to the data* to believe in a similar large effect here too?

        I’m asking about priors, not the particular analysis.

    • Shouldn’t a hypothesis explain something or aggregate various lines of evidence into a single equation? Is this even a useful hypothesis: “X will have a positive effect on Y”? That is just what is predicted to be measured, not really a productive hypothesis because it has no content outside itself. Do chemists measure the density of water to test hypotheses like “the density of water is 1 g/cm^3”?

      I think there is a distinction to be made between hypothesis and prediction being missed here.

    • Those studies are usually experiments where people’s resumes are sent with different names where the names are actually associated with specific groups. This study was an observational study http://web.natur.cuni.cz/~houdek3/papers/Silberzahn%20et%20al%202013.pdf and there doesn’t seem to be any assumption that the names actually represent being noble and that’s not what the theory is.
      I don’t know much about German society and the importance of nobility so I don’t know if it the predictions make sense. I mean, “Because of basic properties of associative cognition, the status linked to a name may spill over to its bearer and influence his or her occupational outcomes.” is a totally different kind of argument than saying that if people think you are female you are less likely to get hired. The attribution to some deep cognitive association … well it seems a bit far fetched to me. On the other hand, in the US name-job studies there are good reasons to think that discrimination based on race, ethnicity, religion and gender exists in the US based on history (social, legal, political) and observed differences in those careers, and the studies are just looking at how that may or may not play out at the micro level.

  2. Two people come to mind: von Neumann, who arbitrarily added “von” to a surname with which it makes no grammatical sense (and was not German either) – I always wondered why, and Reiner Protsch, a fraudulent archeologist who used a fake noble surname and turned out to be a son of a nazi politician instead, like in a bad Hollywood movie.

    • Did John von Neumann add the “von” or was that the doings of his father? Likewise, I believe the hyphen for Murry Gell-Mann was due to his father. A deeper mystery: why is the cover of the book,”Evilicious,” by Marc Hauser making yet another appearance on the blog?

      • It was him who introduced the “von”, but I looked it up and his father did acquire Austro-Hungarian nobility that came with an additional surname, containing the Hungarian grammatical equivalent of “von”, that if Germanified, could result in Johann Neumann von Margitta. von Neumann still makes no sense (von means from, but Neumann means new man, by no means a place), but could be viewed as some sort of attempt to squeeze the original.

  3. Not directly related, but I was reminded of this intriguing study by Stefano Allesina that did a statistical analysis of Names of Italian University Professors to claim strong evidence of Nepotism in Academia:

    http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0021160

    Abstract

    Nepotistic practices are detrimental for academia. Here I show how disciplines with a high likelihood of nepotism can be detected using standard statistical techniques based on shared last names among professors. As an example, I analyze the set of all 61,340 Italian academics. I find that nepotism is prominent in Italy, with particular disciplinary sectors being detected as especially problematic.

    • While the claim of the paper may be (sadly) reasonable, the paper doesn’t seem to have presented the evidence in the right way. For example, the author explains that “in Italy each academic has to declare a macro and micro disciplinary sector”. Let’s start considering the macro-sectors, of which there are 28. The author uses Monte Carlo to compute the probability of obtaining as many different names or less than actually found in each macro-sector, if the names were drawn at random from the set of distict names of Italian academics. He gets 9 significant results, at a 5% level, out of 28. That’s not small, but also 28 test are not few. And the situation is considerably worse for the micro-sectors, which are much more numerous: with 370 micro-sectors, on average you would expect 18 p-values more than 0.05 by chance alone (he finds 45). The author considers this briefly in a paragraph, where he says that he should have used the Bonferroni correction, but he didn’t because that would have reduced the power of test too much. So, basically powerless tests are bad, but a rate of Type I errors much higher than what he states is ok? Doesn’t seem quite right. Using the Bonferroni correction, the number of statistically significant results becomes much less (resp. 3 and 7 instead than 9 and 45). There’s still a finding, but why reporting in the abstract the higher, more impressive number 9, if the author knows that the real effect may actually be much smaller? To be fair, he does cite a reason: “in Italy women maintain their maiden names, and children take their father’s last name”. This should lead to an underestimate of the real effect. However, this is not quantified in any way, and reading a paper where 398 (!!!) test were performed, still leaves me uneasy.

  4. “The author considers this [the large number of hypothesis tests performed] briefly in a paragraph, where he says that he should have used the Bonferroni correction, but he didn’t because that would have reduced the power of test too much.”

    This is an all too common cop-out. It’s doing the multiple testing that reduces the power.

      • Hi, Martha, good point! You mean that if one uses the correct significance level (which for N independent tests is (1 − alpha)^N, with alpha being the significance level for a single test), then the power is much lower, right?

        • Well, something like that, but not necessarily exactly.

          The general idea: In calculating power for a single test, one does need to account for multiple testing.

          What you seem to suggest is using a simple Bonferroni method: If one wants overall (“familywise”) significance level alpha, then use significance level alpha/(number of tests) for each test — both in calculating power and in significance testing. (e.g., if you want familywise .05 and have 5 tests, use .01 for each.)

          But the basic Bonferroni method can allow other ways to “distribute” the overall significance level (e.g., you might decide — in advance — that one test will be at level .03 and four at level .005.

          But there are other methods as well — see, e.g. B. Efron (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge (or his Stats 329 Notes, at http://www-stat.stanford.edu/~omkar/329/)

        • Nononononononono, this discussion makes me want to scream! None of this familywide error crap ever ever ever ever. Please read my paper with Hill and Yajima!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

        • Yes, not doing testing is the best option; multilevel modeling can usually do a better job; Type S and M errors are more meaningful than Type I errors. — but if you must do testing, then at least try to do it in a way that takes multiple testing into account, so that you cut down the chances of getting lots of unreproducible results.

          (Very rough analogy: “Just Don’t do Testing!” is like abstinence only sex ed; accounting for multiple testing is like distributing condoms.)

        • I wish more posts on multilevel modelling would actually address the portion where one actually translates a multilevel model into a Yes / No decision.

          Even if not a Yes / No decision some sort of discrete decision as might be needed for a real world policy-making application.

        • As Rahul also points out, it makes no sense to say just don’t do (hypothesis) testing.

          In practice, one has to make a decision based on the data. Andrew always responds, use decision theory for that. But there is no loss function suggested because it’s hard to come up with one. For specific medical decision making processes it is possible to come up with one. But for psych type studies, where one says “we have evidence for hypothesis X and against Y”, all we have is the posterior distribution. We can present it to the reader and he/she can decide what to make of it. In theory. But that’s not how it works in practice. You have to discuss what you, the researcher conclude, from the data. Basically, just presenting the posterior and being circumspect about the data no matter how strongly the results support your theory X is a reasonable way to go, but then you have to forego publishing in “top” journals, which also means (for untenured researchers) foregoing jobs and foregoing funding opportunities (because funding bodies look for articles published in “high quality” venues like Science, Nature, Psych Science, and so on). It’s a lot to ask for a young researcher, and that’s why Andrew’s advice is not being widely adopted. It would be if publishing articles was just about the science and nothing else. Even I can’t adopt it because I publish with my students, and they need to publish in journals with brand recognition. I accept this and put up with it, even though editors in such journals routinely say scream-inducing things like “lower p-values are more convincing”, “replications waste space, remove them” (actual comments).

        • Shravan:

          I’ve published hundreds of papers with never a multiple comparisons correction, at least, none that I can remember. When there are multiple comparisons I fit a hierarchical model. I don’t see why psychology researchers can’t do that. I’ve never published in Science, Nature, or Psych Science but maybe that’s not so important.

          That said, I will fully accept any criticism you have of my exposition; I’m sure it would help if my books had clearer worked examples on how to do this, maybe a set of “this is bad; here’s a better way” examples. My above screaming is not meant as a substitute for future constructive advice.

        • To restate my point:

          There’s a lot of activity on how to set up a hierarchical model.

          But there’s not so much that shows how to translate an HM into actual decisions.

        • “I don’t see why psychology researchers can’t do that.”

          Because nothing would come out statistically significant. The desire for fame and jobs and funding is the reason to not do what you suggest. This reason might be morally indefensible, but it is very much here to stay and will guide the process of data analysis now and forever.

          My feeling is that many scientists are aware of the issue, but dare not do what you suggest because their publication records (or their students’) will suffer. So they will continue with the status quo.

          I don’t have any constructive suggestion for how to solve this. All I can personally do about it is not play the game and face rejection in “major” journals. And maybe try to teach the next generation about Type S, M error, and Bayesian methods, and other topics that keep coming up in this blog.

        • I’ve a slightly different view: Models are best when made with a clear objective in mind.

          In the absence of a clear downstream decision which will be driven by a model there’s not much incentive to generate a “good” model. In fact, it may not even be clear what “good” means.

          Do most Psych papers have a goal? Outside of getting published.

        • Rahul, of course they have a goal outside of getting published.

          There are many theories out there about cognitive processes (for example) that people try to find evidence for or against. That’s what the decision is about in such studies.

        • I don’t know why I can’t reply to Shravan’s comments: there’s no “Reply” button in his comments, as well as in many others (maybe we reached the max depth for nested comments!). Anyway, I find his post of January 18, 2016 at 12:25 am very interesting and I would like to add that in a lot of contexts, you have to take a Yes/No approach. If you write a scientific paper, directed to other scientists, it may be perfectly reasonable to show the posterior distribution and let the readers decide for themselves. In my company I cannot show the posterior to my managers and tell them to decide for themselves: I am required to take a decision, such as “do we use a radial or tangential inlet? Do we use intercooling or not?” etc. My situation is further complicated by the fact that I’m one the only design engineer in the team who stands by the importance of using statistics in the design process. Most of the designers (and all of the senior designers) “don’t believe in statistics, anyway” (!!!).

        • Andrea:

          Let me emphasize that I never say that we just just show people the posterior distribution. As we discuss many times in BDA, the posterior distribution (as typically represented by simulations) is an intermediate step to be used to get inferences for quantities of interest.

        • Hi, Andrew,

          ok, I was citing Shravan, but I now understand that’s something you would suggest. I’m very curious to get to the point in BDA where multilevel modeling is used to take decisions. Surely that would be very useful for my applications.

        • On January 18:

          Andrew said (2:30 am): “I’m sure it would help if my books had clearer worked examples on how to do this, maybe a set of “this is bad; here’s a better way” examples.”

          Rahul said (2:52 am): “There’s a lot of activity on how to set up a hierarchical model. But there’s not so much that shows how to translate an HM into actual decisions.”

          Andrea said (8:57 am): “I’m very curious to get to the point in BDA where multilevel modeling is used to take decisions. Surely that would be very useful for my applications.”

          At about that time I thought of suggesting discussing how hierarchical models could be used in the type of gene expression studies discussed in Efron’s book, but for whatever reason didn’t follow up.

          However, yesterday I attended a talk by Mike Love (http://mikelove.github.io/) that brought the thought to mind again. The motivating situation in Love’s talk was using RNA sequencing to try to identify gene expression involved in individual cancer cases, then using that information to decide which treatment to use. (Part of the rationale for using RNA sequencing is that the “traditional” microchip methods for testing gene expression are limited to comparison with specific pre-selected proteins, whereas the RNA sequencing method allows “discovery” of expressed proteins that have not been pre-selected for comparison.) He discussed using shrinkage estimators for analyzing data from RNA sequencing, but later focused on false positives and FDR. Someone asked him why he used p-values and FDR rather than Bayesian methods. He replied, “Because the biologists like it.”

          So this situation seems a good one (since it is of practical interest) for thinking about whether Bayesian methods might be better for the purpose at hand. Some vague thoughts: Conceivably effect size (specifically, strength of RNA expression) might be more relevant that a yes-no “expressed or not” decision. This might be an argument for Bayesian methods being better (although the drive to have a yes-no decision might then lead to considering “thresholds” for expression – which leaves open the question of how to choose them). Or possibly the pattern of expression might be relevant for treatment decisions (e.g., protein x alone might not be relevant, but the combination of x and y might be; or the relative levels of x and y might be.) This also could be an argument for Bayesian methods being better. Or it might be the case that current knowledge of treatments and their relation to gene expression is so crude that the current FDR approach is good enough to make relatively substantial progress in treatment.

          Clearly, considering the question of whether Bayesian methods would be better needs to involve serious discussion between biologists and statisticians.

        • Ok, sorry :) can I just 1) correct the formula (it’s 1-(1-alpha)^n – silly mistake) and 2) do we agree at least that even if FWR is crap, also Allesina’s paper, with its 398 uncorrected tests, wasn’t that great? Sooner or later I’ll get to the multilevel modeling part in BDA…but I need to go through quite a number of chapters before. BTW, I noticed that there’s a part on LOO-CV. I’ve always seen k-fold CV applied in a frequentist context, thus I’m very curious to see what it becomes in the Bayesian paradigm.

        • Hmmm, this is weird, I’m sure I replied to Andrew’s post in the discussion between me and Martha, but the post appeared somewhere else. Anyway, my post was a reply to Andrew’s post on January 17, 2016 at 11:02 pm.

        • Yes, it does for me too. Try statweb.stanford.edu/~ckirby/brad/other/2010LSIexcerpt.pdf — it seems to give most of what was at the earlier link.

    • From the paper:

      Other limitations of the study are statistical. Given that I performed several tests, there is the risk of introducing false positives due to multiple comparisons. Typically, one would take recourse to Bonferroni’s or similar corrections to account for multiple hypotheses testing. However, these methods entail considerable loss of power, as they are rooted in the number of tested hypotheses: if one is testing 370 micro-sectors, a significance level of less than 1.4 10^-4 should be used to guarantee an overall significance level of 0.05 for the tests (using Bonferroni’s correction). Using these restrictive techniques, only the macro and micro-sectors for which I did never observe a lower number of names out of a million drawings could be considered significant (3 macro, 7 micro).

      Similar problems are found in the literature on genetic screenings, where the effects of hundreds or thousands of genes are routinely tested. A useful concept taken from this literature is that of a q-value [12]. This value specifies the expected proportion of “false discoveries” when all the tests resulting in a p-value lower than x are called significant. I set the q-value to 0.05 (i.e. I wanted to keep the expected proportion of false positives under 5%), and I found that all the disciplines with p-value <0.05 fell in this region. Thus, less than 5% of the nine significant macro-disciplines are likely to be a false positive, confirming the results obtained above. A different outcome is obtained for the micro-sectors, because of their small size and the large number of sub-disciplines: in order to keep a q-value of 0.05, I would have to call significant only the top 15 micro-sectors (instead of 45). Calling all the tests with p<0.05 significant, would yield a value of 0.37: of these 45 sub-disciplines, 16.65 are likely to be false positives.

Leave a Reply to Leon Shernoff Cancel reply

Your email address will not be published. Required fields are marked *