I have zero problem with people reporting results they found with p=0.1. Or p=0.2. Whatever. The problem is with the attitude that publication should imply some sort of certainty.

Posted on May 26, 2019 5:19 PM by Andrew

56 thoughts on “I have zero problem with people reporting results they found with p=0.1. Or p=0.2. Whatever. The problem is with the attitude that publication should imply some sort of certainty.”

Deborah G. Mayo on May 26, 2019 8:40 PM at 8:40 pm said:

Is this attitude (that publication implies some sort of certainty) prevalent? I don’t see it. The statistical qualification conveys that there’s some probability the method erroneously outputs such an indication. Computing the P-value over various values of the parameter enables inferring discrepancies (from a reference value) that have and have not been well tested. Some claims may be genuinely falsified via statistics, but publication has little to do with it.

Reply ↓
- ojm on May 26, 2019 9:00 PM at 9:00 pm said:
  
  > Is this attitude (that publication implies some sort of certainty) prevalent?
  
  It’s at least a pretty familiar one…
  
  Reply ↓
- Thanatos Savehn on May 26, 2019 9:45 PM at 9:45 pm said:
  
  Come to a Daubert hearing at the courthouse with me. You’ll see.
  
  Reply ↓
- Jake on May 26, 2019 10:56 PM at 10:56 pm said:
  
  For many journals, publishing a paper is dependent on the paper’s findings including low p values because the journal believes this assures some level of certainty. They don’t want to publish findings they think that have more than a small probability of being random noise, and they think using p-values to filter papers for publication helps filter out random noise results. I think it is likely the journal thinks that their publishing the paper does mean to them that the findings are not likely due to just random noise and that it does provide some level of certainty to them, especially compared to papers they reject because of larger p-valued findings, even if it is not actually a reliable method to filter papers for publication using such criteria (which I am sure people will argue about).
  
  Reply ↓
- Andrew on May 26, 2019 11:40 PM at 11:40 pm said:
  
  Deborah:
  
  Yes, the attitude that publication implies some sort of certainty is prevalent. We’ve discussed a few zillion examples of this on this blog. For lots more examples, just check out Gladwell, Freakonomics, NPR, or the press releases of the Association for Psychological Science.
  
  Reply ↓
  - Sameera Daniels on May 27, 2019 10:28 AM at 10:28 am said:
    
    I was able to read Paul Stolley’s article on R. A. Fisher and the Tobacco/Lung Cancer Link from the link posted below. But the comments are closed. Too bad, we’ve been discussing the subject on Twitter.
    
    https://statmodeling.stat.columbia.edu/2012/09/02/cigarettes/
    
    Reply ↓
Peter Dorman on May 27, 2019 1:27 AM at 1:27 am said:

I can add that it appears to be routine in undergraduate education (maybe also in HS but I don’t know) to divide claims into two categories, those supported by peer-reviewed sources and those not. Listen to the folk epistemology in classrooms instructing students in how to do “research”. If you see a claim, should you believe it? Yes, if it went through peer review, no otherwise. And that’s what my students believe when they show up for my courses. Do others have the same experience?

Reply ↓
- Mikhail Shubin on May 27, 2019 2:25 AM at 2:25 am said:
  
  +1
  yes, I have seen that.
  
  I feel the problems starts when undergraduates are told to learn from publication instead of textbooks. But they are never told to treat these publications critically.
  
  Reply ↓
  - Steve on May 27, 2019 8:51 AM at 8:51 am said:
    
    My guess is it is the other way around. When they start to learn from textbooks, they learn that knowledge is just a set of established truths that they need to memorize. Secular education is no different than religious education until upper level courses in college and grad school.
    
    Reply ↓
    - Martha (Smith) on May 27, 2019 11:49 AM at 11:49 am said:
      
      “Secular education is no different than religious education until upper level courses in college and grad school.”
      
      It depends on the school, the teacher, supervisors, the institution, etc. Good secular education in high school and lower level college courses can be very different from religious education; poor secular education can be just like religious education even in upper level courses in college and grad school.
    - Steve on May 27, 2019 12:01 PM at 12:01 pm said:
      
      Of course, you are right. It just seems to me the teachers trying to get students to challenge received wisdom are exception rather than the model.
    - Martha (Smith) on May 27, 2019 12:31 PM at 12:31 pm said:
      
      Steve said, “It just seems to me the teachers trying to get students to challenge received wisdom are exception rather than the model.”
      
      You’re right — my view is admittedly biased, since not only have I tried to get students to challenge the “that’s the way we’ve always done it” mindset, I’ve also spent time working with high school math teachers who in fact challenged TTWWADI. (They even had “buttons” displaying TTWWADI with a slash through it.)
    - Ben Prytherch on May 29, 2019 2:34 AM at 2:34 am said:
      
      As an educator, my impression is that teachers feel like it’s hard enough to get students to understand the “established truths” in the first place, and we dare not try getting them to competently critique the established truths.
      
      I struggle with this a lot in teaching intro stats. We can’t not teach “statistical significance”; it’s ubiquitous. Teaching it is painful and involves student frustration, because it is unintuitive. Teaching it critically requires getting students to understand it deeply enough to understand the critiques. The result is often more frustration (“why are you teaching us this and then criticizing it?”), or cynicism (“my stats teacher told me not to believe statistics”).
      
      I don’t have a solution. I’d like to be able to de-emphasize significance while still getting students to understand what it is. That’s a big ask in a mandatory intro stat class where the students aren’t all that enthusiastic about the material in the first place.
Anonymous on May 27, 2019 4:32 AM at 4:32 am said:

“The problem is with the attitude that publication should imply some sort of certainty.”

Next to the p-value, i believe there was also a brief period of emphasis on how “replicable” (“certain”?) the “fndings” are supposed to be in the form of “p-rep”: https://en.wikipedia.org/wiki/P-rep

Also, keep your eye out in the next few years for (probably) upcoming algorithms that will show everyone how much we can “trust” something by assigning some sort of replication-score, or trust-score, or certainty-score, to a particular “finding”.

You wouldn’t even need to actually perform a replication anymore, and/or you can let the algorithm decide for you what “findings” you should spend all your resources on if you still decide to want to actually replicate something!

Science!

Reply ↓
- Andrew on May 27, 2019 9:32 AM at 9:32 am said:
  
  Anon:
  
  Regarding your paragraph above on the trust-score, here’s something I wrote to the director of a U.S. government program that is funding work in this area:
  
  I’m skeptical about the idea of assigning a numerical “confidence level” to published claims. The trouble is that most published claims are too vague to be tested. Also I’m worried because most of the literature on replication seems to equate successful replication with a “statistically significant” p-value, an attitude I think is misguided. (For one thing, then you can always get a minimum 50% replication rate by simply running a replication study with a huge N. This is not the only problem but it demonstrates the general concern.)
  
  I’m not saying the whole idea is a bad one—you have to start somewhere!—I’m just wary here in part because it’s so easy for people to come up with measures such as replication index or predictive confidence level or intrinsic Bayes factor or expected power or whatever, numbers that don’t make sense but satisfy the desire for a single-number summary of evidence—something that I just don’t think makes sense.
  
  Reply ↓
  - Anonymous on May 27, 2019 10:07 AM at 10:07 am said:
    
    Quote from above: “Regarding your paragraph above on the trust-score, here’s something I wrote to the director of a U.S. government program that is funding work in this area:”
    
    The entire “assigning confidence levels” thing seems fundamentally unscientific to me.
    
    I reason it sets a bad example, because i think it wil only lead to the same problematic stuff that has probably been going on concerning science in the last decades (e.g. no actual replications are performed, “short cuts” in thinking about “evidence” wil be used again, “trusting” something because of “score” or “source”, etc.).
    
    I also doubt the resulting algorithm makes much sense concering forming a solid, and wise, basis for future work’s “trustworthniness” or “replicability”.
    
    It would seem to me that:
    
    1) just because you can get an algorithm to “predict” something based on certain data doesn’t mean that it can predict it for other data, and
    
    2) as soon as the algorithm has been made known, future “findings” can be manipulated to somehow “fit” the algorithm to get a high “confidence-score”.
    
    3) it seems to me that you can pick and choose certain data to get the algorithm you want, which in turn could lead to getting others things you want on the basis of the algorithm.
    
    Also see this: https://www.technologyreview.com/s/608248/biased-algorithms-are-everywhere-and-no-one-seems-to-care/
    
    Reply ↓
  - Sameera Daniels on May 27, 2019 10:30 AM at 10:30 am said:
    
    Completely agree with you, Andrew.
    
    Reply ↓
  - Thanatos Savehn on May 28, 2019 5:21 PM at 5:21 pm said:
    
    Have you had a chance to look at the brand new “Reproducibility and Replicability in Science”, a report to Congress from the NASEM? Relevant to your discussion here, perhaps, is this: “We propose a set of criteria to help determine when testing replicability may be
    warranted.” They are:
    
    1) The scientific results are important for individual decision-making or for policy decisions.
    2) The results have the potential to make a large contribution to basic scientific knowledge.
    3) The original result is particularly surprising, that is, it is unexpected in light of previous evidence and knowledge.
    4) There is controversy about the topic.
    5) There was potential bias in the original investigation, due, for example,to the source of funding.
    6) There was a weakness or flaw in the design, methods, or analysis of the original study.
    7) The cost of a replication is offset by the potential value in reaffirming the original results.
    8) Future expensive and important studies will build on the original scientific results.
    
    Anyway, I’ve been reading the report and from my lawyerly, non-stats mind it’s really good. Here’s the link: https://sites.nationalacademies.org/sites/reproducibility-in-science/index.htm
    
    Reply ↓
    - Anonymous on May 28, 2019 6:15 PM at 6:15 pm said:
      
      “Have you had a chance to look at the brand new “Reproducibility and Replicability in Science”, a report to Congress from the NASEM?”
      
      I took a quick look at the “Consensus study report highlghts”, and the recommendations.
      
      I find them to be in line with your quoted set of criteria for determining when testing replicability may be wanted, which in my case is not a “good” thing. I find both the highlights and criteria:
      
      1) toothless and too general, and
      
      2) not tackling the disease, but merely the symptoms hereby possibly wasting tons of resources, not solving anything, and doing things “backwards”.
      
      My alarm bells definitely went off when reading a recommendation that involved the so-called “TOP-guidelines” on page 11 of the “uncorrected proofs”. I think the “TOP guidelines” are written, and designed very poorly, are possibly doing the exact opposite of what they say they do, and can possibly make things even worse. E.g. see here https://blogs.plos.org/absolutely-maybe/2017/08/29/bias-in-open-science-advocacy-the-case-of-article-badges-for-data-sharing/#comment-33101
      
      I then stopped reading the “uncorrected proofs”.
      
      To summarize my opinion and reasoning:
      
      1) Almost none of the 8 criteria you mention for deciding when to perform a replication make much sense to me. It is doing things “backwards” in my reasoning, and may actually reinforce sloppy work by “rewarding” it by spending resources on it to replicate.
      
      2) Almost none of the recommendation of the summary make much sense to me. They are all very weak, toothless, general remarks that have been mentioned dozens of times before (e.g. see the APA guidelines)
      
      3) I dislike all these “committees” of “stakeholders” coming up with their “recommendations”. I reason they can come up with a lot of stuff that will only serve their own interests, or those of their friends.
    - Andrew on May 28, 2019 6:23 PM at 6:23 pm said:
      
      Thanatos:
      
      I don’t have the patience to read that report, but my quick comment based on the above excerpts is that there’s a lot of work that’s just too crappy to be worth replicating.
      
      I guess the big questions in the above list are #7 and #8. So, for example, ESP, ovulation and voting, himmicanes, power pose, etc., would all be potentially important—if these effects were as large and persistent as claimed by the researchers who are selling these ideas. But the studies in question are so bad that they add essentially nothing beyond the original theories themselves. There’d be little to no scientific value in direct replications of these studies, as any such replications would amount to little more than collection of random numbers. There could be a “sociological” value in such replications as they could possibly convince some people not to believe the original claims—but that’s hardly a good scientific reason to do a study, just to try to confirm what is already clear. Conversely, if for substantive reasons people are interested in these hypotheses (ESP, etc.), the right way to go is to design better studies, not to replicate junk science that happens to have been published.
      
      To put it another way: in the above list, item #6 (a weakness or flaw in the study) is taken as a rationale for testing replicability. I’d think about it the other way: If a study is strong, it makes sense to try to replicate it. If a study is weak, why bother?
      
      Here’s the point. I think that replication is often taken as a sort of attack, something to try when a study has problems. I think that replication is an honor, something to try when you think a study has found something.
      
      I guess I should post something on this.
    - Thanatos Savehn on May 28, 2019 6:59 PM at 6:59 pm said:
      
      It has some quotes you might like:
      
      “”…a scientific study or experiment may be seen to entail hundreds or thousands of choices …” “An investigator may not realize when a possible variation could be consequential …”
      
      Some are hard hitting:
      
      “Researchers who knowingly use questionable research practices with the intent to deceive are committing misconduct or fraud.”
      
      And for the academics:
      
      “In some disciplines and research groups, data are seen as
      resources that must be closely held, and it is widely believed that researchers best advance their careers by generating as many publications as possible using data before the data are shared.”
    - Ben Prytherch on May 29, 2019 2:43 AM at 2:43 am said:
      
      “There could be a “sociological” value in such replications as they could possibly convince some people not to believe the original claims—but that’s hardly a good scientific reason to do a study, just to try to confirm what is already clear.”
      
      If only it was already clear. If a flawed study is getting attention and most people believe its results, that sociological value is enormous. Go back and look at the studies in RP:P and Many Labs that failed to replicate… in retrospect they look silly. But just saying “this is silly” is not effective. Showing failure to replicate has had a big impact on popular attitudes toward statistical significance.
    - Thanatos Savehn on May 29, 2019 3:00 AM at 3:00 am said:
      
      But the point, I think, is that you good people have gotten the attention of the most powerful deliberative body in the world. That’s something, isn’t it?
    - Andrew on May 29, 2019 6:20 AM at 6:20 am said:
      
      Ben:
      
      You write, “in retrospect they look silly. But just saying “this is silly” is not effective. Showing failure to replicate has had a big impact on popular attitudes toward statistical significance.”
      
      Yes, this is what I’m saying. Failure to replicate bad studies can be valuable in convincing popular opinion, and in convincing various scientists who have not fully thought out the problems in these published studies. But these failed replications don’t have much value as scientific evidence, given that we already saw fatal flaws in the earlier published papers.
    - Anonymous on May 29, 2019 7:26 AM at 7:26 am said:
      
      “Failure to replicate bad studies can be valuable in convincing popular opinion (…)”
      
      You can basically convince different people of different things depending on which (type of) studies you are replicating.
      
      Replicating studies may also get you lots of media attention, which provides a further way to convince different people of different things depending on what you want them to be convinced of.
    - Sameera Daniels on November 13, 2019 7:41 PM at 7:41 pm said:
      
      I concur Andrew
Mikhail Shubin on May 27, 2019 5:40 AM at 5:40 am said:

Speaking of p-values, it would be nice to know how they are used in industry. You know, the place where statistics is constantly tested by practice, and where are actual price for doing wrong decisions.

Here one article I found:
https://hbr.org/2016/02/a-refresher-on-statistical-significance

There is a lot of to unpack from this text, but here is one quote:

> Remember that the new marketing campaign above produced a $1.76 boost (more than 20%) in average sales? It’s surely of practical significance. If the p-value comes in at 0.03 the result is also statistically significant, and you should adopt the new campaign. If the p-value comes in at 0.2 the result is not statistically significant, but since the boost is so large you’ll likely still proceed, though perhaps with a bit more caution.

> But what if the difference were only a few cents? If the p-value comes in at 0.2, you’ll stick with your current campaign or explore other options. But even if it had a significance level of 0.03, the result is likely real, though quite small. In this case, your decision probably will be based on other factors, such as the cost of implementing the new campaign.

I guess these real-life people dont care about arbitrary thresholds. Or, if you interpret it the other way around if advice is written down, it means it had to be given, it means it contradicts the practice… OK I’m confused how to interpret this.

Reply ↓
- Martha (Smith) on May 27, 2019 12:22 PM at 12:22 pm said:
  
  Mikhail said, “Speaking of p-values, it would be nice to know how they are used in industry.”
  
  He gave an example involving a marketing campaign, and Dale gave one on an advertising campaign.
  
  My thoughts when I read Mikhail’s quoted comments went to “hard industry” such as manufacturing processes, and in particular to quality control and quality assurance. So I look up the American Society for Quality, but only found stuff behind a paywall. Anyone have any evidence that ASQ has responded to the recent discussions of problems with p-values?
  
  Reply ↓
  - Raj on May 27, 2019 10:51 PM at 10:51 pm said:
    
    NHST and P values are not used in SPC and in many other industrial statistics areas. W. E. Deming, even though credited for the term P value, became critical and his remarks were “If statisticians understood a system, and they understood some theory of knowledge and something about psychology, they could no longer teach tests of significance, tests of hypothesis and chi-square.”; Deming, W. Edwards. “[Applications in Business and Economic Statistics: Some Personal Views]: Comment.” Statistical Science, vol. 5, no. 4, 1990, pp. 391–392. JSTOR, http://www.jstor.org/stable/2245361.
    
    H. F. Dodge in early last centuary modified the error probability notion to risk control, and much of the acceptance sampling methodologies place emphasis on what is practically important in terms of severity in testing and risk control. Acceptance Sampling symposium presentations, edited by John Tukey contains many arguments why the standard hypothesis testing methodology is inadequate for quality control; see https://babel.hathitrust.org/cgi/ssd?id=wu.89048111546;seq=7
    
    Why control charting is not a test of hypothesis, and other issues are dealt in Woodall, W. H. 2000. Controversies and contradictions in statistical process control (with discussion). Journal of Quality Technology 32 (4):341–378.
    
    Industrial experimentation still employs P values but most industrial experiments are well controlled; employ the power of orthogonality; confirmatory experiments are common. Much of the industrial experimentation culture was shaped by George Box (in my opinion) who placed emphasis on the sequential experimentation; see Box, George. “[Applications in Business and Economic Statistics: Some Personal Views]: Comment.” Statistical Science, vol. 5, no. 4, 1990, pp. 390–391. JSTOR, http://www.jstor.org/stable/2245360.
    
    Reply ↓
    - Martha (Smith) on May 28, 2019 1:00 AM at 1:00 am said:
      
      Thanks.
  - Mikhail Shubin on May 28, 2019 1:39 AM at 1:39 am said:
    
    > My thoughts when I read Mikhail’s quoted comments went to “hard industry” such as manufacturing processes, and in particular to quality control and quality assurance.
    
    Astronomers tend to call everything heavier than helium “metal”. Students tend to call every occupation beyond bare minimum salary as “industry”.
    
    Reply ↓
Dale Lehman on May 27, 2019 8:46 AM at 8:46 am said:

Here is one of my favorite examples – from a leading textbook on Managerial Economics. The example is a hypothetical data set (economists love these – a lot easier than messy real world data) on the demand for apartments. I’ll excerpt the results here:

“Since FCI can’t relocate its apartments closer to campus, and advertising does not have
a statistically significant impact on units rented, it would appear that all FCI can do to
reduce its cash flow problems is to lower rents at those apartment buildings where demand
is elastic.”

It turns out that the point estimate for the advertising effect on apartment demand, while not statistically significant (p=.43), implies that each $100 spent on advertising is associated with around a $2500 increase in annual revenues for the apartment owner. Also, the 95% confidence interval extends well into negative territory (meaningless, if you believe in a weak form of rationality where advertising with a negative effect on demand would clearly be recognized – which I’m not sure I do!).

I use this example of how to badly use a statistical result. For a Managerial Economics textbook, I believe the relevance of this example is to recognize the large uncertainty and ask questions about what the best course of action would be. Absent collecting more data, it would be crazy not to increase advertising in this case. However, the gospel of statistical significance leads the authors to conclude that it is not effective. When we teach people this way, then statistical significance indeed becomes a certainty filter.

Reply ↓
- Bob on May 27, 2019 12:27 PM at 12:27 pm said:
  
  I am at a disadvantage because I don’t know the textbook you are referring to. But there was an faction in managerial economics, exemplified by Raffia and his colleagues, based on Bayesian analysis, explicit objective functions, and maximization of the expected objective function that I believe would get the “right” answer for this advertising purchase problem.
  
  It seems to me that decision problems like this strongly illustrate the superiority of Bayesian decision analysis over naive and careless application of frequentist models.
  
  Sigh. It’s upsetting to know that such an example exists.
  
  Bob
  
  Reply ↓
  - Dale Lehman on May 27, 2019 2:04 PM at 2:04 pm said:
    
    I believe Bayesian analysis would make this poor reasoning much less likely – but it isn’t necessary. I really don’t think it is that hard to use frequentist analysis here to reach reasonable conclusions. This bad example is indicative not only of poor statistical reasoning, but poor textbook and teaching approaches. The book is Michael Baye’s (how’s that for irony?) which used to be the best selling (and may still be) managerial economics text (full disclosure: I wrote a managerial economics text several years ago – which was much better, of course, though it sold far fewer copies).
    
    Reply ↓
Ed Hagen on May 27, 2019 12:56 PM at 12:56 pm said:

“The problem is with the attitude that publication should imply some sort of certainty.”

Looking backwards over the existing literature, yes, that attitude is a problem. Looking forward to a hopefully better future, though? Hmm. Isn’t the whole point of this blog to help develop practices — statistical, scientific, institutional — that increase confidence that reported results are not noise, perhaps so much so that, yes, some day publication (in some, perhaps not all, venues) *would* imply some sort of certainty?

Reply ↓
- Daniel Lakeland on May 27, 2019 1:01 PM at 1:01 pm said:
  
  I hope not, the whole thing on this blog is replacing certainty with realistically quantified uncertainty.
  
  Reply ↓
  - Ed Hagen on May 27, 2019 2:25 PM at 2:25 pm said:
    
    For publications? Publication has to imply something beyond “realistically quantified uncertainty.” What, exactly, though, will depend on the scientific context.
    
    I guess the debate here about what a “publication” implies, and the endless debate about p-values, are both, in large part, a debate about filtering research results — if, when, and how to do it.
    
    Reply ↓
    - Daniel Lakeland on May 27, 2019 2:35 PM at 2:35 pm said:
      
      why exactly should we suppress knowledge?
    - Ed Hagen on May 28, 2019 11:09 AM at 11:09 am said:
      
      Daniel, Martha,
      
      Come on guys. Publication sucks up a huge amount of time and resources, including the reviewers’ time, the editor’s time, the researchers’ time, a piece of the library’s publication budget, and reader’s limited time to scan the literature. I’m sure you both rely on “filters” all the time in your daily lives. Do you want the lead story in your favorite news outlet to be about the new pile of patio pavers in my mother-in-law’s front yard? No? Why not? I’m 100% sure there is one (I saw it myself)! Instead, I bet you want reporters to only report stories that are somehow “important” (or at least entertaining).
      
      Daniel, if I recall correctly, seems to be a fan of Bayesian decision theory. Reporting study results is a decision. Yes, this decision should be based on “realistically quantified uncertainty,” but also on the costs and benefits of reporting in venue X vs. venue Y vs. not reporting at all. There would seem to a large number of situations where the scientific benefits of reporting would outweigh the not insubstantial costs only if the study “sufficiently” reduced the uncertainty for some parameter into an “interesting” region of parameter space.
    - Daniel Lakeland on May 28, 2019 1:09 PM at 1:09 pm said:
      
      Sure I want to rely on filters, I just don’t want to rely on publication editors to *be* the filters.
      
      But my ideal system would look like this:
      
      1) Write up some stuff into say PDF form
      
      2) Hand it to the “publicatorator” which cryptographically signs it with your author key and uploads it to the publi-verse where it’s stored in mirrored form across several locations, including your own computer, a few of your friends, and maybe a university archive or two. A 4TB hard drive costs around $150, a typical article should be about 10MB or less, x4 = so total cost for typical article about $0.0015 currently.
      
      3) Several times a day publicatorator receives broadcasted updates about publications available, and a machine-learning algorithm based on your own interests classifies them into those of potential interest to you, and others… and downloads those of interest to you, flagging them in a sorted list.
      
      FIN
      
      that it doesn’t work like that is no reason to think that we should continue to have the current system and continue to have “editors” in all their infinite wisdom telling us what to read.
    - Ed Hagen on May 28, 2019 1:48 PM at 1:48 pm said:
      
      I agree that it’s worth rethinking the publication model as the filter of choice. Not sure if the ML idea would work, though. Even apple and google keep humans in the loop.
      
      I’ve been mulling a system more like GitHub and open source: Hackable Science. Throw your study up on GitHub (or equivalent), with all code and data, and advertise it on social media. Filtering would be by stars/forks and, e.g., keywords. Folks could also filter on researchers they trust/respect. I think there could also be some filtering value-added from an ad hoc “review” step involving an editor chosen by the authors, and non-anonymous reviewers chosen by the editor.
    - Daniel Lakeland on May 28, 2019 2:05 PM at 2:05 pm said:
      
      Of course, a human *would* be in the loop, *you*. You’d be telling your filter “hey this one really was cool” and “gee I didn’t like that one” constantly.
      
      and of course such a system could easily also support your custom filters on stars/requests/recommends by your peers, etc. But I think scientific publication is sufficiently specialized to be worth having a system tuned to it specifically. I’d like to see versioning and forking and etc, but I’d want *distributed* archival storage and easy citations, and etc things that code projects on github don’t have.
    - Daniel Lakeland on May 28, 2019 2:07 PM at 2:07 pm said:
      
      Note, I’m imagining the ML algorithm running on your computer, not some service provided by others. Maybe that wasnt clear. Kind of like the “spam filter” for your email, but much more sophisticated and topical. It shouldn’t just filter “pass / nopass” but actually estimate an “interest factor” and hopefully automatically classify the *type* of interests it thinks you’ll have in the topic. Like if you’re interested in “healthcare economics” it should show up in that bin when it’s about that topic.
    - Mikhail Shubin on May 29, 2019 2:55 AM at 2:55 am said:
      
      Somehow all ML algorithms fail to recommend me next movie to watch or next song to play.
      
      Why should I rely on them to recommend me next scientific paper to read?
    - Mikhail Shubin on May 29, 2019 2:59 AM at 2:59 am said:
      
      Ed Hagen, continuing with the metaphor.
      
      There to many movies to watch, even if you dedicate your entire life. So we need filtering, yes. But deciding to only watch movies with IMDB rating > 8 and lasting odd number of seconds… this is not a good filter, is it?
    - Ed Hagen on May 29, 2019 9:49 AM at 9:49 am said:
      
      “but I’d want *distributed* archival storage and easy citations, and etc things that code projects on github don’t have.”
      
      GitHub has an easy integration with Zenodo, a CERN data archiving service that also mints doi’s:
      
      https://guides.github.com/activities/citable-code/
      
      It’s also easy to automatically mirror git repos on multiple git-based cloud services.
      
      No need to reinvent the wheel.
    - Daniel Lakeland on May 31, 2019 8:53 AM at 8:53 am said:
      
      Mikhail Shubin: are these ML algorithms that are open source, and designed by you and trained by you to recommend things you are interested in, or are they algorithms that are secret and proprietary and designed by large companies with the primary goal to make them money? also remember the movie industry has a lot of behind the scenes shenanigans on licensing rights. if the recommender knows what you want to watch but the rights aren’t available do you think you will see that recommendation?
      
      Ed, GitHub is a good service but it is all about collaborative software development and it is owned by Microsoft. it’s a single point of failure, and it’s an awkward place to put a publication, as opposed to source code. its very organized around the idea that code is what is being hosted.
    - Martha (Smith) on May 28, 2019 4:44 PM at 4:44 pm said:
      
      Ed said,
      “There would seem to a large number of situations where the scientific benefits of reporting would outweigh the not insubstantial costs only if the study “sufficiently” reduced the uncertainty for some parameter into an “interesting” region of parameter space.”
      
      There may very well be. But there also might be other situations very much worth reporting — one example that comes immediately to mind is evidence that a previous study or a new method has flaws that result in a serious underestimate of uncertainty.
    - Martha (Smith) on May 28, 2019 4:51 PM at 4:51 pm said:
      
      Ed said,
      
      “I think there could also be some filtering value-added from an ad hoc “review” step involving an editor chosen by the authors”
      
      There might be — but an editor chosen by the authors might be less likely to give a critical review than someone not chosen by the authors.
    - Ben Prytherch on May 29, 2019 3:01 AM at 3:01 am said:
      
      What if the “filter” is applied to the potential implications of the study’s results, rather than to the results themselves? And to the quality of the data and the study design?
      
      This is the idea behind registered reports: decide upon the value of the study before seeing the results. If the question is worth asking, the answer (which includes uncertainty quantification) should be worth knowing.
      
      Of course, those who advocate filtering based on results will say that they also take into account data quality and study design. And they’re surely right to some extent (after all, papers showing statistical significance still get rejected), but early evidence from registered reports shows what we’d expect : “non-significant” results are far more commonly published when the decision of whether to publish is made before the results are known: https://www.nature.com/articles/d41586-018-07118-1#ref-CR2
    - Anonymous on May 29, 2019 3:57 AM at 3:57 am said:
      
      Quote from above: “This is the idea behind registered reports: decide upon the value of the study before seeing the results. If the question is worth asking, the answer (which includes uncertainty quantification) should be worth knowing.”
      
      What i dislike about “Registered Reports” (next to the fact that they seem to have been designed and/or implemented very poorly https://www.nature.com/articles/s41562-018-0444-y) is that they seem to me to be increasingly giving power to the system and people that (helped) mess things up.
      
      1) I note that the “Registered Reports” started, if i remember correctly, by emphasizing that it was important to look at the quality and design of the study and not so much the results. I, feel, this is already shifting from quality of the design to “how important is the question”. Possibly a very bad possible shift in my opinion and reasoning.
      
      2) I also worry about all the “special” editors that are being connected to “Registered Reports” at various journals if i understood things correctly. Also very bad possibly in my opinion and reasoning.
      
      If you combine 1) and 2) you can see where this could all lead to. To me, it’s all possibly part of giving increasingly more power and influence to a small group of people, and replicating the problematic processes that possibly have went on at “top” journals concerning editorial power, etc.
    - Ben Prytherch on May 31, 2019 12:04 AM at 12:04 am said:
      
      Anonymous: I’m not clear on how RR lead to editors having more power than they currently do. They would have more influence over study design, yes. They would also have less power to demand changes to the paper after the results are in.
      
      We’ve seen the consequences of “filtering” based on results. I can see how filtering based on “how important is the question” can also have consequences. But at least they’re a different set of consequences. We’re in little danger of Registered Reports becoming dominant.
    - Anonymous on May 31, 2019 3:25 AM at 3:25 am said:
      
      “Anonymous: I’m not clear on how RR lead to editors having more power than they currently do. They would have more influence over study design, yes. They would also have less power to demand changes to the paper after the results are in.”
      
      First of all, i dislike the idea behing RR’s that the editor and reviewers are giving way too much influence in my opinion and reasoning. To me this is then not a true independent evaluation anymore.
      
      Sure, editors and reviewers can “block” or “filter” certain results but those results would then probably be send in to a different journal. They will probably find there way out there somehow.
      
      I reason that this may not happen this way when editors and reviewers can influence, or are influencing, the step(s) before the results (like the design)!?
      
      Also possibly think about how some folks view, and/or promote, RR’s as the epitome of doing scientific research. This view, and/or promomotion, can easily lead to emphasizing the possibly problematic things i allude to above in my reasoning.
      
      To me this is all absurd, and non-scientific.
    - Martha (Smith) on May 27, 2019 11:00 PM at 11:00 pm said:
      
      Ed said,
      “Publication has to imply something beyond “realistically quantified uncertainty.” ”
      
      Realistically quantified uncertainty is, in many if not most circumstances, the best we can do, unless we go off in a fairy tale world, which which would be neither realistic nor scientific. So what we need to do is try to do our best to give realistically quantified uncertainty.
      So my best answer to your question is,
      
      “Publication should imply that the authors have done a good/careful/cautious/well-explained job of giving realistically quantified uncertainty”
    - Mikhail on May 28, 2019 6:41 AM at 6:41 am said:
      
      > Publication should imply that the authors have done a good/careful/cautious/well-explained job of giving realistically quantified uncertainty
      
      i.e. uncertainty estimation should pass a severity test
    - Mikhail Shubin on May 28, 2019 6:42 AM at 6:42 am said:
      
      > “Publication should imply that the authors have done a good/careful/cautious/well-explained job of giving realistically quantified uncertainty”
      
      i.e. uncertainty estimates should pass a severity test

Statistical Modeling, Causal Inference, and Social Science

I have zero problem with people reporting results they found with p=0.1. Or p=0.2. Whatever. The problem is with the attitude that publication should imply some sort of certainty.

56 thoughts on “I have zero problem with people reporting results they found with p=0.1. Or p=0.2. Whatever. The problem is with the attitude that publication should imply some sort of certainty.”

Leave a Reply Cancel reply