Skip to content

Simulation-based statistical testing in journalism

Jonathan Stray writes:

In my recent Algorithms in Journalism course we looked at a post which makes a cute little significance-type argument that five Trump campaign payments were actually the $130,000 Daniels payoff. They summed to within a dollar of $130,000, so the simulation recreates sets of payments using bootstrapping and asks how often there’s a subset that gets that close to $130,000. It concludes “very rarely” and therefore that this set of payments was a coverup.

(This is part of my broader collection of simulation-based significance testing in journalism.)

I recreated this payments simulation in a notebook to explore this. The original simulation checks sets of ten payments, which the authors justify because “we’re trying to estimate the odds of the original discovery, which was found in a series of eight or so payments.” You get about p=0.001 that any set of ten payments gets within $1 of $130,000. But the authors also calculated p=0.1 or so if we choose from 15, and my notebook shows this that goes up rapidly to p=0.8 if you choose 20 payments.

So the inference you make depends crucially on the universe of events you use. I think of this as the denominator in the frequentist calculation. It seems like a free parameter robustness problem, and for me it casts serious doubt on the entire exercise.

My question is: Is there a principled way to set the denominator in a test like this? I don’t really see one.

I’d be much more comfortable with fully Bayesian attempt, modeling the generation process for the entire observed payment stream with and without a Daniels payoff. Then the result would be expressed as a Bayes factor which I would find a lot easier to interpret — and this would also use all available data and require making a bunch of domain assumptions explicit, which strikes me as a good thing.

But I do still wonder if frequentist logic can answer the denominator question here. It feels like I’m bumping up against a deep issue here, but I just can’t quite frame it right.

Most fundamentally, I worry that that there is no domain knowledge in this significance test. How does this data relate to reality? What are the FEC rules and typical campaign practice for what is reported and when? When politicians have pulled shady stuff in the past, how did it look in the data? We desperately need domain knowledge here. For an example of what application of domain knowledge to significance testing looks like, see Carl Bialik’s critique of statistical tests for tennis fixing.

My reply:

As Daniel Lakeland said:

A p-value is the probability of seeing data as extreme or more extreme than the result, under the assumption that the result was produced by a specific random number generator (called the null hypothesis).

So . . . when a hypothesis tests rejects, it’s no big deal; you’re just rejecting the hypothesis that the data where produced by a specific random number generator—which we already knew didn’t happen. But when a hypothesis test doesn’t reject, that’s more interesting: it tells us that we know so little about the data that we can’t reject the hypothesis that the data where produced by a specific random number generator.

It’s funny. People are typically trained to think of rejection (low p-values) as the newsworthy event, but that’s backward.

Regarding your more general point: yes, there’s no substitute for subject-matter knowledge. And the post you linked to above is in error, when it says that a p-value of 0.001 implies that “the probability that the Trump campaign payments were related to the Daniels payoff is very high.” To make this statement is just a mathematical error.

But I do think there are some other ways of going about this, beyond full Bayesian modeling. For example, you could take the entire procedure used in this analysis, and apply it to other accounts, and see what p-values you get.


  1. > My question is: Is there a principled way to set the denominator in a test like this? I don’t really see one.
    > I’d be much more comfortable with fully Bayesian attempt

    I think Sander Greenland put this best in an recent email – “In my classes I taught that there is only one principle I could see manifest in all applications, the NFLP (no-free-lunch principle): In messy practice (not math) if you save some conceptual effort somewhere you either have to put that effort in somewhere else or suffer an increase in unmeasured risk of error.”

    > It feels like I’m bumping up against a deep issue here, but I just can’t quite frame it right.
    My sense is that it is a reference class problem, in frequentist what conditions the would data sets conform to if repeated (fake) payments would be made (exactly 10 payments) and in Bayes according to Andrew what would be a sensible prior distribution on the numbers of payments.

    Thanks for making your course material available, enabling folks to understand statistics using simulation I think remains an open topic.

    Something I drafted this morning – Understanding statistics may no longer require advanced math but the ability to simulate AND think abstractly! [the think abstractly being noticing conceptual problems like you did here and thinking about it profitably].

    • Keith

      Re: Something I drafted this morning – Understanding statistics may no longer require advanced math but the ability to simulate AND think abstractly! [the think abstractly being noticing conceptual problems as you did here and thinking about it profitably].

      Simulation can be one of a multi-prong approach to understanding statistics. On my Twitter today, I questioned whether we needed to have a math background to understand statistics for it seems that the more prominent statisticians have a math background. Yet they too had been using the statistical tools that are in question today.

      It leads me to wonder whether there has been an evolution of statistical theory/concepts within the last 15 years, that can improve the field of psychology for example. It seems to me that the quality of the insights have to be addressed more assiduously.

      • I meant to add that ‘quality of the insights have to be addressed before we can conceive of an appropriate measurement. In other words, to understand statistics you have to know what you want from the more general query at hand. I gather that you are implying this in your draft. Maybe I’m misunderstanding.

    • > My sense is that it is a reference class problem,

      Yes exactly. But is there a principled way to pick the reference class for this problem?

  2. Aftab says:

    I haven’t looked at the notebook, but wouldn’t you want your universe to be of the form N or fewer payments for some N, rather than a fixed N? Surely it would have been at least as suspicious if 5 payments added up to within a dollar of $130k as if 10 did. There are two dimensions for “data as extreme or more extreme than the result” – number of payments in subset and closeness to $130k.

    • Terry says:

      Bingo. I was just about to say this.

      They assume 10 payments, but obviously, there could have been 1 through N payments. That alone should increase their p-value about 20-fold.

      Seems like a pretty obvious mistake, which suggests the authors knew what they were doing was not right and makes you wonder what other “mistakes” there are in the analysis.

      • Aftab says:

        Took a look at the notebook which cleared things up. I thought they wee looking at subsets of size 10 from 10k random payments. Turns out they were looking at 10k sims of 10 payments and seeing if subsets of any size (<= 10) get close to $130k. Sorry for the confusion!

        Also, while I do think this analysis is somewhat motivated/silly, I don't think anything here warrants assuming bad faith.

  3. Andrew,

    Re: it’s funny. People are typically trained to think of rejection (low p-values) as the newsworthy event, but that’s backward.

    How is it that an untrained person can entertain the hypothesis that it’s not all about ‘rejection’ and a trained expert can’t. It seems that a basic logic course may have been helpful prior to taking a statistics course, a sequence that was helpful in, at least, mulling over some questions as to the bases for such a hypothesis. Then again, some teachers of statistics do have a math background too. Even some of them bought into dichotomization.

    I am really suggesting that it was always a puzzle to me why they were presented as dichotomies when it was obvious that one would have to examine many many other hypotheses and assumptions not stated. Then again I am not the best representative of binariness.

  4. Jonathan says:

    I’m trying to get this straight in my head: they were being taught to take a simple algorithm – add 5 bits and see if they fit – and they’re applying that not against the actual data field, whose composition is unknown, but a field generated by some process, which seems to be related to 5 bits that almost fit, and they want to calculate odds from this generated field and then apply them to the unknown, actual data field so they can say these 5 bits mean not just fit but that the fit happens to also fit at a higher, other level in which the fit means ‘payoff’. They’d have more luck if they expanded the notion of fit. Then they’d more clearly be approaching the Sholom Aleichem limit where you fix the shortage of sour cream by declaring that water is now sour cream. Or maybe it’s like having a pet: you don’t specify whether that means dog or cat or bird, so you can say yes you have a pet and then be told your dog bit my child except you have a parakeet.

  5. Rejection is informative when you have a strong reason to believe that the RNG hypothesis *really is true* like youre trying to detect when someone fudged their randomization protocol in a clinical study, or you have a large historical dataset that you’ve fit an RNG model to and you are concerned something in the world might have changed, like in a manufacturing process control scenario.

    In those scenarios you either really did use an RNG or you specifically fit an RNG to match your historic data well… Rejecting these tells you something about how your current data differs from something you really expect.

    In a scenario where you are testing a “default” un-tuned hypothesis, rejecting it isn’t too surprising, but not rejecting it tells you that your data isn’t particularly informative, it can’t be distinguished from a braindead model of how it arose.

    • Daniel,

      Thanks, But here is the thing. According to John Ioannidis’ recent research, roughly 96 % of the research that he evaluated contained statistically significant P values, notwithstanding all the other misinterpretations of P-values. Data Dredging, P-hacking, etc that may also be in tow. In light of that figure, do you suppose that researchers are that confident of their analysis?

      • Statistics using Null Hypothesis Testing outside of *narrow* applications like the ones I describe above, is I think a complete failure. It is cargo cult science of the form “if I do this stuff then I can publish” which is why 96% of research contains statistically significant p values, they are there as an inappropriate publication filter.

        I’ve been to training seminars on the use of bioinformatics software that is site-licensed at my wife’s university for *big big bucks* and the training seminar basically consisted of “first open this window, then click these buttons then press go and here are your p values so you can publish”

        I mean, just about exactly those words came out of a person’s mouth who gets a FULL TIME salary teaching people how to use this probably millions of dollars a year site-license to generate spurious information and pollute the biology field with noise. It’s an actual job you can make an upper-middle-class salary doing.

        • Daniel,

          Thanks for the response. Without your keen insights, the rest of us would be really confused.

          What exactly then is the state of play? Are biomedical enterprises still using NHST P value churning software? If so that is truly astounding.

          Those of us who followed the Evidence-Based Medicine movement have been in many science fora in DC since 20002 or so. I was lucky to come across John Holdren, former Science Advisor under the Obama admin and Richard Meserve of Carnegie Science. I don’t recall their raising the measurement issues back then. But I was always struck by how easily we just accepted a particular explanation or analytic tool. So I am grateful that blogs and Twitter are highlighting some of the obstacles to improving the epistemic environment.

          I wonder what impact the Open Science movement will have on settling some of these controversial uses of statistical tools.

          My goal is to become an informed consumer of statistics. In that process, I’m maturing in my own views. And it sure has been fascinating here on Andrew’s blog.

          I also am grateful to Sander Greenland whose articles made a lot of sense to me.

          • > Are biomedical enterprises still using NHST P value churning software

            Not only “still” but INCREASINGLY due to the kind of data they now have available: RNA-seq, single-cell RNA-seq, and things like that are simultaneously measuring *tens of thousands* of things and people are looking for correlations between them, and declaring findings based on the output of this kind of automated churn. Some of it is probably reasonable, but lots of it is going to be noise too.

            • Anoneuoid says:

              Just saying it pollutes the literature with “noise” is too generous, since how are people supposed to extract a “signal” from it? The goal of the people producing this “noise” is to make it appear as much like a “signal” as possible. I.e., things are set up so these authors are adversaries of the audience, they do their best to make their “noise” actively misleading.

              And I would say the worse thing is that it trains people to think in a completely backwards way: look for “differences” instead of “laws”.

              One of the endgames I see is that useful categories will progressively be split up into more and more subcategories until it is meaningless. E.g., first it was only “cancer is many diseases”, but now I have seen the same thing said about Alzheimer’s and depression to explain away their failure as well.

              The argument is: “It’s more complex than we though, that is why we failed. We need more money”. Meanwhile the totally inappropriate methods being used to study the problem have been debunked since the 1960s. We already know every instance of a disease is unique, the point is to make things easier to understand by finding “laws” that describe the phenomenon in general.

              But yea, I have given up hope on this situation being corrected from within. They are going to keep demanding more and more funding to push exponentially more BS on the public until something really bad happens and there is a big “crisis of faith” in academia (something like this:

  6. Terry says:

    Very interesting post. A lot to think about.

    First impression: a lot red flags.

    1. Highly motivated reasoning. Clearly wants a specific result.

    2. Lawyers involved and trying to reach a specific result is what lawyers are trained to do (one of the major lessons I learned in law school).

    3. Very specific and contorted hypothesis which suggests a lot of forking paths. Lawyers seem to be particularly drawn to this type of analysis.

    Reminds me of Richard Feynman’s snotty comment to someone who was amazed by a particular coincidence: Feynman said something like “As I drove here today, I saw the license plate DL 43965. The odds of me seeing that license plate are astronomical, so you can’t tell me it was just chance.”

  7. Terry says:

    The linked post describing the analysis says

    In order to investigate these suspicions, we developed 10,000 sets of simulated Trump campaign payments. Each set contained 10 randomly generated payments. We then searched each of those sets for the combination of payments with the total closest to $130,000.

    The simulation confirmed that it is extremely unlikely that, by random chance alone, a set of payments near a specific date would almost equal $130,000.

    This doesn’t make any sense to me.

    The analysis takes the denominator to be 10,000 because they generate 10,000 sets of 10 random payments and the numerator is the number of times 10 random payments adds to about $130,000. But all this proves is that if you make 10 payments of random amounts, it rarely adds up to about $130,000. That is NOT surprising. AT ALL. Indeed, it would be astonishing if this were NOT true.

    The allegation is not that Cohen made 10 random payments. The allegation is that Cohen made payments specifically designed to add up to $130,000. To test this, you would randomly generate 10,000 sets that model ALL OF TRUMP’S CAMPAIGN CONTRIBUTIONS, and then, for each set of modeled contributions, look at all possible combinations of 1 through N payments to see if there is AT LEAST ONE combination that comes within $x dollars of $130,000. The numerator would then the number of times there was at least one such combination and the denominator would be 10,000

    The model they actually ran looks like lawyer garbage.

    • Terry says:

      Nevermind. I misunderstood the analysis. I’m an idiot. Sorry if I wasted anyone’s time.

      • Andrew says:

        No need to apologize for wasting people’s time. This is a blog, after all. Its whole purpose is to waste time.

        • Increasingly I think much of what people do for “real work” is where the serious time is wasted wholesale.

          • Re: Increasingly I think much of what people do for “real work” is where the serious time is wasted wholesale.

            That has been my observation of much of the scholarship of my father’s generation at least > whereby current generations haven’t had the experiences that validate or invalidate it. Plus the shelf life of books is shorter.

            Many years ago, I read a book, The Temporary Society by Warren Bennis and Philip Slater. It had a deep impression on because it related to the changes that we witnessed to university purposes and structure. More generally how fads come and go.


            • The thing I think about these days is how disconnected the real value of things is from the dollar value. So much of what makes serious money these days is government sponsored monopoly or oligopoly or subsidy or money laundering or getting around regulations through loopholes or selling people things that they think are something else because asymmetry of info, or whatever. But if you didn’t have all this threat of violence or bullshit going on, how many of us would pay what we are forced to pay for the things that are making serious bucks these days? I consider it a huge waste of resources if what we’re doing isn’t priced to within a few tens of percent of the resource cost of doing business and what people would be willing to pay if they knew exactly what they were getting and weren’t being coerced by regulation or offered some kind of subsidy etc.


              So, of the top ten listed there, for sure by my measure, finance, government, education and healthcare, information, construction, arts and entertainment those are all way way overinflated compared to the non-coercive information symmetric pricing criteria. adding up contribution, that’s something like 21 + 13 + 8 + 5 + 4 + 4 = 55% of GDP comes from coercive or information asymmetric industries (at a minimum). If you assume a premium of say 30% in cost due to coercion or information asymmetry (which seems pretty reasonable) then 30% of 55% is 16.5% of GDP is wholly wasted… at a minimum.

              In reality as I see it very likely more like 45% of GDP is things we wouldn’t do if we weren’t forced to or knew what we were really going to be getting.

              Think of how much of that 20.9% going to finance, insurance, real-estate, rental and leasing is coercive? Even just NIMBY anti-build attitudes or rent control is a major component, to say nothing of Equifax data breaches and 3 Trillion of “Quantitative Easing”

              • To say nothing of all the “spying on everything you do and selling it to people” that currently goes into “Professional and business services (modern advertising)” (12.1%) and “Manufacturing things because the government gave some subsidy to buy votes” (Manufacturing = 11.6%)

                Even Retail, do you know that almost every pair of eyeglasses and sunglasses are made by *one* company: Luxottica, who has consolidated that industry into a virtual monopoly?

  8. Terry says:

    There are other things that suggest the post is lawyer garbage as well:

    They found no set of payments that add up exactly to $130,000. This is evidence against their hypothesis. If Cohen thought he wasn’t going to be caught, you would expect the payments to add exactly to $130,000. On the other hand, if he was trying to avoid detection it doesn’t make sense that he made them add up so closely to $130,000. It would make more sense to be off by say $100. To put it another way, if Cohen had wanted to avoid detection, he could have easily done so at almost no cost but chose not to. The story is self-contradictory.

    The total payments fall short of $130,000. I would expect Cohen to overpay, not underpay because it rankles people to feel they were shortchanged, even by $0.24. Why risk ticking Stormy off for only $0.24? Doesn’t make sense.

    Neither of these definitively disprove the allegation, but they do weaken the probability the allegation is true and could have been included in an honest model.

  9. Dale Lehman says:

    But, to follow the reasoning of a recent post, perhaps Cohen thought we would get caught if he was off by $100 since people would look for that, so he decided that a fraction of a dollar was safe. When a result is surprising do we dismiss it out of hand? Like Presidential Party and Global Warming?

    • Terry says:

      The 28 cents is interesting and kind of weird.

      The allegation is that they were trying to get very close to $130,000, but got only within $0.28. Why? Why not just make it an even $130,000?

      This underlines the original post’s assertion that domain knowledge is important here. What constraints were the conspirators under that might have lead to this?

      Or, why not just add $13,000 to 10 payments?

      • Seriously, if you’re ever in the business of needing to pay someone x dollars by surreptitious means, use a crypto RNG to generate a number c between 0 and 1, calculate Y = X * (1+c/50), then use a crypto RNG to generate 3 numbers x1,x2,x3 between 0 and Y/4, round to the nearest 10 dollars, and make the final number Y-x1-x2-x3.

        No one is ever going to connect these 4 payments to your supposed deal by virtue of them adding up to something specific, and consider the up to 2% extra you might have to pay as the cost of avoiding detection.

        You can thank me later. In a series of 4 payments…

  10. Let’s be a little explicit about the implicit Bayesian idea going on. There are two models of history, one in which various people gave money to the campaign without any pattern of interest, and one in which there was a $130,000 “payoff” with this dollar amount chosen by us according to external sources.

    In the payoff model p(A small number of largish transactions add up to 130,000 to within a few dollars) is very near 1. In the other model of the world, this probability can be approximated using the distribution of observed historical transaction sizes and simulations… and it is small.

    The posterior probability that the payoff model is true given the data is

    p(model1 | Data ) p(Data) = p(Data | model1) p(model1)

    p(Data) = p(Data | model1 ) p(model1) + p(Data | model2) p(model2)

    so p(model1|Data) = p(Data | model1) p(model1) / (p(Data|Model1)p(model1) + p(Data|model2) p(model2))

    = 1/ (1+ p(Data|model2)/p(Data|model1) p(model2)/p(model1))

    ~ 1/(1+epsilon)

    where given the data has high probability under model 1 and low probability under model 2, and maybe we are indifferent between the two models before data, then maybe epsilon ~ 1/10 so we’re talking 1/1.1 ~ 91%

  11. Terry says:

    Its been a year since the underlying post was written.

    Do we know if these 5 payments were actually part of the payoff? I couldn’t find anybody crowing that there clever analysis had uncovered a conspiracy, but maybe it is too early still.

    If these 5 payments were legitimate, it would be embarrassing: you claim there is only a 1 in 10,000 chance this could happen by chance, and then it does.

Leave a Reply