Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.

Posted on December 2, 2024 9:41 AM by Andrew

The usual way we think about p-values is that they’re part of null-hypothesis testing inference, and if that’s not your bag, you don’t have use for p-values.

That summary is pretty much the case. I once did write a paper for the journal Epidemiology called “P-values and statistical practice,” and in that paper I gave an example of a p-value that worked (further background is here), but at this point my main interest in p-value is that other people use p-values so it behooves me to understand what they’re doing.

Theoretical statistics is the theory of applied statistics, and part of applied statistics is what other people do.

What this means is that, just as non-Bayesians should understand enough about Bayesian methods to be able to assess the frequency properties of said methods, so should I, as a Bayesian, understand the properties of p-values. Bayesians are frequentists.

The point is, a p-value is a data summary, and it should be interpretable under various assumptions. As we like to say, it’s all about the averaging.

Below are two different ways of understanding p-values. You could think of these as the classical interpretation or the Bayesian interpretation, but I prefer to think of them as conditioning-on-the-null-hypothesis or averaging-over-an-assumed-population-distribution.

So here goes:

1. One interpretation of the p-value is as the probability of seeing a test statistic as extreme as, or more extreme than, the data, conditional on a null hypothesis of zero effects. This is the classical interpretation.

2. Another interpretation of the p-value is conditional on some empirically estimated distribution of effect sizes. This is we did in our recent article by Zwet et al., “A new look at p-values for randomized clinical trials,” using the Cochrane database of medical trials.

Both interpretations 1 and 2 are valid! No need to think of interpretation 2 as a threat to interpretation 1, or vice versa. It’s the same p-value, we’re just understanding it by averaging over different predictive distributions.

What to do with all this theory and empiricism is another question, and there is a legitimate case to be made that following procedures derived from interpretation 2 could lead to worse scientific outcomes, just as of course there is a strong case to be made that procedures derived from interpretation 1 have already led to bad scientific outcomes.

Following that logic, one could argue that interpretation 1, or interpretation 2, or both, are themselves pernicious in leading, inexorably or with high probability, toward these bad outcomes. One can continue with the statement that interpretation 1, or interpretation 2, or both, have intellectual or institutional support that prop them up and allow the relating bad procedures to continue; various people benefit from these theories, procedures, and outcomes.

To the extent there are, or should be, disputes about p-values, I think such disputes should focus on the bad outcomes for which there is concern, not on the p-values themselves or on interpretations 1 and 2, both of which are mathematically valid and empirically supported within their zones of interpretation.

78 thoughts on “Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.”

Peter Dorman on December 2, 2024 1:02 PM at 1:02 pm said:

In practice, I think the most critical problem is embedded in the wording of #1: “the probability of seeing a test statistic as extreme as, or more extreme than, the data, conditional on….” The issue is that users of p-values want to attribute meaning to that test statistic, that it in fact tests for the likelihood that some measured effect is or is not a true population effect. In doing this, they have to have some sense of the relationship between the outcome of the test and the characterization of true associations in the population. What we’ve seen is far too much naivete on this score, with all sorts of potential errors, biases, and selection effects just disappearing from sight. And what makes a reconstruction of the use of p-values so difficult is that its procedure gives you a number, while the other construct, model etc. questions are not as susceptible to quantification.

So yes, it’s about the role of the null, which can be modified, but also so much more.

Reply ↓
Carlos Ungil on December 2, 2024 1:17 PM at 1:17 pm said:

> Both interpretations 1 and 2 are valid! No need to think of interpretation 2 as a threat to interpretation 1, or vice versa. It’s the same p-value, we’re just understanding it by averaging over different predictive distributions.

In what sense is interpretation 1 averaging over anything?

Reply ↓
- Andrew on December 2, 2024 2:52 PM at 2:52 pm said:
  
  Carlos:
  
  Interpretation 1 averages over the distribution of possible datasets that could arise, if the null hypothesis were true.
  
  Reply ↓
  - Michael Lew on December 2, 2024 3:19 PM at 3:19 pm said:
    
    I’m not sure that ‘averaging’ is a helpful way to think of that. The p-value tells you about the _ranking_ of the observed dataset (test statistic) among all of the possible datasets that could arise if the null were true, according to the statistical model. In my opinion, talking of ranking makes it easy to think about the evidential meaning of the data.
    
    Reply ↓
  - Carlos Ungil on December 2, 2024 3:24 PM at 3:24 pm said:
    
    If one calculates a percentile (for an observed value) over the distribution of potential values (conditional on a single null hypothesis) I really don’t quite see what’s being “averaged” there.
    
    [Average: to find a single value that summarizes or represents the general significance of a set of unequal values value]
    
    Reply ↓
  - Andrew on December 2, 2024 4:33 PM at 4:33 pm said:
    
    Michael, Carlos:
    
    “Averaging” is just a word. Mathematically, a rank or percentile is an average of a binary variable. I find it helpful to speak of averaging because it keeps my focus on the model or distribution being averaged over. But if you’d prefer to avoid the term “averaging” and instead speak of “integrating over a probability distribution,” that’s fine too.
    
    Reply ↓
    - Carlos Ungil on December 2, 2024 6:30 PM at 6:30 pm said:
      
      > Mathematically, a rank or percentile is an average of a binary variable.
      
      Ah, thanks.
      
      It seems that in the paper you cite regarding “interpretation 2” the calculation of the p-value is done in the same way (so it also involves this “averaging” over a null distribution to calculate a percentile) and then the p-value is considered in the context of a distribution of p-values.
      
      If that’s equivalent to calculating a p-value from the same statistic “averaging over different predictive distributions” in case 1 and case 2 it’s not obvious. At first sight at least it seems more like using p-value-1 as a statistic to calculate some kind of p-value-2 or meta-p-value.
David Marcus on December 2, 2024 2:44 PM at 2:44 pm said:

> both of which are mathematically valid

I don’t know what “mathematically valid” means.

Reply ↓
- Andrew on December 2, 2024 2:53 PM at 2:53 pm said:
  
  David:
  
  What I mean is that they are mathematically defined; they are just estimating different things. So it’s not that one interpretation is “correct” and the other is “wrong”; the interpretation depends on the model being averaged over.
  
  Reply ↓
  - David Marcus on December 3, 2024 7:01 AM at 7:01 am said:
    
    Just because you can define something in math, doesn’t make it the correct way to mathematically model something in the real world or answer a question about the real world. Even in pure math, some definitions are not correct because they are not the appropriate way to capture some concept. Being well-defined as math is a pretty low bar.
    
    Reply ↓
    - Andrew on December 3, 2024 7:57 AM at 7:57 am said:
      
      David:
      
      Agreed. Here are two classic examples:
      
      • 1 + 1 = 2, but 1 raindrop + 1 raindrop can equal 1 raindrop if they merge with each other.
      
      • We learn in school that the three angles of a triangle add to 180 degrees, but if you put a big enough triangle on the Earth, with one vertex at a pole and two vertices at opposite ends of the Equator, then the angles add to 270 degrees.
      
      Integer arithmetic and plane geometry are mathematically well defined and super useful, but there are lots of real-world situations where they don’t apply!
Anonymous on December 2, 2024 4:42 PM at 4:42 pm said:

I am posting this comment in the true spirit of trying to be helpful.

I am a practicing medical research scientist. I had limited formal statistical training, but have spent thousands of hours over the last several decades learning statistics through books, papers, videos, and practice. I am far from an expert – but am probably in the top 10% of understanding among “people like me” (i.e., non statisticians doing medical research). I base this estimate on countless conversations I have had with and papers I have read from other “people like me”.

And, I can say without hesitation, that I only ever understand 05% of what is being said in these p-value discussions.

I undertand that is a “me” problem. My point is that if there is a sincere goal of improving the understanding (and practice) of “people like me” – these discussions need to be phrased differently.

I also understand that these discussions aren’t necessarily being had for “people like me”. But we sure would like some help from somewhere. If folks are serious about science reform then I think it would be good for them to consider the vast audience of scientists in need of reform.

Reply ↓
- Anoneuoid on December 2, 2024 7:12 PM at 7:12 pm said:
  
  Don’t sell yourself short:
  
  The Panel on Statistics distributed at the meeting of the American Statistical Association in Montreal in August 1972, in the pamphlet INTRODUCTORY STATISTICS WITHOUT CALCULUS, the following statement (page 20).
  
  A basic difficulty for most students is the proper formulation of the alternatives H0 and H1 for any given problem and the consequent determination of the proper critical region (upper tail, lower tail, two-sided). (Here H0 is the hypothesis that mu1 = mu0; H1 the hypothesis that mu1 != mu0.)
  
  Comment. Small wonder that students have trouble. They may be trying to think.
  
  https://deming.org/wp-content/uploads/2020/06/On-Probability-As-a-Basis-For-Action-1975.pdf
  
  Reply ↓
  - Andrew on December 2, 2024 7:18 PM at 7:18 pm said:
    
    Ha! Good to know that the authorities were all screwed up about this in 1972 as well. In the half-century since then, statistics has made lots of progress (see here), mostly by just ignoring all this p-value crap and getting on with solving real problems. It would be good now to go back and communicate things more clearly to non-mathematically-oriented users.
    
    Reply ↓
- Andrew on December 2, 2024 7:19 PM at 7:19 pm said:
  
  Anon:
  
  If you want something fun to read that might help your insight, I recommend the 52 stories in our book, Active Statistics.
  
  Reply ↓
  - Anonymous on December 3, 2024 8:22 AM at 8:22 am said:
    
    I own, and have read Active Stats along with several of your other excellent books, like Regression and Other Stories and Bayesian Data Analysis.
    
    I do find these helpful.
    
    Reply ↓
- Florian Wickelmaier on December 3, 2024 4:54 AM at 4:54 am said:
  
  Maybe a visualization helps:
  
  https://apps.mathpsy.uni-tuebingen.de/fw/pvalbinom/
  
  The app displays the original data and 50 replication experiments assuming the null hypothesis is true. If you code all replications that are at least as extreme as the data as 1 (black dots) and the others as zero (white dots), and average over this binary variable, you get an estimate of the p-value.
  
  Reply ↓
- Michael Lew on December 3, 2024 3:16 PM at 3:16 pm said:
  
  Hey Anonymous, have a look at this open-access chapter that is my best (and last) attempt to explain p-values and the role for statistical inference in scientific inference generation to both scientists like you (and me) and statisticians. Yes, to statisticians. If statisticians understood those things as well as you might think they do then they would certainly be able to explain them to you and me in a cogent and understandable way!
  
  A reckless guide to p-values: local evidence and global errors https://link.springer.com/chapter/10.1007/164_2019_286
  
  Reply ↓
psyoskeptic on December 2, 2024 8:01 PM at 8:01 pm said:

And I have already had someone tell me this post allows them to justify a nonsensical p-value interpretation. I understand that you’re showing two *valid* interpretations but nevertheless, the word will get skpped by those wanting to argue their interpretation of it as an effect size or some such because Gelman argued that another interpretation is no threat the canonical one (ignoring ‘valid’).

BTW, interpretation 2 not being a threat to interpretation 1 isn’t really an argument for anything at all.

Reply ↓
- Andrew on December 2, 2024 8:07 PM at 8:07 pm said:
  
  Psyoskeptic:
  
  I can’t really comment on your first paragraph because I don’t know what was the nonsensical interpretation, but, yeah, I can well believe that just about anything can confuse people when it comes to p-values.
  
  The point of interpretation 2 not being a threat to interpretation 1 is relevant because when we presented our paper with interpretation 2, some people objected by saying, No, the correct interpretation of the p-value is interpretation 1. So I think it’s important to explain that both interpretations are valid; they just correspond to averages over different distributions.
  
  Reply ↓
Anonymous on December 2, 2024 11:11 PM at 11:11 pm said:

whoa. this is wierd. So this dude andrew is saying you kind of gotta know something to do science. That’s bizarre.

Reply ↓
Erik on December 3, 2024 8:38 AM at 8:38 am said:

In our paper “A new look at p-values for randomized clinical trials” (interpretation 2), we analyzed a particular dataset, namely the Cochrane Database of Systematic Reviews (CDSR). Our goal was to find out how the p-value behaves in the context of clinical trials. For example, we found that 6832 of the 23551 trials (29%) have p<0.05. In other words, the average power is 29%. That's the kind of average Andrew talks about.

With a little bit of modeling, we can go further. For example, we can conclude that only 12% of the trials in the CDSR have 80% power or more.

Next, we can zoom in on the trials that have a p-value between 0.01 and 0.05. We conclude that if we would replicate these trials exactly (an idealization, of course) then only 37% would again have p<0.05. That’s not much more than the 29% we had before the trial was done! So, the significant p-value did not add much to the replication probability.

On the other hand, 95% of the trials with a p-value between 0.01 and 0.05 have an observed effect in the right direction. In other words, the type S (or type III) error probability is quite low. I think it's a source of a lot of confusion that there is such a wide gap between the replication probability and the probability of the correct direction.

If you are not interested in clinical trials, then these results will not be relevant to you. In that case, I’d be curious how the p-values behave in your field!

Reply ↓
- Matt Skaggs on December 3, 2024 11:25 AM at 11:25 am said:
  
  “On the other hand, 95% of the trials with a p-value between 0.01 and 0.05 have an observed effect in the right direction. In other words, the type S (or type III) error probability is quite low. I think it’s a source of a lot of confusion that there is such a wide gap between the replication probability and the probability of the correct direction.”
  
  I read this to mean that p-values actually work just fine in most psychology contexts, since the researchers get the full “gee-whiz” effect not from himmicane effect sizes but from merely showing that himmicanes are taken more seriously.
  
  Bloggers and commenters here are fond of using the term “random number generator” for calculating a p-value and the reason for that finally sank in a while ago. I take it to mean that the difference between the null effect and the calculated effect is an artifact of the model construction. But here we see that the number is not random since it goes the right direction 95% of the time!
  
  Now I have to go back and read the Mayo-Gelman debate again, because I seem to recall that these results fit well with what Deborah Mayo was arguing.
  
  Reply ↓
  - Erik on December 3, 2024 12:37 PM at 12:37 pm said:
    
    Matt: In the context of the CDSR, we can say that if the p-value is between 0.01 and 0.05, then there’s 37% probability that an exact (!) replication will have p<0.05, and 95% probability that the direction of the observed effect agrees with the "true" effect. That's in an ideal world. The real world is not ideal. In particular, the "true" effect is not a constant attribute of the treatment. We know from meta-analyses that there's almost alway considerable heterogeneity. So in one study the true effect can be positive, while in another it's negative.
    
    Reply ↓
- Michael Lew on December 3, 2024 3:27 PM at 3:27 pm said:
  
  “For example, we found that 6832 of the 23551 trials (29%) have p<0.05. In other words, the average power is 29%." So now we have a strange form of the statistical technical term 'power' as well…
  
  Reply ↓
  - Erik on December 3, 2024 4:00 PM at 4:00 pm said:
    
    Michael: I don’t think it’s a stange form. Suppose we have an estimate b which has the normal distribution with mean beta and standard deviation s (aka the standard error). Define the z-statistic as z=b/s and the signal-to-noise ratio as SNR=beta/s. The power for (two-sided) testing if beta=0 depends on beta and s through the SNR. It is
    
    power(SNR) = P(|z| > 1.96 | SNR) = pnorm(-1.96,SNR,1) + 1 – pnorm(1.96,SNR,1).
    
    For example, if SNR=2.8 then the power is 80%. This is all standard stuff.
    
    Now, the average power of the trials in the CDSR (that 29% I mentioned) is simply the average of power(SNR) with respect to the distribution of the SNRs in the CDSR.
    
    Reply ↓
    - Andrew on December 3, 2024 5:23 PM at 5:23 pm said:
      
      +1
      
      It’s just a posterior probability. In other settings too I’ve found that people get confused with posterior probabilities when the word “p-value” is in the vicinity. The difficulty is that people feel that anything related to a p-value should have a uniform distribution. Years ago I tried to clarify this by inventing the term “u-value” for a statistic that has a uniform distribution under some model, but that never caught on. A couple years ago I tried again, but it’s just a tough topic, I guess because of the inherent indirectness of tail-area probabilities.
  - Carlos Ungil on December 3, 2024 5:00 PM at 5:00 pm said:
    
    They use “power against the true effect” – not the power given a predefined effect size or even the “ex-post” power calculated using the observed effect size. One can define the power that depends on the unknown true effect size – at least mathematically.
    
    In that sense it’s trivial that if we have, say, a 50/50 mixture of cases with large effect (power 90%) and small effect (power 10%) we expect to get a significant p-value in 90% of the former and 10% of the latter. In aggregate 45% which is equal to the average value of the 90% and 10% powers.
    
    There are also mentions of “predictive power” in the paper but I’m not sure if it’s related to the other references to power.
    
    Reply ↓
    - Carlos Ungil on December 3, 2024 5:02 PM at 5:02 pm said:
      
      > In aggregate 45%
      
      Of course I meant 50%.
Daniel Lakeland on December 3, 2024 11:16 AM at 11:16 am said:

I think what confuses me here is calling (1) and (2) different “interpretations”.

To me, an interpretation is where you calculate some specific number and then you assign a real world meaning to it. The same number could have different meanings depending on your “interpretation”.

A trivial example: you measure height and weight and calculate BMI. then you either interpret this say saying the patient is overweight, or, perhaps knowing some additional information, the patient is healthy but has a lot of muscle mass from working out at the gym heavily.

Same number but two different interpretations.

But when you’re talking about (1) and (2) aren’t you talking about different computations?

For example I could take (1) a point null hypothesis mu=0 and calculate a p value for the observed sample mean. This is a classic point null hypothesis test.

On the other hand, I could (2) take a prior over mu as being centered around 0 with some width and then calculate the prior weighted average of the p value for each mu in the range of the prior support.

This will NOT be the same number as (1) and the difference isn’t about “interpretation” it’s also about the method of calculation! That is, it’s a different procedure!

Am I off base in my understanding? Or do you agree?

Basically, when I see a p value, I need to understand what was done in order to interpret it. And once I have knowledge of what was done, I’m not free to switch back and forth in interpretation. The particular calculation refers to one or the other meaning.

Reply ↓
- Carlos Ungil on December 3, 2024 7:23 PM at 7:23 pm said:
  
  > But when you’re talking about (1) and (2) aren’t you talking about different computations?
  
  I don’t think so. According to the article referenced “We are interested in understanding the resulting two-sided p-value without changing its calculation.”
  
  > For example I could take (1) a point null hypothesis mu=0 and calculate a p value for the observed sample mean. This is a classic point null hypothesis test.
  
  That’s essentially what they do. They calculate a pretty standard p-value using a null hypothesis. In some sense it’s an average of something over some distribution. The distribution are the possible values of the statistic (say the absolute value of the sample mean), the something being averaged is the function which is 1 if the value of the static is higher than the observed one and 0 otherwise.
  
  “Interpretation 1” stops there. The “meaning” of the p-value is derived from its very definition.
  
  > On the other hand, I could (2) take a prior over mu as being centered around 0 with some width and then calculate the prior weighted average of the p value for each mu in the range of the prior support.
  
  One could do that, but that’s not what they do.
  
  “Interpretation 2” involves the p-value for the experiment of interest as before: the average of an indicator function which depends on the observed data and the null distribution for the statistic. Also a myriad of p-values for other experiments to be used as reference: thousands of averages of functions which depend on observed data over null distributions – from thousands of real or synthetic experiments.
  
  If I understand it correctly in the article they use the result of 20’000 experiments to create a model used to produce 1’000’000 p-values and other things. By now we have the same “average over a distribution” as before, plus a million other “averages over a distribution” similar to that one.
  
  “Interpretation 2” is about considering the p-value obtained in our experiment in the context of that distribution of p-values for other experiments. It can be expressed as the averaging of things:
  
  “By selecting and averaging, we can also compute the following quantities, conditional on 𝑝 falling in some interval […] 1. The three quartiles of the exaggeration factor, […] 2. The coverage, the probability that the 95% confidence interval covers the true effect. […] 3. The probability of the estimate having the correct sign […] 4. The probability that an exact replication study will obtain a two-sided 𝑝-value less than 0.05 with the estimate in the same direction as the original study […]”
  
  I wouldn’t say that “they just correspond to averages over different distributions” though. The p-value is calculated as the same average over the same distribution in both cases. Then other things are averaged in the second case involving a multivariate distribution of p-values and other quantities that have no parallel in the first case.
  
  Reply ↓
  - Daniel Lakeland on December 3, 2024 7:44 PM at 7:44 pm said:
    
    Gotcha, thanks for summarizing what’s up. So basically you could compare a test statistic to other possible test statistics from a particular single data generating process… Or you could compare your p values to p values that might have been obtained in different but somehow scientifically related experiments involving different data generating processes… Sure I guess I see that.
    
    So the. It is a little like my BMI example since it’s the context that matters. (Context of knowing the patient is a body builder, or context of knowing that some other data generating processes are also relevant to your science question)
    
    Reply ↓
    - Erik on December 4, 2024 4:17 AM at 4:17 am said:
      
      Carlos: Thanks for the summary.
      Daniel: I replied to your message earlier but used the wrong button, so it’s below.
      
      I’ll just add some more detail. I’m assuming the simple case where we have an estimate b which has the normal distribution with mean beta and standard deviation s (aka the standard error). There are 20000 triples (beta,b,s) in the CDSR. The true effects beta are unobserved, so we observe 20000 pairs (b,s).
      
      Define the z-statistic as z=b/s and the signal-to-noise ratio as SNR=beta/s. Despite the fact that the beta’s (and hence the SNRs) are unobserved, it is possible to estimate the *joint* distribution of z and SNR across the CDSR. The two-sided p-value is a 1-1 transformation of the absolute z-statistic, namely p=2*pnorm(-abs(z)).
      
      Now suppose we have observed the p-value from some clinical trial and we’re interested if the observed effect b has the same sign as the true effect beta.
      
      P(b*beta > 0 | p) = P(z*SNR > 0 | abs(z))
      
      Since we have the joint distribution of z and SNR, we can calculate this probability. In the NEJM paper we use Monte Carlo because we thought it’s easier to understand than the math.
Erik on December 3, 2024 12:47 PM at 12:47 pm said:

Daniel: It really is the same p-value to which we attach different meanings. However, unlike your two interpretations of an elevated BMI, they don’t clash. So, it’s fine to say that if there is no effect (and all other assumptions hold) then the probability that p<0.05 is 5%. At the same time, it's also true that if you choose a trial at random from the CDSR, there's a 29% probability that p<0.05.

Reply ↓
Anoneuoid on December 3, 2024 4:50 PM at 4:50 pm said:

Say you want to make a heatmap to summarize some data, but the sample size and variance within each grid-cell differs.

P-values can be used as a single value to represent a (standardized) average + sample size pair.

Ie, you are normalizing the observed value to the noise and sample size. Using the relative p-values may, or may not, be helpful compared to looking at three heatmaps (one each for mean, var, n). Or panels for each element of the grid. It is like choosing to use a dotted vs dashed line in a plot, or mean vs median.

But it *never* makes sense to compare a p-value to an arbitrary number like 0.05.

Also see Michael Lew’s paper here (p-value + sample size index a likelihood curve): https://arxiv.org/abs/1311.0081

Reply ↓
Carlos Ungil on December 3, 2024 5:26 PM at 5:26 pm said:

From the abstract of the “new look at p-values”:

> We estimate that the great majority of trials have much lower statistical power for actual effects than the 80% or 90% for the effect sizes stated in proposals. Consequently […] “nonsignificant” results often correspond to important effects

I didn’t find where in the article this is discussed but – regarding “important effects” – it’s explained that “RCTs are expensive, and investigators typically limit their planned size to what is needed to detect plausible or important anticipated effects.”

If the 80% or 90% power stated in proposals is valid for the “important anticipated effects” and the “power for actual effects” is much lower doesn’t that make those actual effects unimportant? Some “nonsignificant” results will corresponds to important effects, but not those from “low power” trials.

Reply ↓
- Erik on December 3, 2024 6:25 PM at 6:25 pm said:
  
  Carlos: Sample size calculations are supposed to ensure at least 80% (or even 90%) power against the minimally clinically important difference (MCID). We find that only 12% of the trials in the CDSR actually reach at least 80% power. I think there are two reasons for that. First, just like statistics, medical research is hard and it’s not easy to come up with new treatments that deliver the MCID. Second, due to limitations of time, money and subjects, the MCID is often inflated to lower the required sample size. The power against the real MCID will then be (much) lower than 80%. In that way, effective treatments may be missed.
  
  Reply ↓
  - Carlos Ungil on December 3, 2024 6:37 PM at 6:37 pm said:
    
    Thanks. I agree that if the power is not high enough for an “important” effect to be detected with high probability it’s quite likely that such an important effect will go undetected.
    
    Reply ↓
  - Anoneuoid on December 3, 2024 6:43 PM at 6:43 pm said:
    
    The NNT (number-needed-to-treat) is typically ~100 for blockbuster drugs. Ie, a new patient has about 1% chance of benefiting: https://thennt.com/
    
    That is already an extremely low bar to meet. Compare to interventions developed before NHST like vitamin C for scurvy, anesthetics for acute pain, or insulin for Type I diabetes. Those have NNT ~ 1, definitely less than 10.
    
    Reply ↓
    - Pavlos Msaouel on December 3, 2024 7:08 PM at 7:08 pm said:
      
      As a heads up, NNTs are controversial as discussed, e.g., here: https://discourse.datamethods.org/t/problems-with-nnt/195
      
      While I have found either posterior probabilities and p-values (both interpretations 1 and 2, see here where we used the latter: https://pubmed.ncbi.nlm.nih.gov/39003109/) to be useful in medical research and practice, I have yet to find a scenario where using NNTs would be preferable for risk communication versus other options :)
    - Anoneuoid on December 3, 2024 8:06 PM at 8:06 pm said:
      
      The vast majority of NNTs are computed using group averages. For example, a physician may read a clinical trial report that provides the overall unadjusted cumulative incidence of stroke at 5y to be 0.25 and 0.20 for treatments A, B and compute NNT=1/(0.25 – 0.2) = 20. What is the interpretation of 20? It really has no interpretation. That is because RCTs with even the most restrictive inclusion criteria have a variety of subjects with much differing risks of the outcome.
      
      Yep, the “average person” does not exist. So, unless they happen to be in a (non-p-hacked) subgroup with NNT ~1, you really have no idea whether an individual patient will benefit.
      
      NNT Has Great Uncertainty: How often have you seen a confidence interval for NNT? Probably not very often. And when you compute them they are often so wide as to render the NNT point estimate meaningless, even if the problems listed above didn’t bother you. For example, an ARR that comes from the two proportions 1/100 and 2/100 is 0.01 with 0.95 confidence interval of about [0, 0.044]. Confidence limits for NNT are the reciprocals of these two numbers or [23, ∞].
      
      Yep, it gives a lower bound on the effectiveness. These calculations also ignore systematic error (the study became unblinded, etc). The problems you have with it relate to the upper bound.
      
      Those “controversies” are about how the status quo is *even worse* than indicated by NNT.
    - Dale Lehman on December 4, 2024 8:36 AM at 8:36 am said:
      
      The Harrell post on reasons not to use NNTs is excellent, but raises a question for me. While there is no such thing as the “average” person and we might all want individualized risk and effectiveness data, these rarely exist. Most clinical trials are small for a number of reasons. As a result, when I look at studies relevant to my own situation, they never seem to address people with my specific characteristics. Does this mean that clinicians should resort to their gut feelings or personal experience and ignore the “average” effects that were analyzed? I’m not seeing how to bridge the gap between the admittedly inadequate information that was created and the desired, but unavailable, data that we desire.
      
      On a somewhat related point, Harrell in that discussion said “I can’t think of many population decisions in medicine.” I don’t really understand this. As a patient I am interested in individual decisions. But for any provider, they treat many individual patients and population decisions may be superior to making a number of individual decisions. I think there is a tendency to believe that individual decisions will always be made correctly. Unfortunately, there are many reasons why they are not, including lack of competence by providers and incomplete information about subgroups. Isn’t the same tension present in all statistical analyses? Whatever individual decision is being made, the data is likely to fall short of matching the exact individual circumstances. At what point is it better to use the “average” effects rather than ignoring these in favor of speculating about unique individual circumstances?
      
      I am not advocating that NNTs are good, nor that it is ok to just use whatever average effects have been estimated. But I would really like to understand the practical choice between following guidelines based on average effects and deviations based on subjective individualized circumstances. I see this tension every time I see a medical provider and it is one reason why I place so much importance on the freedom to choose who my provider is (although there is an immense information gap when making those choices, but that is a different matter).
    - Pavlos Msaouel on December 4, 2024 10:56 AM at 10:56 am said:
      
      Dale, you asked the trillion dollar questions. Here is one of my favorite dissections of this topic: https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-010814-020310 (preprint: https://arxiv.org/abs/1510.08539).
      
      Circling now back to the original post, notice in the article how empirical Bayes (readers of this blog may prefer to think of it as a subset of hierarchical Bayes) can often have optimal properties in addressing your questions. Interpretation 2 of p-values aligns with this inferential approach. In fact, see here a cool evolution of this paradigm: https://www.nature.com/articles/s41559-024-02530-5 which additionally takes advantage of the robust CI methodology recently published here: https://www.tandfonline.com/doi/full/10.1080/01621459.2021.2008403 (see also comments and rejoinder).
    - Dale Lehman on December 4, 2024 12:46 PM at 12:46 pm said:
      
      Pavlos
      Thank you for the references – they look quite interesting and pertinent. It will take me awhile to digest them, but I see one possible way to characterize the difficulty of adjusting studies estimating average effects with the desire for individualized estimates. Ex post subgroup analyses have the problems of being subject to selection issues, too small sample sizes, and inflated estimates of statistical significance (or, more generally, type M and S errors). So, if I want to estimate my own need for a prostate biopsy (to make a concrete example), I’d like to modify the existing RCT studies to focus on people like me (with above average health and other possibly relevant factors that are not well represented in the study population). I see there are ways (all problematic, to varying extents) to attempt to build appropriate models based on the existing RCT data in an effort to get a balance between relevant and robust analyses.
      
      One option that occurs to me, and that I often would like to employ, is to supplement the RCT data with much larger observational data. If I could use large scale data on PSA levels, socioeconomic variables, and other health status to have a fairly robust group “like me,” then it might provide a nice complement to the RCT study that was more “legitimate” but provided average effects for people that were mostly “not like me.”
      
      I’m not yet sure how it would be best to combine such data and there certainly problems using observational data to infer causal effects. But I am always thwarted by the fact that the observational data is rarely available (due to the many restrictions that exist on sharing data, some policy driven and others due to proprietary practices). If anybody has references on combining RCT and observational data along these lines, I’d be most interested.
    - Pavlos Msaouel on December 4, 2024 2:08 PM at 2:08 pm said:
      
      Now you are taking us away from the original post (although one can argue that hierarchical modeling is well-suited for combining data across datasets) and towards transportability considerations needed to integrate datasets, including those coming from experiments (such as RCTs) and observational studies. We strategically wrote a very long but open access MDPI paper on this topic with practical examples to catalyze this conversation across clinicians, biostatisticians, theoretical statisticians, epidemiologists, computer scientists etc and provide related references for further reading: https://www.mdpi.com/2072-6694/14/16/3923 Painful to write, and certainly far from perfect, but actually did end up helping introduce new collaborators to these ideas :)
      
      See also another related datamethods thread here: https://discourse.datamethods.org/t/individual-response/5191
Woodpecker on December 3, 2024 5:51 PM at 5:51 pm said:

For me, p-values do more harm than good. I’m still using it due to journals, my supervisor, and the norms of my field. I plan to place less emphasis on it in my next papers. Point 2 resonates with me more

Reply ↓
Anoneuoid on December 4, 2024 1:25 PM at 1:25 pm said:

Dale, re: observational prostate cancer data, you probably want to check out SEER:

There are two Prostate Cancer databases available to request, one with and one without Census Tract Attributes. The need for Census Tract Attributes should be described in the Purpose and Significance section of the application and the planned use of the attributes should be explained in the Analytic Plan section of the application.

1) Prostate Cancer with Additional Treatment Modalities and Risk Stratification Fields Database
This one is linked to county-level attributes, which include county-level SES, rurality, and demographics.
It includes all tumor records from 2000-2020, but the Prostate fields are available just for 2010+.
This one is also included in SEER*Stat Prevalence sessions.

2) Prostate Cancer with Additional Treatment Modalities and Risk Stratification Fields with Census Tract Attributes Database
There are no geographic identifiers included in this database due to confidentiality concerns.
It does not include Alaska Native Tumor Registry data.
It includes all tumor records from 2006-2020, but the Prostate fields are available just for 2010+.
For detailed information about census tract SES and rurality variables, refer to Census Tract-level SES and Rurality Database.

https://seer.cancer.gov/data/specialized/available-databases/prostate-request/

Then you’d probably want to train some model (I’d start with something like xgboost) on eg 2006-2018 data, validate on 2019 data, then assess predictive skill on the 2020 data to get an idea of the accuracy.

Finally, plug your personal features into the model. As long as the model can interpolate (rather than extrapolate) to generate your personal prediction it should be more helpful than an RCT.

As we can see from the NNT discussion, these average outcomes from large RCTs contain essentially no actionable info for the individual. Ie, the typical NNT ~100 means there is somewhere between 1/20 and 1/inf chance you will benefit.

Reply ↓
- Dale Lehman on December 4, 2024 2:22 PM at 2:22 pm said:
  
  Anoneuoid
  Thanks for the reminder about SEER data – it’s been a few years and I will take another look to see what is available. I can’t comment on the data yet, as I am still going through the registration process (I can’t recall the registration login and password from years ago) but it does bring up another complaint I have about data access. The registration procedure for the SEER data is among the worst I’ve encountered. Anybody that wonders about this should try it for themselves – it truly gives you the feeling (at least gives me the feeling) that this is data they really don’t want anybody to access!
  
  Reply ↓
  - Dale Lehman on December 4, 2024 2:54 PM at 2:54 pm said:
    
    I was finally successful (for anybody that wants an example of how NOT to design user-friendly data access, you should spend the 0.5-1 hour to access SEER data). It does actually provide a fair amount of case-specific data although much of it is detailed data about the disease itself (at least for prostate cancer). What is missing is most diagnostic data (PSA levels and history, MRI results, etc.) – things that would be in the electronic health record but not in the SEER database – at least not in the research data). There is more extensive data available with a specific data request, which I did not try. But at least there is some individual data provided, though not enough to effectively use for the purposes I was outlining above.
    
    Reply ↓
ES on December 4, 2024 7:28 PM at 7:28 pm said:

Dale,

“If anybody has references on combining RCT and observational data along these lines, I’d be most interested.”

In medicine, we usually use the concept of “risk magnification” in order to apply RCT results to individual patients. See this post for more details:

https://www.fharrell.com/post/hteview/

Post #51 in the Datamethods thread cited by Pavlos above describes an example of risk magnification involving statin therapy: https://discourse.datamethods.org/t/individual-response/5191/52

Re one of your other questions: “At what point is it better to use the “average” effects rather than ignoring these in favor of speculating about unique individual circumstances?”

The answer is that it’s *very often* necessary to make decisions based on “average” effects in medicine, since this is often the best we can do. See post #2 in this thread: https://discourse.datamethods.org/t/information-evidence-and-statistics-for-critical-research-appraisal/6226/2
and also this thread: https://discourse.datamethods.org/t/when-is-it-clinically-reasonable-to-assume-transportability-of-rct-effects/9375/7

Don’t know if any of this is of interest.

Reply ↓
- Dale Lehman on December 4, 2024 9:47 PM at 9:47 pm said:
  
  very helpful, thanks
  
  Reply ↓
- Daniel Lakeland on December 4, 2024 11:38 PM at 11:38 pm said:
  
  > The answer is that it’s *very often* necessary to make decisions based on “average” effects in medicine, since this is often the best we can do.
  
  especially since everyone designs medical research around the average effect, hardly anyone in medical research has the slightest idea how to analyze data, and the dominant paradigm is testing fixed doses against null hypotheses.
  
  I’ll give the example again because it’s a good example. The dosage for the COVID vax was determined by testing a certain number of fixed doses in micrograms (I don’t remember maybe it was 50, 100, and 200 ug for the Moderna, maybe it was 20, 50, 100 ug for Pfizer? something along those lines). But, we know that micrograms can’t be the determining factor for the immune response, because it’s not dimensionless, and the immune response certainly doesn’t depend on the choice of some Napoleonic era person to define the gram in terms of 1cc of water.
  
  So, the determining factor must be a dimensionless ratio of grams injected to some measure of the body it was injected into. A candidate measure is something like the mass of the bone marrow, where immune cells are created. Such a mass could be estimated by something like age, sex, weight, height, density of bone marrow measured from extracted marrow (which is likely a well studied number already) and perhaps some measurements taken off x-rays, CT scans, and MRIs already sitting in databases at hospitals.
  
  Lets say that given age, sex, weight, height and rho (known density of marrow), we can estimate mass of marrow by some formula marrowMass(age,sex,weight,height,rho)
  
  Then we take the thousands of people who were injected with the vaccine, and we take say the concentration of antibodies in the blood at 2 weeks post second injection.
  
  Basically we then build a nonlinear regression formula where the optimal quantity to inject is determined by a dimensionless nonlinear relationship (I won’t go into details, just suffice to say you can formulate it as a polynomial in injected_mass/marrowMass).
  
  Then a simple calculator app is created where a nurse enters age, sex, weight, height and gets a volume to draw into the needle.
  
  Show me ONE real world trial analyzing dosing using dimensionless ratios. Until that is standard, medical research will continue to be pre-scientific. My sister does dosing of psych meds for difficult cases all day. Not a single dimensionless formula or chart or nomograph is used for any of the drugs she Rxes.
  
  Reply ↓
  - Anonymous on December 5, 2024 1:27 AM at 1:27 am said:
    
    +1, mostly, though I will say pharmacokinetics (the “science” of how medicinal concentrations vary over time)
    at least has a much better grasp of mathematics and agreement with data then the average medical subfield.
    See e.g. Wagner’s Pharmacokinetics for the Pharmaceutical Scientist. It basically relies on one differential equation, but that’s one more than the average medical researcher.
    See also this link for some analysis of that one equation:
    https://math.stackexchange.com/questions/1710840/pharmacokinetics-differential-equations-with-equal-absorption-and-elimination-c
    
    Reply ↓
  - ES on December 5, 2024 8:10 AM at 8:10 am said:
    
    Drug dosing is certainly an area where our therapies could sometimes probably be better targeted to the individual. But if both you and Dale come to the ER with an acute occlusion MI, you’re both going to be treated with primary PCI, since you both have an occluded coronary artery. If both of you present to the ER with anaphylaxis, you’re both going to be treated with epinephrine. If both of you present with community-acquired pneumonia, we’re going to look at some of your unique personal characteristics (e.g., age, whether you’re immunosuppressed, other comorbidities, whether you’ve recently received antibiotics) before we choose your respective antibiotics – but you’re both going to get antibiotics.
    
    There are a few medical fields where individualization of therapy is likely to make an important difference for patients’ outcomes (especially oncology), but MANY fields where the underlying causal mechanism underlying a disease is likely to be the same from patient to patient, such that a “one size fits all” approach is going to lead to better outcomes, on average, than if we attempted, using woefully inadequate sample sizes, to derive more personalized approaches to therapy. Often, individual characteristics don’t determine *whether* a treatment is going to have an effect in a given patient (since underlying causal mechanisms are the same between patients for *many* clinical conditions), but rather how easy it will be for us to *detect* that effect.
    
    Reply ↓
    - somebody on December 5, 2024 9:41 AM at 9:41 am said:
      
      using woefully inadequate sample sizes
      
      I would consider this to be somewhat of a misconception. More complex, individualized models, actually lead to greater sample efficiency. Of course, this isn’t magic, there are more free parameters, but the more complex pharmacokinetically motivated models let you make use of much more data. Basic like height, weight, etc which researchers are generally ALREADY COLLECTING but do not make their way into the main analysis of effect sizes.
    - Dale Lehman on December 5, 2024 10:54 AM at 10:54 am said:
      
      somebody
      Can you elaborate on this? Using more variables with the same sample size strikes me as just making the estimates noisier – I’d call that “inadequate sample sizes.” It would make sense to me for my provider to include all my physical dimensions, socioeconomic and demographic characteristics, and my personal medical history in designing my treatment plan. They do that to varying extents, depending on the specific provider. But the algorithms don’t use any of that information, and if they tried to include it I suspect the sample sizes in the RCT would be woefully inadequate to provide meaningful guidance. It seems to me that we are stuck between woefully inadequate models and woefully inadequate sample sizes – where the best we can do is fall somewhere between these extremes.
    - somebody on December 5, 2024 11:11 AM at 11:11 am said:
      
      Imagine you have a potato gun and you want to figure out how powerful it is, in the sense that you want to be able to predict how far it’ll shoot future potatoes.
      
      Method 1:
      1.You put in a bunch of potatoes, measure the distance they fire, and the average distance is your prediction for the future potatoes.
      2. You measure the mass and volume of each potato, assume a spherical air resistance + a fudge factor, regress the impulse the potato gun is delivering. Then, for future potatoes, you measure them and plug the values in.
      
      Which one is more sample efficient? (2) can work well with like 3 potatoes, while (1) needs a lot because potatoes are very different in size.
    - somebody on December 5, 2024 11:26 AM at 11:26 am said:
      
      I guess in an econometrics class one might teach that, as a rule of thumb, more variables + more coefficients increases the variance of your estimators for a given sample size, but that’y not really true in general. It really depends on how much the additional variables + model modifications decreases the residual variance relative to measurement error and the degree to which covariance with existing variables worsens identifiability in the likelihood. Your classic OLS situation is a bad scene because all the variables “do the same thing” in the model and a lot of them are correlated. But biophysics is an advanced field which is already used to design drugs and is dropped for no reason when analyzing their trials
    - Chris on December 5, 2024 11:26 AM at 11:26 am said:
      
      The paper by Sander Greenland highlighting a serious misuse of a statistical argument (by a statistician) that Andrew Gelman’s paper (link in top post) refers to, maybe gives another example. In the Greenland paper an analysis of a clinical trial assessing safety of gabapentin with respect to suicidal ideation is described. Basically, in the trial 2 out of the 5194 gabapentin-treated patients were positive for suicidality and in the control 1 out of 2682 gabapentin-free control individuals displayed suicidality. (Greenland is making the point that the number of data points is so small that using a statistical analysis allows the possibility that gabapentin multiplies the risk of suicidal ideation by 30-fold or divides the risk by 12-fold – there simply isn’t enough data to make a meaningful statistical analysis).
      On the other hand if one were to further inspect the positive cases and discovered that the two gabpentin positives were individuals with a history of depression, then one might conclude that gabapentin in patients with depression is problematic.
      
      One could include that in subsequent models rather in the same way as “somebody”’s potato example.
      
      I
    - Dale Lehman on December 5, 2024 11:42 AM at 11:42 am said:
      
      Your potato example is interesting. Somehow, using 3 potatoes for method 2 strikes me as requiring assumptions both on the mechanism (probably reasonable for some problems) and assumptions regarding the irrelevance of any missing variables. You have a single potato with known physical characteristics, but you are missing so many other things about that potato that it isn’t clear to me that your predicted behavior will be better than using the average of a larger number of potatoes.
      
      I don’t know anything about drug dosing, so I can’t speak intelligently about how this might apply in that context. But if I think of other medical contexts – e.g., prostate cancer diagnosis/treatment – I think my questions apply. The current algorithms are based on highly aggregated (averaged) data from various RCTs. These fail to take into account a person’s particular physical fitness, their specific medical history (for some variables, not all are missing), and unique factors such as number of prior biopsies and characteristics of where/who conducted the biopsies. To be more specific, prior biopsies are certainly taken into account, but there is no adjustment for the experience of who conducted and examined the biopsy. I don’t think we have enough information to make good assumptions about these variables – for example, I might assume that the quality of the biopsy is a function of the experience level of the providers, but I don’t think we have a reliable method for quantifying this variable’s relevance. We could certainly build a model including this variable, but that would put enormous stress on the assumptions we make regarding its relevance. At some point, it seems better to ignore that variable and rely instead on broader averages (more potatoes) that don’t include it.
      
      Perhaps this is consistent with your second comment. Some situations default to using average effects when they should not – enough may be known about the physical mechanisms involved that we are ignoring useful information. In other situations, our knowledge may be inadequate for models to include a number of factors unique to the individuals in the RCT. Does that make any sense?
    - Dale Lehman on December 5, 2024 11:56 AM at 11:56 am said:
      
      Chris
      I couldn’t tell what paper you are referring to, but from your description I have a reaction. Certainly finding a difference of 1 case out of 5194 subjects vs 1 case out of 2682 controls seems insufficient data to tell us much of anything. The additional information you suggest that the 2 cases had histories of depression would certainly be relevant to a decision faced by a clinician and their patient. But I don’t see it as much of a statistical feature – more of an exploratory idea for further research. The decision not to prescribe a drug under such circumstances (for a patient with a history of depression) just seems like common sense. I’d be suspicious of any attempt to couch it as a statistical argument, however. That would require strong assumptions regarding the irrelevance of any number of other features of those 2 cases, assumptions that would seem to me to be speculative.
    - Chris on December 5, 2024 12:40 PM at 12:40 pm said:
      
      Dale, it’s this paper referenced in a paper linked in the top post:
      
      Sander Greenland: Null misinterpretation in statistical testing and its impact on health risk assessment Preventive Medicine 53, 225-228 (2011)
      
      doi.org/10.1016/j.ypmed.2011.08.010
      
      My example of an additional factor that might be taken into account in clinical trials subsequent to the one where a possible link between gabapentin-suicidality and preexisting depression was found, is possily an extreme one (and incidentally doesn’t come from the paper). You’re probably right tho that this wouldn’t be incorporated into a statistical model for subsequent trials – more likely in a subsequent trial individuals with prexisting depression would be excluded from the trial.
      
      Of course, that might be problematic for individuals with depression that could benefit from gabapentin treatment and that wouldn’t develop suicidal ideation – but then we come back to the problem of the effect of a drug treatment on each individual case, in contrast to some average effect. I guess in cases where there might be some problematic issues with an approved treatment a potentially susceptible individual (to suicidal ideation in this case) could explore the treatment with close monitoring by their doctor. I’m not disagreeing with you in fact!
    - somebody on December 5, 2024 1:46 PM at 1:46 pm said:
      
      Dale
      
      Your potato example is interesting. Somehow, using 3 potatoes for method 2 strikes me as requiring assumptions both on the mechanism (probably reasonable for some problems) and assumptions regarding the irrelevance of any missing variables. You have a single potato with known physical characteristics, but you are missing so many other things about that potato that it isn’t clear to me that your predicted behavior will be better than using the average of a larger number of potatoes.
      
      It can and it can’t. The fact is that arms manufacturers, military, do this kind of thing all the time, and for the most part, mass and a reduced form air resistance constant is good enough to characterize the impulse. Potatoes are more complicated than bullets, so maybe a yukon gold would be fine but a lumpy rough russet wouldn’t. And if there’s significant wind, yeah that’s a problem. But if there’s wind, a simple average also won’t work. The point is that just doing the simplest single variable thing doesn’t actually make anything better, it just helps you not think about all the things that can go wrong.
      
      Similarly, people are already doing this in pharma. Drugs seep their way into the blood stream, either through the alveoli in the lungs or through the digestive system, and settle into a certain concentration for a time before dissipating. People’s body mass matters. These are already known things! They’re not even really new. We just decide to forget about them in certain parts of drug development, testing, and administration. I emphatically don’t think attending physicians should be using DifferentialEquations.jl before delivering shots, but I think it’s weird that nobody has done it to produce a simple chart or rule of them for them, they just give the same shot and then make a judgement call if the guy’s really massive and doesn’t respond
    - Anoneuoid on December 5, 2024 4:39 PM at 4:39 pm said:
      
      But if both you and Dale come to the ER with an acute occlusion MI, you’re both going to be treated with primary PCI, since you both have an occluded coronary artery. If both of you present to the ER with anaphylaxis, you’re both going to be treated with epinephrine. If both of you present with community-acquired pneumonia,
      
      These all have NNT~1. They were developed using small samples and observing that otherwise common (eg, 20-100%) outcomes dropped to single digits.
      
      The sample size is actually inversely correlated with the usefulness of the RCT.
    - Daniel Lakeland on December 5, 2024 6:55 PM at 6:55 pm said:
      
      ES: yes and no. A lot of stuff has a kind of “saturation” response. So you can just give a little more than needed and the outcome is basically the same in anyone. Antibiotics for example, or maybe epinephrine for anaphalaxis… Once you’ve given enough to start killing the bacteria strongly, giving 20% more doesn’t really change anything, or once you’ve shrunk the airway enough to allow air flow, adding some epinephrine doesn’t do too much to change the outcome (too much drug maybe gives a heart attack or something, but not in the 30% extra kind of range maybe).
      
      But that’s not so true for something like an ADHD med right? Too little, people get nearly zero benefit, too much they can’t sleep, and the current concentration in the blood matters, so the drug can be effective early in the day and not later in the day. And, people can metabolize the drug at different rates, possibly as much as a factor of 2 variation from a “slow” to a “fast” metabolizer. So some kind of “one size fits all” scheme is gonna suck, and what’s actually done is to start with a low dose, increase the dose, go above the recommended maximum because that’s what actually works, and then the DEA shuts down production of the drug and the patient has to switch, and then 6 months later the patient maybe has to switch back, and then they get a different Psychiatrist and this person is much more “by the book” and cuts the drug dosing to below the effective dose so that it’s not really doing anything, but they avoid getting their DEA license yanked… bla bla bla. Don’t get me started.
      
      Similarly for anti-psychotics, anti-anxiety, blood pressure management, blablabla. Lots of drugs where the dose gets adjusted for weeks or months before the “right” dose is found, and maybe the effective dose changes with diet, or weight loss/gain, or liver health, or age, or physical activity.
      
      As “somebody” says, information about biophysical aspects of the drug are already well known in drug *development* but are not properly accounted for in most drug *dosing* or *regulatory evaluation*.
    - Chris on December 6, 2024 5:39 AM at 5:39 am said:
      
      We’re really stuck with that sort of scenario for ADHD meds. Assessment of ADHD treatment efficacy is largely around patient perception (or perception of adults with respect to ADHD in their children). It’s not easy to identify a practical, objective biomarker of the sort that’s straightforward for blood glucose- or cholesterol-lowering therapies, or airway flow for treatments for asthma or cystic fibrosis, or treatments for blood pressure lowering, and many others (anticonvulsants; anti-infectives etc). The sort of “titration” you describe especially for adults seems unavoidable to me – the same applies to several conditions under the mental health umbrella, esp depression.
      
      There is the question of why ADHD diagnosis has first of all appeared from a very minor consideration up to the latter part of the 20th century (my perception – I haven’t found much data on prevalence timeline but that’s partly I think because the definition of ADD/ADHD has evolved quite a bit), and has then rather exploded in the last several years which has caused the recent supply shortage (speaking from a UK perspective). My cynical perception is that this is part of the “medicalisation” of what was previously not considered abnormal behaviour – Ivan Illich was already writing about this in the 1970’s (“the expropriation of health”). The pharma industry love treatments that don’t resolve issues but rather tie the patient into lifelong reliance on meds. Some of the things we induce young people to do (e.g. going to University and spending long hours studying) is particularly demanding for a large number of youngsters who in the past would leave school and take an apprenticeship, or work on their parents farm. It’s also a sad reality of the modern world that one of the few seemingly significant risk facts for ADHD is maternal acetaminophen exposure during pregnancy, an examples of meds making people sick and necessitating another level of meds to address this.
      
      Anyway, in the UK there has been a very large increase in ADHD diagnoses and requests for meds especially since the COVID years (which is quite understandable), and that’s the cause of the shortage over here. Since the ADHD meds are stimulants they’re controlled substances and are manufactured under licence with government-assigned quotas. I haven’t found any evidence that “the DEA shuts down production of the drug” and in fact the DEA has increased quotas for stimulants in response to demand. Part of the problem maybe is that quotas for drug manufacturers have been assigned on a yearly basis so that manufactures can’t easily respond to upsurges in demand. I believe that the DEA is converting to quarterly allocations. However I’m in the UK and you may well have a better informed view of the nature of ADHD medication shortages in the US.
    - Daniel Lakeland on December 6, 2024 10:02 AM at 10:02 am said:
      
      Chris,
      
      Yes you need to get feedback from the user, but also the process could be substantially quicker convergence if the biological reality of the drug were used to guide it. I don’t know enough about the specific metabolism of the drugs but something as simple as using height, weight, and age in estimating initial dose, and collecting urine samples for the first week could likely lead to getting to effective doses in a week or two, not months or even never for some patients (because the effective doses are outside the Rx guidelines which are poorly thought out)
      
      I’m not sure what the precise nature of the DEAs restrictions are, all I know is that there are widespread problems where people can’t get their meds and it’s not because we physically can’t produce them fast enough, it’s because the DEA doesn’t allow it. This results in people switching between meds or going without, and leads to long periods of marginal or inadequate dosing as the “feedback titration” isn’t going to work if you can’t even stay on the same med for more than a month or two.
      
      I guess there are many many problems here.
      
      As an anarchist, I deny the right of the DEA to even exist, much less cause the kinds of harm it’s been responsible for over the last 50 years (most of which was drug war stuff)
    - Dale Lehman on December 6, 2024 10:13 AM at 10:13 am said:
      
      Regarding dosing – which I admit I know almost nothing about – the suggestions of more granular dosing guidelines and additional testing to adjust doses – sounds like a great idea but lacking appropriate cost-benefit analysis. Every additional step in initial dosing and subsequent adjustment adds costs to the process. There are certainly benefits – although quantifying these will be challenging and complex (it’s easy to think of cases where wrong doses can kill, but we need to know how likely these and lesser damages are). In addition, each stage of this process is subject to its own errors (lack of adherence (follow-up), inaccurate measurements, etc.). At some point, using an average dose may be superior from a cost-benefit standpoint.
      
      I’m not suggesting that this is always, or even often, the case. But the practical realities of designing and implementing more granular dosing should not be overlooked. I think all such “personalized” medicine involves similar issues, and likely means that only some of it makes sense. Current practices are easy to criticize, but may not be as bad as they are being portrayed.
    - Daniel Lakeland on December 6, 2024 10:26 AM at 10:26 am said:
      
      Dale, of course sometimes dose just doesn’t matter that much. But all the cases I’ve mentioned are ones where low quality dosing guidelines has had HUGE costs.
      
      COVID vaxxes were withheld from children for about a year or more because of dosing. The omicron wave swept through schools in 2022 in part because the Pfizer vax was only available in late 2021 and moderna not until even later.
      
      ADHD dosage issues result in major harm to those who are dependent on their meds to function at their jobs or care properly for their children or responsibly manage their money etc.
      
      I know an adult who bought 4 plane tickets where 2 would suffice if she had been able to focus on the task of deciding which tickets were most appropriate for her vacation. Just imagine that level of dysfunction every day of your life. The same person hasn’t filed for tax returns in a decade. It’s just too hard.
    - Chris on December 6, 2024 1:02 PM at 1:02 pm said:
      
      The specific ADHD issue referred to isn’t a dosage issue, it’s a supply issue. With monitoring, an individual’s ADHD meds can be adjusted to individualise to that patient and so dosage isn’t so much of an issue. In any case I can’t see how dosage can be done without attention to the individual patient, most likely after starting them off with a dose from a population-level assessment in a clinical trial, and taking into account information in post-trial feedback from patients that might provide additional insight into specific considerations for a particular patient.
      
      My experience is somewhat second hand but I’ve had interactions with a number of university students with ADHD. Seven or 8 years ago I’m not sure I saw any students with ADHD, and then there were a few (especially since the covid period), who have seemed to manage quite well with their meds since these have been titrated to the requirements of each individual student. From especially the turn of the year things obviously became problematic because of supply issues. There’s been a recent upsurge in the numbers of ADHD diagnoses and (we’re told) issues around manufacture. In the US I believe this has been exacerbated by the fact that DEA-licensed quotas are issued on a yearly basis so that manufacturers can’t be easily responsive to changes in demand. Apparently that issue is being addressed by changing to quarterly quota licensing.
      
      We don’t live in a perfect world even if such a thing can be imagined, especially with the benefit of hindsight. Our students with problems over ADHD meds supply have managed this period with help from their friends and medical practitioners, and the uni has allowed relaxed coursework deadlines, deferring exams, etc. and the students have done OK – IMO there is a positive outcome for everyone relating to the appreciation of communality under difficult circumstances. And sure, we could imagine (in the US) abolishing the DEA and allowing an extension of the free-for-all that was responsible for the opioid epidemic, but not sure we want to go there – there are already enough horrors on the horizon!
      
      P.S. this recent Open Access paper is a possible starting point for drug dosing:
      
      T. M. Polasek, R. W. Peck (2024) Beyond population-level targets for drug concentrations: Precision dosing needs individual-level targets that include superior biomarkers of drug responses
      
      https://doi.org/10.1002/cpt.3197
    - Daniel Lakeland on December 6, 2024 1:48 PM at 1:48 pm said:
      
      Chris, the issue isn’t that current ADHD prescribing practices can’t eventually get to something useful for most patients… it’s the *rate of convergence*. Even leaving aside artificial supply issues.
      
      A population average starting dose from RCTs is already crude. Why a population average and not an estimate from a regression using age, weight, height, and sex? Mainly because no-one bothers. It’s not because those details don’t matter or that it wouldn’t increase convergence speed.
      
      And then, typical Rxing methodology is something like Rxing 30 days supply, and then after those 30 days asking if the meds were strong enough. I certainly don’t know anyone who did any kind of labs at all. For example, urine or blood draw. It’d be nice to avoid blood draw, but a quick look online shows that it’s eliminated through the urine, so collecting urine samples a few times a day for 3 days in a row plus some modeling would allow you to estimate a personalized blood concentration curve which could be utilized to give precise suggestions for updated dosing and for timing of multiple doses throughout the day (for example, students who need help concentrating for homework or daily activities of living in the evenings after a long day at university classes). Optimal dosing could be achieved within say 2 weeks, compared to people I know who have been trying for a year to get doctors to Rx effective doses. Admittedly this is because the overworked and under-educated doctors are simply afraid to Rx these medications because of all the political bullshit surrounding them.
      
      I’ll just add that the entire multi-decade long opioid crisis happened under the full watchful eye of the DEA and FDA, and that the US has the highest level of incarceration of all countries in the entire world and it’s not even close, about 5x what the UK has, there are more people incarcerated in the US than the entire population of Northern Ireland, and essentially all of it is the war on drugs / DEA related. https://ourworldindata.org/grapher/prison-population-rate
    - Joshua on December 6, 2024 4:49 PM at 4:49 pm said:
      
      Daniel –
      
      You say: Why a population average and not an estimate from a regression using age, weight, height, and sex?
      
      Do you have evidence to show that those variables are meaningful predictors for the outcomes of ADHD medication? If not, collecting those data would be a waste of considerable time and money.
      
      That are so many predictors for an ADHD diagnosis, and I would assume predictors for the efficacy of medication would be multi-factorial and extremely variable by individual. If I’m right, seems it would make more sense to me to base dosing on averages and then adjust as you go along, not the least because I’d imagine a whole hose of external factors would increase or decrease the magnitude of the condition you’re treating as opposed to physiological metrics.
    - Daniel Lakeland on December 6, 2024 7:23 PM at 7:23 pm said:
      
      Joshua, I have a longer message on my laptop which I had to leave behind at home… So maybe I’ll post that later.
      
      But first, know that during drug development and trials ag, height, weight, and sex are all already recorded so there’s literally zero cost to collecting them.
      
      Second, yes of course they matter. What’s well understood is that drug blood concentration matters. And it varies through time… As seen here for example page 2
      
      https://www.accessdata.fda.gov/drugsatfda_docs/label/2005/021802lbl.pdf
      
      Things that matter for these dynamics are the drug concentration in the stomach, the blood mass to body mass ratio (a quick Google suggests women are about 34% smaller), blood mass to kidney mass ratio, heart rate, and blood pressure to atmospheric ratio.
      
      All of these things vary by sex age, height, and weight in rational ways.
      
      The data is already there. Pharmacokinetics is already studied. But it doesn’t wind up informing dosing in a personalized way.
      
      It’s possible the idiosyncracies of an individuals neurological response to the drug is vastly more important than the blood concentration, but it’s also possible that post marketing data collection together with pharmacokinetic models could inform the iterative empirical process so that patients arrive at effective dosing in a more rational and quick way.
      
      Even if ADHD isn’t the poster child for this, antidepressants, opioid treatments, blood pressure meds, etc all provide possible opportunities for better dosing and better estimate of synergistic effects (multi drug)
Pavlos Msaouel on December 5, 2024 12:13 AM at 12:13 am said:

Ah, reminds me of the discussions we used to have with Borje Andersson during my stem cell transplant rotation a lifetime ago. Fun times, see example paper, which we were actually using in these patients: https://pmc.ncbi.nlm.nih.gov/articles/PMC6714050/

Reply ↓
Chris on December 5, 2024 11:38 AM at 11:38 am said:

That summary is pretty much the case. I once did write a paper for the journal Epidemiology called “P-values and statistical practice,” and in that paper I gave an example of a p-value that worked (further background is here), but at this point my main interest in p-value is that other people use p-values so it behooves me to understand what they’re doing.

couple of things: the links in the italicised words/phrases in the quote copied from the top article go to the same place.

secondly, excellent use of “behooves”! I’m picturing Andrew in his doublet and ruff as he writes that :)

Reply ↓
- Joshua on December 5, 2024 7:16 PM at 7:16 pm said:
  
  Interesting. I pictured an ascot and one of those tweed jackets with elbow patches and a pipe.
  
  Reply ↓
Anoneuoid on December 7, 2024 1:17 PM at 1:17 pm said:

I would look closer at this averaging:

Second, yes of course they matter. What’s well understood is that drug blood concentration matters. And it varies through time… As seen here for example page 2
https://www.accessdata.fda.gov/drugsatfda_docs/label/2005/021802lbl.pdf

There are no error bars, or methods (I’d assume its done in uber-healthy adult humans, but is that true…).

Did anyone ever really check the individual curves to make sure its not half the people are quick absorbers and half are slow absorbers? I get their theory is that half is released in the stomach (fast release) and half in the small intestine (slow release), but there should be individual curves somewhere (which I couldn’t find).

Then the trials results are all reported in this form (with no effect size):

There was a statistically significant treatment effect in favor of Focalin XR.

If that is really supposed to be the basis for treating people with this pill, we are dealing with some very confused or cynical people. So I don’t trust them to have reported on the pharmacokinetics adequately either.

Reply ↓
- Anoneuoid on December 8, 2024 9:41 AM at 9:41 am said:
  
  @Daniel:
  
  Group-learning curves have often been published. Fig. 1 is an example. In such a curve, the behavioral measures have been averaged across subjects; and often, across blocks of trials, or even whole sessions, as in Fig. 1. It is often assumed, either explicitly or implicitly, that the properties of the group curve are those of the individual curves. It has, however, long been recognized that averaging across subjects might give a misleading picture of what occurs in individual subjects (1–8). If the progress of conditioning in each individual subject is step-like, but the step occurs early in some subjects and later in others, averaging across subjects will suggest a gradual increase. Averaging across trials will also make rapid transitions appear to be more gradual.
  
  https://pmc.ncbi.nlm.nih.gov/articles/PMC516535/
  
  These kind of averaging artifacts are known to persist for a century after getting pointed out. In fact, “explaining” artifactual learning curves remains the standard.
  
  When it comes to expensive and proprietary data like this, there are even fewer eyes with a chance to inspect the individual dynamics and outliers that get obscured by averaging.
  
  If you can find it, would really like to see what the individual curves look like.
  
  Reply ↓

Statistical Modeling, Causal Inference, and Social Science

Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.

78 thoughts on “Understanding p-values: Different interpretations can be thought of not as different “philosophies” but as different forms of averaging.”

Leave a Reply Cancel reply