Holes in Bayesian Philosophy: My talk for the philosophy of statistics conference this Wed.

4pm Wed 7 Aug 2019 at Virginia Tech (via videolink):

Holes in Bayesian Philosophy

Andrew Gelman, Department of Statistics and Department of Political Science, Columbia University

Every philosophy has holes, and it is the responsibility of proponents of a philosophy to point out these problems. Here are a few holes in Bayesian data analysis: (1) flat priors immediately lead to terrible inferences about things we care about, (2) subjective priors are incoherent, (3) Bayes factors don’t work, (4) for the usual Cantorian reasons we need to check our models, but this destroys the coherence of Bayesian inference. Some of the problems of Bayesian statistics arise from people trying to do things they shouldn’t be trying to do, but other holes are not so easily patched.

1. Anoneuoid says:

I really don’t see a problem with flat priors, they are just a first approximation. We don’t expect the likelihood to be perfect either…

Also, I’d argue that:

1) there has been near universal misinterpretation of a confidence interval as a HDI (credible interval)
2) There is often numerical similarity of a confidence interval with x% coverage to an HDI containing x% of the posterior when using a flat prior

Taken together I conclude that almost everyone has been effectively drawing conclusions from models using flat priors for the last ~80 years.

• Once you start moving away from using priors that actually represent your prior knowledge (which they should do) then you are going to run into trouble.

A prior that is very spiky for one parameterisation of the model can be a flat prior by parameterising the model in a different way.

Therefore, we naturally will try to take model parameterisation out of the picture.

However, the question of what is the right parameterisation of the model can not be answered sensibly.

Hence, some may begin to think about model dependent priors (e.g. Jeffreys prior) and then find that they are probably opening up an even bigger can of worms (e.g. they realise that they are doing something that is not truly Bayesian).

• Anoneuoid says:

There will always be edge-cases where an approximation doesn’t work, that doesn’t mean it is problematic in general.

I think there is probably some difference in focus between mathematicians vs science here. In science you always have other approximations, etc anyway so you don’t expect anything to be perfect. In mathematics you expect no flaws at all.

• Bimal Jain MD says:

From a practicing physician’s viewpoint, the Bayesian method does not work well for statistical inference during diagnosis in practice. The problem arises because practically every given disease occurs with varying prior probabilities, ranging from the very low to the very high in different patients with our goal in diagnosis being an accurate inference in every patient regardless of prior probability.
In the Bayesian method, a very low prior probability represents very strong prior evidence for or a very strong belief in absence of disease which is likely to lead to this disease being ruled out without testing. And even if a test is performed,the high likelihood ratio of a highly ‘diagnostic’ test is combined with this very low prior probability to generate a posterior probability close to 50 percent which leads to an erroneous inference of the disease being indeterminate.
In practice, a suspected disease is formulated as a diagnostic hypothesis without any prior evidence for or against it, regardless of its prior probability. The disease is then inferred if a test leads to a result known to have a high frequency of leading to an accurate inference.
The method employed is closely similar or identical to the frequentist method of statistical inference.
The surprising thing is the Bayesian method has been prescribed as the normatively correct method in diagnosis, but when we look at published diagnostic exercises in real patients, we find it has not been employed in any one of them.

• Carlos Ungil says:

Do you mean that when a test is performed the result should be interpreted – and a diagnostic produced -disregarding completely all the other information available (symptoms, gender, age, medical history, risk factors, familial antecedents, results from other tests)? Hopefully not! Unless the result if the test is so strong that makes all the other information irrelevant, of course.

• A patient comes into a doctor’s office and complains of abdominal pain, the doctor palpates and suspects liver disease, the doctor should:

A) Order every test available that might be relevant to people with any sort of abdominal pain and diagnose whatever disease comes back with a positive test provided that it offers high frequency of correct diagnosis and regardless and independent of the other results obtained from other tests (so that most patients will be diagnosed with anywhere from 0 to say 10 or 20 different simultaneous ailments after taking 1575 tests.

B) Order a liver test and 2 other tests for less likely issues but still within the realm of plausible because the prior information the doctor has on the basis of history and palpation alone suggests that one of these 3 conditions is almost certainly the one of interest.

If Bimal’s assertion were correct, then A would be what doctors do, and should do. If on the other hand Doctors go based on their state of information about the patient and order only things that test for conditions having some nontrivial prior probability of occurrence, then they should be found doing B.

I wonder which one doctors actually do?

They should do B but there is a lot of A going on (especially in the US).
Indiscriminate multiple testing is a particular problem when the tests have low specificity. Abnormal liver function tests for instance are very common and have a list of causes as long as your arm (“the liver is a stupid organ – it can only grunt”). It’s not such a problem with highly specific tests – a pregnancy test is hard to mis-interpret.

• jim says:

“I wonder which one doctors actually do?”

Like any worker anywhere in any job they *typically* do what is the least work for them with the lowest potential for adverse consequences: A. Lowest chance of incorrect diagnosis, least effort, most defensible to insurance.

• There is definitely a certain amount of over-testing, but if you truly used a uniform prior across all possible diseases involving chest pain you would literally order hundreds or thousands of tests for every patient, and that isn’t what they do. Yes, they might order 8 tests when probably 2 or 3 would do, but it’s still rather strong prior knowledge in use.

• jim says:

I dunno about that. My friend got a cat from the humane society. She loved the cat but it pooped everywhere. 4-5 years w several vet visits and tests w no solution. Finally a new vet physically examined it by putting his finger up.the cats butt. It had a broken pelvis.

Ok, this is a cat. But ive noticed similar behavior in human docs. Modern health care professionals dont want to do gross things or use older methods. They want to rely on simple easy things.

• Andrew says:

Jim:

+2 for commenting on two of our favorite topics here: cats and poop.

• Jim, actually I think your anecdote proves my point. The vets had strong priors and therefore weren’t looking for the right causes because the real cause was truncated to zero probability until the new vet realized that all the usual suspects were exhausted and had to formulate a new set of prior causes to examine that excluded the things that had been tried already and therefore had to include some less likely things.

I assume that an undiagnosed nonunion fracture (one that doesn’t heal naturally and remains fractured for long periods of time) in a cat is probably rare.

“It is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth.” — Sherlock Bayes

• jim says:

Andrew: wow, yeah, the cat theme! I didn’t know there was a poop theme but now that I do I feel more at home here!

• Martha (Smith) says:

Jin said,
“But I’ve noticed similar behavior in human docs. Modern health care professionals dont want to do gross things or use older methods. They want to rely on simple easy things.”

I’d say they may also be biased in favor of things that they are familiar with, and especially toward things that are easily “fixed”. Case in point: When I had severe abdominal pain after something seemed to snap in an exercise class, my GP gave me ulcer medication, because the pain sounded like ulcer pain to him. That didn’t help and made matters worse, so he sent me for an upper GI to check for a hernia, saying “because we know how to fix that”. It took a long time for me to drink enough of the “milk shake” for the upper GI (the main symptom of the injury was that I wasn’t able to get much down to my stomach), which irritated the radiologist, and the upper GI showed nothing wrong. So I suggested to the GP that a physical therapist might be able to help. He said, “That’s a good idea!” and gave me a referral. The physical therapist did a thorough exam, having me get in different positions and move different ways. She then said, “It appears that you’ve pulled your rectus abdominis”, explained that a pulled muscle contracts severely, and that a pulled rectus presses on the stomach and would cause my symptoms. She then put me me on a machine that allowed me to give gentle exercise to loosen up my rectus a little. That night, I was able to eat about half a normal dinner, which was more than I had been able to get down in one sitting since the injury.

• jim says:

“I’d say they may also be biased in favor of things that they are familiar with”

Yes! Much agree!

• Andrew says:

Bimal:

It sounds to me that “the Bayesian method” you describe is not actual Bayesian inference, but rather some wrong calculations that someone has mistakenly labeled as Bayesian inference.

I’m sure there are a lot of bad analyses being done that are labeled as Bayes but aren’t. This is a completely different problem than what I discuss in my talk, which is analyses that are actually Bayesian, but which use bad models and as a result give bad results.

• I agree with your assessment. There is nothing Bayesian about dogmatically plugging the frequency in a particular unrelated, or loosely related population in as a prior, and then complaining that the posterior doesn’t correspond to a reasonable inference.

For example in Bimal’s original example from a previous post of an ER patient with heart attack. Suppose among all patients who doctors actually order EKGs on, if the finding is positive for EKG abnormality then 95% of these people actually have had a heart attack.

Now suppose Bimal plugs in his 7% of people coming into the ER with chest pain have heart attack, and together with the data on the calibration of the EKG results, he calculates 50% posterior probability of heart attack. Then in fact he has good reason to dislike this method, because 95% of the people who had the EKG ordered and it came back positive actually had heart attack, but the calculation suggests 50%… what went wrong?

What went wrong was that doctors have additional information and they don’t order EKG tests on people who obviously have broken ribs, were punched in the chest, have obvious gallbladder problems, have obvious pneumonia or severe asthma, a pleural infection, etc etc. So the information that should be used in terms of prior is the *prior probability that the doctor assigns to the patient given everything the doctor knows*. Just knowing that the doctor thinks an EKG is a good idea after exam should increase our notion of prior probability of heart attack above the basal rate for all people showing up in the ER with just “chest pain”. If that basal rate is 7%, then the rate for “people who we order an EKG on” is likely 15% or 30% or 40%, something similar.

I would find it hard to believe that 93% of the time a doctor suspects heart attack and orders an EKG for a patient in the ER with chest pain the patient will be having something other than a heart attack.

The relevant question to ask is “after taking a history and a brief physical exam, given that a trained ER doctor thinks an EKG is needed, and given the type and severity and timing of reported symptoms, what is the probability the patient has a heart attack?” not “what fraction of all people showing up in the ER who have chest pain included on their chart have heart attack”.

• Carlos Ungil says:

> I would find it hard to believe that 93% of the time a doctor suspects heart attack

That 7% that apparently you cannot come to terms with is for young women not at risk and not presenting typical angina pain [*]. So there would be no particular reasons to suspect a heart attack more than anything else, and it is to let the doctor know that fact that those statistics are compiled. To be of use as a baseline in those cases were the origin of the pain is unkown. Doctors can then build on that according to additional information and will just ignore it when patients come into the examination room with an arrow traversing their chest, or suffering an asthma attack, or with any other obvious indication of what’s wrong. Because the doctors writing and applying the guidelines are not complete morons.

[*] And even the typical symptoms are not a very good indicator: “Investigators at a single center in New York City conducted a retrospective study involving 2525 patients with no previous history of myocardial infarction or coronary revascularization who were evaluated for ACS in an emergency department–based chest pain unit. Typical angina was defined as “the presence of substernal chest pain or discomfort that was provoked by exertion or emotional stress and was relieved by rest and/or nitroglycerin.” All patients underwent provocative stress testing after serial biomarkers were obtained.

Presenting symptoms did not vary significantly by sex, age, or history of diabetes. Ischemia was induced by stress testing in 14% of 231 patients with typical angina, 11% of 2140 patients with atypical chest pain, and 16% of 153 patients with no chest pain at presentation. Thus, patients with typical angina were not significantly more likely than those with no or atypical chest pain to have inducible myocardial ischemia.”

https://www.jwatch.org/jc201007070000002/2010/07/07/typical-angina-vs-atypical-chest-pain

• I’m fine with “there is a population of people not presenting typical signs of heart attack and 7% of them actually are having a heart attack”, in which case it is going to be correct that after the positive test only about 50% will actually have heart attack in which case Bimals complaint that it’s better to use some frequentists likelihood ratio test will result in overdiagnosis (which might be desirable from a Utility theory perspective)

in any case the point about doctors not being stupid is another way of saying they will use prior information, like the arrow through the chest or the signs of pneumonia.

to claim that what doctors do is to always use frequentists methods with no prior info is to completely misunderstand what priors are

• Bimal Jain MD says:

Andrew
This is a fact that physicians are diagnosing and treating diseases fairly well on a daily basis.My interest is in knowing the method of statistical inference they employ during diagnosis in practice.I have looked at published diagnostic exercises in real patients and my assessment is, the method employed in them is not Bayesian.
It would be immensely helpful if you, Andrew or someone else were to look at some of these diagnostic exercises in real patients, such as clinical problem solving exercises or clinicopathologic conferences ( CPCs ) published regularly in the New England Journal of Medicine and tell us what the method employed in them is.
My perspective about this issue is different from practically everyone else in this blog which I find is mainly theoretical.
My perspective is that of a user of a statistical method on the ground.If someone can show me, the method in published exercises is Bayesian, I shall agree.And if he or she finds the method is not Bayesian, but a Bayesian method would be better, I would like to see how.

• Anoneuoid says:

Can you link to an example of the reasoning you refer to?

• Andrew says:

Bimal:

You missed the point of my comment. I was not addressing, either positively or negatively, your description of what doctors actually do. My point was that the method you describe as “Bayesian” is, from the evidence provided in your description, not actually Bayesian. This happens all the time, that people have a crappy statistical procedure that they label as being from some popular method. The general problem of people taking crappy methods and giving them inaccurate labels is important; it’s just not the subject of the talk I’ll be giving tomorrow.

• Huw Llewelyn says:

Bimal:

You mention CPCs published in the NEJM. In the references of Chapter 1 that follows from the link in my response below at 5.05am, I cite a paper analysing in detail the reasoning used in the CPS: Eddy DM. Clanton CH (1982) The art of diagnosis: solving the clinico-pathological conference. N. Engl J Med 306, 1263-8.

They use the term ‘pivot’ whereas I have always used the term ‘diagnostic lead’ and they use ‘pruner’ whereas I have used ‘eliminator’ (or probabilistic eliminator’).

This reference and the link to Chapter 2 below at 5.05am may also interest you too Anoneuoid.

• Anoneuoid says:

From a practicing physician’s viewpoint, the Bayesian method does not work well for statistical inference during diagnosis in practice.

[…]

In practice, a suspected disease is formulated as a diagnostic hypothesis without any prior evidence for or against it, regardless of its prior probability. The disease is then inferred if a test leads to a result known to have a high frequency of leading to an accurate inference.

The method employed is closely similar or identical to the frequentist method of statistical inference.

As I said, the frequentist method often gives the same result as the bayesian method with a uniform prior. Ie, it is an approximation of a special case of the bayesian method. In cases where this is not true, it seems it is always the frequentist method that returns nonsense.

Also, I highly doubt you devise a diagnostic hypothesis without coming up with some type idea like “the probability the person has disease x”. If that is what you do, you are using Bayesian reasoning.

In the Bayesian method, a very low prior probability represents very strong prior evidence for or a very strong belief in absence of disease which is likely to lead to this disease being ruled out without testing. And even if a test is performed,the high likelihood ratio of a highly ‘diagnostic’ test is combined with this very low prior probability to generate a posterior probability close to 50 percent which leads to an erroneous inference of the disease being indeterminate.

I have no idea what you are describing but it sounds like a nonsense method. Also, I have had many first/second-hand experiences where second and third independent medical opinions differ greatly from the first opinion. I would even say that is expected.

So it wouldn’t surprise me if indeterminate diagnoses should be far, far more common than we currently see. Likely if you don’t know what is going on the best course of action is nothing or placebo, harder to make money that way though…

• Martha (Smith) says:

My guess is that when Bimal, Carlos, and Daniel say “test”, they are thinking of a lab test. It would be interesting to compare how physical therapists and podiatrists diagnose with how physicians diagnose. They don’t use lab tests, but (in my experience) they do perform various other types of “tests” (Does it hurt when I do this? this? this?) and their detailed knowledge of physiology to make their decisions. My experience is that their method of diagnosis (and treatment — which may involve things like teaching the patient to use their body in a way different from what they are accustomed to) are more effective than physicians’ judgments based on much more cursory examinations, when the ailment seems to involve a muscle, nerve, cartilage, etc..

• AllanC says:

It seems to me that you are taking disease base rates in the general population to be synonymous with one’s prior for a particular patient. They could be the same but are likely very different.

Also, it’s a bit humorous to suggest physicians conjure up hypothesies about what diseases their patients might have without any prior knowledge. If they have no prior knowledge where do the hypothesies come from I wonder?

• somebody says:

The problem here (I’m inferring based on your previous comments in other threads) is not that the bayesian method is “incorrect.” If you’re even discussing the “probability that a patient has a disease”, the only method is the Bayesian method which is tautologically correct.

The problem here is the use of a data based prior like “prior probability that a person has a disease = probability that anyone in the population has a disease,” without updating that prior conditionally before the test based on info like “person walked into the clinic with symptoms.”

In practice, physicians are ALL using an informal bayesian style of reasoning (important to not conflate that with formal bayesian inference) that takes into account the information gleaned with their human eyes from the patient in concert with test results.

> a suspected disease is formulated as a diagnostic hypothesis without any prior evidence for or against it, regardless of its prior probability. The disease is then inferred if a test leads to a result known to have a high frequency of leading to an accurate inference.

This is just wrong. The test, under a frequentist framework, cannot provide a “frequency of leading to an accurate inference.” It has frequency of a true positive and a false negative, assuming the disease is there, and the frequency of a true negative and a false positive, assuming that it is not. The frequency of an “accurate inference” depends on the prior probabilities. If you maintained the exact same thresholds on the tests, but sampled people truly randomly out in the world instead of people who walked into a hospital with symptoms, you would see the “frequency of accurate inference” change. The general “accuracy” is not stable under a change in priors; rather, hospital procedure has been designed with the priors of a hospital setting in mind, and those priors are relatively stable.

> And even if a test is performed,the high likelihood ratio of a highly ‘diagnostic’ test is combined with this very low prior probability to generate a posterior probability close to 50 percent which leads to an erroneous inference of the disease being indeterminate.

As professor Gelman would you, an inference that fails obvious posterior predictive checks means that your prior is not incorporating some prior information that you do have.

• Carlos Ungil says:

Not that you’re suggesting that, but I think it’s important to stress that a posterior probability below 100% for something that happens to be true (or above 0% for something that happens to be false) is not necessarily “an inference that fails obvious posterior predictive checks”.

• Perhaps one direct way of seeing that would be to ask what is the meaning of a particular point in the parameter space that the prior defines a given probability to?

The pragmatic meaning would simply be the the data that would be repeatedly generated given the parameter values at that point.

To answer this requires that a particular data generating model be chosen.

Choose a different data generating model and the particular point in the parameter space means something different.

• This is a great way of putting things Keith, the entire purpose of a parameter *is to describe a data generating process* so it shouldn’t be a surprise that if you change your data generating process you should possibly change your state of information about the parameter space.

• Sorry but not in agreement here.

In standard inference, the purpose of a parameter is to describe a population.

What you currently know about the population (as encapsulated by the prior over the parameters) should not be affected by how you may (or may not) intend to go about collecting more information about the population, i.e. affected by the choice of the sampling model.

This is inconvenient I know.

• Martha (Smith) says:

Daniel said, “the entire purpose of a parameter *is to describe a data generating process*”

TNS replied: “Sorry but not in agreement here.

In standard inference, the purpose of a parameter is to describe a population.

What you currently know about the population (as encapsulated by the prior over the parameters) should not be affected by how you may (or may not) intend to go about collecting more information about the population, i.e. affected by the choice of the sampling model.”

Methinks there is a need for some clarification, so I’ll try — and whoever wants to agree, disagree, elaborate, or improve can chime in:

As I understand Daniel’s statement (based on previous things he has said), he is describing a random variable as a random generating process. Since a random variable refers to a population (not to a sample), a parameter for the random variable does refer to a population, not to a sample. (However, I can see how Daniel’s phrase “data generating process” might lead one to believe that he is talking about a sample, so that might be the source of the confusion.)

• > In standard inference, the purpose of a parameter is to describe a population.

Perhaps in Frequentist inference.

I drop ball bearings from selected and precisely measured heights h in an evacuated tube, and I measure the time it takes to hit the bottom of the tube. The predicted fall time is sqrt(2*g*h)+epsilon with epsilon determined by measurement error, and inference on g is my goal.

what population is g associated with?

• sorry I was too quick, sqrt(2*h/g) is the right formula.

• Carlos Ungil says:

The original statement has the qualifier “often” so it covers the cases where it is obviously false (if one of the parameters in the model is the age of the subject; the prior reflects knowledge external to the model) and where it is obviously true (if there is a parameter kappa fine-tuning something deep inside the model).

And of course if “understanding the prior” means understanding how the posterior depends on the prior that will always require looking at the interaction of prior, model and data.

(One could also adopt the “nothing exists outside the model” view, where any knowledge can only be expressed and understood within a model, making the statement even less interesting.)

• somebody says:

I agree that the prior is meant to “represent” your prior knowledge. However, the word “represent” is doing a lot of work there. Your prior knowledge cannot be perfectly captured by some parsimonious probability distribution you pulled from a table in a textbook, and conversely, what knowledge is actually represented by a given probability distribution cannot be perfectly understood just by looking at its one or two dimensional pdf. The translation of your prior knowledge into some prior distribution in your model is an approximately, lossy, and iterative process, and applying the prior to some likelihood is a part of that process.

Consider a model of the effect of income, x, on height, y. It’s a linear model and residuals follow some likelihood function. I apply what I think is my prior knowledge on the slope, b, as a normal distribution with some positive mean. I find myself predicting that some individuals in my dataset should be -0.5 feet tall.

I obviously have to rework my model, prior distribution included. Is this inconsistent with choosing a prior to represent my prior knowledge? But obviously, I knew beforehand that people cannot be negative any feet tall.

The prior distribution is part of the model which captures some regularizing behaviors on the parameters that I would expect based on my prior knowledge, but neither it, nor the model itself, represents the total set of my knowledge.

• Was on vacation.

I think that “standard inference” which presupposes a finite population – is a conceptual mistake.

As Rob Kass put it “Fisher introduced the idea of a random sample drawn from a hypothetical infinite population, and
Neyman and Pearson’s work encouraged subsequent mathematical statisticians to drop the word “hypothetical” and instead describe statistical inference as analogous to simple random sampling from a finite population…. My complaint is that it is not a good general description of statistical inference” http://www.stat.cmu.edu/~kass/papers/bigpic.pdf

• Bimal Jain MD says:

Andrew
I have looked at a number of your papers and find your description of Bayesianism, which consists of model creation with emphasis on model checking by severe testing, to be very different from what I thought it was, which is a subjective inference from an updated posterior probability. It may well be that Bayesian reasoning as you describe is employed by physicians for inference during diagnosis in practice.

2. Awesome. I love watching statistics videos just before I nod off. So alluring. lol

3. Huw Llewelyn says:

Andrew and Bimal

Bimal wrote: “It would be immensely helpful if you, Andrew or someone else were to look at some of these diagnostic exercises in real patients, such as clinical problem solving exercises or clinicopathologic conferences ( CPCs ) published regularly in the New England Journal of Medicine and tell us what the method employed in them is.”

I have being doing just that for many years. The outcome is summarised in the final chapter Oxford Handbook of Clinical Diagnosis. It can be accessed via the following link: http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-13

In summary, diagnostic reasoning can be modelled with a theorem derived from Bayes’s EXTENDED rule. It is applicable with dependence assumptions (as well as limited independence assumptions). The theorem, proof and corollaries are provided at the end of the chapter.

The ‘priors’ used are conditional priors that incorporate an unconditional diagnostic prior and likelihood of the initial finding conditional on that diagnosis. These provide the conditional priors of the differential diagnosis conditional on that initial finding (eg chest pain).

This differential diagnosis is investigated by seeking other findings (eg an EKG result) that are likely to occur in one diagnostic possibility but unlikely in another, making the latter less probable and the former more probable.

The tactics of this reasoning process are explained in Chapter 1 http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-1 and in more detail with an example in Chapter 2 http://oxfordmedicine.com/view/10.1093/med/9780199679867.001.0001/med-9780199679867-chapter-2

So Andrew, here was another hole in Bayesian Philosophy, one which I have been filling in!

• Bimal Jain MD says:

Huw, thanks for your comments and reference to Eddy’s paper that I know about. I have a couple of papers on method of diagnosis in CPCs in Diagnosis, June 2016 and Sept.2016 that may interest you.

4. Maximilian says:

Is this talk going to be on YouTube or Something? Because it seems quite interesting to me, but I dont have time tomorrow.

Andrew, another question. Lately I came across a thing called “Lords Paradox”. It is essentially this: suppose you are doing an experiment with 2 groups. You could arrive at different conclusions if you do a repeated measures ANOVA vs. an ANCOVA controlling for initial values of the outcome. A Professor of mine recently had to re-do an Analysis of another research-group because they did the ANCOVA and arrived at wrong conclusions, and I personally often read papers where they analyse those research designs with controlling for the baseline value.
So apparently this is still a thing, Pearl published a paper about it in 2016:
Pearl, J. (2016). Lord’s paradox revisited–(oh lord! kumbaya!). Journal of Causal Inference, 4(2).

• Anoneuoid says:

I doubt your professor’s analysis was correct either, because neither of them are in the case of Lords Paradox as I see here: https://www.r-bloggers.com/lords-paradox-in-r/

For the coefficients of your model to have meaning it needs to be properly specified, or at least be thought to be some approximation the correct model.

Maybe dining hall A gives you the option to walk up a flight of stairs, but B does not. So you could include a variable reflecting that which changes your conclusion about diet. Maybe there was a flu/cold going around hall A during one of the weighing sessions so students tended to be more dehydrated, so you should control for that. Etc, etc…

My point is that trying to interpret the coefficients of these arbitrary statistical models is a fool’s errand.

There was just a post on here about a paper where they looked at over 600 million plausible linear model specifications for the same data and the coefficient of interest varied from positive to negative. Then they still had to admit the correct specification is probably nonlinear and there could easily some other variable not collected that would change the estimate.

• Mark Webster says:

You might also be interested in this 2018 article by Stephen Senn: https://errorstatistics.com/2018/11/11/stephen-senn-rothamsted-statistics-meets-lords-paradox-guest-post/

5. Huw Llewelyn says:

Andrew

The reasoning I described in my 5.05am comment can also be applied to working through a checklist of possible sources of bias and other methodological errors in a publication before concluding that the work was probably sound and worthy of statistical analysis. I think that this is one aspect of Mayo’s severe testing.

It also can be used as a guide for hypothetico-deductive scientific reasoning with a number of possible scientific explanations.

6. Anonymous says:

Sooooo is this only accessible remotely through the VT Conferencing Tool ‘Zoom’ or will the final video be available anywhere else?