If this study is valuable at all, it’s for the data. If the data aren’t available, it’s pretty much useless.

(Vegetarians and covid): After adjusting for important confounders, researchers who are willing to make strong unsupported conclusions have 73% (OR 0.27, 95% CI 0.10 to 0.81) and 59% (OR 0.41, 95% CI 0.17 to 0.99) lower odds of getting their papers rejected.

Baland Rabayah writes:

Recently, a vegan friend sent me this study which states that Vegans have 73% to 59% lower odds of moderate to severe COVID-19. It takes place by looking at front-line workers (568 of them are used for the study). I myself as a vegan, believe that diet quality can play a role in one’s ability to defend against COVID-19, however, the odds reported here are incredibly high, and that I definitely do not believe the odds would be this high.

Reading the paper, I’ve noticed things that show some issues surrounding the paper:

– The researchers had grouped the severity of COVID into very mild-mild and moderate-severe rather than simply leave them as very mild, mild, moderate, and severe.
– More than 70% of the sample were men.
– While all the vegans/vegetarians saw statistically significant odds ratios, all the “non-vegans” or rather a low carbohydrate and high protein diet followers were statistically insignificant and by a long shot as well.
– The study has no socio-economic adjustments.
– The study also does not adjust for things such as physical activity, only smoking, diet, and well BMI.
– No R^2 figures are reported for the study, which makes it hard to assess the good of fitness for this study (for all we know the data could be noisy).
– A significant difference in “sweet and dessert” consumption between the different groups.

I took a look at the paper. Here’s how the abstract concludes:

In six countries, plant-based diets or pescatarian diets were associated with lower odds of moderate-to-severe COVID-19. These dietary patterns may be considered for protection against severe COVID-19.

I get the descriptive part but I don’t follow how they think the dietary patterns are protective. I agree with most of Rabayah’s concerns above, especially (a) small sample so results are noisy, implying that realistic effect sizes are undetectable, (b) possible differences in background variables between the groups, and (c) maybe the vegetarians are more health conscious and more likely to wear their masks well, etc. In addition, I couldn’t find the regression results, and I’m concerned that the linear age adjustment might not be enuf.

Overall I’d say there’s nothing wrong with doing this sort of comparison, but because of both bias and variance they should really moderate their claims. Too bad that BMJ published this. Isn’t this sort of result supposed to be published in a bottom-feeder journal, so that it’s there if anyone wants it, but the world will know not to take it too seriously?

Also, this “Data availability statement”:

Data are not publicly available.

If this study is valuable at all, it’s for the data. If the data aren’t available, it’s pretty much useless.

29 thoughts on “If this study is valuable at all, it’s for the data. If the data aren’t available, it’s pretty much useless.

  1. I’m tempted to say your concluding statement should apply to virtually all research. I know that is going too far, but the exceptions should be a much smaller set than currently used. Too often, the data is not available for reasons that are largely bogus – my suspicion is that the real reasons are that careers are furthered by not releasing the data and nobody forces you to do so.

      • Confidentiality is an important concern, particularly for studies with small and/or vulnerable populations. Sometimes, it is reasonable to remove demographic info that could reveal participants’ identities, but that would be hard to do here. It could be hard because this is an observational study and the utility of the data would be in finding correlations between all the demographic variables they have. So what do you leave out to try to preserve confidentiality? It seems like the more a study depends on noise-mining, the harder it is to make the data public!

        Also, a depressing personal anecdote about a barrier to making all data public: For confidentiality, the demographic questionnaires for our participants are stored in locked cabinets in the lab. But because of COVID precautions in place when we were submitting the paper, we were not permitted to even enter the building, making it impossible to include demographic info when we first posted the data to OSF. Fortunately, these were experimental studies so demographics were not important explanatory variables anyway. But we put our depressing COVID story in the paper just the same.

        • If I read this correctly, these are “front line workers” in “six countries.” They probably don’t even know each other! How much anonymizing would need to be done? Even for people who know the particular facilities involved, much less the rest of us.

          And it’s not as if “got Covid” is even stigmatizing, if it ever was.

    • I am utterly bewildered that there are fields where you can publish a study based on unpublished data. The only example in the historical sciences which I can think of is dendrochronology carried out by commercial labs with their own proprietary databases of tree rings.

      • I am currently (on a different computer that is not on the net) looking at data that after signing several forms (and getting folks at my uni to sign these … and moving a computer off the net and into a room with a special lock) I am examining. This is fairly common in my area with education data. When I submit, I will include the analysis code (including cleaning) and say how readers can get the data (i.e., going through the hurdles). I do wish I could give the data, and I am assuming the gov’t won’t give the data to anyone who asks. I agree that I think having evidence for any claims is important (e.g., https://www.degruyter.com/document/doi/10.1515/edu-2020-0106/html), but beyond not using these data I am not sure what more I can do. Any advice would be appreciated!

  2. > they should really moderate their claims

    Without digging deeper into their claims about the data per se, their primary claim could be interpreted as quite moderate.

    These dietary patterns may be considered for protection against severe COVID-19.

    Even “…may be considered as protection…” would be pretty moderate given the “may,”… But “.. may be considered for…”??

    I’m not even sure what that means.

  3. “If this study is valuable at all, it’s for the data. ”

    Are the data any good for anything? Fraud aside, lots of the studies discussed here have so many ridiculous assumptions built into their data collection that the data have value. They’re kind of the equivalent of a survey of nine-year-olds on the best methods for treating diaper rash:

    Bad data collection method (survey) + population subset w/ no relevant knowledge = useless data

    I wonder if concern about data availability would decline if the quality of data collection and reliability of the analyses were higher.

    • Data quality and data availability concerns are complements, not substitutes. I would like to see far more credit for collection and cultivation of high quality data relative to the analysis of that data (which we know if plagued by forking paths, poor incentives, poor analysis, etc.). If there were quality controls on the data, then I’d place even more emphasis on wanting access to that data.

      • “Data quality and data availability concerns are complements, not substitutes. ”

        I agree but I still think there would be much less demand for data access if the general reliability of the work was much higher. My experience is that data generated for one thing doesn’t work very well for other things, which I think is one reason lots of studies fail. It’s not that the data are bad, they’re just not the right data to do the job and – surprise! – they don’t do the job.

    • Chipmunk, Dale:

      Sure, when I say “the data,” I’m implicitly including the meta-data telling us how the data were collected, where they came from, etc. As with a document, “the data” is not just a string of numbers, it’s also a provenance.

  4. The study also does not adjust for things such as physical activity, only smoking, diet, and well BMI.

    Why adjust for BMI?

    Are they interested in the effect of diet except for reasons associated with BMI?

    • Obesity is a risk factor for Covid complications, so you’d want to control for that to isolate the effect of diet. I’m not saying they are succeeding, but that’s why you’d want to adjust for BMI.

      • If the main reason why vegetarians are healthier is because it helps them with lower BMI, controlling for BMI and finding no effect could be misleading.
        It is the opposite of smoking and COVID where once you control for chronic respiratory disease (which was caused by smoking in the first place), it looks like smoking doesn’t lead to a worse outcome from COVID.

        • Fred –

          >. If the main reason why vegetarians are healthier is because it helps them with lower BMI, controlling for BMI and finding no effect could be misleading.

          I think there’s a lot of controversy with respect to how much BMI explains health outcomes.

          But regardless, I’d guess you’d agree that even if BMI is important, it doesn’t fully explain all health outcomes. As such, if you control for BMI you can begin to explore whether or mediates or moderates the effects of other factors on health outcomes.

      • I came up with a less politically loaded dataset to use as an example. What we want to do is figure out a model for number of moons orbiting a planet using the mass, diameter, and distance from the sun. I got the data from here:

        https://nssdc.gsfc.nasa.gov/planetary/factsheet/

        Here are the results I get (put on pastebin because I don’t know how to format lm output for this blog):
        https://pastebin.com/2TYnTiTH

        1) We see the effect of mass on number of moons is positive and significant.

        2) But when we “control for” diameter of the planet, now the effect of mass is negative and not significant.

        3) When we “control for” diameter *and* distance, the effect of mass remains negative and not significant. We also see the effect of distance is highly non-significant.

        Now the “correct” model is roughly the hill radius: Distance*[Mass_planet/(3*Mass_sun)]^(1/3)

        I put “correct” in quotes because really you should be taking the volume of the hill sphere and subtracting the volume of the roche limit sphere, but even then we still expect large effects of initial conditions (presence of debri, encounters with other large objects, etc).

        There are also issues with the data. Eg, Jupiter/Saturn have been observed much more closely than Uranus/Neptune, so the data may be biased in favor of more moons orbiting planets closer to Earth. Finally, there is no clear definition of moon (eg, lower bound on the size).

        So the equation is not expected to fit our solar system perfectly anyway.

        However, our method of investigation should be able to at least figure out that the number of moons is positively correlated with the distance from sun and mass of the planet. This method of “adjusting/controlling” for variables clearly cannot be relied upon do to that. AFAICT, it amounts to comparing arbitrary numbers to other arbitrary numbers.

      • Yes but this all gets tricky fast. Suppose an analysis shows that ‘controlling for’ BMI does soak the explanatory power of diet, apparently. But now further suppose there’s an unmeasured confound common to BMI and risk of severe COVID (who knows maybe some genetic polymorphism or other). In that case, you could find spurious ‘mediation’. So they’d need a way to plausibly close off those kind of causal pathways which is highly unlikely to be possible in this context…

        • This reminds me of linking aerobic exercise to reduced cardiovascular disease risk. The association with VO2 max is massively strong- stronger than aggregate risk score from all traditional markers. So it’s tempting to say that improvements in VO2max could mediate the imoact of exercise on cardio risk, and they surely do. But the common confound is that genetics that confer higher VO2 and responsiveness also seem to be protective- at least in multigenerational mice studies. So you would’ve want to ‘naively control for’ VO2 in studying the impact of exercise.

        • There is probably some very specific scenario with near perfect data and all the assumptions of the statistical model are met where this approach would work.

          But, in practice, when that scenario comes up we would know enough to be coming up with a real model (one derived from basic assumptions about the phenomenon) anyway.

          So I don’t really see any use for this method, but it is everywhere.

        • Chris –

          Sure. There are always going to be problems, un-controlled for confounders. Seems to me you do the best you can. I guess in theory the more you attempt to identify and control for confounders and test for moderator or mediator or interaction effects, the closer you might get to really understanding causality.

          Seems to me the key is to then take findings and work on a theory of mechanism of casualty. And then design tests of that theory – which is where I think it really gets interesting.

          I’m not a big believer in binary thinking, where you conclude because you can’t design the perfect study the investigation is worthless.

        • Joshua, yes agreed. My hope is for researchers to take causal analysis and theorizing seriously not give up. It’s just a lot harder than cookie cutter fake out empiricism :)

        • Like the hill radius example above: Take a bunch of boxes (from tiny amazon deliveries to huge shipping containers) of different color, material, weight, length, height, width, etc.

          Can the method of trying out different statistical models get you to Volume = L*H*W? I would like to see the workflow if so. We need some positive examples of this approach working.

          The method is essentially manual machine learning. Are there examples of ML/AI models figuring something like that out?

        • Joshua said:

          “…in theory the more you attempt to identify and control for confounders and test for moderator or mediator or interaction effects, the closer you might get to really understanding causality.”

          In theory, yes. In practice, judging by the data we have on lots of things, looks like that’s not going to be the case for many, many things. I guess when you can create a mathematical model of a functioning cell that can replicate the *all* the activity of an actual cell, you’ll be getting somewhere.

          Until then, you’re just toying with statistics. I’m getting closer to believing that most statistical analysis is analogous to a playing a video game: it’s fun to play and cute to see what you can come up with, but in reality it tells you absolutely nothing important about anything.

        • Chipmunk –

          > Until then, you’re just toying with statistics…but in reality it tells you absolutely nothing important about anything.

          Problems exist. Unintended consequences exist. The world is sub-optimal, but your conclusion seems too extreme to me. I don’t think we don’t enjoy any benefits from this type of research or even that the net value in balance is negative. Again, such a view seems too binary to me.

  5. I guess when you can create a mathematical model of a functioning cell that can replicate the *all* the activity of an actual cell, you’ll be getting somewhere

    You can definately say something useful about a coinflip without modelling all the forces involved.

    Also, in general a model distills a phenomenon down to the important essentials. Just like a map doesn’t need to reproduce every blade of grass.

Leave a Reply

Your email address will not be published. Required fields are marked *