Derived quantities and generative models

Sandro Ambuehl, who sketched the above non-cat picture, writes:

I [Ambuehl] was wondering why we’re not seeing reports measures of Covid19 mortaliy other than the Case Fatality Rate.

In particular, what would seem far more instructive to me than CFR is a comparison of the distributions of age at death, depending on whether the diseased was a carrier of Covid19 or not. That is, I’d really like to see a graph such as the one above. Does this exist anywhere? Or what would be the issues with it?

It seems such a statistic would address three of the major issues with CFR: 1. You can only determine CFR if you know the number of infected people, which, given the large number of asymptomatic cases and limited testing is nearly impossible. 2. Covid19-deaths frequently coincide with comorbitities. It is close to impossible to determine whether a death is attributable to Covid19 or to the comorbidiy. 3. CFR tells you nothing about the number of lost years of life.

My reply:

I continue to think that it’s a mistake to think of “the CFR”: at the very least you’d like to poststratify by age.

Regarding the question, what would be the issues with the above graph? My response is that it’s a fine graph to make, but ultimately I’d think of it as a product of some underlying model. There’s some generative model of disease progression and death, and when you integrate that over the population you’ll get a graph like the one drawn above. So I think of the above graph as a derived quantity.

33 thoughts on “Derived quantities and generative models

  1. > My response is that it’s a fine graph to make, but ultimately I’d think of it as a product of some underlying model. There’s some generative model of disease progression and death, and when you integrate that over the population you’ll get a graph like the one drawn above. So I think of the above graph as a derived quantity.

    You may be misunderstanding what the proposed graph represents (or maybe I do). There is no model of disease progression involved (in the construction of the chart, even though one would help with its interpretation). It’s derived straight from the data: proportion of all people who died (y axis) at different ages (x axis) separated according to infection status (blue/black lines). (There may be issues with the data, like misidentified infection status, but I don’t think that was what you meant.)

    • Ruh roh, this one looks destined to devolve into the timeless question of whether some sort of model must inherently underlie all data representations. Because the word “model” means exactly what you believe it means, nothing more and nothing less.

      In this case, the “everything-is-a-model” crowd would say that in order to generate the graph above, you first needed a model to decide that you could meaningfully do so while ignoring comorbidity, you performed objective tests to determine the data was representational based upon your model, etc.

      Meanwhile, a member of the “not-everything-is-a-model” club might say “gee, I just wanted to see what column ‘A’ would look like on a graph with column ‘B.'”

      • “whether some sort of model must inherently underlie all data representations.”

        Or is it “some sort of mechanism underlies all processes” – i.e., the progression of covid 19 is a process that we recognize; we measure outcomes of the process and that’s the data; and we seek to use the data to uncover the rational mechamism by which COVID19 operates.

        I’m sure I could put together millions of “data representations” that make no sense whatsoever.

  2. “So I think of the above graph as a derived quantity.” But there should be data that can be graphed like this directly, shouldn’t it? If you know of all deaths whether they were Covid deaths or not (which of course you don’t know always with 100% certainty, but in the vast majority of cases I’d think it’s pretty clear) – what is the “generative model” needed for?

  3. I think the “model vs no model” issue comes up when you talk about what the y axis means. Is it just proportions of people who died who were less than the given age, or is it a probability that a random person in the future who dies will be less than a given age? Not all proportions should be taken as probabilities, particularly if there is more information available.

      • You’d need to be more specific to make any kind of model, but not to illustrate the conceptual difference between “the fraction of the things that definitely happened in the past” and “the weight we should give to different options for what might happen in the future”

        Even to consider the first case as a probability, we need to say something like: “If I were to randomly select a person in my data set who died in the last 6 months, what is the probability that they would have been less than 50 years old if I know that they had COVID?”

        This becomes a probability because it’s about a future unknown event (ie. what happens when in the future the RNG spits out the chosen data point).

        The ratio of two data values is just a fraction, it’s a fixed definite quantity, there’s nothing unknown about it. To talk about “credences” or “plausibility measures” (which is what probability is) we need to have something unknown. That can be supplied by the RNG selecting from the past data, or by the fact that the event we’re discussing hasn’t happened yet.

        • > Even to consider the first case as a probability, we need to say something like: “If I were to randomly select a person in my data set who died in the last 6 months, what is the probability that they would have been less than 50 years old if I know that they had COVID?”
          > This becomes a probability because it’s about a future unknown event (ie. what happens when in the future the RNG spits out the chosen data point).

          But I thought you were adept to the idea of probabilities as measures of uncertainty, which is not necessarily about the future. Say I’ve already selected randomly a person in my data set among those who died in the last 6 months and had COVID. The medical history of this person is in a folder right in front of me, but I’ve not looked at it yet. I know already the aggregate statistics, though. What is the probability that this person was younger than 50?

          That seems a probability to me, no need for future unknown events (what I will find in the folder when I look is no more unknown than what’s in the folder right now).

        • “That seems a probability to me, no need for future unknown events (what I will find in the folder when I look is no more unknown than what’s in the folder right now).”

          Ouch! ;)

          When words have different meanings for everyone, they cease being tools of communication. If you go back and read all the threads on this blog with epistemological arguments about the concept of probability, folks are all over the place. The word “model” is not used in any consistent way either. And while I am on this rant, the word “theory” used to have a formal definition – graduated from “hypothesis” – and now it just means “explanation.”

          I don’t know what a “probability model” is. I suppose I could go figure it out, I could just read some Probability Model Theory.

        • What you will find in the folder when you look *is* what is in the folder now. So they are precisely the same thing.

          The key thing that lets you discuss probabilities is that you haven’t peeked.

          As soon as you peek, the probability that the person is under 50 collapses to either 1 or 0. In other words, it’s a function of your knowledge and its the fact that your knowledge is incomplete that creates the probability.

          Whether it’s incomplete because the thing has yet to happen, or it’s incomplete because you’ve yet to observe the outcome that already happened, is not particularly important.

          But if you’ve observed all the deaths in the last 6 months and you know that 23 percent of the covid related deaths are people under 50, this doesn’t admit a probability interpretation until you add something that is unknown. Like for example which person will be picked by your RNG, or which person you picked and put in the folder.

          The “credence” or “credibility” measure of a statement where the truth is a known fact is always either 0 or 1.

        • I misunderstood what you wrote about “becoming a probability because it’s about a future unknown event” as implying the impossibility of attaching a probability to a claim about the past.

          My original reply was simply pointing out that when you presented the dilemma

          “Is it just proportions of people who died who were less than the given age, or is it a probability that a random person in the future who dies will be less than a given age?”

          the former is (more or less) well defined while the latter isn’t.

          By the way, your contraposition of “proportions” to “probability” in that sentence is somewhat artificial. You could also have phrased it like this:

          “Is it a probability that a random person who died was less than the given age, or is it just proportions of people who will die in the future who will be less than a given age?”

        • > But if you’ve observed all the deaths in the last 6 months and you know that 23 percent of the covid related deaths are people under 50…

          From this for example you can say “the probability that 23 percent of the covid related deaths in the last 6 months were people under 50 is 1”

          We often use a shorthand: “the probability of a person with covid who died in the last 6 months being under 50 is 23 percent” but this is kind of short hand for

          “if I choose a person from the list uniformly at random the probability that the person chosen will be under 50 is 23%” it’s the fact that we don’t know which one will be chosen which induces a probability.

          If you choose person 834 and you don’t look at the info about that person you could say “the probability I assign to person 834 being a person under 50 is 23%” but as soon as you look, it will become either 0 or 1

          A thing which is certain to be true, has probability 1 by definition of probability 1.

    • Daniel: Technically you are right, but many people confuse probabilities and relative frequencies. I don’t know Sandro Ambuehl, but maybe he was just interesting in displaying relative frequencies there, and when you change the P to F and say it means relative frequency, the issue goes away?

      • I don’t mean to complain about Sandro’s choice really. What I’m trying to do is elucidate the difference in position between Andrew’s statement about the graph being “a derived quantity” based on a model and Carlos’ statement about the graph being calculated straight from data.

        If you label the y axis as “proportion of people in the dataset who died while less than given age” then the graph can be calculated straight from the data, and is just processed data. There isn’t any uncertainty involved, it’s like 13/75 = .1733

        You can turn it into a probability if you add some further uncertain component to it… and then it becomes a derived model of the type I think Andrew was thinking about. So for example either: “if I chose uniformly at random a deceased person from the dataset who had covid what is the probability they will be less than 50 yo” is a probability statement. It requires assumptions only about the goodness of the RNG (that it be sufficiently uniform).

        Or you could do something like “if the next 6 months are similar to the previous 6 months, what is the probability that a person who dies in the next six months with covid will be less than 50yo?” and you can derive a probability statement because there is uncertainty involved since you haven’t observed this outcome yet.

        These last two statements are probability statements about uncertain quantities, and involve a model (either a random selection model from a fixed dataset, or an unknown future outcome assumed to be similar to past outcomes) whereas the first one is just a statement about a proportion of a given fixed data set.

        The probability statements require a probability model (ie. a way to map possibilities to plausibilities). The fraction of a given dataset does not require a probability model, it’s just a mathematical fact about a dataset.

        • > So for example either: “if I chose uniformly at random a deceased person from the dataset who had covid what is the probability they will be less than 50 yo” is a probability statement. It requires assumptions only about the goodness of the RNG (that it be sufficiently uniform).

          It could also be “given this list of the people in the dataset who had covid in alphabetical order, what is the probability that the first one will be less than 50 yo”. We can have probability from ignorance/symmetry/exchangeability considerations, no need to make assumptions about uniformity of RNGs.

        • I’d say it’s the other way around. The natural state is ignorance and a source of information/knowledge is required to reduce the uncertainty.

          In any case, the probability model required to go from a distribution of ages in a population to a probabilty of ages within the population is trivial. It doesn’t require a generative model of disease progression and death or complicated assumptions about sampling schemes and random number generators.

          But I agree, of course, that given a question like “If half of the the class is younger than 30 what is the probability that a student is younger than 30?” a myriad of answers are possible, including:

          a) 50% assuming that no additional assumptions are made and we base the answer just on the information contained in the question

          and

          b) 100% if the student is younger than 30 and 0% otherwise

        • Once we have a dataset, we are not ignorant of the contents of the data set (well, I mean, we can query a computer to calculate whatever we want about the dataset, we personally may not have internalized it, but that’s irrelevant).

          So if you want to know answers to questions about the dataset, there is no probability model involved in getting the answers, just some necessary algebraic calculations.

          So probability doesn’t enter unless you do something like you did with the “imagine we sort into alphabetic order and give me your guess for the status of the first person without actually telling the computer to do the calculation and give you the answer” type scenario.

          Where probability enters is where some information is missing… and that can be from whatever mechanism: a random number generator, a future event that’s related to the current event, a past event that you’re told is similar to the ones in your data set… Or even asking you to pretend you don’t know some facts about your dataset. In each case, this is where a probability model is required. The probability model maps between things you 100% KNOW (such as the contents of the data set) and things you aren’t sure about (whatever they may be). That’s why probability models are written in conditional form p(A | B), where B is stuff you KNOW and A is stuff you have a model for that’s informed by B.

          The distinction was intended to show that Carlos can be right in saying that the requested graph is 100% a deterministic function of the dataset and doesn’t require any assumptions (if the graph was just supposed to be a graph of the proportion of the dataset falling below a given age)… And that if given a different interpretation of what the graph was supposed to mean, Andrew could be right that the graph is a quantity derived from a generative probability model (if the graph is something like the probability that the next covid death to be recorded will be a person less than a given age)

      • In fact it’s Daniel who introduced probabilities in the discussion (at least explicitly, they were probably implicit in Andrew’s response). The text accompanying the chart says it is “a comparison of the distributions of age at death, depending on whether the diseased was a carrier of Covid19 or not” and calls it a “statistic” (i.e. a quantity computed from values in a sample). P could stand for probability, in a suitably-defined sense which doesn’t seem *so* problematic, but it could also stand for percentile.

  4. The Years of Life Lost idea can be got at in another way: look at different time units of measurement for death rates, different “frequencies”. Suppose covi19 kills mostly people who are within a month of death, but after one day we develop a vaccine. On January 1, everyone who would have died in January dies immediately— huge one-day death rate. On January 2, if we look at the one-day death rate, it’s zero, and the two-day death rate, which is also the new 2020 death rate, is half of the January 1 rate. January 3, the 2020 death rate falls still further. By January 31, we’re back to the 2019 death rate; it’s just that all the January deaths were Jnauary 1 rather than spread through the month.

    Is this happening in reality? Are all the people who would have died by December 31 dying early? If so, we’d see negative serial correlation in death rates. This is a testable theory, in theory– in theory, because my impression is that government mortality data is so disgracefully bad that we don’t know, for example, the New York City death rate for July 2020 yet. In fact, it’s not really an accept/reject theory, it’s a “how-big-is-the-effect” one: if feeble people have a higher probability of dying, it’s arithmetically necessary that there be this kind of negative-serial-correlatoin effect and the only question is how big it is and whether it’s big enough to see in noisy data that has other things (like geogrpahic spread) in it.

    • “my impression is that government mortality data is so disgracefully bad that we don’t know, for example, the New York City death rate for July 2020 yet. ”

      Can everything be solved with better data? And what does it mean to “solve” that problem? If it costs several billion to get that data, we might solve the technical problem but create an economic problem.

      So while there are all these calls for better data, I really doubt there is a realistic way to implement that at a reasonable cost.

  5. I have seen measurements of ‘excess death’. In the worst days of COVID this spring, over 800 people died in NYC, compared to 150 or so on the date in a typical year.

  6. The suggested type of graph does nothing to address the three issues listed.

    Hypothetically, if we had a sudden illness that doubled everyone’s overall risk of death this year (Covid-19 isn’t that, mostly because of fewer deaths among young people, but it may come close), then the blue and the black line coincide because the graph is normalized to 100%. What this does do is chip away at the age distribution of the population, and maybe, depending on how widespread the epidemic becomes, you can see a change in subsequent years when the “age at death” probabilities shift due to the change in the underlying demographic.

    In detail:
    > 1. You can only determine CFR if you know the number of infected people, which, given the large number of asymptomatic cases and limited testing is nearly impossible.

    “CFR” means Case Fatality Rate. Cases are known patients, so all you need to have is the number of cases to find the CFR. That’s easy. The lethality of the infection is expressed by the IFR, Infection Fatality Rate, and the various antibody surveys done aim at establishing a ballpark figure for this. For certain delimited populations, e.g. on ships, the IFR can be determined more readily. It always depends on the age demographics of the underlying population.

    > 2. Covid19-deaths frequently coincide with comorbitities. It is close to impossible to determine whether a death is attributable to Covid19 or to the comorbidiy.

    A death without signs of Covid-19 (test is negative, no symptoms of lung infection) is most likely not attributable to Covid-19, a death with the typical bilateral lung infection has Covid-19 as a definite factor, the non-lung-related deaths where the virus is present require more work from the pathologist, but medically, they don’t usually have difficulty determining this for the death certificate.

    It has been pointed out that there is something this year that causes an excess of all-causes-mortality compared to other years, and that it’s not the flu (because the CDC and similar agencies in other countries monitor the incidence of the flu). The CDC offers mortality data on a page titled “Excess Deaths Associated with COVID-19”, and EuroMOMO offers this data for many European countries (not all of which show a spike of excess deaths in April, but many do, including the UK and Sweden).

    > 3. CFR tells you nothing about the number of lost years of life.

    And neither does this graph.
    You can’t derive lost years of life from it in any way that I can see.
    And that’s even if we disregard that Covid-19 can do some lasting damage to the lungs or other organs that may itself become a “co-morbidity” for some other disease later: more “lost years of life” attributable to Covid-19 may manifest themselves in later years!

    • I think the “lost years of life” aspect is that death at an older age = fewer lost years of life.

      Yes, if lasting damage from COVID becomes a significant issue, that might have its own effects on lost life-years (or disability-adjusted life years) separate from COVID deaths themselves. I don’t think we know this is true yet… few people have been recovered from COVID for more than say 6 months and no one for more than ~9 months, so I don’t think truly long-term effects (years/decades) are knowable. However certainly there are expected to be some lasting effects in e.g. ICU patients (from experience with other illnesses).

      • Pneumonia is not a new illness; whatever “unknowable” means, some effects are certainly predictable, though some aspects of Covid-19 are unexpected.

        Life expectancy does depend on the medical conditions you have, and Covid-19 fatality risk does as well, so to analyse it by age only is a simplification. The point is that even if you use that simplification, you can’t use this graph to figure out that number: if the risk to die at age X doubles for 12 months, then obviously there will be a lot of lost years, but the doubled graph and the normal graph will look the same if you normalize them to 100% of the deaths. If you graph absolute numbers, then you can figure out how many people died who wouldn’t have; and we do have these “excess death” numbers.

        The problem is, what do you do with a “lost years of life” figure? What does it tell you that you don’t already know?

        • >>Pneumonia is not a new illness; whatever “unknowable” means, some effects are certainly predictable, though some aspects of Covid-19 are unexpected.

          Yeah, I was more referring to certain claims of COVID long-term effects being more common/more severe than would be “expected” from experience with other respiratory illnesses.

          >>The problem is, what do you do with a “lost years of life” figure? What does it tell you that you don’t already know?

          Well, it’s another way to look at the severity of COVID risk vs. other risks.

Leave a Reply to Mendel Cancel reply

Your email address will not be published. Required fields are marked *