If Yogi Berra could see this one, he’d spin in his grave: Regression modeling using a convenience sample

Kelvin Leshabari writes:

We are currently planning to publish some few manuscripts on the outcome of treatment of some selected cancers occuring in children. The current dataset was derived from the natural admission process of those children with cancer found at a selected tertiary cancer centre. To the best of our understanding, our data are based on convenient sampling as we collected what was available. We are also quite sure that those patients were far from being representative of all other children with cancer in our area. We do know that convenient sampling is a non-probability sampling method. We plan to run several analyses that may include fitting proportional hazard models.

At present, I have a conflict with other co-investigators since I do believe that our dataset has violated some assumptions (e.g. proportionality assumption, randomness, independence, etc etc) and hence it may be statistically inappropriate to apply some statistical techniques (e.g. Proportional Hazard models, Kaplan-Meier curves). However, they provided me with a handful of publications that analysed data (based on convenience sampling!) even with obvious signs of violation of key assumptions of those statistical techniques.

Questions:

1. Is it statistically valid to analyse data using methods that need randomness and independence assumptions in cases of non-probability sampling? (e.g. Is it logical to fit a regression model to a dataset that was derived from convenient sampling?)

2. Can sample size (when infinitely large!) be a rescue for application of proportional hazard models (including fitting Kaplan-Meier curves!) in cases of non-probability sampling?

3. Can I fit proportional hazard models in a dataset whose nature is derived from convenience sampling?

My reply: If you do regression modeling using a convenience sample, you’re implicitly assuming the probability of a person being included in your sample depends only on the variables included in your regression model. To put it another way, you’re assuming equal-probability sampling within the poststratification cells determined by the predictors in your regression. So you can typically say that, the more predictors you have, the more reasonable it is to fit a regression model to a convenience sample. If you have some sense of the biases in the convenience sample (which sorts of people are more or less like to be in the sample), you should try to include predictors in your model that will capture this. And if necessary add a term to your log-likelihood to model any selection processes, but it’s rare that researchers ever get to this step.

46 thoughts on “If Yogi Berra could see this one, he’d spin in his grave: Regression modeling using a convenience sample

  1. I think the person who wrote to you might also like to know that the results of a regression applied to a convenience sample are valid for whatever population the convenience sample represents (which may not be the population of interest). It seemed like he was maybe confused on that point as well.

  2. The great majority of clinical trials are conducted on non-probability samples, and the data from many of them are analysed with regression, including sometimes proportional hazard models.

  3. 1) “Is it statistically valid to analyse data using methods that **need** randomness and independence assumptions in cases of non-probability sampling?”

    If it truly *needs* randomness and independence, then no. But what does *need* this? I’d say p values for example really need this. The concept of a p value is that you’re assuming your data collection is basically from a random number generator.

    But, Bayesian models just don’t make this assumption (in general). So, instead you should ask yourself, given what I have, what assumptions do I need to make in order to make the inferences I need, and are those assumptions valid?

    For example: you need an inference about how well a cancer treatment works on average over *all* patients with the associated type of cancer for a given outcome measurement (not just the kind of patient that shows up at your hospital).

    The assumption you’ll need to make is that conditional on whatever your predictors are, the outcomes and their variabilities are on average the same for your convenience sample as for the groups not captured by the convenience sample.

    That’s not necessarily a great assumption. So you might prefer to assume that the difference between the outcomes is +- some margin for error that you assume. Then you can for example impute the differences given this assumption, and get a realistic uncertainty interval. Or, you might identify a group that you think is “more typical” of the under-sampled group, and then use a model of the population and the results from this sub-group to impute the overall effect across the whole population…

    etc etc. bayesian models handle any kind of straightforward assumptions that you want to put into them. They’re only as good as the assumptions you make, so think about the assumptions you need.

    2) “Can sample size (when infinitely large!) be a rescue for application of proportional hazard models (including fitting Kaplan-Meier curves!) in cases of non-probability sampling?”

    Clearly not. Suppose you have a very large population of people, like all the people in the US. Now you conveniently sample all the health outcomes in 20 counties over 10 years. Let’s say it’s 2 Million people. These counties are predominantly poor and black. You extrapolate from the poor black counties to the whole US. Your standard errors are tiny because of 2 million people included. But, they are a terrible estimate of what’s going on in rich white new england counties or other areas very different from your counties.

    3) “Can I fit proportional hazard models in a dataset whose nature is derived from convenience sampling?”

    You can fit it, but how can you interpret it? Given the above (1) and (2) clearly you can’t interpret it as telling you about any population other than the one covered well by the convenience sample.

    To get extrapolations, you’ll need to make real valid assumptions about what is going on in the un-sampled or under-sampled population.

    • Daniel – re: “you need an inference about how well a cancer treatment works on average over *all* patients with the associated type of cancer for a given outcome measurement (not just the kind of patient that shows up at your hospital).”

      I’m not saying you are wrong here, but I think it is important to realize that the “average [effect] over *all* patients with the associated type of [whatever]” is not always the thing a researcher *should* want. For instance, suppose that there is a strong selection into seeking cancer treatment, such that those who seek it are likely to respond differently than those who do not (so, say, white kids from rich countries respond better to drugs designed for white people in rich countries). Then do we really care about the “average” treatment effect over *all* potential patients, or just care about effectiveness over the people who are likely to seek treatment? Or maybe we are interested in effectiveness on the *next* (marginal) patient – the one who could be brought into treatment if the treatment proved effective.

      I guess my point is that Z’s comment above is actually more broadly relevant than you give it credit for: a convenience sample is a perfectly good sample of the population it draws from. If we want to know the effectiveness of cancer drugs for kids who seek them (so that excludes kid in like 70% of the world as well as kids from sufficiently marginalized groups that they don’t seek treatment because they can’t or don’t know about cancer), then this “convenience” sample seems a lot more like a sample from the population of interest.

      Now, this might not be a good sample of the population of treatment-seeking children, but that is a different question and should be posed as such.

      • that’s why I put “for example” before my statement “you need an inference…”

        The person needs an inference for something… If they just want to know what happened in the treated population of their study, the data gives them that. So for whatever their population of interest is, they need an inference for that.

        Getting that inference requires them to build some model. If they got their sample by random number generator from among all people who are in the population they care about, then the model “builds itself”. Any other case requires explicit assumptions that are different from the RNG assumptions.

        I totally agree with you that convenience samples are useful information, but I think it’s correct of this person to question the application of models that implicitly assume random number generator based sampling, which is basically all methods derived from Frequentist assumptions.

        • And I totally agree with you that this person should question the kinds of inferences they can make from this data, and that to make inferences that are relevant beyond the sample requires assumptions.

          But this is always true. It isn’t a frequentist problem, it is an empiricist problem.

        • Users specify log density functions in Stan’s probabilistic programming language and get:

          *full Bayesian statistical inference with MCMC sampling (NUTS, HMC)

          *approximate Bayesian inference with variational inference (ADVI)

          *penalized maximum likelihood estimation with optimization (L-BFGS)

          http://mc-stan.org

        • Obviously, because SAS models are created by people with a PhD in statistics and many years of experience, they can give those explicit coverage guarantees, whereas all the average user of Stan gets is NUTS.

        • Ok, so to be less glib. Inside the typical Frequentist procedures coded up in packages like SAS are a bunch of assumptions about distributions and independence and soforth which are the focus of the question here. They are:

          1) unchangeable… so you either get whatever they output, or basically nothing
          2) rarely understood in detail by users
          3) whose violations are a matter of degree in any given problem, and there is generally no good “rule of thumb” about how much violation is tolerable
          4) Explicitly based on an attempt to calculate how often things will occur under repeated application of a procedure.

          In a Stan program you have assumptions which:

          1) are infinitely changeable by the user
          2) are chosen explicitly by the user
          3) whose violations can be investigated by the user choosing alternative models, or including a mixture of multiple models
          4) Explicitly based on an attempt to calculate how sensitive the inferences are to whatever is left unknown, regardless of whether there’s any sense in which repetitions or some other factor are responsible for this sensitivity.

          In a situation where the differences between what is observed in a given sample and what would have been observed in some other population is not due to the sampling distribution of some statistic under some known sampling rule… I find it hard to see how you can logically apply Frequentist procedures to come to the desired final results.

        • No worries mate? Are you an Aussie or a Kiwi? Struth, I’d never have guessed.

          PS Aunty Helen for the next UN Sec-Gen (and not Kevin 07 please!)

        • Daniel,

          Vaguely elated: I am always amazed/amused when someone insists that “subjective”==”bad” ==> “priors”==”bad” ==> “Bayesian”==”bad”, but then looks at a QQ plot and judges that normality assumptions are met.

  4. “Fitting” a model generally doesn’t require the model to be true. You can actually do all these things but you’ve got to be careful when it comes to interpretation. The problem with convenience samples is not with fitting models but rather with generalisation. Fitting models is in this case something rather exploratory, some kind of summary of what’s in the data when considered in the light of the model. Many techniques that are derived from models (including Kaplan-Meyer curves or proportional hazards) can be given a direct interpretation that doesn’t require the model to be true, or the sample to be representative (although one needs to be careful to interpret in ways that indeed don’t implicitly assume the model to be true).

    You can’t generalise to a population that your data don’t represent, and generally with convenient samples it’s not clear what population they represent or whether there is any. In some cases people argue that no systematic difference between the convenience sample and the population of interest is to be expected; such arguments can be very weak but may occasionally carry a certain degree of sense with them.

    In many respects this is pretty similar to what Daniel Lakeland wrote but from my point of view there’s nothing specifically Bayesian or frequentist about this. (Meaning that you can go down this route without the need to make up a prior.)

    • The essence of Bayesian vs Frequentist isn’t “prior” vs “no prior”, it’s whether probability is an extension of 2 valued logic to real valued logic, or probability is related to how often something occurs under sampling.

      Since the sampling model inherent in any push-button regression technique is by assumption violated in this case, there is no interpretation of the output of such a computation.

      Frequentist analysis has the general flavor:

      If D is a sample from P with sampling property S, then f(D) differs from f(P) by X amount only Q% of the time that the procedure is repeated with new samples D.

      We then say “D is not a sample from P with property S”

      and are left with… “So go home and make a sandwich”

        • The big advantage to Bayesian statistics as far as I can see is not the “informed prior” but the “informed likelihood”. That is, a likelihood based on science not based on notions of repeated sampling from a random number generator that stands in for the science.

          See the ABC approach for an extreme example: choose a random set of coefficients from the prior… simulate your system using a black box PDE solver or whatever, calculate some summary statistics of the output… and assert probabilistically that the summary statistics will have some Bayesian probabilistic relationship to the data summary statistics when the parameters are correct (such as they will differ by less than some amount q, or they will be jointly normal with given standard deviation… or whatever best fits your knowledge).

          It’s really the fact that you don’t have to substitute a RNG for science that is what makes Bayes great. The prior is really just ancillary.

        • “The big advantage to Bayesian statistics as far as I can see is not the “informed prior” but the “informed likelihood”. That is, a likelihood based on science not based on notions of repeated sampling from a random number generator that stands in for the science…

          …It’s really the fact that you don’t have to substitute a RNG for science that is what makes Bayes great.”

          I don’t really understand this comment. It doesn’t seem to reflect an accurate conception of non-Bayesian statistics at all – where does this view come from?

          Also –

          “choose a random set of coefficients…simulate your system using a black box PDE solver or whatever, calculate some summary statistics of the output…assert that the summary statistics will have some probabilistic relationship to the data summary statistics”

          Seems to be very much “repeated sampling from a random number generator”, just a random number generator with some structure – i.e. you’ve constructed a model with both signal and noise. Almost all classical probability/statistics models applied in science have both ‘randomness’ and ‘structure’, ‘noise’ *and* ‘signal’.

          I’m not saying Bayesian stats doesn’t facilitate a model-based approach (I like to use it myself), but it seems like a caricature of other approaches to say they don’t also use science-based models.

        • So, I think there’s lots of “non-Bayesian” statistics that are really Bayesian statistics in disguise. In particular, any likelihood based inference is a type of Bayesian model. People who steadfastly “refuse to” include a prior are just people who steadfastly put constant priors on the whole real line. If you object that such a thing doesn’t exist, then I say there’s a perfectly reasonable mathematical interpretation for that in nonstandard analysis (equivalent to a prior 1/(2N) on the interval [-N,N] where N is a nonstandard integer). So, if you’re putting some default distribution for a likelihood and then doing ML estimation, you’re doing Bayesian analysis, because that “default” distribution isn’t based on the frequency of anything, it hasn’t been validated as a frequency distribution in any way in the vast majority of cases where it’s done. Even if it has been validated as a frequency distribution, it’s still a perfectly good special case of a Bayesian analysis, it’s fine to use frequency distributions as Bayesian distributions. The Bayesian case just includes cases where we don’t know the frequency distribution or we use some other method to get a distribution.

          “True” frequentist statistics is statistics where the frequency properties are relied on to be true. That is the case when you have RNG based sampling from a finite population and you extrapolate to the population by virtue of what you know about the sampling distribution of a sample statistic. If your inference is based on the sampling distribution of a sample statistic, you’re doing “real” Frequentist analysis.

          There’s actually LOTS of “classical” statistics that DOESN’T look like that, it looks like Bayesian stats with a flat prior, and so people say “what’s the big deal?” Well the big deal is you’re already doing Bayesian statistics, and if you accept it you can widen the class of models you could use even more.

          As for the ABC method. You have to distinguish between what the *model* is built on, and *how the computation is done*. The goal of sampling from a posterior is to connect the frequencies output by the sampler to the probabilities implied by the distribution you specified. But that all happens *in parameter space*, and is a mathematical convenience for calculation.

          In Frequentist statistics, the assumption is that the frequencies happen *in data space* and that a sample statistic has a well defined sampling distribution that you can rely on. That requires a “RNG” like behavior *of the world*. A Bayesian analysis says that the distribution put on the data is *in your head* whereas a Frequentist analysis makes strong claims *about the world* to derive probabilistic bounds *for sample statistics*.

        • The TLDR version, we can distinguish 3 classes of things that are done:

          1) Frequentist statistics in which we derive sampling distributions of sample statistics and rely on them to figure out something about unobserved stuff.

          2) Likelihoodist statistics, where we do Bayesian statistics limited to the subset where there is a flat prior on the parameter and usually also where the likelihood is required to be of an IID type.

          3) Full Bayesian statistics, where we write down a joint distribution and we don’t require flat priors, and we don’t require IID type likelihoods.

          You might also distinguish 2a and 2b, where

          2a) Frequentist Likelihoodist where we try to only use likelihoods based on validated frequency properties

          2b) Default-Only Bayesian Likelihoodists, where we plug in things like least squares as a default distribution without validating the normality assumption it implies.

        • Daniel, your TL;DR summary is great! It reminds me of one time I presented the results of a (semi)-mechanistic hierarchical Bayes model for predicting plant fecundity in the field. We’d done a ton of modeling to combine two relevant datasets – choice of likelihoods, reparameterization, functional forms- but discussion at the end got derailed by the question, “is Bayes with flat priors really Bayesian?”. At the time, I explained it as a default choice (not ideal, but a starting point), that allowed us to use MCMC machinery to estimate parameters, propagate uncertainty, etc. Now I’m tending to agree with you that the REAL Bayes was in writing down a joint distribution, and then factoring it out in a certain way that was scientifically interesting/plausible. The priors were just a distraction! Thankfully, reviewers saw it that way too.

          Chris

        • All stochastic models are ‘random number generators’ with some structure and some noise, usually not fully unambiguously distinguishable. Ultimately they make predictions in data space which are used to ‘update’ information on parameters, whether by Bayes or otherwise. It’s the analysis and interpretation of such models that usually differs.

          For a given stochastic model a frequentist analysis typically makes *weaker* assumptions on the model being true than likelihood/Bayes-based analysis. It is more of a ‘meta’ assessment of a given model.

          On the other hand this higher-level analysis can often be reinterpreted as a Bayesian analysis of a *different* likelihood (or joint) model, and thus the cycle continues with no real ‘winner’ just different perspectives.

          Consider the interpretation of ABC – which almost always uses summary statistics – it can either be interpreted as a rather frequentist-style analysis of a stochastic model based on summary statistics or as a Bayesian model with a modified likelihood.

        • OJM: I take it as fundamental to Frequentist statistics that probability is entirely defined in terms of repeated sampling.

          From that perspective, it’s easy to see how to analyze something like “there are 100000 people, I use an RNG to get a sample of them and calculate their average height as h, the average height of all of them is therefore h’ which is within a certain probabilistic bounds of h”

          This is “automatic” extrapolation based on the properties of random sampling. If we have less than uniform random sampling, then we can derive a different sampling distribution and use it to get a different probabilistic distribution for h’ provided we know the rule for sampling and sufficient information about the population.

          The Bayesian approach simply DOES NOT do this in general. If I collect a series of radar measurements of position x(t_i) there is no sense in which there is a bag of radar measurements *in the world* from which my particular x(t_i) were drawn via a defined random sampling procedure. That would be the case for example if I had say 30,000 radar machines and chose one of them to get the data from, and wanted to find out what the other 30,000 machines might have said.

          I can, if I want, use my likelihood function to generate a bunch of random “pseudo-radar-measurements” x_j(t_i) but there is no sense in which any of these alternative realizations need to exist in the world. My calculation doesn’t rely on them. It only relies on the measurements I actually have.

          For a Frequentist, i’d say they really need those 100,000 people to be out there in the world and for the distribution of the values to be at least reasonably approximated by their chosen frequency distribution, and for the sampling procedure to be at least reasonably approximated by the one used to derive the calculation, otherwise what they’re doing is just pure drivel.

          For a Bayesian, what they need is for there to exist values of the parameters such that the model for the measurements fx(t_i, params) is predictive of the actually observed x(t_i) to within the tolerance expressed by the likelihood function that was chosen so that some of the param values are downweighted from their prior weights.

          There is no sense in which the Bayesian requires a random number sequence to “generate” the data values, although you can ALWAYS *CONSTRUCT* an artificial random number generator which generates according to ANY well defined probability distribution.

        • I’m not too keen to go on a big defending frequentism tangent, but I’ll just state for the record that I personally think this is a pretty inaccurate characterisation of what frequentist analysis is about and what it assumes.

        • ojm: fair enough. I may well misunderstand Frequentist philosophy, but this is more or less the distillation I’ve come to after starting out life as a guy who read standard Stats textbooks and just did the kinds of things they taught, and then slowly became dissatisfied with those explanations and tried to figure out when different methods had any logical basis and when they were pure “cargo cult”.

          I think lots of people just go ahead doing what they were taught without any real commitment to a “philosophy” of any kind, and so I throw them out of the Frequentist camp, they’re just “defaultists” or whatever, at the very least you can say they don’t contribute to a frequentist philosophy since they aren’t actually committed to any philosophy.

          I also acknowledge that Frequentist philosophy is extended to the idea of repeated sampling through time. So for example that there are stable frequencies with which say fertilizer treatments perform, and you could do crop studies over 3 or 4 years and then try to extrapolate to the following 5 years. But I disagree with this idea in general, as I think it imposes a RNG structure on the world that the world doesn’t have, or frequently the approximation is fine for some definition of “short duration”, but changes dramatically on longer timescales and the only way to know whether and how it will change is basically Physics and Chemistry and soforth.

        • For a directly practical example, Stan has some people involved in the development from various pharma companies. They want to do pharmacokinetics. Their problems involve fitting the results of multi “compartment” ODE models for the distribution of the drug throughout the body.

          I imagine that in those kinds of models you specify an ODE with unknown coefficients, or coefficients that are unknown functions of time, or of concentration, or concentration differences, or the like. You then need to specify a probability distribution for the differences between the ODE predictions and the measured concentrations in various tissues at various times. Stan then moves around through parameter space and spends sampling time mostly where the errors between the ODE predictions match the claimed error distributions (the likelihood is high).

          I suppose you might be able to torture yourself into believing this has some Frequentist interpretation… (in an infinite set of worlds where copies of this particular single person are each having their blood drawn at times t1,t2,t3… there is a distribution of measurement errors for the concentrations… or whatever)

          But at least the Bayesian interpretation is clear: we don’t know everything, but the information we specify about the process and the data lets us find out more than we started with because we spend time sampling from the region of parameter space where the model fits the data pretty well.

        • I think I’m fine with this example. Generally, though, frequentist models and likelihoods can be “informed”, too. I don’t think “based on science” is in contradiction to “based on repeated sampling”. Science helps with frequentist modelling (e.g., specifying a functional form of a regression model), and observations from repeated sampling inform the science.

        • >>>You then need to specify a probability distribution for the differences<<<

          So, that's akin to a subjective prior, right? It's adding some fair bit of value to this analysis?

          Otherwise, why not just solve the ODEs & minimize the mean squared errors between the predicted vs measured concentrations as a function of the ODE parameters?

          I believe that's what a lot of Chemical Kinetics codes do. e.g. Athena or Chemkin

          https://en.wikipedia.org/wiki/CHEMKIN

        • Minimizing mean squared error is just the assumption that the errors are normally distributed and that you are interested in maximum likelihood.

          If you have alternative information, such as for example that errors are correlated in time, or that the measurement errors are more precise at different concentrations or at different times or in different tissues, you could write a likelihood that was informed by those correlations, like a non-stationary gaussian process for example. That’s what I did in fitting stuff to ODEs for my dissertation.

          And, of course, even if you use IID normal errors, you may still want to see the uncertainty intervals or relationships between your unknown coefficients etc.

          So, yes, it adds value to include additional information beyond some kind of pure default assumption. Including it in the likelihood is where the real action is because the likelihood is how you get info out of your data.

        • >>>If you have alternative information<<<

          Indeed. That's kinda my point. "If you have alternative info" Bayesian makes a lot of sense.

        • “why not just solve the ODEs & minimize the mean squared errors”

          Why not minimize sum(abs(error) + error^2/2+error^4/4) ???

          I prefer to think that Cox/Jaynes probability theory explains why (and when) least squares works rather than Bayesian analysis being something extra on top of some “default” least squares method.

          Least squares IS one Bayesian method. It’s the method you use when errors can go either way equally likely to similar degrees, each error gives very little information about other errors, and very large errors are rare, the way that very large values like 7 or 10 or more are rare to come out of a normal(0,1) RNG.

          There are plenty of situations where that’s not the case, but there are plenty of situations where that IS the case. And it’s not “non bayesian” to do Least Squares, just like it’s not “non logic” to use “A, and if A then B” to deduce B, just because it’s kind of the “easy case” that doesn’t need a college course in symbolic logic to figure out.

          It’s so rare that anyone verifies the frequency distribution of errors that in fact virtually everyone doing least squares is doing Bayesian analysis without knowing it. Their state of information isn’t “an infinite string of errors has frequency distribution close to the normal curve” it’s “these errors are rarely all THAT big compared to the standard deviation”

  5. “Since the sampling model inherent in any push-button regression technique is by assumption violated in this case, there is no interpretation of the output of such a computation.”
    Not true. The Least Squares regression line is a mathematically well defined data summary. Even the p-values still tell us to what extent the data are compatible with what would be expected under the null model (although of course if they are not this doesn’t mean at all that an alternative that the researcher may be interested in is true, because we already know another reason how the null model is violated).
    I’m not interested in whether indeed D is a sample from P. I know it’s not. I may still be interested in exploring whether the data alone are incompatible with P, and if so in what way.
    See, I know that models are not “true”, so I’m not going to use any frequentist output on that basis. It’s rather about comparing the data with what I know are thought constructs.

    • It’s always possible to fit a model and ask how well it fits the data you have. But, for a finite population, and an undefined sampling procedure, HOW can you determine how well it fits the data you don’t have? You can’t do it using Frequentist statistics, because the implicit issue in Frequentist statistics is that the sample will *rarely* be too different from the population, and we can put probabilistic bounds on how rare based on the sampling model, the population size, the sample size, and observed deviations, etc. But we have no sampling model, we just know that there’s a systematic difference, but not in what way.

      In a Bayesian analysis, we can *specify* the relationship we believe exists between the sample and the population. This relationship may be based on scientific information OTHER than sampling frequencies, and then we can find out what that assumption implies probabilistically about the population, it gives us an automatic sort of sensitivity analysis. Of course, frequentists can do all kinds of sensitivity analysis as well: “if there’s a relationship between sample and population that looks like Q then the results in the population are going to look like P etc” Bayesian analysis essentially automates that process using weighting of the plausibility of different possible relationships.

      We can specify a GOOD model, or a BAD model, and we may have no way of determining whether our model really is good or bad without additional data, but the Bayesian machinery guarantees at least that for whatever model we specify there is a particular well defined answer, it maps assigned probabilistic weights over assumptions to well-posed mathematical probabilistic predictions about the world.

      Frequentism is a subset of Bayesian analysis. This problem just seems to fall outside of the subset though. Without a sampling model that could tell us how our sample differs from the population of interest, we MUST use something OTHER THAN sampling theory to make the connection.

      • Daniel: I have no problem in general with Bayesian analysis and I’m fine with it in many situations. The results can, as you know, often be given a frequentist interpretation, but even if this isn’t possible or desirable I may still be fine with them.
        There’s no need to convince me that this makes sense.

        I think though that in many situations it makes sense to a) interpret probabilities in a frequentist manner (albeit admittedly idealised and yes, I do use something other than sampling theory when setting up models) and b) to proceed without specifying a prior. I’m a pluralist. I’m often happy with Bayesian ideas but ultimately annoyed by quite a bit of Bayesian propaganda that tends to portray anyone as an idiot who does anything else.

        • So, I should say that some of the reason I keep replying here is that I hope people other than just the person I’m replying to are listening. In particular I hope the person doing this actual cancer data analysis is listening, because I’d like to think that people start taking these issues seriously in areas where it matters, and I think posting this question to Andrew is a sign that someone is taking this issue seriously.

  6. There’s a sense where as soon as you choose a model for some data, the only certain thing you know is that you’ve picked the wrong model. I would not ask the question “is it statistically appropriate” but rather – “will my particular violation of some notion of statistical appropriateness materially influence my results.”

    I would simulate some data and try to get a handle on the size of the problem. E.g. suppose a proposed model is outcome = f(coef1, …, coefn) + error – f is your model, coefs are the model parameters. Say you know that people present to your center only if they are very sick. Guess some reasonable values for coef1 … coefn, guess the coefs for very sick people (who present at your center) simulate some data, and try your proposed model on both the simulated whole population/sample and the very sick people. See what shifts your model around, in simulation, under what you think are realistic assumptions for the parameters. This should give you comfort that the models based both on the subsample and the whole sample give similar guidance and will result in similar interpretations.

    • You mean, consider some kind of “range” of possible values for the coefficients that you might have “prior” to actually collecting data, and then figure out how “likely” it is to see the data you do have under some kind of “data model” for how the data arises, given a variety of different values of the coefficients based on your prior knowledge, and keep the values of the coefficients that aren’t too inconsistent with the actual data, and then use them and your prior knowledge of how things might behave in the unobserved population to extrapolate how the treatment might actually perform across both the sampled and the un-sampled portion of the population?

      Maybe we could name it after a Reverend from the 1700’s.

      :-)

    • Might it be that he knows that what they have used is not a random sample but does not know in what ways systematically the parameters would differ from one?

  7. Is there no set of assumptions that does not involve “implicitly assuming the probability of a person being included in your sample depends only on the variables included in your regression model”?

    It seems to me that if you assumed that the true underlying model was a linear function using only variables captured by the regression model plus normally distributed noise, then e.g. you could expect the distribution of the coefficients given by the fit to be multivariate normal around the true value according to the size of the noise and the leverage of the independent variables, as expected by your favorite “turn the crank” statistical package, regardless of the source of the independent variables – since the analysis would have been of the dependent variables given the independent variables, in other words, after conditioning on the independent variables.

    I am prepared to believe that in practice this set of assumptions is implausible in this case, but if you are going to state that the original analysis implicitly assumes implausible assumptions and there is more than one set of implausible assumptions which would justify the analysis, how do you know which to criticize them for?

    • Yes. You can assume the convenience sample represents a population which can be adequately described by available data and compared to larger populations on which this population is a part.

  8. So, it seems to me that ultimately, you can get what you need from a study like this if you assume the following:

    1) Conditioning on the covariates you measure within the sample gives you a good estimate of what happens under those covariate conditions for people both within sample and outside of the sample (there are no black swans, things that are VERY different from your sample out there in the world). Or, you have at least probabilistic bounds on how different things might be (so you could put informative priors on the differences or the ratios etc).

    2) You can get independent information about the larger population which would let you find out how many people are in each region of the covariate space in the larger population, possibly with uncertainty (ie. you can put an informative prior on the covariate density in the larger population).

    You can then fit your model in-sample (in the “model” block of a Stan code for example), and use it to predict results (in the generated quantities block) in the larger population based on your knowledge of the (2).

    If (2) has uncertainty involved it will interact with the uncertainty from your model fit (1) to give you even more uncertainty about the population. This is a feature not a bug. You really DO have that kind of uncertainty.

    My assertion is that any other technique is either equivalent to this in some way, or wrong. You need *both things*, an assumption about the way in which the sample can be connected to the unsampled population… and an estimate of what’s going on in that unsampled population at least with respect to predictors to apply your predictions to.

Leave a Reply

Your email address will not be published. Required fields are marked *