What exactly is a “representative sample”?

Emilio Laca writes:

Could you refer me to the definition of “representative sample” as you use it in your books. I am interested in developing my understanding of the theoretical and philosophical basis to make statements about unobservables using measurements on subsets of populations. I also want to learn more about how a Bayesian approach changes (or not) how one deals with sampling design. Any readings that you can recommend will be appreciated.

My reply: I don’t think there’s any formal definition. We could say that a sample is representative if we could usefully use it to represent the population As this definition makes clear, “representativeness” depends on the use to which the sample would be put.

Do readers have other thoughts?

42 thoughts on “What exactly is a “representative sample”?

  1. I do believe there is at least one “official” defintion. Statistically representative sample is a subset generated from a sampling frame by (some version of) random sampling mechanism. The weight of the argument lies on the process of sampling, not on the outcome sample.

    Now, We do not need necessarily think in statistical inference frame: I found it more useful to define a “representative sample” as any sample which correctly (in same proportion) reflects all relevant aspects of the population of interest in context of our goal.

    • Your second paragraph sounds a lot like Andrew’s definition, and it invites at least a couple challenging questions:

      How “same” do the proportions need to be?

      How does one enumerate •all• relevant aspects?

    • The issue with a process based definition is how to justify “rerandomization” and enforced covariate balance in experimental designs with small-medium n?

  2. Why I cannot define it like the following? Suppose the sample was drawn superpopulation with distribution F(x). Through a sample, an empirical distribution f(x) depending upon sample size n can be obtained. A sample is representative if kolmogorov smirnov distance between f(x) and F(x) is small.

    I think as n tends to infinity, one can define a sampling plan is “representative” if expected KS distance between f(x) and F(x) tends to 0. However, a sampling plan could be really bad in finite n. What should be appropriate risk measure of representativeness here? Will there be an optimal “representative” sampling in some sense?

    Had the sample drawn from a finite population, one can always check KS distance between sample empirical distribution and finite population distribution.

    Why I cannot consider matching sample moments against those of population moments instead?

    • I don’t see how any distance measure can adequately be used as a definition. Distances can only be measured for the dimensions of the sample – so issues about “all relevant aspects” still arise, and these aspects will always be related to the intended use of the sample. I don’t believe any mathematical definition can address this, although it may help address the “how close” question.

      • This reminds me of Draper et al’s (1993) discussion of exchangeability, where they decompose assessments of this into the “logical” parts and those based on “pure faith”. I think some does of the latter is often required for operational usage of a lot of concepts in statistics. (Also reminds me of Mallow’s Zeroth problem).

        Draper et al. 1993: https://www.jstor.org/stable/2982858

    • Josh:

      No, you’re givings a definition of a sampling procedure. I’m asking about the sample itself. There’s just the sample, there’s no sampling distribution or expectation.

      • That’s precisely the type of problem that I have with this term: in sampling, we never talk about a single realization of the sample, we only talk about the properties of the sampling procedure.
        I guess we are just coming from different sides of the coin: you are seeing this more from a modelling/data analysis perspective, whereas I’m coming from a sample design/selection perspective.

  3. This is my favorite pet peeve and something I’ve been arguing for over a decade (and many other sampling statisticians before me for even longer). From a sampling perspective, there is no such thing as a “representative sample” and we should strongly avoid using it.
    There is no formal definition in the sampling literature and the term is not only ambiguous, but it can also be very misleading for the layman. William Kruskal and Frederick Mosteller discussed the ambiguity of the term in a series of papers* in the International Statistical Review back in the late 70’s/early 80’s, showing how the term means different things for different groups of people.

    It can be misleading because, for the layman, it may sound like it attributes a property for a sample (realization) that it is impossible to give from a sampling theory perspective, almost like a seal of approval stamp in some way. In sampling theory, we work with the notion of repeated realizations of the sample and derive all the properties due to the sampling distribution of such process. Therefore, we only talk about properties of the sampling mechanism, not about a single sample realization of such mechanism itself, which is what the term “representative sample” sounds like is doing.

    Also, what some people define as “representative sample”, we already have well-defined technical terms in sampling for that. For example, some say that a “representative sample” is one in which its distribution with respect to some auxiliary variables match the population distribution. Well, we have a name for that in probability sampling: if you control the selection of the sample over those variables, it is a stratified sample with proportionate allocation, or some sort of balanced sample (as in the work by Jean-Claude DeVille, Yves Tillè and others). Even if that is achieved with some sort of calibration adjustment, we also have specific terms for the types of estimators used to get that.

    Moreover, say that, in order to meet a study objective, you are not using a proportionate allocation, but instead an equal, Neyman or some other disproportionate allocation. The sample distribution will not match the population distribution with respect to the stratifying variables (with the exception on some cases). Would it be fair to call it a “unrepresentative sample”, even if it is producing estimates that have even smaller standard error than a proportionate allocation?

    And ultimately, what we would really want is for our sample distribution to match the population distribution with respect to the study’s outcomes, not with respect to auxiliary variables that we already know their distribution in the population. For that, I really like the notion of the “paradox of sampling” that Alan Stuart introduces in his “Ideas of Sampling” book:

    “The central paradox of sampling is that it is impossible to know, from the examination of a sample, whether or not it is a “good” sample in the sense of being free from selection bias (just as one cannot deduce a witness’s credentials from how he looks). Of course, if we know some details of the population, we can compare the sample to the population in respect to these details; but we can never know everything about the population (and in any case, if we could we should have no reason for sampling it).”

    Finally, even for a very carefully designed sample, assuming we don’t have any non-sampling errors (which never happens in practice), there is always a chance that the sample realization we drew is a point out of the curve. Of course, through (probability) sampling techniques like stratification, we aim at reducing the sampling distribution spread as much as possible so that most of the estimates of the sample realizations will be as close as possible to the population parameter. But with multi-purpose surveys with hundreds, if not thousands, of estimates, that is almost an impossible task to accomplish for all the study outcomes. In my opinion, using a term like “representative sample” is not only an over-simplification of the sampling theory framework, but it also becomes a shortcut that many survey data producers use to avoid having a more transparent and serious discussion on how the sample was selected (accompanied with a discussion on the impact of other non-sampling errors in the estimates). And a lot of the times, this gives to the general population a false idea that the sample has a certain property that even with the highest-quality probability sample we cannot ascertain.

    As I said and as you can see on my long “rant” here (sorry for that!), this is my favorite pet peeve and I have a lot of other thoughts about it, but I’ll stop here for now. I won’t promise I’ll stop putting other comments on this post in the future though ;)

    * Kruskal, W., & Mosteller, F. (1979). Representative sampling, I: Non-scientific literature. International Statistical Review, 13-24.
    Kruskal, W., & Mosteller, F. (1979). Representative sampling, II: Scientific literature, excluding statistics. ISR, 111-127.
    Kruskal, W., & Mosteller, F. (1979). Representative sampling, III: The current statistical literature. International Statistical Review, 245-265.
    Kruskal, W., & Mosteller, F. (1980). Representative sampling, IV: The history of the concept in statistics, 1895-1939. ISR, 169-195.

    • I appreciate the information! Wasn’t aware there was this much research on sampling. Is The Idea of Sampling the best source for learning more?

      • Ideas of Sampling is a little book very well-written, but it is very introductory. It’s a good start if you don’t know anything about sampling, but there is much more research on the area. You will find a lot of that in journals like Journal of Survey Statistics and Methodology, Journal of Official Statistics, Survey Methodology, Survey Practice, Survey Methods: Insights from the Field, among others. There are also plenty of great textbooks in survey sampling. From classics like Sample Surveys Methods & Theory by Hansen, Hurwitz and Madow (1953), Survey Sampling by Kish (1965) or Sampling Techniques by Cochran (1977), to more modern ones such as Sampling Theory and Practice by Wu and Thompson (2020), Sampling: Design and Analysis by Lohr (2021) and Sampling and Estimation from Finite Populations by Tillè (2020)

    • Raphael:

      I agree that a sample can be nonrepresentative and still be useful (the point of your fourth paragraph above). I was not intending to imply that “representative” is the same as “useful.”

      Also, except in some extreme special cases, no sample will be completely representative. Representativeness is an ideal property that will never be satisfied. I still think the concept of a representative sample is useful, as that is the goal of much of survey sampling.

      • Thank you for your reply!

        I’m sorry to keep insisting, but I don’t see how useful can be a concept that does not even have a formal definition, especially in a technical field like statistics. For me, “representative sample” is just as vague and meaningless as when someone asks me how to select a “statistical significant sample” or “robust sample”. It just sounds like it is a technical term (for the layman), but it lacks any real meaning.

        Another point that I have just remembered is that the notion of “representative sampling” dates back from Kiaer (1985), but such concept has been long abandoned in the sampling field since Neyman (1934) seminal paper, where he not only lays down the framework for probability sampling, but also introduces stratified sampling, an optimal allocation (we now call the Neyman allocation) and, if that wasn’t already enough, he establishes the concept of confidence interval (!)

        Anyhow, I really enjoy this more conceptual kind of discussions. And I think it is really important, because, as Jaynes once put it, there is so many terms in statistics that are heavily loaded (“unbiasedness”, “admissible” or “uniformly most powerful”), that I think we need to be careful when using concepts that might sound like there is a technical meaning, when in fact, we don’t even have a formal definition for them, which in my opinion, is the case of
        “representative sample”.

  4. For a taste of what it can be like to be a judge or lawyer, read these comments and then imagine you had to implement a statute requiring a decision based on a “representative sample.” Don’t be lawyers, kids.

  5. My take:

    From a measure theoretic perspective, a probability distribution is something that gets integrated against a function

    If all integrals (expectations) are faithful, you recover the original distribution

    There are infinitely many functions to integrate against (expectations to take), so any finite sample will not satisfy all of them to any finite epsilon (proof)

    You need to choose functions to care about

    Even so, we don’t know what the value of those expectations in, or we wouldn’t need to samples, so you can’t directly verify in practice

    In practice, to assess representativeness for a given purpose we need to make assumptions. Generic examples are assuming something about higher moments to place bounds on lower moments, like assuming finite variance, kurtosis to place bounds on the mean. Or else, assess representativeness along parameters we do know, then make assumptions about the relationship between those and unknown expectations of interest, like in the covariate balance rerandomization example

    • Woah, typos

      There are infinitely many functions to integrate against (expectations to take), so almost every finite sample will not recover *every* expectations to any finite epsilon (*needs proof, I’m purely guessing here, but it gives me good vibes so it must be true*)

      You need to choose functions to care about

      Even so, we don’t know what the value of those expectations are, or we wouldn’t need to take samples. We would just use the known values! So you can’t directly verify against the expectations of interest in practice

      • Also, we have to be careful about what we’re sampling and whether samples are sets (unordered) or sequences (ordered). An extremely non-random sequence can satisfy the definition above if the space under consideration is not sequences, since by definition our functions being integrated cannot take into account ordinal information. So, a hand-crafted or “periodic” can be representative by this definition — like, for example,

        UniformInt(0, 3)

        can be represented by sufficiently long periodic samples 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, …

        while the same is not true if we’re sampling random sequences. I don’t think this is necessarily a deficiency of the definition. Nearly everything generated by computers is only pseudorandom, but useful

        • Algorithmic randomness gives us a definition of a random sequence as any sequence that passes some uniformly most powerful computable test of randomness. Per Martin Löf proved the existence of such a test non constructively. So similarly to Kolmogorov complexity we tend to work with practical batteries of tests, like so called “die harder” or various NIST test suites.

          A random sequence can be constructed as (f(x_i)) for (x_i) a sequence passing such a test.

          A representative sample is roughly any sample that allows us to compute expectations reasonably accurately. You can prove things like central limit theorems on the error of expectations. Basically it turns out that most random sequences result in samples that give accurate expectations.

        • can be represented by sufficiently long periodic samples 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, …

          I think we would all benefit if the date each datapoint was collected was standard reporting (which gives the order plus more info). This is another thing Deming pointed out:

          It is important to remember that the mean, the variance, the standard error, likelihood, and many other functions of a set of numbers, are symmetric. Interchange of any two observations x_i and x_j leaves unchanged the mean, the variance, and even the distribution itself. Obviously, then, use of variance and elaborate methods of estimation buries the information contained in the order of appearance in the original data, and must therefore be presumed inefficient until cleared.

          https://deming.org/wp-content/uploads/2020/06/On-Probability-As-a-Basis-For-Action-1975.pdf

      • Also would need proof but I suspect for any given epsilon, there exists a finite sample length N such that for all specified expectations in some finite set E they are all accurately computed to within epsilon.

  6. Representativeness of a sample is an absolute requirement for accurate Frequentist inference… you have an estimator the frequency properties of which are dependent on the frequency properties of the procedure that generates the data.

    A Bayesian procedure doesn’t need any of this, it just has to be the case that the assumptions made about what is plausible to occur has some degree of validity. We can place high probability on things that are virtually unheard of and still get valid inference so long as the things that DO occur also have high probability. For example suppose you model the weight of passenger cars as exponentially distributed with mean mu, and the mu value is itself distributed near 1000kg… Then if you think of this as a frequency distribution, and discretized to the resolution of 1kg, then the most frequent weight of a car will be 1kg…. Which is quite honestly insane. There are zero 1kg cars.

    Nevertheless, after measuring the weight of 100 cars you will have a posterior distribution for mu which is somehow accurate to a few hundred kg, almost regardless of what you do in terms of your sample.

    Sure, you will do better if you make an effort to sample cars in a way that doesn’t focus only on minivans or mini coopers, but the concentration of the probability around a reasonable value of the mu parameter does not depend on the frequency distribution of actual cars in the same way that it does for say a permutation test.

    At the same time, the Bayesian model will occasionally make ridiculous predictions, even with the right mu value, if you ask it to generate a predicted individual car mass it’ll sometimes say 37kg for example.

    • “I know it when I see it!”

      Wins thread, IMO. Language is full of terms like “representative sample” that cannot be quantified but function perfectly well at a conceptual level. Like the words “skinny” or “pornography.” These words make the language richer, not poorer.

      Raphael has articulated numerous reasons why he is uncomfortable with the term, and obviously has spent much more time thinking about it than I have. But in the work I did, we used the term all the time to describe what we were trying to achieve in picking a sample, and there was never any ambiguity about what it meant, which may mean nothing for sampling theory but is quite pertinent when assessing a particular usage of language.

      Perhaps it would be less confusing to write it as we actually mean it, “adequately representative sample.”

      • In general terms, I completely agree with you: everyday language is full of terms like that and I have absolutely no problem with that.

        In this case, however, a lot of times I see “representative sample” being used as a technical term, as something that actually has a formal definition and meaning in sampling. In most situations, in my experience, I can easily remove the term without any loss of meaning (like in “national representative sample” can be easily changed to “national sample”) or there are better terms for that, such as “stratified sample with proportionate allocation” or just “probability sample”. IMHO, “representative sample” is just as meaningless as saying “robust sample” or “statistically valid sample”. You are only adding a loaded adjective to the sample that sounds technical and with a positive-connotation, but that, from a sampling perspective, it does not mean anything.

        • Raphael wrote:

          “You are only adding a loaded adjective to the sample that sounds technical and with a positive-connotation, but that, from a sampling perspective, it does not mean anything.”

          Some of the work I did was in the field of reliability. Imagine a case where a sample of ten units is pulled from a production flow and submitted to a life simulation stress test until failure. How do I pick my sample? While there is no doubt some heterogeneity in the population, I have no idea how much or even whether it matters at all. Since I don’t know which components are going to fail, I don’t really know what I need to “represent,” so no stratification by anything meaningful is possible. I’m not doing any kind of reliability statistics either – that is not what stress durability testing is looking for – I’m just trying to figure out which parts wear out first and how long they last until that happens.

          I have intentionally built an example where sampling theory is of little value. Beyond some simple scheme based upon serial number, what else can I do? The best word I have for the most important attribute of this sample is “representative.” Do you have a better word for my conceptual GOAL in choosing which units to test than “representative?”

  7. A sample is fully representative if there exists a single real number,which when multiplied with the sample leads to the full population.

  8. Just that a word is around doesn’t mean it has an agreed correct definition. As with some other terms, if somebody wants to use it in a certain context, they should define it as precisely as they can and say clearly that this definition is not a general one but given by that particular writer for that particular work, as there is no agreed general definition.

    I think a major distinction here is whether we want to have a definition that can be checked from observables (including knowing how the sample was drawn) or whether this is not the case. We should also acknowledge that (for reasons well explained in other postings) if we make the former choice, having a “representative sample” does not imply that the sample will work well for, say, estimating the desired population characteristics, whereas if we make the latter choice we can’t know whether any given sample is representative or not. This is pretty much what was called above “central paradox of sampling”. Obviously we can choose a definition that does not refer to what we can observe, and then speculate (potentially based on an assumed model) to what extent observables can at least give us some hints on whether our sample is properly representative or not, and how good they might be at indicating this.

    That’s about what I have to say, the main issue being that we should be clear about whether the definition we use has the first or the second set of issues. That, I think, contributes to the clarity of the concept.

    • This is a nice summary of the issue, and relates to the Meng (2018) definition, where one needs to know the true distribution of the variable of interest in the population to know the representativeness (a.k.a. selection bias in some literatures), versus the practical approach of assessing representativeness using proxies (auxiliary variables capturing response propensity), metrics of which are often known as R-indicators in survey sampling.

    • Christian:

      Yes, part of the point of my post was to emphasize that I do think that “representative sample” is a useful concept. Random sampling is fine but it’s a means to an end, which is a representative sample (or, more generally, a sample that can be analyzed usefully).

  9. A sample that can be analyzed usefully is any sample you know enough about how it was generated to write down an approximate model of the sampling process. A sample is useless if you don’t have background knowledge that can make sense of it.

    A representative sample is one where naive expectations lead to answers within the naive sampling distribution of correct values. You can only know if you have a representative sample if you know how it was generated or if you know more information about the population then you can gain from the sample

    • Daniel
      You are starting to sound like ChatGPT. So, I thought I would check to see if you were – and you’re not. But, in any case, here is the real ChatGPT:

      Dale: How can I tell if a sample I have is representative of the population it was drawn from?

      ChatGPT:

      To determine if a sample is representative of the population it was drawn from, you can consider the following factors:

      Sampling Method: Understand the sampling method used to select the sample. Random sampling techniques, such as simple random sampling or stratified random sampling, increase the likelihood of obtaining a representative sample. If a random sampling method was employed, it provides a higher chance that the sample is representative.

      Sample Size: Consider the size of the sample in relation to the population. Generally, larger sample sizes tend to provide more representative results. A larger sample size reduces the margin of error and increases the likelihood of capturing the diversity and characteristics of the population.

      Population Characteristics: Examine the key characteristics of the population and compare them to the sample. Look for similarities in demographics, geographic distribution, age, gender, socioeconomic status, or any other relevant variables. If the sample closely resembles the population in terms of these characteristics, it indicates a higher likelihood of representativeness.

      Sampling Bias: Assess if there were any biases in the sampling process that may have influenced the selection of individuals or elements in the sample. Biases, such as self-selection bias or non-response bias, can affect representativeness. If efforts were made to minimize biases and ensure fairness in the sampling process, it increases the chances of obtaining a representative sample.

      External Validation: Validate the findings or characteristics observed in the sample against existing knowledge or external sources of information about the population. If the sample’s characteristics align with what is known about the population, it supports the claim of representativeness.

      Expert Opinion: Seek the opinion of experts or domain specialists who can evaluate the sample and provide their judgment on its representativeness. They can assess the appropriateness of the sampling method, sample size, and the overall design of the study to determine if the sample is representative.

      It is important to note that achieving perfect representativeness is challenging in practice. However, considering these factors can help assess the degree of representativeness and the extent to which the sample reflects the population of interest.

Leave a Reply

Your email address will not be published. Required fields are marked *