Mister P when you don’t have the full poststratification table, you only have margins

Torleif Halkjelsvik from the Norwegian Institute of Public Health writes:

Norway has very good register data (education/income/health/drugs/welfare/etc.) but it is difficult to obtain complete tables at the population level. It is however easy to get independent tables from different registries (e.g., age by gender by education as one data source and gender by age by welfare benefits as another). What if I first run a multilevel model to regularize predictions for a vast set of variables, but in the second step, instead of a full table, use a raking approach based on several independent post-stratification tables? Would that be a valid approach? And have you seen examples of this?

My reply:

I think the right way to frame this is as a poststratification problem where you don’t have the full poststratification table, you only have some margins. The raking idea you propose could work, but to me it seems awkward in that it’s mixing different parts of the problem together. Instead I’d recommend first imputing a full poststrat table and then using this to do your poststratification. But then the question is how to do this. One approach is iterative proportional fitting (Deming and Stephan, 1940). I don’t know any clean examples of this sort of thing in the recent literature, but there might be something out there.

Halkjelsvik responded:

It is an interesting idea to impute a full poststrat table, but I wonder whether it is actually better than directly calculating weights using the proportions in the data itself. Cells that should be empty in the population (e.g., women, 80-90 years old, high education, sativa spray prescription) may not be empty in the imputed table when using iterative proportional fitting (IPF), and these “extreme” cells may have quite high or low predicted values. By using the data itself, such cells will be empty, and they will not “steal” any of the marginal proportions when using IPF. This is of course a problem in itself if the data is limited (if there are empty cells in the data that are not empty in the population).

To which I replied:

If you have information that certain cells are empty or nearly so, that’s information that you should include in the poststrat table. I think the IPF approach will be similar to the weighting; it is just more model-based. So if you think the IPF will give some wrong answers, that suggests you have additional information. I recommend you try to write down all the additional information you have and use all of it in constructing the poststratification table. This should allow you to do better than with any procedure that does not use this info.

11 thoughts on “Mister P when you don’t have the full poststratification table, you only have margins

  1. This is a question I had for a long time! Thanks for posting about it. It is not clear to me, though how to use the data on those structural zeros (logically implausible combinations) when doing IPF. Can you maybe me point me to some reference? Or maybe briefly explain how would that work?

  2. I always think about Andrew’s advice for missing data, which he attributed to Rubin: think about what you’d do if you had all the data. In this case, you’d just poststrat as usual. Then the path ahead with Bayes is straightforward. We introduce parameters for the unknowns and use those in posterior inference just like known variables, only now they give us plug-in uncertainty estimates conditioned on the data.

    The trick in this case is “solving” the inverse problem of finding cell counts that have the appropriate marginals. We can introduce some wiggle here in the form of uncertainty so we don’t have to construct exact marginals. As Andrew once put it, this “greases the wheels of commerce.” Technically, we can do that by giving assigning the sum of the parameters making up the marginal to a normal distribution around the marignal with a little wiggle room.

    Writing this all out in Stan to do the imputation for a 2-dimensional case looks as follows. Higher dimensions work the same way only now the marginals involve more terms.

    data {
      int<lower=0> M;
      int<lower=0> N;
      vector<lower=0>[M] row_margin;
      vector<lower=0>[N] col_margin;
    }
    transformed data {
      if (sum(row_margin) != sum(col_margin))
        reject("sum of row and column margins must match");
    
      real<lower=0> epsilon = sum(row_margin) / 100;
    }
    parameters {
      matrix<lower=0>[M, N] alpha;
    }
    transformed parameters {
      vector<lower=0>[M] row_sum;
      vector<lower=0>[N] col_sum;
      for (m in 1:M) row_sum[m] = sum(alpha[m, ]);
      for (n in 1:N) col_sum[n] = sum(alpha[, n]);
    }
    model {
      row_sum ~ normal(row_margin, epsilon);
      col_sum ~ normal(col_margin, epsilon);
    }
    

    And then we can simulate data and fit. I’m using cmdstanr to do that.

    library("cmdstanr")
    
    set.seed(1234)
    
    M <- 6
    N <- 4
    y <- matrix(sample(0:100, M * N, replace = TRUE), nrow=M, ncol=N)
    rowm <- c()
    for (m in 1:M) rowm[m] <- sum(y[m, ])
    colm <- c()
    for (n in 1:N) colm[n] <- sum(y[ , n])
    epsilon <- 1
    init_fun <- function(chain_id)
            list(alpha = matrix(sum(y) / (M * N),
                                nrow=M, ncol=N))
    data <- list(M = M, N = N, row_margin = rowm, col_margin = colm,
                 epsilon=epsilon)
    model <- cmdstan_model('marginal-mrp.stan')
    fit <- model$sample(data=data, chains=1, init=init_fun)
    

    Here's the simulated cells.

    > y
         [,1] [,2] [,3] [,4]
    [1,]   27   37   69    3
    [2,]   79   15   78    3
    [3,]   21    3   77   20
    [4,]  100   97   13   39
    [5,]    8   85   55   83
    [6,]    4   89   61   55
    

    and here's what Stan reconstructs.

       variable mean q5 q95
     alpha[1,1]   30     2  74
     alpha[2,1]   39     3  95
     alpha[3,1]   29     2  76
     alpha[4,1]   50     5 116
     alpha[5,1]   49     3 114
     alpha[6,1]   44     3 107
     alpha[1,2]   38     3  90
     alpha[2,2]   51     4 117
     alpha[3,2]   34     2  79
     alpha[4,2]   74     5 161
     alpha[5,2]   68     5 149
     alpha[6,2]   62     6 135
     alpha[1,3]   40     4  94
     alpha[2,3]   53     5 117
     alpha[3,3]   35     3  81
     alpha[4,3]   84    11 169
     alpha[5,3]   76    10 153
     alpha[6,3]   66     6 147
     alpha[1,4]   28     2  71
     alpha[2,4]   33     2  85
     alpha[3,4]   26     1  68
     alpha[4,4]   40     3 100
     alpha[5,4]   39     4  97
     alpha[6,4]   38     3  96 
    

    Without additional assumptions, there is a lot of uncertainty. At this point, various kinds of priors could be added. For example, each value could be given a prior around the grand mean value (sum divided by number of cells). Or the values on each row and/or column could be constrained to be close to each other.

    The fit gets tighter as epsilon shrinks. As is, there's a lot of slack in row sums. For instance, the first has a 90% interval of roughly (120, 150). aking epsilon to be 1/10 of the first run, the 90% interval is reduced to (134, 137). It otherwise doesn't change the answers much, but causes the sampler to max out tree depth.

  3. I’m not sure where to go for informative priors here other than meta-analysis. One simple thing to do would be to build a factor model to generate a mean toward which to concentrate. In the two-effect case I outlined before that’s basically just a simplex for row prob and a simplex for column prob and you complete the matrix by assuming independence (and then scale by total number). This essentially assumes that the probability of an entry in a cell is the probability of the row times the probability of the column (i.e., no interaction effects on population sizes).

    If I do that (code below), how big a spread I get for each imputation depends on the prior scale. I’ll take a lognormal around the factor model solution and with prior scale of 1, I get a 90% interval of (6, 75) for the first count, which is close to the original model, but without the very low values. On the other hand, with a prior scale of 0.2, I get a 90% interval of (22, 38). Because the result is otherwise unconstrained, it’s very sensitive to the prior.

    data {
      int M;
      int N;
      vector[M] row_margin;
      vector[N] col_margin;
    }
    transformed data {
      real total = sum(row_margin);
      real epsilon = total / 500;
      simplex[M] p_row = row_margin / total;
      simplex[N] p_col = col_margin / total;
      matrix[M, N] mu;
      for (m in 1:M) mu[m] = log(total * p_row[m] * p_col)';
    }
    parameters {
      matrix[M, N] alpha;
    }
    transformed parameters {
      vector[M] row_sum;
      vector[N] col_sum;
      for (m in 1:M) row_sum[m] = sum(alpha[m, ]);
      for (n in 1:N) col_sum[n] = sum(alpha[, n]);
    }
    model {
      row_sum ~ normal(row_margin, epsilon);
      col_sum ~ normal(col_margin, epsilon);
      for (m in 1:M)
        alpha[m] ~ lognormal(mu[m], 0.2);  // SENSITIVE!!!
    }
    

    P.S. One more thing on formatting: it’d be nice if the “pre” font were Lucida Console or something else more code friendly. I see that “code” inline gets a good code font that’s the right size.

    • Bob:

      Thanks for the comments that are better than the original post. That happens sometime!

      P.S. Do you think the font of the code in your comment is not code friendly? It looks good to me, but then again I don’t do as much coding as you do, so I guess I’m missing some nuance here?

    • Shiro:

      Thanks for the link. It looks great. I have one quick comment. On that page, it says, “Almost all survey weighting . . .” I think you should change that to “Almost all survey adjustment . . .” “Adjustment” is the general term for methods for accounting for known or estimated differences between sample and population; “weighting” is just one particular family of methods for adjustment.

    • I have gone through that paper a couple of times, and think there is a typo in formula 7. I believe it should be
      \vec{N}_{..} \sim \operatorname{Poisson}(L \vec{\hat{N}}), (or the other way around with vec and hat)
      and not
      \vec{N}_{..} \sim \operatorname{Poisson}(L \vec{N}).

  4. Suppose you have two non-overlapping margins (for simplicity). You don’t have any population data on the correlation structure (the copula); it’s all prior and sample data.

    If you fit a loglinear model by IPF, you’re going to be specifying the interaction structure (in the simplest case, specifying independence), but not relying on the sample data at all for the population table.

    Raking, which also can be done with IPF, gives you the joint distribution with the correct margins that is closest to the data, in terms of minimising the weight adjustments needed. In that sense, it’s the opposite of fitting a loglinear model by IPF; it doesn’t have a view about the interaction structure but trusts the data.

    It would presumably be better to do something in-between: to express prior uncertainty about the population interaction structure, and also about how informative or misleading the sample interaction structure is. In settings where response rates are plausibly high or ignorable you’d end up with something towards raking, but where the sample is likely to be misleading, you’d trust your assumptions about population interaction structure more. And with these all in one model it would be easier to get the uncertainty estimates right, but it’s going to be a complicated model.

Leave a Reply

Your email address will not be published. Required fields are marked *