Alex Stenlake and Ranjit Lall write about a program they wrote for imputing missing data:

Strategies for analyzing missing data have become increasingly sophisticated in recent years, most notably with the growing popularity of the best-practice technique of multiple imputation. However, existing algorithms for implementing multiple imputation suffer from limited computational efficiency, scalability, and capacity to exploit complex interactions among large numbers of variables. These shortcomings render them poorly suited to the emerging era of “Big Data” in the social and natural sciences.

Drawing on new advances in machine learning, we have developed an easy-to-use Python program – MIDAS (Multiple Imputation with Denoising Autoencoders) – that leverages principles of Bayesian nonparametrics to deliver a fast, scalable, and high-performance implementation of multiple imputation. MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are capable of producing complex, fine-grained reconstructions of partially corrupted inputs. To enhance their accuracy and speed while preserving their complexity, these networks leverage the recently developed technique of Monte Carlo dropout, changing their output from a frequentist point estimate into the approximate posterior of a Gaussian process. Preliminary tests indicate that, in addition to successfully handling large datasets that cause existing multiple imputation algorithms to fail, MIDAS generates substantially more accurate and precise imputed values than such algorithms in ordinary statistical settings.

They continue:

Please keep in mind we’re writing for a political science/IPE audience, where listwise deletion is still common practice. The “best-practice” part should be fairly evident among your readership…in fact, it’s probably just considered “how to build a model”, rather than a separate step.

And here are some details:

Our method is “black box”-y, using neural networks and denoising autoencoders to approximate a Gaussian process. We call it MIDAS -Multiple Imputation with Denoising AutoencoderS – because you have to have a snappy name. Using a neural network has drawbacks – it’s a non-interpretable system which can’t give additional insight into the data generation process. On the upside, it means you can point it at truly enormous datasets and yield accurate imputations in relatively short (by MCMC standards) time. For example, we ran the entire CCES (66k x ~ 2k categories) as one of our test cases in about an hour. We’ve also built in an overimputation method for checking model complexity and the ability of the model to reconstruct known values. It’s not a perfect “return of the full distribution of missing values”, but point estimates of error for deliberately removed values, giving a good estimate and allowing the avoidance of any potential overtraining through early stopping. A rough sanity check, if you will. This is an attempt to map onto the same terrain covered in your blog post.

Due to aggressive regularisation and application of deep learning techniques, it’s also resistant to overfitting in overspecified models. A second test loosely follows the method outlined in Honaker and King 2012, taking the World Development Indicators, subsetting out a 31 year period for 6 African nations, and then lagging all complete indicators a year either side. We then remove a single value of GDP, run a complete imputation model and compare S=200 draws of the approximate posterior to the true value. In other words, the data feed is a 186 x ~3.6k matrix, suffering hugely from both collinearity and irrelevant input, and it still yields quite accurate results. It should be overfitting hell, but it’s not. We’d make a joke about the Bayesian LASSO, but I’m actually not sure. Right now we think its a combination of data augmentation and sparsity driven by input noise. Gal’s PhD thesis is the basis for this algorithm, and this conception of a sparsity-inducing prior could just be a logical extension of his idea. Bayesian deep learning is all pretty experimental right now.

We’ve got a github alpha release of MIDAS up, but there’s still a long way to go before it gets close to Stan’s level of flexibility. Right now, it’s a fire-and-forget algorithm designed to be simple and fast. Let’s be frank – it’s not a replacement for conventional model-based imputation strategies. We’d still trust in a fully specified generative model which can incorporate explicit information about the missingness-generating mechanism over my own for bespoke models. Our aim is to build a better solution than listwise deletion for the average scholar/data scientist, that can reliably handle the sorts of nonlinear patterns found in real datasets. Looking at the internet, most users – particularly data scientists – aren’t statisticians who have the time or energy to go the full-Bayes route.

Missing data imputation is tough. You want to include lots of predictors and a flexible model, so regularization is necessary, but it’s only recently that we—statisticians and users of statistics—have become comfortable with regularization. Our mi package, for example, uses only very weak regularization, and I think we’ll need to start from scratch when thinking about imputation. As background, here are 3 papers we wrote on generic missing data imputation: http://www.stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf and http://www.stat.columbia.edu/~gelman/research/published/mipaper.pdf and http://www.stat.columbia.edu/~gelman/research/published/paper77notblind.pdf .

I encourage interested readers to check out Stenlake and Lall’s program and see how it works. Or, if people have any thoughts on the method, feel free to share these thoughts in comments. I’ve not looked at the program or the documentation myself; I’m just sharing it here because the description above sounds reasonable.

Baysianism in a nutshell:

“Our method is “black box”-y, …it’s a non-interpretable system which can’t give additional insight into the data generation process. On the upside, it means you can point it at truly enormous datasets and yield accurate imputations in relatively short (by MCMC standards) time.”

Non-interpretable – but accurate – assessments of missing data! Almost seems like magic. All you need to do now is point it at new, the real-world datasets and hope it cooperates – that might make non-statisticians comfortable with regularization too… That might require some interpretation of the “data-generation process, though, to figure out where to point this magical device. (Assuming we’re interested in real-world prediction).

Dangerous waters here – missing data is exactly where black box methods can develop pernicious biases. And black box also means effectively non-auditable, so these biases can’t be easily diagnosed.

And as statistical methods become more widespread, this gets more and more dangerous. I’d recommend seeing Osonde Osoba’s report here; https://www.rand.org/pubs/research_reports/RR1744.html and Cathy O’Neil’s book here; https://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815

@ David:

Granted, conditionally. This is a problem both myself and Ranjit are keenly aware of. We don’t like the black box aspect, but it’s how a lot of scalable machine learning is done for complex problems. Neural networks are just horrendously effective. At least now, with Gal’s work on interpreting dropout-trained networks as approximate GPs, we can get an idea of the uncertainty due to moving away from known regions of the modelling space. Many social science scholars will treat interpretable MI systems like Amelia, MICE, Hmisc, etc. as black box optimisers anyway, knowing just enough about how they work as to avoid throwing errors.

In general, the current consensus seems to be that MI is – at worst – no more prone to inducing bias than listwise deletion. Both myself and Ranjit are International Relations people, where missingness is highly nonrandom. For instance, Ranjit ran a reanalysis of a large number of studies from the last few years. He found that, when listwise deletion was replaced with actual imputation methods, many findings completely disappeared, if not reversed (type-M/S, if you will). If it puts your mind at ease, we have done some testing with strictly MNAR data, where missingness depends on an excluded (target) variable. We then use the imputed datasets in a Naive Bayes (AKA. Idiot Bayes) classifier to predict that excluded variable. What we observe is that the accuracy of that classification improves. It’s not proof that it works for all cases, but it does seem to be a significant improvement over existing approaches.

I don’t disagree that this is dangerous territory, and there will always be room for tailor-made missing data solutions based on extensive subject matter knowledge. Our solution is designed to work for those who lack the time or knowledge to implement such a system effectively. Thanks for the links though.

@Lydia:

Non-interpretable in this sense means we can’t access meaningful representations of the millions of parameters in your average neural network. We can represent the output quite easily, and we withhold an overimputed test set to ensure that the model is reconstructing accurately. I assure you, it’s not magic. We’ve had to modify the standard train/test split to account for missingness, based on some obscure work done 6 years ago, but other than that it’s based on practical solutions to well-known issues in machine learning.

On the issue of the “data generation process”, you can think of data as generated by a nonlinear manifold in feature space. Standard denoising autoencoders attempt to learn this manifold. To do so, you apply a noise mask to an input, and then force the network to learn how to reconstruct the original input from the corrupted input. For a corrupted input, the network reconstructs its value from the closest point on the manifold indexed by the other inputs. If we were dealing with a simple linear manifold, this is analogous to conditional linear interpolation. (You might be seeing exactly how this works right about now, but I’ll continue.) By utilising dropout, we’re approximating the Gaussian process (a prior over the domain of functions) which describes this manifold. Then, by repeated sampling, we get an uncertainty-weighted distribution of said value.

Allow me to put words in your mouth, if I may. “But Alex, we’re concerned with the real world, not some abstract mathematical construct.” You’re right, and we’re doing approximations on approximations here. However, I’d point you towards some of the recent work involving Generative Adversarial Networks. GANs seem to be manifold learners too, although one with a unique adaptive loss function. They map random inputs onto points in feature space (often faces), so you can then sweep through the random noise and get original, unseen images out.

The real-world data generation process is whatever process generated the limited dataset in front of you, with the normal warnings about GIGO. The manifold is just a mathematical representation of this restricted set of observations. Sure, if you want to know the missing value of someone’s favourite flavour of ice cream, and “chocolate” fails to appear in your dataset, you can’t correctly predict that this observation should be chocolate. Excluded variable bias is a big deal. The advantage of our system is you can plausibly include hundreds, if not thousands, of variables in your imputation model, thus reducing the odds of EVB.

Hopefully that (more technical) explanation reassures you that we’re not simply waving our hands and saying “abracadabra!”. If you have more specific issues, I’d be happy to discuss them.

Alex,

First, props for your lucid writing and tight arguments. Second, do you have a link or two for “the recent work involving Generative Adversarial Networks”?

bloggy intro: https://deeplearning4j.org/generative-adversarial-network

NIPS tutorial paper: https://arxiv.org/abs/1701.00160

Exactly what I needed. Thank you.

“We don’t like the black box aspect, but it’s how a lot of scalable machine learning is done for complex problems.” The second part of this sentence seems like a non sequitur. Saying that its often done this way doesn’t legitimize it, does it? Maybe the reasons you don’t like it are the reasons why it shouldn’t be used.

“…there will always be room for tailor-made missing data solutions based on extensive subject matter knowledge. Our solution is designed to work for those who lack the time or knowledge to implement such a system effectively.”

The solution to what problem, exactly? How to take shots in the dark when you don’t have the appropriate data or the time to become familiar with your subject matter? Why do it?

I don’t have the expertise to engage with you on the technical issues, but the problems here seem to me much more fundamental. No amount of technical manipulation of inadequate data can reduce uncertainty to acceptable levels. Huge levels of uncertainty papered over with uncertain assumptions should have no more place in science than they do in engineering. There doesn’t seem to be a rational way to choose which techniques to prefer.

I don’t know if this example is relevant, but E=mc2 is of the same form as A=pi r2, but that tells us nothing about the underlying “data generation process.” However, I agree that the discussion would be more focussed if we were talking about a specific problem situation.

1) Well, yes…it kind of does. I know neural networks annoy a lot of my statistician friends, but the reality is their performance is unparalleled on large datasets and complex problem domains. Would you argue against neural networks for object recognition because we don’t understand fully how they work? Would you argue AlphaGo isn’t legitimate because we don’t exactly how it makes its decisions? These systems deliver superhuman levels of performance. The performance at least partially stems from the ability of neural networks to identify statistical regularities and relationships either too small for a human to notice, or that our biases and heuristics blind us to. Go experts are studying AlphaGo’s moves to learn more. The basic logic/process is well understood, but then you stack millions of these basic units together into a complex system and everything becomes too complicated to track.

I guess you could think of it like the weather…we know broadly what’s happening, and can predict outcomes with a degree of certainty, but the exact drivers of any given phenomenon remain inaccessible due to the complexity of and interactions within the system. We know it’s driven by heat gradients and fluid dynamics, but that’s not enough to completely mathematically describe what’s going on at any given moment. There’s an ontological point to be made about the mysteries of human cognition, but let’s not even open that can of worms.

2) Because people are taking shots in the dark already. Worse, policy is made on the basis of those shots in the dark. To torture the metaphor, our system is an attempt to ‘restrict the arcs’ of those firing into the night, so even if they miss they’re unlikely to commit fratricide, and should at least be firing in a reasonable direction. We ought to do it because the world doesn’t wait for ‘perfect’, and it might have changed by the time a ‘perfect’ model is built. I forget who said it, but the logic is “an exact solution to an approximate model is more useful than an approximate solution to an exact model”. What we’re after is better models. When they break, it gives us clues to what we’re missing.

The social domain is hugely complex, with thousands of competing theories that are all supported by data in varying ways. Please don’t imply that we’re not taking the time to try to grapple with it. Our data is limited, and we’re trying to learn what we can. You’re right that technical solutions will not reduce uncertainty to acceptable levels. What we’re trying to do is avoid too little uncertainty (point estimates, ie. random forest imputation, median imputation) or too much (naive Gaussian approximation, failure to try and analyse data). I don’t understand the hostility to the idea. “Papering over” is failing to acknowledge that the results depended on an imputed dataset, not attempting to deal with missingness in a principled manner.

3) Well, no, they’re not of the same form. E = variable (constant squared), whereas A = constant (variable squared). They describe a very different process. But let’s take your assumption at face value, and talk purely about the case E = mc2. If I gave you a large matrix, consisting of three columns (E, m, c), could you learn the relationship between them? Even if you knew nothing about relativity, but knew a little about maths, you could probably figure it out…particularly if I gave you a calculator to speed up the arithmetic. You could learn, or at least approximate, E=mc2 provided I gave you enough examples. Even if I mixed in a bunch of partial examples (with one of the values missing), you could solve for a single column, use those assumed values to approximate another column’s missing values, rinse and repeat until convergence. This is the basis of chained equations, and is how a lot of MI methods work. The advantage of our method is that you can throw a column in signifying “A or E”, and rather than treating the ‘c’ column as a squared constant (with “A or E” becoming a dummy variable of some magnitude), the model can learn to switch between the two states.

While you can say “well take the time to learn the relationships!”, well…you’re welcome to switch disciplines and come join us in fighting the good fight. A lot of this stuff is unknown, the variables we measure keep changing, latent factors abound, politically-useful theories flourish at the expense of attempts to grasp some notion of truth. (And the goalposts of truth keep shifting all the time because the poststructualists have a point, even if I wish they’d do something more useful than keep making the same point.) In absence of known relationships, we do the next best thing – estimate from the data we have, try not to impart too much of our own biases, and transparently report how we reach our conclusions.

Alex, do you have any simulations that show how your method preserves the uncertainty caused by the missing values?

In general, I would be looking for 1) the point estimate should have the right value, 2) the confidence interval should have the right coverage, and 3) the confidence should be as short as possible. Any idea how your method performs in a basic case as in

https://stefvanbuuren.name/fimd/sec-evaluation.html ?

Alex, could you possibly direct me to some more detailed documentation (or a paper) about the model used in the MIDAS code? Although I got the gist of the general composition of the denoising autoencoder (it’s basically a VAE with dropout at the input layer to induce missingness?) there are some details I couldn’t really make sense of just by reading the python files in your repository.

For example, I’m not too sure whether the loss function you’re trying to optimize is the traditional ELBO defined for the original VAE (Kingma & Welling, 2014) – I cant seem to locate the reconstruction error term (i.e. expectation of log p(x/z) w.r.t z).