We came across a interesting paper on missing data by Nicholas J. Horton and Ken P. Kleinman. The paper is about comparison of Statistical Methods and related Software to Fit Incomplete Data Regression Models.

Here is the abstract:

Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Statistical methods to address missingness have been actively pursued in recent years, including imputation, likelihood, and weighting approaches. Each approach is more complicated when there are many patterns of missing values, or when both categorical and continuous random variables are involved. Implementations of routines to incorporate observations with incomplete variables in regression models are now widely available. We review these routines in the context of a motivating example from a large health services research dataset. While there are still limitations to the current implementations, and additional efforts are required of the analyst, it is feasible to incorporate partially observed values, and these methods should be used in practice.

This is quite a thorough review. The authors refer to different packages already available. One thing we noticed is that there is nothing on diagnostics (see here for more on diagnostics of imputation). This paper should help us on improving the “mi” package.

Also the appendix of the paper can be found here.

To any knowledgeable person out there:

Are there methods based on Maxent doing this ?

Igor.

Igor, good question! Entropy is only defined for probability distributions, not for data sets. For that reason, you can use a MaxEnt distribution to come up with a probabilistic model, and you can then draw missing values from this model.

Thanks Aleks,

Well the reason I am asking is because I have seen this type of undertaking done only once and was wondering if the literature had other examples.

The example I am thinking about is the Experimental Probabilistic Hypersurface (EPH) of Beauzamy at SCM in Paris within the Robust Mathematical Modeling program ( http://perso.orange.fr/scmsa/RMM/RMM_EPH.htm )

The idea is, as you said, to construct a probability distribution on top of experimental data and then find the "missing points": The nice thing about the construction is the ability to update its own model when you find/compute additional points.

Two examples of application are here (the first one is mostly in french, but it is trying to find the level of rivers in france over the period of 30 years).

http://perso.orange.fr/scmsa/RMM/rapport_SCM_reco…

or some other environmental issue (in english):

http://perso.orange.fr/scmsa/RMM/Applications_EPH…

Besides the missing data aspect of it, another clear use of this is for engineers that have to deal with either very expensive experiments or the results of very computationally intensive codes (like we have in nuclear engineering), an example is highlighted in english on page 39 of this document:

http://perso.orange.fr/scmsa/RMM/IRSN_SCMSA_EPH4….

Are there similar constructions in the literature ?

Igor.

Igor, I have taken a quick look at the material. EPH bears similarity to restricted mixture models (each observation would correspond to a distribution in the mixture with a fixed distribution, but the weights can be fitted) and generalized kernel density estimates (where the observations do not all have the same weight). Am I correct?

Aleks,

Thanks for looking it up.

Yep, I think these would be good descriptions, except that in EPH the weights are really not fitted but rather computed following Maxent constraints. They can be recalculated when new data come in.

Are there any good reasons that this type of method does not show up in the paper of this entry ? Is it because of the assumption on the fixed distribution or something else (limitation to low dimensions ?)…?

More generally, I find the technique as a very helpful for engineers who have not taken probability courses. It may not be the very highly accurate but it provides a good toolkit to the engineer when faced with these types of problems (pb with parameters in high dimensional space and fewer data point than the dimension of the parameter space).

Igor.