I just added the link to the recorded video

]]>I’m wondering if the reference model approach is similar to Hinton’s Dark Knowledge (https://arxiv.org/abs/1503.02531 etc). To quote a bit from the paper — this is on a classification task:

“An obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. For this transfer stage, we could use the same training set or a separate “transfer” set. When the cumbersome model is a large ensemble of simpler models, we can use an arithmetic or geometric mean of their individual predictive distributions as the soft targets. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.

For tasks like MNIST in which the cumbersome model almost always produces the correct answer with very high confidence, much of the information about the learned function resides in the ratios of very small probabilities in the soft targets. …”

As I understand Reference Models better, I also want to see trade-offs and how applicable it is to non-Bayesian techniques.

]]>Your post appeared 2 minutes (!) after I asked about model comparison here:

https://statmodeling.stat.columbia.edu/2020/06/18/estimating-the-effects-of-non-pharmaceutical-interventions-on-covid-19-in-europe/#comment-1362950

… so this is really helpful for me. Thanks! ]]>

Radford had also a comment in favor of using reference models and then fitting the smaller model to approximate the reference model, but I think he never did that in his papers.

]]>And oops just noticed this is likely going in the wrong place – meant to respond to Andrew’s comment.

]]>“Why a bigger model helps inference for smaller models” reminds me of this wonderful Radford Neal quote:

]]>Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.