Below are notes from a talk on Expectation Maximisation I gave at our ML-learning group. Gaussian mixture models are considered as an example application.
Another useful reference is likely the 1977 paper by Dempster et al. that made the technique famous (this is something I would have liked to have read, but didn’t).
- I still don’t understand how EM manages to (reportedly) work so well, given that the maximisation chooses for the next parameter vector precisely the one that reinforces the “fantasy” completions of the data made by the previous parameter vector. I would not have considered it a good learning strategy. It contrasts greatly with, for example, the learning strategy of a restricted Boltzmann machine, in which, at each iteration, the parameters are adjusted so as to correct the model’s fantasy towards producing the observed data.
- Can we offer a better argument for why maximisation of the likelihood for latent variable models is difficult?
- Is the likelihood of an exponential family distribution convex in the parameters? This is certainly the case for e.g. the mean of a Gaussian. Does this explain why the maximisation of the constructed lower bound for the likelihood is easy?