High Reproducibility and High-Accuracy Method for Automated Topic Extraction

Lancichinetti et al 2015


(forwarded by Schakel)

LDA is a generative probabilistic topic model. The authors generate toy models using synthetic documents of words from distinct natural languages. This is in accordance with the generative model posited by LDA, where the topics here are the languages. They then calculate the likelihood of the desired (i.e. generating) solution and the likelihood of various deformed solutions, and show that in quite normal cases the generating solution can have a lower likelihood than a deformed solution.

They further show that the generating solution is often not obtained in practice by LDA and its standard methods of optimisation, even in the normal case where the generating solution is the unique global maximum of the likelihood function.

They describe an pragmatic (non-probabilisitic) approach to topic modelling, which involves first clustering words by detecting communities in the (denoised) word co-occurrence graph, and using these clusters to (somehow) choose initial values for PLSA or LDA to obtain a better solution.

They demonstrate how much better their method performs on their synthetic data.

I find the results of the authors vindicating. I have found the esteem of the machine learning community for LDA so at odds with my own experience of its performance that I wondered if I had misunderstood something. In the realm of information retrieval, we found LDA to be consistently out-performed by the non-probabilistic decompositions of PCA and NMF.

It is not too hard to find support for what I sense might be considered an unpopular opinion:

“Performance of LDA has never significantly surpassed PLSI (in fact we often found inferior results) which is the reason we left them out”


The authors of this paper undertook to investigate the short-comings of LDA by constructing some toy models. As they suggest themselves, it is not a new idea to create a toy model, but we don’t seem to do enough of it in machine learning.