## Estimation Theory and the Bias-Variance Trade-Off

Below is a (rough!) draft of a talk I gave at the Berlin Machine-Learning Learning-Group on the bias-variance trade-off, including the basics of estimator theory, some examples and maximum likelihood estimation.

## Eigendecomposition of real, symmetric matrices

We show that a real, symmetric matrix has basis of real-valued orthonormal eigenvectors and that the corresponding eigenvalues are real.  We show moreover that these eigenvalues are all non-negative if and only if the matrix is positive semi-definite.

## Literary treasure hunting with the Lateral API

A good friend, Sam Matthews, recently gave a talk in December 2014 at a conference of the Australian Modernist Studies Network on “Transnational Modernisms”. Sam spoke about his discovery of a reference to a print-shop from Balzac’s “Two Poets” in Christina Stead‘s novel Seven Poor Men of Sydney. Sam later suggested that I check if we couldn’t use Lateral’s text matching service (the “Recommender (BYO documents!)” API) to confirm this reference to Balzac and potentially uncover other ones. As you’ll see below, the preliminary results are very encouraging. This is hardly a conclusive experiment, but

In case you would like to search for references to Balzac’s works yourself, you can do so by reusing the API key I created: `b4de9b9183df4cbf8d70cde15609800a` .

This is how I proceeded:

1. I downloaded the Complete works of Balzac from Project Gutenberg. This gives one HTML file for each of Balzac’s works.
2. I split each work into paragraphs, labelling the paragraphs by their work and position within the work. Balzac wrote many paragraphs, it turns out!
3. I subscribed to the API at Lateral, obtaining an API key.
4. I installed Francis Tzeng’s python package for accessing the Lateral API
5. Using the python package, I added the paragraphs of Balzac to the Lateral recommender. Short paragraphs containing not enough meaningful words were rejected; in total, the number of meaningful paragraphs of Balzac indexed was over 21,000.
6. Again using the python package, I searched for the closest paragraphs of Balzac to the passage of Stead that Sam had indicated to me (see below).

The passage of Stead’s novel that evokes the print-shop appears below (from Chapter 3):

devil’s kitchen where  the word is made bread … triangular park … A wide old doorway opened beside the tobacconist’s shop, and over it was a name, white on blue, “Tank Steam Press, Ground Floor.” The tobacconist owned the old single-storey building and rented out to several establishments the mouldy apartments of the ground and first floor. In the attic was the man who did heliogravure. The building had once been a private house. Its court was now a cart-dock and opened into the other street. Its first-floor bathroom at the head of the stairs contained the old water-closet, used by all the workers in the house, a gas-ring to make tea, and the usual broken chairs and out of-date telephone directories. The distinctive smell of the building came from this closet and from the printing-ink.Joseph walked through the old doorway, went by a staircase and entered the large airy double room occupied by the Press. He opened the glass back-door and moved about among the presses, curiously inspecting the jobs in their various stages, picking up a paper, looking through the bills on a bill-hook, putting his finger in the dust in the little glassed-in office of Chamberlain, the owner, and shutting off the stove, lighted by the cleaner, because the day was warm enough.

Below are the paragraphs of Balzac that are semantically closest to the text above, according to Lateral. As you can see, the 1st and the 9th closest paragraphs (of over 21,000!) indeed come from “Two Poets”, and inspection reveals that they indeed concern the printshop! You can click the links to fetch the corresponding paragraphs using the API. The intermediately ranked results seem to be architectural descriptions.

``` [ { "distance": 0.034905, "document_id": "TWO POETS-00019" }, { "distance": 0.035945, "document_id": "THE COLLECTION OF ANTIQUITIES-00557" }, { "distance": 0.037409, "document_id": "SONS OF THE SOIL-01139" }, { "distance": 0.038067, "document_id": "A MAN OF BUSINESS-00034" }, { "distance": 0.038168, "document_id": "URSULA-01020" }, { "distance": 0.038216, "document_id": "COUSIN PONS-01938" }, { "distance": 0.03837, "document_id": "COLONEL CHABERT-00023" }, { "distance": 0.038545, "document_id": "COUSIN BETTY-01508" }, { "distance": 0.038823, "document_id": "TWO POETS-00018" }, { "distance": 0.038891, "document_id": "RISE AND FALL OF CESAR BIROTTEAU-01382" }, { "distance": 0.039151, "document_id": "THE RED INN and others-00045" }, { "distance": 0.039195, "document_id": "THE LESSER BOURGEOISIE(The Middle Classes)-00635" }, { "distance": 0.039369, "document_id": "SCENES FROM A COURTESAN'S LIFE-00761" }, { "distance": 0.039377, "document_id": "THE TWO BROTHERS-00663" }, { "distance": 0.039471, "document_id": "HONORINE-00036" }, { "distance": 0.039808, "document_id": "Z. MARCAS-00043" }, { "distance": 0.039896, "document_id": "RISE AND FALL OF CESAR BIROTTEAU-00623" }, { "distance": 0.040041, "document_id": "THE VILLAGE RECTOR-00313" }, { "distance": 0.040253, "document_id": "A WOMAN OF THIRTY-00700" }, { "distance": 0.04031, "document_id": "CATHERINE DE' MEDICI-01059" } ] ```

## Lancichinetti et al 2015

http://arxiv.org/abs/1402.0422

(forwarded by Schakel)

LDA is a generative probabilistic topic model. The authors generate toy models using synthetic documents of words from distinct natural languages. This is in accordance with the generative model posited by LDA, where the topics here are the languages. They then calculate the likelihood of the desired (i.e. generating) solution and the likelihood of various deformed solutions, and show that in quite normal cases the generating solution can have a lower likelihood than a deformed solution.

They further show that the generating solution is often not obtained in practice by LDA and its standard methods of optimisation, even in the normal case where the generating solution is the unique global maximum of the likelihood function.

They describe an pragmatic (non-probabilisitic) approach to topic modelling, which involves first clustering words by detecting communities in the (denoised) word co-occurrence graph, and using these clusters to (somehow) choose initial values for PLSA or LDA to obtain a better solution.

They demonstrate how much better their method performs on their synthetic data.

I find the results of the authors vindicating. I have found the esteem of the machine learning community for LDA so at odds with my own experience of its performance that I wondered if I had misunderstood something. In the realm of information retrieval, we found LDA to be consistently out-performed by the non-probabilistic decompositions of PCA and NMF.

It is not too hard to find support for what I sense might be considered an unpopular opinion:

“Performance of LDA has never significantly surpassed PLSI (in fact we often found inferior results) which is the reason we left them out”

The authors of this paper undertook to investigate the short-comings of LDA by constructing some toy models. As they suggest themselves, it is not a new idea to create a toy model, but we don’t seem to do enough of it in machine learning.

## An overview of word2vec

Below are the slides from my talk at the Berlin Machine Learning Meetup group on July 8, 2014, giving an overview of word2vec, covering the CBOW learning task, hierarchical softmax and negative sampling.