## Support Vector Machine Active Learning with Applications to Text Classification

This is an old but interesting paper from 2001 by Simon Tong (now at Google) and Daphne Koller (who launched Coursera with Andrew Ng), both then at Stanford. (PDF here)

As the title suggests, the authors apply SVMs to text classification using active learning techniques.  This is the first active learning paper I’ve read, and it has provided me with some useful notions.  I’ll be reading further in this field, as building e.g. a Tinder-like application for finding research articles of interest or for “e-Discovery” applications in litigation interest me.  These are both examples of relevance feedback for transduction since the classifier will be applied to a collection of (unlabelled) samples that we know of in advance (in contrast, applying to unseen samples, e.g. for e-mail filtering, is induction).  In machine learning, recognition of this distinction is due to Vapnik.

The authors introduce the notion of a version space for a SVM trained on linearly separable data: it is the set of all hyperplanes that separate the two classes in the training data, thought of as points on the unit sphere by taking the unit normal to the hyperplane.  The area of the version space is thus a measure of our uncertainty of the true decision boundary.  The authors propose an active learning approach that successively “queries” (i.e. requests the label for) a sample that would maximally reduce the area of the version space.

Three heuristic methods are proposed for choosing a such a sample: SimpleMargin, MaxMin Margin and Ratio Margin.  The latter two outperform the former, but are significantly more expensive, computationally, since they require the training of SVMs and the cost of training a SVM is polynomial in the size of the training set.  All three active learning methods are much better than random.

Linear separability of the training examples is not a problem for the authors, since they operate of word count vectors (the vocabulary size = rank is about 10k).  In the case where a dense, lower dimensional vectorisation was used, linear separability would likely still hold, particularly if kernels were used.

The perform experiments using the Reuters corpus and a newsgroup collection.

Questions

• Active learning by successively reducing the area of the version space is an interesting approach. I wonder if, more generally, we might think of choosing a sample that maximally reduces the entropy of the posterior distribution?

## Entropy of the Normal Distribution

Here we calculate the entropy of the normal distribution and show that the normal distribution has maximal entropy amongst all distributions with a given finite variance.

The video belong concerns just the calculation of the entropy, not the maximality property.

## Shannon: averaging operations can only increase the entropy

Shannon makes this claim in his 1948 paper and lists this property as one of his justifications for the definition of entropy. We give a short proof of his claim.

## Matrix Factorisation and the Eigendecomposition of the Gram Matrix

Here we consider the problem of approximately factorising a matrix without constraints and show that solutions can be generated from the orthonormal eigenvectors of the Gram matrix (i.e. of the sample covariance matrix).

For this we need the eigendecomposition of real symmetric matrices.

Questions, all related to one another:

• What other solutions are there?
• (Speculative) can we characterise the solutions as orbits of the orthogonal group on the solutions above, and on those solutions obtained from the above by adding rows of zeros to ?
• Under what constraints, if any, are the optimal solutions to matrix factorisation matrices with orthonormal rows/columns? To what extent does orthogonality come for free?

## Estimation Theory and the Bias-Variance Trade-Off

Below is a (rough!) draft of a talk I gave at the Berlin Machine-Learning Learning-Group on the bias-variance trade-off, including the basics of estimator theory, some examples and maximum likelihood estimation.

## Eigendecomposition of real, symmetric matrices

We show that a real, symmetric matrix has basis of real-valued orthonormal eigenvectors and that the corresponding eigenvalues are real.  We show moreover that these eigenvalues are all non-negative if and only if the matrix is positive semi-definite.

## Literary treasure hunting with the Lateral API

A good friend, Sam Matthews, recently gave a talk in December 2014 at a conference of the Australian Modernist Studies Network on “Transnational Modernisms”. Sam spoke about his discovery of a reference to a print-shop from Balzac’s “Two Poets” in Christina Stead‘s novel Seven Poor Men of Sydney. Sam later suggested that I check if we couldn’t use Lateral’s text matching service (the “Recommender (BYO documents!)” API) to confirm this reference to Balzac and potentially uncover other ones. As you’ll see below, the preliminary results are very encouraging. This is hardly a conclusive experiment, but

In case you would like to search for references to Balzac’s works yourself, you can do so by reusing the API key I created: `b4de9b9183df4cbf8d70cde15609800a` .

This is how I proceeded:

1. I downloaded the Complete works of Balzac from Project Gutenberg. This gives one HTML file for each of Balzac’s works.
2. I split each work into paragraphs, labelling the paragraphs by their work and position within the work. Balzac wrote many paragraphs, it turns out!
3. I subscribed to the API at Lateral, obtaining an API key.
4. I installed Francis Tzeng’s python package for accessing the Lateral API
5. Using the python package, I added the paragraphs of Balzac to the Lateral recommender. Short paragraphs containing not enough meaningful words were rejected; in total, the number of meaningful paragraphs of Balzac indexed was over 21,000.
6. Again using the python package, I searched for the closest paragraphs of Balzac to the passage of Stead that Sam had indicated to me (see below).

The passage of Stead’s novel that evokes the print-shop appears below (from Chapter 3):

devil’s kitchen where  the word is made bread … triangular park … A wide old doorway opened beside the tobacconist’s shop, and over it was a name, white on blue, “Tank Steam Press, Ground Floor.” The tobacconist owned the old single-storey building and rented out to several establishments the mouldy apartments of the ground and first floor. In the attic was the man who did heliogravure. The building had once been a private house. Its court was now a cart-dock and opened into the other street. Its first-floor bathroom at the head of the stairs contained the old water-closet, used by all the workers in the house, a gas-ring to make tea, and the usual broken chairs and out of-date telephone directories. The distinctive smell of the building came from this closet and from the printing-ink.Joseph walked through the old doorway, went by a staircase and entered the large airy double room occupied by the Press. He opened the glass back-door and moved about among the presses, curiously inspecting the jobs in their various stages, picking up a paper, looking through the bills on a bill-hook, putting his finger in the dust in the little glassed-in office of Chamberlain, the owner, and shutting off the stove, lighted by the cleaner, because the day was warm enough.

Below are the paragraphs of Balzac that are semantically closest to the text above, according to Lateral. As you can see, the 1st and the 9th closest paragraphs (of over 21,000!) indeed come from “Two Poets”, and inspection reveals that they indeed concern the printshop! You can click the links to fetch the corresponding paragraphs using the API. The intermediately ranked results seem to be architectural descriptions.

``` [ { "distance": 0.034905, "document_id": "TWO POETS-00019" }, { "distance": 0.035945, "document_id": "THE COLLECTION OF ANTIQUITIES-00557" }, { "distance": 0.037409, "document_id": "SONS OF THE SOIL-01139" }, { "distance": 0.038067, "document_id": "A MAN OF BUSINESS-00034" }, { "distance": 0.038168, "document_id": "URSULA-01020" }, { "distance": 0.038216, "document_id": "COUSIN PONS-01938" }, { "distance": 0.03837, "document_id": "COLONEL CHABERT-00023" }, { "distance": 0.038545, "document_id": "COUSIN BETTY-01508" }, { "distance": 0.038823, "document_id": "TWO POETS-00018" }, { "distance": 0.038891, "document_id": "RISE AND FALL OF CESAR BIROTTEAU-01382" }, { "distance": 0.039151, "document_id": "THE RED INN and others-00045" }, { "distance": 0.039195, "document_id": "THE LESSER BOURGEOISIE(The Middle Classes)-00635" }, { "distance": 0.039369, "document_id": "SCENES FROM A COURTESAN'S LIFE-00761" }, { "distance": 0.039377, "document_id": "THE TWO BROTHERS-00663" }, { "distance": 0.039471, "document_id": "HONORINE-00036" }, { "distance": 0.039808, "document_id": "Z. MARCAS-00043" }, { "distance": 0.039896, "document_id": "RISE AND FALL OF CESAR BIROTTEAU-00623" }, { "distance": 0.040041, "document_id": "THE VILLAGE RECTOR-00313" }, { "distance": 0.040253, "document_id": "A WOMAN OF THIRTY-00700" }, { "distance": 0.04031, "document_id": "CATHERINE DE' MEDICI-01059" } ] ```

## Lancichinetti et al 2015

http://arxiv.org/abs/1402.0422

(forwarded by Schakel)

LDA is a generative probabilistic topic model. The authors generate toy models using synthetic documents of words from distinct natural languages. This is in accordance with the generative model posited by LDA, where the topics here are the languages. They then calculate the likelihood of the desired (i.e. generating) solution and the likelihood of various deformed solutions, and show that in quite normal cases the generating solution can have a lower likelihood than a deformed solution.

They further show that the generating solution is often not obtained in practice by LDA and its standard methods of optimisation, even in the normal case where the generating solution is the unique global maximum of the likelihood function.

They describe an pragmatic (non-probabilisitic) approach to topic modelling, which involves first clustering words by detecting communities in the (denoised) word co-occurrence graph, and using these clusters to (somehow) choose initial values for PLSA or LDA to obtain a better solution.

They demonstrate how much better their method performs on their synthetic data.

I find the results of the authors vindicating. I have found the esteem of the machine learning community for LDA so at odds with my own experience of its performance that I wondered if I had misunderstood something. In the realm of information retrieval, we found LDA to be consistently out-performed by the non-probabilistic decompositions of PCA and NMF.

It is not too hard to find support for what I sense might be considered an unpopular opinion:

“Performance of LDA has never significantly surpassed PLSI (in fact we often found inferior results) which is the reason we left them out”

The authors of this paper undertook to investigate the short-comings of LDA by constructing some toy models. As they suggest themselves, it is not a new idea to create a toy model, but we don’t seem to do enough of it in machine learning.