This is an old but interesting paper from 2001 by Simon Tong (now at Google) and Daphne Koller (who launched Coursera with Andrew Ng), both then at Stanford. (PDF here)

As the title suggests, the authors apply SVMs to text classification using active learning techniques. This is the first active learning paper I’ve read, and it has provided me with some useful notions. I’ll be reading further in this field, as building e.g. a Tinder-like application for finding research articles of interest or for “e-Discovery” applications in litigation interest me. These are both examples of relevance feedback for *transduction* since the classifier will be applied to a collection of (unlabelled) samples that we know of in advance (in contrast, applying to unseen samples, e.g. for e-mail filtering, is *induction*). In machine learning, recognition of this distinction is due to Vapnik.

The authors introduce the notion of a *version space* for a SVM trained on linearly separable data: it is the set of all hyperplanes that separate the two classes in the training data, thought of as points on the unit sphere by taking the unit normal to the hyperplane. The area of the version space is thus a measure of our uncertainty of the true decision boundary. The authors propose an active learning approach that successively “queries” (i.e. requests the label for) a sample that would maximally reduce the area of the version space.

Three heuristic methods are proposed for choosing a such a sample: SimpleMargin, MaxMin Margin and Ratio Margin. The latter two outperform the former, but are significantly more expensive, computationally, since they require the training of SVMs and the cost of training a SVM is polynomial in the size of the training set. All three active learning methods are much better than random.

Linear separability of the training examples is not a problem for the authors, since they operate of word count vectors (the vocabulary size = rank is about 10k). In the case where a dense, lower dimensional vectorisation was used, linear separability would likely still hold, particularly if kernels were used.

The perform experiments using the Reuters corpus and a newsgroup collection.

**Questions**

- Active learning by successively reducing the area of the version space is an interesting approach. I wonder if, more generally, we might think of choosing a sample that maximally reduces the entropy of the posterior distribution?