Don’t interpret linear hidden units, they don’t exist.

Having trained a model, it is natural to want to understand how it works. An intuitively appealing approach is to consider data samples that maximise the activation of a hidden unit, and to take the common input features of these samples as an indication of what that unit has learned to recognise. However, as we’ll see below, it is a misconception to speak of hidden units if:

  • there is no non-linearity on the hidden layer;
  • the weights connecting the layers are unconstrained; and
  • the model is trained using (stochastic) gradient descent or similar.

In such a scenario, the hidden feature space must instead be considered as a whole.


Consider the task of factorising a matrix X as a product of matrices X \cong A^T B with some fixed inner dimension k. The model parameters are pairs of matrices (A, B) with the appropriate dimensions, and the image of an input vector x on the hidden layer is given by Ax. To consider this vector Ax in terms of hidden unit activations is to fix a co-ordinate system in the hidden feature space, and to measure the displacement of the vector along each co-ordinate axis. If E_1, \dots , E_k denote the unit vectors corresponding to the chosen co-ordinate system, then the displacements are given by the inner products

    \begin{equation*} \langle Ax, E_i \rangle, \quad i = 1, \dots, k. \end{equation*}

We show below that if P is any rotation of the hidden feature space, then the model parameters (PA, PB) are just as likely as (A, B) to result in the factorisation of a fixed matrix X and that which of these occurs depends only on the random initialisation of gradient descent. Thus the hidden unit activations might just as likely have been given by

    \begin{equation*} \langle (PA)x, E_i \rangle, \quad i = 1, \dots, k. \end{equation*}

The hidden unit activations given by 1 and 2 can be very different indeed. In fact, since P is an orthogonal transformation, we have

    \[ \langle (PA)x, E_i \rangle = \langle Ax, P^T E_i \rangle, \quad i = 1, \dots, k \]

(see e.g. here). Thus the indeterminacy of the model parameters, i.e. (A, B) vs. (PA, PB), might equivalently be thought of as an indeterminacy in the orientation of the co-ordinate system, i.e. the E_i vs. the P^T E_i. The choice of orientation of co-ordinate basis is completely arbitrary, so speaking of hidden unit activations makes no sense at all.

The above holds more generally for P an orthogonal transformation of the hidden feature space, i.e. for P a composition of rotations and reflections.

Szegedy et al.

None of the above is new. For example, it was stated by Szegedy et al. in an empirical study of the interpretability of hidden units. We are demonstrating, step-by-step, a statement of theirs (which was about word2vec):

… word representations, where the various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. At the same time, the vector representations are stable up to a rotation of the space, so the individual units of the vector representations are unlikely to contain semantic information.

Matrix factorisation and unit activation

Given a matrix X and an inner dimension k, the task of matrix factorisation is to learn two matrices A and B whose product approximates X:

The parameter space consists of the entries of the matrices A and B. The hidden feature space, on the other hand, is the k-dimensional space containing the columns of A and B.

Error function

To train a matrix factorisation model using gradient descent, the model parameters are repeatedly updated using the gradient vector of the error function. An example error function E could be

    \[ E_X (A, B) = \sum_{i, j} {(X_{i, j} - (AB)_{i,j})^2}. \]

Notice that this choice of error function doesn’t depend directly on the pair of matrices (A, B), but rather only on their product AB, i.e. only on the approximation AB of X. This is true of any error function E_X, because error functions depend only on inputs and outputs.

Orthogonal transformations of the hidden feature space

Recall that orthogonal transformations of a space are just compositions of rotations and reflections about hyperplanes passing through the origin. Considered as matrices, orthogonal transformations are defined by the property that their product with their transpose gives the identity matrix. Using this property, it can be seen that an orthogonal transformation of the hidden feature space defines an orthogonal transformation of the parameter space by acting simultaneously on the column vectors of the matrices. If O_k and O_{(m+n)k} denote the groups of orthogonal transformations on the hidden feature space and the parameter space, respectively, then:

Contour lines of the gradient

The effect of this block-diagonal orthogonal transformation on the parameter space corresponds to multiplying the matrices A and B on the left by the orthogonal transformation P of the feature space, i.e. it effects (A, B) \mapsto (PA, PB). Notice that (A, B) and (PA, PB) yield the same approximation to the original matrix X, since:

    \[ (PA)^T (PB) = (A^T P^T) (PB) = A^T (P^T P) B = AB. \]

Thus E_X (A, B) = E_X (PA, PB), so the orthogonal transformations P of the hidden feature space trace out contour lines of E_X in the parameter space. Now the gradient vector is always perpendicular to the contour line, so the sequence of points in the parameter space visited during gradient descent preserve the orientation of the hidden feature space set at initialisation (see here, for example). So if gradient descent of E_X starting at the initial parameters (A^{(0)}, B^{(0)}) converges to the parameters (A, B), and you’d prefer that it instead converged to (PA, PB), then all you need to do is start the gradient descent over again, but this time with the initial parameters (PA^{(0)}, PB^{(0)}). We thus see that the matrices (A, B) that our matrix factorisation model has learned are only determined up to an orthogonal transformation of the hidden feature space, i.e. up to a simultaneous transformation of their columns.

Gradient descent methods

The above statements continue to hold in the case of stochastic gradient descent, where the error function E_X is not fixed but rather defined by varying mini-tasks (an instance being e.g. word2vec). Such error functions still don’t depend upon hidden layer values, so as above their gradient vectors are perpendicular to the contour lines traced out by the orthogonal transformations of the hidden layer. Thus the updates performed in stochastic gradient descent also preserve the original orientation of the feature space.


How likely is it that initial parameters, transformed via an orthogonal transformation as above, ever occur themselves as initial parameters? In order to conclude that the orientation of the co-ordinate system on the hidden layer is completely arbitrary, we need it to be precisely as likely. Thus if \pi denotes the probability distribution on the parameter space from which the initial parameters are drawn, we require

    \[ \pi ( (P A^{(0)}, P B^{(0)}) ) =  \pi ( (A^{(0)}, B^{(0)}) ), \]

for any initial parameters ( A^{(0)}, B^{(0)} ) and any orthogonal transformation P of the hidden feature space.

This is not the case with word2vec, where each parameter is drawn independently from a uniform distribution. However, it remains true that for any choice of initial parameters, there will still be any number of possible orientations of the co-ordinate system, but for some choices of initial parameters there is less freedom than for others.

Appendix: What about GloVe?

GloVe performs weighted matrix factorisation with bias terms, so the above should apply. The weighting is just a modified error function, and the bias terms are not hidden features and so are left unmodified by its orthogonal transformations. Like word2vec, GloVe initialises each parameter with independent samples from uniform distribution, so there are no new problems there. The real problem with applying the above analysis to GloVe is that the implementation of Adagrad used makes the learning regime dependent on the choice of basis of the hidden feature space (see e.g. here). This doesn’t mean that the hidden unit activations of GloVe make sense, it just means that GloVe is less amenable to theoretical arguments like those above and needs to be considered empirically e.g. in the manner of Szegedy et al.

Adagrad depends on the choice of co-ordinate system

Adagrad is a learning regime that maintains separate learning rates for each individual model parameter. It is used, for instance, in GloVe (perhaps incorrectly), in LightFM, and in many other places besides. Below is an example showing that Adagrad models evolve in a manner that depends upon the choice of co-ordinate system (i.e. orthonormal basis) for the parameter space. This dependency is no problem when the parameter space consists of many one-dimensional, conceptually unrelated features lined up beside one another, because such parameter spaces have only one natural orientation of the co-ordinate axes. It is a problem, however, for the use of Adagrad in models like GloVe and LightFM. In these cases the parameter space consists of many feature vectors (of dimension, say, 100) concatenated together. A learning regime should not depend upon the arbitrary choice of orthonormal basis in this 100-dimensional feature space.

For feature spaces like these, I would propose instead maintaining a separate learning rate for each feature vector (for example, in the case of GloVe, there would be one learning rate per word vector per layer). The learning rate of a feature vector would dampen the initial learning rate by the accumulation of the squared norms of the previous gradient updates of the feature vector. The evolution of Adagrad would then be independent of the choice of basis in the feature space (as distinct from the entire parameter space). In the case of GloVe this means that a simultaneous rotation of all the word vectors in both layers during training does not alter the resulting model. This proposal would have the further advantage of greatly reducing the number of learning rates that have to be stored in memory. I don’t know if this proposal would have regret minimisation properties analogous to Adagrad. I haven’t read the original paper of Duchi et al. (2011), and what I am proposing might be subsumed there by the full-rank case (thanks to Alexey Rodriguez for pointing this out). Perhaps a block diagonal matrix could be used instead of a diagonal one.

Update: Minh + Kavukcuoglu seem to have adopted the same point of view in Learning word embeddings efficiently with noise-contrastive estimation (2013). Thanks to Matthias Leimeister for this.

The mathematics of the discrete Fourier transform

We aim to identify the assumptions that are implicit in the sampling of a continuous-time signal and in the subsequent application of the discrete Fourier transform (DFT). In particular, we consider the following questions:

  • When does the sampling of periodic continuous-time signal result in a periodic discrete-time signal?
  • When the resulting discrete-time signal is periodic, what is its frequency in samples/second?
  • Which continuous-time frequencies coincide in discrete time, and what does the “frequency spectrum” in discrete-time look like?
  • To which periodic discrete-time signals can the discrete Fourier transform be applied to without losing information?

We furthermore show that the DFT interchanges point-wise and convolution products in the time- and frequency- domains, and thereby express the DFT to Pontryagin duality for finite cycle groups.

I talked about the above (skipping many details!) at a recent talk.

Feature scaling and non-negative matrix factorisation

Non-negative matrix factorisation (NMF) is a dimension reduction technique that is commonly applied in a number of different fields, for example:

  • in topic modelling, applied to the document x word matrix;
  • in speech processing, applied to the matrix of magnitude spectrograms of framed audio;
  • in recommendation systems, applied to the user x item interaction matrix.

Due to its non-negativity constraint, it has the wonderful property of decomposing a objects as an additive combination of (often very meaningful) parts. However, as with all unsupervised learning tasks, it is sensitive to the relative scale of different features.

The fundamental problem is that the informativeness of a feature need not be related to its scale. For example, when processing speech, the highest-energy components of a magnitude spectrogram are those of the least perceptual importance! So when NMF decides which information to discard into order to achieve a low-rank factorisation that minimises the error function, it can be the signal, not the noise, that is sacrificed. This problem is not unique to NMF, of course: PCA retains those dimensions of the sample cloud that have the greatest variance.

It is in general better to learn a feature representation jointly with the downstream task, so that the model learns to scale features according to their informativeness for the task. If NMF is for some reason still desirable, however, it is possible to better control the information loss by choosing an appropriate measure of the matrix factorisation error.

There are three common error functions used in NMF (all of which Bregman divergences): squared Euclidean, Kullback-Leibler (KL) and Itakura-Saito (IS). These are respectively quadratic, linear and invariant with respect to the feature scale. Thus, for example, NMF with the Euclidean error function gives strong preference to high-energy features, while NMF with the IS error function is agnostic to feature scale.

Convergence rate of gradient descent

These are notes from a talk I presented at the seminar on June 22nd. All this material is drawn from Chapter 7 of Bishop’s Neural Networks for Pattern Recognition, 1995.

In these notes we study the rate of convergence of gradient descent in the neighbourhood of a local minimum. The eigenvalues of the Hessian at the local minimum determine the maximum learning rate and the rate of convergence along the axes corresponding to the orthonormal eigenvectors.

See the eigendecomposition of real, symmetric matrices for the linear algebra preliminaries.

Skipgram isn't Matrix Factorisation

The paper Neural Word Embeddings as Implicit Matrix Factorization of Levy and Goldberg was published in the proceedings of NIPS 2014 (pdf).  It claims to demonstrate that Mikolov’s Skipgram model with negative sampling is implicitly factorising the matrix of pointwise mutual information (PMI) of the word/context pairs, shifted by a global constant.  Although the paper is interesting and worth reading, it greatly overstates what is actually established, which can be summarised as follows:

Suppose that the dimension of the Skipgram word embedding is at least as large as the vocabulary.  Then if the matrices of parameters (W, C) minimise the Skipgram objective, and the rows of W or the columns of C are linearly independent, then the matrix product WC is the PMI matrix shifted by a global constant.

This is a really nice result, but it certainly doesn’t show that Skipgram is performing (even implicitly) matrix factorisation.  Rather it shows that the two learning tasks have the same global optimum  – and even this is only shown when the dimension is larger than the vocabulary, which is precisely the case where Skipgram is uninteresting.

The linear independence assumption

The authors (perhaps unknowingly) implicitly assume that the word vectors on one of the two layers of the Skipgram model are linearly independent.  This is a stronger assumption than what the authors explicitly assume, which is that the dimension of the hidden layer is at least as large as the vocabulary.  It is also not a very natural assumption, since Skipgram is interesting to us precisely because it captures word analogies in word vector arithmetic, which are linear dependencies between the word vectors!  This is not a deal breaker, however, since these linear dependencies are only ever approximate.

In order to see where the assumption arises, first recall some notation of the paper:


The authors consider the case where the negative samples for Skipgram are drawn from the uniform distribution P_D over the contexts V_C, and write


for the log likelihood.  The log likelihood is then rewritten as another double summation, in which each summand (as a function of the model parameters) depends only upon the dot product of one word vector with one context vector:

11-05-2016 5-56 pm

The authors then suppose that the values of the parameters W, C are such that Skipgram is at equilibrium, i.e. that the partial derivatives of l with respect to each word- and content-vector component vanish.  They then assume that this implies that the partial derivatives of l with respect to the dot products vanish also.  To see that this doesn’t necessarily follow, apply the chain rule to the partial derivatives:

11-05-2016 5-56 pm(4)

This yields systems of linear equations relating the partial derivatives with respect to the word- and content- vector components (which are zero by supposition) to the partial derivatives with respect to the dot products, which we want to show are zero.  But this only follows if one of the two systems of linear equations has a unique solution, which is precisely when its matrix of coefficients (which are just word- or context- vector components) has linearly independent rows or columns.  So either the family of word vectors or the family of context vectors must be linearly independent in order for the authors to proceed to their conclusion.

Word vectors that are of dimension the size of the vocabulary and linearly independent sound to me more akin to a one-hot or bag of words representations than to Skipgram word vectors.

Skipgram isn’t Matrix Factorisation (yet)

If Skipgram is matrix factorisation, then it isn’t shown in this paper.  What has been shown is that the optima of the two methods coincide when the dimension is larger that the size of the vocabulary. Unfortunately, this tells us nothing about the lower dimensional case where Skipgram is actually interesting.  In the lower dimensional case, the argument of the authors can’t be applied, since it is then impossible for the word- or context- vectors to be linearly independent.  It is only in the lower dimensional case that the Skipgram and Matrix Factorisation are forced to compress the word co-occurrence information and thereby learn anything at all.  This compression is necessarily lossy (since there are insufficient parameters) and there is nothing in the paper to suggest that the two methods will retain the same information (which is what it means to say that the two methods are the same).

Appendix: Comparing the objectives

To compare Skipgram with negative sampling to MF, we might compare the two objective functions.  Skipgram maximises the log likelihood l (above). MF, on the other hand, typically minimises the squared error between the matrix and its reconstruction:

11-05-2016 5-56 pm(3)

The partial derivatives of E, needed for a gradient update, are easy to compute:

11-05-2016 5-56 pm(2)

Compare these with the partial derivatives of the Skipgram log-likehood l, which can be computed as follows:

11-05-2016 5-56 pm(1)

Softmax parameterisation and optimisation

The softmax function provides a convenient parameterisation of the probability distributions over a fixed number of outcomes. Using the softmax, such probability distributions can be learned parametrically using gradient methods to minimise the cross-entropy (or equivalently, the Kullback-Leibler divergence) to observed distributions.  This is equivalent to maximum likelihood learning when the distributions to be learned are one-hot (i.e. we are learning for a classification task). In the notes below, the softmax parameterisation and the gradient updates with respect to the cross entropy are derived explicitly.

This material spells out section 4 of the paper of Bridle referenced below, where the softmax was first proposed as an activation function for a neural network. It was in this paper that softmax was named, moreover. The name contrasts the outputs of the function with those of the “winner-takes-all” function, whose outputs are one-hot distributions.


Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp. 227-236.

Improving Pairwise Learning for Item Recommendation from Implicit Feedback 2014

Steffen Rendle and Christoph Freudenthaler (University of Konstanz), WSDM 2014.
The authors present a modification of the Bayesian Pairwise Ranking (BPR) for implicit feedback (i.e. one class) recommendation datasets in which the negative samples are drawn according both to the models current belief and the user/context in question (“adaptive oversampling“). They show that the prediction accuracy of BPR models trained with adaptive oversampling matches that of BPR models trained with uniform sampling but that convergence is 10x-20x faster.
Consider the problem of recommending items to users (or more generally: contexts, e.g. user on a particular page).  The observed data consists then of context-item pairs (c, i), where item i was the choice made in context c.  The authors work in the context of pairwise learning, which amounts to a binary classification task where context-item-item triples (c, i, j) are labelled as true i.f.f. item i was chosen in context c but item j is not:
Screen Shot 2016-03-21 at 11.54.46
where \hat{y} is a scoring function (e.g. the dot product of the context- and the item- latent vectors, for matrix factorisation).  It is infeasible to consider all the negative examples j, so how should we choose which to consider?

Negative sampling from static distributions

We could draw negative examples from the uniform distribution over all items, or instead from the observed distribution over all items (i.e. by popularity).  Both are inexpensive to perform and easy to implement.  However:
  • Uniform sampling tends to yield uninformative samples, i.e. those for which the probability of being incorrectly labelled is very likely already low: popular items are precisely those that appear often as positive examples (and hence tend to be highly ranked by the model), while a uniformly-drawn item is likely to be from the tail of the popularity distribution (so likely lowly ranked by the model).
  • Sampling according to popularity is demonstrated by the authors to converge to inferior solutions.
The authors point out that these sampling schemes depend neither upon the current context (user) nor the current belief of the model.  This contrasts with their own method, adaptive over-sampling.

Adaptive over-sampling

The authors propose a scheme in which the negative samples chosen are those that the model would be likely to recommend to the user in question, according to its current state.  In this sense it is reminiscent of the Gibbs sampling used by restricted Boltzmann machines.
Choosing negative samples dependent on the current model and user is computationally expensive if performed in the naive manner. The authors speed this up by working with the latent factors individually, and only periodically re-computing the ranking of the items according to each latent factor.  Specifically, in the case of matrix factorisation, when looking for negative samples for a context c, a negative sample is chosen by:
  1. sampling a latent factor l according to the absolute values of the latent vector associated to c (normalised, so it looks like a probability distribution);
  2. sampling an item j that ranks highly for l.  More precisely, sample a rank r from a geometric distribution over possible ranks, then find the item that has rank r when the lth coordinates of the item latent vectors are compared.

(We have ignored the sign of the latent factor.  If the sign is negative, one choses the rth-to-last ranked item).  The ranking of the items according to each latent factor is precomputed periodically with a period such that the extra overhead is independent of the number of items (i.e. is a constant).

Problems with the approach

The samples yielded by the adaptive oversampling approach depend heavily upon the choice of basis for the latent space.  It is easy to construct examples of where things go wrong:

Non-negativity constraints would solve this.  Regularisation would also deal with this particular example – however regularisation would complicate the expression of the scoring function \hat y as a mixture (since you need to divide though by Z_c.

Despite these problems, the authors demonstrate impressive performance, as we’ll see below.



The authors demonstrate that their method does converge to solutions slightly better than those given by uniform sampling, but twenty times faster.  It is also interesting to note that uniform sampling is vastly superior to popularity based sampling, as shown in the diagrams below.
Screen Shot 2016-03-21 at 11.54.01
Note that a single epoch of the adaptive oversampling takes approximately 33% longer than a single epoch of uniform sampling BPR.


According to the paper, the method is implemented in libFM, a C++ software package that Rendle has published.  However, while I haven’t looked exhaustively, I can’t see anything in that package about the adaptive oversampling (nor in Rendle’s other package, MyMediaLite).


What about adaptive oversampling in word2vec?

Word2vec with negative sampling learns a word embedding via binary classification task, specifically, “does the word appear in the same window as the context, or not?”.  As in the case of implicit feedback recommendation, the observed data for word2vec is one-class (just a sequence of words).  Interestingly, word2vec samples from a modification of the observed word frequency distribution (i.e. of the “distribution according to popularity”) in which the frequencies are raised to the 0.75th power and renormalised.  The exponent was chosen empirically.  This raises two questions:

  1. Would word2vec perform better with adaptive oversampling?
  2. How does BPR perform when sampling from a similarly-modified item popularity distribution (i.e. raising to some exponent)?



Corrections and comments are most welcome. Please get in touch either via the comments below or via email.


WARP loss for implicit-feedback recommendation

We consider the “Weighted Approximate-Rank Pairwise-” (WARP-) loss, as introduced in the WSABIE paper of Weston et. al (2011, see references), in the context of making recommendations using implicit feedback data, where it has been shown several times to perform excellently.  For the sake of discussion, consider the problem of recommending items i to users u, where a scoring function f_u(i) gives the score of item i for user u, and the item with the highest score is recommended.

WARP considers each observed user-item interaction (u, i) in turn, choses another “negative” item i' that the model believed was more appropriate to the user, and performs gradient updates to the model parameters associated to u, i and i' such that the models beliefs are corrected.  WARP weights the gradient updates using (a function of) the estimated rank of item i for user u.  Thus the updates are amplified if the model did not believe that the interaction (u, i) could ever occur, and are dampened if, on the other hand, if the interaction is not surprising to the model. Conveniently, the rank of i for u can be estimated by counting the number of sample items i' that had to be considered before one was found that the model (erroneously) believed more appropriate for user u.

Minimising the rank?

Ideally we would like to make updates to the model parameters that minimised the rank of item i for user u.  Previous work of Usunier (one of the authors) showed that the precision at k was best optimised when the logarithm of the rank was minimised.  (to read!)

The problem with the rank is that, while it does depend on the model parameters, this dependence is not continuous (the rank being integer valued!).  So it is not possible to speak of gradients.  So what is to be done instead?  The approach of the authors is to derive a differentiable approximation to the logarithm of the rank, and to minimise this instead.

Derivation: approximating the (log of the) rank

WARP has been shown several times to perform very well for implicit feedback recommendation.  However, the derivation of the approximation of the log of the rank used in WARP, as outlined in the WSABIE paper, is nonsense.  I can only think that the authors arrived at WARP in another way.  Let’s look at it more closely.  In the following:

  • f_u (i) is the score assigned by the model to item i for user u.
  • L is some function that defines the error as a function of the rank.  In the WSABIE paper, L(k) = \sum_{j=1}^k \frac{1}{j} is approximately the natural logarithm (for the derivation below, however, it doesn’t matter what L is)

warp derivationThe most obvious problem with the derivation is the approximation marked with an asterix (*).  At this step, the authors approximate the indication function I[x > y] by I[x > y] \cdot (x - y + 1).  While the latter is familiar as the hinge loss from SVMs, it is (begin unbounded!) a dreadful approximation for the indicator I[x > y].  It seems to me that the sigmoid of the difference of the scores would be a much better differentiable approximation to the indicator function.

To appreciate why the derivation is nonsense, however, you have to notice that the it has nothing to do with L.  That is, the derivation above would yield an approximation for L, whatever L happened to be, even a constant function.


WARP considers each observed interaction (u, i) in turn, repeatedly sampling items i' from the uniform distribution over all items until it finds one in V_{u, i}^1, i.e. until it finds an item i' whose score for the user u is at worst 1 less than the score of the observed interaction.  For this triple (u, i, i'), it performs gradient updates to minimise:

\displaystyle L( \text{rank}_u^1 (i) ) \cdot (f_u (i') + 1 - f_u (i))

The naive approach to computing \text{rank}_u^1 (i) is to calculate all the scores for the given user in order to determine the rank of the item i.  WARP performs a nice trick to do much better: it estimates \text{rank}_u^1 (i) by counting how many candidate negative items i' it had to consider before finding one in V_{u, i}^1.  This yields

\displaystyle \text{rank}_u^1 (i) \approx \frac{\text{total number of items} - 1}{\text{number of i' we had to draw}}

However it is still the case that L(\text{rank}_u^1 (i)) is not differentiable.  So when we compute the gradients, this quantity has to be treated as a constant. Thus it simply becomes a weighting applied to the gradient of the difference of the scores (hence the name WARP, I guess).

WARP optimises for item to user recommendations

With its negative sampling technique, WARP optimises for recommending items to a user.  For instance, the problem of recommending users to items (so, transposing the interaction matrix) is not trained for.  I wonder if some extra uplift could be obtained by training for both problems simultaneously.

Normalising for the total number of items

With the optimisation stated as above, the learning rate will need to be re-tuned for datasets that have different numbers of items, since the gradient weighting L( \text{rank}_u^1 (i) ) is ranges from L(0) to L(\text{total number items}).  It would make more sense to weigh the gradient updates by:

\displaystyle \frac{L( \text{rank}_u^1 (i) )}{L(\text{total number items})}

which ranges between 0 and 1.


There are two implementations of WARP for recommendation that I know of, both in Python:

  • LightFM, written by Maciej Kula.  Works well.  Also implements BPR with uniform sampling and WARP k-OS (which I’ve not investigated yet).
  • MREC, written by Levy and Jack at Mendeley, has a matrix factorisation recommender trained using WARP.  I’ve not tried this one out yet.


Jason Weston, Samy Bengio and Nicolas Usunier, Wsabie: Scaling Up To Large Vocabulary Image Annotation, 2011, (PDF).

Metadata Embeddings for User and Item Cold-start Recommendations

Maciej Kula (Lyst)
CBRecSys 2015 (arxiv)

Kula presents a model for cold start recommendation, which he calls “LightFM”.  Users and items are considered as sets of binary features. For example:

\text{alice} = \{ \text{domain}:\text{gmail} \}

\text{itemXYZ} = \{\text{description}:\text{pleated}, \text{description}:\text{skirt}, \text{tag}:\text{chanel} \}.

Each of these features (e.g. each tag, each word and each email domain) corresponds to a parameter vector and a bias term.  A user vector (or item vector) is then the sum of the vectors associated to its constituent features.  Similarly, a user (item) bias term is just the sum of the bias terms associated to its features.

The probability \hat{r}(u, i) of an interaction between a user u and an item i is modelled as the sigmoid of the dot product of the user vector and the item vector, along with the bias terms associated with the user and the item:

\hat{r}(u, i) := \sigma (vec(u) \cdot vec(i) + bias(u) + bias(i))

The model is trained on a set S_{+} of user-item pairs observed as having interacted, and on a set S_{-} of user-item pairs that were not observed to have interacted (in the case of implicit feedback recommendation) or to have interacted negatively (in the case of explicit feedback recommendation).  Specifically, these interactions and non-interactions are assumed independent and the likelihood

\displaystyle L = \prod_{(u, i) \in S_{+}} \hat{r}(u,i) \cdot \prod_{(u, i) \in S_{-}} (1 - \hat{r}(u,i))

is then maximised using stochastic gradient descent and with adaptive per-parameter learning rates determined by Adagrad.

Trivial featurisation gives matrix factorisation

Note that users (or items) can be featurised trivially using their ids.   We create one user feature for each user id, so that the user-feature matrix is the identity matrix.  In this case, we have a separate parameter vector for each user.  If we do this for both users and items, then the model is just a (sigmoid-) factorisation of the user-item interaction matrix. This is then the case of Johnson’s logistic matrix factorization.


Performance is evaluated on MovieLens for explicit feedback recommendation and on CrossValidated (one of the StackExchange websites) for implicit feedback recommendation.  In both cases, warm- and cold-start scenarios are tested.  Warm start is tested by holding out interactions in such a way that every item and every user is still represented in the training interaction data.  Cold start is tested by holding out all interactions for some items.  Model accuracy is measured by considering each user in the set of test interactions, considering the binary classification task of labelling each item as having been interacted with or not and then measuring the area under the curve of the associated ROC curve.  The mean is that taken over all users in the test set.

LightFM seems to perform well in both cold and warm start scenarios.

Engineering Notes

Kula included some interesting notes on the production use of LightFM at Lyst.  Training is incremental with model state stored in the database.

Implementation and Examples

Available on GitHub and extensively documented.  Written in Cython.  In addition to the logistic loss used above, Bayesian Personalised Ranking and WARP are supported.


Parameterising the Mahalanobis distances for metric learning

Below are the notes I made to prepare for a short talk given at our seminar on learning distance metrics, and the Mahalanobis distances in particular. We show that the Mahalanobis distances can be parameterised by the positive semidefinite (PSD) matrices or alternatively (in a highly redundant way) by all matrices. The set of PSD matrices is convex, but in order to perform gradient descent to optimise the objective function, we need to perform a costly projection after each update involving the singular value decomposition.

We note along the way that a Mahalanobis distance is nothing more than the Euclidean distance after applying a linear transform to the data.

The example of the 2×2 PSD matrices is worked out in detail here.

What’s here documents my first steps. What I really discovered is that metric learning is a research domain in its own right, and that a great deal of work has been done. There is an excellent survey by Bellet et al. (2013) that covers everything I have said in the first two of its sixty pages.

Visualising the set of 2×2 positive semidefinite matrices

Recall that a symmetric matrix M \in \mathbb{R}^{n \times n} is called positive semidefinite (“PSD”) if, for any x \in \mathbb{R}^n, we have x^{T} M x \geqslant 0. Positive semidefinite matrices occur, for instance, in the study of bilinear forms and as the Gram (or covariance) matrices in probability theory. In the case where n = 2, the space of symmetric matrices is 3-dimensional, and we can actually draw the subset of all positive semidefinite matrices – it looks like the bow of a ship.

It is clear that in the case illustrated below, the PSD matrices form a convex subset. It is easy to show this in general, by observing that the set of all PSD matrices is closed under addition and multiplication by non-negative scalars. The convexity of this set is crucial for the fitting of Mahalanobis distances in metric learning, which is how I got interested PSD matrices in the first place.

Does vector direction encode word frequency?

In a paper with Adriaan Schakel, we presented controlled experiments for word embeddings using pseudo-words. Performing these experiments in the case of word2vec CBOW showed that, in particular, the vector direction of any particular word changed only moderately when the frequency of the word was varied. Shortly before we released the paper, Schnabel et al presented an interesting paper at EMNLP, where (amongst other things), they showed that it was possible to distinguish rare from frequent words using logistic regression on the normalised word vectors, i.e. they showed that vector direction does approximately encode coarse (i.e. binary, rare vs. frequent) frequency information.  Here, I wanted to quickly report that the result of Schnabel et al. holds for the vectors obtained from our experiments, as they should. Below, I’ll walk through exactly what I checked.

I took the word vectors that we trained during our experiments. You can check our paper for a detailed account. In brief, we trained a word2vec CBOW model on popular Wikipedia pages with a hidden layer of size 100, negative sampling with 5 negative samples, a window size of 10, a minimum frequency of 128, and 10 passes through the corpus. Sub-sampling was not used so that the influence of word frequency could be more clearly discerned. There were 81k unigrams in the vocabulary. Then:

  1. the word vectors were normalised so that their (Euclidean-) length was 1.
  2. the frequency threshold of 5000 was chosen (somewhat arbitrarily) to define the boundary between rare and frequent words. This gave 8428 “frequent” words. A random sample of the same size of the remaining “rare” words was then chosen, so that the two classes, “rare” and “frequent”, were balanced. This yielded approximately 17k data points, where a data point is a normalised word vector labelled with either “frequent” (1) or “rare” (0).
  3. the data points were split into training- and test- sets, with 70% of the data points in the training set.
  4. a logistic regression model was fit on the training set. An intercept was fit, but this boosted the performance only slightly. No regularisation was used since the number of training examples wass high compared to the number of parameters.
  5. The performance on the test set was assessed by calculating the ROC curve on the training and test sets and the accuracy on the test set.

Model performance
Consider the ROC curve below. We see from that fact that the test curve approximately tracks the training curve that the model generalises reasonably to unseen data. We see also from the closeness of the curves to the axes at the beginning and the end that the model is very accurate in detecting frequent words when it gives a high probability (bottom left of the curve) and at detecting infrequent words when it gives a low probability (top right).

ROC curve

(ROC curve made using a helpful code snippet from sklearn)

The accuracy of the model on the test set was 82%, which agrees very nicely with what was reported in Schnabel et al., summarised in the following image:
Schnabel et al image
The training corpus and parameters of Schnabel, though not reported in full detail (they had a lot of other things to report), seem similar. We know that their CBOW model was 50 dimensional, had a vocabulary of 103k words, and was trained on the 2008 Wikipedia.

Limiting Distributions of Markov Chains

Below are the notes I prepared for a talk that I gave at our seminar on limiting distributions for (finite-state, time-homogeneous) Markov chains, drawing on PageRank as an example. We see in particular how the “random teleport” possibility in the PageRank random walk algorithm can be motivated by the theoretical guarantees that result from it: the transition matrix is both irreducible and aperiodic, and iterated application of the transmission matrix converges more rapidly to the unique stationary distribution.

More thorough reading material:

Update: that the limit of the average time spent is the reciprocal of the expected return time can be proved using the renewal reward theorem (thank you Prof. Sigman!)

A Bayesian Personalised Ranking Example: Factor Models for Recommending Given Names

Immanuel Bayer and Steffen Rendle

ECML PKDD Discovery Challenge 2013 (offline track).

This paper provides an interesting example of using factorisation machines for implicit feedback via the Bayesian personalised ranking (BPR) optimisation criterion.

The challenge was to recommend first names (e.g. to soon-to-be parents). Participants were provided with the browsing history of users on the name selection website Nameling. So the items to be recommended are names.

Users are considered together with their browsing history, consisting of the name they looked at last and the list of all the names they looked at up to the time t in question. The combination of a user (in this complete sense) at time t and a candidate name is vectorised as follows:

Screen Shot 2015-08-26 at 13.23.09

The (order 2) factorisation machine assigns a score to the combination as follows:

Screen Shot 2015-08-26 at 13.24.43

where p is the rank of this vectorisation, w_0 \in \mathbb{R}, w \in \mathbb{R}^p and V \in \mathbb{R}^{p \times k} are model parameters to be learned. These parameters are learned via stochastic gradient ascent of the following pair-wise learning objective:

Screen Shot 2015-08-26 at 13.24.58

where D is the set of (user u, time t, name n) tuples where u has browsed n before time t, N is the set of all possible names and \sigma is the sigmoid function. Only a single name is chosen from N \setminus \{ n \} for each update. These negative samples are drawn according to their estimated rank (this part is quite difficult to do efficiently).

In the above, we have purposefully omitted the “prefix smoothing” step, since we are mainly interested in the a simple factorisation machine example. The details are in the paper.

This recommender did very well in the challenge. However, I don’t find the examples given in the paper very impressive (though I have not seen the examples given by others). My suspicion is that the FM approach is very strong, but that there is no good way to make first name recommendations using the provided dataset. A more effective dataset would be e.g. first name x product interaction on a large e-commerce site. This would do a much better job of capturing the social meaning of names, but could go out of date very quickly.

Musings on "adjectives as matrices"

The advantage of considering (e.g.) adjectives as transformations rather than points in space is that these transformations can be applied in unseen combinations. This counters one of Chomsky’s objections to statistical modelling of language, that is, that language is effectively infinite, whereas language models are trained on only a finite amount of data (so are humans, but humans are supposed to be born with a universal grammar). The case, considered by Baroni et al., of adjective as linear transform has a couple of disadvantages, however. The first that there are a large number of parameters to be learnt for each adjective, the second being that it doesn’t capture the near commutativity of adjectives, i.e. in most cases adjectives can be applied to a noun in different orders without significantly changing the meaning.

I can think of several approaches for enforcing the commutativity of adjective matrices:

  1. simply using diagonal matrices (this reduces to one of the approaches already considered), or
  2. penalising the off-diagonal elements via regularisation, or
  3. interleaving existing parameter updates with updates that penalise (co-occurring?) adjective matrices for not commuting with one another, e.g. using the gradient of the matrix commutator AB - BA

(Linear) Maps of the Impossible: Capturing semantic anomalies in distributional space

Eva Maria Vecchi, Marco Baroni and Roberto Zamparelli.

Presented at the workshop “Distributional Semantics and Compositionality” (2011) PDF

The authors attempt to use distributional models to distinguish between acceptable and “semantically deviant” adjective-noun combinations (an example of this distinction is given by “blue rose” vs “residential steak”). They hypothesise in particular that the length of the vector representation of the adjective-noun combination is an indication of its acceptability. Their reasoning for this hypothesis assumes that directions and in particular axes are interpretable in distributional models (this does not apply in the case of word2vec, at least). They further hypothesise that the combination will be spatially isolated with respect to the cosine similarity.

The distributional representation is derived from a POS-tagged and lemmatised corpus by considering sentence-internal co-occurrence between the vocabulary as a whole and the 10k most frequent nouns, verbs and adjectives, transformed via the “local mutual information” measure and reduced to rank 300 using PCA.

Different methods of transforming the noun representation using the adjective to obtain the adjective-noun combination are studied and the results are evaluated against human judgements of semantic deviance.

Collaborative Filtering for Implicit Feedback Datasets

Hu, Koren and Volinsky (AT&T, Yahoo!), 2008.

A well-written paper.


The authors give a good description of the distinctions between explicit and implicit feedback datasets, pointing out in particular that:

  1. implicit feedback data is inherently noisy, since a user might decide that they do not like an item after viewing it — interaction does not necessarily indicate interest.
  2. the numerical value in explicit feedback indicates preference whereas in the implicit case indicates confidence.

The authors describe their model as being based on SVD, but this is not accurate, since they weight squared difference summands in the cost function according to a confidence value (which is proportional to the number of interactions for that user-item pair).

The input matrix is the user-item matrix.

Optimisation is via alternating least squares.

Their evaluation metric is percentile rank based.

Their model, which we’ll call “weighted SVD” (they speak of “confidence intervals”) compares favourably with the baseline popularity method and also with an old-school item-based neighbourhood method, in terms of the expected percentile rank (Figure 1). Interestingly, the differences are less marked when the probability that a desired item is in the top (say) 1% is considered (Figure 2).

The unweighted SVD on the user-item matrix is shown to perform terribly, with a significant but insufficient improvement obtained with regularisation.

Logistic Matrix Factorization for Implicit Feedback Data

Christopher Johnson, Spotify, 2014


A new matrix factorisation model for behavioural recommendation in the case of implicit feedback is presented.

User-item interactions are encoded in a non-negative interaction matrix. The question as to whether a user-item interaction occurred is then treated as a problem of binary classification. User-item pairs for which an interaction has occurred are regarded as positive outcomes with confidence in constant proportion to the value in the interaction matrix, while the absence of an interaction is regarded as a negative outcome. This binary classification is task is then leveraged to train user- and item-vectors. These vectors reside in dually paired spaces. The dot-product of the vectors, combined with user- and item- bias terms, is then fed into the sigmoid function. What we are really doing is looking for a low-rank approximation to a bilinear form via the sigmoid function.

Confidence values (proportional to the values in the interaction matrix) are used as powers of their corresponding factor in the maximum likelihood function. The constant proportion that defines the confidence values from the interaction matrix is a tuning parameter, but is typically chosen so that the positive outcomes balance the negative outcomes in total confidence. Thus the likelihood function is weighed according to the entries of the interaction matrix. The weighted likelihood function is then maximised using alternating gradient ascent. This optimisation is batch. Negative sampling can be used to speed up convergence, and the confidence parameter is decreased proportionally.

They use a fractional rank type metric to evaluate performance at each iteration. For each user, the interaction probability is computed for each item, and the rank of the target item in this list is determined. This is then averaged over a set-aside collection of user-item pairs. Given that batch gradient descent is used, this is not prohibitively expensive.

The author reports that this logistic matrix factorisation model performs better in low rank than the implicit MF model of Koren et al 2008, though both give a similar fractional rank in high ranks.

A very basic implementation in Python is available. The implementation uses AdaGrad to dynamically chose a step size at each iteration, just as is described in the paper. The paper mentions a Spark or Hadoop based implementation, but I couldn’t find this published anywhere.

Polyglot: Distributed Word Representations for Multilingual NLP

Rami Al-Rfou, Bryan Perozzi, Steven Skiena (all at Stony Brook University)

Published in the proceedings of CoNLL 2013 (PDF).

The authors train word embeddings for 117 different languages using Wikipedia. The embeddings are trained using an architecture similar to that of SENNA of Collobert et al. This architecture computes a score representing the likelihood that the words given as input occurred together in order. A short window is scanned over a stream of text, and the score of the phrase in the window is compared to the score of a corrupted version of the same phrase where the middle word is substituted randomly. The model is penalised (using hinge loss, i.e. one-way error) according to whether the uncorrupted or corrupted phrase was more highly scored.

The score of a phrase is computed as per the following:

Screen Shot 2015-08-05 at 11.49.37

  1. Each of the words is transformed from a one-hot to a distributed representation via the application of a shared matrix C, and these representations are concatenated;
  2. The hyperbolic tan of an affine transformation of this concatenation is calculated component-wise, yielding a “hidden” vector.
  3. The components of this vector are combined via an affine transformation to yield the score.

So this neural network has three layers and the parameters are the shared matrix C together with the two affine transformations.

The word embedding is given by the rows of the shared matrix C.

The models are trained using Theano for extensive periods of time (the authors mention “weeks”). The window size is taken to be radial length 2, the word embedding rank is 64 and the hidden layer size is 32.

To demonstrate the utility of the word representations, the authors the representations as initialisation for a model performing parts of speech tagging.

The paper was published at about the same time as word2vec (it does not refer to word2vec at all). The approach, the notation and the terminology, however, demonstrate that certain things that I had thought particular to word2vec were in fact already accepted practice, including:

  • the use of discriminative tasks for training word embeddings
  • sampling contexts by scanning a short window over text
  • the use of the middle word in a context for the discriminative task
  • dividing through by the “fan out” for initialisation (page 187, TBC)
  • the symbols <S> and </S> for delimiting sentences

A Unified Model for Word Sense Representation and Disambiguation

Chen, Liu, Sun, published in the conference proceedings of EMNLP 2014 (PDF).

The authors leverage the word2vec skipgram model and WordNet glosses (i.e. word sense definitions) for word sense disambiguation. This is achieved as follows:

  1. A skipgram model is trained.
  2. For each sense of a word according to WordNet, a vector is derived by taking the average of the content words in the WordNet definition (“gloss”) of that sense (“gloss vectors”)
  3. The gloss vectors are used to identify the sense of a word occurrence by considering its dot product with the context of that occurrence. The sense whose gloss vector has the highest dot product with the context vector is chosen, as long as it is wins by a sufficient margin.

The authors are then able to train word sense vectors (distinct from the gloss vectors) by modifying the skip-gram objective. These word sense vectors are then used for similarity tasks and not for word sense disambiguation. It seems to me that it would have been simpler to annotate word occurrences in the corpus with the senses than to modify the objective.

Evaluation is performed for coarse-grained WSD (i.e. disambiguating homographs).

Independence assumptions in iterative word sense disambiguation
The authors disambiguate the senses of a words one word at a time, based upon the disambiguation that has already taken place. Two different strategies are considered for choosing the order in which to disambiguate the words in a context. These strategic approaches make a problematic independence assumption – that the sense of the word to be disambiguated is independent of the senses of the words not yet disambiguated. I haven’t read many WSD papers – I suspect these independence assumptions aren’t particular to the approach of the authors.

GloVe: Global Vectors for Word Representations

Pennington, Socher, Manning, 2014.

GloVe trains word embeddings by performing a weighted factorisation of the log of the word co-occurrence matrix. The model scales to very large corpora (Common Crawl 840B tokens) and performs well on word analogy tasks.

The cost function is given by:

\displaystyle \sum_{i, j = 1}^V f(X_{i,j}) (u_i^T v_j + b_i + c_j - \text{log} X_{i,j})^2


  • V is the size of the vocabulary,
  • X denotes the word co-occurrence matrix (so X_{i,j} is the number of times that word j occurs in the context of word i)
  • the weighting f is given by f(x) = (x / x_{\text{max}})^\alpha if x < x_{\text{max}} and 1 otherwise,
  • x_{\text{max}} = 100 and \alpha = 0.75 (determined empirically),
  • u_i, v_j are the two layers of word vectors,
  • b_i, c_j are bias terms.

Note that the product is only over pairs i, j for which X_{i,j} is non-zero. This means that GloVe (in contrast to word2vec with negative sampling) trains only “positive samples” and also that we don’t have to worry about the logarithm of zero.

This is essentially just weighted matrix factorisation with bias terms:



Note that in the implementation (see below), the X_{i,j} are not raw co-occurrence counts, but rather the accumulated inverse distance between the two words, i.e.

\displaystyle X_{w, w'} := \sum_{\text{windows containing\ } w, w'} (\text{distance between\ } w, w')^{-1}.

I am fairly sure that the implementation of Adagrad is incorrect. See my post to the forum.

The factor weighting f

The authors go to some trouble to motivate the definition of this cost function (section 3).  The authors note that many different functions could be used in place of their particular choice of f, and further that their \alpha coincides with that used by word2vec for negative sampling. I can’t see the relevance of the latter, however (in word2vec, the 0.75th power it is used to define the noise distribution; moreover powering a value in the range [0, 1] has a very different effect to powering a value in the range [0, 100]).


Graphing the function (see above) hints that it might have been specified more simply, since the non-linear region is in fact almost linear.

A radial window size of 10 is used. Adagrad is used for optimisation.

Word vectors
The resulting word embeddings (u_i and v_j) are unified via a direct sum of their vector spaces.

The cosine similarity is used to find the missing word in word similarity tasks. It is not stated if the word vectors were normalised before forming the arithmetic combination of word vectors.

Source code
The authors take the exemplary step of making the source code available.

Evaluation and comparison with word2vec
The authors do a good job of demonstrating their approach, but do a scandalously bad job of comparing their approach to word2vec. This seems to reflect a profound misunderstanding on the part of the authors as to how word2vec works. While it has to be admitted that the word2vec papers were not well written, it is apparent that the authors made very little effort at all.

The greatest injustice is the comparison of the performance of GloVe with an increasing number of iterations to word2vec with an increasing number of negative samples:

The most important remaining variable to control
for is training time. For GloVe, the relevant
parameter is the number of training iterations.
For word2vec, the obvious choice would be the
number of training epochs. Unfortunately, the
code is currently designed for only a single epoch:
it specifies a learning schedule specific to a single
pass through the data, making a modification for
multiple passes a non-trivial task. Another choice
is to vary the number of negative samples. Adding
negative samples effectively increases the number
of training words seen by the model, so in some
ways it is analogous to extra epochs.

Firstly, it is simply impossible that it didn’t occur to the authors to simulate extra iterations through the training corpus for word2vec by simply concatenating the training corpus with itself multiple times. Moreover, the authors themselves are capable programmers (as demonstrated by their own implementation). The modification to word2vec that they avoided is the work of ten minutes.

Secondly, the notion that increasing the exposure of word2vec to noise is comparable to increasing the exposure of GloVe to training data is ridiculous. The authors clearly didn’t take the time to understand the model they were at pains to criticise.

While some objections were raised about the evaluation performed in this article and subsequent revisions have been made, the GloVe iterations vs word2vec negative sample counts evaluation persists in the current version of the paper.

Another problem with the evaluation is that the GloVe word vectors formed as the direct sum of the word vectors resulting from each matrix factor. The authors do not do word2vec the favour of also direct summing the word vectors from the first and second layers.


A Graph-Based Method for Combining Collaborative and Content-Based Filtering

Phuong, Thang and Phuong (all from the Posts and Telecommunications Institute of Technology, Vietnam), 2008.

PDF obtainable here.

The authors present a recommendation system that incorporates user ratings of items (they work in the explicit rating context) and item-feature relations. The approach is graphical. Effectively, two weighted graphs with non-negative weights are constructed, network propagation is performed on both independently and the two resulting scorings are combined in a weighted sum. The first graph is directed represents user-item preference via the item features (where the user-feature preferences are computed heuristically from the given ratings and item-feature associations); thus in this graph all paths from user to item have length 2. The second graph is undirected and represents the purely positive user ratings of items and excludes the item features. The two graphs capture the content- and collaborative- aspects of the recommendation problem, respectively.

I find the approach lacks unity and is too heuristic. The unity suggested by the user/item/feature graph of Figure 1 is merely pictorial, since network propagation is actually performed on the two graphs described above (which are derived from this unified graph) separately. The two separate graphs are constructed heuristically, and this removes any claim the approach might have had to necessity.

The more obvious, unified, approach (network propagation on the user/item/feature graph) is unavailable here since the user-item associations may be negative. This would not be the case if the feedback were implicit (e.g. purchases). For this reason I will be interested to read the paper of Huang et al. cited by the authors – perhaps they use just such an approach. Furthermore, Huang et al. experiment with different network propagation algorithms (in this paper, a modification of an algorithm by Weston et al. is used).

The MovieLens dataset is used in the evaluations, which demonstrate the superiority of the authors approach over a purely content, a purely collaborative and a simple hybrid approach that merges the result sets of collaborative filtering and content recommendation computed separately.

Paper is clearly written.

HybridRank: A Hybrid Content-Based Approach to Mobile Game Recommendations

Chow, Foo and Manai (all at Group Digital Life, Singtel, Singapore) 2014.


The authors consider a personalised PageRank on a graph whose vertices represent items for recommendation and whose edge weights are determined by feature overlap (where features are e.g. category, tag). In order to incorporate collaborative information into the computation, they define the “teleport vector” (or “reset vector”) for a user to be the sum over all items they’ve interacted with of the corresponding rows of the behavioural item x item matrix (they call these “user correlation matrices”, somewhat confusingly).

This is a nice advance from “ItemRank” for incorporating item meta information into the recommendation process. In contrast to ItemRank, the authors work in the context of implicit feedback data. However, I think that the approach could be made more elegant by considering the users and tags (for example) as additional vertices in the graph – the reset vector would then just be the one-hot vector singling out the vertex corresponding to the user receiving the recommendations.

The authors carry out an enormous user survey and an impressive live production test to demonstrate the performance of their approach.

ItemRank: A Random-Walk Based Scoring Algorithm for Recommender Engines

Marco Gori and Augusto Pucci, 2007 (from the IJCAI conference proceedings).


The authors consider an application of PageRank to recommendation in the case where explicit ratings are available. The vertices of the graph represent items to be recommended, and the weight of the edge between any two vertices is proportional to the number of users that have interacted with both the corresponding items (thus the explicit ratings are not incorporated into the graph itself). To obtain recommendations for a particular user, the “reset-” (or “teleport-“) vector of is set to be the explicit ratings given by the user (0 is used for the absence of a rating), PageRank is run and then resulting importances are used to rank the items.

It seems to me that this set-up would be more sensible in contexts where the behavioural data was implicit (e.g. user looked at particular item) rather that explicit (user gave a particular rating to a particular item) – in the explicit context the use of the value 0 for the absence of rating can not be motivated.

The authors test their approach on the MovieLens dataset.

As the authors themselves note, PageRank had been used for personalised (more generally, deliberately biased) recommendation before their work (e.g. Haveliwala “Topic sensitive Pagerank”, 2002). The novelty here lies in the construction of the graph from the user-item interaction matrix.

Language Understanding for Text-based Games using Deep Reinforcement Learning

Appeared on the arXiv, June 2015.

The joint work of Karthik Narasimhan, Tejas Kulkarni and Regina Barzilay.

The aim of the paper is to create an autonomous agent that solves quests in text-based adventure games. The agent has no knowledge of the underlying game state, and must decide upon what action to take based only upon the representation of the game state that is afforded by the game. In this sense it seeks to solve a similar problem to that of the now famous Atari deep learning paper. This is also an interesting model for how humans communicate with one another.

There are similarities in approach, moreover, in that both employ reinforcement learning. In contrast, this paper employs a Long-Short Term Memory network.

They use Evennia, a Python framework for building multiplayer online text games (used here in a single player context).

Adriaan S.: Q-learning does not scale well. (This could account for the small vocabulary used.)

Perpendicularity and dimension

We show below that vectors drawn uniformly at random from the unit sphere are more likely to be orthogonal in higher dimensions.

In information retrieval and other areas besides, it is common to use the dot product of normalised vectors as a measure of their similarity. It can be problematic that the similarity measure depends upon the rank of the representation, as it does here — it means, for example, that similarity thresholds (for relevance in a particular situation) need to be re-calibrated if the underlying vectorisation model is retrained e.g. in a higher dimension.

Does anyone have a solution to this? Is there a (of course, dimension dependent) transformation that can be applied to the dot product values to arrive at some dimension independent measure of similarity?

I asked this question on “Cross Validated”. Unfortunately, it has attracted so little interest that it has earned me the “tumbleweed” badge!

For those interested in the distribution of the dot products, and not just the expectation and the variance derived below, see this answer by whuber on Cross Validated.

Marginal and Conditional Distributions of the Multivariate Gaussian

This is the standard, elementary arithmetic proof that the marginal and conditional distributions of the multivariate Gaussian are again Gaussian with parameters expressible in terms of the covariance matrix of the original Gaussian. We use the block multiplication of matrices.

I was surprised by how much work is required to show this, and feel moreover that the proof (while correct) fails to offer any intuitive understanding. Is there not a higher-level, co-ordinate free proof of this important result, perhaps one that uses characteristic properties of multivariate Gaussians?

Update: I received an answer to this question on Cross Validated from whuber. He uses a generative definition: the multivariate Gaussians are precisely the affine transformations of tuples of standard (mean zero, unit-variance) one-dimensional Gaussians. Using this definition, it follows quickly that the conditional and marginal distributions must be multivariate Gaussian.

Word2vec weight initialisation

The initialisation of the weights in word2vec is not what I expected.

  • syn1 The weights connecting the hidden- to the output-layer are initialised to zero (in both the hierarchical softmax and the negative sampling cases)
  • syn0 The initial values for the weights connecting the input- to the hidden-layer are drawn uniformly and independently from the interval [\frac{-1}{2n}, \frac{1}{2n} ], where n is the rank of the hidden layer (i.e. number of hidden units.)

The range of interval from which the syn0 weights are sampled was chosen depends on the rank. I had presumed that this was to account for the dependency of the distribution of the dot product (and in particular the L2-norm) on the rank. However, estimating these distributions empirically, this doesn’t seem to be the case:

Screen Shot 2015-07-10 at 14.59.46

According to Mikolov (in a helpful response in the word2vec google group), the initialisation of the weights was chosen empirically, since it seemed to work well.


  1. I was unable to derive an expression for the distribution of L2-norms mathematically. Can someone help with that?

PageRank meets vectorial representations – "Ranking on Data Manifolds"

I came across this paper when following up on ideas I had when reading about TextRank for summarising documents. It is short, well written and very interesting, and was authored by Zhou, Weston, Gretton, Bousquet and Schölkopf (all then at the Max Planck Institute for Biological Cybernetics, Tübingen) in 2004. (PDF).

The authors consider the problem of ranking objects by relevancy to one or more query objects in the case where the objects have a vectorial representation. This is done using the PageRank algorithm on a graph in which the vertices represent the objects and the edges weights are computed using e.g. an RBF kernel (or normalised dot-product, if the vectors are non-negative).

One advantage of this approach is that it generalises naturally to multiple query vectors. The query vectors are simply treated as a complete list of possible (re)starting points for the PageRank random walk. This contrasts with the typical PageRank case, where all pages are possible starting points. Note that this list of starting points need not be binary — it can rather be a probability distribution over the objects representing user preference.

The PageRank approach is shown to perform much better than Euclidean nearest neighbours search on real world data sets (MNIST digits and newsgroup posts are considered). In both cases, the datasets have labels. The query data points are chosen from a single class, and the ranking problem is treated as one of binary classification, i.e. finding the distinguishing the objects of the same class from those of all other classes. The ranking is used to calculate a ROC curve, and the area under this curve is used as a performance measure.

The evaluation considers the case of multiple query vectors, as well. The Euclidean nearest neighbours, in the multiple query vector case, are aggregated by taking the minimal distance to a query vector (i.e. “disjunctively”).

The PageRank method is visually contrasted with the Euclidean nearest neighbours case using the MNIST data set. Below are the 99 best ranked results in PageRank case (left) and the Euclidean case (right) (99 = 10 x 10 – 1 query). Not only does the left panel contain no threes, but the twos are more homogeneous.

Screen Shot 2015-07-10 at 11.40.55

The graph that the authors construct is not complete. Rather, edges are added, beginning with those most heavily weighted (but excluding self-loops) until the graph is connected. For example, in the case of the two sickle moons:

Screen Shot 2015-07-10 at 11.41.59

Screen Shot 2015-07-10 at 11.42.22

An RBF kernel function is used in some cases to define the edge weighting between nodes (see step 2 of the algorithm). Note that the variance \sigma of the RBF kernel needs to be fitted using cross-validation. This is no problem for labelled datasets like MNIST, but would be problematic for e.g. the sickle moons data.

It’s not a random walk
Note that, due to the “symmetric normalisation” that is applied to the affinity matrix W (see step 3 of the algorithm), this is not a random walk — the columns of the normalisation S do not sum to 1, so the iterative procedure of step 4 will not map probability distributions to probability distributions. Given that we only want to rank the nodes, not derive a probability distribution over them, this is not necessarily a problem.

Why symmetric normalisation? Dengyong was kind to explain the motivation for this (via correspondence). This normalisation was used in order to avoid over-emphasising points in high density regions, and because it is the normalised graph Laplacian (I haven’t looked into this). Dengyong added that these reasons were not strong, and that other alternatives should be investigated.

Corrections to the paper
There are some mistakes in the paper. In particular, there is a problem with Theorem 2, pointed out by begab on reddit when I shared this post. Begab points out that DU \neq DU. The problem is larger than this, in fact, since the claim that the steady state of a PageRank random walk on a connected, undirected graph does not hold. There is the following counterexample, for instance.


  1. Who has taken this research further?
  2. In both of the cases considered, the vector representations of the objects are rather poor (pixel on/off and tf-idf). How much better is this approach if dimension reduction is first applied to the vectors?

Block Multiplication of Matrices

(We needed this to derive the conditional distribution of a multivariate Gaussian).

Consider a matrix product AB. Partition the two outer dimensions (i.e. the rows of A and the columns of B) and the one inner dimension (i.e. the columns of A and the rows of B) arbitrarily. This defines a “block decomposition” of the product AB and of the factors A, B such that the blocks of AB are related to the blocks of A and B via the familiar formula for components of the product, i.e.

(AB)_{m,n} = \sum_s A_{m,s} B_{s,n}.

Pictorially, we have the following:


Arithmetically, this is easy to prove by considering the formula above for the components of the product. The partitioning of the outer dimensions comes for free, while the partitioning of the inner dimension just corresponds to partitioning the summation:

(AB)_{m,n} = \sum_s A_{m,s} B_{s,n} = \sum_i \sum_{s_i \leq s \leq s_{i+1}} A_{m,s} B_{s,n}.

Zooming out to a categorical level, we can see that there is nothing peculiar about this situation. If, in an additive category, we have three objects X, Y, Z with biproduct decompositions, and a chain of morphisms:

X \xrightarrow{\varphi_B} Y \xrightarrow{\varphi_A} Z

then this “block decomposition of matrices” finds expression as a formula in \text{End}(X, Z) using the injection and projection morphisms associated with each biproduct factor.

TextRank: Bringing Order into Texts

Published in 2004 (PDF) by Rada Mihalcea and Paul Tarau.

I picked this paper up after seeing that it had been integrated into GenSim (see also this article by the contributor to gensim and others). The authors (of the original paper) apply the PageRank algorithm to graphs constructed from text for the purposes of keyword extraction and summarisation. These two approaches they name (somewhat unnecessarily, I feel) TextRank.

You can test how it works yourself e.g. the Python implementation here.

In the case of keyword extraction, the graph has words as vertices (in the best case, only nouns and adjectives) and the (undirected) edges represent co-occurrence within a fixed length window.

In the case of summarisation, the graph has sentences as vertices and the graph is complete. The weight of each edge can be determined by any sentence similarity function. The authors consider the case where sentence similarity is measured by word overlap, normalised by sentence length. If a vectorial representation of the sentences is available, then e.g. the cosine similarity could be used instead. The authors extend the definition of PageRank to deal with weighted graphs.

The advantage of the TextRank approaches is that nothing needs to be learnt — there is no machine learning involved at all. The keyword extraction and summarisation make relatively loose assumptions about the language of the text and apply equally well to documents from unseen domains. TextRank is, however, entirely heuristic. The theory leaves off where the authors begin (that is, with PageRank). The authors do present an interesting application of PageRank, however.


  • Has anyone tried the summarisation out using a vectorial representation of sentences and the cosine similarity? Other than bag-of-words?
  • If we use e.g. the RBF kernel e^{- \| x_1 - x_2\|} for the edge weight between two vectors x_1, x_2, what points does PageRank tend to choose from a (multimodal) data sample? Related is perhaps this article. (See also my summary).

Document Classification by Inversion of Distributed Language Representations

This is a note on the arxiv by Matt Taddy from April 2015. It reads very clearly and has a simple point to make: language modelling techniques can be used in classification tasks by training a separate language model for each class; documents are assigned to the class of the model where the document has the highest likelihood (hence “inversion”). In our discussion, we assume a uniform prior over the classes.

Taddy considers the particular case of predicting the sentiment of Yelp reviews at different levels of granularity. Different approaches are considered:

  • word2vec inversion is inversion in the sense described above where document vectors are taken as the average of the word vectors of the constituent words;
  • phrase regression, where separate logistic regression models are trained for each output class, taking as input phrase count vectors;
  • doc2vec regression, is as per phrase regression, but taking as input one of:
    • doc2vec DBOW
    • doc2vec DM
    • doc2vec DBOW and DM combined, i.e. in direct sum
  • MNIR, the authors own Multinomial Inverse Regression

Three separate classification tasks are considered, labelled “a”, “b” and “c” in the diagram below, representing two-, three- and five-class sentiment classification.

Screen Shot 2015-06-13 at 18.06.38

As illustrated in the following figure, only the word2vec inversion technique would do a decent job when the gravity of a misclassification is considered (so penalising less if, e.g. predicted star rating is off by only one star):

Screen Shot 2015-06-13 at 18.07.17

Missing from Taddy’s comparison is inversion using the document vectors, though this is certainly the sort of thing his paper suggests might work well. Also missing is regression using the document vectors obtained as aggregates of word vectors.

Notes on Document Embedding with Paragraph Vectors

Presented at NIPS 2014 (PDF) by Dai, Olah, Le and Corrado.


The authors consider a modified version of the PV-DBOW paragraph vector model. In previous work, PV-DBOW had distinguished words appearing in the context window from non-appearing words given only the paragraph vector as input. In this modified version, the word vectors and the paragraph vectors take turns playing the role of the input, and word vectors and paragraph vectors are trained together. That is, a gradient update is performed for the paragraph vector in the manner of regular PV-DBOW, then a gradient update is made to the word vectors in the manner of Skipgram, and so on. This is unfortunately less than clear from the paper. The authors were good enough to confirm this via correspondence, however (thanks to Adriaan Schakel for communicating this). For the purposes of the paper, this is the paragraph vector model.

The representations obtained from paragraph vector (using cosine similarity) are compared to those obtained using:

  • an average of word embeddings
  • LDA, using Hellinger distance (which is proportional to the L2 distance between the component-wise square roots)
  • paragraph vector with static, pre-trained word vectors

In the case of the average of word embeddings, the word vectors were not normalised prior to taking the average (confirmed by correspondence).


Two corpora are considered, the arXiv and Wikipedia:

  • 4.5M articles from Wikipedia, with a vocabulary of size 915k
  • 886k articles from the arXiv, full texts extracted from the PDFs, with a vocabulary of 970k words.

Only unigrams are used. The authors observed that bigrams did not improve the quality of the paragraph vectors. (p3)

Quantitative Evaluation

Performance was measured against collections of triples, where each triple consisted of a test article, an article relevant to the test article, and an article less relevant to the test article. While not explicitly stated, it is reasonable to assume that the accuracy is then taken to be the rate at which similarity according to the model coincides with relevance, i.e. the rate at which the model says that the relevant article is more similar than the less relevant article to the test article. Different sets of triples were considered, the graph below shows performance of the different methods relative to a set of 172 Wikipedia triples that the authors built by hand (these remain unreleased at the time of writing).

Screen Shot 2015-05-24 at 15.23.52

It is curious that, with the exception of the averaged word embeddings, the accuracy does not seem to saturate as the dimension increases for any of the methods. However, as each data point is the accuracy of a single training (confirmed by correspondence), this is likely nothing more than the variability inherent to each method. It might suggest, for example, that the paragraph vectors method has a tendency to get stuck in local minima. This instability in paragraph vector is not apparent, however, when tested on the triples that are automatically generated from Wikipedia (Figure 5). In this latter case, there are many more triples.

Performance on the arXiv is even more curious: accuracy decreases markedly as the dimension increases!

Screen Shot 2015-05-24 at 15.24.39


I am not sure there are any publicly available implementations of this modified paragraph vectors method. According to Dai, the implementation of the authors uses Google proprietary code and is unlikely to be released. However it should be simple to modify the word2vec code to train the paragraph vectors, though some extra code will need to be written to infer paragraph vectors after training has finished.

I believe that the gensim implementation provides only the unmodified version of PV-DBOW, not the one considered in this paper.


It is interesting that the paragraph vector is chosen so as to best predict the constituent words, i.e. it is inferred. This is a much better approach from the point of view of word sense disambiguation than obtaining the paragraph vector as a linear image of an average of the word vectors (NMF vs PCA, in their dimension reductions on bag of words, is another example of this difference).

Thanks to Andrew Dai and Adriaan Schakel for answering questions!


  1. Is there is an implementation available in GenSim? (see e.g. this tutorial).
  2. (Tangent) What is the motivation (probabilistic meaning) for the Hellinger distance?

Expectation-Maximisation and Gaussian Mixture Models

Below are notes from a talk on Expectation Maximisation I gave at our ML-learning group. Gaussian mixture models are considered as an example application.

The exposition follows Bishop section 2.6 and Andrew Ng’s CS229 lecture notes. If you weren’t at the seminar, then it is probably better to read one of these instead.

Another useful reference is likely the 1977 paper by Dempster et al. that made the technique famous (this is something I would have liked to have read, but didn’t).


  1. I still don’t understand how EM manages to (reportedly) work so well, given that the maximisation chooses for the next parameter vector precisely the one that reinforces the “fantasy” completions of the data made by the previous parameter vector. I would not have considered it a good learning strategy. It contrasts greatly with, for example, the learning strategy of a restricted Boltzmann machine, in which, at each iteration, the parameters are adjusted so as to correct the model’s fantasy towards producing the observed data.
  2. Can we offer a better argument for why maximisation of the likelihood for latent variable models is difficult?
  3. Is the likelihood of an exponential family distribution convex in the parameters? This is certainly the case for e.g. the mean of a Gaussian. Does this explain why the maximisation of the constructed lower bound for the likelihood is easy?

Support Vector Machine Active Learning with Applications to Text Classification

This is an old but interesting paper from 2001 by Simon Tong (now at Google) and Daphne Koller (who launched Coursera with Andrew Ng), both then at Stanford. (PDF here)

As the title suggests, the authors apply SVMs to text classification using active learning techniques.  This is the first active learning paper I’ve read, and it has provided me with some useful notions.  I’ll be reading further in this field, as building e.g. a Tinder-like application for finding research articles of interest or for “e-Discovery” applications in litigation interest me.  These are both examples of relevance feedback for transduction since the classifier will be applied to a collection of (unlabelled) samples that we know of in advance (in contrast, applying to unseen samples, e.g. for e-mail filtering, is induction).  In machine learning, recognition of this distinction is due to Vapnik.

The authors introduce the notion of a version space for a SVM trained on linearly separable data: it is the set of all hyperplanes that separate the two classes in the training data, thought of as points on the unit sphere by taking the unit normal to the hyperplane.  The area of the version space is thus a measure of our uncertainty of the true decision boundary.  The authors propose an active learning approach that successively “queries” (i.e. requests the label for) a sample that would maximally reduce the area of the version space.

Three heuristic methods are proposed for choosing a such a sample: SimpleMargin, MaxMin Margin and Ratio Margin.  The latter two outperform the former, but are significantly more expensive, computationally, since they require the training of SVMs and the cost of training a SVM is polynomial in the size of the training set.  All three active learning methods are much better than random.

Linear separability of the training examples is not a problem for the authors, since they operate of word count vectors (the vocabulary size = rank is about 10k).  In the case where a dense, lower dimensional vectorisation was used, linear separability would likely still hold, particularly if kernels were used.

The perform experiments using the Reuters corpus and a newsgroup collection.


  • Active learning by successively reducing the area of the version space is an interesting approach. I wonder if, more generally, we might think of choosing a sample that maximally reduces the entropy of the posterior distribution?

Entropy of the Normal Distribution

Here we calculate the entropy of the normal distribution and show that the normal distribution has maximal entropy amongst all distributions with a given finite variance.

The video belong concerns just the calculation of the entropy, not the maximality property.


Matrix Factorisation and the Eigendecomposition of the Gram Matrix

Here we consider the problem of approximately factorising a matrix X without constraints and show that solutions can be generated from the orthonormal eigenvectors of the Gram matrix X^T X (i.e. of the sample covariance matrix).

For this we need the eigendecomposition of real symmetric matrices.



Questions, all related to one another:

  • What other solutions are there?
  • (Speculative) can we characterise the solutions as orbits of the orthogonal group on the solutions above, and on those solutions obtained from the above by adding rows of zeros to B?
  • Under what constraints, if any, are the optimal solutions to matrix factorisation matrices with orthonormal rows/columns? To what extent does orthogonality come for free?

Eigendecomposition of real, symmetric matrices

We show that a real, symmetric matrix has basis of real-valued orthonormal eigenvectors and that the corresponding eigenvalues are real.  We show moreover that these eigenvalues are all non-negative if and only if the matrix is positive semi-definite.

Literary treasure hunting with the Lateral API

A good friend, Sam Matthews, recently gave a talk in December 2014 at a conference of the Australian Modernist Studies Network on “Transnational Modernisms”. Sam spoke about his discovery of a reference to a print-shop from Balzac’s “Two Poets” in Christina Stead‘s novel Seven Poor Men of Sydney. Sam later suggested that I check if we couldn’t use Lateral’s text matching service (the “Recommender (BYO documents!)” API) to confirm this reference to Balzac and potentially uncover other ones. As you’ll see below, the preliminary results are very encouraging. This is hardly a conclusive experiment, but

In case you would like to search for references to Balzac’s works yourself, you can do so by reusing the API key I created: b4de9b9183df4cbf8d70cde15609800a .

This is how I proceeded:

  1. I downloaded the Complete works of Balzac from Project Gutenberg. This gives one HTML file for each of Balzac’s works.
  2. I split each work into paragraphs, labelling the paragraphs by their work and position within the work. Balzac wrote many paragraphs, it turns out!
  3. I subscribed to the API at Lateral, obtaining an API key.
  4. I installed Francis Tzeng’s python package for accessing the Lateral API
  5. Using the python package, I added the paragraphs of Balzac to the Lateral recommender. Short paragraphs containing not enough meaningful words were rejected; in total, the number of meaningful paragraphs of Balzac indexed was over 21,000.
  6. Again using the python package, I searched for the closest paragraphs of Balzac to the passage of Stead that Sam had indicated to me (see below).

The passage of Stead’s novel that evokes the print-shop appears below (from Chapter 3):

devil’s kitchen where  the word is made bread … triangular park … A wide old doorway opened beside the tobacconist’s shop, and over it was a name, white on blue, “Tank Steam Press, Ground Floor.” The tobacconist owned the old single-storey building and rented out to several establishments the mouldy apartments of the ground and first floor. In the attic was the man who did heliogravure. The building had once been a private house. Its court was now a cart-dock and opened into the other street. Its first-floor bathroom at the head of the stairs contained the old water-closet, used by all the workers in the house, a gas-ring to make tea, and the usual broken chairs and out of-date telephone directories. The distinctive smell of the building came from this closet and from the printing-ink.Joseph walked through the old doorway, went by a staircase and entered the large airy double room occupied by the Press. He opened the glass back-door and moved about among the presses, curiously inspecting the jobs in their various stages, picking up a paper, looking through the bills on a bill-hook, putting his finger in the dust in the little glassed-in office of Chamberlain, the owner, and shutting off the stove, lighted by the cleaner, because the day was warm enough.

Below are the paragraphs of Balzac that are semantically closest to the text above, according to Lateral. As you can see, the 1st and the 9th closest paragraphs (of over 21,000!) indeed come from “Two Poets”, and inspection reveals that they indeed concern the printshop! You can click the links to fetch the corresponding paragraphs using the API. The intermediately ranked results seem to be architectural descriptions.

"distance": 0.034905,
"document_id": "TWO POETS-00019"
"distance": 0.035945,
"document_id": "THE COLLECTION OF ANTIQUITIES-00557"
"distance": 0.037409,
"document_id": "SONS OF THE SOIL-01139"
"distance": 0.038067,
"document_id": "A MAN OF BUSINESS-00034"
"distance": 0.038168,
"document_id": "URSULA-01020"
"distance": 0.038216,
"document_id": "COUSIN PONS-01938"
"distance": 0.03837,
"document_id": "COLONEL CHABERT-00023"
"distance": 0.038545,
"document_id": "COUSIN BETTY-01508"
"distance": 0.038823,
"document_id": "TWO POETS-00018"
"distance": 0.038891,
"document_id": "RISE AND FALL OF CESAR BIROTTEAU-01382"
"distance": 0.039151,
"document_id": "THE RED INN and others-00045"
"distance": 0.039195,
"document_id": "THE LESSER BOURGEOISIE(The Middle Classes)-00635"
"distance": 0.039369,
"document_id": "SCENES FROM A COURTESAN'S LIFE-00761"
"distance": 0.039377,
"document_id": "THE TWO BROTHERS-00663"
"distance": 0.039471,
"document_id": "HONORINE-00036"
"distance": 0.039808,
"document_id": "Z. MARCAS-00043"
"distance": 0.039896,
"document_id": "RISE AND FALL OF CESAR BIROTTEAU-00623"
"distance": 0.040041,
"document_id": "THE VILLAGE RECTOR-00313"
"distance": 0.040253,
"document_id": "A WOMAN OF THIRTY-00700"
"distance": 0.04031,
"document_id": "CATHERINE DE' MEDICI-01059"

High Reproducibility and High-Accuracy Method for Automated Topic Extraction

Lancichinetti et al 2015

(forwarded by Schakel)

LDA is a generative probabilistic topic model. The authors generate toy models using synthetic documents of words from distinct natural languages. This is in accordance with the generative model posited by LDA, where the topics here are the languages. They then calculate the likelihood of the desired (i.e. generating) solution and the likelihood of various deformed solutions, and show that in quite normal cases the generating solution can have a lower likelihood than a deformed solution.

They further show that the generating solution is often not obtained in practice by LDA and its standard methods of optimisation, even in the normal case where the generating solution is the unique global maximum of the likelihood function.

They describe an pragmatic (non-probabilisitic) approach to topic modelling, which involves first clustering words by detecting communities in the (denoised) word co-occurrence graph, and using these clusters to (somehow) choose initial values for PLSA or LDA to obtain a better solution.

They demonstrate how much better their method performs on their synthetic data.

I find the results of the authors vindicating. I have found the esteem of the machine learning community for LDA so at odds with my own experience of its performance that I wondered if I had misunderstood something. In the realm of information retrieval, we found LDA to be consistently out-performed by the non-probabilistic decompositions of PCA and NMF.

It is not too hard to find support for what I sense might be considered an unpopular opinion:

“Performance of LDA has never significantly surpassed PLSI (in fact we often found inferior results) which is the reason we left them out”

The authors of this paper undertook to investigate the short-comings of LDA by constructing some toy models. As they suggest themselves, it is not a new idea to create a toy model, but we don’t seem to do enough of it in machine learning.