## Don’t interpret linear hidden units, they don’t exist.

Having trained a model, it is natural to want to understand how it works. An intuitively appealing approach is to consider data samples that maximise the activation of a hidden unit, and to take the common input features of these samples as an indication of what that unit has learned to recognise. However, as we’ll see below, it is a misconception to speak of hidden units if:

• there is no non-linearity on the hidden layer;
• the weights connecting the layers are unconstrained; and
• the model is trained using (stochastic) gradient descent or similar.

In such a scenario, the hidden feature space must instead be considered as a whole.

### Summary

Consider the task of factorising a matrix as a product of matrices with some fixed inner dimension . The model parameters are pairs of matrices with the appropriate dimensions, and the image of an input vector on the hidden layer is given by . To consider this vector in terms of hidden unit activations is to fix a co-ordinate system in the hidden feature space, and to measure the displacement of the vector along each co-ordinate axis. If denote the unit vectors corresponding to the chosen co-ordinate system, then the displacements are given by the inner products

We show below that if is any rotation of the hidden feature space, then the model parameters are just as likely as to result in the factorisation of a fixed matrix and that which of these occurs depends only on the random initialisation of gradient descent. Thus the hidden unit activations might just as likely have been given by

The hidden unit activations given by 1 and 2 can be very different indeed. In fact, since is an orthogonal transformation, we have

(see e.g. here). Thus the indeterminacy of the model parameters, i.e. vs. , might equivalently be thought of as an indeterminacy in the orientation of the co-ordinate system, i.e. the vs. the . The choice of orientation of co-ordinate basis is completely arbitrary, so speaking of hidden unit activations makes no sense at all.

The above holds more generally for an orthogonal transformation of the hidden feature space, i.e. for a composition of rotations and reflections.

### Szegedy et al.

None of the above is new. For example, it was stated by Szegedy et al. in an empirical study of the interpretability of hidden units. We are demonstrating, step-by-step, a statement of theirs (which was about word2vec):

… word representations, where the various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. At the same time, the vector representations are stable up to a rotation of the space, so the individual units of the vector representations are unlikely to contain semantic information.

### Matrix factorisation and unit activation

Given a matrix and an inner dimension , the task of matrix factorisation is to learn two matrices and whose product approximates :

The parameter space consists of the entries of the matrices and . The hidden feature space, on the other hand, is the k-dimensional space containing the columns of and .

### Error function

To train a matrix factorisation model using gradient descent, the model parameters are repeatedly updated using the gradient vector of the error function. An example error function could be

Notice that this choice of error function doesn’t depend directly on the pair of matrices , but rather only on their product , i.e. only on the approximation of . This is true of any error function , because error functions depend only on inputs and outputs.

### Orthogonal transformations of the hidden feature space

Recall that orthogonal transformations of a space are just compositions of rotations and reflections about hyperplanes passing through the origin. Considered as matrices, orthogonal transformations are defined by the property that their product with their transpose gives the identity matrix. Using this property, it can be seen that an orthogonal transformation of the hidden feature space defines an orthogonal transformation of the parameter space by acting simultaneously on the column vectors of the matrices. If and denote the groups of orthogonal transformations on the hidden feature space and the parameter space, respectively, then:

### Contour lines of the gradient

The effect of this block-diagonal orthogonal transformation on the parameter space corresponds to multiplying the matrices and on the left by the orthogonal transformation of the feature space, i.e. it effects . Notice that and yield the same approximation to the original matrix , since:

Thus , so the orthogonal transformations of the hidden feature space trace out contour lines of in the parameter space. Now the gradient vector is always perpendicular to the contour line, so the sequence of points in the parameter space visited during gradient descent preserve the orientation of the hidden feature space set at initialisation (see here, for example). So if gradient descent of starting at the initial parameters converges to the parameters , and you’d prefer that it instead converged to , then all you need to do is start the gradient descent over again, but this time with the initial parameters . We thus see that the matrices that our matrix factorisation model has learned are only determined up to an orthogonal transformation of the hidden feature space, i.e. up to a simultaneous transformation of their columns.

### Gradient descent methods

The above statements continue to hold in the case of stochastic gradient descent, where the error function is not fixed but rather defined by varying mini-tasks (an instance being e.g. word2vec). Such error functions still don’t depend upon hidden layer values, so as above their gradient vectors are perpendicular to the contour lines traced out by the orthogonal transformations of the hidden layer. Thus the updates performed in stochastic gradient descent also preserve the original orientation of the feature space.

### Initialisation

How likely is it that initial parameters, transformed via an orthogonal transformation as above, ever occur themselves as initial parameters? In order to conclude that the orientation of the co-ordinate system on the hidden layer is completely arbitrary, we need it to be precisely as likely. Thus if denotes the probability distribution on the parameter space from which the initial parameters are drawn, we require

for any initial parameters and any orthogonal transformation of the hidden feature space.

This is not the case with word2vec, where each parameter is drawn independently from a uniform distribution. However, it remains true that for any choice of initial parameters, there will still be any number of possible orientations of the co-ordinate system, but for some choices of initial parameters there is less freedom than for others.

### Appendix: What about GloVe?

GloVe performs weighted matrix factorisation with bias terms, so the above should apply. The weighting is just a modified error function, and the bias terms are not hidden features and so are left unmodified by its orthogonal transformations. Like word2vec, GloVe initialises each parameter with independent samples from uniform distribution, so there are no new problems there. The real problem with applying the above analysis to GloVe is that the implementation of Adagrad used makes the learning regime dependent on the choice of basis of the hidden feature space (see e.g. here). This doesn’t mean that the hidden unit activations of GloVe make sense, it just means that GloVe is less amenable to theoretical arguments like those above and needs to be considered empirically e.g. in the manner of Szegedy et al.

## Adagrad depends on the choice of co-ordinate system

Adagrad is a learning regime that maintains separate learning rates for each individual model parameter. It is used, for instance, in GloVe (perhaps incorrectly), in LightFM, and in many other places besides. Below is an example showing that Adagrad models evolve in a manner that depends upon the choice of co-ordinate system (i.e. orthonormal basis) for the parameter space. This dependency is no problem when the parameter space consists of many one-dimensional, conceptually unrelated features lined up beside one another, because such parameter spaces have only one natural orientation of the co-ordinate axes. It is a problem, however, for the use of Adagrad in models like GloVe and LightFM. In these cases the parameter space consists of many feature vectors (of dimension, say, 100) concatenated together. A learning regime should not depend upon the arbitrary choice of orthonormal basis in this 100-dimensional feature space.

For feature spaces like these, I would propose instead maintaining a separate learning rate for each feature vector (for example, in the case of GloVe, there would be one learning rate per word vector per layer). The learning rate of a feature vector would dampen the initial learning rate by the accumulation of the squared norms of the previous gradient updates of the feature vector. The evolution of Adagrad would then be independent of the choice of basis in the feature space (as distinct from the entire parameter space). In the case of GloVe this means that a simultaneous rotation of all the word vectors in both layers during training does not alter the resulting model. This proposal would have the further advantage of greatly reducing the number of learning rates that have to be stored in memory. I don’t know if this proposal would have regret minimisation properties analogous to Adagrad. I haven’t read the original paper of Duchi et al. (2011), and what I am proposing might be subsumed there by the full-rank case (thanks to Alexey Rodriguez for pointing this out). Perhaps a block diagonal matrix could be used instead of a diagonal one.

Update: Minh + Kavukcuoglu seem to have adopted the same point of view in Learning word embeddings efficiently with noise-contrastive estimation (2013). Thanks to Matthias Leimeister for this.

## Short-time Fourier transform cheatsheet

I prepared the following one-page overview of the short-time Fourier transform for a recent talk, perhaps it’ll be useful to others. For justification of e.g. the conjugate symmetry of the Fourier coefficients or a discussion of aliasing, see here.

## The mathematics of the discrete Fourier transform

We aim to identify the assumptions that are implicit in the sampling of a continuous-time signal and in the subsequent application of the discrete Fourier transform (DFT). In particular, we consider the following questions:

• When does the sampling of periodic continuous-time signal result in a periodic discrete-time signal?
• When the resulting discrete-time signal is periodic, what is its frequency in samples/second?
• Which continuous-time frequencies coincide in discrete time, and what does the “frequency spectrum” in discrete-time look like?
• To which periodic discrete-time signals can the discrete Fourier transform be applied to without losing information?

We furthermore show that the DFT interchanges point-wise and convolution products in the time- and frequency- domains, and thereby express the DFT to Pontryagin duality for finite cycle groups.

I talked about the above (skipping many details!) at a recent talk.

## Feature scaling and non-negative matrix factorisation

Non-negative matrix factorisation (NMF) is a dimension reduction technique that is commonly applied in a number of different fields, for example:

• in topic modelling, applied to the document x word matrix;
• in speech processing, applied to the matrix of magnitude spectrograms of framed audio;
• in recommendation systems, applied to the user x item interaction matrix.

Due to its non-negativity constraint, it has the wonderful property of decomposing a objects as an additive combination of (often very meaningful) parts. However, as with all unsupervised learning tasks, it is sensitive to the relative scale of different features.

The fundamental problem is that the informativeness of a feature need not be related to its scale. For example, when processing speech, the highest-energy components of a magnitude spectrogram are those of the least perceptual importance! So when NMF decides which information to discard into order to achieve a low-rank factorisation that minimises the error function, it can be the signal, not the noise, that is sacrificed. This problem is not unique to NMF, of course: PCA retains those dimensions of the sample cloud that have the greatest variance.

It is in general better to learn a feature representation jointly with the downstream task, so that the model learns to scale features according to their informativeness for the task. If NMF is for some reason still desirable, however, it is possible to better control the information loss by choosing an appropriate measure of the matrix factorisation error.

There are three common error functions used in NMF (all of which Bregman divergences): squared Euclidean, Kullback-Leibler (KL) and Itakura-Saito (IS). These are respectively quadratic, linear and invariant with respect to the feature scale. Thus, for example, NMF with the Euclidean error function gives strong preference to high-energy features, while NMF with the IS error function is agnostic to feature scale.

## Convergence rate of gradient descent

These are notes from a talk I presented at the seminar on June 22nd. All this material is drawn from Chapter 7 of Bishop’s Neural Networks for Pattern Recognition, 1995.

In these notes we study the rate of convergence of gradient descent in the neighbourhood of a local minimum. The eigenvalues of the Hessian at the local minimum determine the maximum learning rate and the rate of convergence along the axes corresponding to the orthonormal eigenvectors.

See the eigendecomposition of real, symmetric matrices for the linear algebra preliminaries.

## Skipgram isn't Matrix Factorisation

The paper Neural Word Embeddings as Implicit Matrix Factorization of Levy and Goldberg was published in the proceedings of NIPS 2014 (pdf).  It claims to demonstrate that Mikolov’s Skipgram model with negative sampling is implicitly factorising the matrix of pointwise mutual information (PMI) of the word/context pairs, shifted by a global constant.  Although the paper is interesting and worth reading, it greatly overstates what is actually established, which can be summarised as follows:

Suppose that the dimension of the Skipgram word embedding is at least as large as the vocabulary.  Then if the matrices of parameters minimise the Skipgram objective, and the rows of or the columns of are linearly independent, then the matrix product is the PMI matrix shifted by a global constant.

This is a really nice result, but it certainly doesn’t show that Skipgram is performing (even implicitly) matrix factorisation.  Rather it shows that the two learning tasks have the same global optimum  – and even this is only shown when the dimension is larger than the vocabulary, which is precisely the case where Skipgram is uninteresting.

### The linear independence assumption

The authors (perhaps unknowingly) implicitly assume that the word vectors on one of the two layers of the Skipgram model are linearly independent.  This is a stronger assumption than what the authors explicitly assume, which is that the dimension of the hidden layer is at least as large as the vocabulary.  It is also not a very natural assumption, since Skipgram is interesting to us precisely because it captures word analogies in word vector arithmetic, which are linear dependencies between the word vectors!  This is not a deal breaker, however, since these linear dependencies are only ever approximate.

In order to see where the assumption arises, first recall some notation of the paper:

The authors consider the case where the negative samples for Skipgram are drawn from the uniform distribution over the contexts , and write

for the log likelihood.  The log likelihood is then rewritten as another double summation, in which each summand (as a function of the model parameters) depends only upon the dot product of one word vector with one context vector:

The authors then suppose that the values of the parameters are such that Skipgram is at equilibrium, i.e. that the partial derivatives of with respect to each word- and content-vector component vanish.  They then assume that this implies that the partial derivatives of with respect to the dot products vanish also.  To see that this doesn’t necessarily follow, apply the chain rule to the partial derivatives:

This yields systems of linear equations relating the partial derivatives with respect to the word- and content- vector components (which are zero by supposition) to the partial derivatives with respect to the dot products, which we want to show are zero.  But this only follows if one of the two systems of linear equations has a unique solution, which is precisely when its matrix of coefficients (which are just word- or context- vector components) has linearly independent rows or columns.  So either the family of word vectors or the family of context vectors must be linearly independent in order for the authors to proceed to their conclusion.

Word vectors that are of dimension the size of the vocabulary and linearly independent sound to me more akin to a one-hot or bag of words representations than to Skipgram word vectors.

### Skipgram isn’t Matrix Factorisation (yet)

If Skipgram is matrix factorisation, then it isn’t shown in this paper.  What has been shown is that the optima of the two methods coincide when the dimension is larger that the size of the vocabulary. Unfortunately, this tells us nothing about the lower dimensional case where Skipgram is actually interesting.  In the lower dimensional case, the argument of the authors can’t be applied, since it is then impossible for the word- or context- vectors to be linearly independent.  It is only in the lower dimensional case that the Skipgram and Matrix Factorisation are forced to compress the word co-occurrence information and thereby learn anything at all.  This compression is necessarily lossy (since there are insufficient parameters) and there is nothing in the paper to suggest that the two methods will retain the same information (which is what it means to say that the two methods are the same).

### Appendix: Comparing the objectives

To compare Skipgram with negative sampling to MF, we might compare the two objective functions.  Skipgram maximises the log likelihood (above). MF, on the other hand, typically minimises the squared error between the matrix and its reconstruction:

The partial derivatives of , needed for a gradient update, are easy to compute:

Compare these with the partial derivatives of the Skipgram log-likehood , which can be computed as follows:

## Softmax parameterisation and optimisation

The softmax function provides a convenient parameterisation of the probability distributions over a fixed number of outcomes. Using the softmax, such probability distributions can be learned parametrically using gradient methods to minimise the cross-entropy (or equivalently, the Kullback-Leibler divergence) to observed distributions.  This is equivalent to maximum likelihood learning when the distributions to be learned are one-hot (i.e. we are learning for a classification task). In the notes below, the softmax parameterisation and the gradient updates with respect to the cross entropy are derived explicitly.

This material spells out section 4 of the paper of Bridle referenced below, where the softmax was first proposed as an activation function for a neural network. It was in this paper that softmax was named, moreover. The name contrasts the outputs of the function with those of the “winner-takes-all” function, whose outputs are one-hot distributions.

### References

Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp. 227-236.

## Improving Pairwise Learning for Item Recommendation from Implicit Feedback 2014

Steffen Rendle and Christoph Freudenthaler (University of Konstanz), WSDM 2014.
The authors present a modification of the Bayesian Pairwise Ranking (BPR) for implicit feedback (i.e. one class) recommendation datasets in which the negative samples are drawn according both to the models current belief and the user/context in question (“adaptive oversampling“). They show that the prediction accuracy of BPR models trained with adaptive oversampling matches that of BPR models trained with uniform sampling but that convergence is 10x-20x faster.
Consider the problem of recommending items to users (or more generally: contexts, e.g. user on a particular page).  The observed data consists then of context-item pairs , where item was the choice made in context .  The authors work in the context of pairwise learning, which amounts to a binary classification task where context-item-item triples are labelled as true i.f.f. item was chosen in context but item is not:
where is a scoring function (e.g. the dot product of the context- and the item- latent vectors, for matrix factorisation).  It is infeasible to consider all the negative examples , so how should we choose which to consider?

### Negative sampling from static distributions

We could draw negative examples from the uniform distribution over all items, or instead from the observed distribution over all items (i.e. by popularity).  Both are inexpensive to perform and easy to implement.  However:
• Uniform sampling tends to yield uninformative samples, i.e. those for which the probability of being incorrectly labelled is very likely already low: popular items are precisely those that appear often as positive examples (and hence tend to be highly ranked by the model), while a uniformly-drawn item is likely to be from the tail of the popularity distribution (so likely lowly ranked by the model).
• Sampling according to popularity is demonstrated by the authors to converge to inferior solutions.
The authors point out that these sampling schemes depend neither upon the current context (user) nor the current belief of the model.  This contrasts with their own method, adaptive over-sampling.

### Adaptive over-sampling

The authors propose a scheme in which the negative samples chosen are those that the model would be likely to recommend to the user in question, according to its current state.  In this sense it is reminiscent of the Gibbs sampling used by restricted Boltzmann machines.
Choosing negative samples dependent on the current model and user is computationally expensive if performed in the naive manner. The authors speed this up by working with the latent factors individually, and only periodically re-computing the ranking of the items according to each latent factor.  Specifically, in the case of matrix factorisation, when looking for negative samples for a context , a negative sample is chosen by:
1. sampling a latent factor according to the absolute values of the latent vector associated to (normalised, so it looks like a probability distribution);
2. sampling an item that ranks highly for .  More precisely, sample a rank from a geometric distribution over possible ranks, then find the item that has rank when the th coordinates of the item latent vectors are compared.

(We have ignored the sign of the latent factor.  If the sign is negative, one choses the rth-to-last ranked item).  The ranking of the items according to each latent factor is precomputed periodically with a period such that the extra overhead is independent of the number of items (i.e. is a constant).

### Problems with the approach

The samples yielded by the adaptive oversampling approach depend heavily upon the choice of basis for the latent space.  It is easy to construct examples of where things go wrong:

Non-negativity constraints would solve this.  Regularisation would also deal with this particular example – however regularisation would complicate the expression of the scoring function as a mixture (since you need to divide though by .

Despite these problems, the authors demonstrate impressive performance, as we’ll see below.

### Performance

The authors demonstrate that their method does converge to solutions slightly better than those given by uniform sampling, but twenty times faster.  It is also interesting to note that uniform sampling is vastly superior to popularity based sampling, as shown in the diagrams below.
Note that a single epoch of the adaptive oversampling takes approximately 33% longer than a single epoch of uniform sampling BPR.

### Implementation

According to the paper, the method is implemented in libFM, a C++ software package that Rendle has published.  However, while I haven’t looked exhaustively, I can’t see anything in that package about the adaptive oversampling (nor in Rendle’s other package, MyMediaLite).

### What about adaptive oversampling in word2vec?

Word2vec with negative sampling learns a word embedding via binary classification task, specifically, “does the word appear in the same window as the context, or not?”.  As in the case of implicit feedback recommendation, the observed data for word2vec is one-class (just a sequence of words).  Interestingly, word2vec samples from a modification of the observed word frequency distribution (i.e. of the “distribution according to popularity”) in which the frequencies are raised to the th power and renormalised.  The exponent was chosen empirically.  This raises two questions:

1. Would word2vec perform better with adaptive oversampling?
2. How does BPR perform when sampling from a similarly-modified item popularity distribution (i.e. raising to some exponent)?

### Corrections

Corrections and comments are most welcome. Please get in touch either via the comments below or via email.

## WARP loss for implicit-feedback recommendation

We consider the “Weighted Approximate-Rank Pairwise-” (WARP-) loss, as introduced in the WSABIE paper of Weston et. al (2011, see references), in the context of making recommendations using implicit feedback data, where it has been shown several times to perform excellently.  For the sake of discussion, consider the problem of recommending items to users , where a scoring function gives the score of item for user , and the item with the highest score is recommended.

WARP considers each observed user-item interaction in turn, choses another “negative” item that the model believed was more appropriate to the user, and performs gradient updates to the model parameters associated to , and such that the models beliefs are corrected.  WARP weights the gradient updates using (a function of) the estimated rank of item for user .  Thus the updates are amplified if the model did not believe that the interaction could ever occur, and are dampened if, on the other hand, if the interaction is not surprising to the model. Conveniently, the rank of for can be estimated by counting the number of sample items that had to be considered before one was found that the model (erroneously) believed more appropriate for user .

### Minimising the rank?

Ideally we would like to make updates to the model parameters that minimised the rank of item for user .  Previous work of Usunier (one of the authors) showed that the precision at k was best optimised when the logarithm of the rank was minimised.  (to read!)

The problem with the rank is that, while it does depend on the model parameters, this dependence is not continuous (the rank being integer valued!).  So it is not possible to speak of gradients.  So what is to be done instead?  The approach of the authors is to derive a differentiable approximation to the logarithm of the rank, and to minimise this instead.

### Derivation: approximating the (log of the) rank

WARP has been shown several times to perform very well for implicit feedback recommendation.  However, the derivation of the approximation of the log of the rank used in WARP, as outlined in the WSABIE paper, is nonsense.  I can only think that the authors arrived at WARP in another way.  Let’s look at it more closely.  In the following:

• is the score assigned by the model to item for user .
• is some function that defines the error as a function of the rank.  In the WSABIE paper, is approximately the natural logarithm (for the derivation below, however, it doesn’t matter what is)

The most obvious problem with the derivation is the approximation marked with an asterix (*).  At this step, the authors approximate the indication function by .  While the latter is familiar as the hinge loss from SVMs, it is (begin unbounded!) a dreadful approximation for the indicator .  It seems to me that the sigmoid of the difference of the scores would be a much better differentiable approximation to the indicator function.

To appreciate why the derivation is nonsense, however, you have to notice that the it has nothing to do with .  That is, the derivation above would yield an approximation for , whatever happened to be, even a constant function.

### Optimisation

WARP considers each observed interaction in turn, repeatedly sampling items from the uniform distribution over all items until it finds one in , i.e. until it finds an item whose score for the user is at worst 1 less than the score of the observed interaction.  For this triple , it performs gradient updates to minimise:

The naive approach to computing is to calculate all the scores for the given user in order to determine the rank of the item .  WARP performs a nice trick to do much better: it estimates by counting how many candidate negative items it had to consider before finding one in .  This yields

However it is still the case that is not differentiable.  So when we compute the gradients, this quantity has to be treated as a constant. Thus it simply becomes a weighting applied to the gradient of the difference of the scores (hence the name WARP, I guess).

### WARP optimises for item to user recommendations

With its negative sampling technique, WARP optimises for recommending items to a user.  For instance, the problem of recommending users to items (so, transposing the interaction matrix) is not trained for.  I wonder if some extra uplift could be obtained by training for both problems simultaneously.

### Normalising for the total number of items

With the optimisation stated as above, the learning rate will need to be re-tuned for datasets that have different numbers of items, since the gradient weighting is ranges from to .  It would make more sense to weigh the gradient updates by:

which ranges between 0 and 1.

### Implementations

There are two implementations of WARP for recommendation that I know of, both in Python:

• LightFM, written by Maciej Kula.  Works well.  Also implements BPR with uniform sampling and WARP k-OS (which I’ve not investigated yet).
• MREC, written by Levy and Jack at Mendeley, has a matrix factorisation recommender trained using WARP.  I’ve not tried this one out yet.

### References

Jason Weston, Samy Bengio and Nicolas Usunier, Wsabie: Scaling Up To Large Vocabulary Image Annotation, 2011, (PDF).