Minsky & Papert’s “Perceptrons”

In their book “Perceptrons” (1969), Minsky and Papert demonstrate that a simplified version of Rosenblatt’s perceptron can not perform certain natural binary classification tasks, unless it uses an unmanageably large number of input predicates. It is easy to show that with sufficiently many input predicates, a perceptron (even on this type) can perform any classification with perfect accuracy (see page 3 of the notes below). The contribution of Minsky and Papert is to show that meaningful restrictions on the type of input predicates hamper the expressive ability of the perceptron to such a degree that it is unable to e.g. distinguish connected from disconnected figures, or classify according to whether the number of active pixels is odd or even. The former has a simple picture proof, whereas the crucial ingredient for the latter is the action of a permutation group on the retina (i.e. the input array) of the perceptron.

Talk

Here are my notes from a recent talk I gave on the book at the Berlin machine learning seminar:

Style

The book is an engaging and instructive read – not only is it peppered with the author’s opinions and ideas, but it includes also enlightening comments on how the presented ideas originated, and why other ideas that occurred to the authors didn’t work out. The book still bears the marks of it’s making, so to speak.

Controversy

The publication of the first edition in 1969 is popularly credited with bringing research on perceptrons and connectionism in general to a grinding halt. The book is held to be unjust, moreover. The “perceptrons”, which Minsky and Papert prove to be so limited in expressive power, were in fact only a very simplified version of what practitioners then regarded as a perceptron. A typical perceptron (unlike those of Minsky and Papert) might include more layers, feedback loops, or even be coupled with another perceptron. All these variations are described in Rosenblatt’s book “Principles of Neurodynamics” (1962). This is put very well (and colourfully!) by Block in his review of the book (1970):

Thus, although the authors state (p. 4, lines 12-14) “we have agreed to use the name ‘perceptron’ in recognition of the pioneer work of Frank Rosenblatt.”, they study a severely limited class of machines from a viewpoint quite alien to Rosenblatt’s. As a result, the title of the book, although generous in intent, is seriously misleading to the naive reader who wants to find out something about the general class of Perceptrons.

In summary then, Minsky and Papert use the word perceptron to denote a restricted subset of the general class of Perceptrons. They show that these simple machines are limited in their capabilities. This approach is reminiscent of the möhel who throws the baby into the furnace, hands the father the foreskin and says, “Here it is; but it will never amount to much.”

Despite these serious criticisms, it should be noted that Block (himself a trained mathematician) was full of praise for the “mathematical virtuosity” exhibited by Minsky and Papert in their book.

As to whether the book alone stopped research into perceptrons is hard to judge, particularly given it’s impact is confounded by the tragic death of Rosenblatt (at age 41) only two years later.

Re-parameterising for non-negativity yields multiplicative updates

Suppose you have a model that depends on real-valued parameters, and that you would like to constrain these parameters to be non-negative. For simplicity, suppose the model has a single parameter a \in \mathbb R. Let E denote the error function. To constrain a to be non-negative, parameterise a as the square of a real-valued parameter \alpha \in \mathbb R:

    \[a = \alpha^2, \quad \alpha \in \mathbb R.\]

We can now minimise E by choosing \alpha without constraints, e.g. by using gradient descent. Let \lambda > 0 be the learning rate. We have

    \begin{eqnarray*} \alpha^{\text{new}} &=& \alpha - \lambda \frac{\partial E}{\partial \alpha} \\ &=& \alpha - \lambda \textstyle{\frac{\partial E}{\partial a} \frac{\partial a}{\partial \alpha}} \\ &=& \alpha - \lambda 2 \alpha \textstyle \frac{\partial E}{\partial a} \\ &=& \alpha \cdot (1 - 2 \lambda \textstyle \frac{\partial E}{\partial a}) \end{eqnarray*}

by the chain rule. Thus

    \begin{eqnarray*} a^{\text{new}} &=& (\alpha^{\text{new}})^2 \\ &=& \alpha^2 (1 - 2 \lambda \textstyle\frac{\partial E}{\partial a})^2 \\ &=& a \cdot (1 - 2 \lambda \textstyle\frac{\partial E}{\partial a})^2. \end{eqnarray*}

Thus we’ve obtained a multiplicative update rule for a that is in terms of a, only. In particular, we don’t need \alpha anymore!

Factorisation of stochastic matrices

Here we derive updates rules for the approximation of a row stochastic matrix by the product of two lower-rank row stochastic matrices using gradient descent. Such a factorisation corresponds to a decomposition

    \[p(n|m) = \sum_k p(n|k) \cdot p(k|m)\]

Both the sum of squares and row-wise cross-entropy functions are considered.

Heider and Simmel misinterpreted

I learnt of the 1944 experiment of Heider and Simmel in the Machine Intelligence workshop at NIPS 2016. The experiment involved showing subjects the video below, and asking them to describe what they saw. If you’ve watched the video, you’ll not be surprised to learn that most of the subjects anthropomorphised the geometric objects (i.e. they described them as humans, or their actions and intentions in human terms). In the talk at the workshop, this was offered as evidence that humans have a tendency to anthropomorphise, the implication then being that it is not hard to trick users into believing that e.g. a chat bot is a real person. While the implication might indeed be true, I don’t think the experiment of Heider and Simmel shows this at all. In my view, the reason that viewers anthropomorphise the shapes in the video is because they know that the video is made by human beings, and as such is created with intention. When you watch this video, you are participating in an act of communication, and the natural question to ask is: “what are the creators trying to communicate to me?”. For contrast, imagine that you are sitting in the bath and you have a few triangular shapes that float. If you float the shapes on the water and watch them for a while as they randomly (without expression of intent!) bob about, are you likely to anthropomorphise them? I’d say you are much less likely to do so.

Wine dataset demonstrates importance of feature scaling

The UCI hosts a dataset of wine measurements that is fantastic for demonstrating the importance of feature scaling in unsupervised learning . There are a bunch of real-valued measurements (of e.g. chemical composition) for three varieties of wine: Barolo, Grignolino and Barbera. I am using this dataset to introduce feature scaling my course on DataCamp.

The wine measurements have very different scales, and performing a k-means clustering (with 3 clusters) on the unscaled measurements yields clusters that don’t correspond to the varieties:

varieties  Barbera  Barolo  Grignolino
labels                                
0               29      13          20
1                0      46           1
2               19       0          50

However, if the measurements are first standardised, the clusters correspond almost perfectly to the three wine varieties:

varieties  Barbera  Barolo  Grignolino
labels                                
0                0      59           3
1               48       0           3
2                0       0          65

Of course, k-means clustering is not meant to be a classifier! But when the clusters do correspond so well to the classes, then it is apparent that the scaling is pretty good.

Which wine varieties?

I had to search to find the names of the wine varieties. According to page 9 of “Chemometrics with R” (Ron Wehrens), the three varieties are: Barolo (58 samples), Grignolino (71 samples) and Barbera (48 samples). I was unable to follow Wehren’s citation (it’s his [6]) on Google books.

Original source

According to the UCI page, this dataset originated from the paper V-PARVUS. An Extendible Pachage of programs for esplorative data analysis, classification and regression analysis by Forina, M., Lanteri, S. Armanino, C., Casolino, C., Casale, M., Oliveri, P.

Olive oil dataset

There is a dataset of olive oil measurements associated to the same paper. I haven’t tried using it, but am sure I’ll use it in an example one day.

Eurovision song contest dataset

I’m teaching hierarchical clustering in my course on DataCamp, and I needed an interesting dataset. Importantly, I wanted the dataset to have labelled instances, so that the dendrogram would be easily interpretable, but also not too many instances, so they all fit on the dendrogram. Fortunately for me, the Eurovision song contest has been publishing the voting results (which is great!) and these are perfect. Both the voting results from the judges, and those from the public give great results. The only thing you need to adjust for is that countries are not allowed to vote for themselves in Eurovision, and this gives you some missing values in the data. I filled these with the maximum score of 12, since it is reasonable to assume that countries would vote selfishly if they were allowed to. Below is the dendrogram of a hierarchical (agglomerative) clustering using complete linkage.

A better version

It occurs to me now that I should have normalised the rows after filling in the missing values. This does indeed improve the hierarchical clustering further.

An LCD digit dataset for illustrating the “parts-based” representation of NMF

Non-negative matrix factorisation (NMF) learns to reconstruct samples as a superposition of their constituent parts. In the paper of Lee and Seung (1999) that popularised NMF, this is called a “parts-based” representation. This is illustrated in that paper by applying NMF to encodings of images of faces, where NMF seems to decompose the faces into a collage of eigen-eyebrows and eigen-noses. Visual demonstrations are fantastic for conveying ideas, but in this particular instance, the clarity is compromised by the inherent noisiness of real-world facial images. The images are drawn, moreover, from the CBCL dataset, which has a non-commercial license. In order to get around this problem, and to have an even clearer visual demonstration of the “parts-based” decomposition provided by NMF for my course at DataCamp, I created a synthetic image dataset, where each image is of a single digit of a LCD display, as on a clock radio. The parts learned by NMF are then the individual “cells” of the LCD display.

You can construct this dataset yourself, using the code below. The collection of images is encoded as a 2d array of non-negative values. Each row corresponds to an image, and each column corresponds to a pixel. The non-negative entries represent the whiteness of the pixel, encoded here as a value between 0 and 1.

Alternatives

  • The standard bars provide a similar (but more apparently synthetic) image dataset for learning the parts of images. See, for example, the references given in Spratling (1996).
  • Another great visual dataset could be built from black-and-white images of the 52 playing cards in a deck. NMF would then learn the ranks (i.e. ace, 2, 3, …, ) and the suits (i.e. spades, hearts, …) as parts, and reconstruct playing cards from these. I haven’t done this.
  • Yet another great example dataset could be constructed using images of a piano keyboard, or perhaps just an octave range, colouring the keys according to how often they are pressed during a song. NMF should then be able to learn the chords as parts. The midi files to construct this dataset could be obtained from the Mutopia project, for example. I haven’t done this either.

Don’t interpret linear hidden units, they don’t exist.

Having trained a model, it is natural to want to understand how it works. An intuitively appealing approach is to consider data samples that maximise the activation of a hidden unit, and to take the common input features of these samples as an indication of what that unit has learned to recognise. However, as we’ll see below, it is a misconception to speak of hidden units if:

  • there is no non-linearity on the hidden layer;
  • the weights connecting the layers are unconstrained; and
  • the model is trained using (stochastic) gradient descent or similar.

In such a scenario, the hidden feature space must instead be considered as a whole.

Summary

Consider the task of factorising a matrix X as a product of matrices X \cong A^T B with some fixed inner dimension k. The model parameters are pairs of matrices (A, B) with the appropriate dimensions, and the image of an input vector x on the hidden layer is given by Ax. To consider this vector Ax in terms of hidden unit activations is to fix a co-ordinate system in the hidden feature space, and to measure the displacement of the vector along each co-ordinate axis. If E_1, \dots , E_k denote the unit vectors corresponding to the chosen co-ordinate system, then the displacements are given by the inner products

    \begin{equation*} \langle Ax, E_i \rangle, \quad i = 1, \dots, k. \end{equation*}

We show below that if P is any rotation of the hidden feature space, then the model parameters (PA, PB) are just as likely as (A, B) to result in the factorisation of a fixed matrix X and that which of these occurs depends only on the random initialisation of gradient descent. Thus the hidden unit activations might just as likely have been given by

    \begin{equation*} \langle (PA)x, E_i \rangle, \quad i = 1, \dots, k. \end{equation*}

The hidden unit activations given by 1 and 2 can be very different indeed. In fact, since P is an orthogonal transformation, we have

    \[ \langle (PA)x, E_i \rangle = \langle Ax, P^T E_i \rangle, \quad i = 1, \dots, k \]

(see e.g. here). Thus the indeterminacy of the model parameters, i.e. (A, B) vs. (PA, PB), might equivalently be thought of as an indeterminacy in the orientation of the co-ordinate system, i.e. the E_i vs. the P^T E_i. The choice of orientation of co-ordinate basis is completely arbitrary, so speaking of hidden unit activations makes no sense at all.

The above holds more generally for P an orthogonal transformation of the hidden feature space, i.e. for P a composition of rotations and reflections.

Szegedy et al.

None of the above is new. For example, it was stated by Szegedy et al. in an empirical study of the interpretability of hidden units. We are demonstrating, step-by-step, a statement of theirs (which was about word2vec):

… word representations, where the various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. At the same time, the vector representations are stable up to a rotation of the space, so the individual units of the vector representations are unlikely to contain semantic information.

Matrix factorisation and unit activation

Given a matrix X and an inner dimension k, the task of matrix factorisation is to learn two matrices A and B whose product approximates X:

The parameter space consists of the entries of the matrices A and B. The hidden feature space, on the other hand, is the k-dimensional space containing the columns of A and B.

Error function

To train a matrix factorisation model using gradient descent, the model parameters are repeatedly updated using the gradient vector of the error function. An example error function E could be

    \[ E_X (A, B) = \sum_{i, j} {(X_{i, j} - (AB)_{i,j})^2}. \]

Notice that this choice of error function doesn’t depend directly on the pair of matrices (A, B), but rather only on their product AB, i.e. only on the approximation AB of X. This is true of any error function E_X, because error functions depend only on inputs and outputs.

Orthogonal transformations of the hidden feature space

Recall that orthogonal transformations of a space are just compositions of rotations and reflections about hyperplanes passing through the origin. Considered as matrices, orthogonal transformations are defined by the property that their product with their transpose gives the identity matrix. Using this property, it can be seen that an orthogonal transformation of the hidden feature space defines an orthogonal transformation of the parameter space by acting simultaneously on the column vectors of the matrices. If O_k and O_{(m+n)k} denote the groups of orthogonal transformations on the hidden feature space and the parameter space, respectively, then:

Contour lines of the gradient

The effect of this block-diagonal orthogonal transformation on the parameter space corresponds to multiplying the matrices A and B on the left by the orthogonal transformation P of the feature space, i.e. it effects (A, B) \mapsto (PA, PB). Notice that (A, B) and (PA, PB) yield the same approximation to the original matrix X, since:

    \[ (PA)^T (PB) = (A^T P^T) (PB) = A^T (P^T P) B = AB. \]

Thus E_X (A, B) = E_X (PA, PB), so the orthogonal transformations P of the hidden feature space trace out contour lines of E_X in the parameter space. Now the gradient vector is always perpendicular to the contour line, so the sequence of points in the parameter space visited during gradient descent preserve the orientation of the hidden feature space set at initialisation (see here, for example). So if gradient descent of E_X starting at the initial parameters (A^{(0)}, B^{(0)}) converges to the parameters (A, B), and you’d prefer that it instead converged to (PA, PB), then all you need to do is start the gradient descent over again, but this time with the initial parameters (PA^{(0)}, PB^{(0)}). We thus see that the matrices (A, B) that our matrix factorisation model has learned are only determined up to an orthogonal transformation of the hidden feature space, i.e. up to a simultaneous transformation of their columns.

Gradient descent methods

The above statements continue to hold in the case of stochastic gradient descent, where the error function E_X is not fixed but rather defined by varying mini-tasks (an instance being e.g. word2vec). Such error functions still don’t depend upon hidden layer values, so as above their gradient vectors are perpendicular to the contour lines traced out by the orthogonal transformations of the hidden layer. Thus the updates performed in stochastic gradient descent also preserve the original orientation of the feature space.

Initialisation

How likely is it that initial parameters, transformed via an orthogonal transformation as above, ever occur themselves as initial parameters? In order to conclude that the orientation of the co-ordinate system on the hidden layer is completely arbitrary, we need it to be precisely as likely. Thus if \pi denotes the probability distribution on the parameter space from which the initial parameters are drawn, we require

    \[ \pi ( (P A^{(0)}, P B^{(0)}) ) =  \pi ( (A^{(0)}, B^{(0)}) ), \]

for any initial parameters ( A^{(0)}, B^{(0)} ) and any orthogonal transformation P of the hidden feature space.

This is not the case with word2vec, where each parameter is drawn independently from a uniform distribution. However, it remains true that for any choice of initial parameters, there will still be any number of possible orientations of the co-ordinate system, but for some choices of initial parameters there is less freedom than for others.

Appendix: What about GloVe?

GloVe performs weighted matrix factorisation with bias terms, so the above should apply. The weighting is just a modified error function, and the bias terms are not hidden features and so are left unmodified by its orthogonal transformations. Like word2vec, GloVe initialises each parameter with independent samples from uniform distribution, so there are no new problems there. The real problem with applying the above analysis to GloVe is that the implementation of Adagrad used makes the learning regime dependent on the choice of basis of the hidden feature space (see e.g. here). This doesn’t mean that the hidden unit activations of GloVe make sense, it just means that GloVe is less amenable to theoretical arguments like those above and needs to be considered empirically e.g. in the manner of Szegedy et al.

Descartes’ headache: rotation of space is transpose to rotation of co-ordinate system

[I often speak of such-and-such depending upon the choice of co-ordinate system, and proceed to show this by rotating the space. The following explains why this is equivalent (via the transpose)].

You awaken in a two-dimensional landscape. In front of you are co-ordinate axes, fixed to the ground on a pivot. The landscape is featureless except for a tree and a well. You get bored, and fall asleep again. When you awaken, you notice that the relative arrangement of the co-ordinate axes to the tree and the well is different. What happened?

You realise that you’ll never truly know what happened while you were asleep. Either (a) someone rotated the co-ordinate axes 90 degrees clockwise, or (b) someone rotated the world 90 degrees anti-clockwise about the pivot point of the axes.

Rotation of the space is equivalent (via the transpose) to rotation of the co-ordinate system. In fact, this is true more generally for orthogonal transformations and has a familiar mathematical expression. Let V be an n-dimensional vector space with inner product \langle \cdot , \cdot \rangle and let e_i, i=1, .., n be an orthonormal basis. If X is an orthogonal transformation of V, then for any point v \in V, we have

    \[\displaystyle $\langle Xv, e_i \rangle = \langle v, X^T e_i \rangle\]

for all i=1, .. , n. That is, the co-ordinates of the transformed point with respect to the original basis are the co-ordinates of the original point with respect to the transpose-transformed basis.

Adagrad depends on the choice of co-ordinate system

Adagrad is a learning regime that maintains separate learning rates for each individual model parameter. It is used, for instance, in GloVe (perhaps incorrectly), in LightFM, and in many other places besides. Below is an example showing that Adagrad models evolve in a manner that depends upon the choice of co-ordinate system (i.e. orthonormal basis) for the parameter space. This dependency is no problem when the parameter space consists of many one-dimensional, conceptually unrelated features lined up beside one another, because such parameter spaces have only one natural orientation of the co-ordinate axes. It is a problem, however, for the use of Adagrad in models like GloVe and LightFM. In these cases the parameter space consists of many feature vectors (of dimension, say, 100) concatenated together. A learning regime should not depend upon the arbitrary choice of orthonormal basis in this 100-dimensional feature space.

For feature spaces like these, I would propose instead maintaining a separate learning rate for each feature vector (for example, in the case of GloVe, there would be one learning rate per word vector per layer). The learning rate of a feature vector would dampen the initial learning rate by the accumulation of the squared norms of the previous gradient updates of the feature vector. The evolution of Adagrad would then be independent of the choice of basis in the feature space (as distinct from the entire parameter space). In the case of GloVe this means that a simultaneous rotation of all the word vectors in both layers during training does not alter the resulting model. This proposal would have the further advantage of greatly reducing the number of learning rates that have to be stored in memory. I don’t know if this proposal would have regret minimisation properties analogous to Adagrad. I haven’t read the original paper of Duchi et al. (2011), and what I am proposing might be subsumed there by the full-rank case (thanks to Alexey Rodriguez for pointing this out). Perhaps a block diagonal matrix could be used instead of a diagonal one.

Update: Minh + Kavukcuoglu seem to have adopted the same point of view in Learning word embeddings efficiently with noise-contrastive estimation (2013). Thanks to Matthias Leimeister for this.