Graph embeddings in Hyperbolic Space

I gave a talk last night at the Berlin machine learning meetup on learning graph embeddings in hyperbolic space, featuring the recent NIPS 2017 paper of Nickel & Kiela. Covered are:

  • An illustration of why the Euclidean plane is not a good place to embed trees (since circle circumference grows only linearly in the radius);
  • Extending this same argument to higher dimensional Euclidean space;
  • An introduction to the hyperbolic plane and the Poincaré disc model;
  • A discussion of Rik Sarkar’s result that trees embed with arbitrarily small error in the hyperbolic plane;
  • A demonstration that, in the hyperbolic plane, circle circumference is exponential in the radius (better written here);
  • A review of the results of Nickel & Kiela on the (transitive closure of the) WordNet hypernymy graph;
  • Some thoughts on the gradient optimisation (perhaps better written here).

And here are the slides!

Gradient optimisation on the Poincaré disc

Nickel & Kiela had a great paper on embedding graphs in hyperbolic space at NIPS 2017. They work with the Poincaré ball model of hyperbolic space. This is just the interior of the unit ball, equipped with an appropriate Riemannian metric. This metric is conformal, meaning that the inner product on the tangent spaces on the Poincare ball differ from that of the (Euclidean) ambient space by only a scalar factor. This means that the hyperbolic gradient \nabla_u at a point u can be obtained from the Euclidean gradient \nabla^E_u at that same point just by rescaling. That is, you pretend for a moment that your objective function is defined in Euclidean space, calculate the gradient as usual, and just rescale. This scaling factor depends on the Euclidean distance r of u from the origin, as depicted below:

So far, so good. What the authors then do is simply add the (rescaled) gradient to obtain the new value of the parameter vector, which is fine, if you only take a small step, and so long as you don’t accidentally step over the boundary of the Poincaré disc! A friend described this as the Buzz Lightyear update (“to infinity, and beyond!”). While adding the gradient vector seems to work fine in practice, it does seem rather brutal. The root of the “problem” (if we agree to call it one) is that we aren’t following the geodesics of the manifold – to perform an update, we should really been applying the exponential map at that current point to the gradient vector. Geodesics on the Poincaré disc look like this:

that is, they are sections of circles that intersect the boundary of the Poincaré disc at right angles, or diameters (the latter being a limiting case of the former). With that in mind, here’s a picture showing how the Buzz Lightyear update on the Poincaré disc could be sub-optimal:

The blue vector \nabla_u is the hyperbolic gradient vector that is added to u, taking us out of the Poincaré disc. The resulting vector is then pulled back (along the ray with the faintly-marked origin) until it is within the disc by some small margin, resulting in the new value of the parameter vector u^\textrm{new}. On the other hand, if you followed the geodesic from u to which the gradient vector is tangent, you’d end up at the end of the red curve. Which is quite some distance away.

Circle circumference in the hyperbolic plane is exponential in the radius: proof by computer game

I recently needed to demonstrate this fact to an audience that I could not assume would be familiar with Riemannian geometry, and it took some time to find a way to do it! You can use the HyperRogue game, which takes place on a tiling of the Poincaré disc. The avatar moves across the Poincaré disc (slaying monsters and collecting treasure) by stepping from tile to tile. It is a pretty impressive piece of software, and you can play it on various devices for small change. Below are four scenes from the game. This image was taken from the interesting paper on HyperRogue by Kopczyński et al., cited at bottom.

As a proxy for proving that the growth rate was exponential, just count the number of tiles that were accessible (starting from the centre tile) in N steps and no less. This gives the sequence 1, 14, 28, 49, 84, 147, 252, 434, 749, … . Plotting this values gives evidence enough.

The authors of the paper do a more thorough job, obtaining a recurrence relation for the sequence and using this to derive the base of the exponential growth (which comes out at about \sqrt{3}.

[1] Kopczyński, Eryk; Celińska, Dorota; Čtrnáct, Marek. “HyperRogue: playing with hyperbolic geometry”. Proceedings of Bridges 2017: Mathematics, Music, Art, Architecture, Culture (2017).

Hierarchical Softmax

[These are the notes from a talk I gave at the seminar]

Hierarchical softmax is an alternative to the softmax in which the probability of any one outcome depends on a number of model parameters that is only logarithmic in the total number of outcomes. In “vanilla” softmax, on the other hand, the number of such parameters is linear in the number of total number of outcomes. In a case where there are many outcomes (e.g. in language modelling) this can be a huge difference. The consequence is that models using hierarchical softmax are significantly faster to train with stochastic gradient descent, since only the parameters upon which the current training example depend need to be updated, and less updates means we can move on to the next training example sooner. At evaluation time, hierarchical softmax models allow faster calculation of individual outcomes, again because they depend on less parameters (and because the calculation using the parameters is just as straightforward as in the softmax case). So hierarchical softmax is very interesting from a computational point-of-view. By explaining it here, I hope to convince you that it is also interesting conceptually. To keep things concrete, I’ll illustrate using the CBOW learning task from word2vec (and fasttext, and others).

The CBOW learning task

The CBOW learning task is to predict a word w_0 by the words on either side of it (its “context” C).

We are interested then in the conditional distribution P(w|C), where w ranges over some fixed vocabulary W.

This is very similar to language modelling, where the task is to predict the next word by the words that precede it.

CBOW with softmax

One approach is to model the conditional distribution P(w|C) with the softmax. In this setup, we have:

    \[\hat{y} := P(w|C) = \text{exp}({\beta_w}^T \alpha_C) / Z_C,\]

where Z_C = \sum_{w' \in W} {\text{exp}({\beta_{w'}}^T \alpha_C)} is a normalisation constant, \alpha_C is the hidden layer representation of the context C, and \beta_w is the second-layer word vector for the word w. Pictorially:

The parameters of this model are the entries of the matrices \alpha and \beta.

Cross-entropy

For a single training example (w_0, C), the model parameters are updated to reduce the cross-entropy H between the distribution \hat{y} produced by the model and the distribution y representing the ground truth:

Because y is one-hot at w_0 (in this case, the word “time”), the cross-entropy reduces to a single log probability:

    \[H(y, \hat{y}) := - \sum_{w \in W} {y_w \log \hat{y}_w = - \log P(w_0|C).\]

Note that this expression doesn’t depend on whether \hat{y} is modelled using the softmax or not.

Optimisation of softmax

The above expression for the cross entropy is very simple. However, in the case of the softmax, it depends on a huge number of model parameters. It does not depend on many entries of the matrix \alpha (only on those that correspond to the few words in the context C), but via the normalisation Z_C it depends on every entry of the matrix \beta. The number of these parameters is proportional to |W|, the number of vocabulary words, which can be huge. If we optimise using the softmax, all of these parameters need to be updated at every step.

Hierarchical softmax

Hierarchical softmax provides an alternative model for the conditional distributions P(\cdot|C) such that the number of parameters upon which a single outcome P(w|C) depends is only proportional to the logarithm of |W|. To see how it works, let’s keep working with our example. We begin by choosing a binary tree whose leaf nodes can be made to correspond to the words in the vocabulary:

Now view this tree as a decision process, or a random walk, that begins at the root of the tree and descents towards the leaf nodes at each step. It turns out that the probability of each outcome in the original distribution uniquely determines the transition probabilities of this random walk. At every internal node of the tree, the transition probabilities to the children are given by the proportions of total probability mass in the subtree of its left- vs its right- child:

This decision tree now allows us to view each outcome (i.e. word in the vocabulary) as the result of a sequence of binary decisions. For example:

    \[P(\texttt{"time"}|C) = P_{n_0}(\text{right}|C) P_{n_1}(\text{left}|C) P_{n_2}(\text{right}|C),\]

where P_{n}(\text{right}|C) is the probability of choosing the right child when transitioning from node n. There are only two outcomes, of course, so:

    \[P_{n}(\text{right}|C) = 1 - P_{n}(\text{left}|C).\]

These distributions are then modelled using the logistic sigmoid \sigma:

    \[P_{n}(\text{left}|C) = \sigma({\gamma_n}^T \alpha_C),\]

where for each internal node n of the tree, \gamma_n is a coefficient vector – these are new model parameters that replace the \beta_w of the softmax. The wonderful thing about this new parameterisation is that the probability of a single outcome P(w|C) only depends upon the \gamma_n of the internal nodes n that lie on the path from the root to the leaf labelling w. Thus, in the case of a balanced tree, the number of parameters is only logarithmic in the size |W| of the vocabulary!

Which tree?

J. Goodman (2001)

Goodman (2001) uses 2- and 3-level trees to speed up the training of a conditional maximum entropy model which seems to resemble a softmax model without a hidden layer (I don’t understand the optimisation method, however, which is called generalised iterative scaling). In any case, the internal nodes of the tree represent “word classes” which are derived in a data driven way (which is apparently elaborated in the reference [9] of the same author, which is behind a paywall).

F. Morin & Y. Bengio (2005)

Morin and Bengio (2005) build a tree by beginning with the “is-a” relationships for WordNet. They make it a graph of words (instead of word-senses), by employing a heuristicFelix, and make it acyclic by hand). Finally, to make the tree binary, the authors repeatedly cluster the child nodes using columns of a tf-idf matrix.

A. Mnih & G. Hinton (2009)

Mnih & Hinto (2009) use a boot-strapping method to construct binary trees. Firstly they train their language model using a random tree, and afterwards calculate the average context vector for every word in the vocabulary. They then recursively partition these context vectors, each time fitting a Gaussian mixture model with 2 spherical components. After fitting the GMM, the words are associated to the components, and this defines to which subtree (left or right) a word belongs. This is done in a few different ways. The simplest is to associate the word to the component that gives the word vector the highest probability (“ADAPTIVE”); another is splitting the words between the two components, so that the resulting tree is balanced (“BALANCED”). They consider also a version of “adaptive” in which words that were in a middle band between the two components are placed in both subtrees (“ADAPTIVE(e)”), which results not in a tree, but a directed acyclic graph. All these alternatives they compare to trees with random associations between leaves and words, measuring the performance of the resulting language models using the perplexity. As might be expected, their semantically constructed trees outperform the random tree. Remarkably, some of the DAG models perform better than the softmax!

Mikolov et al. (2013)

The approaches above all use trees that are semantically informed. Mikolov et al, in their 2013 word2vec papers, choose to use a Huffman tree. This minimises the expected path length from root to leaf, thereby minimising the expected number of parameter updates per training task. Here is an example of the Huffman tree constructed from a word frequency distribution:

What is interesting about this approach is that the tree is random from a semantic point of view.

Minsky & Papert’s “Perceptrons”

In their book “Perceptrons” (1969), Minsky and Papert demonstrate that a simplified version of Rosenblatt’s perceptron can not perform certain natural binary classification tasks, unless it uses an unmanageably large number of input predicates. It is easy to show that with sufficiently many input predicates, a perceptron (even on this type) can perform any classification with perfect accuracy (see page 3 of the notes below). The contribution of Minsky and Papert is to show that meaningful restrictions on the type of input predicates hamper the expressive ability of the perceptron to such a degree that it is unable to e.g. distinguish connected from disconnected figures, or classify according to whether the number of active pixels is odd or even. The former has a simple picture proof, whereas the crucial ingredient for the latter is the action of a permutation group on the retina (i.e. the input array) of the perceptron.

Talk

Here are my notes from a recent talk I gave on the book at the Berlin machine learning seminar:

Style

The book is an engaging and instructive read – not only is it peppered with the author’s opinions and ideas, but it includes also enlightening comments on how the presented ideas originated, and why other ideas that occurred to the authors didn’t work out. The book still bears the marks of it’s making, so to speak.

Controversy

The publication of the first edition in 1969 is popularly credited with bringing research on perceptrons and connectionism in general to a grinding halt. The book is held to be unjust, moreover. The “perceptrons”, which Minsky and Papert prove to be so limited in expressive power, were in fact only a very simplified version of what practitioners then regarded as a perceptron. A typical perceptron (unlike those of Minsky and Papert) might include more layers, feedback loops, or even be coupled with another perceptron. All these variations are described in Rosenblatt’s book “Principles of Neurodynamics” (1962). This is put very well (and colourfully!) by Block in his review of the book (1970):

Thus, although the authors state (p. 4, lines 12-14) “we have agreed to use the name ‘perceptron’ in recognition of the pioneer work of Frank Rosenblatt.”, they study a severely limited class of machines from a viewpoint quite alien to Rosenblatt’s. As a result, the title of the book, although generous in intent, is seriously misleading to the naive reader who wants to find out something about the general class of Perceptrons.

In summary then, Minsky and Papert use the word perceptron to denote a restricted subset of the general class of Perceptrons. They show that these simple machines are limited in their capabilities. This approach is reminiscent of the möhel who throws the baby into the furnace, hands the father the foreskin and says, “Here it is; but it will never amount to much.”

Despite these serious criticisms, it should be noted that Block (himself a trained mathematician) was full of praise for the “mathematical virtuosity” exhibited by Minsky and Papert in their book.

As to whether the book alone stopped research into perceptrons is hard to judge, particularly given it’s impact is confounded by the tragic death of Rosenblatt (at age 41) only two years later.

Re-parameterising for non-negativity yields multiplicative updates

Suppose you have a model that depends on real-valued parameters, and that you would like to constrain these parameters to be non-negative. For simplicity, suppose the model has a single parameter a \in \mathbb R. Let E denote the error function. To constrain a to be non-negative, parameterise a as the square of a real-valued parameter \alpha \in \mathbb R:

    \[a = \alpha^2, \quad \alpha \in \mathbb R.\]

We can now minimise E by choosing \alpha without constraints, e.g. by using gradient descent. Let \lambda > 0 be the learning rate. We have

    \begin{eqnarray*} \alpha^{\text{new}} &=& \alpha - \lambda \frac{\partial E}{\partial \alpha} \\ &=& \alpha - \lambda \textstyle{\frac{\partial E}{\partial a} \frac{\partial a}{\partial \alpha}} \\ &=& \alpha - \lambda 2 \alpha \textstyle \frac{\partial E}{\partial a} \\ &=& \alpha \cdot (1 - 2 \lambda \textstyle \frac{\partial E}{\partial a}) \end{eqnarray*}

by the chain rule. Thus

    \begin{eqnarray*} a^{\text{new}} &=& (\alpha^{\text{new}})^2 \\ &=& \alpha^2 (1 - 2 \lambda \textstyle\frac{\partial E}{\partial a})^2 \\ &=& a \cdot (1 - 2 \lambda \textstyle\frac{\partial E}{\partial a})^2. \end{eqnarray*}

Thus we’ve obtained a multiplicative update rule for a that is in terms of a, only. In particular, we don’t need \alpha anymore!

Factorisation of stochastic matrices

Here we derive updates rules for the approximation of a row stochastic matrix by the product of two lower-rank row stochastic matrices using gradient descent. Such a factorisation corresponds to a decomposition

    \[p(n|m) = \sum_k p(n|k) \cdot p(k|m)\]

Both the sum of squares and row-wise cross-entropy functions are considered.

Heider and Simmel misinterpreted

I learnt of the 1944 experiment of Heider and Simmel in the Machine Intelligence workshop at NIPS 2016. The experiment involved showing subjects the video below, and asking them to describe what they saw. If you’ve watched the video, you’ll not be surprised to learn that most of the subjects anthropomorphised the geometric objects (i.e. they described them as humans, or their actions and intentions in human terms). In the talk at the workshop, this was offered as evidence that humans have a tendency to anthropomorphise, the implication then being that it is not hard to trick users into believing that e.g. a chat bot is a real person. While the implication might indeed be true, I don’t think the experiment of Heider and Simmel shows this at all. In my view, the reason that viewers anthropomorphise the shapes in the video is because they know that the video is made by human beings, and as such is created with intention. When you watch this video, you are participating in an act of communication, and the natural question to ask is: “what are the creators trying to communicate to me?”. For contrast, imagine that you are sitting in the bath and you have a few triangular shapes that float. If you float the shapes on the water and watch them for a while as they randomly (without expression of intent!) bob about, are you likely to anthropomorphise them? I’d say you are much less likely to do so.

Wine dataset demonstrates importance of feature scaling

The UCI hosts a dataset of wine measurements that is fantastic for demonstrating the importance of feature scaling in unsupervised learning . There are a bunch of real-valued measurements (of e.g. chemical composition) for three varieties of wine: Barolo, Grignolino and Barbera. I am using this dataset to introduce feature scaling my course on DataCamp.

The wine measurements have very different scales, and performing a k-means clustering (with 3 clusters) on the unscaled measurements yields clusters that don’t correspond to the varieties:

varieties  Barbera  Barolo  Grignolino
labels                                
0               29      13          20
1                0      46           1
2               19       0          50

However, if the measurements are first standardised, the clusters correspond almost perfectly to the three wine varieties:

varieties  Barbera  Barolo  Grignolino
labels                                
0                0      59           3
1               48       0           3
2                0       0          65

Of course, k-means clustering is not meant to be a classifier! But when the clusters do correspond so well to the classes, then it is apparent that the scaling is pretty good.

Which wine varieties?

I had to search to find the names of the wine varieties. According to page 9 of “Chemometrics with R” (Ron Wehrens), the three varieties are: Barolo (58 samples), Grignolino (71 samples) and Barbera (48 samples). I was unable to follow Wehren’s citation (it’s his [6]) on Google books.

Original source

According to the UCI page, this dataset originated from the paper V-PARVUS. An Extendible Pachage of programs for esplorative data analysis, classification and regression analysis by Forina, M., Lanteri, S. Armanino, C., Casolino, C., Casale, M., Oliveri, P.

Olive oil dataset

There is a dataset of olive oil measurements associated to the same paper. I haven’t tried using it, but am sure I’ll use it in an example one day.

Eurovision song contest dataset

I’m teaching hierarchical clustering in my course on DataCamp, and I needed an interesting dataset. Importantly, I wanted the dataset to have labelled instances, so that the dendrogram would be easily interpretable, but also not too many instances, so they all fit on the dendrogram. Fortunately for me, the Eurovision song contest has been publishing the voting results (which is great!) and these are perfect. Both the voting results from the judges, and those from the public give great results. The only thing you need to adjust for is that countries are not allowed to vote for themselves in Eurovision, and this gives you some missing values in the data. I filled these with the maximum score of 12, since it is reasonable to assume that countries would vote selfishly if they were allowed to. Below is the dendrogram of a hierarchical (agglomerative) clustering using complete linkage.

A better version

It occurs to me now that I should have normalised the rows after filling in the missing values. This does indeed improve the hierarchical clustering further.