Metadata Embeddings for User and Item Cold-start Recommendations

Maciej Kula (Lyst)
CBRecSys 2015 (arxiv)

Kula presents a model for cold start recommendation, which he calls “LightFM”.  Users and items are considered as sets of binary features. For example:

\text{alice} = \{ \text{domain}:\text{gmail} \}

\text{itemXYZ} = \{\text{description}:\text{pleated}, \text{description}:\text{skirt}, \text{tag}:\text{chanel} \}.

Each of these features (e.g. each tag, each word and each email domain) corresponds to a parameter vector and a bias term.  A user vector (or item vector) is then the sum of the vectors associated to its constituent features.  Similarly, a user (item) bias term is just the sum of the bias terms associated to its features.

The probability \hat{r}(u, i) of an interaction between a user u and an item i is modelled as the sigmoid of the dot product of the user vector and the item vector, along with the bias terms associated with the user and the item:

\hat{r}(u, i) := \sigma (vec(u) \cdot vec(i) + bias(u) + bias(i))

The model is trained on a set S_{+} of user-item pairs observed as having interacted, and on a set S_{-} of user-item pairs that were not observed to have interacted (in the case of implicit feedback recommendation) or to have interacted negatively (in the case of explicit feedback recommendation).  Specifically, these interactions and non-interactions are assumed independent and the likelihood

\displaystyle L = \prod_{(u, i) \in S_{+}} \hat{r}(u,i) \cdot \prod_{(u, i) \in S_{-}} (1 - \hat{r}(u,i))

is then maximised using stochastic gradient descent and with adaptive per-parameter learning rates determined by Adagrad.

Trivial featurisation gives matrix factorisation

Note that users (or items) can be featurised trivially using their ids.   We create one user feature for each user id, so that the user-feature matrix is the identity matrix.  In this case, we have a separate parameter vector for each user.  If we do this for both users and items, then the model is just a (sigmoid-) factorisation of the user-item interaction matrix. This is then the case of Johnson’s logistic matrix factorization.


Performance is evaluated on MovieLens for explicit feedback recommendation and on CrossValidated (one of the StackExchange websites) for implicit feedback recommendation.  In both cases, warm- and cold-start scenarios are tested.  Warm start is tested by holding out interactions in such a way that every item and every user is still represented in the training interaction data.  Cold start is tested by holding out all interactions for some items.  Model accuracy is measured by considering each user in the set of test interactions, considering the binary classification task of labelling each item as having been interacted with or not and then measuring the area under the curve of the associated ROC curve.  The mean is that taken over all users in the test set.

LightFM seems to perform well in both cold and warm start scenarios.

Engineering Notes

Kula included some interesting notes on the production use of LightFM at Lyst.  Training is incremental with model state stored in the database.

Implementation and Examples

Available on GitHub and extensively documented.  Written in Cython.  In addition to the logistic loss used above, Bayesian Personalised Ranking and WARP are supported.


Collaborative Filtering for Implicit Feedback Datasets

Hu, Koren and Volinsky (AT&T, Yahoo!), 2008.

A well-written paper.


The authors give a good description of the distinctions between explicit and implicit feedback datasets, pointing out in particular that:

  1. implicit feedback data is inherently noisy, since a user might decide that they do not like an item after viewing it — interaction does not necessarily indicate interest.
  2. the numerical value in explicit feedback indicates preference whereas in the implicit case indicates confidence.

The authors describe their model as being based on SVD, but this is not accurate, since they weight squared difference summands in the cost function according to a confidence value (which is proportional to the number of interactions for that user-item pair).

The input matrix is the user-item matrix.

Optimisation is via alternating least squares.

Their evaluation metric is percentile rank based.

Their model, which we’ll call “weighted SVD” (they speak of “confidence intervals”) compares favourably with the baseline popularity method and also with an old-school item-based neighbourhood method, in terms of the expected percentile rank (Figure 1). Interestingly, the differences are less marked when the probability that a desired item is in the top (say) 1% is considered (Figure 2).

The unweighted SVD on the user-item matrix is shown to perform terribly, with a significant but insufficient improvement obtained with regularisation.

ItemRank: A Random-Walk Based Scoring Algorithm for Recommender Engines

Marco Gori and Augusto Pucci, 2007 (from the IJCAI conference proceedings).


The authors consider an application of PageRank to recommendation in the case where explicit ratings are available. The vertices of the graph represent items to be recommended, and the weight of the edge between any two vertices is proportional to the number of users that have interacted with both the corresponding items (thus the explicit ratings are not incorporated into the graph itself). To obtain recommendations for a particular user, the “reset-” (or “teleport-“) vector of is set to be the explicit ratings given by the user (0 is used for the absence of a rating), PageRank is run and then resulting importances are used to rank the items.

It seems to me that this set-up would be more sensible in contexts where the behavioural data was implicit (e.g. user looked at particular item) rather that explicit (user gave a particular rating to a particular item) – in the explicit context the use of the value 0 for the absence of rating can not be motivated.

The authors test their approach on the MovieLens dataset.

As the authors themselves note, PageRank had been used for personalised (more generally, deliberately biased) recommendation before their work (e.g. Haveliwala “Topic sensitive Pagerank”, 2002). The novelty here lies in the construction of the graph from the user-item interaction matrix.