Maciej Kula (Lyst)

CBRecSys 2015 (arxiv)

Kula presents a model for cold start recommendation, which he calls “LightFM”. Users and items are considered as sets of binary features. For example:

$\text{alice} = \{ \text{domain}:\text{gmail} \}$

$\text{itemXYZ} = \{\text{description}:\text{pleated}, \text{description}:\text{skirt}, \text{tag}:\text{chanel} \}.$

Each of these features (e.g. each tag, each word and each email domain) corresponds to a parameter vector and a bias term. A user vector (or item vector) is then the sum of the vectors associated to its constituent features. Similarly, a user (item) bias term is just the sum of the bias terms associated to its features.

The probability $\hat{r}(u, i)$ of an interaction between a user $u$ and an item $i$ is modelled as the sigmoid of the dot product of the user vector and the item vector, along with the bias terms associated with the user and the item:

$\hat{r}(u, i) := \sigma (vec(u) \cdot vec(i) + bias(u) + bias(i))$

The model is trained on a set $S_{+}$ of user-item pairs observed as having interacted, and on a set $S_{-}$ of user-item pairs that were not observed to have interacted (in the case of implicit feedback recommendation) or to have interacted negatively (in the case of explicit feedback recommendation). Specifically, these interactions and non-interactions are assumed independent and the likelihood

$\displaystyle L = \prod_{(u, i) \in S_{+}} \hat{r}(u,i) \cdot \prod_{(u, i) \in S_{-}} (1 – \hat{r}(u,i))$

is then maximised using stochastic gradient descent and with adaptive per-parameter learning rates determined by Adagrad.

### Trivial featurisation gives matrix factorisation

Note that users (or items) can be featurised trivially using their ids. We create one user feature for each user id, so that the user-feature matrix is the identity matrix. In this case, we have a separate parameter vector for each user. If we do this for both users and items, then the model is just a (sigmoid-) factorisation of the user-item interaction matrix. This is then the case of Johnson’s logistic matrix factorization.

### Evaluation

Performance is evaluated on MovieLens for explicit feedback recommendation and on CrossValidated (one of the StackExchange websites) for implicit feedback recommendation. In both cases, warm- and cold-start scenarios are tested. Warm start is tested by holding out interactions in such a way that every item and every user is still represented in the training interaction data. Cold start is tested by holding out all interactions for some items. Model accuracy is measured by considering each user in the set of test interactions, considering the binary classification task of labelling each item as having been interacted with or not and then measuring the area under the curve of the associated ROC curve. The mean is that taken over all users in the test set.

LightFM seems to perform well in both cold and warm start scenarios.

### Engineering Notes

Kula included some interesting notes on the production use of LightFM at Lyst. Training is incremental with model state stored in the database.

### Implementation and Examples

Available on GitHub and extensively documented. Written in Cython. In addition to the logistic loss used above, Bayesian Personalised Ranking and WARP are supported.