*BPR*) for implicit feedback (i.e. one class) recommendation datasets in which the negative samples are drawn according both to the models current belief and the user/context in question (“

*adaptive oversampling*“). They show that the prediction accuracy of BPR models trained with adaptive oversampling matches that of BPR models trained with uniform sampling but that convergence is 10x-20x faster.

### Negative sampling from static distributions

- Uniform sampling tends to yield uninformative samples, i.e. those for which the probability of being incorrectly labelled is very likely already low: popular items are precisely those that appear often as positive examples (and hence tend to be highly ranked by the model), while a uniformly-drawn item is likely to be from the tail of the popularity distribution (so likely lowly ranked by the model).
- Sampling according to popularity is demonstrated by the authors to converge to inferior solutions.

### Adaptive over-sampling

- sampling a latent factor $l$ according to the absolute values of the latent vector associated to $c$ (normalised, so it looks like a probability distribution);
- sampling an item $j$ that ranks highly for $l$. More precisely, sample a rank $r$ from a geometric distribution over possible ranks, then find the item that has rank $r$ when the $l$th coordinates of the item latent vectors are compared.

(We have ignored the sign of the latent factor. If the sign is negative, one choses the rth-to-last ranked item). The ranking of the items according to each latent factor is precomputed periodically with a period such that the extra overhead is independent of the number of items (i.e. is a constant).

### Problems with the approach

The samples yielded by the adaptive oversampling approach depend heavily upon the choice of basis for the latent space. It is easy to construct examples of where things go wrong:

Non-negativity constraints would solve this. Regularisation would also deal with this particular example – however regularisation would complicate the expression of the scoring function $\hat y$ as a mixture (since you need to divide though by $Z_c$.

Despite these problems, the authors demonstrate impressive performance, as we’ll see below.

### Performance

### Implementation

### What about adaptive oversampling in word2vec?

Word2vec with negative sampling learns a word embedding via binary classification task, specifically, *“does the word appear in the same window as the context, or not?”*. As in the case of implicit feedback recommendation, the observed data for word2vec is one-class (just a sequence of words). Interestingly, word2vec samples from a modification of the observed word frequency distribution (i.e. of the “distribution according to popularity”) in which the frequencies are raised to the $0.75$th power and renormalised. The exponent was chosen empirically. This raises two questions:

- Would word2vec perform better with adaptive oversampling?
- How does BPR perform when sampling from a similarly-modified item popularity distribution (i.e. raising to some exponent)?

### Corrections

Corrections and comments are most welcome. Please get in touch either via the comments below or via email.