
Negative sampling from static distributions
- Uniform sampling tends to yield uninformative samples, i.e. those for which the probability of being incorrectly labelled is very likely already low: popular items are precisely those that appear often as positive examples (and hence tend to be highly ranked by the model), while a uniformly-drawn item is likely to be from the tail of the popularity distribution (so likely lowly ranked by the model).
- Sampling according to popularity is demonstrated by the authors to converge to inferior solutions.
Adaptive over-sampling
- sampling a latent factor
according to the absolute values of the latent vector associated to (normalised, so it looks like a probability distribution); - sampling an item
that ranks highly for . More precisely, sample a rank from a geometric distribution over possible ranks, then find the item that has rank when the th coordinates of the item latent vectors are compared.
(We have ignored the sign of the latent factor. If the sign is negative, one choses the rth-to-last ranked item). The ranking of the items according to each latent factor is precomputed periodically with a period such that the extra overhead is independent of the number of items (i.e. is a constant).
Problems with the approach
The samples yielded by the adaptive oversampling approach depend heavily upon the choice of basis for the latent space. It is easy to construct examples of where things go wrong:
Non-negativity constraints would solve this. Regularisation would also deal with this particular example – however regularisation would complicate the expression of the scoring function
Despite these problems, the authors demonstrate impressive performance, as we’ll see below.
Performance

Implementation
What about adaptive oversampling in word2vec?
Word2vec with negative sampling learns a word embedding via binary classification task, specifically, “does the word appear in the same window as the context, or not?”. As in the case of implicit feedback recommendation, the observed data for word2vec is one-class (just a sequence of words). Interestingly, word2vec samples from a modification of the observed word frequency distribution (i.e. of the “distribution according to popularity”) in which the frequencies are raised to the
- Would word2vec perform better with adaptive oversampling?
- How does BPR perform when sampling from a similarly-modified item popularity distribution (i.e. raising to some exponent)?
Corrections
Corrections and comments are most welcome. Please get in touch either via the comments below or via email.