We show below that vectors drawn uniformly at random from the unit sphere are more likely to be orthogonal in higher dimensions.
In information retrieval and other areas besides, it is common to use the dot product of normalised vectors as a measure of their similarity. It can be problematic that the similarity measure depends upon the rank of the representation, as it does here — it means, for example, that similarity thresholds (for relevance in a particular situation) need to be re-calibrated if the underlying vectorisation model is retrained e.g. in a higher dimension.
Question
Does anyone have a solution to this? Is there a (of course, dimension dependent) transformation that can be applied to the dot product values to arrive at some dimension independent measure of similarity?
I asked this question on “Cross Validated”. Unfortunately, it has attracted so little interest that it has earned me the “tumbleweed” badge!
For those interested in the distribution of the dot products, and not just the expectation and the variance derived below, see this answer by whuber on Cross Validated.
While it is true that in high dimensions two randomly picked vectors are more likely to be orthogonal this doesn’t happen in the practice because text is not random(!). Even in several millions of dimensions the cosine measure is still an effective way to measure similarity between texts. This kind of statements are actually very interesting, popular and missleading. It is in some way related to the dimensionality curse in KNN, many people say that in high dimensional spaces all points tend to be close to each other but that seldom happens with real datasets because data is not random, that’s why data is data.
Hi Luis, it all depends on the vectorisation of the text, or rather the distribution of the vectors. You are correct that for e.g. sparse non-negative valued vectors (e.g. tf-idf), the dot product has different properties. However, here I am considering the case where the vectors are uniformly distributed on the sphere. This is the case of the vectorisation technique I use, after the vectors are normalised.
I am not arguing that it is not the case that the dot product is not effective in a high dimensional space — it ranks perfectly well. If the dimension was fixed, however high it was, it wouldn’t be a problem. The problem I have is that the actual values of the dot product (in distribution) vary so much with dimension, and the actual values are used in subsequent processing, e.g. deciding when kNN results are “similar enough” to be included in the results set. So changing the rank changes the distribution of dot products, which breaks the subsequent processing.