The initialisation of the weights in word2vec is not what I expected.
syn1
The weights connecting the hidden- to the output-layer are initialised to zero (in both the hierarchical softmax and the negative sampling cases)syn0
The initial values for the weights connecting the input- to the hidden-layer are drawn uniformly and independently from the interval, where
is the rank of the hidden layer (i.e. number of hidden units.)
The range of interval from which the syn0
weights are sampled was chosen depends on the rank. I had presumed that this was to account for the dependency of the distribution of the dot product (and in particular the L2-norm) on the rank. However, estimating these distributions empirically, this doesn’t seem to be the case:
According to Mikolov (in a helpful response in the word2vec google group), the initialisation of the weights was chosen empirically, since it seemed to work well.
Questions:
- I was unable to derive an expression for the distribution of L2-norms mathematically. Can someone help with that?