The initialisation of the weights in word2vec is not what I expected.

`syn1`

The weights connecting the hidden- to the output-layer are initialised to zero (in both the hierarchical softmax and the negative sampling cases)`syn0`

The initial values for the weights connecting the input- to the hidden-layer are drawn uniformly and independently from the interval $[\frac{-1}{2n}, \frac{1}{2n} ] $, where $n$ is the rank of the hidden layer (i.e. number of hidden units.)

The range of interval from which the `syn0`

weights are sampled was chosen depends on the rank. I had presumed that this was to account for the dependency of the distribution of the dot product (and in particular the L2-norm) on the rank. However, estimating these distributions empirically, this doesn’t seem to be the case:

According to Mikolov (in a helpful response in the word2vec google group), the initialisation of the weights was chosen empirically, since it seemed to work well.

Questions:

- I was unable to derive an expression for the distribution of L2-norms mathematically. Can someone help with that?