I’ve really enjoyed your article! So is hierarchial softmax like a computing softmax([sum(left_subtree_children), sum(right_subtree_children)]) for every node?

]]>Really helps understanding the idea.

Thanks! ]]>

We show among other things that Skipgram is computing a sufficient dimensionality reduction factorization (a la Globerson and Tishby) of the word-word co-occurence matrix in an online fashion. This is assuming that you fit the Skipgram model exactly, not using neg-sampling. It’s not clear to me how neg-sampling can be recast as a regularized matrix factorization problem. ]]>

Under batch GD, minimising the gradient of the loss function (L) with respect to w, say, which can be expressed as a sum over all context word vectors (c_i), has the non-trivial solution:

– where all sub-gradients dL/dx_i = 0 (i.e. for all x_i = w^T c_i), which is the solution we want; or

– where not all sub-gradients are zero but the sum (S) of these sub-gradients multiplied by the corresponding vector c_i gives the zero vector – which is to say the (non-zero) vector of sub-gradients lies in the null space of C^T. This would be a solution to the minimisation process, but not one we want.

However, using (mini-batched) stochastic gradient descent, breaks the sum S into random “sub-sums” of sub-gradients. In the extreme (i.e. a mini-batch size of 1), for each (non-zero) word vector w, we have dl/dx . w = 0 and so the gradient dl/dx must be zero. Thus stochasticity (in expectation) breaks the linear combination issue that allows unwanted solutions, the sub-gradients do therefore tend to zero and thus the matrix is factorized as required (or as closely as achievable, subject to dimensionality vs rank…)

]]>Great Tutorial.Can you please explain how did you calculate transition probability like 0.65 and 0.35 at the root node ? ]]>