The softmax function provides a convenient parameterisation of the probability distributions over a fixed number of outcomes. Using the softmax, such probability distributions can be learned parametrically using gradient methods to minimise the cross-entropy (or equivalently, the Kullback-Leibler divergence) to observed distributions. This is equivalent to maximum likelihood learning when the distributions to be learned are one-hot (i.e. we are learning for a classification task). In the notes below, the softmax parameterisation and the gradient updates with respect to the cross entropy are derived explicitly.
This material spells out section 4 of the paper of Bridle referenced below, where the softmax was first proposed as an activation function for a neural network. It was in this paper that softmax was named, moreover. The name contrasts the outputs of the function with those of the “winner-takes-all” function, whose outputs are one-hot distributions.
Bridle, J.S. (1990a). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp. 227-236.