Re-parameterising for non-negativity yields multiplicative updates

Suppose you have a model that depends on real-valued parameters, and that you would like to constrain these parameters to be non-negative. For simplicity, suppose the model has a single parameter a \in \mathbb R. Let E denote the error function. To constrain a to be non-negative, parameterise a as the square of a real-valued parameter \alpha \in \mathbb R:

    \[a = \alpha^2, \quad \alpha \in \mathbb R.\]

We can now minimise E by choosing \alpha without constraints, e.g. by using gradient descent. Let \lambda > 0 be the learning rate. We have

    \begin{eqnarray*} \alpha^{\text{new}} &=& \alpha - \lambda \frac{\partial E}{\partial \alpha} \\ &=& \alpha - \lambda \textstyle{\frac{\partial E}{\partial a} \frac{\partial a}{\partial \alpha}} \\ &=& \alpha - \lambda 2 \alpha \textstyle \frac{\partial E}{\partial a} \\ &=& \alpha \cdot (1 - 2 \lambda \textstyle \frac{\partial E}{\partial a}) \end{eqnarray*}

by the chain rule. Thus

    \begin{eqnarray*} a^{\text{new}} &=& (\alpha^{\text{new}})^2 \\ &=& \alpha^2 (1 - 2 \lambda \textstyle\frac{\partial E}{\partial a})^2 \\ &=& a \cdot (1 - 2 \lambda \textstyle\frac{\partial E}{\partial a})^2. \end{eqnarray*}

Thus we’ve obtained a multiplicative update rule for a that is in terms of a, only. In particular, we don’t need \alpha anymore!

Orthogonal transformations and gradient updates

We show that if the contour lines of a function are symmetric with respect to some rotation or reflection, then so is the evolution of gradient descent when minimising that function. Rotation of the space on which the function is evaluated effects a corresponding rotation of each of the points visited under gradient descent (similarly, for reflections).

This ultimately comes down to showing the following: if f: \mathbb{R}^N \to \mathbb{R} is the differentiable function being minimised and g is a rotation or reflection that preserves the contours of f, then

(1)   \begin{equation*} \nabla |_{g(u)} f = g ( \nabla |_u f) \end{equation*}

for all points u \in \mathbb{R}^N.


We consider below three one-dimensional examples that demonstrate that, even if the function f is symmetric with respect to all orthogonal transformations, it is necessary that the transformation g be orthogonal in order for the property (1) above to hold.