Re-parameterising for non-negativity yields multiplicative updates

Suppose you have a model that depends on real-valued parameters, and that you would like to constrain these parameters to be non-negative. For simplicity, suppose the model has a single parameter $a \in \mathbb R$. Let $E$ denote the error function. To constrain $a$ to be non-negative, parameterise $a$ as the square of a real-valued parameter $\alpha \in \mathbb R$:

$$a = \alpha^2, \quad \alpha \in \mathbb R.$$

We can now minimise $E$ by choosing $\alpha$ without constraints, e.g. by using gradient descent. Let $\lambda > 0$ be the learning rate. We have

\alpha^{\text{new}} &=& \alpha – \lambda \frac{\partial E}{\partial \alpha} \\
&=& \alpha – \lambda \textstyle{\frac{\partial E}{\partial a} \frac{\partial a}{\partial \alpha}} \\
&=& \alpha – \lambda 2 \alpha \textstyle \frac{\partial E}{\partial a} \\
&=& \alpha \cdot (1 – 2 \lambda \textstyle \frac{\partial E}{\partial a})

by the chain rule. Thus

a^{\text{new}} &=& (\alpha^{\text{new}})^2 \\
&=& \alpha^2 (1 – 2 \lambda \textstyle\frac{\partial E}{\partial a})^2 \\
&=& a \cdot (1 – 2 \lambda \textstyle\frac{\partial E}{\partial a})^2.

Thus we’ve obtained a multiplicative update rule for $a$ that is in terms of $a$, only. In particular, we don’t need $\alpha$ anymore!

Leave a Reply

Your email address will not be published. Required fields are marked *