The advantage of considering (e.g.) adjectives as transformations rather than points in space is that these transformations can be applied in unseen combinations. This counters one of Chomsky’s objections to statistical modelling of language, that is, that language is effectively infinite, whereas language models are trained on only a finite amount of data (so are humans, but humans are supposed to be born with a universal grammar). The case, considered by Baroni et al., of adjective as linear transform has a couple of disadvantages, however. The first that there are a large number of parameters to be learnt for each adjective, the second being that it doesn’t capture the near commutativity of adjectives, i.e. in most cases adjectives can be applied to a noun in different orders without significantly changing the meaning.
I can think of several approaches for enforcing the commutativity of adjective matrices:
- simply using diagonal matrices (this reduces to one of the approaches already considered), or
- penalising the off-diagonal elements via regularisation, or
- interleaving existing parameter updates with updates that penalise (co-occurring?) adjective matrices for not commuting with one another, e.g. using the gradient of the matrix commutator