The advantage of considering (e.g.) adjectives as transformations rather than points in space is that these transformations can be applied in unseen combinations. This counters one of Chomsky’s objections to statistical modelling of language, that is, that language is effectively infinite, whereas language models are trained on only a finite amount of data (so are humans, but humans are supposed to be born with a universal grammar). The case, considered by Baroni et al., of adjective as linear transform has a couple of disadvantages, however. The first that there are a large number of parameters to be learnt for each adjective, the second being that it doesn’t capture the near commutativity of adjectives, i.e. in most cases adjectives can be applied to a noun in different orders without significantly changing the meaning.
I can think of several approaches for enforcing the commutativity of adjective matrices:
- simply using diagonal matrices (this reduces to one of the approaches already considered), or
- penalising the off-diagonal elements via regularisation, or
- interleaving existing parameter updates with updates that penalise (co-occurring?) adjective matrices for not commuting with one another, e.g. using the gradient of the matrix commutator $AB – BA$
Eva Maria Vecchi, Marco Baroni and Roberto Zamparelli.
Presented at the workshop “Distributional Semantics and Compositionality” (2011) PDF
The authors attempt to use distributional models to distinguish between acceptable and “semantically deviant” adjective-noun combinations (an example of this distinction is given by “blue rose” vs “residential steak”). They hypothesise in particular that the length of the vector representation of the adjective-noun combination is an indication of its acceptability. Their reasoning for this hypothesis assumes that directions and in particular axes are interpretable in distributional models (this does not apply in the case of word2vec, at least). They further hypothesise that the combination will be spatially isolated with respect to the cosine similarity.
The distributional representation is derived from a POS-tagged and lemmatised corpus by considering sentence-internal co-occurrence between the vocabulary as a whole and the 10k most frequent nouns, verbs and adjectives, transformed via the “local mutual information” measure and reduced to rank 300 using PCA.
Different methods of transforming the noun representation using the adjective to obtain the adjective-noun combination are studied and the results are evaluated against human judgements of semantic deviance.