I picked this paper up after seeing that it had been integrated into GenSim (see also this article by the contributor to gensim and others). The authors (of the original paper) apply the PageRank algorithm to graphs constructed from text for the purposes of keyword extraction and summarisation. These two approaches they name (somewhat unnecessarily, I feel) TextRank.
You can test how it works yourself e.g. the Python implementation here.
In the case of keyword extraction, the graph has words as vertices (in the best case, only nouns and adjectives) and the (undirected) edges represent co-occurrence within a fixed length window.
In the case of summarisation, the graph has sentences as vertices and the graph is complete. The weight of each edge can be determined by any sentence similarity function. The authors consider the case where sentence similarity is measured by word overlap, normalised by sentence length. If a vectorial representation of the sentences is available, then e.g. the cosine similarity could be used instead. The authors extend the definition of PageRank to deal with weighted graphs.
The advantage of the TextRank approaches is that nothing needs to be learnt — there is no machine learning involved at all. The keyword extraction and summarisation make relatively loose assumptions about the language of the text and apply equally well to documents from unseen domains. TextRank is, however, entirely heuristic. The theory leaves off where the authors begin (that is, with PageRank). The authors do present an interesting application of PageRank, however.
- Has anyone tried the summarisation out using a vectorial representation of sentences and the cosine similarity? Other than bag-of-words?
- If we use e.g. the RBF kernel for the edge weight between two vectors , what points does PageRank tend to choose from a (multimodal) data sample? Related is perhaps this article. (See also my summary).