Let us say, we have the following system and reference summaries:
System Summary (what the machine produced):
|
1 2 |
the cat was found under the bed |
Reference Summary (gold standard - usually by humans) :
|
1 2 3 4 |
the cat was under the bed |
Precision and Recall in the Context of ROUGE
In this example, the Recall would thus be:
In this example, the Precision would thus be:
This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above:
System Summary 2:
|
1 2 3 |
the tiny little cat was found under the big funny bed |
The Precision now becomes:
So What is ROUGE-N, ROUGE-S & ROUGE-L ?
|
1 |
the cat was found under the bed |
|
1 |
the cat was under the bed |
System Summary Bigrams:
|
1 2 3 4 5 6 7 8 |
the cat, cat was, was found, found under, under the, the bed |
Reference Summary Bigrams:
|
1 2 3 4 5 6 7 |
the cat, cat was, was under, under the, the bed |
The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary. This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams especially in the case of abstractive summarization where you are not directly re-using sentences for summarization.
The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.
Short Explanation of a few Different ROUGE measures
- ROUGE-N - measures unigram, bigram, trigram and higher order n-gram overlap
- ROUGE-L - measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order. Since it automatically includes longest in-sequence common n-grams, you don’t need a predefined n-gram length.
- ROUGE-S - Is any pair of word in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram coocurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase “cat in the hat” the skip-bigrams would be “cat in, cat the, cat hat, in the, in hat, the hat”.
ROUGE Evaluation Packages
- Perl implementation of ROUGE - this is the original implementation of ROUGE
- Java based ROUGE - implementation in Java which supports evaluation of unicode texts.
- Javascript implementation of ROUGE
