Text Mining, Analytics & More: Natural Language Processing

Showing posts with label Natural Language Processing. Show all posts

Jan 26, 2017

What is ROUGE and how it works for evaluation of summaries?

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It is essentially of a set of metrics for evaluating automatic summarization of texts as well as machine translation. It works by comparing an automatically produced summary or translation against a set of reference summaries (typically human-produced).

Let us say, we have the following system and reference summaries:

System Summary (what the machine produced):


    the cat was found under the bed

Reference Summary (gold standard - usually by humans) :


    the cat was under the bed

If we consider just the individual words, the number of overlapping words between the system summary and reference summary is 6. This however, does not tell you much as a metric. To get a good quantitative value, we can actually compute the precision and recall using the overlap.

Precision and Recall in the Context of ROUGE

Simplistically put, Recall in the context of ROUGE simply means how much of the reference summary is the system summary recovering or capturing? If we are just considering the individual words, it can be computed as:

In this example, the Recall would thus be:

This means that all the words in the reference summary has been captured by the system summary, which indeed is the case for this example. Whoala! this looks really good for a text summarization system. However, it does not tell you the other side of the story. A machine generated summary (system summary) can be extremely long, capturing all words in the reference summary. But, much of the words in the system summary may be useless, making the summary unnecessarily verbose. This is where precision comes into play. In terms of precision, what you are essentially measuring is, how much of the system summary was in fact relevant or needed? Precision is measured as:

In this example, the Precision would thus be:

This simply means that 6 out of the 7 words in the system summary were in fact relevant or needed. If we had the following system summary, as opposed to the example above:

System Summary 2:


    the tiny little cat was found under the big funny bed

The Precision now becomes:

Now, this doesn't look so good, does it? That is because we have quite a few unnecessary words in the summary. The precision aspect becomes really crucial when you are trying to generate summaries that are concise in nature. Therefore, it is always best to compute both the Precision and Recall and then report the F-Measure. If your summaries are in some way forced to be concise through some constraints, then you could consider using just the Recall since precision is of less concern in this scenario.

So What is ROUGE-N, ROUGE-S & ROUGE-L ?

ROUGE-N, ROUGE-S and ROUGE-L can be thought of as the granularity of texts being compared between the system summaries and reference summaries. For example, ROUGE-1 refers to overlap of unigrams between the system summary and reference summary. ROUGE-2 refers to the overlap of bigrams between the system and reference summaries. Let's take the example from above. Let us say we want to compute the ROUGE-2 precision and recall scores.

System Summary :

    the cat was found under the bed

Reference Summary :

    the cat was under the bed

System Summary Bigrams:

the cat, 
cat was, 
was found, 
found under, 
under the, 
the bed

Reference Summary Bigrams:

the cat, 
cat was, 
was under, 
under the, 
the bed

Based on the bigrams above, the ROUGE-2 recall is as follows:

Essentially, the system summary has recovered 4 bigrams out of 5 bigrams from the reference summary which is pretty good! Now the ROUGE-2 precision is as follows:

The precision here tells us that out of all the system summary bigrams, there is a 67% overlap with the reference summary. This is not too bad either. Note that as the summaries (both system and reference summaries) get longer and longer, there will be fewer overlapping bigrams especially in the case of abstractive summarization where you are not directly re-using sentences for summarization.

The reason one would use ROUGE-1 over or in conjunction with ROUGE-2 (or other finer granularity ROUGE measures), is to also show the fluency of the summaries or translation. The intuition is that if you more closely follow the word orderings of the reference summary, then your summary is actually more fluent.

Short Explanation of a few Different ROUGE measures

ROUGE-N - measures unigram, bigram, trigram and higher order n-gram overlap
ROUGE-L - measures longest matching sequence of words using LCS. An advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order. Since it automatically includes longest in-sequence common n-grams, you don't need a predefined n-gram length.
ROUGE-S - Is any pair of word in a sentence in order, allowing for arbitrary gaps. This can also be called skip-gram coocurrence. For example, skip-bigram measures the overlap of word pairs that can have a maximum of two gaps in between words. As an example, for the phrase "cat in the hat" the skip-bigrams would be "cat in, cat the, cat hat, in the, in hat, the hat".

For more in-depth information about these evaluation metrics you can refer to Lin's paper. Which measure to use depends on the specific task that you are trying to evaluate. If you are working on extractive summarization with fairly verbose system and reference summaries, then it may make sense to use ROUGE-1 and ROUGE-L. For very concise summaries, ROUGE-1 alone may suffice especially if you are also applying stemming and stop word removal.

ROUGE Evaluation Packages

Perl implementation of ROUGE - this is the original implementation of ROUGE
Java based ROUGE - implementation in Java which supports evaluation of unicode texts.
Javascript implementation of ROUGE

Papers to Read

ROUGE: A Package for Automatic Evaluation of Summaries

Nov 8, 2015

What is Text Similarity?

When talking about text similarity, different people have a slightly different notion on what text similarity means. In essence, the goal is to compute how 'close' two pieces of text are in (1) meaning or (2) surface closeness. The first is referred to as semantic similarity and the latter is referred to as lexical similarity. Although the methods for lexical similarity are often used to achieve semantic similarity (to a certain extent), achieving true semantic similarity is often much more involved. In this article, I mainly focus on lexical similarity as it has the most use from a practical stand-point and then I briefly introduce semantic similarity.

Lexical or Word Level Similarity

For the most part, when referring to text similarity, people actually refer to how similar two pieces of text are at the surface level. For example, how similar are the phrases "the cat ate the mouse" with "the mouse ate the cat food" by just looking at the words? On the surface, if you consider only word level similarity, these two phrases (with determiners disregarded) appear very similar as 3 of the 4 unique words are an exact overlap.

This notion of similarity is often referred to as lexical similarity. It typically does not take into account the actual meaning behind words or the entire phrase in context. While the actual meaning of the phrases is often disregarded this does not mean that computing similarity in this way is ineffective. You can actually come up with creative ways of expanding the scope of each word or a set of related words (lexical chains) to improve the similarity between two pieces of text being compared. For instance, if you are comparing similarity between phrases from newspaper articles. You could potentially use N non-determiner words to the left and to the right of the current word in the phrase for a simple scope expansion. Instead of doing a word for word comparison, you are essentially providing more context. This is analogous to expanding a search query. Imagine if you were to compare the similarity between your search query, and all the documents on the Web so that you can get the best results matching your query. How would you expand your query? The same thought process can be applied in improving lexical level similarity measures.

Granularity

Another point to note is that lexical similarity can be computed at various granularity. You can compute lexical similarity at the character level, word level (as shown earlier) or at a phrase level (or lexical chain level) where you break a piece of text into a group of related words prior to computing similarity. Character level similarity is also known as string similarity/matching and is commonly used to determine how close two strings are. For example how close are the names 'Kavita Ganesan' and 'Kavita A Ganesan' ? Pretty close! You can use the common metrics outlined below for string similarity or you can use edit distance type of measures to quantify how dissimilar two strings are. In essence, you are trying to compute the minimum number of operations required to transform one string into the other.

Common Metrics

Some of the most common metrics for computing similarity between two pieces of text are the Jaccard coefficient, Dice and Cosine similarity all of which have been around for a very long time. Jaccard and Dice are actually really simple as you are just dealing with sets. Here is how you can compute Jaccard:

Simply put, this is the intersection of the sets divided by the union of the sets. Your resulting value will be between [0,1], so you can set a threshold as needed. I have found that a threshold of 0.6 and above is pretty effective in detecting phrases that are similar (maximum length of phrase is 10 words). For longer texts this value could be smaller or if you only care for marginal overlap, again this value could be much smaller. The more you normalize, pre-process and filter your text (e.g. stem, remove noise, remove stop words), the better the outcome of your text similarity measure using simple measures such as Jaccard.

Where is lexical similarity used?

Clustering - if you want to group similar texts together how can you tell if two groups of text are even similar?

Redundancy removal - if two pieces of texts are so similar, why do you need both? You can always eliminate the redundant one. Think of duplicate product listings, or the same person in your database, with slight variation in the name or even html pages that are near duplicates.

Information Retrieval - you could use the more established information retrieval measures like BM25, PL2, etc. But you could also use a measure like cosine (for longer texts) or jaccard and dice for (shorter texts).

Semantic Similarity

So far, we have talked about lexical similarity. Another notion of similarity mostly explored by the NLP research community is how similar in meaning are any two phrases? If we look at the phrases, "the cat ate the mouse" and "the mouse ate the cat food", we know that while the words significantly overlap, these two phrases actually have different meaning. Getting the meaning out of the phrases is often a more difficult task as it requires deeper level of analysis. In this example, we can actually look at simple aspects like order of words: "cat==>ate==>mouse" and "mouse==>ate==>cat food". Although the words overlap in this case, the order of occurrence is different and from that we can tell that these two phrases actually have different meaning. This is just one simple example. Most people use syntactic parsing to help with semantic similarity. Let's look at the parse trees for these two phrases. What can you get from it?

You can get phrases out of the parse (e.g. "cat food"), dependency structure (e.g. mouse is the object of ate in the first case and food is the object of ate in the second case) as well as parts of speech (nouns, verbs, adjectives and etc.) - all of which can be used in different ways to estimate semantic similarity. Semantic similarity is often used to address NLP tasks such as paraphrase identification and automatic question answering. To get a better understanding of semantic similarity and paraphrasing you can refer to some of the articles below.

Text Similarity Tools and APIs

Robust-lightweight approaches to compute lexical similarity
Book Chapter - Finding Similar Items (not text specific)
Dan Jurafsky - Video on scoring with Jaccard
A Comparison of String Distance Metrics for Name-Matching Tasks
Measuring semantic similarity in short texts
Sentence similarity based on semantic nets and corpus statistics
What is paraphrasing

Feb 13, 2015

What is the Notion of NLP in Computer Science vs. Biomedical Informatics vs. Information Systems (MIS)?

Computer Science: In computer science research, when we talk about NLP methods we are referring to specific methods that use some linguistics properties or structural properties of text often coupled with statistics to solve targetted language and text mining problems. The NLP parts are those "customized methods" that we use to solve specific problems. Thus, the notion of NLP here is rather broad as any novel approach that does not just re-use existing tools could be considered NLP or Computational Linguistics.

Biomedical Informatics: In the pure biomedical informatics world however, I often notice that the idea of what NLP is, is quite narrow. People tend to refer to usage of tools such as NegEx Detection, POS Taggers, Parsers and Stemming as NLP. While technically, the underlying technology in these tools can involve lightweight to heavy NLP, the applications of these tools in specific problems are not necessarily "NLP". It is more of applications of NLP tools or text-processing using NLP tools. To be technically correct, only when these NLP tools are used in combination with additional learning methods or customized text mining algorithms, this would probably qualify as actual NLP, solving a specific problem.

Information Systems (MIS): The concept of what qualifies as NLP (and text mining) in the information systems (MIS) world tends to be a lot more shallow and ill-defined than biomedical informatics or computer science where even counting word frequencies seems to be considered a form of NLP. In fact, even getting a binary classifier to work on text is considered fairly involved NLP. This may seem slightly exaggerated, but when you start reviewing papers from some of the MIS departments through TOIS or TKDE you will start to understand how ill-defined NLP can really be.

The point here is that there are different flavors of NLP both in research and in practice and the notion of NLP varies from department to department. So when you are publishing obvious methods to conferences where the audience primarily consists of core computer scientists, you will get dinged really hard for the lack of originality in your proposed methods. At the same time, if you submit a really effective algorithm that does not use existing NLP tools to a biomedical informatics journal, again you may get dinged but this time for not using existing tools or as a friend of mine once told me "the reviewer had philosophical issues with my submission". Also, be warned that if a physician comes and tells you hey, I am also doing NLP (heard quite a few of those), don't be surprised that what he or she means is that he is running some NLP tools on some patient data. And if someone from the IS department of a business school talks to you about offering a text mining course, it usually means they are teaching you how to use R or SaS to do some text processing and querying :D, not to mention they will also refer to this as "Big Data".

Nov 23, 2014

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the ngrams would be:

the cow
cow jumps
jumps over
over the
the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:

the cow jumps
cow jumps over
jumps over the
over the moon

So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

What are N-grams used for?

N-grams are used for a variety of different task. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization. Here is a publicly available web scale n-gram model by Microsoft: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx. Here is a paper that uses Web N-gram models for text summarization:Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams. But please be warned that from my personal experience and various research papers that I have reviewed, the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement. The only way to know this is to try it!

Java Code Block for N-gram Generation

This code block generates n-grams at a sentence level. The input consists of N (the size of n-gram), sent the sentence and ngramList a place to store the n-grams generated.

  
private static void generateNgrams(int N, String sent, List ngramList) {
  String[] tokens = sent.split("\\s+"); //split sentence into tokens
  //GENERATE THE N-GRAMS
  for(int k=0; k<(tokens.length-N+1); k++){
    String s="";
    int start=k;
    int end=k+N;
    for(int j=start; j<end; j++){
     s=s+""+tokens[j];
    }
    //Add n-gram to a list
    ngramList.add(s);
  }
}//End of method

Online API for N-gram Generation

Here is a Web API for on demand word count and N-Gram Generation

Oct 22, 2014

Demystifying Nouns in Wordnet

WordNet is an excellent resource for all sorts of NLP tasks. Having said that, understanding some of the relationships between terms can be quite difficult. For example, when you are searching for a noun, you are faced with all sorts of relationships such as holonyms, meronyms, hyponyms and meronyms when all that you are looking for is just plain synonyms. In this article, I am going to try to explain some of the relationships for nouns. Please note that this is not WordNet specific, this is more about understanding some basic linguistics.

Relationships in WordNet

For nouns: Antonym, Hypernym, Instance Hypernym, Hyponym, Instance Hyponym, Member holonym, Substance holonym, Part holonym, Member meronym, Substance meronym, Part meronym, Attribute, Derivationally related form , Domain of synset (topic, region, usage), and Member of this domain (topic, region, usage).
For verbs: Antonym, Hypernym, Hyponym, Entailment, Cause, Also see, Verb Group, Derivationally related form, and Domain of synset (topic, region, usage).
For adjectives: Antonym, Similar to, Participle of verb, Pertainym (pertains to noun), Attribute, Also see and Domain of synset (topic, region, usage).
For adverbs: Antonym, Derived from adjective, and Domain of synset (topic, region,usage)

Hyponyms and Hypernyms

Hyponyms: a word or phrase that is a more specific than the given word.
Hypernyms: a word or phrase that is a more general than the given word.
Hyponyms have a direct relationship with hypernyms where hyponym is the specific term and hypernym is the more general term. Let us take the word 'limb' as an example. What is more specific than limb?

arm is a kind of limb

leg is a kind of limb

So in this case 'arm' and 'leg' are the hyponyms and 'limb' is the hypernym because arm and leg are the more specific terms in this relationship. Here is a list of hyponyms from WordNet for 'limb' in the human limb sense:

hind limb -- (a posterior leg or homologous structure in other animals)
forelimb -- (the front limb (or homologous structure in other animals such as a flipper or wing))
flipper -- (the flat broad limb of aquatic animals specialized for swimming)
leg -- (a human limb; commonly used to refer to a whole limb but technically only the part of the limb between the knee and ankle)
crus -- (the leg from the knee to foot)
leg -- (a structure in animals that is similar to a human leg and used for locomotion)
thigh -- (the part of the leg between the hip and the knee)
arm -- (a human limb; technically the part of the superior limb between the shoulder and the elbow but commonly used to refer to the whole superior limb)
cubitus -- (the arm from the elbow to the fingertips)
forearm -- (the part of the superior limb between the elbow and the wrist)

Wikipedia article: http://en.wikipedia.org/wiki/Hyponymy_and_hypernymy

Meronyms

Meronyms: something that is part of a larger thing. Let us take arm as an example. What is part of an arm?

bicep is part of an arm
wrist is part of an arm

Here is a list of meronyms from WordNet for 'arm' in the human limb sense:

HAS PART: brachial artery, arteria brachialis 
HAS PART: cephalic vein, vena cephalica 
HAS PART: forearm
HAS PART: hand, manus, mitt, paw
HAS PART: ulnar nerve, cubital nerve, nervus ulnaris
HAS PART: biceps brachii, musculus biceps brachii, biceps humeri
HAS PART: triceps brachii, musculus triceps brachii 
HAS PART: elbow, elbow joint, human elbow, cubitus, cubital joint, articulatio cubiti 
HAS PART: wrist, carpus, wrist joint, radiocarpal joint, articulatio radiocarpea 
HAS PART: arm bone
HAS PART: humerus

Wikipedia article: http://en.wikipedia.org/wiki/Meronymy

Holonyms

Holonym: a word that represents the physical whole of a given word. Basically, the opposite of meronyms. Let us take the same arm as an example. What is an arm a part of?

an arm is part of a body
an arm is part of a human being

Here is a list of holonyms from WordNet for 'arm' in the human limb sense:

PART OF: body, organic structure, physical structure 
PART OF: homo, man, human being, human

Here is another example, for the word 'toe'. What is a toe a part of?

a toe is part of a human foot

Wikipedia article: http://en.wikipedia.org/wiki/Holonymy

Oct 20, 2014

Computing Precision and Recall for Multi-Class Classification Problems

In evaluating multi-class classification problems, we often think that the only way to evaluate performance is by computing the accuracy which is the proportion or

percentage of correcly predicted labels over all predictions. However, we can always compute precision and recall for each class label and analyze the individual performance on class labels or average the values to get the overall precision and recall. Accuracy alone is sometimes quite misleading as you may have a model with relatively 'high' accuracy with the model predicting the 'not so important' class labels fairly accurately (e.g. "unknown bucket") but the model may be making all sorts of mistakes on the classes that are actually critical to the application.

What Does Precision and Recall Tell Us?

Precision: Given all the predicted labels (for a given class X), how many instances were correctly predicted?
Recall: For all instances that should have a label X, how many of these were correctly captured?

Computing Precision and Recall for the Multi-Class Problem

While it is fairly straightforward to compute precision and recall for a binary classification problem, it can be quite confusing as to how to compute these values for a multi-class classifcation problem. Now lets look at how to compute precision and recall for a multi-class problem.

First, let us assume that we have a 3-class multi classification problem , with labels A, B and C.
The first thing to do is to generate a confusion matrix as below. Many existing machine learning packages already generate the confusion matrix for you, but if you don't have that luxury, it is actually very easy to implement it yourself by keeping counters for the true positives, false positives and total number of instances for each label.

This is an example confusion matrix for 3 labels: A,B and C

Once you have the confusion matrix, you have all the values you need to compute precision and recall for each class. Note that the values in the diagonal would always be the true positives (TP).
Now, let us compute recall for Label A:

= TP_A/(TP_A+FN_A)
= TP_A/(Total Gold for A)
= TP_A/TotalGoldLabel_A
= 30/100
= 0.3

Now, let us compute precision for Label A:

= TP_A/(TP_A+FP_A)
= TP_A/(Total predicted as A)
= TP_A/TotalPredicted_A
= 30/60
= 0.5

So precision=0.5 and recall=0.3 for label A. Which means that for precision, out of the times label A was predicted, 50% of the time the system was in fact correct. And for recall, it means that out of all the times label A should have been predicted only 30% of the labels were correctly predicted.
Now, let us compute recall for Label B:

= TP_B/(TP_B+FN_B)
= TP_B/(Total Gold for B)
= TP_B/TotalGoldLabel_B
= 60/100
= 0.6

Now, let us compute precision for Label B:

= TP_B/(TP_B+FP_B)
= TP_B/(Total predicted as B)
= TP_B/TotalPredicted_B
= 60/120
= 0.5

So precision=0.5 and recall=0.6 for label B. So you just have to repeat this for each label in your multi-class classification problem.

The Need for a Confusion Matrix

Apart from helping with computing precision and recall, it is always important to look at the confusion matrix to analyze your results as it also gives you very strong clues as to where your classifier is going wrong. So for example, for Label A you can see that the classifier incorrectly labelled Label B for majority of the mislabelled cases. Which means the classifier is somehow confused between label A and B. So, you can add biasing features to improve classification of label A. In essence, the more zeroes or smaller the numbers on all cells but the diagonal, the better your classifier is doing. So tweak your features and analyze your confusion matrix !

Oct 16, 2014

How do we evaluate machine generated textual summaries?

Text summarization is a hard problem! Evaluating textual summaries and summarization systems is just as hard. It is not as straightforward as generating precision and recall like in supervised learning problems. I have worked through summarization papers and the number one comment I get is - we don't really know if your summarization method is really effective even though the ROUGE scores say that the summaries are actually not bad. It sometimes boils down to the question of what exactly makes a summary beneficial to readers? Is it supposed to inform readers about the most essential 'event' or 'topic' or the most important opinions on a topic?

So in general, there is the question of utility and accuracy. When I say accuracy in this context it means how readable, well formed and on topic are the generated summaries in comparison to a gold standard. The question of on topic can usually be measured based on agreement with human composed summaries. The question of well formedness, cohesion and readability usually would require a human assessor to judge the quality of generated summaries with regard to these aspects.

Utility of summaries is mainly from the stand point of the user. Did the summary fulfill its goal? This would require a few users to read the unsummarized documents and then the summaries to decide how well the summarization system performed. You will have to come up with a creative evaluation metric for this manual assessment. For example, in one of my abstractive summarization papers, we wanted to measure the readability aspect. So, we mixed the human composed summary phrases with the system generated summary phrases for each summarization task. We then asked a human assessor to pick out summaries that seemed to be generated by the system (based on grammatical errors, incompleteness, etc). We call this the readability score and this is how it is computed:

# of correctly picked system generated phrases / total number of system generated phrases

Surprisingly, the assessor could not tell the difference in many of the cases - which meant the system summaries were fairly well-formed making it hard for the assessor to distinguish between human composed phrases vs. machine generated phrases. More details on this readability test is available in the following paper:

Ganesan, Kavita, ChengXiang Zhai, and Jiawei Han. "Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions."Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010.

The most common automatic evaluation for summarization sustems is ROUGE (Recall-Oriented Understudy for Gisting Evaluation) which is an adaption of the IBM BLEU. ROUGE works by measuring n-gram overlap between gold standard summaries with system summaries. With this it can measure topic relevance and in some cases fluency especially when you use higher order ROUGE-N. I don't really 'trust' this part because there are many ways to express the same information. So use ROUGE-N > 1 with caution.

Oct 15, 2014

Text Mining, IR and NLP References

These are some Text Mining, IR and NLP related reference materials that would be useful to anyone who is doing research and development in the area of Text Data Mining, Retrieval and Analysis. I have found many of these resources particularly useful in getting me started. Please note that this page is periodically updated.

Opinion Analysis

Survey: Opinion Mining and Sentiment Analysis [ pdf ]

This is a fairly complete survey that covers some of the core techniques and approaches used in Opinion Mining (prior to 2008). Note that the techniques covered are the earlier ones that do not necessarily involve summarization of Tweets or short texts. In particular, it does not cover the newer body of work focusing on textual summarization of opinions.
(By Bo Pang and Lillian Lee - 2008)

Opinion Mining Tutorial [ pdf ]
This is a nice and easy-to-follow set of slides on Opinion Mining. The main focus in these slides is the use of heuristics / data mining based approaches to opinion mining. It does not really cover some of the more recent probabilistic / learning based approaches, but it gives a fairly good introduction to Opinion Mining. (By Bing Liu)

Survey: Opinion Mining and Summarization [pdf]
This survey zooms into recent research in the area of opinion summarization, which is related to generating effective summaries of opinions so that users can get a quick understanding of the underlying sentiments. Since there are various formats of summaries, the survey breaks down the approaches into the commonly studied aspect-based summariztion and non-aspect based ones (which includes visualization, contrastive summarization and text summarization of opinions). (By Kim et al - 2011)

Interesting tasks within Opinion Mining and Sentiment Analysis [link]
A one page summary of the various tasks within opinion mining and related areas. (By Kavita Ganesan)

Automatic Text Summarization

A Survey on Automatic Text Summarization [pdf]
This is a well written survey about text summarization. The focus of this survey is mainly on techniques in extractive summarization. The authors talk about techniques used in single document summarization, multi-document summarization and also include a nice section on evaluation methods. There is one section that talks about sentence compression which can be considered a form of abstractive summarization.
(Dipanjan Das and André F. T. Martins. 2007)

Text Summarization an Overview [pdf ]
This article contains nice descriptions of the various text summarization methods used. The explanation is fairly intuitive. It also attempts to classify the summarization methods in several ways (e.g. abstractive vs extractive) - which is very useful. (Elena Lloret, 2008)

Query Based Text Summarization Tutorial [view]
This is a nice deck of slides summarizing the area of text summarization. Easy to follow and has a lot of useful information.
(Mariana Damova)

Abstractive Summarization

More info here

Information Retrieval

Stanford IR/NLP Book [ read online ] [ pdf ]
A very good reference point for IR/NLP tasks. I would recommend this to anyone who is getting in to the IR field. The concepts are well explained and easy to understand. (By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze)

Information Retrieval - a Survey [ pdf ]
A good survey on classic IR approaches. Topics include VSM, Bayesian, Term Weighting..etc (By Ed Greengrass 2000)

Statistical Language Models [ pdf ]
This book systematically reviews the large body of literature on applying statistical language models to information retrieval with an emphasis on the underlying principles, empirically effective language models, and language models developed for non-traditional retrieval tasks. Recommended to those who want to get in-depth knowledge on language model based retrieval approaches. (By ChengXiang Zhai 2008)

Search User Interfaces [ read online ] [ book ]
This book talks about the design of search user interfaces, how to evaluate search interfaces, effective methods of presentation and other useful tips that relates to users of a search system. This book is ideal for those who are interested in designing, studying or improving upon search systems from the user's perspective. (By Marti Hearst 2009)

Recommended Reading for IR Research Students [ pdf ]
This paper highlights some of the very core papers that should be read if you are in the IR field. This is a good place to start if you need a refresher or are a new student.

Faceted search [ pdf ]

(Daniel Tunkelang)

Oct 6, 2014

Constructing a Domain Specific Stop Word List

While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms.

The good news is that it is actually fairly easy to construct your own domain specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text from the domain of interest, you can do one or more of the following to figure out your stop words:

1. Most frequent terms as stop words

Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. The benefit of this approach is that it is so easy to implement, the downside however is if you have a particularly long document, the raw term frequency from just a few documents can dominate and cause the term to be at the top. One way to resolve this is to normalize the raw tf using a normalize such as the document length (i.e. number of words in a given document).

2. Least frequent terms as stop words

Just as terms that are extremely frequent could be distracting terms rather than discriminating terms, terms that are extremely infrequent may also not be useful for text mining and retrieval. For example the username “@username” that occurs only once in a collection of tweets, may not be very useful. Other terms like “yoMateZ!” which could be just made-up terms by people again may not be useful for text mining and retrieval. Note that certain terms like “yaaaaayy!!” can often be normalized to standard forms such as “yay”. However, despite all the normalization if terms still have a term frequency count of one you could remove it. This could significantly reduce your overall feature space.

3. Terms that occur frequently across documents - low idf terms As Stop Words

Inverse document frequency (IDF) basically refers to the inverse fraction of documents in your collection that contains a specific term ti. Let us say you have N documents. And term ti occurred in M of the N documents. The IDF of ti is thus computed as:

IDF(ti)=Log N/M

So the more documents ti appears in, the lower the IDF score. This means terms that appear in each and every document will have an IDF score of 0. If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words. This is not necessary really if your K is large enough such that it will prune both general stop words as well as domain specific stop words. You will find more information about IDFs here.

So, would stop words help my task?

So how would you know if removing domain specific stop words would be helpful in your case? Easy, test it on a subset of your data. See if whatever measure of accuracy and performance improves, stays constant or degrades. If it degrades, needless to say, don’t do it unless the degradation is negligible and you see gains in other forms such as decrease in size of model, ability to process things in memory, etc. If you see a significant drop in performance you may also want to try a minimal stop word removal approach.

All About Stop Words for Text Mining and Information Retrieval

What are Stop Words?

When working with text mining applications, we often hear of the term “stop words” or "stop word list" or even "stop list". Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example, in the context of a search engine, if your search query is “how to develop information retrieval applications”, If the search engine tries to find web pages that contained the terms “how”, “to” “develop”, “information”, ”retrieval”, “applications” the search engine is going to find a lot more pages that contain the terms “how”, “to” than pages that contain information about developing information retrieval applications because the terms “how” and “to” are so commonly used in the English language. So, if we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop” “information” “retrieval” “applications” - which would more closely bring up pages that are really of interest. This is just the basic intuition for using stop words. Stop words can be used in a whole range of tasks and these are just a few:

Supervised machine learning - removing stop words from the feature space
Clustering - removing stop words prior to generating clusters
Information retrieval - preventing stop words from being indexed
Text summarization- excluding stop words from contributing to summarization scores & removing stop words when computing ROUGE scores

Types of Stop Words

Stop words are generally thought to be a “single set of words”. It really can mean different things to different applications. For example, in some applications removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above, across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. To some applications however, this can be detrimental. For instance, in sentiment analysis removing adjective terms such as ‘good’ and ‘nice’ as well as negations such as ‘not’ can throw algorithms off their tracks. In such cases, one can choose to use a minimal stop list consisting of just determiners or determiners with prepositions or just coordinating conjunctions depending on the needs of the application.

Examples of minimal stop word lists that you can use:

Determiners - Determiners tend to mark nouns where a determiner usually will be followed by a noun
examples: the, a, an, another
Coordinating conjunctions - Coordinating conjunctions connect words, phrases, and clauses
examples: for, an, nor, but, or, yet, so
Prepositions - Prepositions express temporal or spatial relations
examples: in, under, towards, before

In some domain specific cases, such as clinical texts, we may want a whole different set of stop words. For example, terms like “mcg” “dr” and “patient” may have less discriminating power in building intelligent applications compared to terms such as ‘heart’ ‘failure’ and ‘diabetes’. In such cases, we can also construct domain specific stop words as opposed to using a published stop word list.

What About Stop Phrases?

Stop phrases are just like stop words just that instead of removing individual words, you exclude phrases. For example, if the phrase "good item" appears very frequently in your text but has a very low discriminating power or results in unwanted behavior in your results, one may choose to add such phrases as stop phrases. It is certainly possible to construct "stop phrases" the same way you construct stop words. For example, you can treat phrases with very low occurrence in your corpora as stop phrases. Similarly, you can consider phrases that occur in almost every document in your corpora as a stop phrase.

Published Stop Word Lists

If you want to use stop words lists that have been published here are a few that you could use:

Snowball stop word list - this stop word list is published with the Snowball Stemmer
Terrier stop word list - this is a pretty comprehensive stop word list published with the Terrier package.
Minimal stop word list - this is a stop word list that I compiled consisting of determiners, coordinating conjunctions and prepositions
Construct your own stop word list - this article basically outlines an automatic method for constructing a stop word list for your specific data set (e.g. tweets, clinical texts, etc)

Constructing Domain Specific Stop Word Lists

While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms. Here is an article that I wrote that talks about how to construct domain specific stop word lists.

[ download article pdf ]