Text Mining, Analytics & More

Dec 19, 2014

What is an Opinion-Driven Decision Support System?

Opinion Driven Decision Support System (ODSS) refers to the use of large amounts of online opinions to facilitate business and consumer decision making. The idea is to combine the strengths of search technologies with opinion mining and analysis tools to provide a synergistic decision making platform.

The research and engineering problems related to developing such a system include :
(1) opinion acquisition
(2) opinion based search
(3) opinion summarization
(4) presentation of results

Opinions in this case can be aggregation of user reviews, blog comments, facebook status updates, Tweets and so on. Essentially any opinion containing texts on specific topics or entities qualify as candidates for building an ODSS platform. Here's a description of some of the research and engineering problems towards developing an ODSS platform:

1. Search Capabilities Based on Opinions

The goal of opinion-based search is to help users find entities of interest based on their key requirements. Since a user is often interested in choosing an entity based on opinions on that entity, a system that ranks entities based on a user’s personal preferences would provide a more direct support for a user’s decision-making task. For example, in the case of finding hotels at a destination, a user may only want to consider hotels where other people thought was clean. By finding and ranking hotels based on how well it satisfies such a requirement would significantly reduce the number of entities in consideration, facilitating decision making. Unlike traditional search, the query in this case is a set of preferences and the results is a set of entities that match these preferences. The challenge is to accurately match the user’s preferences with existing opinions in order to recommend the best entities. This special ranking problem is referred to as Opinion-Based Entity Ranking. Many of the existing opinion mining techniques can be potentially used for this new ranking task. I have explored information retrieval based techniques to specifically solve this ranking problem and there has been a few follow-up works (from other groups) trying other approaches.

2. Opinion Summarization (i.e. Sentiment Analysis + Text Summarization)

Opinion summaries play a critical role in helping users better analyze entities in consideration (e.g. product, physician, cars, politican). Users are often looking out for major concerns or advantages in selecting a specific entity. Thus, a summary that can quickly highlight the key opinions about the entity would significantly help exploration of entities and aid decision making. The field of opinion summarization has been long explored with most techniques being focused on generating structured summaries on a fixed set of topics. These are referred to as stuctured summaries. In the last few years, textual summaries of opinions have been gaining more and more popularity. Bing Liu's Opinion Mining Tutorial covers some of these recent works or you can refer to this article point (5).

3. Opinion Acquisition (i.e. Opinion or Sentiment Crawling)

To support accurate search and analysis based on opinions, opinionated content is imperative. Relying on opinions from just one specific source not only makes the information unreliable, but also incomplete due to variations in opinions as well as potential bias present in a specific source. Although many applications rely on large amounts of opinions, there has been very limited work on collecting and integrating a complete set of opinions. I recently explored a very simple method to collecting large amounts of opinions on arbitrary entities.

The idea of an Opinion Driven Decision Support (ODSS) was developed as part of my thesis. For more information on this please see Kavita's thesis.

Nov 24, 2014

API for Word Counts and N-Gram Counts

N-grams are essentially a set of co-occurring words within a given window. It has many uses in NLP and text mining right from summarizing to feature extraction in supervised machine learning tasks. Here is a basic tutorial on what n-grams are. In this article, we will focus on a Web API for generating N-Grams which is available on mashape: https://www.mashape.com/rxnlp/text-mining-and-nlp/.

These are the input parameters:

text for n-gram generation,
case-sensitive: true/false
n-gram size.

Below is a sample output from the API for the text "I love rainy days. How I wish it was raining ! How I wish it was snowing !". As you can see, you have the n-grams and the count of n-grams in descending order. Note that the API also does basic sentencing to generate sentences from text so that n-grams can be computed. Also note that this tool is language-neutral so you can generate n-grams for multiple languages. To force your sentences to be correctly split, you would probably need to ensure that punctuation is available at least between two consequent sentences.

Nov 23, 2014

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the ngrams would be:

the cow
cow jumps
jumps over
over the
the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:

the cow jumps
cow jumps over
jumps over the
over the moon

So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:

What are N-grams used for?

N-grams are used for a variety of different task. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization. Here is a publicly available web scale n-gram model by Microsoft: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx. Here is a paper that uses Web N-gram models for text summarization:Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams. But please be warned that from my personal experience and various research papers that I have reviewed, the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement. The only way to know this is to try it!

Java Code Block for N-gram Generation

This code block generates n-grams at a sentence level. The input consists of N (the size of n-gram), sent the sentence and ngramList a place to store the n-grams generated.

  
private static void generateNgrams(int N, String sent, List ngramList) {
  String[] tokens = sent.split("\\s+"); //split sentence into tokens
  //GENERATE THE N-GRAMS
  for(int k=0; k<(tokens.length-N+1); k++){
    String s="";
    int start=k;
    int end=k+N;
    for(int j=start; j<end; j++){
     s=s+""+tokens[j];
    }
    //Add n-gram to a list
    ngramList.add(s);
  }
}//End of method

Online API for N-gram Generation

Here is a Web API for on demand word count and N-Gram Generation

Oct 22, 2014

Demystifying Nouns in Wordnet

WordNet is an excellent resource for all sorts of NLP tasks. Having said that, understanding some of the relationships between terms can be quite difficult. For example, when you are searching for a noun, you are faced with all sorts of relationships such as holonyms, meronyms, hyponyms and meronyms when all that you are looking for is just plain synonyms. In this article, I am going to try to explain some of the relationships for nouns. Please note that this is not WordNet specific, this is more about understanding some basic linguistics.

Relationships in WordNet

For nouns: Antonym, Hypernym, Instance Hypernym, Hyponym, Instance Hyponym, Member holonym, Substance holonym, Part holonym, Member meronym, Substance meronym, Part meronym, Attribute, Derivationally related form , Domain of synset (topic, region, usage), and Member of this domain (topic, region, usage).
For verbs: Antonym, Hypernym, Hyponym, Entailment, Cause, Also see, Verb Group, Derivationally related form, and Domain of synset (topic, region, usage).
For adjectives: Antonym, Similar to, Participle of verb, Pertainym (pertains to noun), Attribute, Also see and Domain of synset (topic, region, usage).
For adverbs: Antonym, Derived from adjective, and Domain of synset (topic, region,usage)

Hyponyms and Hypernyms

Hyponyms: a word or phrase that is a more specific than the given word.
Hypernyms: a word or phrase that is a more general than the given word.
Hyponyms have a direct relationship with hypernyms where hyponym is the specific term and hypernym is the more general term. Let us take the word 'limb' as an example. What is more specific than limb?

arm is a kind of limb

leg is a kind of limb

So in this case 'arm' and 'leg' are the hyponyms and 'limb' is the hypernym because arm and leg are the more specific terms in this relationship. Here is a list of hyponyms from WordNet for 'limb' in the human limb sense:

hind limb -- (a posterior leg or homologous structure in other animals)
forelimb -- (the front limb (or homologous structure in other animals such as a flipper or wing))
flipper -- (the flat broad limb of aquatic animals specialized for swimming)
leg -- (a human limb; commonly used to refer to a whole limb but technically only the part of the limb between the knee and ankle)
crus -- (the leg from the knee to foot)
leg -- (a structure in animals that is similar to a human leg and used for locomotion)
thigh -- (the part of the leg between the hip and the knee)
arm -- (a human limb; technically the part of the superior limb between the shoulder and the elbow but commonly used to refer to the whole superior limb)
cubitus -- (the arm from the elbow to the fingertips)
forearm -- (the part of the superior limb between the elbow and the wrist)

Wikipedia article: http://en.wikipedia.org/wiki/Hyponymy_and_hypernymy

Meronyms

Meronyms: something that is part of a larger thing. Let us take arm as an example. What is part of an arm?

bicep is part of an arm
wrist is part of an arm

Here is a list of meronyms from WordNet for 'arm' in the human limb sense:

HAS PART: brachial artery, arteria brachialis 
HAS PART: cephalic vein, vena cephalica 
HAS PART: forearm
HAS PART: hand, manus, mitt, paw
HAS PART: ulnar nerve, cubital nerve, nervus ulnaris
HAS PART: biceps brachii, musculus biceps brachii, biceps humeri
HAS PART: triceps brachii, musculus triceps brachii 
HAS PART: elbow, elbow joint, human elbow, cubitus, cubital joint, articulatio cubiti 
HAS PART: wrist, carpus, wrist joint, radiocarpal joint, articulatio radiocarpea 
HAS PART: arm bone
HAS PART: humerus

Wikipedia article: http://en.wikipedia.org/wiki/Meronymy

Holonyms

Holonym: a word that represents the physical whole of a given word. Basically, the opposite of meronyms. Let us take the same arm as an example. What is an arm a part of?

an arm is part of a body
an arm is part of a human being

Here is a list of holonyms from WordNet for 'arm' in the human limb sense:

PART OF: body, organic structure, physical structure 
PART OF: homo, man, human being, human

Here is another example, for the word 'toe'. What is a toe a part of?

a toe is part of a human foot

Wikipedia article: http://en.wikipedia.org/wiki/Holonymy

Oct 20, 2014

Computing Precision and Recall for Multi-Class Classification Problems

In evaluating multi-class classification problems, we often think that the only way to evaluate performance is by computing the accuracy which is the proportion or

percentage of correcly predicted labels over all predictions. However, we can always compute precision and recall for each class label and analyze the individual performance on class labels or average the values to get the overall precision and recall. Accuracy alone is sometimes quite misleading as you may have a model with relatively 'high' accuracy with the model predicting the 'not so important' class labels fairly accurately (e.g. "unknown bucket") but the model may be making all sorts of mistakes on the classes that are actually critical to the application.

What Does Precision and Recall Tell Us?

Precision: Given all the predicted labels (for a given class X), how many instances were correctly predicted?
Recall: For all instances that should have a label X, how many of these were correctly captured?

Computing Precision and Recall for the Multi-Class Problem

While it is fairly straightforward to compute precision and recall for a binary classification problem, it can be quite confusing as to how to compute these values for a multi-class classifcation problem. Now lets look at how to compute precision and recall for a multi-class problem.

First, let us assume that we have a 3-class multi classification problem , with labels A, B and C.
The first thing to do is to generate a confusion matrix as below. Many existing machine learning packages already generate the confusion matrix for you, but if you don't have that luxury, it is actually very easy to implement it yourself by keeping counters for the true positives, false positives and total number of instances for each label.

This is an example confusion matrix for 3 labels: A,B and C

Once you have the confusion matrix, you have all the values you need to compute precision and recall for each class. Note that the values in the diagonal would always be the true positives (TP).
Now, let us compute recall for Label A:

= TP_A/(TP_A+FN_A)
= TP_A/(Total Gold for A)
= TP_A/TotalGoldLabel_A
= 30/100
= 0.3

Now, let us compute precision for Label A:

= TP_A/(TP_A+FP_A)
= TP_A/(Total predicted as A)
= TP_A/TotalPredicted_A
= 30/60
= 0.5

So precision=0.5 and recall=0.3 for label A. Which means that for precision, out of the times label A was predicted, 50% of the time the system was in fact correct. And for recall, it means that out of all the times label A should have been predicted only 30% of the labels were correctly predicted.
Now, let us compute recall for Label B:

= TP_B/(TP_B+FN_B)
= TP_B/(Total Gold for B)
= TP_B/TotalGoldLabel_B
= 60/100
= 0.6

Now, let us compute precision for Label B:

= TP_B/(TP_B+FP_B)
= TP_B/(Total predicted as B)
= TP_B/TotalPredicted_B
= 60/120
= 0.5

So precision=0.5 and recall=0.6 for label B. So you just have to repeat this for each label in your multi-class classification problem.

The Need for a Confusion Matrix

Apart from helping with computing precision and recall, it is always important to look at the confusion matrix to analyze your results as it also gives you very strong clues as to where your classifier is going wrong. So for example, for Label A you can see that the classifier incorrectly labelled Label B for majority of the mislabelled cases. Which means the classifier is somehow confused between label A and B. So, you can add biasing features to improve classification of label A. In essence, the more zeroes or smaller the numbers on all cells but the diagonal, the better your classifier is doing. So tweak your features and analyze your confusion matrix !

Oct 16, 2014

How do we evaluate machine generated textual summaries?

Text summarization is a hard problem! Evaluating textual summaries and summarization systems is just as hard. It is not as straightforward as generating precision and recall like in supervised learning problems. I have worked through summarization papers and the number one comment I get is - we don't really know if your summarization method is really effective even though the ROUGE scores say that the summaries are actually not bad. It sometimes boils down to the question of what exactly makes a summary beneficial to readers? Is it supposed to inform readers about the most essential 'event' or 'topic' or the most important opinions on a topic?

So in general, there is the question of utility and accuracy. When I say accuracy in this context it means how readable, well formed and on topic are the generated summaries in comparison to a gold standard. The question of on topic can usually be measured based on agreement with human composed summaries. The question of well formedness, cohesion and readability usually would require a human assessor to judge the quality of generated summaries with regard to these aspects.

Utility of summaries is mainly from the stand point of the user. Did the summary fulfill its goal? This would require a few users to read the unsummarized documents and then the summaries to decide how well the summarization system performed. You will have to come up with a creative evaluation metric for this manual assessment. For example, in one of my abstractive summarization papers, we wanted to measure the readability aspect. So, we mixed the human composed summary phrases with the system generated summary phrases for each summarization task. We then asked a human assessor to pick out summaries that seemed to be generated by the system (based on grammatical errors, incompleteness, etc). We call this the readability score and this is how it is computed:

# of correctly picked system generated phrases / total number of system generated phrases

Surprisingly, the assessor could not tell the difference in many of the cases - which meant the system summaries were fairly well-formed making it hard for the assessor to distinguish between human composed phrases vs. machine generated phrases. More details on this readability test is available in the following paper:

Ganesan, Kavita, ChengXiang Zhai, and Jiawei Han. "Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions."Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010.

The most common automatic evaluation for summarization sustems is ROUGE (Recall-Oriented Understudy for Gisting Evaluation) which is an adaption of the IBM BLEU. ROUGE works by measuring n-gram overlap between gold standard summaries with system summaries. With this it can measure topic relevance and in some cases fluency especially when you use higher order ROUGE-N. I don't really 'trust' this part because there are many ways to express the same information. So use ROUGE-N > 1 with caution.

Oct 15, 2014

Text Mining, IR and NLP References

These are some Text Mining, IR and NLP related reference materials that would be useful to anyone who is doing research and development in the area of Text Data Mining, Retrieval and Analysis. I have found many of these resources particularly useful in getting me started. Please note that this page is periodically updated.

Opinion Analysis

Survey: Opinion Mining and Sentiment Analysis [ pdf ]

This is a fairly complete survey that covers some of the core techniques and approaches used in Opinion Mining (prior to 2008). Note that the techniques covered are the earlier ones that do not necessarily involve summarization of Tweets or short texts. In particular, it does not cover the newer body of work focusing on textual summarization of opinions.
(By Bo Pang and Lillian Lee - 2008)

Opinion Mining Tutorial [ pdf ]
This is a nice and easy-to-follow set of slides on Opinion Mining. The main focus in these slides is the use of heuristics / data mining based approaches to opinion mining. It does not really cover some of the more recent probabilistic / learning based approaches, but it gives a fairly good introduction to Opinion Mining. (By Bing Liu)

Survey: Opinion Mining and Summarization [pdf]
This survey zooms into recent research in the area of opinion summarization, which is related to generating effective summaries of opinions so that users can get a quick understanding of the underlying sentiments. Since there are various formats of summaries, the survey breaks down the approaches into the commonly studied aspect-based summariztion and non-aspect based ones (which includes visualization, contrastive summarization and text summarization of opinions). (By Kim et al - 2011)

Interesting tasks within Opinion Mining and Sentiment Analysis [link]
A one page summary of the various tasks within opinion mining and related areas. (By Kavita Ganesan)

Automatic Text Summarization

A Survey on Automatic Text Summarization [pdf]
This is a well written survey about text summarization. The focus of this survey is mainly on techniques in extractive summarization. The authors talk about techniques used in single document summarization, multi-document summarization and also include a nice section on evaluation methods. There is one section that talks about sentence compression which can be considered a form of abstractive summarization.
(Dipanjan Das and André F. T. Martins. 2007)

Text Summarization an Overview [pdf ]
This article contains nice descriptions of the various text summarization methods used. The explanation is fairly intuitive. It also attempts to classify the summarization methods in several ways (e.g. abstractive vs extractive) - which is very useful. (Elena Lloret, 2008)

Query Based Text Summarization Tutorial [view]
This is a nice deck of slides summarizing the area of text summarization. Easy to follow and has a lot of useful information.
(Mariana Damova)

Abstractive Summarization

More info here

Information Retrieval

Stanford IR/NLP Book [ read online ] [ pdf ]
A very good reference point for IR/NLP tasks. I would recommend this to anyone who is getting in to the IR field. The concepts are well explained and easy to understand. (By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze)

Information Retrieval - a Survey [ pdf ]
A good survey on classic IR approaches. Topics include VSM, Bayesian, Term Weighting..etc (By Ed Greengrass 2000)

Statistical Language Models [ pdf ]
This book systematically reviews the large body of literature on applying statistical language models to information retrieval with an emphasis on the underlying principles, empirically effective language models, and language models developed for non-traditional retrieval tasks. Recommended to those who want to get in-depth knowledge on language model based retrieval approaches. (By ChengXiang Zhai 2008)

Search User Interfaces [ read online ] [ book ]
This book talks about the design of search user interfaces, how to evaluate search interfaces, effective methods of presentation and other useful tips that relates to users of a search system. This book is ideal for those who are interested in designing, studying or improving upon search systems from the user's perspective. (By Marti Hearst 2009)

Recommended Reading for IR Research Students [ pdf ]
This paper highlights some of the very core papers that should be read if you are in the IR field. This is a good place to start if you need a refresher or are a new student.

Faceted search [ pdf ]

(Daniel Tunkelang)

Oct 10, 2014

Shell script to count number of words in a file

Count number of words in a file

Count number of words in all files in the current directory. 
$ wc -w *
Count number of words in the given file 
$ wc -w

Oct 7, 2014

Shell script to generate counts of words in descending order of term frequency

To get word counts from a text file, you don't really need to write a Java or a Python program. You can actually do it using shell scripting which is available by default if you are using Linux. If you are using Windows you can still use shell scripting if you install Cygwin - which is basically a unix command line for Windows. Here is any easy way of doing it:

Assuming you start with a file called my_text_file, we first transform all of the contents of this file to lowercase (my_text_file.lowercase), then split the entire textual content such that we have one word per line (my_text_file.onewordperline). Then we sort the words and count its term frequency and then sort it again by descending term frequency (my_text_file.countsorted). Here is the step-by-step guide:

1. First convert all capital letters to lower cases.
$ tr '[A-Z]' '[a-z]'  my_text_file.lowercase 
2. Split the words on a given line so that each line has only one word.
$ awk '{for (i=1;i<=NF;i++) print $i;}' my_text_file.lowercase > my_text_file.onewordperline
3. Sort all the words and then count the number of occurrences of each word.
$ sort my_text_file.onewordperline | uniq -c > my_text_file.count 
4. Sort the words in descending order of counts so you see the high frequency words.  
$ sort -rn -k1 my_text_file.count > my_text_file.countsorted"  
All steps above in a combined way:
$ tr '[A-Z]' '[a-z]' < my_text_file | awk '{for (i=1;i<=NF;i++) print $i;}' | sort | uniq -c |sort -rn -k1 > my_text_file.countsorted

The "|" character is called a pipe which basically says send the output from the previous command to the next command. The ">" symbol above basically says push the output to the file as named on the right. You don't have to create this file before hand, it is automatically created.

Oct 6, 2014

Constructing a Domain Specific Stop Word List

While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms.

The good news is that it is actually fairly easy to construct your own domain specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text from the domain of interest, you can do one or more of the following to figure out your stop words:

1. Most frequent terms as stop words

Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. The benefit of this approach is that it is so easy to implement, the downside however is if you have a particularly long document, the raw term frequency from just a few documents can dominate and cause the term to be at the top. One way to resolve this is to normalize the raw tf using a normalize such as the document length (i.e. number of words in a given document).

2. Least frequent terms as stop words

Just as terms that are extremely frequent could be distracting terms rather than discriminating terms, terms that are extremely infrequent may also not be useful for text mining and retrieval. For example the username “@username” that occurs only once in a collection of tweets, may not be very useful. Other terms like “yoMateZ!” which could be just made-up terms by people again may not be useful for text mining and retrieval. Note that certain terms like “yaaaaayy!!” can often be normalized to standard forms such as “yay”. However, despite all the normalization if terms still have a term frequency count of one you could remove it. This could significantly reduce your overall feature space.

3. Terms that occur frequently across documents - low idf terms As Stop Words

Inverse document frequency (IDF) basically refers to the inverse fraction of documents in your collection that contains a specific term ti. Let us say you have N documents. And term ti occurred in M of the N documents. The IDF of ti is thus computed as:

IDF(ti)=Log N/M

So the more documents ti appears in, the lower the IDF score. This means terms that appear in each and every document will have an IDF score of 0. If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words. This is not necessary really if your K is large enough such that it will prune both general stop words as well as domain specific stop words. You will find more information about IDFs here.

So, would stop words help my task?

So how would you know if removing domain specific stop words would be helpful in your case? Easy, test it on a subset of your data. See if whatever measure of accuracy and performance improves, stays constant or degrades. If it degrades, needless to say, don’t do it unless the degradation is negligible and you see gains in other forms such as decrease in size of model, ability to process things in memory, etc. If you see a significant drop in performance you may also want to try a minimal stop word removal approach.