Showing posts with label text mining. Show all posts
Showing posts with label text mining. Show all posts

Mar 7, 2016

Giving an IDF Effect to Probability Count of Words

The probability of an n-gram  or word tells you how important a word is to a document. Unfortunately, since a document almost always has important words inter-mingled with conjunctions and determiners and other types of noise, using the raw probabilities may not be a good idea. One thing that you can do, is to normalize these probabilities using a larger collection.  So let us say, you have 3 documents and you want to find the most important  words in a document. Assume that your document looks like this:

Doc1 { the cat and dog and mouse ate cheese }
Doc2 { the mouse in the house was lovely }
Doc3 { the mouse in the jungle was dead }

Now, let us look at the word counts.

Entire Collection (i.e. Doc1 + Doc2 + Doc3):

{
  "results": [
    " the:5",
    " mouse:3",
    " in:2",
    " and:2",
    " was:2",
    " house:1",
    " jungle:1",
    " cat:1",
    " lovely:1",
    " ate:1",
    " cheese:1",
    " dead:1",
    " dog:1"
  ]
}

Counts for Doc1:

Doc 1 {
  "results": [
    " and:2",
    " mouse:1",
    " cat:1",
    " ate:1",
    " the:1",
    " dog:1",
    " cheese:1"
  ]
}

Counts for Doc2: 

Doc 2 {
  "results": [
    " the:2",
    " house:1",
    " in:1",
    " mouse:1",
    " lovely:1",
    " was:1"
  ]
}

Counts for Doc3: 

Doc 3 {
  "results": [
    " the:2",
    " in:1",
    " jungle:1",
    " mouse:1",
    " dead:1",
    " was:1"
  ]
}

We will refer to the counts of the entire collection as CollectionCount and for the individual document as DocumentCount. Now, if you look at the probability of the word 'the' in Doc1 , it is 1/7 =0.143 which was obtained as follows:


The word 'the' appeared once in Doc1 and Doc1 has an overall word count of  7. Now, the probability of 'mouse' in Doc1 is also 1/7 =0.143 obtained as follows:


By just looking at Doc1, even though we know that the word 'mouse' is more important to Doc1 than the word 'the' (`the' as we know is used very frequently in English), the probabilities do not accurately reflect this. A system would think that these two words are equally important because the word 'mouse' and 'the' have identical counts and probabilities. One way to address this is to, give an IDF effect to the probabilities. Meaning, words that appear very frequently across the entire collection should be penalized more than words that appear in fewer documents on the whole. You can actually do this by normalizing the probability of words using probabilities from the entire Collection by using the CollectionCount.  So the updated scores of 'mouse' and 'the' will be computed as follows with this normalization strategy:

  
There you go! Now,  the word 'the' has a lower score than mouse even though the 'within document' probabilities are the same. Note that this example is based on a very very small sample, so the same theory may not hold for some of the other words in the example documents. However, when you actually take a large sample,  the numbers will accurately reflect the typical distribution of words in a document and a collection of documents. This type of normalization is powerful when you rank the words/n-grams in descending order of scores. Also, take note that when your sample is really large, the probabilities will get really small, so it is always better to utilize log of probabilities as it makes it easier for analysis. For example, you can do the following:


Don't get intimidated by the negative sign, when you rank the words in descending order of scores, it will all make sense.  

Mar 3, 2016

Untold Truths About Text Mining and Web Mining in Practice

Truth #1: There is no such thing as "THE TOOL" for text mining


People often ask me, "So what tool do you use for your text analysis projects"...This is how it typically goes..I stare at them not sure whether I should laugh or cry, they stare at me thinking I am crazy and there is a few seconds of uncomfortable silence. I usually end up with a single tear in one eye and a partial half-hearted smile...then the confused other person clarifies him/her self by saying..."actually, what I meant was do you use NLTK or GATE". Now the tears start to pour. Well, the truth is, it all depends on the problem!! There is no magic tool or tools. If I am scraping the Web, I might use something like Nutch or write a custom crawler (no, not in python!) that does crawling and scraping at the same time. If I am building a text classifier I may just use Mallet with simple stemming where needed. If I am doing entity linking I might just use some similarity measures like Jaccard or Cosine. It also depends on the programming language I am working on. If I am using Python, I may choose to use scikit learn for text classification instead of Mallet. So, there really is no such thing as THE TOOL for text analysis - it always depends on what you are trying to solve and in which development environment. The idea of "THE TOOL" probably comes from some irresponsible advertising where you are promised that you can solve the world's text mining problems with this one tool.

Truth #2: Text Mining is 20% engineering, 40% algorithms and 40% science/statistics

If you think text mining or web mining is just pure statistics or science where you can for example, apply black box Machine Learning and solve the entire problem, you are in for a big surprise. Many text mining projects start with intelligent algorithms (from focused crawling, to compact text representation to graph search) with much of the code written from scratch coupled with some statistical modeling to solve very specific parts of the problem (e.g. CRF based web page segmentation). Developing the custom algorithms and tying the pieces together requires some level of engineering. This is why text mining and analytics is so appealing to some, as there is always going to be new challenges in every new problem you work on - whether its engineering problems or types of algorithms to use - its always a fun and challenging journey. Even a change in domain, can significantly change the scope of the problem.


Truth # 3: Not all data scientists can work on text mining / web mining / NLP projects

Data science is a very broad term and it encapsulates BI analysts, NLP Experts and ML Experts. The way data scientists are being trained today (outside academic programs), is that most of them get hands on experience working on structured data using R or Python and are mostly trained in descriptive statistics and supervised machine learning. Text and language is a whole other ball game. The language expert type of data scientist must have significant NLP knowledge as well as some knowledge in information retrieval, data mining, breadth of AI, as well as creative algorithm development capabilities.

Now you might be thinking so what is the whole point of GATE and NLTK and that Alchemy API for gods sake! GATE and NLTK and Alchemy type NLP APIs are just some tools to help you get started with text mining type projects. These tools provide some core functionality such as stemming, sentencing, chunking, etc. You are not required to use these tools. If you find other off the shelf tools that are easy to integrate into your code, go ahead and use it! If you are already working with python, then NLTK is always an option for you to use. The bottom line is, pick and choose and use *only* what you REALLY need, not something that looks or sounds cool as this just adds unneeded overhead and makes debugging a living nightmare.

Nov 23, 2014

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2 (known as bigrams), then the ngrams would be:
  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc, essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be:
  • the cow jumps
  • cow jumps over
  • jumps over the
  • over the moon
So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on.

How many N-grams in a sentence?

If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:


What are N-grams used for?

N-grams are used for a variety of different task. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization. Here is a publicly available web scale n-gram model by Microsoft: http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx. Here is a paper that uses Web N-gram models for text summarization:Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions

Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams. But please be warned that from my personal experience and various research papers that I have reviewed, the use of bigrams and trigrams in your feature space may not necessarily yield any significant improvement. The only way to know this is to try it!

Java Code Block for N-gram Generation

This code block generates n-grams at a sentence level. The input consists of N (the size of n-gram), sent the sentence and ngramList a place to store the n-grams generated.
  
private static void generateNgrams(int N, String sent, List ngramList) {
  String[] tokens = sent.split("\\s+"); //split sentence into tokens
  //GENERATE THE N-GRAMS
  for(int k=0; k<(tokens.length-N+1); k++){
    String s="";
    int start=k;
    int end=k+N;
    for(int j=start; j<end; j++){
     s=s+""+tokens[j];
    }
    //Add n-gram to a list
    ngramList.add(s);
  }
}//End of method


Online API for N-gram Generation

Here is a Web API for on demand word count and N-Gram Generation

Oct 22, 2014

Demystifying Nouns in Wordnet

WordNet is an excellent resource for all sorts of NLP tasks. Having said that, understanding some of the relationships between terms can be quite difficult. For example, when you are searching for a noun, you are faced with all sorts of relationships such as holonyms, meronyms, hyponyms and meronyms when all that you are looking for is just plain synonyms. In this article, I am going to try to explain some of the relationships for nouns. Please note that this is not WordNet specific, this is more about understanding some basic linguistics.

Relationships in WordNet


  • For nouns: Antonym, Hypernym, Instance Hypernym, Hyponym, Instance Hyponym, Member holonym, Substance holonym, Part holonym, Member meronym, Substance meronym, Part meronym, Attribute, Derivationally related form , Domain of synset (topic, region, usage), and Member of this domain (topic, region, usage). 
  • For verbs: Antonym, Hypernym, Hyponym, Entailment, Cause, Also see, Verb Group, Derivationally related form, and Domain of synset (topic, region, usage). 
  • For adjectives: Antonym, Similar to, Participle of verb, Pertainym (pertains to noun), Attribute, Also see and Domain of synset (topic, region, usage).
  • For adverbs: Antonym, Derived from adjective, and Domain of synset (topic, region,usage)

Hyponyms and Hypernyms

Hyponyms: a word or phrase that is a more specific than the given word.
Hypernyms: a word or phrase that is a more general than the given word. 
Hyponyms have a direct relationship with hypernyms where hyponym is the specific term and hypernym is the more general term. Let us take the word 'limb' as an example. What is more specific than limb?
  • arm is a kind of limb
  • leg is a kind of limb

So in this case 'arm' and 'leg' are the hyponyms and 'limb' is the hypernym because arm and leg are the more specific terms in this relationship. Here is a list of hyponyms from WordNet for 'limb' in the human limb sense:
hind limb -- (a posterior leg or homologous structure in other animals)
forelimb -- (the front limb (or homologous structure in other animals such as a flipper or wing))
flipper -- (the flat broad limb of aquatic animals specialized for swimming)
leg -- (a human limb; commonly used to refer to a whole limb but technically only the part of the limb between the knee and ankle)
crus -- (the leg from the knee to foot)
leg -- (a structure in animals that is similar to a human leg and used for locomotion)
thigh -- (the part of the leg between the hip and the knee)
arm -- (a human limb; technically the part of the superior limb between the shoulder and the elbow but commonly used to refer to the whole superior limb)
cubitus -- (the arm from the elbow to the fingertips)
forearm -- (the part of the superior limb between the elbow and the wrist)
Wikipedia article: http://en.wikipedia.org/wiki/Hyponymy_and_hypernymy

Meronyms

Meronyms: something that is part of a larger thing. Let us take arm as an example. What is part of an arm?
  • bicep is part of an arm
  • wrist is part of an arm
Here is a list of meronyms from WordNet for 'arm' in the human limb sense:
HAS PART: brachial artery, arteria brachialis 
HAS PART: cephalic vein, vena cephalica 
HAS PART: forearm
HAS PART: hand, manus, mitt, paw
HAS PART: ulnar nerve, cubital nerve, nervus ulnaris
HAS PART: biceps brachii, musculus biceps brachii, biceps humeri
HAS PART: triceps brachii, musculus triceps brachii 
HAS PART: elbow, elbow joint, human elbow, cubitus, cubital joint, articulatio cubiti 
HAS PART: wrist, carpus, wrist joint, radiocarpal joint, articulatio radiocarpea 
HAS PART: arm bone
HAS PART: humerus
Wikipedia article: http://en.wikipedia.org/wiki/Meronymy


Holonyms

Holonym: a word that represents the physical whole of a given word. Basically, the opposite of meronyms. Let us take the same arm as an example. What is an arm a part of?
  • an arm is part of a body
  • an arm is part of a human being 
Here is a list of holonyms from WordNet for 'arm' in the human limb sense:
PART OF: body, organic structure, physical structure 
PART OF: homo, man, human being, human 
Here is another example, for the word 'toe'. What is a toe a part of? 
  • a toe is part of a human foot

Oct 15, 2014

Text Mining, IR and NLP References

These are some Text Mining, IR and NLP related reference materials that would be useful to anyone who is doing research and development in the area of Text Data Mining, Retrieval and Analysis. I have found many of these resources particularly useful in getting me started. Please note that this page is periodically updated.

Opinion Analysis

Survey: Opinion Mining and Sentiment Analysis [ pdf ] 
This is a fairly complete survey that covers some of the core techniques and approaches used in Opinion Mining (prior to 2008). Note that the techniques covered are the earlier ones that do not necessarily involve summarization of Tweets or short texts. In particular, it does not cover the newer body of work focusing on textual summarization of opinions.
(By Bo Pang and Lillian Lee - 2008)

Opinion Mining Tutorial [ pdf ] 
This is a nice and easy-to-follow set of slides on Opinion Mining. The main focus in these slides is the use of heuristics / data mining based approaches to  opinion mining. It does not really cover some of the more recent probabilistic / learning based approaches, but it gives a fairly good introduction to Opinion Mining. (By Bing Liu)

Survey: Opinion Mining and Summarization  [pdf] 
This survey zooms into recent research in the area of opinion summarization, which is related to generating effective summaries of opinions so that users can get a quick understanding of the underlying sentiments. Since there are various formats of summaries, the survey breaks down the approaches into the commonly studied aspect-based summariztion and non-aspect based ones (which includes visualization, contrastive summarization and text summarization of opinions).  (By Kim et al - 2011)

Interesting tasks within Opinion Mining and Sentiment Analysis  [link
A one page summary of the various tasks within opinion mining and related areas. (By Kavita Ganesan)

Automatic Text Summarization

A Survey on Automatic Text Summarization [pdf
This is a well written survey about text summarization. The focus of this survey is mainly on techniques in extractive summarization. The authors talk about techniques used in single document summarization, multi-document summarization and also include a nice section on evaluation methods. There is one section that talks about sentence compression which can be considered a form of abstractive summarization.
(Dipanjan Das and André F. T. Martins. 2007)

Text Summarization an Overview [pdf 
This article contains nice descriptions of the various text summarization methods used. The explanation is fairly intuitive. It also attempts to classify the summarization methods in several ways (e.g. abstractive vs extractive) - which is very useful. (Elena Lloret, 2008)

Query Based Text Summarization Tutoria[view
This is a nice deck of slides summarizing the area of text summarization. Easy to follow and has a lot of useful information.
(Mariana Damova)

Abstractive Summarization

More info here
 

Information Retrieval

Stanford IR/NLP Book  [ read online ] [ pdf ] 
A very good reference point for IR/NLP tasks. I would recommend this to anyone who is getting in to the IR field. The concepts are well explained and easy to understand. (By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze)

Information Retrieval - a Survey  [ pdf ] 
A good survey on classic IR approaches. Topics include VSM, Bayesian, Term Weighting..etc  (By Ed Greengrass 2000)

Statistical Language Models  [ pdf ] 
This book systematically reviews the large body of literature on applying statistical language models to information retrieval with an emphasis on the underlying principles, empirically effective language models, and language models developed for non-traditional retrieval tasks. Recommended to those who want to get in-depth knowledge on language model based retrieval approaches. (By ChengXiang Zhai 2008)

Search User Interfaces  [ read online ] [ book ] 
This book talks about the design of search user interfaces, how to evaluate search interfaces, effective methods of presentation and other useful tips that relates to users of a search system. This book is ideal for those who are interested in designing, studying or improving upon search systems from the user's perspective. (By Marti Hearst 2009)

Recommended Reading for IR Research Students  [ pdf ]
This paper highlights some of the very core papers that should be read if you are in the IR field. This is a good place to start if you need a refresher or are a new student.

Faceted search
 [ pdf ]
(Daniel Tunkelang) 

Oct 6, 2014

Constructing a Domain Specific Stop Word List

While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms.

The good news is that it is actually fairly easy to construct your own domain specific stop word list. Here are a few ways of doing it assuming you have a large corpus of text from the domain of interest, you can do one or more of the following to figure out your stop words:

1. Most frequent terms as stop words

Sum the term frequencies of each unique word, w across all documents in your collection. Sort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words. The benefit of this approach is that it is so easy to implement, the downside however is if you have a particularly long document, the raw term frequency from just a few documents can dominate and cause the term to be at the top. One way to resolve this is to normalize the raw tf using a normalize such as the document length (i.e. number of words in a given document).


2. Least frequent terms as stop words

Just as terms that are extremely frequent could be distracting terms rather than discriminating terms, terms that are extremely infrequent may also not be useful for text mining and retrieval. For example the username “@username” that occurs only once in a collection of tweets, may not be very useful. Other terms like “yoMateZ!” which could be just made-up terms by people again may not be useful for text mining and retrieval. Note that certain terms like “yaaaaayy!!” can often be normalized to standard forms such as “yay”. However, despite all the normalization if terms still have a term frequency count of one you could remove it. This could significantly reduce your overall feature space.


3. Terms that occur frequently across documents - low idf terms As Stop Words

Inverse document frequency (IDF) basically refers to the inverse fraction of documents in your collection that contains a specific term ti. Let us say you have N documents. And term ti occurred in M of the N documents. The IDF of ti is thus computed as:


IDF(ti)=Log N/M


So the more documents ti appears in, the lower the IDF score. This means terms that appear in each and every document will have an IDF score of 0. If you rank each ti in your collection by its IDF score in descending order, you can treat the bottom K terms with the lowest IDF scores to be your stop words. Again, you can also eliminate common English words (using a published stop list) prior to sorting so that you are sure that you target the domain specific low IDF words. This is not necessary really if your K is large enough such that it will prune both general stop words as well as domain specific stop words. You will find more information about IDFs here.


So, would stop words help my task?

So how would you know if removing domain specific stop words would be helpful in your case? Easy, test it on a subset of your data. See if whatever measure of accuracy and performance improves, stays constant or degrades. If it degrades, needless to say, don’t do it unless the degradation is negligible and you see gains in other forms such as decrease in size of model, ability to process things in memory, etc. If you see a significant drop in performance you may also want to try a minimal stop word removal approach.

All About Stop Words for Text Mining and Information Retrieval

What are Stop Words?

When working with text mining applications, we often hear of the term “stop words” or "stop word list" or even "stop list". Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead. For example, in the context of a search engine, if your search query is “how to develop information retrieval applications”, If the search engine tries to find web pages that contained the terms “how”, “to” “develop”, “information”, ”retrieval”, “applications” the search engine is going to find a lot more pages that contain the terms “how”, “to” than pages that contain information about developing information retrieval applications because the terms “how” and “to” are so commonly used in the English language. So, if we disregard these two terms, the search engine can actually focus on retrieving pages that contain the keywords: “develop” “information” “retrieval” “applications” - which would more closely bring up pages that are really of interest. This is just the basic intuition for using stop words. Stop words can be used in a whole range of tasks and these are just a few:
  1. Supervised machine learning - removing stop words from the feature space
  2. Clustering - removing stop words prior to generating clusters
  3. Information retrieval - preventing stop words from being indexed 
  4. Text summarization- excluding stop words from contributing to summarization scores & removing stop words when computing ROUGE scores

Types of Stop Words

Stop words are generally thought to be a “single set of words”. It really can mean different things to different applications. For example, in some applications removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above, across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. To some applications however, this can be detrimental. For instance, in sentiment analysis removing adjective terms such as ‘good’ and ‘nice’ as well as negations such as ‘not’ can throw algorithms off their tracks. In such cases, one can choose to use a minimal stop list consisting of just determiners or determiners with prepositions or just coordinating conjunctions depending on the needs of the application.

Examples of minimal stop word lists that you can use:

  • Determiners - Determiners tend to mark nouns where a determiner usually will be followed by a noun
    examples: the, a, an, another
  • Coordinating conjunctions - Coordinating conjunctions connect words, phrases, and clauses
    examples: for, an, nor, but, or, yet, so
  • Prepositions - Prepositions express temporal or spatial relations
    examples: in, under, towards, before

In some domain specific cases, such as clinical texts, we may want a whole different set of stop words. For example, terms like “mcg” “dr” and “patient” may have less discriminating power in building intelligent applications compared to terms such as ‘heart’ ‘failure’ and ‘diabetes’. In such cases, we can also construct domain specific stop words as opposed to using a published stop word list.

What About Stop Phrases?

Stop phrases are just like stop words just that instead of removing individual words, you exclude phrases. For example, if the phrase "good item" appears very frequently in your text but has a very low discriminating power or results in unwanted behavior in your results, one may choose to add such phrases as stop phrases. It is certainly possible to construct "stop phrases" the same way you construct stop words. For example, you can treat phrases with very low occurrence in your corpora as stop phrases. Similarly, you can consider phrases that occur in almost every document in your corpora as a stop phrase. 


Published Stop Word Lists

If you want to use stop words lists that have been published here are a few that you could use:
  • Snowball stop word list - this stop word list is published with the Snowball Stemmer 
  • Terrier stop word list - this is a pretty comprehensive stop word list published with the Terrier package.
  • Minimal stop word list - this is a stop word list that I compiled consisting of determiners, coordinating conjunctions and prepositions
  • Construct your own stop word list - this article basically outlines an automatic method for constructing a stop word list for your specific data set (e.g. tweets, clinical texts, etc)

Constructing Domain Specific Stop Word Lists

While it is fairly easy to use a published set of stop words, in many cases, using such stop words is completely insufficient for certain applications. For example, in clinical texts, terms like “mcg” “dr.” and “patient” occur almost in every document that you come across. So, these terms may be regarded as potential stop words for clinical text mining and retrieval. Similarly, for tweets, terms like “#” “RT”, “@username” can be potentially regarded as stop words. The common language specific stop word list generally DOES NOT cover such domain specific terms. Here is an article that I wrote that talks about how to construct domain specific stop word lists.