When working with text mining applications, we often hear of the term “stop words” or “stop word list” or even “stop list”. Stop words are just a set of commonly used words in any language. Stop words are commonly eliminated from many text processing applications because these words can be distracting, non-informative (or non-discriminative) and are additional memory overhead.

Stop words are a set of commonly used words in any language.stoplist

For example, in the context of a search engine, let us assume that your search query is “how to implement the BM25 retrieval formula”. If the search engine tries to find web pages that contain the terms “how”, “to” “implement”, “BM25 ”, ”retrieval”, “formula” the search engine is going to find a lot more pages that contain the terms “how”, “to” and “the” than pages that contain information about implementing the BM25 formula  because the terms “how”, “to” and “the”  are so commonly used in the English language. However, if we disregard these three terms (as they are considered stop words), the search engine can actually focus on retrieving pages that contain the keywords: “implement” “BM25” “retrieval” “formula” – which would more likely bring up pages that are of interest rather than noise.

Another example, is in sentiment classification. If a classifier is trying to decide if the sentence “this is a lousy car” carries a positive or negative connotation, both the positive and negative classes would both carry words like “this”, “is” and “a” – which are all common English words. By chance, it could happen that one class just carries a lot more of the common English terms than the other class. This can potentially draw the classifier into selecting the class with a higher occurrence of the common English words. Thus, the stop words in this case can be distracting and can prevent a model from deciding on the correct class membership. Eliminating these common terms can do two things for the classifier: (1) learning can become much faster, since you are reducing the the total number of features in use and (2) prediction can become more accurate since you are eliminating “noise” or distracting features.

To summarize, stop word elimination is a simple but important aspect of many text mining applications as it has the following advantages:

  • Reduces memory overhead (since we eliminate words in consideration)
  • Reduces noise and false positives (since we are focusing on the more important terms)
  • Can potentially improve power of prediction (this is dependent on the application)
  • A point to note is that, while many applications benefit from the use of stop words, some do not see any added advantage in removing stop words other than the fact that it makes analysis or lookup much faster and reduces overall memory requirements.

    Additional Reading

    Tagged on: