What is Sentence Clustering?


The  Sentence Clustering API groups sentence level texts (e.g. from news articles, customer support emails, support tickets, blog comments, customer reviews, etc) or short texts (e.g. Tweets, FourSquare tips, sms text messages, Facebook status updates) into logical groups. This API not only produces meaningful clusters, it also provides meaningful topic cues which would make analysis of the clusters much easier than not having any labels or topics. For example, let us say you have simple sentences as follows which you would like to cluster:

The output from the SentenceClustering endpoint would be as follows:


As you can see above, the sentences are in fact, grouped into logical buckets. The first bucket is about the terrorist attack and the second bucket is about CNN reporting on the crime. Each cluster has a score which is a function of its size and topic informativeness.  Each cluster also comes with topic cues under clusterTopics” which describes the clusters. Sentences that are not found to be part of any cluster are put under “sentences_with_no_cluster_membership” and when this happens, the unclustered sentences can be thought of as being in their own cluster of size “1”. In this specific example however, all sentences were successfully clustered. The sentence ids and topic cues can be used for further analysis and can help with organization of the cluster results.

How is this different from Document Clustering?

Document clustering is about grouping similar documents (e.g. web pages) into logical groups (e.g. web pages about sports, pages about entertainment, pages about politics, etc). The size of each textual unit being clustered is much larger than clustering a set of sentences or pieces of short texts. Conceptually however, these two tasks of clustering sentences and documents are similar. Document clustering has traditionally been the focus of many research groups up until recently. Now, with micro-format texts all over the Web, there is a need for algorithms that actually work well at the sentence level. The ClusterSentences endpoint uses a novel algorithm which is the result of recent research in this area. This algorithm is different from other algorithms like K-means because the focus is not only to cluster sentences (and short texts) into logical buckets but also to simultaneously generate meaningful topics for each cluster.

What type of texts can I cluster?

Essentially, any documents that have sentence level texts. Here are a few examples:

  • News articles
  • Customer Support Emails
  • Support Tickets
  • Incident Reports
  • Tweets about a brand, product, company or person
  • Search Results or Product listings (e.g. eBay, Etsy)
  • Text messages (SMS)
  • User Reviews (e.g. Yelp, YellowPages, Urban Spoon)
  • Micro-reviews (e.g. Foursquare tips)
  • Clinical texts

Start Clustering Sentences

Before we start…

Before you start, please ensure that you have a valid API key to be able to access the API.

Clustering Algorithm Key Facts

  • The clustering algorithm performs soft-clustering, meaning the same sentence or short text may appear in different clusters
  • You get meaningful labels for each cluster
  • You can use the sentence ids and topics cues to further merge clusters
  • The current algorithm has only been fine-tuned for the English language

Sentence Clustering API Request

The Sentence Clustering endpoint accepts a JSON request via POST. It takes in two parameters:

Parameter name Type Required? Values
type text Yes “chunk” or “pre-sentenced”
text text Yes a chunk of text or pre-sentenced text (array of sentences)

Clustering a “chunk” or “blob” of text

Clustering a “chunk” of text simply means, you are leaving the clustering endpoint to determine sentence boundaries. We use our default sentencer to extract sentences from text. This typically works well for news articles or well-written texts. For short texts such as comments on blog articles or Tweets, the “pre-sentenced” option would be more appropriate. If you are sending in concatenation of texts as chunk (e.g. concatenated news articles, concatenated Tweets, etc), please ensure that punctuation is available between the two textual units (a “.” should be sufficient). Here is an example JSON request using the “chunk” option for clustering.


Example of Plain JSON request with “chunk”

This request uses the “chunk” option where a chunk of text is sent in with no sentence segmentation.


Example JSON request using Unirest Java Library (with “chunk” option)

If you want to send the JSON request in Java, this is how it would look if using the third party Unirest library:


Example of Plain JSON request with “pre-sentenced” option

This request uses the pre-sentenced option where sentences are already segmented into sentence level texts or you have a set of short texts to cluster. Each short text (.e.g Tweets) can be sent in as a separate sentence using the pre-sentenced option.

Encoding Issues with JSON

Please note that if you do not escape special characters appropriately and do not use the proper encoding, this can cause “400 Bad Request” errors. Its recommended that you use a JSON wrapper that does the encoding/decoding for you.

ClusterSentences API Response

The ClusterSentences API endpoint returns several values as the output:

Parameter name Type Short Description
clusterScore double  score of cluster – function of topics and cluster size
clusterSize integer  how many sentences or short texts are part of this cluster?
clusterTopics text  meaningful labels describing your clusters
clusteredSentences array of texts  list of sentences with corresponding ids that are part of the cluster

Cluster Score (clusterScore)

The cluster score is the function of the topic meaningfulness and size of the cluster. It can be used to rank clusters or to prune unwanted clusters. For example, if you have 500 clusters, you can choose to use the top 100.

Cluster Size (clusterSize)

This reflects the number of sentences within the cluster. A larger cluster does not necessarily mean the quality of the cluster is better. In fact, if you have all sentences that are related to only one particular topic, you may get back one large cluster rather than several meaningful ones. This defeats the purpose of clustering and you may have to re-analyze your input or ignore the really large clusters that significantly deviate from the size of the other clusters.

Cluster Topics (clusteredTopics)

This is the **label **or **topic **for the clusters. The topics of the cluster try to describe the contents of the cluster. In the example below you would see that the first topic is related to the customer service of Citibank which is thought to be lousy. The different topics are separated by comma and each topic has a corresponding score. You can choose to use the first best topic to represent the cluster or you can use the top N most diverse topics.

Clustered Sentences (clusterSentences)

This is the  list of sentences with corresponding ids that are part of a cluster. The clustered sentences are numbered from 0 upto the number of sentences provided in sequence. If you use the chunk option, the sentences are numbered after the text has been segmented into a set of sentences. If you use the pre-sentenced option, then the sentences are numbered in the order sent in the request. The sentence ids along with the topic cues can be used to merge clusters in map-reduce tasks. You can use measures such as Jaccard, Cosine and Dice to measure how similar the sets of sentence ids and topics are. Note that for the sentences you can ommit the actual sentences itself to measure similarity. You just have to deal with the ids. For example, if you have two clusters with the following ids: cluster1: “0001 0002 0003” cluster2: “0001 0002 0004” The Jaccard, Cosine and Dice score from the TextSimilarity API are as follows:

These scores show that the clusters do overlap and if the overlap is greater than a specific threshold, the two clusters may be merged (reduced).

Example JSON Response

Please note that this is not a complete response, just a snapshot.

​Language Support

This API is currently only tuned and optimized for English.