What is Text Similarity API?

Mashape

The Text Similarity API computes surface similarity between two pieces of text (long or short) using well known measures namely Jaccard, Dice and Cosine. Determining similarity between texts is crucial to many applications such as clustering, duplicate removal, merging similar topics or themes, text retrieval and etc. Let’s say we have the following two product listings on eBay:

How can you tell that these two listings are almost the same? You can use text similarity measures for this. The results from the Text Similarity API shows how close these two texts are using different measures:

In text mining applications, you can heuristically set a similarity threshold. Meaning, if the similarity score between two pieces of text is greater than a value, say 0.5, then you can consider these two units as being similar.  Threshold levels are dependent on the application need. Here are some recommendations:

  • For strict similarity, use a threshold of 0.5 and above
  • For a more liberal similarity,  use a score lesser than 0.5
  • In some cases, you can avoid thresholds by ranking texts by similarity scores and using only the top N most similar texts.

Integrate Text Similarity with Code

To use this api, you would essentially have to set 3 parameters:

  • text1: your first unit of text or text tokens
  • text2: your second unit of text or text tokens
  • clean: perform cleaning on your text before similarity computation?

You can have fairly lengthy units of texts (e.g. two plain text documents) but the maximum payload size is 1MB per request. The text that you provide can be plain words, words with Part of Speech Annotations (POS) (e.g.the/dt cow/nn jumps/vb) or combined tokens such as n-grams (e.g. this_cat cat_is is_cute).


First Steps: Get your API Key

Before you start, please ensure that you have a valid API key.


Request

The TextSimilarity endpoint accepts a JSON request via POST. It takes in 3 parameters:

Parameter name Type Required? Description
text1 text Yes first text
text2 text Yes second text
clean text No (Default=true) lowercase, remove punctuation and numbers?

Points to note:

  • There is no maximum length for the text, but a 1MB maximum payload per request.
  • The text can be in any language – The text that you provide can be:
    • plain text, (e.g. the cow jumps over the moon)
    • text with POS annotations (e.g. the/dt cow/nn jumps/vb)
    • manipulated texts such as n-grams (e.g. thiscat catis iscute).
  • Since this is a json request, your text has to be properly escaped and encoded in UTF-8

Requests can be sent in any language as long as it is formatted according to the expected JSON format. There is a library called the unirest library that handles http request and response in several languages including Java, Python, Ruby, Node.js, PHP and more. Here is an example, using the Java Unirest library:

  • ‘text1’ and ‘text2’ are the two texts that you want to compute similarity over and are both mandatory.
  • ‘clean’ indicates if you want your text to be cleaned up prior to computing text similarity and this is optional
  • Content type with application/json is mandatory to indicate the type of request being sent
  • X-Mashape-Key is mandatory and it is the key that allows you access to the API Here is a simple wrapper for the text similarity API in Java using HttpURLConnection

Response

Text Similarity returns a JSON response. It returns the Cosine, Jaccard and Dice similarity scores along with the average based on these 3 scores. Here is an example request and response output:

Request:

Response:

Request:

Response:

Since you have access to different similarity measures, you can choose to use one of these measures at all times or all of it at once. You can also use the average scores.


Which Similarity Measure to Use?

If you have very short texts and want a strict measure that ensures only phrases that are very similar get high scores, then Jaccard would be ideal. However, if your text is more than 5 words long, Cosine or Dice may be more appropriate since these measures tend not to over-penalize non-overlapping terms. You can also average all three scores. In either case, please do some experimentation before you decide which measure(s) to use.


Improving Similarity Measures

There are several ways to improve similarity (meaning finding more overlaps). Here are some ideas to improve reliability in the similarity measures:


Languages Supported

Text Similarity is language-neutral and would thus work for all languages.