Mashape

The Topics Extraction API allows you to find key topics within large amounts of text. The extracted topics can be used in a variety of applications such as to create navigable word clouds, characterize documents, features for machine learning, summarizing documents and for visualization of large amounts of text. Topunstr ics can be extracted from single documents or aggregated texts. Our algorithm uses data mining and NLP techniques over graphs.


Key Features

With our API, as output you get a main topic with a corresponding score. You also see related topics if present and topics are single words or phrases. For each topic you are also presented with supporting texts that discusses the topic. Our API is currently fine-tuned for the English language.


Types of Supported Texts

This API will work for most texts in its plain form. Basic pre-processing may be needed for highly noisy texts. If you have questions about how much pre-processing may be needed for your data, you can ask a question at any time. Here is an idea of the types of texts that can be used with the API:

  • Unstructured survey responses
  • Aggregated news articles
  • User reviews
  • Blog posts
  • Web pages
  • Customer complaints

Getting Started with Topics Extraction

API Key

Before you start, please ensure that you have a valid API key to be able to access the API.

Request & Response

The Topics Extraction API accepts a JSON request and returns a JSON response.

Topics Extraction API Request

The topics extraction endpoint accepts two parameters, type and text.

Parameter Name Type Required? Values
type text yes “chunk” or “pre-sentenced”
text text yes a chunk of text or pre-sentenced text (array of sentences)
“chunk” or “pre-sentenced”, which to use?

When you use “chunk” as the type it simply means you are leaving the topics endpoint to determine sentence boundaries for the text that you send in via the text parameter. We use our default sentence segmentation tool to extract sentences from text. This typically works well for well-written texts such as news articles and legal documents. If you are sending in concatenation of texts as chunk (e.g. concatenated news articles, concatenated Tweets, etc), please ensure that punctuation is available between the two textual units (a “.” should be sufficient).

For inherently short texts such as Tweets, the “pre-sentenced” option would be more appropriate. For the “pre-sentenced” type, you will have to send in an array of texts under the “text” parameter. An example can be found here.

Topics Extraction API Response

The topics extraction API returns several values as the output. The topics are ranked based on its score. The score is essentially computed as a function of the topic length and its importance to the text fed into the API.

Parameter Name Type Description
mainTopic text The primary topic extracted
relatedTopics text A list of secondary related topics if present
score double The topic score which is a function of topic length and importance of the topic
supportingTexts text List of snippets from the original text that discusses the topic

Example JSON Request

The example below is a news article sent in as a “chunk” of text.

Example JSON Response

The response example below shows a list of topics formed for the news article from the request above. Notice that the top topics are related to “charlie hebdo” and the “je suis charlie” sign.

Post-processing

Suppressing Unwanted Topics

In the JSON response above, notice that some of the topics may not be interesting to an application. For example, the topic “saying” may not necessarily be useful. While we have eliminated some non-interesting words as topics, to avoid over eliminating topics, we have let it to the consuming application to further suppress non-interesting topics. This can be done in several ways. First, you can maintain a stop word list customized to your application and remove topics that contain those stop words. Second, you can look at # of supporting texts and score. If a given topic does not have too many supporting texts (e.g. # texts lower than 4) or a score lower than a particular threshold (e.g. score < 2.5) then such topics can be eliminated. You will have to play around to see what works for your application.

Pre-processing

How much pre-processing is needed?

It depends on your text. If you are dealing with fairly noisy texts such as Tweets, you would have to remove retweets, links, hash symbols, unwanted characters and etc. Our API handles basic pre-processing of text including lowercasing, normalization and simple stop word handling.

​Language Support

This API is currently only tuned and optimized for English.