Mining insights from unstructured data using NLP

What is NLP?

Massive amounts of data are stored in text in natural language, however not all of it directly is machine understandable. The ability to process and analyse this data has a huge scope in our current business and day to day lives. However processing this unstructured kind of data is a complex task. Natural Language Processing (NLP)techniques provide the basis for harnessing this massive amount of data and making it useful for further processing.

A very simple example of the usage of NLP in our day to day lives is Identifying meaning from textual conversations via email.

For example, if you have received your flight ticket via email it will be added automatically to your Calendar. Another live example that we are currently working on is gathering sales signals from real time data that people are generating.

Let’s take a look at how we can build this.

NLP tasks

NLP tasks are closely braided, some of their categories in the order of processing include

  1. Tokenization : This task splits the text into roughly words depending on the language being used. It can identify when single quotes are part of a words, or period do or do not imply sentence boundaries.
Image for post
Image for post
  1. Lemmatization : For grammatical reasons documents may have multiple words that have similar meaning eg: “read”, “reads”, “reading”. It would be useful for a search for one of those terms to match documents containing the other. The goal of Lemmatization is to break these words down to common base word. Eg all the words will correspond to “read”

There are a number of open source libraries that support NLP tasks like NLTK(Python) , Stanford Core NLP(Java, Scala), Spacy(Python) etc. However, for the rest of this article we will focus our attention on one of the above NLP tasks i.e. Named Entity Recognition. In our projects that we are currently deploying we have utilized the techniques further with the Stanford Core NLP toolkit in Java.

Named Entity Recognition (NER)

  • Stanford NER Annotator

For the English language, by default, this annotator recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities (12 classes). Included with the stanford NER module are 3 models trained on different datasets

3 class: Annotating [Location, Person, Organization]
4 class: Annotating [Location, Person, Organization, Misc]
7 class: Annotating [Location, Person, Organization, Money, Percent, Date, Time]

You can train the models on custom training data as well to suit your custom needs. The default models that are trained on over a million tokens. The more training data you have, the more accurate your model should be. If you have to tag names of people on Indian origin(with the assumption a corpus does not already exist) for example you probably want to train your custom dataset.)

  • Stanford REGEX NER Annotator

The intention of building this annotator was to have the ability to annotate those entities that could not be annotated by the Stanford NER, this is done using regular expressions over a sequence of tokens. Regex NER has a simple rule based interface where you may specify rules as labelled Entities. These rules files may have tab delimited regular expressions with the corresponding NERs. For example

Bachelor of (Arts|Commerce|Science) DEGREE
Bachelor of Laws DEGREE

These rules files will be utilised in the annotation phase to annotate matched tokens.

Adding the regexner annotator adds support for the fine-grained and additional entity classes EMAIL, URL, CITY, STATE_OR_PROVINCE, COUNTRY, NATIONALITY, RELIGION, (job) TITLE, IDEOLOGY, CRIMINAL_CHARGE, CAUSE_OF_DEATH (11 classes) for a total of 23 classes, including those of Stanford NER.

However writing explicit rules to cover all case may not always be feasible, in such cases you probably want to have an ML model integrate various sources to learn the classification of entities. Tools like RegexNER is more useful at a higher level to augment such ML models

Example Code in JAVA

//Organizing Imports
import edu.stanford.nlp.pipeline.*;
// set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run in the pipeline
props.setProperty(“annotators”, “tokenize, ssplit, pos, lemma, ner, regexner, tokensregex”);
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Prepare the document for annotation
CoreDocument document = new CoreDocument(your_document_text);
// annotate the document
// text of the first sentence
CoreSentence sentence = document.sentences().get(1);
// list of the ner tags for the first sentence
List<String> nerTags = sentence.nerTags();

  • Stanford TOKEN REGEX NER Annotator

Typically in NLP systems text is first tokenized and then annotated with information eg: (Part-of-speech tagging or NER tagging). Since regular expressions are more string based, Token regex may be expressed with tokens, or parts of speech tags, or even ner tags, thereby making it more convenient and comprehensible to specify regular expressions. This gives us more abstraction in specifying patterns.

For example, one of the interesting challenges that we came across recently was to develop a market research product, in which organizational data should be extracted by crawling and processing millions of online pages, picture this –

([ner: PERSON]+) /was|is/ /the?/ /CEO/

This expression will match statements with names of people who are the CEO.

The SUTime library was built using Token Regex. It is a library for finding and normalizing time expressions, ie it will convert “next wednesday at 3pm” to something like 2018–06–06T15:00 (depending on the assumed current reference time).

  • Semgrex

Semgrex is a query language to query from a dependency graph that represents the grammatical relationships between the words of the text.

In such a graph the nodes in the graph are the words of the text and the edges are grammatical relationships between those nodes. A grammatical relationship holds between a governor and a dependent. The most important part of Semgrex is the ability to query through specifying relations between nodes or group of nodes in a regular expression like format.

Following is the graphical representation for the Stanford dependencies for the sentence “John took the bat.”

Image for post
Image for post
Image for post
Image for post

You can find the list of other POS tags here.

Following is a code snippet demonstrating the usage of SemgrexPattern in JAVA

String semgrex_pattern = “{} >/nsubj/ {}=subject > /dobj/ {}=object”;
SemgrexPattern semgrex = SemgrexPattern.compile(semgrex_pattern);
SemgrexMatcher matcher = semgrex.matcher(semantic_graph);
System.out.println(“subject = “ + matcher.getNode(“subject”) + “, object = “+ matcher.getNode(“object”));

OUTPUT: subject = John/NNP, object = bat/NN

The semgrex pattern in the code above finds all the pairs of nodes connected by a directed ‘nsubj’ relation in which the first node is the governor of the relation with two other nodes ie: an nsubj relation and a dobj relation. We have captured the nsub relation and named it ”subject” and also captured the dobj relation and named it “object”.

You can find a detailed explanation of the usage of Semgrex in a powerpoint presentation available here. Another Example demonstrating the usage of SemgrexPattern from the Core NLP Source can be found here

Using the NER Annotations

Now, with the extracted entities within the document and information from the other annotators(NLP modules eg POS) we can build a data structure that represents the meaning of our text expressed through the relationships between nodes and edges. This concept of a knowledge graph is a sort of a network of real world entities and their interrelations, organised with a graph.

According to Wikipedia A Semantic graph is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts. You will often hear semantic graphs referred to as “knowledge graphs”.

Building the Semantic Graph

Taking an example with the statement below:

John Lennon was born in Liverpool to Julia and Alfred Lennon

Post the Annotation Phase we get the NER tags

Image for post
Image for post

Followed by which we can construct a semantic graph. Below is the Graph representing the information extracted from the NLP phase. We understand that the text “Alfred” and “Julia” are persons, that “Liverpool” is a Location, that “Alfred” and “Julia” are the parents of “John Lennon”.

Image for post
Image for post

Such a graph would effortlessly help us to run queries like, What are the names of Johns Lennon’s parents? Or where was John Lennon born? Through one statement we are able to get insights, Imagine the kind of knowledge such a graph can hold while it processes and understand news articles, websites, blogs, documents etc.


NLP can be used in combination with other big data technologies like Spark to process and mine insights from large scale data and from multiple sources. The results may also be pipelined to multiple areas, eg: Relational Databases, Graph Databases for performing complex relationship analysis, Monitoring and alerting tools, analytics engine like spark for further big data processing or other custom applications.

In this article we have briefly walked you through

  • NLP and some of its tasks.

Currently we are working on live projects and consulting for wide variety of domains, such as Fintech, Healthcare, Market Research etc. This approach can also be further extended to highly sensitive areas of Security, Military Intelligence, etc., and can be also be extended to other commonly used NLP frameworks, such as NLTK, Spacy etc.

Have you ever worked on unstructured datasets? What framework did you use? Did you find the article useful? Did this article solve any of your existing dilemma? Feel free to leave your comments below.

Note: A major source of inspiration of this article is the The Stanford NLP Group website.

Originally Posted on Cuelogic Blog.

Global organizations partner with us to leverage our engineering excellence and product thinking to build Cloud Native & Data-Driven applications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store