Quick Introduction to word2vec

In a previous post I gave links to some pretrained models for a few implementations of word vectors. In this post we’ll take a look at word vectors and their applications.

If you have been anywhere around NLP in the past couple of years you have undoubtedly heard of word2vec. As John Rupert Firth said, “You shall know a word by the company it keeps.” That is the premise behind word2vec. Words that have similar contexts will be placed closer to each other by the algorithm. For example, Paris and France will be closer together than Paris and Germany.

If you are interested in the details of the algorithms behind word2vec you will want to see the paper Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov et. al. and the code that goes along with it.

There are two model architectures of word2vec. One of these is called the Skip-Gram model. This model uses the current word to predict the surrounding words (the context). The other model architecture is called continuous bag-of-words (CBOW). This model predicts the current word from the surrounding words (context). For both models, the limit on the number of surrounding words is controlled by the window size parameter.

The practical applications of word vectors includes, but is not limited to, NLP tasks like named-entity recognition, machine translation, sentiment analysis, recommendation engines, and document retrieval. Word vectors have been applied to other domains such as biological sequences of proteins and genes.

Nearly all deep learning and NLP toolkits available today offer at least some support for word vectors. Tensorflow, GluonNLP (based on MXNet), and cloud-based tools such as Amazon SageMaker BlazingText support word vectors.

PyData Washington DC 2018

Last month in November 2018 I had the privilege of attending and presenting at PyData Washington DC 2018 at Capital One. It was my first PyData event and I learned so much from it that I hope to attend many more in the future and I encourage you to do so, too, if you’re interested in all things data science and the supporting Python ecosystem. Plus, I got some awesome stickers for my laptop.

My presentation was an introduction to machine translation and demonstrated how machine translation can be used in a streaming pipeline. In it, I gave a (brief) overview of machine translation and how it has evolved from early methods, to statistical machine translation (SMT), to today’s neural machine translation (NMT). The demonstrated application used Apache Flink to consume German tweets and, via Sockeye, translate the German tweets to English.

A video and slides will be available soon and will be posted here. The code for the project is available here. Credits to Suneel Marthi for the original implementation from which mine is forked and to Kellen Sunderland for the Sockeye-Thrift code that enabled communication between Flink and Sockeye.

Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.

I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.

We used Apache NiFi to drive the process. The data flow is summarized as follows:

The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level. 

Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.

New AWS NLP Service

The AWS reInvent conference in Las Vegas always results in announcements of new AWS services. This year AWS announced a new addition to their cloud-based NLP service.

Amazon Comprehend Medical – Natural Language Processing for Healthcare Customers is a service for understanding unstructured natural language medical text. From the announcement, it supports extracting entities from a vocabulary of medical terms and extracting Protected Health Information (PHI) such as addresses and medical record numbers. For a full description and code samples see the AWS blog post. Pricing is based on the usage of the service.

Apache cTakes

While this is an interesting and exciting new product, I would be remiss to not mention that this functionality is largely available in the open source application called cTakes. cTakes, or “clinical Text Analysis Knowledge Extraction System”, is an Apache project for extracting information from natural language medical record clinical text. cTakes is used by many large hospitals and referenced in many publications.

Being open source, cTakes is free to use and modify. You can deploy cTakes on-premises or in your cloud without paying any fees for usage. You only have to pay for the hardware that it is running on. Depending on your usage, the cost difference between a service like Amazon Comprehend Medical and cTakes could be significant so I recommend evaluating both if you need to process medical records.

Natural Language Processing and Information Extraction for Biomedicine

Quick List of Pretrained Word Vectors

In the past few years word vectors have become all the rage in NLP and rightly so. It’s hard today to find some application of NLP that doesn’t involve the use of word vectors. The fact that word vectors are generated using unsupervised learning makes them even more appealing.

In a future post we’ll take a look at what exactly are word vectors but in this post I wanted to just give a quick list of pretrained word vectors that you can use now. There are several different algorithms and implementations for generating word vectors with the most famous likely being word2vec.

word2vec – These pretrained vectors were created from a set of Google News dataset containing about 100 billion words. 

GloVe – The GloVe pretrained vectors were created from Wikipedia, a combination of Wikipedia and Common Crawl, and Twitter. 

fastText – The fastText pretrained vectors were created from Wikipedia. They are available for 294 languages. 

Please note the license each pretrained vector is released under prior to using them in your applications.

Using the NLP Building Blocks with Apache NiFi to Perform Named-Entity Extraction on Logical Entity Exchange Specifications (LEXS) Documents

In this post we are going to show how our NLP Building Blocks can be used with Apache NiFi to create an NLP pipeline to perform named-entity extraction on Logical Entity Exchange Specifications (LEXS) documents. The pipeline will extract a natural language field from each document, identify the named-entities in the text through a process of sentence extraction, tokenization, and named-entity recognition, and persist the entities to a MongoDB database.  While the pipeline we are going to create uses data files in a specific format, the pipeline could be easily modified to read documents in a different format.

LEXS is an XML, NIEM-based framework for information exchange developed for the US Department of Justice. While the details of LEXS are out of scope for this post, the keypoints is that it is XML-based, a mix of structured and unstructured text, and is used to describe various law enforcement events. We have taken the LEXS specification and created test documents for this pipeline. Example documents are also available on the public internet.

And just in case you are not familiar with Apache NiFi, it is a free (Apache-licensed), cross-platform application that allows the creation and execution of data flow processes. With Apache NiFi you can move data through pipelines while applying transformations and executing actions.

The completed Apache NiFi data flow is shown below.

NLP Building Blocks

This post requires that our NLP Building Blocks are running and accessible. The NLP Building Blocks are microservices to perform NLP tasks. They are:

Renku Language Detection Engine
Prose Sentence Extraction Engine
Sonnet Tokenization Engine
Idyl E3 Entity Extraction Engine

Each is available as Docker containers and on the AWS and Azure marketplaces. You can quickly start each building block as a Docker container using docker compose or individually:

Start Prose Sentence Extraction Engine:

docker run -p 8060:8060 -it mtnfog/prose:1.1.0

Start Sonnet Tokenization Engine:

docker run -p 9040:9040 -it mtnfog/sonnet:1.1.0

Start Idyl E3 Entity Extraction Engine:

docker run -p 9000:9000 -it mtnfog/idyl-e3:3.0.0

With the containers running we will next set up Apache NiFi.

Setting Up

To begin, download Apache NiFi and unzip it. Now we can start Apache NiFi:

apache-nifi-1.5.0/bin/nifi.sh start

We can now begin creating our data flow.

Creating the Ingest Data Flow

The Process

Our data flow process in Apache NiFi will follow this process. Each step is described in detail below.

  1. Ingest LEXS XML files from the file system. Apache NiFi offers the ability to read files from many sources (such as HDFS and S3) but we will simply use the local file system as our source.
  2. Execute an XPath query against each LEXS XML file to extract the narrative from each record. The narrative is a free text, natural language description of the event described by the LEXS XML file.
  3. Use Prose Sentence Extraction Engine to identify the individual sentences in the narrative.
  4. Use Sonnet Tokenization Engine to break each sentence into its individual tokens (typically words).
  5. Use Idyl E3 Entity Extraction Engine to identity the named-person entities in the tokens.
  6. Persist the extracted entities into a MongoDB database.

Configuring the Apache NiFi Processors

Ingesting the XML Files

To read the documents from the file system we will use the GetFile processor. The only configuration property for this processor that we will set is the input directory. Our documents are stored in /docs so that will be our source directory. Note that, by default, the GetFile processor removes the files from the directory as they are processed.

Extracting the Narrative from Each Record

The GetFile processor will send the file’s XML content to an EvaluateXPath processor. This processor will execute an XPath query against each XML document to extract the document’s narrative. The extracted narrative will be stored in the content of the flowfile. The XPath is:


Identifying Individual Sentences in the Narrative

The flowfile will now be sent to an InvokeHTTP processor that will send the sentence extraction request to Prose Sentence Extraction Engine. We set the following properties on the processor:

Remote URL: http://localhost:8060/api/sentences
Content Type: text/plain

The response from Prose Sentence Extraction engine will be a JSON array containing the individual sentences in the narrative.

Splitting the Sentences Array into Separate FlowFiles

The array of sentences will be sent to a SplitJSON processor. This processor splits the flowfile creating a new flowfile for each sentence in the array. For the remainder of the data flow, the sentences will be operated on individually.

Identifying the Tokens in Each Sentence

Each sentence is next sent to an InvokeHTTP processor that will call Sonnet Tokenization Engine. The properties set for this processor are:

Remote URL: http://localhost:9040/api/tokenize
Content Type: text/plain

The response from Sonnet Tokenization Engine will be an array of tokens (typically words) in the sentence.

Extracting Named-Entities from the Tokens

The array of tokens is next sent to an InvokeHTTP processor that sends the tokens to Idyl E3 Entity Extraction Engine for named-entity extraction. The properties to set for this processor are:

Remote URL: http://localhost:9000/api/extract
Content Type: application/json

Idyl E3 analyzes the tokens and identifies which tokens are named-person entities (like John Doe, Jerry Smith, etc.). The response is a list of the entities found along with metadata about each entity. This metadata includes the entity’s confidence value. This is a value from 0 to 1 that indicates Idyl E3’s confidence the entity is actually an entity.

Storing Entities in MongoDB

The entities having a confidence value greater than or equal to 0.6.0 will be persisted to a MongoDB database. In this processor, each entity will be written to the database for storage and further analysis by other systems. The properties to configure for the PutMongo processor are:

Mongo URI: mongodb://localhost:27017
Mongo Database Name: <Any database>
Mongo Collection Name: <Any collection>

You could just as easily insert the entities into a relational database, Elasticsearch, or another repository.

Pipeline Summary

That is our pipeline! We went from XML documents, did some natural language processing via the NLP Building Blocks, and ended up with named-entities stored in MongoDB.

Production Deployment

There’s a few things you may want to change for a production deployment.

Multiple Instances of Apache NiFi

First, you will likely want (and need) more than one instance of Apache NiFi to handle large volumes of files.

High Availability of NLP Building Blocks

Second, in this post we ran the NLP Building Blocks as local docker containers. This is great for a demonstration or proof-of-concept but you will want some high-availability of these services from a service like Kubernetes or AWS ECS.

You can also launch the NLP Building Blocks as EC2 instances via the AWS Marketplace. You could then plug the AMI of each building block into an EC2 autoscaling group behind an Elastic Load Balancer. This provides instance health checks and the ability to scale up and down in response to demand. They are also available on the Azure Marketplace.

Incorporate Language Detection in the Data Flow

Third, you may have noticed that we did not use Renku Language Detection Engine. This is because we knew beforehand that all of our documents are English. If you are unsure, you can insert a Renku Language Detection Engine processor in the data flow immediately after the EvaluateXPath processor to determine the text’s language and use the result as a query parameter to the other NLP Building Blocks.

Improve Performance through Custom Models

Lastly, we did not use any custom sentence, tokenization, or entity models. Each NLP Building Block includes basic functionality to perform these actions without custom models, but, using custom models will almost certainly provide a much higher level of performance. This is because the custom models will more closely match your data unlike the default models. The tools to create and evaluate custom models are included with the application – refer to each application’s documentation for the necessary steps.

Filtering Entities with Low Confidence

You may want to filter entities having a low confidence value in order to control noise. What the optimal threshold is depends on a combination of your data, the entity model being used, and how much noise your system can tolerate. in some use-cases it may be better to use a lower threshold out of caution. Each entity has an associated confidence value that can be used to filter.

Need Help?

Get in touch. We’ll be glad to help out. Send us a line a support at mtnfog.com.

English-language “Places” model in Idyl E3 2.4.0 Analyst Edition

Idyl E3Idyl E3 2.4.0 now includes an English-language “Places” model as well as an English-language “Persons” model. Prior to version 2.4.0, only the persons models was included. Idyl E3 2.4.0 Analyst Edition will be available from the AWS Marketplace soon.

The model will be loaded automatically when Idyl E3 2.4.0 Analyst Edition starts. An entity extraction request such as “George Washington was president of the United States.” will return two entities:

  • George Washington (person)
  • United States (place)

AWS Marketplace

Idyl E3 2.4.0 comes with a free 30 day trial period in which you can use a single instance of Idyl E3 in AWS by only paying the cost of the underlying instance!

Idyl Talk – New Open Source Project

We have pushed a new open source project to our GitHub called Idyl Talk. The goal of Idyl Talk is to replace traditional interface-defined software communication with natural language text.

When software communicates with other software, either internally or with external software, the communication is defined by interfaces. These interfaces tell each side how to communicate. Interfaces are an essential piece of good design. But what happens when two components have to communicate, and for whatever reasons, it is difficult (or impossible) to define the interface? Idyl Talk addresses this problem by letting software components communicate using natural language English text.

Imagine your refrigerator talking to your smartphone app to update your shopping list. The communication might look a bit like this:

    inventory: {
        "milk": "low",
        "eggs": 12

Your smartphone receives the message and an app notifies you that you need milk. For this to be possible the developers of the refrigerator and the smartphone app have to agree on some interface that dictates the communication between the devices. This requires collaboration, and of course, time and money.

Now, imagine that when you are running low on milk your refrigerator sends the following message to your smartphone app:

You are low on milk.

The agreed-to interface here is the English language. With Idyl Talk can now create devices that are enabled to communicate even if they do not exist yet! The app processes the received message and alerts you that you are low on milk.

Sound interesting? We think so! We welcome your contributions to the project as it matures and grows. Check out Idyl Talk on GitHub.

See a listing of all our open source projects.

OpenNLP’s RegexNameFinder and Tokenizing

OpenNLP’s RegexNameFinder takes one or more regular expressions and uses those expressions to extract entities from the input text. This is very useful for instances in which you want to extract things that follow a set format, like phone numbers and email addresses. However, when tokenizing the input to the RegexNameFinder be careful because it can affect the RegexNameFinder’s accuracy.

The RegexNameFinder is very simple to use and here’s an example borrowed from an OpenNLP testcase.

Pattern testPattern = Pattern.compile("test");
String sentence[] = new String[]{"a", "test", "b", "c"};

Pattern[] patterns = new Pattern[]{testPattern};
Map<String, Pattern[]> regexMap = new HashMap<>();
String type = "testtype";

regexMap.put(type, patterns);

RegexNameFinder finder =
new RegexNameFinder(regexMap);

Span[] result = finder.find(sentence);

The sentence variable is a list of tokens. In the above example the tokens are set manually. In a more likely scenario the string would be received as “a test b c” and it would be up to the application to tokenize the string into {“a”, “test”, “b”, “c”}.

There are three types of tokenizers available in OpenNLP – the WhitespaceTokenizer, the SimpleTokenizer, and a tokenizer (TokenizerME) that uses a token model you have trained. The WhitespaceTokenizer works on, you guessed it, white space. The locations of white space in the string is used to tokenize the string. The SimpleTokenizer looks at character classes, such as letters and numbers.

Let’s take the example string “My email address is me@me.com and I like Gmail.” Using the WhitespaceTokenizer the tokens are {“My”, “email”, “address”, “is”, “me@me.com”, “and”, “I”, “like”, “Gmail.”}. If we use the RegexNameFinder with a regular expression that matches an email address, OpenNLP will return to us the span covering “me@me.com”. Works great!

However, let’s consider the sentence “My email address is me@me.com.” Using the WhitespaceTokenizer again the tokens are {“My”, “email”, “address”, “is”, “me@me.com.”}. Notice the last token includes the sentence’s period. Our regular expression for an email address will not match “me@me.com.” because it is not a valid email address. Using the SimpleTokenizer doesn’t give any better results.

How to work around this is up to you. You could make a custom tokenizer by implementing the Tokenizer interface, try using a token model, or massaging your text before it is passed to the tokenizer.