Open Source NLP Microservices

We have open sourced our NLP building block applications on GitHub under the Apache license.

These microservices are stateless applications designed for deployment into scalable environments. They can be launched as docker containers or through cloud marketplaces.

Each application is built using Idyl NLP, an open source Apache-licensed NLP framework for Java.

Extracting Patient Names, Hospitals, and Drugs with Idyl NLP

In this post we will demonstrate how Idyl NLP can be used to find patient names, hospitals, and names of drugs in natural language text.

A quickly growing use of natural language processing (NLP) exists in the healthcare industry. Recent advancements in technology have made it possible to extract useful and very valuable information from unstructured medical records. This can be used to correlate patient information and look for treatment patterns. From a security perspective, it may be necessary to be able to simply and quickly determine and identify any protected health information (PHI) that may exist in a document for auditing or compliance requirements.

Idyl NLP is our opensource NLP library for Java licensed under the business-friendly Apache license, version 2.0. The library provides various NLP capabilities through abstracted interfaces to lower level NLP libraries. The people, places, and things we are concerned with here are patient names, hospitals, and drug names. The goal of Idyl NLP is to provide a powerful, yet easy to use NLP framework.

Extracting Drug Names via a Dictionary

Drug names in the text do not require a trained model since they can be identified via dictionary models. To identify drug names we will use the FDA’s Orange Book. From the Orange Book CSV download we extracted the “Trade Name” column to a text file. Because some drugs are duplicated, we sort the file and remove the duplicate entries. We now have a text file of drug names having one drug per line.

cat drugs.txt | sort | uniq -d > drugs-sorted.txt

We will use Idyl NLP’s dictionary entity recognizer to find the drug names in our input text. The dictionary entity recognizer takes a file and reads its contents into a bloom filter. The dictionary entity recognizer accepts as input tokenized text. Because some drug names may consist of more than one word, we can not do a simple contains check against the dictionary. Instead, we produce a list of n-grams of the tokenized text having length one up to the length of the number of input tokens. We can now see if the bloom filter “might contain” each n-gram. If the bloom filter returns true, we then do a definite check to rule out false positives. Using a bloom filter provides much improved efficiency for dictionary checks.

Extracting Patient Names and Hospitals

Patient names and hospitals will be extracted from the input text through the use of trained models. Each model was created from the same training data. The only difference is that in the patient model the patient names were annotated, and in the hospital model the names of hospitals were annotated. This training process gives us two model files, one for patients and one for hospitals, and their associated model manifest files. To use these models we will instantiate a model-based entity recognizer. The recognizer will load the two trained entity models from disk.

Creating the Pipeline

To use these two entity recognizers we will create a NerPipeline. This pipeline accepts a list of entity recognizers when built along with other configurable settings, such as a sentence detector and tokenizer. When the pipeline is executed, each entity recognizer will be applied to the input text. The output will be a list of Entity objects that contain information about each extracted entity.

The Code

Below is the code written that was described above. Refer to the idylnlp-samples project for up to date examples since this code could change between the time it was written and the time you see it here. This code used Idyl NLP 1.1.0-SNAPSHOT.

Creating the dictionary entity recognizer. The first argument specifies that the entities extracted will be identified as English, the second argument is the full path to the file created from the Orange Book, the third argument is the type of entity, and fourth parameter is the false positive probability for the bloom filter, and the last argument indicates that the dictionary lookup is not case-sensitive.

DictionaryEntityRecognizer dictionaryRecognizer = new DictionaryEntityRecognizer(LanguageCode.en, "/path/to/drugs-sorted.txt", "drug", 0.1, false);

Creating the model entity recognizer requires us to read the model manifests from disk. Maps correlate models for entity types and languages.

String modelPath = "/path/to/trained-models/";

LocalModelLoader<TokenNameFinderModel> entityModelLoader = new LocalModelLoader<>(new TrueValidator(), modelPath);

ModelManifest patientModelManifest = ModelManifestUtils.readManifest("/full/path/to/patient.manifest");
ModelManifest hospitalModelManifest = ModelManifestUtils.readManifest("/full/path/to/hospital.manifest");

Set<StandardModelManifest> patientModelManifests = new HashSet<StandardModelManifest>();
patientModelManifests.add(patientModelManifest);

Set<StandardModelManifest> hospitalModelManifests = new HashSet<StandardModelManifest>();
hospitalModelManifests.add(hospitalModelManifest);

Map<LanguageCode, Set<StandardModelManifest>> persons = new HashMap<>();
persons.put(LanguageCode.en, patientModelManifests);

Map<LanguageCode, Set<StandardModelManifest>> hospitals = new HashMap<>();
hospitals.put(LanguageCode.en, hospitalModelManifests);

Map<String, Map<LanguageCode, Set<StandardModelManifest>>> models = new HashMap<>();
models.put("person", persons);
models.put("hospital", hospitals);

OpenNLPEntityRecognizerConfiguration config = new Builder()
 .withEntityModelLoader(entityModelLoader)
 .withEntityModels(models)
 .build();

OpenNLPEntityRecognizer modelRecognizer = new OpenNLPEntityRecognizer(config);

Now we can create the pipeline providing the entity recognizers:

List<EntityRecognizer> entityRecognizers = new ArrayList<>();
entityRecognizers.add(dictionaryRecognizer);
entityRecognizers.add(modelRecognizer);

NerPipeline pipeline = new NerPipeline.NerPipelineBuilder().withEntityRecognizers(entityRecognizers ).build;

And, finally, we can execute the pipeline:

String input = FileUtils.readFileToString(new File("/tmp/input-file.txt"));
EntityExtractionResponse response = pipeline.run(input);

The response will contain a set of entities (persons, hospitals, and drugs) that were extracted from the input text.

Notes

Because we created the pipeline using most defaults, it will use an internal English sentence detector and tokenizer. For other languages you can create the pipeline with other options. As when using any trained model to perform named-entity recognition, the performance of the model is important. How well the training data represents the actual data will be crucial to achieving good performance.

Simplified Named-Entity Extraction Pipeline in Idyl NLP

Idyl NLP 1.1.0 introduces a simplified named-entity extraction pipeline that can be created in just a few lines of code. The following code block shows how to make a pipeline to extract named-person entities from natural language English text in Idyl NLP.

NerPipelineBuilder builder = new NerPipeline.NerPipelineBuilder();
NerPipeline pipeline = builder.build(LanguageCode.en);

EntityExtractionResponse response = pipeline.run("George Washington was president.");
		
for(Entity entity : response.getEntities()) {
  System.out.println(entity.toString());
}

When you run this code a single line will be printed to the screen:

Text: George Washington; Confidence: 0.96; Type: person; Language Code: eng; Span: [0..2);

Internally, the pipeline creates a sentence detector, tokenizer, and named-entity recognizer for the given language. Currently only person-entities for English is supported but we will be adding support for more languages and more entity types in the future. The goal of this functionality is to simplify the amount of code needed to perform a complex operation like named-entity extraction. The NerPipeline class is new in Idyl NLP 1.1.0-SNAPSHOT.

Idyl NLP is our open-source, Apache-licensed NLP framework for Java. Its releases are available in Maven Central and daily snapshots are also available. See Idyl NLP on GitHub at https://github.com/idylnlp/idylnlp for the code, examples, and documentation. Idyl NLP powers our NLP Building Blocks.

Idyl NLP

We have open-sourced our NLP library and its associated projects on GitHub. The library, Idyl NLP, is a Java natural language processing library. It is licensed under the Apache License, version 2.0.

Idyl NLP stands on the shoulders of giants to provide a capable and flexible NLP library. Utilizing components such as OpenNLP and DeepLearning4j under the hood, Idyl NLP offers various implementations for NLP tasks such as language detection, sentence extraction, tokenization, named-entity extraction, and document classification.

Idyl NLP has its own webpage at http://idylnlp.ai and is available in Maven Central under the group ai.idylnlp.

Here are the GitHub project links:

Idyl NLP powers our NLP building block microservices and they are also open source on GitHub:

NLP Models and Model Zoo

Idyl NLP has the ability to automatically download NLP models when needed. The Idyl NLP Models repository contains model manifests for various NLP models. Through the manifest files, Idyl NLP can automatically download the model file referenced by the manifest and use it. The service powering the service is the Idyl NLP Model Zoo that will soon be hosted at zoo.idylnlp.ai. It is a Spring boot application that provides a REST interface for querying and downloading models so you can run your own model zoo for internal usage. See these two repositories on GitHub for more information about the available models and the model zoo. Models will become available through the repository in the coming days.

Sample Projects

There are some sample projects available for Idyl NLP. The samples illustrate how to use some of Idyl NLP’s core capabilities and hopefully provide starting points for using Idyl NLP in your projects.

Future

We are committed to further developing Idyl NLP and its ecosystem. We welcome the community’s contributions to help it foster and grow. We hope that the business friendly Apache license helps Idyl NLP’s adoption. Like most software engineers we are a bit behind on documentation. In the near term we will be focusing on the wiki, javadocs, and the sample projects. Our NLP Building Blocks will continue to be powered by Idyl NLP.

For questions or more information please contact help@idylnlp.ai.

Using the NLP Building Blocks with Apache NiFi to Perform Named-Entity Extraction on Logical Entity Exchange Specifications (LEXS) Documents

In this post we are going to show how our NLP Building Blocks can be used with Apache NiFi to create an NLP pipeline to perform named-entity extraction on Logical Entity Exchange Specifications (LEXS) documents. The pipeline will extract a natural language field from each document, identify the named-entities in the text through a process of sentence extraction, tokenization, and named-entity recognition, and persist the entities to a MongoDB database.  While the pipeline we are going to create uses data files in a specific format, the pipeline could be easily modified to read documents in a different format.

LEXS is an XML, NIEM-based framework for information exchange developed for the US Department of Justice. While the details of LEXS are out of scope for this post, the keypoints is that it is XML-based, a mix of structured and unstructured text, and is used to describe various law enforcement events. We have taken the LEXS specification and created test documents for this pipeline. Example documents are also available on the public internet.

And just in case you are not familiar with Apache NiFi, it is a free (Apache-licensed), cross-platform application that allows the creation and execution of data flow processes. With Apache NiFi you can move data through pipelines while applying transformations and executing actions.

The completed Apache NiFi data flow is shown below.

NLP Building Blocks

This post requires that our NLP Building Blocks are running and accessible. The NLP Building Blocks are microservices to perform NLP tasks. They are:

Renku Language Detection Engine
Prose Sentence Extraction Engine
Sonnet Tokenization Engine
Idyl E3 Entity Extraction Engine

Each is available as Docker containers and on the AWS and Azure marketplaces. You can quickly start each building block as a Docker container using docker compose or individually:

Start Prose Sentence Extraction Engine:

docker run -p 8060:8060 -it mtnfog/prose:1.0.0

Start Sonnet Tokenization Engine:

docker run -p 9040:9040 -it mtnfog/sonnet:1.1.0

Start Idyl E3 Entity Extraction Engine:

docker run -p 9000:9000 -it mtnfog/idyl-e3:3.0.0

With the containers running we will next set up Apache NiFi.

Setting Up

To begin, download Apache NiFi and unzip it. Now we can start Apache NiFi:

apache-nifi-1.5.0/bin/nifi.sh start

We can now begin creating our data flow.

Creating the Ingest Data Flow

The Process

Our data flow process in Apache NiFi will follow this process. Each step is described in detail below.

  1. Ingest LEXS XML files from the file system. Apache NiFi offers the ability to read files from many sources (such as HDFS and S3) but we will simply use the local file system as our source.
  2. Execute an XPath query against each LEXS XML file to extract the narrative from each record. The narrative is a free text, natural language description of the event described by the LEXS XML file.
  3. Use Prose Sentence Extraction Engine to identify the individual sentences in the narrative.
  4. Use Sonnet Tokenization Engine to break each sentence into its individual tokens (typically words).
  5. Use Idyl E3 Entity Extraction Engine to identity the named-person entities in the tokens.
  6. Persist the extracted entities into a MongoDB database.

Configuring the Apache NiFi Processors

Ingesting the XML Files

To read the documents from the file system we will use the GetFile processor. The only configuration property for this processor that we will set is the input directory. Our documents are stored in /docs so that will be our source directory. Note that, by default, the GetFile processor removes the files from the directory as they are processed.

Extracting the Narrative from Each Record

The GetFile processor will send the file’s XML content to an EvaluateXPath processor. This processor will execute an XPath query against each XML document to extract the document’s narrative. The extracted narrative will be stored in the content of the flowfile. The XPath is:

/*[local-name()='doPublish']/*[local-name()='PublishMessageContainer']/*[local-name()='PublishMessage']/*[local-name()='DataItemPackage']/*[local-name()='Narrative']

Identifying Individual Sentences in the Narrative

The flowfile will now be sent to an InvokeHTTP processor that will send the sentence extraction request to Prose Sentence Extraction Engine. We set the following properties on the processor:

HTTP Method: POST
Remote URL: http://localhost:8060/api/sentences
Content Type: text/plain

The response from Prose Sentence Extraction engine will be a JSON array containing the individual sentences in the narrative.

Splitting the Sentences Array into Separate FlowFiles

The array of sentences will be sent to a SplitJSON processor. This processor splits the flowfile creating a new flowfile for each sentence in the array. For the remainder of the data flow, the sentences will be operated on individually.

Identifying the Tokens in Each Sentence

Each sentence is next sent to an InvokeHTTP processor that will call Sonnet Tokenization Engine. The properties set for this processor are:

HTTP Method: POST
Remote URL: http://localhost:9040/api/tokenize
Content Type: text/plain

The response from Sonnet Tokenization Engine will be an array of tokens (typically words) in the sentence.

Extracting Named-Entities from the Tokens

The array of tokens is next sent to an InvokeHTTP processor that sends the tokens to Idyl E3 Entity Extraction Engine for named-entity extraction. The properties to set for this processor are:

HTTP Method: POST
Remote URL: http://localhost:9000/api/extract
Content Type: application/json

Idyl E3 analyzes the tokens and identifies which tokens are named-person entities (like John Doe, Jerry Smith, etc.). The response is a list of the entities found along with metadata about each entity. This metadata includes the entity’s confidence value. This is a value from 0 to 1 that indicates Idyl E3’s confidence the entity is actually an entity.

Storing Entities in MongoDB

The entities having a confidence value greater than or equal to 0.6.0 will be persisted to a MongoDB database. In this processor, each entity will be written to the database for storage and further analysis by other systems. The properties to configure for the PutMongo processor are:

Mongo URI: mongodb://localhost:27017
Mongo Database Name: <Any database>
Mongo Collection Name: <Any collection>

You could just as easily insert the entities into a relational database, Elasticsearch, or another repository.

Pipeline Summary

That is our pipeline! We went from XML documents, did some natural language processing via the NLP Building Blocks, and ended up with named-entities stored in MongoDB.

Production Deployment

There’s a few things you may want to change for a production deployment.

Multiple Instances of Apache NiFi

First, you will likely want (and need) more than one instance of Apache NiFi to handle large volumes of files.

High Availability of NLP Building Blocks

Second, in this post we ran the NLP Building Blocks as local docker containers. This is great for a demonstration or proof-of-concept but you will want some high-availability of these services from a service like Kubernetes or AWS ECS.

You can also launch the NLP Building Blocks as EC2 instances via the AWS Marketplace. You could then plug the AMI of each building block into an EC2 autoscaling group behind an Elastic Load Balancer. This provides instance health checks and the ability to scale up and down in response to demand. They are also available on the Azure Marketplace.

Incorporate Language Detection in the Data Flow

Third, you may have noticed that we did not use Renku Language Detection Engine. This is because we knew beforehand that all of our documents are English. If you are unsure, you can insert a Renku Language Detection Engine processor in the data flow immediately after the EvaluateXPath processor to determine the text’s language and use the result as a query parameter to the other NLP Building Blocks.

Improve Performance through Custom Models

Lastly, we did not use any custom sentence, tokenization, or entity models. Each NLP Building Block includes basic functionality to perform these actions without custom models, but, using custom models will almost certainly provide a much higher level of performance. This is because the custom models will more closely match your data unlike the default models. The tools to create and evaluate custom models are included with the application – refer to each application’s documentation for the necessary steps.

Filtering Entities with Low Confidence

You may want to filter entities having a low confidence value in order to control noise. What the optimal threshold is depends on a combination of your data, the entity model being used, and how much noise your system can tolerate. in some use-cases it may be better to use a lower threshold out of caution. Each entity has an associated confidence value that can be used to filter.

Need Help?

Get in touch. We’ll be glad to help out. Send us a line a support at mtnfog.com.

Creating Custom Tokenization Models with Sonnet Tokenization Engine

Sonnet Tokenization Engine 1.1.0 includes the ability to train custom token models from your text. Using your own token model provides improved performance because the model will more closely match your text to be tokenized. This post describes how to launch an instance of Sonnet Tokenization Engine on AWS, connect to it, train a custom token model, and then use it.

To get started, let’s launch an instance of Sonnet Tokenization Engine from the AWS Marketplace. On the product page, click the orange “Continue to Subscribe” button.

 

On the next page, we highly recommend selecting a VPC from the VPC Settings options. This is to allow you to launch Sonnet Tokenization Engine on a newer instance type. Select your VPC and a public subnet.

Now, select an instance type. We recommend a t2.micro for this demonstration. In production you will likely want a larger instance type.

Now click the “Launch with 1-Click” button!

An instance of Sonnet Tokenization Engine will now be starting in your AWS account. Head over to your EC2 console to check it out. By default, for security purposes port 22 for SSH is not open to the instance. Let’s open port 22 so we can SSH to the instance. Click on the instance’s security group, click Inbound Rules, and add port 22. Now let’s SSH into the instance.

ssh -i keypair.pem ec2-user@ec2-34-201-136-186.compute-1.amazonaws.com

Sonnet Tokenization Engine is installed under /opt/sonnet.

cd /opt/sonnet

Training a custom token model requires training data. The format for this data is a single sentence per line with tokens separated by whitespace or <SPLIT>. You can download sample training data for this exercise.

wget https://s3.amazonaws.com/mtnfog-public/token.train -O /tmp/token.train

We also need a training definition file. Again, we can download one for this exercise:

wget https://s3.amazonaws.com/mtnfog-public/token-training-definition.xml -O /tmp/token-training-definition.xml

Using these two files we are now ready to train our model.

sudo su sonnet
./bin/train-model.sh /tmp/token-training-definition.xml

The output will look similar to:

Sonnet Token Model Generator
Version: 1.1.0
Beginning training using definition file: /tmp/token-training-definition.xml
2018-03-17 12:47:46,135 DEBUG [main] models.ModelOperationsUtils (ModelOperationsUtils.java:40) - Using OpenNLP data format.
2018-03-17 12:47:46,260 INFO  [main] training.TokenModelOperations (TokenModelOperations.java:282) - Beginning tokenizer model training. Output model will be: /tmp/token.bin
Indexing events with TwoPass using cutoff of 0

	Computing event counts...  done. 6002 events
	Indexing...  done.
Collecting events... Done indexing in 0.54 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 6002
	    Number of Outcomes: 2
	  Number of Predicates: 6290
Computing model parameters...
Performing 100 iterations.
  1:  . (5991/6002) 0.9981672775741419
  2:  . (5995/6002) 0.9988337220926358
  3:  . (5996/6002) 0.9990003332222592
  4:  . (5997/6002) 0.9991669443518827
  5:  . (5996/6002) 0.9990003332222592
  6:  . (5998/6002) 0.9993335554815062
  7:  . (5998/6002) 0.9993335554815062
  8:  . (6000/6002) 0.9996667777407531
  9:  . (6000/6002) 0.9996667777407531
 10:  . (6000/6002) 0.9996667777407531
Stopping: change in training set accuracy less than 1.0E-5
Stats: (6002/6002) 1.0
...done.
Compressed 6290 parameters to 159
1 outcome patterns
Entity model generated complete. Summary:
Model file   : /tmp/token.bin
Manifest file : token.bin.manifest
Time Taken    : 2690 ms

The created model file and its associated manifest file will have been created. Copy the manifest file to Sonnet’s models directory.

cp /tmp/token.bin.manifest /opt/sonnet/models/

Now start/restart Sonnet.

sudo service sonnet restart

The model will be loaded and ready for use. All API requests for tokenization that are received for the model’s language will be processed by the model. To try it:

curl "http://ec2-34-201-136-186.compute-1.amazonaws.com:9040/api/tokenize?language=eng" -d "Tokenize this text please." -H "Content-Type: text/plain"

Renku Language Detection Engine 1.1.0

Renku Language Detection Engine 1.1.0 has been released. It is available now as a DockerHub and will be available on the AWS Marketplace and Azure Marketplace in a few days. This version adds a new API endpoint that returns a list of the languages (as ISO-639-3 codes) supported by Renku. The AWS Marketplace image is built using the newest version of the Amazon Linux AMI, and the Azure Marketplace image is now built on CentOS 7.4 (previously was 7.3).

Get Renku Language Detection Engine.

Intel “Meltdown” and “Spectre” Vulnerabilities

With the recent announcement of the vulnerabilities known as “Spectre” and “Meltdown” in Intel processors we have made this post to inform our users how to protect their virtual machines of our products launched via cloud marketplaces.

Products Launched via Docker Containers

Docker uses the host’s system kernel. Refer to your host OS’s documentation on applying the necessary kernel patch.

Products Launched via the AWS Marketplace

The following product versions are using kernel 4.9.62-21.56.amzn1.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each instance:

sudo yum update
sudo reboot
uname -r

The output of the last command will an updated kernel version of 4.9.76-3.78.amzn1.x86_64 (or newer). Details are available on the AWS Amazon Linux Security Center.

Products Launched via the Azure Marketplace

The following product versions are running on CentOS 7.3 on kernel 3.10.0-514.26.2.el7.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each virtual machine:

sudo yum update
sudo reboot
uname -r

The output of the last command will show an updated kernel version of 3.10.0-693.11.6.el7.x86_64 (or newer). For more information see the Red Hat Security Advisory and the announcement email.

 

 

Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language’s probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.

In case you are not familiar with OpenNLP’s language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.

To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi’s lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.

git clone https://github.com/mtnfog/nlp-nifi-processors.git
cd nlp-nifi-processors
mvn clean install
cp langdetect-nifi-processor/langdetect-processor-nar/target/*.nar /path/to/nifi/lib/

The processor does not have any settings to configure. It’s ready to work right “out of the box.” You can add the processor to your NiFi canvas:

You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.

Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!

This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.

Let us know how the processor works out for you!

Jupyter Notebook for NLP Building Blocks

This post presents a Jupyter notebook interactively showing how the NLP Building Blocks can be used. The notebook defines functions for sentence extraction, tokenization, and named-entity extraction. (We recently made a blog post showing how to accomplish the same thing but through Apache NiFi.)

To run the notebook first start the NLP Building Block docker containers. Then fire up Jupyter and replace the 192.168.1.134 IP in the notebook with the IP address of your computer running the containers. You can then step through the notebook.

Sentence Extraction with Custom Trained NLP Models

Introducing Sentence Extraction

A common required task of natural language processing (NLP) is to extract sentences from natural language text. This can be a task on its own or as part of a larger NLP system. There are several ways to go about doing sentence extraction. There is the naive way of splitting based on the presence of periods. That works great until you remember that periods don’t always indicate a sentence break. There are tools that break text into sentences based on rules. There is actually a standard for communicating these rules called Segmentation Rules eXchange, or, SRX. These rules often work with very good success, however, they are language dependent. Additionally, implementing code for these rules can be difficult because not all programming languages have the necessary constructs.

Model-Based Sentence Extraction

This brings us to model-based sentence extraction. In this approach we use trained models to identify sentence boundaries in natural language text. In summary, we take training text, run it through a training process, and we get a model that can be used to extract sentences. A significant benefit of model-based sentence extraction is that you can adapt your model to represent the actual text you will be processing. This leads to potentially great performance. Our NLP Building Block product called Prose Sentence Extraction Engine uses this model-based approach.

Training a Custom Sentence Model with Prose Sentence Extraction Engine

Prose Sentence Extraction Engine 1.1.0 introduced the ability to create custom models for extracting sentences from natural language text. Using a custom model typically provides a much greater level of accuracy than relying on the internal Prose logic to extract sentences. Creating a custom model is fairly simple and this blog post demonstrates how to do it.

To get started we are going to launch Prose Sentence Extraction Engine via the AWS Marketplace. The benefit of doing this is in just a few seconds (okay, maybe 30 seconds) we will have an instance of Prose fully configured and ready to go. Once the instance is up and running in EC2 we can SSH into it. (Note that the SSH username is ec2-user.) All commands presented in this post are executed through SSH on the Prose instance.

SSH to the Prose instance on EC2:

ssh -i key.pem ec2-user@ec2-54-174-13-245.compute-1.amazonaws.com

Once connected, change to the Prose directory:

cd /opt/prose

Training a sentence extraction model requires training text. This text needs to be formatted in a certain way – one sentence per line. This is how Prose learns how to recognize a sentence for any given language. We have some training text for you to use for this example. When creating a model for your production use you should use text representative of the real text that you will be processing. This gives the best performance.

Download the example training text to the instance:

wget https://s3.amazonaws.com/mtnfog-public/a-christmas-carol-sentences.txt -O /tmp/a-christmas-carol-sentences.txt

Take a look at the first few lines of the file you just downloaded. You will see that it is a sentence per line. This file is also attached to this blog post and can be downloaded at the bottom of this post.

Now, edit the example training definition file:

sudo nano example-training-definition-template.xml

You want to modify the trainingdata file to be “/tmp/a-christmas-carol-sentences.txt” and set the output model file as shown below:

<?xml version="1.0" encoding="UTF-8"?>
<trainingdefinition xmlns="https://www.mtnfog.com">
  <algorithm/>
  <trainingdata file="/tmp/a-christmas-carol-sentences.txt" format="opennlp"/>
  <model name="sentence" file="/tmp/sentence.bin" encryptionkey="random" language="eng" type="sentence"/>
</trainingdefinition>

This training definition says we are creating a sentence model for English (eng) text. The trainined model file will be written to /tmp/sentence.bin. Now, we are ready to train the model:

./bin/train-model.sh example-training-definition-template.xml

You will see some output quickly scroll by. Since the input text is rather small, the training only takes at most a few seconds. Your output should look similar to:

$ ./bin/train-model.sh example-training-definition-template.xml

Prose Sentence Model Generator
Version: 1.1.0
Beginning training using definition file: /opt/prose/example-training-definition-template.xml
2017-12-31 19:21:03,451 DEBUG [main] models.ModelOperationsUtils (ModelOperationsUtils.java:40) - Using OpenNLP data format.
2017-12-31 19:21:03,567 INFO  [main] training.SentenceModelOperations (SentenceModelOperations.java:281) - Beginning sentence model training. Output model will be: /tmp/sentence.bin
Indexing events with TwoPass using cutoff of 0

	Computing event counts...  done. 1990 events
	Indexing...  done.
Collecting events... Done indexing in 0.41 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 1990
	    Number of Outcomes: 2
	  Number of Predicates: 2274
Computing model parameters...
Performing 100 iterations.
  1:  . (1827/1990) 0.9180904522613065
  2:  . (1882/1990) 0.9457286432160804
  3:  . (1910/1990) 0.9597989949748744
  4:  . (1915/1990) 0.9623115577889447
  5:  . (1940/1990) 0.9748743718592965
  6:  . (1950/1990) 0.9798994974874372
  7:  . (1953/1990) 0.9814070351758793
  8:  . (1948/1990) 0.978894472361809
  9:  . (1962/1990) 0.985929648241206
 10:  . (1954/1990) 0.9819095477386934
 20:  . (1979/1990) 0.9944723618090452
 30:  . (1986/1990) 0.9979899497487437
 40:  . (1990/1990) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1990/1990) 1.0
...done.
Compressed 2274 parameters to 707
1 outcome patterns
2017-12-31 19:21:04,491 INFO  [main] manifest.ModelManifestUtils (ModelManifestUtils.java:108) - Removing existing manifest file /tmp/sentence.bin.manifest.
Sentence model generated complete. Summary:
Model file   : /tmp/sentence.bin
Manifest file : sentence.bin.manifest
Time Taken    : 1056 ms

Our model has been created and we can now use it. First, let’s stop Prose in case it is running:

sudo service prose stop

Next, copy the model file and its manifest file to /opt/prose/models:

sudo cp /tmp/sentence.* /opt/prose/models/

Since we moved the model file, let’s also update the model’s file name in the manifest file:

sudo nano models/sentence.bin.manifest

Change the model.filename property to be sentence.bin (remove the /tmp/). The manifest should now look like:

model.id=e54091c9-89de-4edb-828b-4edf58006c73
model.name=sentence
model.type=sentence
model.subtype=none
model.filename=sentence.bin
language.code=eng
license.key=
encryption.key=random
creator.version=prose-1.1.0
model.source=
generation=1

Now, with our models in place, we can now start Prose. If we tail Prose’s log while loading we can see that it finds and loads our custom model:

sudo service prose start && tail -f /var/log/prose.log

In case you are curious, the lines in the log that show the model was loaded will look similar to these:

[INFO ] 2017-12-31 19:25:57.933 [main] ModelManifestUtils - Found model manifest ./models//sentence.bin.manifest.
[INFO ] 2017-12-31 19:25:57.939 [main] ModelManifestUtils - Validating model manifest ./models//sentence.bin.manifest.
[WARN ] 2017-12-31 19:25:57.942 [main] ModelManifestUtils - The license.key in ./models//sentence.bin.manifest is missing.
[INFO ] 2017-12-31 19:25:58.130 [main] ModelManifestUtils - Entity Class: sentence, Model File Name: sentence.bin, Language Code: en, License Key: 
[INFO ] 2017-12-31 19:25:58.135 [main] DefaultSentenceDetectionService - Found 1 models to load.
[INFO ] 2017-12-31 19:25:58.138 [main] LocalModelLoader - Using local model loader directory ./models/
[INFO ] 2017-12-31 19:25:58.560 [main] ModelLoader - Model validation successful.
[INFO ] 2017-12-31 19:25:58.569 [main] DefaultSentenceDetectionService - Found sentence model for language eng

Yay! This means that Prose has started and loaded our model. Requests to Prose to extract sentences for English text will now use our model. Let’s try it:

curl http://ec2-54-174-13-245.compute-1.amazonaws.com:8060/api/sentences -d "This is a sentence. This is another sentence. This is also a sentence." -H "Content-type: text/plain"

The response we receive from Prose is:

["This is a sentence.","This is another sentence.","This is also a sentence."]

Our sentence model worked! Prose successfully took in the natural language English text and sent us back three sentences that made up the text.

Prose Sentence Extraction Engine is available on the AWS Marketplace, Azure Marketplace, an Dockerhub. You can launch Prose Sentence Extraction Engine on any of those platforms in just a few seconds.

At the time of publishing, Prose 1.1.0 was in-process of being published to the Azure and AWS Marketplaces. If 1.1.0 is not yet available on those marketplaces it will be in just a few days once the update has been published.

Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls “processors”, you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let’s get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

unzip nifi-1.4.0-bin.zip
cd nifi-1.4.0/bin
./nifi.sh start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi’s GetFile processor to read the file. Add a GetFile processor to the canvas:


Right-click the GetFile processor and click Configure to bring up the processor’s properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor’s properties and set the following values:

  • HTTP Method – POST
  • Remote URL – http://localhost:7070/api/sentences
  • Content Type – text/plain

Set the processor to autoterminate for everything except Response. We also set the processor’s name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let’s go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method – POST
  • Remote URL – http://localhost:9000/api/extract
  • Content-Type – application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow’s execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:

{"entities":[],"extractionTime":7}

Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There’s a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don’t have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can’t leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).

 

 

String Tokenization with OpenNLP

OpenNLP is an open-source library for performing various NLP functions. One of those function is string tokenization. With OpenNLP’s tokenizers you can break text into its individual tokens. For example, given the text “George Washington was president” the tokens are [“George”, “Washington”, “was”, “president”].

If you don’t want the trouble of making your own project look at Sonnet Tokenization Engine. Sonnet, for short, performs text tokenization via a REST API. It is available on the AWS and Azure marketplaces.

A lot of NLP functions operate on tokenized text so tokenization is an important part of an NLP pipeline. In this post we will use OpenNLP to tokenize some text. At time of writing the current version of OpenNLP is 1.8.3.

The tokenizers in OpenNLP are located under the opennlp.tools.tokenize package. This package contains three important classes and they are:

  • WhitespaceTokenizer
  • SimpleTokenizer
  • TokenizerME

The WhitespaceTokenizer does simply that – breaks text into tokens based on the presence of whitespace in the text. The SimpleTokenizer is a little bit smarter. It tokenizes text based on the character classes in the text. Lastly, the TokenizerME performs tokenization using a trained token model. As long as you have data to train your own model this is the class you should use as it will give the best performance. All three classes implement the Tokenizer interface.

You can include the OpenNLP dependency in your project:

<dependency>
 <groupId>org.apache.opennlp</groupId>
 <artifactId>opennlp-tools</artifactId>
 <version>1.8.3</version>
</dependency>

The WhitespaceTokenizer and SimpleTokenizer can be used in a very similar manner:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
tokenizer.tokenize("George Washington was president.");
String tokens[] = tokenizer.tokenize(sentence);

And the WhitespaceTokenizer:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
tokenizer.tokenize("George Washington was president.");
String tokens[] = tokenizer.tokenize(sentence);

The tokenize() function takes a string and returns a string array which are the tokens of the string.

As mentioned earlier, the TokenizerME class uses a trained model to tokenize text. This is much more fun than the previous examples. To use this class we first load a token model from the disk. We are going to use the en-token.bin model file available here. Note that these models are really only good for testing since the text they were trained from is likely different from the text you will be using.

To start we load the model into an input stream from the disk:

InputStream inputStream = new FileInputStream("/path/to/en-token.bin"); 
TokenizerModel model = new TokenizerModel(inputStream);

Now we can instantiate a Tokenizer from the model:

TokenizerME tokenizer = new TokenizerME(tokenModel);

Since TokenizerME implements the Tokenizer interface it works just like the SimpleTokenizer and WhitespaceTokenizer:

String tokens[] = tokenizer.tokenize("George Washington was president.");

The tokenizer will tokenize the text using the trained model and return the tokens in the string array. Pretty cool, right?

Deploy Sonnet Tokenization Engine on AWS and Azure.

Sonnet, Prose, and Idyl E3 now on Azure Marketplace

We are happy to announce that Sonnet Tokenization Engine, Prose Sentence Extraction Engine, and Idyl E3 Entity Extraction Engine have joined Renku Language Detection Engine on the Microsoft Azure Marketplace!

 

Idyl E3 3.0 to be a Microservice

Idyl E3 Entity Extraction Engine is an all-in-one solution for performing entity extraction from natural language text. It takes in unmodified natural language text and through a pipeline, it identifies the language of the text, the sentences in the text, tokenizes those sentences, and extracts entities from those tokens. It’s not exactly what you would call a microservice. The archives for version 2.6.0 are nearly 1 GB in size.

With the introduction of the NLP Building Blocks earlier this year, we began breaking up Idyl E3 into a set of smaller services to perform its individual functions. Renku identifies languages, Prose extracts sentences, and Sonnet performs tokenization. Joining the mix soon with its first release will be Lacuna that classifies documents. Lacuna can be used to route documents through your NLP pipelines based on their content. Each of these applications are small (less than 30 MB), stateless, and horizontally scalable. Using these building blocks for an NLP pipeline instead of the all-in-one Idyl E3 provides much improved flexibility in your NLP pipelines. You can now create loosely connected microservices in your custom NLP pipeline.

With that said, Idyl E3 3.0 will become a microservice whose only function is to perform entity extraction. This will dramatically cut Idyl E3’s deployment size making it easier to deploy and manage. Like the other building blocks, Idyl E3 3.0 will be available as a Docker container. Because Idyl E3’s functionality will be trimmed down its pricing will also be reduced. Stay tuned for the updated pricing.

To help bring the NLP building blocks together in a pipeline we have made the nlp-building-blocks-java-sdk available on GitHub. It includes clients for each product’s API. The Apache2 license product also includes the ability to tie each client together in a pipeline. This is a Java project but we hope to eventually have similar projects available for other languages.

We are very excited to take this path of making NLP building block microservices. We believe it provides awesome flexibility and control over your NLP pipelines.

Renku Language Detection Engine

Renku Language Detection Engine is now available. Renku, for short, is an NLP building block application that performs language detection on natural language text. Renku’s API allows you to submit text for analysis and receive back a list of language codes and associated probabilities. Renku is free for personal, non-commercial, and commercial use.

You can get started with Renku in a docker container quickly:

docker run -p 7070:7070 -it mtnfog/renku:1.1.0

Once running, you can submit requests to Renku. For example:

curl http://localhost:7070/api/language -d "George Washington was the first president of the United States."

The response from Renku will be a list of three-letter language codes and each’s associated probability. The languages will be ordered from highest probability to lowest. In this example the highest probability language will be “eng” for English.

NLP Building Blocks

With the introduction of a new product called Lacuna Document Classification Engine, we are continuing toward our goal of providing the building blocks for larger NLP systems. Lacuna, for short, is an application that uses deep learning algorithms to classify documents into predefined categories.

Document classification has many uses in NLP systems though it is probably most famous for applications such as sentiment analysis and spam detection. Using Lacuna with Idyl E3 allows you to construct NLP pipelines capable of automatically performing entity extraction based on a document’s category. For instance, if Lacuna categorizes a document as a movie review, the document can be sent to an Idyl E3 containing an entity model for actors. Or, if Lacuna categorizes a document as a scientific paper, the document can be sent to an Idyl E3 containing an entity model for chemical compounds. Lacuna allows NLP pipelines to be more fluid and less rigid.

Lacuna will be available for download, on cloud marketplaces, and as a Docker container.

Idyl E3 2.5.1

Idyl E3We are in the process of publishing Idyl E3 2.5.1 to the AWS Marketplace and also to our website for download. The only change in 2.5.1 from 2.5.0 is a fix to address OpenNLP CVE-2017-12620. We have updated the Release Notes to reflect this as well.

The details of the issue are explained in OpenNLP CVE-2017-12620. It is important that only models from trusted sources are used in Idyl E3. Please be aware of a model’s origin whether it be a model that was downloaded from our website, created by you, or created by someone else in your organization.

Yahoo! Vespa and Entity Annotations

Some interesting news this week is that Yahoo! has open-sourced their software that drives many of their content recommendation systems. The software, called Vespa, is available at vespa.ai.

Annotations on words and phrases in the text can be provided as text is ingested into Vespa. This process is described in the Vespa Annotations API documentation. But in order to make these annotations you need something that can identify persons, places, and things in the text! Idyl E3 Entity Extraction Engine is perfect for this and here’s how:

You probably have a pipeline in which text is gathered from some source and eventually pushed to your search application, in this case we’re using Vespa. All that is needed is to modify your pipeline to first send the text to Idyl E3 to get the entities. Once a response is received from Idyl E3 the text along with its annotations can be sent on to Vespa. It really is that easy. You can customize the types of entities to extract through the entity models installed in Idyl E3. So you could annotate persons, places, and things like buildings, schools, and airports.

To recap, in case you have not yet read about Vespa it is worth a few minutes to read about. Its ability to ingest text with annotations makes a natural fit for Idyl E3. You can certainly use Idyl E3 to annotate text for Vespa now and we’re going to make some improvements to make working with Vespa even easier.

Idyl E3 2.6.0 Updates

Idyl E3As we work toward Idyl E3 2.6.0 we keep the Release Notes page updated with what’s new, tweaked, and fixed in 2.6.0. Probably the most significant new feature is support for GPUs.

Blacklisted Models

Less exciting but still useful is how models that fail to load are handled in 2.6.0. Previously when a model failed to load it would be retried the next time the model is needed. If nothing has changed that could help the model load then this can result in needlessly trying to load the model and failing. In 2.6.0 if a model fails to load it is added to a blacklist and Idyl E3 will not attempt to reload any model on the blacklist until Idyl E3 is restarted. A message will be included in Idyl E3’s log when a model is blacklisted.

A model can fail to load for a few reasons. The most common reasons are:

  • The model file defined in the manifest does not exist or cannot be read due to insufficient permissions.
  • The model’s encryption key is invalid.
  • The model’s license key is invalid.

IDYL_E3_HOME Environment Variable

Also noteworthy is the IDYL_E3_HOME environment variable that must be set. If you launch Idyl E3 through the AWS Marketplace it is taken care for you. If not, you just need to set IDYL_E3_HOME to the location where you extracted Idyl E3 (we recommend /opt/idyl-e3):

export IDYL_E3_HOME=/opt/idyl-e3

Most of Idyl E3’s scripts reference the IDYL_E3_HOME environment variable to know where to find its file.

Model Downloader

The last new thing we’ll mention here is the new tool included with Idyl E3 called the Model Downloader. When run, this command line tool shows you models available for download from us that you can download and install into your Idyl E3. No more downloading via your web browser and then having to copy to Idyl E3. You can now download models straight from Idyl E3. The tool will prompt you for a login (it is your Mountain Fog account username and password – register for free if you need a login) and then present you with a simple menu. The tool also supports a non-interactive mode so you can script the download of models!

We’ll give a more detailed look at the Model Downloader tool once 2.6.0 is released so stay tuned.

Model Download Tool

Idyl E3In Idyl E3 2.6.0 we will be introducing a command line tool to download entity, sentence, and token models directly from us. The tool will make getting and using Idyl E3 models much easier. You will no longer have to manually download a model, unzip it, and copy it to Idyl E3’s models directory. The tool will perform these steps for you. It will have both interactive and non-interactive modes so it can be integrated into provisioning scripts to automatically obtain models when deployed.

On our side, the tool will help us to more rapidly create models and make them available to you.

The tool will be bundled with Idyl E3 2.6.0 and will support all platforms.

Idyl E3 2.5.0

Idyl E3Idyl E3 2.5.0 will soon be available on the AWS Marketplace. We will post an update when the new versions are added. The marketplace links stay the same and they are:

(Compare the available editions of Idyl E3.)

The Idyl E3 2.5.0 Release Notes describes what’s new, changed, and fixed in this version. The big new feature is the ability to create and use deep learning neural network entity models. It’s all covered in the Idyl E3 2.5.0 User’s Guide.

Streaming Text in Idyl E3 2.5.0

Idyl E3The Idyl E3 API has an /extract endpoint that receives text and returns the extracted entities in response. This means you have to make a full HTTP connection for each extraction request. Idyl E3 2.5.0 introduces the ability to accept streaming text through a TCP socket. When Idyl E3 starts it will open a TCP port and listen for incoming text. As text is received the socket will extract entities from the text and return an entity extraction response.

Now you can extract entities from the command line using a tool like netcat:

cat some-file.txt | netcat [idyl-e3-ip-address] [port]

Compare that command with using cURL:

curl -X POST http://idyl-e3-ip-address:port/api/v2/extract -H "Content-Type: plain/text; charset=UTF-8" -d "George Washington was president."

It’s easy to see which command is simpler. Using streaming should make processing text files and other constant sources of text much simpler.

The response to streaming input is identical to the response received from the /extract endpoint. (Both commands above will produce the same output.)

{
   "entities":[
      {
         "text":"George Washington",
         "confidence":0.96,
         "span":{
            "tokenStart":0,
            "tokenEnd":2,
            "characterStart":0,
            "characterEnd":17
         },
         "type":"person",
         "languageCode":"eng",
         "context":"not-set",
         "documentId":"not-set",
         "extractionDate":1502970191843,
         "metadata":
         }
      }
   ],
   "extractionTime":72
}

Streaming is disabled by default. To enable it set the streaming.enabled property to true in Idyl E3’s properties file. Streaming does not currently support authentication. See the Idyl E3 Documentation for more streaming configuration options.

What’s New in Idyl E3 2.5.0

Idyl E3Here’s a quick summary of what’s new in Idyl E3 2.5.0. It’s not available yet but will be soon. For a full list check out the Idyl E3 Release Notes.

What’s New

  • English-language person entity models can now be trained using the ConLL-2003 format.
  • You can now create and use deep learning neural network entity models. Check out the previous blog posts for more information!
  • There’s a new setting that allows you to control how duplicate entities per extraction request are handled. You can choose to retain all duplicates or return only the duplicate entity with the highest probability.
  • A new TCP endpoint accepts streaming text. This endpoint deprecates the /ingest API endpoint.

What’s Changed

  • Idyl E3 2.5.0 changes all language codes to 3-letter ISO 3166 codes. While 2-letter codes are still supported we recommend using the 3-letter codes instead.

What’s Fixed

  • Entities extracted by a non-model method (regular expression, dictionary) used to return a value of 100.0 for the entity’s probability. Extracted entity probabilities should exist within the range 0 to 1 so these entities are now extracted with a probability of 1.0 instead of 100.0.

Deep Learning Entity Models in Idyl E3 2.5.0

Idyl E3As we mentioned in an earlier post, Idyl E3 2.5.0 will include the ability to create and user deep learning neural network entity models. As we get closer to the release of Idyl E3 2.5.0 we wanted to introduce the new capability and compare it with the current entity models.

In Idyl E3 2.4.0 you can create entity models through a perceptron algorithm. This algorithm requires as input annotated training text and a list of features. Feature selection can be a difficult task. Too many features can result in over-fitting the model such that it performs well on the input text but does not generalize well to other text. Feature selection is a crucial part of producing a useful, quality model.

Idyl E3 2.5.0’s ability to create and use deep learning models still requires annotated input text but does not require a list of features. The features are discovered automatically during the execution of the neural network algorithm through the use of word vectors, produced by applications like Word2vec or GloVe. Using a tool like this, generate a file of vectors from your training text to provide to Idyl E3 during model training. In summary, manual feature selection is not required for deep learning models.

While word vectors really helps with deep learning model training, training a deep learning model can still be a challenging task. A neural network has many hyperparameters that tune the underlying algorithms. Small changes to these hyperparameters can have a dramatic effect on the generated model. Hyperparameter optimization is an active area of academic and industry research. Tools and best practices exist to help with hyperparameter selection and we will provide some useful resources to help in the near future.

Idyl E3 2.5.0 and newer versions will continue to support using and creating maximum entropy based models so you can choose which type of model you want to create and use.

 

Updated English-person Entities Base Model

In the upcoming week we will be posting an updated English-person entities base model on the Models page. The model, like the version it replaces, will be free to use and included in the upcoming Idyl E3 2.5.0 release. To give an idea of the performance of this model, we evaluated the model against the CoNLL-2003 training set and the results are as follows:

  • Precision: 0.916720
  • Recall: 0.776873
  • F-Measure: 0.841023

Please keep in mind that these models are trained on general text and may not provide adequate performance for all text. In these cases it is recommended that you use Idyl E3 Analyst Edition to create a custom entity model from your text. Launch an instance on AWS today.

 

Deep Learning Entity Extraction in Idyl E3

Idyl E3 Entity Extraction Engine 2.5.0 will introduce entity extraction powered by deep learning neural networks. Neural networks are powerful machine learning algorithms that excel at tasks like natural language processing. Idyl E3 will also support entity model training and usage on GPUs as well as CPUs. Using GPUs provides significant performance improvements. Idyl E3 2.5.0 will add support for AWS’ P2 instance type.

Entity models created by a deep learning neural network will be referred to as “second generation models.” Entity models created by Idyl E3 2.4.0 and earlier will be referred to as “first generation models.”

So how are the current entity models going to be different than the deep learning entity models?

Good question. Training entity models with the Idyl E3 2.4.0 and earlier require you to identify “features” of your text in order to train the model. Some examples of features include where an entity appears in a sentence, what words surround it, if the word is capitalized, and so on. While you can create very powerful models using this method, identifying the features can be a laborious task that requires intimate knowledge of the text. It can also result in over-fitting causing the model to not apply well to non-training text.

When training a deep learning entity model there is no need to identify the features as the algorithm is able to learn the features on its own during the training. It is able to do this through word vectors. Idyl E3 2.5.0 will be able to use word vectors generated by word vector applications such as word2vec and GloVe. To create a deep learning entity model simply provide your input training text and word vectors and Idyl E3 will generate the model.

Can I customize the neural network used to train a model?

There will be many options available to customize the neural network used for model training with a standard set of options to be used out of the box. We will describe all of the available options in the Idyl E3 2.5.0 User’s Guide.

Will there be any other impacts of the new type of model training?

No. You can continue to use your existing first generation models. You can also continue to train new first generation models. In fact, you can use first and second generation models simultaneously in an Idyl E3 pipeline.

Any other questions that we did not cover? Let us know!

English-language “Places” model in Idyl E3 2.4.0 Analyst Edition

Idyl E3Idyl E3 2.4.0 now includes an English-language “Places” model as well as an English-language “Persons” model. Prior to version 2.4.0, only the persons models was included. Idyl E3 2.4.0 Analyst Edition will be available from the AWS Marketplace soon.

The model will be loaded automatically when Idyl E3 2.4.0 Analyst Edition starts. An entity extraction request such as “George Washington was president of the United States.” will return two entities:

  • George Washington (person)
  • United States (place)

AWS Marketplace

Idyl E3 2.4.0 comes with a free 30 day trial period in which you can use a single instance of Idyl E3 in AWS by only paying the cost of the underlying instance!

Idyl E3 2.4.0

Idyl E3Idyl E3 2.4.0 is now available for download. It will be available on the AWS Marketplace and DockerHub soon.

The two new features in 2.4.0 are:

  • The Idyl NLP annotation format that lets you store your annotations outside your training text. See previous post.
  • You can now configure how duplicate entities are handled. See previous post.

As always, we’re excited to release a new version and welcome your feedback.

Handling Duplicate Entities

When performing entity extraction it is common for an entity extraction request to return duplicate entities. For example, given the input:

George Washington was president. George Washington was married to Martha.

Idyl E3 may return the following entities:

  • George Washington – person – 86% confidence
  • George Washington – person – 89% confidence

The entity “George Washington” is a duplicate entity because the entity text and entity type match at least one other entity in the same entity extraction response. New in Idyl E3 2.4.0 you can choose how to handle duplicate entities. The default behavior (and the same in past versions) is to return all entities regardless of whether they are duplicates or not. A new option is to only return the entity having the highest confidence. For example, given the above entities Idyl E3 would only return the entity having 89% confidence. Entities having a confidence lower than 89% will be ignored.

The “Duplicate Entity Handling Strategy” is controlled via the duplicate.entity.handling.strategy property in Idyl E3’s configuration file. The valid values are:

  • retain – All entities are returned. This is the default behavior.
  • highest – When duplicate entities are present in a single entity extraction request, only the entity having the highest confidence value will be returned.

In summary, the new duplicate.entity.handling.strategy property controls how duplicate entities are handled on a per-entity extraction request basis. This property will be available in Idyl E3 2.4.0 and is documented in Idyl E3 2.4.0’s configuration documentation.

Training Definition File

In the next release of Idyl E3 Entity Extraction Engine (which will be version 2.4.0) we will introduce the Training Definition File to help alleviate a few problems.

The problems:

  1. When training an entity model there are quite a few command line arguments that you have to provide. The sheer number of arguments doesn’t help with usability.
  2. After training a model, unless you keep excellent documentation it’s easy to lose track of the training parameters. What entity type? Language? Iterations? Features? And so on.
  3. How do you manage the command line arguments and the feature generators XML file?

The Training Definition File offers a solution to these problems. It is an XML file that contains all of the training parameters. Everything. Now you have a record of the parameters used to create the model while also simplifying the command line arguments. Note that you can still use the command line arguments as they will remain available.

Below is an example of a training definition file. Note that the final format may change between now and the release.

<?xml version="1.0" encoding="UTF-8"?>
<trainingdefinition xmlns="https://www.mtnfog.com">
	<algorithm cutoff="1" iterations="1" threads="2" />
	<trainingdata file="person-train.txt" />
	<model file="person.bin" encryptionkey="enckey" language="en" type="person" />	
	<features>
		<generators>
			<cache>
				<generators>
					<window prevLength="2" nextLength="2">
						<tokenclass />
					</window>
					<window prevLength="2" nextLength="2">
						<token />
					</window>
					<definition />
					<prevmap />
					<bigram />
					<sentence begin="true" end="true" />
				</generators>
			</cache>
		</generators>
	</features>
</trainingdefinition>

You can see in the above XML that all of the entity model training parameters are included. The training definition file defines four things:

  1. The training algorithm.
  2. The training data.
  3. The output model.
  4. The feature generators.

This removes the need for a separate feature generators file since it is now included in the training definition file. Now when training an entity model you can use the simpler command:

java -jar idyl-e3-entity-model-generator.jar -td training-definition.xml

Look for the training definition file functionality to be included with Idyl E3 2.4.0. The details may change so check back for updates.

Idyl E3 2.3.0

We are announIdyl E3cing the availability of Idyl E3 2.3.0! This version has a long list of new features. You can see the full list in the Release Notes and we’ll summarize the changes below and they are covered in the documentation.

You can download the new version from our website and look for it to be available on cloud marketplaces in the upcoming week. We love adding new features and supporting our users’ needs. Feel free to let us know how we’re doing!

API Changes

  • There is a new option for API authentication. You can now use HMACSHA512 instead of plain authorization.
  • There is a new /sanitize endpoint that takes in text, identifies the entities in the text, and returns the text without the entities. This endpoint is useful for cases where you want to sanitize PII or PHI information from text.
  • A new sort parameter was added to the /extract endpoint to control how the extracted entities are sorted in the response.
  • Each API endpoint now responds with HTTP 405 Method Not Allowed when given a HEAD request. This change is to support smoother integration with Idyl E3 and Apache NiFi.
  • Version 1 of the API has been deprecated and will be removed in Idyl E3 2.4.0.

Entity Extraction

  • Can now create and use part-of-speech and lemmatization models to improve entity extraction performance.
  • Added new feature generators to help improve entity extraction performance.
  • We added a new plugin to complement Idyl E3’s entity extraction capabilities with Google Cloud Natural Language API.

In addition, there were some minor bug fixes and performance improvements. There is also the new Idyl E3 SDK for Go.

Idyl E3 Entity Extraction Engine AWS Reference Architectures

With the popularity of running Idyl E3 Entity Extraction Engine on AWS we wanted to provide some AWS reference architectures to help you get started deploying Idyl E3 to AWS. Don’t forget Idyl E3 is available on the AWS Marketplace for easy launching and we have some Idyl E3 CloudFormation templates available on GitHub. We offer managed Idyl E3 services is you prefer a hands-off approach to Idyl E3 deployment and operation.

A Few Notes Before Starting

Using a Pre-Configured AMI

No matter what architecture you choose we recommend creating a pre-configured Idyl E3 AMI and using it to launch new Idyl E3 instances. This method is recommended instead of relying on user-data scripts to perform the configuration because the time required to spin up a pre-configured AMI can be significantly less than user-data scripts. If you want to have the AMI configuration under source control I highly recommend using Hashicorp’s Packer to build the AMI.

Stateless API

Before we describe the architectures it is helpful to note that the Idyl E3 API is stateless. There is no session data necessary to be shared by multiple instances and as long as all Idyl E3 instances are configured identically (as they should be when behind a load balancer), it does not matter which instance gets routed the entity extraction request. We can take advantage of this stateless architecture to allow us to scale Idyl E3 up (and down) as much as we need to in order to meet the demands of the load.

Load-balanced Architecture

The first architecture is a very simple one yet probably adequate to meet the needs of most users. This architecture has a single VPC that contains two subnets. One subnet is public and it contains an Elastic Load Balancer (ELB) and the other subnet is private and it contains the Idyl E3 instances. In the diagram shown below, the ELB is set to be a public ELB allowing Idyl E3 requests to be received from the internet. However, if your application will also run in the VPC you can change the ELB to an internal ELB. Note that this architecture uses a fixed number of Idyl E3 instances behind the ELB. Any scaling up or down will have to be performed manually when needed. Idyl E3’s API has a /health endpoint that returns HTTP 200 OK when everything is ok and that is perfect for ELB instance health checks.

Simple Idyl E3 AWS Architecture with VPC and ELB

Load-balanced and Auto-scaling Architecture

Launch the Idyl E3 CloudFormation stack!

The previous architecture is a simple but very functional and it minimizes cost. The first thing that will be noticed in this architecture is the static nature of the Idyl E3 instances. To provide some flexibility we can modify this architecture a bit to put the Idyl E3 instances into an autoscaling group. We can use the group’s Desired Capacity to manually control the number of Idyl E3 instances or we can configure the autoscaling group to automatically scale up and down based on some chosen metrics. The average CPU usage is a good metric for scaling Idyl E3 because entity extraction can cause the CPU usage to rise. With that change here is what our architecture looks like now:

Idyl E3 AWS architecture with VPC, ELB, and autoscaling.

With the autoscaling we don’t have to worry about unexpected surges or decreases in entity extraction requests. The number of Idyl E3 instances will automatically scale up and down based on the average CPU usage of all Idyl E3 instances. Scaling down is important in order to keep costs to a minimum. Nobody wants to pay for more than what they need.

This architecture is available in our GitHub repository of Idyl E3 CloudFormation Templates. The template also contains an optional bastion instance to facilitate SSH access into the Idyl E3 instances from outside the VPC.

Need more?

Got more complicated requirements? Let us know. We have AWS certified engineers on staff and we’ll be glad to help.

Apache NiFi EQL Processor

We have published a new open source project on GitHub that is an Apache NiFi processor that filters entities through an Entity Query Language (EQL) query. When used along with the Idyl E3 NiFi Processor you can perform entity filtering in a NiFi dataflow pipeline.

To add the EQL processor to your NiFi pipeline, clone the project and build it or download the jar file from our website. Then copy the jar to NiFi’s lib directory and restart NiFi. The processor will not be available in the list of processors:

The EQL processor has a single property that holds the EQL query:

For this example our query will look for entities whose text is “George Washington”:

select * from entities where text = "George Washington"

Entities matching the EQL query will be outputted from the processor as JSON. Entities not matching the EQL query will be dropped.

With this capability we can create Apache NiFi dataflows that produce alerts when an entity matches a given set of conditions. Entities matching the EQL query can be published to an SQS queue, a Kafka stream, or any other NiFi processor.

The Entity Query Language previously existed as a component of the EntityDB project. It is now its own project on GitHub and is licensed under the Apache Software License, 2.0. The project’s README.md contains more examples of how to construct EQL queries.

Open Source Project Updates

A few open source project updates to share:

  • The Entity Query Language (EQL) has been moved out of the EntityDB project and moved into its own entity-query-language project on GitHub. It is also now licensed under the Apache Software License, version 2.0.
  • A new project, eql-nifi-processor, is an Apache NiFi processor for performing EQL queries on entities extracted using the Idyl E3 NiFi processor.

Apache NiFi and Idyl E3 Entity Extraction Engine

We Apache NiFiare happy to let you know how Idyl E3 Entity Extraction Engine can be used with Apache NiFi. First, what is Apache NiFi? From the NiFi homepage: “Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” Idyl E3 extracts entities (persons, places, things) from natural language text.

That’s a very short description of NiFi but it is very accurate. Apache NiFi allows you to configure simple or complex processes of data processing. For example, you can configure a pipeline to consume files from a file system and upload them to S3. (See Example Dataflow Templates.) There are many operations you can do and they are performed by components called Processors. There are many excellent guides available about NiFi, such as:

There are many processors available for NiFi out of the box. One in particular is the InvokeHttp processor that lets your pipeline send an HTTP request. You can use this processor to send text to Idyl E3 for entity extraction from within your pipeline. However, to make things a bit simpler and more flexible we have created a custom NiFi processor just for Idyl E3. This processor is available on GitHub and its binaries will be included with all editions of Idyl E3 starting with version 2.3.0.

Idyl E3 NiFi Processor

Instructions for how to use the Idyl E3 processor will be added to the Idyl E3 documentation bit they are simple. Here’s a rundown. Copy the idyl-e3-nifi-processor.jar from Idyl E3’s home directory to NiFi’s lib directory. Restart NiFi. Once NiFi is available you will see the Idyl E3 in the list of processors when adding a processor:

Idyl E3 NiFi Processor

There are a few properties you can set but the only required property is the Idyl E3 endpoint. By default, the processor extracts entities from the input text but this can be changed using the action property. The available actions are:

  • extract (the default) to get a JSON response containing the entities.
  • annotate to return the input text with the entities annotated.
  • sanitize to return the input text with the entities removed.
  • ingest to extract entities from the input text but provide no response. (This is useful if you are letting Idyl E3 plugins handle the publishing of entities to a database or other service outside of the NiFi data flow.)

The available properties are shown in the screen capture below:

And that is it. The processor will send text to Idyl E3 for entity extraction via Idyl E3’s /api/v2/extract endpoint. The response from Idyl E3 containing the entities will be placed in a new idyl-e3-response attribute.

The Idyl E3 NiFi processor is licensed under the Apache Software License, version 2.0. Under the hood, the processor uses the Idyl E3 client SDK for Java which is also licensed under the Apache license.

Idyl NLP Annotation Format

Idyl E3’s entity model training tool expects entities in training text to be annotated in the format used by OpenNLP. This format uses START and END tags to denote entities:

<START:person> George Washington <END> was president.

This works great but it has a drawback. The annotations and text have to be combined in a single file. Once the text is annotated it becomes difficult to use the training text for any other purposes.

New Annotation Format

Idyl E3 2.4.0 is going to introduce an additional method of annotating text that allows the annotations to be stored separate from the training text. In 2.4.0 the annotations will be able to be stored in a separate file (and we plan to eventually support storing the annotations in a database). Even though Idyl E3 2.4.0 is not yet ready for prime time, we wanted to introduce this feature early in case you are in the middle of any annotation efforts and want to use the new format.

It is still required that the input text contain a single sentence per line. Use blank lines to indicate document boundaries. Here’s an example of a simple input training file:

George Washington was president .
He was president of the United States .
George Washington was married to Martha Washington .
In 1755 , Washington became the senior American aide to British General Edward Braddock on the ill-fated Braddock expedition .

And here’s the annotations stored in a separate file:

1 0 2 person
2 5 6 place
3 0 2 person
3 5 7 person
4 11 12 person

Here’s what this means. Each line in the annotations file represents an annotation in the training text. So there are 5 annotations in this example.

  • The first column is the line number that contains the entity. In this example there is an annotation in each of the 3 lines.
  • The second column is the token index of the start of the entity. Indexes are zero-based so the first token is zero!
  • The third column is the token index of the end of the entity.
  • The last column is the type of the entity.

Note that there are two entities in the third line and each is put on its own separate line in the annotations file. Specifying the entity text in the three column format simplifies the annotation by removing the need to specify the entity’s token start and end positions. This will only annotate the first occurrence of the entity text. (If Edward Braddock had occurred more than once in the input text on line 4 only the first occurrence would be annotated.)

Summary

Now your annotations can be kept separate from your training text allowing you to use your training text for other purposes. Additionally, we hope that this new annotation method helps decrease the time required for annotating and helps with automating the process. As mentioned earlier in the post, currently the only supported means of storing the annotations is in a separate file but we plan to extend this to support databases in a future release of Idyl E3.

The Entity Model Generator tool included in Idyl E3 has been updated to allow for using this new annotation format. You can, however, continue to use the OpenNLP-style annotations when creating entity models. This new annotation format is only available for entity models. Sentence, token, parts-of-speech, and lemma model annotations will remain unchanged in 2.4.0.

Idyl E3 SDK for Go

The Idyl E3 SDK for Go is now available on GitHub. This SDK allows you to integrate Idyl E3’s entity extraction capabilities into your Go projects.

Like the other Idyl E3 SDKs, the project is licensed under the Apache Software License, version 2.0.

It’s easy to use:

endpoint := "http://localhost:9000"
s := "George Washington was president."
confidence := 0
context =: "context"
documentID := "documentID"
language := "en"
key := "your-api-key"

response := Extract(endpoint, s, confidence, context, documentID, language, key)

 

Amazon EBS Elastic Volumes

On Feb 13, 2017, Amazon Web Services announced elastic EBS volumes! If you have used EC2 much you have undoubtedly been frustrated by the rigidness of EBS volumes. Once created they could not be modified or resized. If your EC2 instance required more disk space your only option was to manually create a new volume of the desired size and attach it to your instance. Now that EBS volumes are more “elastic” you can now simply resize an EBS volume. I put “elastic” in quotes because the volume size can only be increased and not decreased. That’s more elastic than before but sill not completely elastic. In addition to adjusting size, you can now adjust performance and change the volume type even while the volume is in use. These functions are available for your existing EBS volumes.

You can use the AWS CLI to modify a volume:

aws ec2 modify-volume --region us-east-1 --volume-id vol-11111111111111111 --size 200 --volume-type io1 --iops 10000

After enlarging a volume don’t forget to tell your OS to use the newly allocated storage.

This can make like a lot easier is many situation. As described in the AWS blog post, you can use this functionality in combination with CloudWatch and Lamba to automatically enlarge volumes when running low on disk space. You can also use it to simply save money by starting with a smaller EBS volume than what you might need knowing you have the flexibility to increase the capacity of the volumes when needed.

Why do we find this interesting? Our Idyl E3 managed services run in AWS and we encourage all potential customers to launch Idyl E3 from the AWS Marketplace due to its ease of use and turn-key capabilities. So we like to pass interesting and relevant information regarding related services on to our users and readers when it comes available. Learn more about Idyl E3’s entity extraction capabilities.