Introducing NLP Flow

Today we are introducing NLP Flow, a collection of processors for the popular Apache NiFi data platform to support NLP pipeline data flows.

Apache NiFi is a cross-platform tool for creating and managing data flows. With Apache NiFi you can create flows to ingest data from a multitude of sources, perform transformations and logic on the data, and interface with external systems. Apache NiFi is a stable and proven platform used by companies worldwide.

Extending Apache NiFi to support NLP pipelines is a perfect fit. NLP Flow is, in Apache NiFi terminology, a set of processors that facilitate NLP tasks via our NLP Building Blocks. With NLP Flow, you can create powerful NLP pipelines inside of Apache NiFi to perform language identification, sentence extraction, text tokenization, and named-entity extraction. For example, an NLP pipeline to ingest text from HDFS, extract all named-person entities for English and Spanish text, and persist the entities to a MongoDB database can be managed and executed within Apache NiFi.

NLP Flow is free for everyone to use. An existing Apache NiFi (a free download) installation is required.

 NLP Flow

 

Using the NLP Building Blocks with Apache NiFi to Perform Named-Entity Extraction on Logical Entity Exchange Specifications (LEXS) Documents

In this post we are going to show how our NLP Building Blocks can be used with Apache NiFi to create an NLP pipeline to perform named-entity extraction on Logical Entity Exchange Specifications (LEXS) documents. The pipeline will extract a natural language field from each document, identify the named-entities in the text through a process of sentence extraction, tokenization, and named-entity recognition, and persist the entities to a MongoDB database.  While the pipeline we are going to create uses data files in a specific format, the pipeline could be easily modified to read documents in a different format.

LEXS is an XML, NIEM-based framework for information exchange developed for the US Department of Justice. While the details of LEXS are out of scope for this post, the keypoints is that it is XML-based, a mix of structured and unstructured text, and is used to describe various law enforcement events. We have taken the LEXS specification and created test documents for this pipeline. Example documents are also available on the public internet.

And just in case you are not familiar with Apache NiFi, it is a free (Apache-licensed), cross-platform application that allows the creation and execution of data flow processes. With Apache NiFi you can move data through pipelines while applying transformations and executing actions.

The completed Apache NiFi data flow is shown below.

NLP Building Blocks

This post requires that our NLP Building Blocks are running and accessible. The NLP Building Blocks are microservices to perform NLP tasks. They are:

Renku Language Detection Engine
Prose Sentence Extraction Engine
Sonnet Tokenization Engine
Idyl E3 Entity Extraction Engine

Each is available as Docker containers and on the AWS and Azure marketplaces. You can quickly start each building block as a Docker container using docker compose or individually:

Start Prose Sentence Extraction Engine:

docker run -p 8060:8060 -it mtnfog/prose:1.1.0

Start Sonnet Tokenization Engine:

docker run -p 9040:9040 -it mtnfog/sonnet:1.1.0

Start Idyl E3 Entity Extraction Engine:

docker run -p 9000:9000 -it mtnfog/idyl-e3:3.0.0

With the containers running we will next set up Apache NiFi.

Setting Up

To begin, download Apache NiFi and unzip it. Now we can start Apache NiFi:

apache-nifi-1.5.0/bin/nifi.sh start

We can now begin creating our data flow.

Creating the Ingest Data Flow

The Process

Our data flow process in Apache NiFi will follow this process. Each step is described in detail below.

  1. Ingest LEXS XML files from the file system. Apache NiFi offers the ability to read files from many sources (such as HDFS and S3) but we will simply use the local file system as our source.
  2. Execute an XPath query against each LEXS XML file to extract the narrative from each record. The narrative is a free text, natural language description of the event described by the LEXS XML file.
  3. Use Prose Sentence Extraction Engine to identify the individual sentences in the narrative.
  4. Use Sonnet Tokenization Engine to break each sentence into its individual tokens (typically words).
  5. Use Idyl E3 Entity Extraction Engine to identity the named-person entities in the tokens.
  6. Persist the extracted entities into a MongoDB database.

Configuring the Apache NiFi Processors

Ingesting the XML Files

To read the documents from the file system we will use the GetFile processor. The only configuration property for this processor that we will set is the input directory. Our documents are stored in /docs so that will be our source directory. Note that, by default, the GetFile processor removes the files from the directory as they are processed.

Extracting the Narrative from Each Record

The GetFile processor will send the file’s XML content to an EvaluateXPath processor. This processor will execute an XPath query against each XML document to extract the document’s narrative. The extracted narrative will be stored in the content of the flowfile. The XPath is:

/*[local-name()='doPublish']/*[local-name()='PublishMessageContainer']/*[local-name()='PublishMessage']/*[local-name()='DataItemPackage']/*[local-name()='Narrative']

Identifying Individual Sentences in the Narrative

The flowfile will now be sent to an InvokeHTTP processor that will send the sentence extraction request to Prose Sentence Extraction Engine. We set the following properties on the processor:

HTTP Method: POST
Remote URL: http://localhost:8060/api/sentences
Content Type: text/plain

The response from Prose Sentence Extraction engine will be a JSON array containing the individual sentences in the narrative.

Splitting the Sentences Array into Separate FlowFiles

The array of sentences will be sent to a SplitJSON processor. This processor splits the flowfile creating a new flowfile for each sentence in the array. For the remainder of the data flow, the sentences will be operated on individually.

Identifying the Tokens in Each Sentence

Each sentence is next sent to an InvokeHTTP processor that will call Sonnet Tokenization Engine. The properties set for this processor are:

HTTP Method: POST
Remote URL: http://localhost:9040/api/tokenize
Content Type: text/plain

The response from Sonnet Tokenization Engine will be an array of tokens (typically words) in the sentence.

Extracting Named-Entities from the Tokens

The array of tokens is next sent to an InvokeHTTP processor that sends the tokens to Idyl E3 Entity Extraction Engine for named-entity extraction. The properties to set for this processor are:

HTTP Method: POST
Remote URL: http://localhost:9000/api/extract
Content Type: application/json

Idyl E3 analyzes the tokens and identifies which tokens are named-person entities (like John Doe, Jerry Smith, etc.). The response is a list of the entities found along with metadata about each entity. This metadata includes the entity’s confidence value. This is a value from 0 to 1 that indicates Idyl E3’s confidence the entity is actually an entity.

Storing Entities in MongoDB

The entities having a confidence value greater than or equal to 0.6.0 will be persisted to a MongoDB database. In this processor, each entity will be written to the database for storage and further analysis by other systems. The properties to configure for the PutMongo processor are:

Mongo URI: mongodb://localhost:27017
Mongo Database Name: <Any database>
Mongo Collection Name: <Any collection>

You could just as easily insert the entities into a relational database, Elasticsearch, or another repository.

Pipeline Summary

That is our pipeline! We went from XML documents, did some natural language processing via the NLP Building Blocks, and ended up with named-entities stored in MongoDB.

Production Deployment

There’s a few things you may want to change for a production deployment.

Multiple Instances of Apache NiFi

First, you will likely want (and need) more than one instance of Apache NiFi to handle large volumes of files.

High Availability of NLP Building Blocks

Second, in this post we ran the NLP Building Blocks as local docker containers. This is great for a demonstration or proof-of-concept but you will want some high-availability of these services from a service like Kubernetes or AWS ECS.

You can also launch the NLP Building Blocks as EC2 instances via the AWS Marketplace. You could then plug the AMI of each building block into an EC2 autoscaling group behind an Elastic Load Balancer. This provides instance health checks and the ability to scale up and down in response to demand. They are also available on the Azure Marketplace.

Incorporate Language Detection in the Data Flow

Third, you may have noticed that we did not use Renku Language Detection Engine. This is because we knew beforehand that all of our documents are English. If you are unsure, you can insert a Renku Language Detection Engine processor in the data flow immediately after the EvaluateXPath processor to determine the text’s language and use the result as a query parameter to the other NLP Building Blocks.

Improve Performance through Custom Models

Lastly, we did not use any custom sentence, tokenization, or entity models. Each NLP Building Block includes basic functionality to perform these actions without custom models, but, using custom models will almost certainly provide a much higher level of performance. This is because the custom models will more closely match your data unlike the default models. The tools to create and evaluate custom models are included with the application – refer to each application’s documentation for the necessary steps.

Filtering Entities with Low Confidence

You may want to filter entities having a low confidence value in order to control noise. What the optimal threshold is depends on a combination of your data, the entity model being used, and how much noise your system can tolerate. in some use-cases it may be better to use a lower threshold out of caution. Each entity has an associated confidence value that can be used to filter.

Need Help?

Get in touch. We’ll be glad to help out. Send us a line a support at mtnfog.com.

Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language’s probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.

In case you are not familiar with OpenNLP’s language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.

To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi’s lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.

git clone https://github.com/mtnfog/nlp-nifi-processors.git
cd nlp-nifi-processors
mvn clean install
cp langdetect-nifi-processor/langdetect-processor-nar/target/*.nar /path/to/nifi/lib/

The processor does not have any settings to configure. It’s ready to work right “out of the box.” You can add the processor to your NiFi canvas:

You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.

Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!

This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.

Let us know how the processor works out for you!

Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls “processors”, you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let’s get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

unzip nifi-1.4.0-bin.zip
cd nifi-1.4.0/bin
./nifi.sh start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi’s GetFile processor to read the file. Add a GetFile processor to the canvas:


Right-click the GetFile processor and click Configure to bring up the processor’s properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor’s properties and set the following values:

  • HTTP Method – POST
  • Remote URL – http://localhost:7070/api/sentences
  • Content Type – text/plain

Set the processor to autoterminate for everything except Response. We also set the processor’s name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let’s go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method – POST
  • Remote URL – http://localhost:9000/api/extract
  • Content-Type – application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow’s execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:

{"entities":[],"extractionTime":7}

Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There’s a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don’t have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can’t leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).

 

 

Apache NiFi EQL Processor

We have published a new open source project on GitHub that is an Apache NiFi processor that filters entities through an Entity Query Language (EQL) query. When used along with the Idyl E3 NiFi Processor you can perform entity filtering in a NiFi dataflow pipeline.

To add the EQL processor to your NiFi pipeline, clone the project and build it or download the jar file from our website. Then copy the jar to NiFi’s lib directory and restart NiFi. The processor will not be available in the list of processors:

The EQL processor has a single property that holds the EQL query:

For this example our query will look for entities whose text is “George Washington”:

select * from entities where text = "George Washington"

Entities matching the EQL query will be outputted from the processor as JSON. Entities not matching the EQL query will be dropped.

With this capability we can create Apache NiFi dataflows that produce alerts when an entity matches a given set of conditions. Entities matching the EQL query can be published to an SQS queue, a Kafka stream, or any other NiFi processor.

The Entity Query Language previously existed as a component of the EntityDB project. It is now its own project on GitHub and is licensed under the Apache Software License, 2.0. The project’s README.md contains more examples of how to construct EQL queries.

Apache NiFi and Idyl E3 Entity Extraction Engine

We Apache NiFiare happy to let you know how Idyl E3 Entity Extraction Engine can be used with Apache NiFi. First, what is Apache NiFi? From the NiFi homepage: “Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” Idyl E3 extracts entities (persons, places, things) from natural language text.

That’s a very short description of NiFi but it is very accurate. Apache NiFi allows you to configure simple or complex processes of data processing. For example, you can configure a pipeline to consume files from a file system and upload them to S3. (See Example Dataflow Templates.) There are many operations you can do and they are performed by components called Processors. There are many excellent guides available about NiFi, such as:

There are many processors available for NiFi out of the box. One in particular is the InvokeHttp processor that lets your pipeline send an HTTP request. You can use this processor to send text to Idyl E3 for entity extraction from within your pipeline. However, to make things a bit simpler and more flexible we have created a custom NiFi processor just for Idyl E3. This processor is available on GitHub and its binaries will be included with all editions of Idyl E3 starting with version 2.3.0.

Idyl E3 NiFi Processor

Instructions for how to use the Idyl E3 processor will be added to the Idyl E3 documentation bit they are simple. Here’s a rundown. Copy the idyl-e3-nifi-processor.jar from Idyl E3’s home directory to NiFi’s lib directory. Restart NiFi. Once NiFi is available you will see the Idyl E3 in the list of processors when adding a processor:

Idyl E3 NiFi Processor

There are a few properties you can set but the only required property is the Idyl E3 endpoint. By default, the processor extracts entities from the input text but this can be changed using the action property. The available actions are:

  • extract (the default) to get a JSON response containing the entities.
  • annotate to return the input text with the entities annotated.
  • sanitize to return the input text with the entities removed.
  • ingest to extract entities from the input text but provide no response. (This is useful if you are letting Idyl E3 plugins handle the publishing of entities to a database or other service outside of the NiFi data flow.)

The available properties are shown in the screen capture below:

And that is it. The processor will send text to Idyl E3 for entity extraction via Idyl E3’s /api/v2/extract endpoint. The response from Idyl E3 containing the entities will be placed in a new idyl-e3-response attribute.

The Idyl E3 NiFi processor is licensed under the Apache Software License, version 2.0. Under the hood, the processor uses the Idyl E3 client SDK for Java which is also licensed under the Apache license.