In this post we are going to show how our NLP Building Blocks can be used with Apache NiFi to create an NLP pipeline to perform named-entity extraction on Logical Entity Exchange Specifications (LEXS) documents. The pipeline will extract a natural language field from each document, identify the named-entities in the text through a process of sentence extraction, tokenization, and named-entity recognition, and persist the entities to a MongoDB database. While the pipeline we are going to create uses data files in a specific format, the pipeline could be easily modified to read documents in a different format.
LEXS is an XML, NIEM-based framework for information exchange developed for the US Department of Justice. While the details of LEXS are out of scope for this post, the keypoints is that it is XML-based, a mix of structured and unstructured text, and is used to describe various law enforcement events. We have taken the LEXS specification and created test documents for this pipeline. Example documents are also available on the public internet.
And just in case you are not familiar with Apache NiFi, it is a free (Apache-licensed), cross-platform application that allows the creation and execution of data flow processes. With Apache NiFi you can move data through pipelines while applying transformations and executing actions.
The completed Apache NiFi data flow is shown below.
NLP Building Blocks
This post requires that our NLP Building Blocks are running and accessible. The NLP Building Blocks are microservices to perform NLP tasks. They are:
Each is available as Docker containers and on the AWS and Azure marketplaces. You can quickly start each building block as a Docker container using docker compose or individually:
Start Prose Sentence Extraction Engine:
docker run -p 8060:8060 -it mtnfog/prose:1.1.0
Start Sonnet Tokenization Engine:
docker run -p 9040:9040 -it mtnfog/sonnet:1.1.0
Start Idyl E3 Entity Extraction Engine:
docker run -p 9000:9000 -it mtnfog/idyl-e3:3.0.0
With the containers running we will next set up Apache NiFi.
To begin, download Apache NiFi and unzip it. Now we can start Apache NiFi:
We can now begin creating our data flow.
Creating the Ingest Data Flow
Our data flow process in Apache NiFi will follow this process. Each step is described in detail below.
- Ingest LEXS XML files from the file system. Apache NiFi offers the ability to read files from many sources (such as HDFS and S3) but we will simply use the local file system as our source.
- Execute an XPath query against each LEXS XML file to extract the narrative from each record. The narrative is a free text, natural language description of the event described by the LEXS XML file.
- Use Prose Sentence Extraction Engine to identify the individual sentences in the narrative.
- Use Sonnet Tokenization Engine to break each sentence into its individual tokens (typically words).
- Use Idyl E3 Entity Extraction Engine to identity the named-person entities in the tokens.
- Persist the extracted entities into a MongoDB database.
Configuring the Apache NiFi Processors
Ingesting the XML Files
To read the documents from the file system we will use the GetFile processor. The only configuration property for this processor that we will set is the input directory. Our documents are stored in /docs so that will be our source directory. Note that, by default, the GetFile processor removes the files from the directory as they are processed.
Extracting the Narrative from Each Record
The GetFile processor will send the file’s XML content to an EvaluateXPath processor. This processor will execute an XPath query against each XML document to extract the document’s narrative. The extracted narrative will be stored in the content of the flowfile. The XPath is:
Identifying Individual Sentences in the Narrative
The flowfile will now be sent to an InvokeHTTP processor that will send the sentence extraction request to Prose Sentence Extraction Engine. We set the following properties on the processor:
The response from Prose Sentence Extraction engine will be a JSON array containing the individual sentences in the narrative.
Splitting the Sentences Array into Separate FlowFiles
The array of sentences will be sent to a SplitJSON processor. This processor splits the flowfile creating a new flowfile for each sentence in the array. For the remainder of the data flow, the sentences will be operated on individually.
Identifying the Tokens in Each Sentence
Each sentence is next sent to an InvokeHTTP processor that will call Sonnet Tokenization Engine. The properties set for this processor are:
HTTP Method: POST
Remote URL: http://localhost:9040/api/tokenize
Content Type: text/plain
The response from Sonnet Tokenization Engine will be an array of tokens (typically words) in the sentence.
Extracting Named-Entities from the Tokens
The array of tokens is next sent to an InvokeHTTP processor that sends the tokens to Idyl E3 Entity Extraction Engine for named-entity extraction. The properties to set for this processor are:
HTTP Method: POST
Remote URL: http://localhost:9000/api/extract
Content Type: application/json
Idyl E3 analyzes the tokens and identifies which tokens are named-person entities (like John Doe, Jerry Smith, etc.). The response is a list of the entities found along with metadata about each entity. This metadata includes the entity’s confidence value. This is a value from 0 to 1 that indicates Idyl E3’s confidence the entity is actually an entity.
Storing Entities in MongoDB
The entities having a confidence value greater than or equal to 0.6.0 will be persisted to a MongoDB database. In this processor, each entity will be written to the database for storage and further analysis by other systems. The properties to configure for the PutMongo processor are:
Mongo URI: mongodb://localhost:27017
Mongo Database Name: <Any database>
Mongo Collection Name: <Any collection>
You could just as easily insert the entities into a relational database, Elasticsearch, or another repository.
That is our pipeline! We went from XML documents, did some natural language processing via the NLP Building Blocks, and ended up with named-entities stored in MongoDB.
There’s a few things you may want to change for a production deployment.
Multiple Instances of Apache NiFi
First, you will likely want (and need) more than one instance of Apache NiFi to handle large volumes of files.
High Availability of NLP Building Blocks
Second, in this post we ran the NLP Building Blocks as local docker containers. This is great for a demonstration or proof-of-concept but you will want some high-availability of these services from a service like Kubernetes or AWS ECS.
You can also launch the NLP Building Blocks as EC2 instances via the AWS Marketplace. You could then plug the AMI of each building block into an EC2 autoscaling group behind an Elastic Load Balancer. This provides instance health checks and the ability to scale up and down in response to demand. They are also available on the Azure Marketplace.
Incorporate Language Detection in the Data Flow
Third, you may have noticed that we did not use Renku Language Detection Engine. This is because we knew beforehand that all of our documents are English. If you are unsure, you can insert a Renku Language Detection Engine processor in the data flow immediately after the EvaluateXPath processor to determine the text’s language and use the result as a query parameter to the other NLP Building Blocks.
Improve Performance through Custom Models
Lastly, we did not use any custom sentence, tokenization, or entity models. Each NLP Building Block includes basic functionality to perform these actions without custom models, but, using custom models will almost certainly provide a much higher level of performance. This is because the custom models will more closely match your data unlike the default models. The tools to create and evaluate custom models are included with the application – refer to each application’s documentation for the necessary steps.
Filtering Entities with Low Confidence
You may want to filter entities having a low confidence value in order to control noise. What the optimal threshold is depends on a combination of your data, the entity model being used, and how much noise your system can tolerate. in some use-cases it may be better to use a lower threshold out of caution. Each entity has an associated confidence value that can be used to filter.
Get in touch. We’ll be glad to help out. Send us a line a support at mtnfog.com.