Philter and Elasticsearch

PhilterPhilter, our application for finding and removing PII and PHI from natural language text, has the ability to optionally store the identified text in an external data store. With this feature, you had access to a complete log of Philter’s actions as well as the ability to reconstruct the original text in the future if you ever needed to.

In Philter 1.0,  we chose MongoDB as the external data store. With just a few configuration properties, Philter would connect to MongoDB and persist all identified “spans” (the identified text, its location in the document, and some other attributes) to a MongoDB database. This worked well but we realized that looking forward it might not have been the best choice.

In Philter 1.1 we are replacing MongoDB with Elasticsearch. The functionality and the Philter APIs will remain the same. The only difference is that now instead of the spans being stored in a MongoDB database they will now be stored in an Elasticsearch index. So, what, exactly are the benefits? Great question.

The first benefit comes with Elasticsearch and Kibana’s ability to quickly and easily make dashboards to view the indexed data. With the spans in Elasticsearch, you can make a dashboard to summarize the spans by type, text, etc., to show insights into the PII and PHI that Philter is finding and manipulating in your text.

It also became quickly apparent that a primary use-case for users and the store would be to query the spans it contains. For example, a query to find all documents containing “John Doe” or all documents containing a certain date or phone number. A search engine is better prepared to handle those queries.

Another consideration is licensing. Elasticsearch is available under the Apache Software License or a compatible license while MongoDB is available under a Server Side Public License.

In summary, Philter 1.1 will offer support for using Elasticsearch as the store for identified PII and PHI. Remember, using the store is an optional feature of Philter. If you do not require any history of the text that Philter identifies then it is not needed. (By default, Philter’s store feature is disabled and has to be explicitly enabled.) Support for using MongoDB as a store will not be available in Philter 1.1.

We are really excited about this change and excited about the possibilities that comes with it!

Filter Profile JSON Schema

Philter can find and remove many types of PII and PHI. You can select the types of of PII and PHI and how the identified values are removed or manipulated through what we call a “filter profile.” A filter profile is a file that essentially lets you tell Philter what to do!

To help make creating and editing filter profiles a little bit easier, we have published the JSON schema.

https://www.mtnfog.com/filter-profile-schema.json

This JSON schema can be imported into some development tools to provide features such as validation and autocomplete. The screenshot below shows an example of adding the schema to IntelliJ. More details into the capability and features are available from the IntelliJ documentation.

Visual Studio Code and Atom (via a package) also include support for validating JSON documents per JSON schemas.

The Filter Profile Registry provides a way to centrally manage filter profiles across one or more instances of Philter.

Using AWS Kinesis Firehose Transformations to Filter PII and PHI from Streaming Text

AWS Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from places such as CloudWatch, AWS IoT, and custom applications using the AWS SDK to places such as Amazon S3, Amazon Redshift, Amazon Elasticsearch, and others. In this post we will use S3 as the firehose’s destination.

Sometimes you want to manipulate the data as it goes through the firehose. In this blog post we will show how AWS Kinesis Firehose and AWS Lambda can be used to remove PII and PHI from the text as it travels through the firehose.

Prerequisites

Your must have a running instance of Philter. If you don’t already have a running instance of Philter you can launch one through the AWS Marketplace.

It’s not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows you to communicate locally with Philter from the function.

There is no need to duplicate an excellent blog post on creating a Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests
import base64

def handler(event, context):

    output = []

    for record in event['records']:
        payload=base64.b64decode(record["data"])
        headers = {'Content-type': 'text/plain'}
        r = requests.post("https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
        filtered = r.text
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8')
        }
        output.append(output_record)

    return output

The following Kinesis Firehose test event can be used to test the function:

{
  "invocationId": "invocationIdExample",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "region": "us-east-1",
  "records": [
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    },
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    }    
  ]
}

This test event contains 2 messages and the data for each is base 64 encoded, which is the value “He lived in 90210 and his SSN was 123-45-6789.” When the test is executed the response will be:

[
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."
]

When executing the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter’s self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Kinesis Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.

Conclusion

In this blog post we have created an AWS Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.

Resources

Apache NiFi for Processing PHI Data

With the recent release of Apache NiFi 1.10.0, it seems like a good time to discuss using Apache NiFi with data containing protected health information (PHI). When PHI is present in data it can present significant concerns and impose many requirements you may not face otherwise due to regulations such as HIPAA.

Apache NiFi probably needs little introduction but in case you are new to it, Apache NiFi is a big-data ETL application that uses directed graphs called data flows to move and transform data. You can think of it as taking data from one place to another while, optionally, doing some transformation to the data. The data goes through the flow in a construct known as a flow file. In this post we’ll consider a simple data flow that reads file from a remote SFTP server and uploads the files to S3. We don’t need to look at a complex data flow to understand how PHI can impact our setup.

Encryption of Data at Rest and In-motion

Two core things to address when PHI data is present is encryption of the data at rest and encryption of the data in motion. The first step is to identify those places where sensitive data will be at rest and in motion.

For encryption of data at rest, the first location is the remote SFTP server. In this example, let’s assume the remote SFTP server is not managed by us, has the appropriate safeguards, and is someone else’s responsibility. As the data goes through the NiFi flow, the next place the data is at rest is inside NiFi’s provenance repository. (The provenance repository stores the history of all flow files that pass through the data flow.) NiFi then uploads the files to S3. AWS gives us the capability to encrypt S3 bucket contents by default so we will use that through an S3 bucket policy.

For encryption of data in motion, we have the connection between the SFTP server and NiFi and between NiFi and S3. Since we are using an SFTP server, our communication to the SFTP server will be encrypted. Similarly, we will access S3 over HTTPS providing encryption there as well.

If we are using a multi-node NiFi cluster, we may also have the communication between the NiFi nodes in the cluster. If the flows only execute on a single node you may argue that encryption between the nodes is not necessary. However, what happens in the future when the flow’s behavior is changed and now PHI data is being transmitted in plain text across a network? For that reason, it’s best to set up encryption between NiFi nodes from the start. This is covered in the NiFi System Administrator’s Guide.

Encrypting Apache NiFi’s Data at Rest

The best way to ensure encryption of data at rest is to use full disk encryption for the NiFi instances. (If you are on AWS and running NiFi on EC2 instances, use an encrypted EBS volume.) This ensures that all data persisted on the system will be encrypted no matter where the data appears. If a NiFi processor decides to have a bad day and dump error data to the log there is a risk of PHI data being included in the log. With full disk encryption we can be sure that even that data is encrypted as well.

Looking at Other Methods

Let’s recap the NiFi repositories:

PHI could exist in any of these repositories when PHI data is passing through a NiFi flow. NiFi does have an encrypted provenance repository implementation and NiFi 1.10.0 introduces an experimental encrypted content repository but there are some caveats. (Currently, NiFi does not have an implementation of an encrypted flowfile repository.)

When using these encryption implementations, spillage of PHI onto the file system through a log file or some other means is a risk. There will be a bit of overhead due to the additional CPU instructions to perform the encryption. Comparing usage of the encrypted repositories with using an encrypted EBS volume, we don’t have to worry about spilling unencrypted PHI to the disk, and per the AWS EBS encryption documentation, “You can expect the same IOPS performance on encrypted volumes as on unencrypted volumes, with a minimal effect on latency.”

There is also the NiFi EncryptContent processor that can encrypt (and decrypt despite the name!) the content of flow files. This processor has use but in very specific cases. Trying to encrypt data at the level of the data flow for compliance reasons is not recommended due to the data possibly existing elsewhere in the NiFi repositories.

Removing PHI from Text in a NiFi Flow

PhilterWhat if you want to remove PHI (and PII) from the content of flow files as they go through a NiFi data flow? Check out our product Philter. It provides the ability to find and remove many types of PHI and PII from natural language, unstructured text from within a NiFi flow. Text containing PHI is sent to Philter and Philter responds with same text but with the PHI and PII removed.

Conclusion

Full disk encryption and encrypting all connections in the NiFi flow and between NiFi nodes provides encryption of data at rest and in motion. It’s also recommended that you check with your organization’s compliance officer to determine if there are any other requirements imposed by your organization or other relevant regulation prior to deployment. It’s best to gather that information up front to avoid rework in the future!

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Our Approach to Continuous Delivery for Cloud Marketplaces

In this blog post I wanted to take a moment to share our challenges with continuous integration and delivery and how we approached them. Our Philter software to find and remove PII and PHI from text is deployed on (at the moment) three cloud marketplaces as well as being available for on-premises deployment. Each of the marketplaces, AWS Marketplace, Microsoft Azure Marketplace, and Google Compute Platform (GCP) Marketplace, all have their own requirements and constraints. We needed a pipeline that can build and test our code and deliver the binaries to each of the cloud marketplaces as a deployable image.

What tools you use to implement your process does not really matter. Some tools are more feature-rich than others and some are only better or worse in terms of difference of opinion. It’s up to you to pick the tools that you or your organization want to use. We will mention the tools we use but don’t take that as meaning only these tools will work. (We like being tool-agnostic to not make us afraid to try new tools.) Our build infrastructure runs in AWS.

Our builds are managed by Jenkins through Jenkinsfiles. Each project has a Jenkinsfile that defines the build stages for the project. These stages vary by project but are usually similar to “build”, “test”, and “deploy.” The build and test stages are pretty self-explanatory. The deploy stage is where things get interesting (i.e. challengine).

We are using Hashicorp’s Packer tool to create our images for the cloud marketplaces. A single packer JSON file contains a “builder” (in Packer terminology) for each cloud marketplace. A builder defines the necessary parameters for constructing the image on that specific cloud platform. For instance, when building on AWS EC2, the builder contains information about the VPC and subnet making the build, base AMI the image will be created from, and the AWS region for the image. Likewise, for Microsoft Azure, the builder defines things such as the storage account name, operating system name and version, and the Azure subscription ID. GCP has its own set of required parameters.

The rest of the Packer JSON file contains the steps that will be done to prepare the image. This includes steps such as executing commands over SSH to install prerequisite packages, upload build artifacts made by the Jenkins build, and lastly, prepare the system for being turned into an image.

After the Jenkinsfile’s “deploy” stage executes, the end result will be a new image in each of the cloud platforms suitable for final testing prior to being made available on the cloud’s marketplace. This testing is initiated by the build by publishing a message to an AWS SNS topic each an images completes creation. This triggers a process to create and start a virtual machine from the image powered by AWS Lambda. Required credentials are stored in AWS SSM Parameter Store.

Automated testing is then performed against the virtual machine. Individual testing of each image is required is due to the nuances and different requirements of each cloud platform and marketplace and different base images. For instance, on AWS the base image is Amazon Linux 2. On Microsoft Azure it is CentOS 7. The scripts that install the prerequisites and  configure the application can differ based on the base image.

The automated testing involves testing the application’s API and by establishing an SSH connection to the virtual machine to verify files are in the correct location or have been configured properly. A message is published to a separate AWS SNS topic indicating success or failure of the tests and the virtual machine image is deleted/terminated leaving only the newly built image. The test results are persisted to a database along with the build number for reference. If testing was successful, we can proceed to the manual steps of publishing the image to the marketplaces when we are ready to do so. (All marketplaces require manual clicking to submit images so none of it can be automated.)

Continuous integration and delivery is important for all software projects. Having a consistent, repeatable process for building, testing, and packaging software for delivery is critical. A well-defined and implemented process can help teams find problems earlier, get configurations into code, and ultimately, get higher-quality products to the market faster.


Jeff Zemerick is the founder of Mountain Fog. You can contact him at jeff.zemerick@mtnfog.com or on Twitter at @jzonthemtn.  
Philter

Sneak Peek at Philter Big-Data and ETL Integrations

As we are nearing the general availability of Philter we would like to take a minute to offer a quick look at Philter’s integrations with other applications. Philter offers integration capabilities with Apache NiFi, Apache Kafka, and Apache Pulsar to provide PHI/PII filtering capabilities across your big-data and ETL ecosystems. We are very excited to offer these integrations for such awesome and popular open source applications.

To recap, Philter is an application to identify and, optionally, remove or replace protected health information (PHI) and personally identifiable information (PII) from natural language text.

Apache NiFi

Philter provides a custom Apache NiFi processor NAR that you can plug into your existing NiFi installations by copying the NAR file to NiFi’s lib directory. The processor allows your NiFi flow to identify and replace PHI and PII directly in your flow without any required external services. The processor’s configuration is similar to Philter’s standard configuration. The processor accepts a filter profile, an optional MongoDB URI to use to store replaced values, and a cache to maintain state when anonymizing values consistently. For the cache, the processor utilizes NiFi’s built-in DistributedMapCacheServer.

The processor operates on the content of the incoming flowfile by performing filtering on the content and replacing the content with the filtered text. An outbound transition provides the downstream processors with the filtered text.

Apache Kafka

Philter is able to integrate with Apache Kafka by providing the ability to consume text from Kafka, perform the filtering, and publish the filtered text to a different Kafka topic. Philter does this in a performant and fault tolerant manner by leveraging the Apache Flink streaming framework. This integration is suitable for integration into existing pipelines where text is being consumed from Kafka for processing because it requires minimal changes to the pipeline. Simple provide the appropriate configuration values to the Philter job and update your topic names.

Apache Pulsar

Philter integrates with Apache Pulsar via Pulsar Functions. A Pulsar Function enables Pulsar to execute functions on the streaming data as it passes through Pulsar. Pulsar is similar to Kafka in its functionality as a massive pub/sub application but unlike Kafka it provides the ability to directly transform the data inside of the application. This is an ideal integration point for Philter and your streaming architectures using Apache Pulsar.

Posted by / September 9, 2019
Philter

Filter Profiles in Philter

Today we are excited to announce a new feature in Philter that is a result of Philter’s open beta testing. We are excited to offer this functionality just prior to Philter going live. Thanks to those who provided their feedback to make this possible!

Previously in Philter the configurations of each “filter” was static and configured when Philter started. The limitation imposed by this implementation is that if you wanted to filter documents differently based on some criteria you had to run two instances of Philter and add logic when using Philter’s API to send your document to the appropriate instance. It was also restrictive because each enabled filter was configured with some of the same values such as the replacement format. You could not replace a zip code differently than a credit card number, for example.

We have changed how the filters are configured and we are introducing the new feature as “filter profiles.” A feature profile is a defined set of filters and each’s respective configuration defined in a JSON file. Now, a single instance of Philter can simultaneously apply multiple filter profiles and selectively choose which to utilize on a per-request basis. We have also added more options to handling each individual PII/PHI identifier such as being able to independently configure how to redact or replace each one. For instance, it is now possible to truncate zip codes to a chosen length instead of simply replacing the whole zip code.

Here’s an example filter profile that enables filters and defines how corresponding PII/PHI should be replaced. Note how each identifier has its own strategy for handling items – individual types are no longer constrained to sharing the same strategy. With filter profiles, you can also selectively enable identifier filters. Filters not defined in the profile will not be enabled for that profile. Non-deterministic filters such as NLP-based ones can now have their own sensitivity setting, too.

{  
   "name":"default",
   "identifiers":{  
      "creditCard":{  
         "creditCardFilterStrategy":{  
            "strategy":"REDACT",
            "redactionFormat":"{{{REDACTED-%t}}}"
         }
      },
      "ipAddress":{  
         "ipAddressFilterStrategy":{  
            "strategy":"REDACT",
            "redactionFormat":"{{{REDACTED-%t}}}"
         }
      },
      "zipCode":{  
         "zipCodeFilterStrategy":{  
            "truncateDigits":2,
            "strategy":"TRUNCATE"
         }
      }
   }
}

You can initialize Philter with as many filter profiles as you need. Using the REST API you can select the filter profile to use when making your request by providing a p parameter along with the name of a filter profile as shown in this sample request:

curl -k -X POST "https://localhost:8080/api/filter?c=context&p=profile" \
  -d @file.txt \
  -H Content-Type "text/plain"

We have lots of ways to expand the filter profiles on our to-do list, such as providing centralized filter profile management and an API around filter profiles to allow for remote management of them so look for updates on those features soon.

Posted by / August 25, 2019

Introducing Phinder

Go to Phinder’s home page.

As data lakes grow and become more commonplace the need for data awareness and governance also grows. In areas subject to data regulation such as healthcare, the management of protected health information (PHI) becomes a key concern. Today we are introducing Phinder (“find-er”) as a means of discovering, labeling, and reporting on PHI data in your Amazon S3 data lake and supported applications.

Phinder analyzes your Amazon S3 data lake to locate potential PHI in your data. When used regularly as your data lake grows, Phinder can help your organization maintain awareness over PHI and remain compliant with industry regulations. Phinder runs in your cloud to provide data safety and security. Its generated reports provide a view of PHI in your data lake.

Phinder is currently in a private beta and we welcome your interest in joining the beta. Please just contact us to get started or fill out the form below. As we near a first release date we will make more details available. Until then we welcome your feedback and personal requirements for Phinder.

Phinder and its complementary sibling Philter were created to help organizations take control of PHI data and make that data valuable to them without introducing any additional burdens or difficulties. Managing an organization’s PHI certainly comes with enough difficulties!

Interested in the Phinder beta?

If you are interested in giving Phinder a test drive before it’s ready for prime time please let us know. We would love to discuss with you your expectations for a tool like Phinder as we finalize development of its features.

Posted by / August 18, 2019

PHI in a DevOps Environment

We have all had a doctor’s visit where we are asked to fill out their HIPAA form regarding who they can share our medical data with. The form is typically titled HIPAA Privacy Notice or something similar. Because of this, probably most of us are familiar with HIPAA and what its general purpose. But for those of us in the tech industry, even those outside of the healthcare sector, it’s beneficial to have a slightly deeper understanding of HIPAA and protected health information (PHI).

PHI is any information that can be used to identify a patient. Formally, HIPAA defines 18 categories of PHI under the HIPAA Privacy Rule. These categories include names, social security numbers, addresses, biometric data, and patient record numbers. That’s most of the obvious ones. Things like email address, vehicle identification numbers (VINs), and fax numbers are also part of those 18 categories.

The advent of DevOps cultures has helped make knowledge of HIPAA and PHI a team requirement. All team members need to understand what is PHI and the implications of having PHI in a system. Prior to DevOps, it was likely that only a select few team members had access to data containing PHI. The democratization of team responsibilities introduced by DevOps now means all team members may potentially have access to PHI data.

PHI and DevOps

Team Training

The very first thing your organization should do is develop a training program to educate current and future team members on PHI. (The HIPAA Privacy and Security Rules mandate appropriate training.) The content of the training is out of scope here, but once completing the training each team member should have a solid working knowledge of HIPAA and PHI and awareness of the possible penalties for failing to proect PHI at all times.

Well-Defined Scope and Approved Services

Next, the scope of PHI in your system should be very well defined and documented. The boundaries where PHI exists should be well-known to all team members. If you are operating in a cloud environment with a signed Business Associate Agreement, all team members should be aware of the cloud provider’s services that are approved for PHI data. It is a common pitfall to utilize a cloud provider’s service that is not on the list of approved services.

During the design phase, it is imperative the team fully research any third-party services to determine if they are approved for storing PHI data. After delivery, we recommend regularly checking the cloud providers list of approved services as new additions are made over time. For example, a new service just launched may not be approved, but in the next six months it will be listed on the BAA list of approved services.

Least Privilege Permissions

Your method of assigning permissions to team members should be based on the paradigm of least privilege. Team members should only have access to the data they need to in order to perform their role. For example, in an AWS environment it is a best practice to have a separate AWS account to contain the application logs. This account is only accessible by the project’s security team members. The application can write logs to the account via CloudWatch or some other log aggregation utility but only the security team members can view the logs.

Encryption of Data

Each team member should also be aware of the basic security precautions required for PHI. Encryption of data at rest and in motion is paramount, even in a virtual private cloud or other isolated environment. The movement of traffic in an isolated environment is not a substitute for transport encryption. Full disk encryption, transport layer encryption, and encryption provided by blob stores and other persistent data stores will help protect the data. A deep knowledge of the cloud provider being utilized is essential. Resources are available for AWS, Azure, and Google Cloud.

Disaster and Recovery Plan

Your team should have a disaster and recovery plan that clearly defines the required steps and roles. This is required by the HIPAA Security Rule. The plan should define how PHI data will be managed and restored and how the project will operate in the event of some natural or man-made disaster. The plan should clearly outline how the PHI data will be protected at all times throughout execution of the plan.

While a disaster and recovery plan is not specific to DevOps teams, a DevOps culture will certainly affect how the plan is written. Roles and responsibilities may transcend the traditional lines of development and operations. It may utilize an on-call schedule to contact the applicable team members. A DevOps culture makes it crucial that all team members be aware of the plan since the responsibility of executing the plan likely requires at least one team member from each role.

It is also crucial that all disaster and recovery plans be tested in order to identify areas of improvement and to ensure team members are aware of and can execute their responsibilities when needed. In AWS terminology these planned events are referred to as Game days.

Going Forward

The task of managing PHI can be a daunting one but it should not be an impediment to your project’s success. With the appropriate care, knowledge, and awareness, your team can create and execute a strategy to successfully manage PHI in your environments. If you need to remove PHI from data take a look at our Philter application. Philter can help you make that data usable for other purposes.

Posted by / August 17, 2019

Pre-trained PubMed Vectors

We have added a download to our Datasets page. This addition is pre-trained vectors for PubMed Open-Access Subset.

PubMed comprises more than 29 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

These pre-trained word vectors were created from the commercial PubMed Open Access Subset. There is a lot of great information inside the collection of biomedical text and we hope these word vectors are useful to you in your NLP and text mining experiments.

Go to our Datasets page to access the downloads.

Posted by / August 16, 2019

A Tool for Every Data Engineer’s Toolbox

Collecting data from edge devices in manufacturing, processing medical records from electronic health systems, and analyzing text all sound like very different problems each requiring unique solutions. While that certainly is true there are some commonalities between each of these tasks. Each task requires a scalable method of data ingestion, predictable performance, and capabilities for management and monitoring. Also typically required in projects like the ones described are the abilities to track data lineage as it moves through the pipeline and the ability to replay data. Now we can start to abstract out the commonalities of the projects and observe that the projects are actually not all that different. In each case, the data is being consumed and ingested to be analyzed or processed.

A tool that satisfies those common requirements would be invaluable to a data engineer. One such tool is Apache NiFi, an application that allows data engineers to create directed graphs of data flows using an intuitive web interface. Through NiFi’s construct called a processor, data can be ingested, manipulated, and persisted. Data and software engineers no longer have to write custom code to implement data pipelines. With Apache NiFi, creating a pipeline is as simple as dragging and dropping processors onto its canvas and applying appropriate configuration.

To help illustrate the capabilities of Apache NiFi, a recent project required translating documents, existing in an Apache Kafka topic, of varying languages into a single language. The pipeline required consuming the documents from the topic, determining the language of each document, and selecting the appropriate translation service. Apache NiFi’s ConsumeKafka processor handled the ingestion of documents, an InvokeHttpProcessor powered the webservice request to determine the document’s source language, and a RouteOnAttribute processor directed the flow based on the document’s language to the appropriate InvokeHttpProcessor that sent the text to a translation service. The resulting translated documents were then persisted to S3.

A few years back, making a pipeline to do this would have likely required writing custom code, whether it was consuming from a queue, communicating with the language translation services, or persisting the results to a remote store. Not writing custom code also usually translates to saving time and money. Apache NiFi is one tool that should definitely exist in each data engineer’s toolbox. Like with any tool, it is important to understand NiFi’s capabilities and limitations. The Apache NiFi User Guide

This article was originally posted to Medium.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Posted by / August 4, 2019

Some First Steps for a New NiFi Cluster

After installing Apache NiFi there are a few steps you might want to take before making your cluster available for prime time. None of these steps are required so make sure they are appropriate for your use-case before implementing them.

Lowering NiFi’s Log File Retention Properties

By default, Apache NiFi’s nifi-app.log files are capped at 100 MB per log file and NiFi retains 30 log files. If the maximum is reached that comes out to 3 GB of disk space from nifi-app.log files. That’s not a whole lot but in some cases you may need the extra disk space. Or, if an external service is already managing your NiFi log files you don’t need them hanging around any longer than necessary. To lower the thresholds open NiFi’s conf/logback.xml file. Under the appender configuration for nifi-app.log you will see a maxFileSize and maxHistory values. Lower these values, save the file, and restart NiFi to save disk space on log files. Conversely, if you want to keep more log files just increase those limits.

You can also make changes to the handling of the nifi-user.log and nifi-bootstrap.log files here, too. But those files typically don’t grow as fast as the nifi-app.log so they can often be left as-is. Note that in a cluster you will need to make these changes on each node.

Install NiFi as a Service

Having NiFi run as a service allows it to automatically start when the system starts and provides easier access for starting and stopping NiFi. Note that in a cluster you will need to make these changes on each node. To install NiFi as a system service, go to NiFi’s bin/ directory and run the following commands (on Ubuntu):

sudo ./nifi.sh install
sudo update-rc.d nifi defaults

You can now control the NiFi service with the commands:

sudo systemctl status nifi
sudo systemctl start nifi
sudo systemctl stop nifi
sudo systemctl restart nifi

If running NiFi in a container add the install commands to your Dockerfile.

Install (and use!) the NiFi Registry

The NiFi Registry provides the ability to put your flows under source control. It has quickly become an invaluable tool for NiFi. The NiFi Registry should be installed outside of your cluster but accessible to your cluster. The NiFi Registry Documentation contains instructions on how to install it, create buckets, and connect it to your NiFi cluster.

By default the NiFi Registry listens on port 18080 so be sure your firewall rules allow for the communication. Remember, you only need a single installation of the NiFi Registry per NiFi cluster. If you are using infrastructure-as-code to deploy your NiFi cluster make sure the scripts to deploy the NiFi Registry are outside the cluster scripts. You don’t want the NiFi Registry’s lifecycle to be tied to the NiFI cluster’s lifecycle. This allows you to create and teardown NiFi clusters without affecting your NiFi Registry. It also allows you to share your NiFi registry between multiple clusters if you need to.

Although using the NiFi Registry is not required to make a data flow in NiFI your life will be much, much easier if you do use the NiFi Registry, especially in an environment where multiple users will be manipulating the data flow.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Monitoring Apache NiFi’s Logs with AWS CloudWatch

It’s inevitable that at some point while running Apache NiFi on a single node or as a cluster you will want to see what’s in NiFi’s log and maybe even be alerted when certain logged events are found. Maybe you are debugging your own processor or just looking for more insight into your data flow. With the AWS CloudWatch Logs agent we can send NiFi’s log files to CloudWatch for aggregation, storage, and alerting.

Creating an IAM Role and Policy

The first thing we will do is install the CloudWatch Logs Agent. (We’ll mostly be following this Quick Start.) Because permissions are required to save the logs, we will create a new IAM role for our NiFi instances in EC2. (If your NiFi instances already have an existing role attached you can just edit that role.) After creating a new role, add a new JSON policy to it:

For copy/paste ease, the policy is:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "logs:DescribeLogStreams"
         ],
         "Resource":[
            "arn:aws:logs:*:*:*"
         ]
      }
   ]
}

Click the Review Policy and give the policy a name, like cloud-watch-logs, and click Create Policy. This policy can now be attached to an IAM role. Click through and give your role a name, such as nifi-instance-role and click Create Role. Now we can attach this role to our NiFi instances.

Install CloudWatch Logs Agent

Now that our NiFi EC2 instances have access to store the logs in CloudWatch Logs we can install the CloudWatch Logs agent on the instance. Because we are running Ubuntu and not Amazon Linux we’ll install the agent manually.

curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O

sudo python ./awslogs-agent-setup.py --region us-east-1

If it gives you an error that the command python cannot be found you probably don’t have python (2) installed. You can quickly install it:

sudo apt-get install python

When prompted for an AWS Access Key ID and AWS Secret Access Key press enter to skip both. If you’re instances are running in a region other than us-east-1 enter it now. Press enter to skip the default output format. The next prompt asks the location of the syslog. You can press enter to accept the default of /var/log/syslog for both prompts. For the log stream name I recommend using the EC2 instance id which is the default option. Next, select the log event timestamp format. Again, the first option is recommended to press enter to accept it or make a different selection. Next, the agent asks where to start uploading. The first option will get the whole log file while the second option will just start at the end of the file. For completeness, I recommend the first option so press enter.

When asked if there are more log files to configure press enter for yes. Now we will specific NiFi’s application log. Our NiFi is installed at /opt/nifi/ so replace /opt/nifi/ with your NiFi directory in the responses below.

Path of log file to upload: /opt/nifi/logs/nifi-app.log
Destination Log Group Name: /opt/nifi/logs/nifi-app.log
Choose Log Stream name: 1. Use EC2 instance id
Choose Log Event timestamp format: 1. %b %d %H:%M:%S (Dec 31 23:59:59)
Choose initial position of upload: 1. From start of file.

Repeat these steps to add any other log files such as nifi-bootstrap.log and nifi-user.log. For convenience, the relevant contents of my /var/awslogs/etc/awslogs.conf file is below:

datetime_format = %b %d %H:%M:%S
file = /var/log/syslog
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = /var/log/syslog

[/opt/nifi/logs/nifi-app.log]
datetime_format = %b %d %H:%M:%S
file = /opt/nifi/logs/nifi-app.log
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = /opt/nifi/logs/nifi-app.log

[/opt/nifi/logs/nifi-bootstrap.log]
datetime_format = %b %d %H:%M:%S
file = /opt/nifi/logs/nifi-bootstrap.log
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = /opt/nifi/logs/nifi-bootstrap.log

[/opt/nifi/logs/nifi-user.log]
datetime_format = %b %d %H:%M:%S
file = /opt/nifi/logs/nifi-user.log
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = /opt/nifi/logs/nifi-user.log

After making manual changes to this file be sure to restart the CloudWatch Logs Agent service.

sudo service awslogs restart

With the service configured and restarted it will now be sending logs to CloudWatch Logs.

Checkout the Logs!

Navigating back to the AWS Console and going to CloudWatch we can now see our NiFi logs under the Logs section.

Because we selected the EC2 instance ID as the log_stream_name the logs will be grouped by instance ID. It may be more convenient for you to use a hostname instead of the instance ID.

By having all of our NiFi logs aggregated in a single place we no longer have to SSH into each host to look at the log files!

Create Custom Log Filter

We can also now create custom filters on the logs. For example, to quickly just see any error messages we can create a new Logs Metric Filter with the Filter Pattern ERROR. This will create a metric for lines that contain the filter. If you want the filter to look for something more specific you can adjust the Filter Pattern as needed.

Click the Assign Metric button to continue.

Now we can name our filter and assign it a value. Click Create Filter. Now we have our metric filter!

With this filter we can create alarms to watch for static thresholds or anomalies. For example, if more than two ERROR messages are found in the log in a period of 5 minutes generate an alarm. We can utilize CloudWatch’s anomaly detection instead of static values. In this case, CloudWatch will monitor the standard deviation and generate an alarm when the condition threshold is met.

Monitoring for ERROR messages in the log is a useful, even if trivial, example but I think it shows the value in utilizing CloudWatch Logs to capture NiFi’s logs and building custom metrics and alarms on them.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Posted by / July 30, 2019

Monitoring Apache NiFi with Datadog

One of the most common requirements when using Apache NiFi is a means to adequately monitor the NiFi cluster. Insights into a NiFi cluster’s use of memory, disk space, CPU, and NiFi-level metrics are crucial to operating and optimizing data flows. NiFi’s Reporting Tasks provide the capability to publish metrics to external services.

Datadog is a hosted service for collecting, visualizing, and alerting on metrics. With Apache NiFi’s built-in DataDogReportingTask, we can leverage Datadog to monitor our NiFi instances. In this blog post we are running a 3 node NiFi cluster in Amazon EC2. Each node is on its own EC2 instance.

Note that the intention of this blog post is not to promote Datadog but instead to demonstrate one potential platform for monitoring Apache NiFi.

If you don’t already have a Datadog account the first thing to do is to create one. Once done, the first thing you can do is install the Datadog Agent on your NiFi hosts. The command will look similar to the following except it will have your API key. In the command below we are installing the Datadog agent on an Ubuntu instance. If you just want to monitor NiFi-level metrics you can skip this step, however, we find the host-level metrics to be valuable as well.

DD_API_KEY=xxxxxxxxxxxxxxx bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/master/cmd/agent/install_script.sh)"

This command will download and install the Datadog agent on the system. The Datadog service will be automatically started and run automatically upon system start.

Creating a Reporting Task

The next step is to create a DataDogReportingTask in NiFi. In NiFi’s Controller Settings under Reporting Tasks, click to add a new Reporting Task and select Datadog. In the Reporting Tasks’ settings, enter your Datadog API key and change the other values as desired.

By default, NiFi reporting Tasks run every 5 minutes by default. You can change this period under the Settings tab under the “Run Schedule” if needed. Click Apply to save the reporting task.

The reporting task will now be listed. Click the Play icon to start the reporting task. Apache NiFi will now send metrics to Datadog every 5 minutes (unless you changed the Run Schedule value to a different interval).

Exploring NiFi Metrics in Datadog

We can now go to Datadog and explore the metrics from NiFi. Open the Metrics Explorer and enter “nifi” (without the quotes) and the available NiFi metrics will be displayed. These metrics can be included in graphs and other visuals in Datadog dashboards. (If you’re interested, the names of the metrics originate in MetricNames.java.)

Creating a Datadog Dashboard for Apache NiFi

These metrics can be added to Datadog dashboards. By creating a new dashboard, we can add NiFi metrics to it. For example, in the dashboard shown below we added line graphs to show the CPU usage and JVM heap usage for each of the NiFi nodes.

The DataDogReportingTask provides a convenient but powerful method of publishing Apache NiFi metrics. The Datadog dashboards can be configured to provide a comprehensive look into the performance of your Apache NiFi cluster.

What we have shown here is really the tip of the iceberg for making a comprehensive monitoring dashboard. With NiFi’s metrics and Datadog’s flexibility, how the dashboard is created is completely up to you and your needs.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

MergeContent

Apache NiFi’s MergeContent Processor

The MergeContent processor in Apache NiFi is one of the most useful processors but can also be one of the biggest sources of confusion. The processor (you guessed it!) merges flowfiles together based on a merge strategy. The processor’s purpose is straightforward but its properties can be tricky. In this post we describe how it can be used to merge previously split flowfiles together.

In a recent NiFi flow, the flow was being split into separate pipelines. Both pipelines executed independently and when both were complete they were merged back into a single flowfile. The MergeContent will be using Defragment as the Merge Strategy. In MergeContent-speak, the split flowfiles became fragments. There are two fragments and we can refer to them as 0 and 1. Which is which doesn’t really matter. What does matter is that the index is unique and less than the fragment count (2).

Merging by Defragment

Using the Defragment merge strategy requires some attributes to be placed on the flowfile. Those attributes are:

  • fragment.identifier – All flowfiles with the same fragment.identifier will be grouped together.
  • fragment.count – The count of flowfile fragments, i.e. how many splits do we have? (All flowfiles having the same fragment.identifier must have the same value for fragment.count.)
  • fragment.index – Identifies the index of the flowfile in the group. All flowfiles in the group must have a unique fragment.index value that is between 0 and the count of fragments.

You can set these attributes using an UpdateAttribute processor. In the screenshot below, our flowfile was previously split into 5 fragments. The common attribute value across each of the fragments is some.id. This UpdateAttribute processor is setting this flowfile as index 0.

UpdateAttribute

With these attributes set, when flowfiles reach the MergeContent processor it will know how to combine them. As flowfiles come into the MergeContent processor the value of the fragment.identifier attribute will be read. The MergeContent will bin the flowfiles based on this attribute. When the count of flowfiles binned equals the fragment.count the flowfiles will be merged together. This means that the MergeContent’s Maximum Number of Bins property should be equal to or greater than the number of fragment.identifers processed concurrently (source). So, if your flow has 100 flowfiles with unique fragment.identifier attribute values being processed at any given time you will want to have at least 100 bins.

The other properties of the MergeContent processor are mostly self-explanatory. For example, if you are merging text you can set the Demarcator property to separate the text. The Header and Footer properties allow you to sandwich the combined text with some values.

MergeContent on a Multi-Node NiFi Cluster

It’s important to remember that in a distributed NiFi cluster the MergeContent processor requires all fragments to be on the same node. An easy way to catch this is when flowfiles get “stuck” in the transition to the MergeContent processor and their positions are the same. (The same flowfile fragments are both at position N.) This means that one part of the flowfile is on one node and the other part is on the other node, hence the stuck flowfiles. You need to ensure that the upstream flow is putting the flowfiles onto the same NiFi node. One way to do this is to set the load balance strategy to partition and use ${some.id} as the attribute. This will ensure that flowfiles with the same value for the some.id attribute will be routed to the same node. (For more on load balancing check out this blog post from the Apache NiFi project.)

Partition by Attribute

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Posted by / July 23, 2019

Dataworks Summit 2019

Recently (in May 2019) I had the honor of attending and speaking at the Dataworks Summit in Washington D.C. The conference had many interesting topics and keynote speakers focused on big-data technologies and business applications. I also always enjoy exploring downtown Washington DC.  Whether it is doing the “hike” across the National Mall taking in the sights or visiting all of the nearby shopping, there’s always something new to see.

One thing that caught my attention early on was the number of talks that either focused largely on Apache NiFi or at least mentioned Apache NiFi as a component of a larger data ingest platform. Apache NiFi has definitely cemented itself squarely as a core piece of data flow orchestration.

My talk was one of those. In my talk that described a process to ingest natural language text, process it, and persist extracted entities to a database, Apache NiFi was the workhorse that drove the process. Without NiFi, I would have had to write a lot more code and probably ended up with a much less elegant and performant solution.

In conclusion, if you have not yet looked at Apache NiFi for your data ingest and transformation (think ETL) pipeline needs, do yourself a favor and spend a few minutes with NiFi. I think you will find what you like. And if you need help along the way drop me a note. Just say, “Hey Jeff, I need a pipeline to do X. Show me how it can be done with NiFi.” I’ll be glad to help.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Creating an N-gram Language Model

A statistical language model is a probability distribution over sequences of words. (source)

We can build a language model using n-grams and query it to determine the probability of an arbitrary sentence (a sequence of words) belonging to that language.

Language modeling has uses in various NLP applications such as statistical machine translation and speech recognition. It’s easy to see how being able to determine the probability a sentence belongs to a corpus can be useful in areas such as machine translation.

To build it, we need a corpus and a language modeling tool. We will use kenlm as our tool. Other language modeling tools exist and some are listed at bottom of the Language Model Wikipedia article.

To start, we will clone the kenlm repository from GitHub:

git clone https://github.com/kpu/kenlm.git

Once cloned, we will follow the instructions in the repository’s README for how to compile. Those instructions are:

mkdir -p build
cd build
cmake ..
make -j 4

Once done we have a bin directory that contains the kenlm binaries. We can now create our language model. For text to experiment with I used the raw text of Pride and Prejudice. You will most certainly need a much, much larger corpus to get more meaningful results. But this should be sufficient for testing and learning.

To create the model:

./bin/lmplz -o 5 < book.txt > book.lm.arpa

This creates an ARPA file whose format can be found documented here. The -o option specifies the order (length of the n-grams) of the model. With this language model we can calculate the probability of an arbitrary sentence being found in Pride and Prejudice.

echo "This is my sentence ." | ./bin/query book.lm.arpa

The output shows us a few things.

Loading the LM will be faster if you build a binary file.
Reading book.lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
This=14 2 -2.8062737 is=16 2 -1.1830423 my=186 3 -1.7089757 sentence=6455 1 -4.2776613 .=0 1 -4.980392 </s>=2 1 -1.2587173 Total: -16.215061 OOV: 1
Perplexity including OOVs: 504.0924558936663
Perplexity excluding OOVs: 176.57688116229482
OOVs: 1
Tokens: 6
Name:query VmPeak:33044 kB VmRSS:4836 kB RSSMax:14040 kB user:0.273361 sys:0.00804 CPU:0.281469 real:0.279475

The value -16.215061 is the log probability of the sentence belonging to the language. Ten to the power of -16.215061 gives us 6.0945129×10^-17.

Compare with word2vec

So how does an n-gram language model compare with word2vec models? Do they do the same thing? No, they don’t. In an n-gram language model the order of the words is important. word2vec does not consider the ordering of words, and instead, only looks at the words in a given window size. This allows word2vec to predict the neighboring words given some context without consideration of word order.

A little bit more…

This post did not go into the inner workings of kenlm. For those details refer to the kenlm repository or to this paper. Of particular note is Kneser-Ney smoothing, the algorithm used by kenlm to improve results for instances such as when a word is found that was not present in the corpus. A corpus will never contain every possible n-gram so it is possible the sentence we are estimating has an n-gram not included in the model.

Note that the input text to kenlm should be preprocessed and tokenized, a step which we skipped here. You could use Sonnet Tokenization Engine.

To see an example of kenlm used in support of statistical machine translation see Apache Joshua.

Posted by / February 26, 2019

Entity Extraction from Natural Language Text in a Data Flow Pipeline

Entity Extraction from Natural Language Text in a Data Flow Pipeline

This brief slide show illustrates using Idyl E3 Entity Extraction Engine with Apache NiFi to extract named-entities from text in a data pipeline.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

NLP Flow Now Supports AllegroGraph

NLP Flow (in version 1.3.0) now supports AllegroGraph as an entity store. Supporting AllegroGraph makes NLP Flow more suitable for use in semantic web applications. Extracted entities can be sent to AllegroGraph where they can be queried via SPARQL or visualized by Gruff. And to make it even better, AllegroGraph’s free edition can easily be launched in Amazon Web Services’ EC2. A pipeline to extract entities from natural language text becomes even more capable and powerful when combined with a comprehensive backend like AllegroGraph.

Now for the technical details…

The AllegroGraph support in NLP Flow is made possible by two new processors available in NLP Flow 1.3.0. The first processor, ConvertEntityExtractionToNQuads, takes an entity extraction response from Idyl E3 Entity Extraction Engine and converts it to a list of N-quads (composed of subject, predicate, object, graph). These N-quads can then be sent to AllegroGraph via the AllegroGraphEntityStore processor.

 

Applications of NLP – Using NLP to Screen Stocks

A publicly traded index focused on bitcoin investments is using NLP to select the holdings. From the press release:

The index underlying KOIN was constructed utilizing a natural language processing algorithm that screens for global stocks that are believed to have a current or future economic interest in blockchain technology. By harnessing the power of textual analysis and artificial intelligence, companies are uncovered that might otherwise be overlooked by traditional analytical research.

You can read the full press release here.

It’s true there is a lot of information in unstructured text but to make that information useful it needs to be extracted and understood on a large-scale basis. This new fund is a great example of a practical use of NLP. If we take a minute to think about the requirements for a system like this we can identify these items:

  • Scalable – The system has to support an enormous amount of text quickly. News didn’t stop or take a break to let us catch up. The system must scale horizontally to meet demand.
  • Multi-lingual – Blockchain news isn’t just written in English or any other single language. The system must be able to support text documents written in many different languages. We’re interested in global stocks.
  • Customizable – Press releases and news reports represent two specific categories of text. They aren’t like other categories such as legal documents, encyclopedia articles, or general human conversation text. The system needs to be customizable in that it can support text from various formats. A general, all-purpose document processor won’t give us the results we need.
  • NLP – The system likely needs to be able to process natural language text and identify key topics, generate summaries, identify entities (companies and persons), and detect sentiment.

There are, of course, always other requirements but these represent arguably the largest areas.

How can we meet these requirements? To help provide scalabilty we can use an establish cloud provider like AWS or Azure. These platforms give us the tools we need in order to make an application scale to meet demand so that’s a good starting point. For the other requirements we can select from available tools based on whether we are making our own implementation from the ground up or using components publicly available. Both ways have their own advantages and disadvantages. To save time (and money) we’ll assume you would rather use other tools instead of building them yourself. If not, then you better stop reading and get to coding!

 

 

Quick Introduction to word2vec

In a previous post I gave links to some pretrained models for a few implementations of word vectors. In this post we’ll take a look at word vectors and their applications.

If you have been anywhere around NLP in the past couple of years you have undoubtedly heard of word2vec. As John Rupert Firth said, “You shall know a word by the company it keeps.” That is the premise behind word2vec. Words that have similar contexts will be placed closer to each other by the algorithm. For example, Paris and France will be closer together than Paris and Germany.

If you are interested in the details of the algorithms behind word2vec you will want to see the paper Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov et. al. and the code that goes along with it.

There are two model architectures of word2vec. One of these is called the Skip-Gram model. This model uses the current word to predict the surrounding words (the context). The other model architecture is called continuous bag-of-words (CBOW). This model predicts the current word from the surrounding words (context). For both models, the limit on the number of surrounding words is controlled by the window size parameter.

The practical applications of word vectors includes, but is not limited to, NLP tasks like named-entity recognition, machine translation, sentiment analysis, recommendation engines, and document retrieval. Word vectors have been applied to other domains such as biological sequences of proteins and genes.

Nearly all deep learning and NLP toolkits available today offer at least some support for word vectors. Tensorflow, GluonNLP (based on MXNet), and cloud-based tools such as Amazon SageMaker BlazingText support word vectors.

Introducing ngramdb

ngramdb provides a distributed means of storing and querying N-grams (or bags of words) organized under contexts. A REST interface provides the ability to insert n-grams, execute “starts with” and “top” queries, and calculate similarity metrics of contexts. Apache Ignite provides the distributed and highly available persistence and powers the querying abilities.

ngramdb is experimental and significant changes are likely. We welcome your feedback and input into its future capabilities.

ngramdb is open source under the Apache License, version 2.0.

 https://github.com/mtnfog/ngramdb

PyData Washington DC 2018

Last month in November 2018 I had the privilege of attending and presenting at PyData Washington DC 2018 at Capital One. It was my first PyData event and I learned so much from it that I hope to attend many more in the future and I encourage you to do so, too, if you’re interested in all things data science and the supporting Python ecosystem. Plus, I got some awesome stickers for my laptop.

My presentation was an introduction to machine translation and demonstrated how machine translation can be used in a streaming pipeline. In it, I gave a (brief) overview of machine translation and how it has evolved from early methods, to statistical machine translation (SMT), to today’s neural machine translation (NMT). The demonstrated application used Apache Flink to consume German tweets and, via Sockeye, translate the German tweets to English.

A video and slides will be available soon and will be posted here. The code for the project is available here. Credits to Suneel Marthi for the original implementation from which mine is forked and to Kellen Sunderland for the Sockeye-Thrift code that enabled communication between Flink and Sockeye.

Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.

I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.

We used Apache NiFi to drive the process. The data flow is summarized as follows:

The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level. 

Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.

New AWS NLP Service

The AWS reInvent conference in Las Vegas always results in announcements of new AWS services. This year AWS announced a new addition to their cloud-based NLP service.

Amazon Comprehend Medical – Natural Language Processing for Healthcare Customers is a service for understanding unstructured natural language medical text. From the announcement, it supports extracting entities from a vocabulary of medical terms and extracting Protected Health Information (PHI) such as addresses and medical record numbers. For a full description and code samples see the AWS blog post. Pricing is based on the usage of the service.

Apache cTakes

While this is an interesting and exciting new product, I would be remiss to not mention that this functionality is largely available in the open source application called cTakes. cTakes, or “clinical Text Analysis Knowledge Extraction System”, is an Apache project for extracting information from natural language medical record clinical text. cTakes is used by many large hospitals and referenced in many publications.

Being open source, cTakes is free to use and modify. You can deploy cTakes on-premises or in your cloud without paying any fees for usage. You only have to pay for the hardware that it is running on. Depending on your usage, the cost difference between a service like Amazon Comprehend Medical and cTakes could be significant so I recommend evaluating both if you need to process medical records.

Natural Language Processing and Information Extraction for Biomedicine

AWS T3 Instance

Here’s how to quickly make a T3 instance from a T2 instance.

  1. Take a snapshot of the EBS volume.
  2. Run the following AWS CLI command providing the snapshot ID and customizing the name and description properties:
aws ec2 register-image \
    --architecture "x86_64" \
    --ena-support \
    --name "AMI Name" \
     --description "AMI Description" \
    --root-device-name "/dev/sda1" \
    --block-device-mappings "[{\"DeviceName\": \"/dev/sda1\",\"Ebs\": {\"SnapshotId\": \"snap-000111222333444\"}}]" \
    --virtualization-type "hvm"

Now you can launch a T3 instance from the newly created AMI.

Update:

You can stop the instance, enable ENA on it, change the type to a t3, and start the instance. This does require the instance OS to support ENA.

aws ec2 modify-instance-attribute --ena-support --instance-id i-xxxxxxxx

 

Quick List of Pretrained Word Vectors

In the past few years word vectors have become all the rage in NLP and rightly so. It’s hard today to find some application of NLP that doesn’t involve the use of word vectors. The fact that word vectors are generated using unsupervised learning makes them even more appealing.

In a future post we’ll take a look at what exactly are word vectors but in this post I wanted to just give a quick list of pretrained word vectors that you can use now. There are several different algorithms and implementations for generating word vectors with the most famous likely being word2vec.

word2vec – These pretrained vectors were created from a set of Google News dataset containing about 100 billion words. 

GloVe – The GloVe pretrained vectors were created from Wikipedia, a combination of Wikipedia and Common Crawl, and Twitter. 

fastText – The fastText pretrained vectors were created from Wikipedia. They are available for 294 languages. 

Please note the license each pretrained vector is released under prior to using them in your applications.

NLP Building Blocks with Apache NiFi 1.7.0

Update: Launch NLP Flow in Amazon Web Services!

Apache NiFi 1.7.0 was recently announced. With that announcement we want to provide a guide on using NLP Flow with Apache NiFi 1.7.0.

Apache NiFi

Download Apache NiFi 1.7.0

Use the link above to download Apache NiFi 1.7.0. Once downloaded, extract the archive somewhere to your disk. We’ll assume for purposes of going forward you extracted it to /opt/nifi-1.7.0.

NLP Flow

Download NLP Flow

Use the link above to download NLP Flow. Once downloaded, extract the files to Apache NiFi’s lib directory at /opt/nifi-1.7.0/lib. We can now start Apache NiFi:

/opt/nifi-1.7.0/bin/nifi.sh start

Apache NiFi will now start running in the background.

NLP Building Blocks

We will use the Docker containers for the NLP Building Blocks. To do so, you must have git, docker, and docker-compose installed.

git clone https://github.com/mtnfog/nlp-building-blocks.git
cd nlp-building-blocks
docker-compose up

This will start the NLP Building Blocks.

Building a Flow

We can now open a browser to http://localhost:8080/nifi and see Apache NiFi’s canvas. NLP Flow’s processors for interacting with the NLP Building Blocks are available alongside the standard Apache NiFi processors. We can see the NLP Flow processors by filtering on the keyword “nlp” in the search:

The flow shown below performs named-entity extraction using the NLP Building Blocks containers that we started earlier. Files are read from the file system, separated into sentences, sentences are tokenized, entities are extracted from the tokens, and then the entities are stored in a MongoDB database. I did not use Renku Language Detection Engine in this flow because it was known beforehand that all input files would be in English. Otherwise, a Renku Language Engine Processor along with a RouteOnAttribute processor would have been used to appropriately route that text through the flow.

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

Introducing NLP Flow

Today we are introducing NLP Flow, a collection of processors for the popular Apache NiFi data platform to support NLP pipeline data flows.

Apache NiFi is a cross-platform tool for creating and managing data flows. With Apache NiFi you can create flows to ingest data from a multitude of sources, perform transformations and logic on the data, and interface with external systems. Apache NiFi is a stable and proven platform used by companies worldwide.

Extending Apache NiFi to support NLP pipelines is a perfect fit. NLP Flow is, in Apache NiFi terminology, a set of processors that facilitate NLP tasks via our NLP Building Blocks. With NLP Flow, you can create powerful NLP pipelines inside of Apache NiFi to perform language identification, sentence extraction, text tokenization, and named-entity extraction. For example, an NLP pipeline to ingest text from HDFS, extract all named-person entities for English and Spanish text, and persist the entities to a MongoDB database can be managed and executed within Apache NiFi.

NLP Flow is free for everyone to use. An existing Apache NiFi (a free download) installation is required.

 NLP Flow

 

Open Source NLP Microservices

We have open sourced our NLP building block applications on GitHub under the Apache license.

These microservices are stateless applications designed for deployment into scalable environments. They can be launched as docker containers or through cloud marketplaces.

Each application is built using Idyl NLP, an open source Apache-licensed NLP framework for Java.

Extracting Patient Names, Hospitals, and Drugs with Idyl NLP

In this post we will demonstrate how Idyl NLP can be used to find patient names, hospitals, and names of drugs in natural language text.

A quickly growing use of natural language processing (NLP) exists in the healthcare industry. Recent advancements in technology have made it possible to extract useful and very valuable information from unstructured medical records. This can be used to correlate patient information and look for treatment patterns. From a security perspective, it may be necessary to be able to simply and quickly determine and identify any protected health information (PHI) that may exist in a document for auditing or compliance requirements.

Idyl NLP is our opensource NLP library for Java licensed under the business-friendly Apache license, version 2.0. The library provides various NLP capabilities through abstracted interfaces to lower level NLP libraries. The people, places, and things we are concerned with here are patient names, hospitals, and drug names. The goal of Idyl NLP is to provide a powerful, yet easy to use NLP framework.

Extracting Drug Names via a Dictionary

Drug names in the text do not require a trained model since they can be identified via dictionary models. To identify drug names we will use the FDA’s Orange Book. From the Orange Book CSV download we extracted the “Trade Name” column to a text file. Because some drugs are duplicated, we sort the file and remove the duplicate entries. We now have a text file of drug names having one drug per line.

cat drugs.txt | sort | uniq -d > drugs-sorted.txt

We will use Idyl NLP’s dictionary entity recognizer to find the drug names in our input text. The dictionary entity recognizer takes a file and reads its contents into a bloom filter. The dictionary entity recognizer accepts as input tokenized text. Because some drug names may consist of more than one word, we can not do a simple contains check against the dictionary. Instead, we produce a list of n-grams of the tokenized text having length one up to the length of the number of input tokens. We can now see if the bloom filter “might contain” each n-gram. If the bloom filter returns true, we then do a definite check to rule out false positives. Using a bloom filter provides much improved efficiency for dictionary checks.

Extracting Patient Names and Hospitals

Patient names and hospitals will be extracted from the input text through the use of trained models. Each model was created from the same training data. The only difference is that in the patient model the patient names were annotated, and in the hospital model the names of hospitals were annotated. This training process gives us two model files, one for patients and one for hospitals, and their associated model manifest files. To use these models we will instantiate a model-based entity recognizer. The recognizer will load the two trained entity models from disk.

Creating the Pipeline

To use these two entity recognizers we will create a NerPipeline. This pipeline accepts a list of entity recognizers when built along with other configurable settings, such as a sentence detector and tokenizer. When the pipeline is executed, each entity recognizer will be applied to the input text. The output will be a list of Entity objects that contain information about each extracted entity.

The Code

Below is the code written that was described above. Refer to the idylnlp-samples project for up to date examples since this code could change between the time it was written and the time you see it here. This code used Idyl NLP 1.1.0-SNAPSHOT.

Creating the dictionary entity recognizer. The first argument specifies that the entities extracted will be identified as English, the second argument is the full path to the file created from the Orange Book, the third argument is the type of entity, and fourth parameter is the false positive probability for the bloom filter, and the last argument indicates that the dictionary lookup is not case-sensitive.

DictionaryEntityRecognizer dictionaryRecognizer = new DictionaryEntityRecognizer(LanguageCode.en, "/path/to/drugs-sorted.txt", "drug", 0.1, false);

Creating the model entity recognizer requires us to read the model manifests from disk. Maps correlate models for entity types and languages.

String modelPath = "/path/to/trained-models/";

LocalModelLoader<TokenNameFinderModel> entityModelLoader = new LocalModelLoader<>(new TrueValidator(), modelPath);

ModelManifest patientModelManifest = ModelManifestUtils.readManifest("/full/path/to/patient.manifest");
ModelManifest hospitalModelManifest = ModelManifestUtils.readManifest("/full/path/to/hospital.manifest");

Set<StandardModelManifest> patientModelManifests = new HashSet<StandardModelManifest>();
patientModelManifests.add(patientModelManifest);

Set<StandardModelManifest> hospitalModelManifests = new HashSet<StandardModelManifest>();
hospitalModelManifests.add(hospitalModelManifest);

Map<LanguageCode, Set<StandardModelManifest>> persons = new HashMap<>();
persons.put(LanguageCode.en, patientModelManifests);

Map<LanguageCode, Set<StandardModelManifest>> hospitals = new HashMap<>();
hospitals.put(LanguageCode.en, hospitalModelManifests);

Map<String, Map<LanguageCode, Set<StandardModelManifest>>> models = new HashMap<>();
models.put("person", persons);
models.put("hospital", hospitals);

OpenNLPEntityRecognizerConfiguration config = new Builder()
 .withEntityModelLoader(entityModelLoader)
 .withEntityModels(models)
 .build();

OpenNLPEntityRecognizer modelRecognizer = new OpenNLPEntityRecognizer(config);

Now we can create the pipeline providing the entity recognizers:

List<EntityRecognizer> entityRecognizers = new ArrayList<>();
entityRecognizers.add(dictionaryRecognizer);
entityRecognizers.add(modelRecognizer);

NerPipeline pipeline = new NerPipeline.NerPipelineBuilder().withEntityRecognizers(entityRecognizers ).build;

And, finally, we can execute the pipeline:

String input = FileUtils.readFileToString(new File("/tmp/input-file.txt"));
EntityExtractionResponse response = pipeline.run(input);

The response will contain a set of entities (persons, hospitals, and drugs) that were extracted from the input text.

Notes

Because we created the pipeline using most defaults, it will use an internal English sentence detector and tokenizer. For other languages you can create the pipeline with other options. As when using any trained model to perform named-entity recognition, the performance of the model is important. How well the training data represents the actual data will be crucial to achieving good performance.

Simplified Named-Entity Extraction Pipeline in Idyl NLP

Idyl NLP 1.1.0 introduces a simplified named-entity extraction pipeline that can be created in just a few lines of code. The following code block shows how to make a pipeline to extract named-person entities from natural language English text in Idyl NLP.

NerPipelineBuilder builder = new NerPipeline.NerPipelineBuilder();
NerPipeline pipeline = builder.build(LanguageCode.en);

EntityExtractionResponse response = pipeline.run("George Washington was president.");
		
for(Entity entity : response.getEntities()) {
  System.out.println(entity.toString());
}

When you run this code a single line will be printed to the screen:

Text: George Washington; Confidence: 0.96; Type: person; Language Code: eng; Span: [0..2);

Internally, the pipeline creates a sentence detector, tokenizer, and named-entity recognizer for the given language. Currently only person-entities for English is supported but we will be adding support for more languages and more entity types in the future. The goal of this functionality is to simplify the amount of code needed to perform a complex operation like named-entity extraction. The NerPipeline class is new in Idyl NLP 1.1.0-SNAPSHOT.

Idyl NLP is our open-source, Apache-licensed NLP framework for Java. Its releases are available in Maven Central and daily snapshots are also available. See Idyl NLP on GitHub at https://github.com/idylnlp/idylnlp for the code, examples, and documentation. Idyl NLP powers our NLP Building Blocks.

Idyl NLP

We have open-sourced our NLP library and its associated projects on GitHub. The library, Idyl NLP, is a Java natural language processing library. It is licensed under the Apache License, version 2.0.

Idyl NLP stands on the shoulders of giants to provide a capable and flexible NLP library. Utilizing components such as OpenNLP and DeepLearning4j under the hood, Idyl NLP offers various implementations for NLP tasks such as language detection, sentence extraction, tokenization, named-entity extraction, and document classification.

Idyl NLP has its own webpage at http://idylnlp.ai and is available in Maven Central under the group ai.idylnlp.

Here are the GitHub project links:

Idyl NLP powers our NLP building block microservices and they are also open source on GitHub:

NLP Models and Model Zoo

Idyl NLP has the ability to automatically download NLP models when needed. The Idyl NLP Models repository contains model manifests for various NLP models. Through the manifest files, Idyl NLP can automatically download the model file referenced by the manifest and use it. The service powering the service is the Idyl NLP Model Zoo that will soon be hosted at zoo.idylnlp.ai. It is a Spring boot application that provides a REST interface for querying and downloading models so you can run your own model zoo for internal usage. See these two repositories on GitHub for more information about the available models and the model zoo. Models will become available through the repository in the coming days.

Sample Projects

There are some sample projects available for Idyl NLP. The samples illustrate how to use some of Idyl NLP’s core capabilities and hopefully provide starting points for using Idyl NLP in your projects.

Future

We are committed to further developing Idyl NLP and its ecosystem. We welcome the community’s contributions to help it foster and grow. We hope that the business friendly Apache license helps Idyl NLP’s adoption. Like most software engineers we are a bit behind on documentation. In the near term we will be focusing on the wiki, javadocs, and the sample projects. Our NLP Building Blocks will continue to be powered by Idyl NLP.

For questions or more information please contact help@idylnlp.ai.

Using the NLP Building Blocks with Apache NiFi to Perform Named-Entity Extraction on Logical Entity Exchange Specifications (LEXS) Documents

In this post we are going to show how our NLP Building Blocks can be used with Apache NiFi to create an NLP pipeline to perform named-entity extraction on Logical Entity Exchange Specifications (LEXS) documents. The pipeline will extract a natural language field from each document, identify the named-entities in the text through a process of sentence extraction, tokenization, and named-entity recognition, and persist the entities to a MongoDB database.  While the pipeline we are going to create uses data files in a specific format, the pipeline could be easily modified to read documents in a different format.

LEXS is an XML, NIEM-based framework for information exchange developed for the US Department of Justice. While the details of LEXS are out of scope for this post, the keypoints is that it is XML-based, a mix of structured and unstructured text, and is used to describe various law enforcement events. We have taken the LEXS specification and created test documents for this pipeline. Example documents are also available on the public internet.

And just in case you are not familiar with Apache NiFi, it is a free (Apache-licensed), cross-platform application that allows the creation and execution of data flow processes. With Apache NiFi you can move data through pipelines while applying transformations and executing actions.

The completed Apache NiFi data flow is shown below.

NLP Building Blocks

This post requires that our NLP Building Blocks are running and accessible. The NLP Building Blocks are microservices to perform NLP tasks. They are:

Renku Language Detection Engine
Prose Sentence Extraction Engine
Sonnet Tokenization Engine
Idyl E3 Entity Extraction Engine

Each is available as Docker containers and on the AWS and Azure marketplaces. You can quickly start each building block as a Docker container using docker compose or individually:

Start Prose Sentence Extraction Engine:

docker run -p 8060:8060 -it mtnfog/prose:1.1.0

Start Sonnet Tokenization Engine:

docker run -p 9040:9040 -it mtnfog/sonnet:1.1.0

Start Idyl E3 Entity Extraction Engine:

docker run -p 9000:9000 -it mtnfog/idyl-e3:3.0.0

With the containers running we will next set up Apache NiFi.

Setting Up

To begin, download Apache NiFi and unzip it. Now we can start Apache NiFi:

apache-nifi-1.5.0/bin/nifi.sh start

We can now begin creating our data flow.

Creating the Ingest Data Flow

The Process

Our data flow process in Apache NiFi will follow this process. Each step is described in detail below.

  1. Ingest LEXS XML files from the file system. Apache NiFi offers the ability to read files from many sources (such as HDFS and S3) but we will simply use the local file system as our source.
  2. Execute an XPath query against each LEXS XML file to extract the narrative from each record. The narrative is a free text, natural language description of the event described by the LEXS XML file.
  3. Use Prose Sentence Extraction Engine to identify the individual sentences in the narrative.
  4. Use Sonnet Tokenization Engine to break each sentence into its individual tokens (typically words).
  5. Use Idyl E3 Entity Extraction Engine to identity the named-person entities in the tokens.
  6. Persist the extracted entities into a MongoDB database.

Configuring the Apache NiFi Processors

Ingesting the XML Files

To read the documents from the file system we will use the GetFile processor. The only configuration property for this processor that we will set is the input directory. Our documents are stored in /docs so that will be our source directory. Note that, by default, the GetFile processor removes the files from the directory as they are processed.

Extracting the Narrative from Each Record

The GetFile processor will send the file’s XML content to an EvaluateXPath processor. This processor will execute an XPath query against each XML document to extract the document’s narrative. The extracted narrative will be stored in the content of the flowfile. The XPath is:

/*[local-name()='doPublish']/*[local-name()='PublishMessageContainer']/*[local-name()='PublishMessage']/*[local-name()='DataItemPackage']/*[local-name()='Narrative']

Identifying Individual Sentences in the Narrative

The flowfile will now be sent to an InvokeHTTP processor that will send the sentence extraction request to Prose Sentence Extraction Engine. We set the following properties on the processor:

HTTP Method: POST
Remote URL: http://localhost:8060/api/sentences
Content Type: text/plain

The response from Prose Sentence Extraction engine will be a JSON array containing the individual sentences in the narrative.

Splitting the Sentences Array into Separate FlowFiles

The array of sentences will be sent to a SplitJSON processor. This processor splits the flowfile creating a new flowfile for each sentence in the array. For the remainder of the data flow, the sentences will be operated on individually.

Identifying the Tokens in Each Sentence

Each sentence is next sent to an InvokeHTTP processor that will call Sonnet Tokenization Engine. The properties set for this processor are:

HTTP Method: POST
Remote URL: http://localhost:9040/api/tokenize
Content Type: text/plain

The response from Sonnet Tokenization Engine will be an array of tokens (typically words) in the sentence.

Extracting Named-Entities from the Tokens

The array of tokens is next sent to an InvokeHTTP processor that sends the tokens to Idyl E3 Entity Extraction Engine for named-entity extraction. The properties to set for this processor are:

HTTP Method: POST
Remote URL: http://localhost:9000/api/extract
Content Type: application/json

Idyl E3 analyzes the tokens and identifies which tokens are named-person entities (like John Doe, Jerry Smith, etc.). The response is a list of the entities found along with metadata about each entity. This metadata includes the entity’s confidence value. This is a value from 0 to 1 that indicates Idyl E3’s confidence the entity is actually an entity.

Storing Entities in MongoDB

The entities having a confidence value greater than or equal to 0.6.0 will be persisted to a MongoDB database. In this processor, each entity will be written to the database for storage and further analysis by other systems. The properties to configure for the PutMongo processor are:

Mongo URI: mongodb://localhost:27017
Mongo Database Name: <Any database>
Mongo Collection Name: <Any collection>

You could just as easily insert the entities into a relational database, Elasticsearch, or another repository.

Pipeline Summary

That is our pipeline! We went from XML documents, did some natural language processing via the NLP Building Blocks, and ended up with named-entities stored in MongoDB.

Production Deployment

There’s a few things you may want to change for a production deployment.

Multiple Instances of Apache NiFi

First, you will likely want (and need) more than one instance of Apache NiFi to handle large volumes of files.

High Availability of NLP Building Blocks

Second, in this post we ran the NLP Building Blocks as local docker containers. This is great for a demonstration or proof-of-concept but you will want some high-availability of these services from a service like Kubernetes or AWS ECS.

You can also launch the NLP Building Blocks as EC2 instances via the AWS Marketplace. You could then plug the AMI of each building block into an EC2 autoscaling group behind an Elastic Load Balancer. This provides instance health checks and the ability to scale up and down in response to demand. They are also available on the Azure Marketplace.

Incorporate Language Detection in the Data Flow

Third, you may have noticed that we did not use Renku Language Detection Engine. This is because we knew beforehand that all of our documents are English. If you are unsure, you can insert a Renku Language Detection Engine processor in the data flow immediately after the EvaluateXPath processor to determine the text’s language and use the result as a query parameter to the other NLP Building Blocks.

Improve Performance through Custom Models

Lastly, we did not use any custom sentence, tokenization, or entity models. Each NLP Building Block includes basic functionality to perform these actions without custom models, but, using custom models will almost certainly provide a much higher level of performance. This is because the custom models will more closely match your data unlike the default models. The tools to create and evaluate custom models are included with the application – refer to each application’s documentation for the necessary steps.

Filtering Entities with Low Confidence

You may want to filter entities having a low confidence value in order to control noise. What the optimal threshold is depends on a combination of your data, the entity model being used, and how much noise your system can tolerate. in some use-cases it may be better to use a lower threshold out of caution. Each entity has an associated confidence value that can be used to filter.

Need Help?

Get in touch. We’ll be glad to help out. Send us a line a support at mtnfog.com.

Creating Custom Tokenization Models with Sonnet Tokenization Engine

Sonnet Tokenization Engine 1.1.0 includes the ability to train custom token models from your text. Using your own token model provides improved performance because the model will more closely match your text to be tokenized. This post describes how to launch an instance of Sonnet Tokenization Engine on AWS, connect to it, train a custom token model, and then use it.

To get started, let’s launch an instance of Sonnet Tokenization Engine from the AWS Marketplace. On the product page, click the orange “Continue to Subscribe” button.

 

On the next page, we highly recommend selecting a VPC from the VPC Settings options. This is to allow you to launch Sonnet Tokenization Engine on a newer instance type. Select your VPC and a public subnet.

Now, select an instance type. We recommend a t2.micro for this demonstration. In production you will likely want a larger instance type.

Now click the “Launch with 1-Click” button!

An instance of Sonnet Tokenization Engine will now be starting in your AWS account. Head over to your EC2 console to check it out. By default, for security purposes port 22 for SSH is not open to the instance. Let’s open port 22 so we can SSH to the instance. Click on the instance’s security group, click Inbound Rules, and add port 22. Now let’s SSH into the instance.

ssh -i keypair.pem ec2-user@ec2-34-201-136-186.compute-1.amazonaws.com

Sonnet Tokenization Engine is installed under /opt/sonnet.

cd /opt/sonnet

Training a custom token model requires training data. The format for this data is a single sentence per line with tokens separated by whitespace or <SPLIT>. You can download sample training data for this exercise.

wget https://s3.amazonaws.com/mtnfog-public/token.train -O /tmp/token.train

We also need a training definition file. Again, we can download one for this exercise:

wget https://s3.amazonaws.com/mtnfog-public/token-training-definition.xml -O /tmp/token-training-definition.xml

Using these two files we are now ready to train our model.

sudo su sonnet
./bin/train-model.sh /tmp/token-training-definition.xml

The output will look similar to:

Sonnet Token Model Generator
Version: 1.1.0
Beginning training using definition file: /tmp/token-training-definition.xml
2018-03-17 12:47:46,135 DEBUG [main] models.ModelOperationsUtils (ModelOperationsUtils.java:40) - Using OpenNLP data format.
2018-03-17 12:47:46,260 INFO  [main] training.TokenModelOperations (TokenModelOperations.java:282) - Beginning tokenizer model training. Output model will be: /tmp/token.bin
Indexing events with TwoPass using cutoff of 0

	Computing event counts...  done. 6002 events
	Indexing...  done.
Collecting events... Done indexing in 0.54 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 6002
	    Number of Outcomes: 2
	  Number of Predicates: 6290
Computing model parameters...
Performing 100 iterations.
  1:  . (5991/6002) 0.9981672775741419
  2:  . (5995/6002) 0.9988337220926358
  3:  . (5996/6002) 0.9990003332222592
  4:  . (5997/6002) 0.9991669443518827
  5:  . (5996/6002) 0.9990003332222592
  6:  . (5998/6002) 0.9993335554815062
  7:  . (5998/6002) 0.9993335554815062
  8:  . (6000/6002) 0.9996667777407531
  9:  . (6000/6002) 0.9996667777407531
 10:  . (6000/6002) 0.9996667777407531
Stopping: change in training set accuracy less than 1.0E-5
Stats: (6002/6002) 1.0
...done.
Compressed 6290 parameters to 159
1 outcome patterns
Entity model generated complete. Summary:
Model file   : /tmp/token.bin
Manifest file : token.bin.manifest
Time Taken    : 2690 ms

The created model file and its associated manifest file will have been created. Copy the manifest file to Sonnet’s models directory.

cp /tmp/token.bin.manifest /opt/sonnet/models/

Now start/restart Sonnet.

sudo service sonnet restart

The model will be loaded and ready for use. All API requests for tokenization that are received for the model’s language will be processed by the model. To try it:

curl "http://ec2-34-201-136-186.compute-1.amazonaws.com:9040/api/tokenize?language=eng" -d "Tokenize this text please." -H "Content-Type: text/plain"

Renku Language Detection Engine 1.1.0

Renku Language Detection Engine 1.1.0 has been released. It is available now as a DockerHub and will be available on the AWS Marketplace and Azure Marketplace in a few days. This version adds a new API endpoint that returns a list of the languages (as ISO-639-3 codes) supported by Renku. The AWS Marketplace image is built using the newest version of the Amazon Linux AMI, and the Azure Marketplace image is now built on CentOS 7.4 (previously was 7.3).

Get Renku Language Detection Engine.

Intel “Meltdown” and “Spectre” Vulnerabilities

With the recent announcement of the vulnerabilities known as “Spectre” and “Meltdown” in Intel processors we have made this post to inform our users how to protect their virtual machines of our products launched via cloud marketplaces.

Products Launched via Docker Containers

Docker uses the host’s system kernel. Refer to your host OS’s documentation on applying the necessary kernel patch.

Products Launched via the AWS Marketplace

The following product versions are using kernel 4.9.62-21.56.amzn1.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each instance:

sudo yum update
sudo reboot
uname -r

The output of the last command will an updated kernel version of 4.9.76-3.78.amzn1.x86_64 (or newer). Details are available on the AWS Amazon Linux Security Center.

Products Launched via the Azure Marketplace

The following product versions are running on CentOS 7.3 on kernel 3.10.0-514.26.2.el7.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each virtual machine:

sudo yum update
sudo reboot
uname -r

The output of the last command will show an updated kernel version of 3.10.0-693.11.6.el7.x86_64 (or newer). For more information see the Red Hat Security Advisory and the announcement email.

 

 

Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language’s probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.

In case you are not familiar with OpenNLP’s language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.

To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi’s lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.

git clone https://github.com/mtnfog/nlp-nifi-processors.git
cd nlp-nifi-processors
mvn clean install
cp langdetect-nifi-processor/langdetect-processor-nar/target/*.nar /path/to/nifi/lib/

The processor does not have any settings to configure. It’s ready to work right “out of the box.” You can add the processor to your NiFi canvas:

You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.

Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!

This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.

Let us know how the processor works out for you!

Jupyter Notebook for NLP Building Blocks

This post presents a Jupyter notebook interactively showing how the NLP Building Blocks can be used. The notebook defines functions for sentence extraction, tokenization, and named-entity extraction. (We recently made a blog post showing how to accomplish the same thing but through Apache NiFi.)

To run the notebook first start the NLP Building Block docker containers. Then fire up Jupyter and replace the 192.168.1.134 IP in the notebook with the IP address of your computer running the containers. You can then step through the notebook.

https://gist.github.com/jzonthemtn/ea6923c1e4595eb61e45f7a8ceb6f83d

Sentence Extraction with Custom Trained NLP Models

Introducing Sentence Extraction

A common required task of natural language processing (NLP) is to extract sentences from natural language text. This can be a task on its own or as part of a larger NLP system. There are several ways to go about doing sentence extraction. There is the naive way of splitting based on the presence of periods. That works great until you remember that periods don’t always indicate a sentence break. There are tools that break text into sentences based on rules. There is actually a standard for communicating these rules called Segmentation Rules eXchange, or, SRX. These rules often work with very good success, however, they are language dependent. Additionally, implementing code for these rules can be difficult because not all programming languages have the necessary constructs.

Model-Based Sentence Extraction

This brings us to model-based sentence extraction. In this approach we use trained models to identify sentence boundaries in natural language text. In summary, we take training text, run it through a training process, and we get a model that can be used to extract sentences. A significant benefit of model-based sentence extraction is that you can adapt your model to represent the actual text you will be processing. This leads to potentially great performance. Our NLP Building Block product called Prose Sentence Extraction Engine uses this model-based approach.

Training a Custom Sentence Model with Prose Sentence Extraction Engine

Prose Sentence Extraction Engine 1.1.0 introduced the ability to create custom models for extracting sentences from natural language text. Using a custom model typically provides a much greater level of accuracy than relying on the internal Prose logic to extract sentences. Creating a custom model is fairly simple and this blog post demonstrates how to do it.

To get started we are going to launch Prose Sentence Extraction Engine via the AWS Marketplace. The benefit of doing this is in just a few seconds (okay, maybe 30 seconds) we will have an instance of Prose fully configured and ready to go. Once the instance is up and running in EC2 we can SSH into it. (Note that the SSH username is ec2-user.) All commands presented in this post are executed through SSH on the Prose instance.

SSH to the Prose instance on EC2:

ssh -i key.pem ec2-user@ec2-54-174-13-245.compute-1.amazonaws.com

Once connected, change to the Prose directory:

cd /opt/prose

Training a sentence extraction model requires training text. This text needs to be formatted in a certain way – one sentence per line. This is how Prose learns how to recognize a sentence for any given language. We have some training text for you to use for this example. When creating a model for your production use you should use text representative of the real text that you will be processing. This gives the best performance.

Download the example training text to the instance:

wget https://s3.amazonaws.com/mtnfog-public/a-christmas-carol-sentences.txt -O /tmp/a-christmas-carol-sentences.txt

Take a look at the first few lines of the file you just downloaded. You will see that it is a sentence per line. This file is also attached to this blog post and can be downloaded at the bottom of this post.

Now, edit the example training definition file:

sudo nano example-training-definition-template.xml

You want to modify the trainingdata file to be “/tmp/a-christmas-carol-sentences.txt” and set the output model file as shown below:

<?xml version="1.0" encoding="UTF-8"?>
<trainingdefinition xmlns="http://www.mtnfog.com">
  <algorithm/>
  <trainingdata file="/tmp/a-christmas-carol-sentences.txt" format="opennlp"/>
  <model name="sentence" file="/tmp/sentence.bin" encryptionkey="random" language="eng" type="sentence"/>
</trainingdefinition>

This training definition says we are creating a sentence model for English (eng) text. The trainined model file will be written to /tmp/sentence.bin. Now, we are ready to train the model:

./bin/train-model.sh example-training-definition-template.xml

You will see some output quickly scroll by. Since the input text is rather small, the training only takes at most a few seconds. Your output should look similar to:

$ ./bin/train-model.sh example-training-definition-template.xml

Prose Sentence Model Generator
Version: 1.1.0
Beginning training using definition file: /opt/prose/example-training-definition-template.xml
2017-12-31 19:21:03,451 DEBUG [main] models.ModelOperationsUtils (ModelOperationsUtils.java:40) - Using OpenNLP data format.
2017-12-31 19:21:03,567 INFO  [main] training.SentenceModelOperations (SentenceModelOperations.java:281) - Beginning sentence model training. Output model will be: /tmp/sentence.bin
Indexing events with TwoPass using cutoff of 0

	Computing event counts...  done. 1990 events
	Indexing...  done.
Collecting events... Done indexing in 0.41 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 1990
	    Number of Outcomes: 2
	  Number of Predicates: 2274
Computing model parameters...
Performing 100 iterations.
  1:  . (1827/1990) 0.9180904522613065
  2:  . (1882/1990) 0.9457286432160804
  3:  . (1910/1990) 0.9597989949748744
  4:  . (1915/1990) 0.9623115577889447
  5:  . (1940/1990) 0.9748743718592965
  6:  . (1950/1990) 0.9798994974874372
  7:  . (1953/1990) 0.9814070351758793
  8:  . (1948/1990) 0.978894472361809
  9:  . (1962/1990) 0.985929648241206
 10:  . (1954/1990) 0.9819095477386934
 20:  . (1979/1990) 0.9944723618090452
 30:  . (1986/1990) 0.9979899497487437
 40:  . (1990/1990) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1990/1990) 1.0
...done.
Compressed 2274 parameters to 707
1 outcome patterns
2017-12-31 19:21:04,491 INFO  [main] manifest.ModelManifestUtils (ModelManifestUtils.java:108) - Removing existing manifest file /tmp/sentence.bin.manifest.
Sentence model generated complete. Summary:
Model file   : /tmp/sentence.bin
Manifest file : sentence.bin.manifest
Time Taken    : 1056 ms

Our model has been created and we can now use it. First, let’s stop Prose in case it is running:

sudo service prose stop

Next, copy the model file and its manifest file to /opt/prose/models:

sudo cp /tmp/sentence.* /opt/prose/models/

Since we moved the model file, let’s also update the model’s file name in the manifest file:

sudo nano models/sentence.bin.manifest

Change the model.filename property to be sentence.bin (remove the /tmp/). The manifest should now look like:

model.id=e54091c9-89de-4edb-828b-4edf58006c73
model.name=sentence
model.type=sentence
model.subtype=none
model.filename=sentence.bin
language.code=eng
license.key=
encryption.key=random
creator.version=prose-1.1.0
model.source=
generation=1

Now, with our models in place, we can now start Prose. If we tail Prose’s log while loading we can see that it finds and loads our custom model:

sudo service prose start && tail -f /var/log/prose.log

In case you are curious, the lines in the log that show the model was loaded will look similar to these:

[INFO ] 2017-12-31 19:25:57.933 [main] ModelManifestUtils - Found model manifest ./models//sentence.bin.manifest.
[INFO ] 2017-12-31 19:25:57.939 [main] ModelManifestUtils - Validating model manifest ./models//sentence.bin.manifest.
[WARN ] 2017-12-31 19:25:57.942 [main] ModelManifestUtils - The license.key in ./models//sentence.bin.manifest is missing.
[INFO ] 2017-12-31 19:25:58.130 [main] ModelManifestUtils - Entity Class: sentence, Model File Name: sentence.bin, Language Code: en, License Key: 
[INFO ] 2017-12-31 19:25:58.135 [main] DefaultSentenceDetectionService - Found 1 models to load.
[INFO ] 2017-12-31 19:25:58.138 [main] LocalModelLoader - Using local model loader directory ./models/
[INFO ] 2017-12-31 19:25:58.560 [main] ModelLoader - Model validation successful.
[INFO ] 2017-12-31 19:25:58.569 [main] DefaultSentenceDetectionService - Found sentence model for language eng

Yay! This means that Prose has started and loaded our model. Requests to Prose to extract sentences for English text will now use our model. Let’s try it:

curl http://ec2-54-174-13-245.compute-1.amazonaws.com:8060/api/sentences -d "This is a sentence. This is another sentence. This is also a sentence." -H "Content-type: text/plain"

The response we receive from Prose is:

["This is a sentence.","This is another sentence.","This is also a sentence."]

Our sentence model worked! Prose successfully took in the natural language English text and sent us back three sentences that made up the text.

Prose Sentence Extraction Engine is available on the AWS Marketplace, Azure Marketplace, an Dockerhub. You can launch Prose Sentence Extraction Engine on any of those platforms in just a few seconds.

At the time of publishing, Prose 1.1.0 was in-process of being published to the Azure and AWS Marketplaces. If 1.1.0 is not yet available on those marketplaces it will be in just a few days once the update has been published.

Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls “processors”, you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let’s get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

unzip nifi-1.4.0-bin.zip
cd nifi-1.4.0/bin
./nifi.sh start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi’s GetFile processor to read the file. Add a GetFile processor to the canvas:


Right-click the GetFile processor and click Configure to bring up the processor’s properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor’s properties and set the following values:

  • HTTP Method – POST
  • Remote URL – http://localhost:7070/api/sentences
  • Content Type – text/plain

Set the processor to autoterminate for everything except Response. We also set the processor’s name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let’s go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method – POST
  • Remote URL – http://localhost:9000/api/extract
  • Content-Type – application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow’s execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:

{"entities":[],"extractionTime":7}

Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There’s a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don’t have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can’t leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).

 

 

OpenNLP 1.8.4 is available

OpenNLP 1.8.4 is available. This minor revision brings a few changes. Checkout the details on the Apache OpenNLP News page.

The updated Maven dependency:

<dependency>
  <groupId>org.apache.opennlp</groupId>
  <artifactId>opennlp-tools</artifactId>
  <version>1.8.4</version>
</dependency>

String Tokenization with OpenNLP

OpenNLP is an open-source library for performing various NLP functions. One of those function is string tokenization. With OpenNLP’s tokenizers you can break text into its individual tokens. For example, given the text “George Washington was president” the tokens are [“George”, “Washington”, “was”, “president”].

If you don’t want the trouble of making your own project look at Sonnet Tokenization Engine. Sonnet, for short, performs text tokenization via a REST API. It is available on the AWS and Azure marketplaces.

A lot of NLP functions operate on tokenized text so tokenization is an important part of an NLP pipeline. In this post we will use OpenNLP to tokenize some text. At time of writing the current version of OpenNLP is 1.8.3.

The tokenizers in OpenNLP are located under the opennlp.tools.tokenize package. This package contains three important classes and they are:

  • WhitespaceTokenizer
  • SimpleTokenizer
  • TokenizerME

The WhitespaceTokenizer does simply that – breaks text into tokens based on the presence of whitespace in the text. The SimpleTokenizer is a little bit smarter. It tokenizes text based on the character classes in the text. Lastly, the TokenizerME performs tokenization using a trained token model. As long as you have data to train your own model this is the class you should use as it will give the best performance. All three classes implement the Tokenizer interface.

You can include the OpenNLP dependency in your project:

<dependency>
 <groupId>org.apache.opennlp</groupId>
 <artifactId>opennlp-tools</artifactId>
 <version>1.8.3</version>
</dependency>

The WhitespaceTokenizer and SimpleTokenizer can be used in a very similar manner:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
tokenizer.tokenize("George Washington was president.");
String tokens[] = tokenizer.tokenize(sentence);

And the WhitespaceTokenizer:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
tokenizer.tokenize("George Washington was president.");
String tokens[] = tokenizer.tokenize(sentence);

The tokenize() function takes a string and returns a string array which are the tokens of the string.

As mentioned earlier, the TokenizerME class uses a trained model to tokenize text. This is much more fun than the previous examples. To use this class we first load a token model from the disk. We are going to use the en-token.bin model file available here. Note that these models are really only good for testing since the text they were trained from is likely different from the text you will be using.

To start we load the model into an input stream from the disk:

InputStream inputStream = new FileInputStream("/path/to/en-token.bin"); 
TokenizerModel model = new TokenizerModel(inputStream);

Now we can instantiate a Tokenizer from the model:

TokenizerME tokenizer = new TokenizerME(tokenModel);

Since TokenizerME implements the Tokenizer interface it works just like the SimpleTokenizer and WhitespaceTokenizer:

String tokens[] = tokenizer.tokenize("George Washington was president.");

The tokenizer will tokenize the text using the trained model and return the tokens in the string array. Pretty cool, right?

Deploy Sonnet Tokenization Engine on AWS and Azure.

Posted by / December 19, 2017

Sonnet, Prose, and Idyl E3 now on Azure Marketplace

We are happy to announce that Sonnet Tokenization Engine, Prose Sentence Extraction Engine, and Idyl E3 Entity Extraction Engine have joined Renku Language Detection Engine on the Microsoft Azure Marketplace!

 

Posted by / December 18, 2017

Idyl E3 3.0 to be a Microservice

Idyl E3 Entity Extraction Engine is an all-in-one solution for performing entity extraction from natural language text. It takes in unmodified natural language text and through a pipeline, it identifies the language of the text, the sentences in the text, tokenizes those sentences, and extracts entities from those tokens. It’s not exactly what you would call a microservice. The archives for version 2.6.0 are nearly 1 GB in size.

With the introduction of the NLP Building Blocks earlier this year, we began breaking up Idyl E3 into a set of smaller services to perform its individual functions. Renku identifies languages, Prose extracts sentences, and Sonnet performs tokenization. Joining the mix soon with its first release will be Lacuna that classifies documents. Lacuna can be used to route documents through your NLP pipelines based on their content. Each of these applications are small (less than 30 MB), stateless, and horizontally scalable. Using these building blocks for an NLP pipeline instead of the all-in-one Idyl E3 provides much improved flexibility in your NLP pipelines. You can now create loosely connected microservices in your custom NLP pipeline.

With that said, Idyl E3 3.0 will become a microservice whose only function is to perform entity extraction. This will dramatically cut Idyl E3’s deployment size making it easier to deploy and manage. Like the other building blocks, Idyl E3 3.0 will be available as a Docker container. Because Idyl E3’s functionality will be trimmed down its pricing will also be reduced. Stay tuned for the updated pricing.

To help bring the NLP building blocks together in a pipeline we have made the nlp-building-blocks-java-sdk available on GitHub. It includes clients for each product’s API. The Apache2 license product also includes the ability to tie each client together in a pipeline. This is a Java project but we hope to eventually have similar projects available for other languages.

We are very excited to take this path of making NLP building block microservices. We believe it provides awesome flexibility and control over your NLP pipelines.

Posted by / December 5, 2017

Renku Language Detection Engine

Renku Language Detection Engine is now available. Renku, for short, is an NLP building block application that performs language detection on natural language text. Renku’s API allows you to submit text for analysis and receive back a list of language codes and associated probabilities. Renku is free for personal, non-commercial, and commercial use.

You can get started with Renku in a docker container quickly:

docker run -p 7070:7070 -it mtnfog/renku:1.2.0

Once running, you can submit requests to Renku. For example:

curl http://localhost:7070/api/language -d "George Washington was the first president of the United States."

The response from Renku will be a list of three-letter language codes and each’s associated probability. The languages will be ordered from highest probability to lowest. In this example the highest probability language will be “eng” for English.

Posted by / November 24, 2017

NLP Building Blocks

With the introduction of a new product called Lacuna Document Classification Engine, we are continuing toward our goal of providing the building blocks for larger NLP systems. Lacuna, for short, is an application that uses deep learning algorithms to classify documents into predefined categories.

Document classification has many uses in NLP systems though it is probably most famous for applications such as sentiment analysis and spam detection. Using Lacuna with Idyl E3 allows you to construct NLP pipelines capable of automatically performing entity extraction based on a document’s category. For instance, if Lacuna categorizes a document as a movie review, the document can be sent to an Idyl E3 containing an entity model for actors. Or, if Lacuna categorizes a document as a scientific paper, the document can be sent to an Idyl E3 containing an entity model for chemical compounds. Lacuna allows NLP pipelines to be more fluid and less rigid.

Lacuna will be available for download, on cloud marketplaces, and as a Docker container.

Posted by / November 5, 2017