How We Train Philter's NLP Models

PhilterIn this post I want to give some insight into how we create and train the NLP (natural language processing) models that Philter uses to identify entities like person's names in text.

Read this first :)

As a user of Philter you don't need to understand or even be aware of how we train Philter's NLP models. But it is helpful to know that Philter's NLP model can be changed based on your domain. For example, we offer some models trained specifically for the healthcare domain. These models were trained to give better performance when using Philter in a healthcare environment. See the bottom of this post for a list of the currently available NLP models for Philter.

What is NLP?

Some sensitive information can be identified by Philter based on patterns (SSNs) or dictionaries. Things like a person's name don't follow a pattern and while it may be found in a dictionary there isn't any guarantee your dictionary will contain all possible names. To identify person's names we rely on a set of techniques collectively known as natural language processing, or NLP.

NLP is a broad term used to describe many types of methods and technologies used to extract information from unstructured, or natural language, text. Some foundational common NLP tasks are to identify the language of some given text and to label the words in a sentence with their parts-of-speech types. More advanced tasks include named-entity recognition, summarizing text passages in a few sentences, translating text from one language to another, and determining the sentiment of a given text. It's a very exciting time in NLP due to lots of recent advancements in neural networks, GPU hardware, and just an explosion in the number of researchers and practitioners in the NLP community.

How does NLP work?

NLP tasks often require a trained model to operate. For instance, language translation requires a model that is able to take words and phrases in one language and produce another language. The model is trained in identical sets of text in both languages. How the words and phrases are used help the model determine how the text should be translated. Identifying person's names in text also requires a trained model. Training this type of model requires text that has been annotated, meaning that the entities have been labeled. The algorithms will use these labels to train the model to identify names in the future. An example of an annotated sentence:

{person}George Washington{/person} was president.

There are different annotation formats created for different purposes but I'm sure you get the idea. With annotated text we can train our model to know what a person's name looks like when the model is applied to unlabeled text. That's essentially all there is to it.

There are lots of fantastic open-source tools with active user communities for natural language processing. If you are interested in learning the nuts and bolts of NLP, choose a framework in your preferred programming language to lower the learning curve and dive in! The techniques and terminology learned from using one framework will always apply to a different framework even if it is in a different programming language so you aren't at any risk of lock-in.

How We Train Philter's NLP Models

As described above, training our model requires annotated text. We have annotated text for various domains. We use this annotated text, along with a set of word embeddings, a few GPUs, and some time, to train the models for Philter. The output of the training is a file which contains the model. The model can then be used by Philter to identify person's names in text.

Evaluating a Model's Performance

To have an idea of how our model will perform we use some common metrics called precision and recall. These metrics give us an idea of how well the model is performing on our test data. We don't need to get into the details of precision and recall here. However, one important thing we want you to know is often we will try to maximize the recall value when training the model. Maximizing the recall means it is better to label some text as a person's name even if it is not than it is to risk not labeling a person's name. When dealing with sensitive information in text it can be advantageous to err on the side of caution instead of risk missing a person's name not being filtered. Restated, maximizing recall means false positives are more acceptable than false negatives.

Currently Available Models for Philter

Once we are satisfied with the model's performance we publish it and make it available on our website. Here's the models we have so far:

NameDomainDescription
covid-19-1.0HealthcareOptimized for text relating to COVID-19.
general-2.0General UseA general model for use across various domains.
general-2.0-liteGeneral UseA general model for use across various domains optimized for speed and size.
healthcare-2.0HealthcareA model for text in the healthcare domain.
healthcare-2.0-liteHealthcareA model for text in the healthcare domain optimized for speed and size.

We have models for general usage and models more specialized for specific domains such as healthcare. We are continuously training and updating our models to keep them current and improve their performance. The model included with Philter is a general usage model.

To stay up to date on model updates please follow us on Twitter or subscribe to our very low volume newsletter.



Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 


Philter

Use your own NLP models with Philter

New in Philter 1.5.0 is the ability to use your own custom NLP models with Philter. Available in both Standard and Enterprise editions.

Philter is able to identify named-entities in text through the use of a trained model. The model is able to identify things, like person's names, in the text that do not follow a well-defined pattern or are easily referenced in a dictionary. Philter's NLP model is interchangeable and we offer multiple models that you can choose from to better tailor Philter to your use-case and your domain.

However, there are times when using our models may not be sufficient, such as when your use-case does not exactly match our available models or you want to try to get better performance by training a model on text very similar to your input text. In those cases you can train a custom NLP model for use with Philter.

Getting Started

Before diving into the details, here are some examples that illustrate how to make a custom NLP model usable for Philter. Feel free to use these examples as a starting point for your models. Additionally, this capability has been documented in Philter's User's Guide.

The first example in that repository is an implementation of a named-entity recognizer using Apache OpenNLP. The entity recognizer is exposed via two endpoints using a Spring Boot REST controller.

  • The first endpoint /process simply passes the received text to the entity recognizer. The entity recognizer receives the text, extracts the entities, and returns them in a list of PhilterSpan objects.
  • The second endpoint /status simply returns HTTP 200 and the text healthy. In cases where your model may take a bit to load, it would be better to actually check the status of the model instead of just returning that it is healthy. But for this example, the load model loading is quick and done before the HTTP service is even available.

That's the only required endpoints. Philter will use those endpoints to interact with your model.

@RequestMapping(value = "/process", method = RequestMethod.POST, consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE)
public @ResponseBody List<PhilterSpan> process(@RequestBody String text) {
return openNlpNer.extract(text);
}

@RequestMapping(value = "/status", method = RequestMethod.GET)
public ResponseEntity<String> status() {
return new ResponseEntity<String>("healthy", HttpStatus.OK);
}

Click here to see the full source file.

The PhilterSpan object contains the details of a single extracted entity. An example response from the /process endpoint for a single entity is shown below.

[ 
  {
    text: "George Washington",
    tag: "person",
    score: "0.97",
    start: "0",
    end: "17"
  }
]

Custom NLP Models

Training Your Own Model

Philter is indifferent of the technologies and methods you choose to train your custom model. You can use any framework you like, such as Apache OpenNLP, spaCy, Stanford CoreNLP, or your own custom framework. Follow the framework's documentation for training a model.

Using a Model You Already Have

If you already have a trained NLP model you can skip the training part and proceed on to making the model accessible to Philter.

Using Your Own Model with Philter

Once your model has been trained and you are satisfied with its performance, to use the model with Philter you must expose the model by implementing a simple HTTP service interface around it. This service facilitates communication between Philter and your model. The service is illustrated below.

Once your model is available behind the HTTP interface described above, you are ready to use the model with Philter. On the Philter virtual machine, simply export the PHILTER_NER_ENDPOINT environment variable to be the location of the running HTTP service. It is recommended you set this environment variable in /etc/environment. If your HTTP service is running on the same host as Philter on port 8888, the environment variable would be set as:

export PHILTER_NER_ENDPOINT=http://localhost:8888/

Now restart the Philter service and stop and disable the philter-ner service.

sudo systemctl restart philter.service
sudo systemctl stop philter-ner.service
sudo systemctl disable philter-ner.service

When a filter profile containing an NER filter is applied, requests will be made to your HTTP service invoking your model inference returning the identified named-entities.

Recommendations and Best Practices

You have complete freedom to train your custom NLP model using whatever tools and processes you choose. However, from our experience that are a few things that can help you be successful.
The first recommendation is to contain your service in a Docker container. Doing so gives you a self-contained image that can be deployed and run virtually anywhere. It simplifies dependency management and protects you from dependency version changes.

The second recommendation is to make your HTTP service as lightweight as possible. Avoid any unnecessary code or features that could negatively impact the speed of your model inference.

Lastly, thoroughly evaluate your model prior to deploying the model to Philter to have a better expectation of performance.

Conclusion

Using a custom NLP model with Philter is a fairly straightforward process. Train your model, make it accessible by HTTP, and then deploy the HTTP service and the model such that the service is accessible from Philter.

Do you have to train a custom model to use Philter? Absolutely not. Philter's "out-of-the-box" capabilities are sufficient for most use-cases. Will using your own model give you better performance? Possibly. It depends on how well the model is trained and all of the parameters used. Training your own model can be a difficult and time consuming activity so it's best to have some familiarity with the process before starting.

We are excited to offer this feature and look forward to getting your feedback!

 Philter Version 
Launch Philter on AWS1.6.0
Launch Philter on Azure1.5.0
Launch Philter on Google Cloud1.6.0

PyData Washington DC 2018

Last month in November 2018 I had the privilege of attending and presenting at PyData Washington DC 2018 at Capital One. It was my first PyData event and I learned so much from it that I hope to attend many more in the future and I encourage you to do so, too, if you're interested in all things data science and the supporting Python ecosystem. Plus, I got some awesome stickers for my laptop.

My presentation was an introduction to machine translation and demonstrated how machine translation can be used in a streaming pipeline. In it, I gave a (brief) overview of machine translation and how it has evolved from early methods, to statistical machine translation (SMT), to today's neural machine translation (NMT). The demonstrated application used Apache Flink to consume German tweets and, via Sockeye, translate the German tweets to English.

A video and slides will be available soon and will be posted here. The code for the project is available here. Credits to Suneel Marthi for the original implementation from which mine is forked and to Kellen Sunderland for the Sockeye-Thrift code that enabled communication between Flink and Sockeye.


Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.

I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.

We used Apache NiFi to drive the process. The data flow is summarized as follows:

The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level. 

Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.

https://www.youtube.com/watch?v=ek-crQwMfnQ&t=838s