How We Train Philter's NLP Models

PhilterIn this post I want to give some insight into how we create and train the NLP (natural language processing) models that Philter uses to identify entities like person's names in text.

Read this first :)

As a user of Philter you don't need to understand or even be aware of how we train Philter's NLP models. But it is helpful to know that Philter's NLP model can be changed based on your domain. For example, we offer some models trained specifically for the healthcare domain. These models were trained to give better performance when using Philter in a healthcare environment. See the bottom of this post for a list of the currently available NLP models for Philter.

What is NLP?

Some sensitive information can be identified by Philter based on patterns (SSNs) or dictionaries. Things like a person's name don't follow a pattern and while it may be found in a dictionary there isn't any guarantee your dictionary will contain all possible names. To identify person's names we rely on a set of techniques collectively known as natural language processing, or NLP.

NLP is a broad term used to describe many types of methods and technologies used to extract information from unstructured, or natural language, text. Some foundational common NLP tasks are to identify the language of some given text and to label the words in a sentence with their parts-of-speech types. More advanced tasks include named-entity recognition, summarizing text passages in a few sentences, translating text from one language to another, and determining the sentiment of a given text. It's a very exciting time in NLP due to lots of recent advancements in neural networks, GPU hardware, and just an explosion in the number of researchers and practitioners in the NLP community.

How does NLP work?

NLP tasks often require a trained model to operate. For instance, language translation requires a model that is able to take words and phrases in one language and produce another language. The model is trained in identical sets of text in both languages. How the words and phrases are used help the model determine how the text should be translated. Identifying person's names in text also requires a trained model. Training this type of model requires text that has been annotated, meaning that the entities have been labeled. The algorithms will use these labels to train the model to identify names in the future. An example of an annotated sentence:

{person}George Washington{/person} was president.

There are different annotation formats created for different purposes but I'm sure you get the idea. With annotated text we can train our model to know what a person's name looks like when the model is applied to unlabeled text. That's essentially all there is to it.

There are lots of fantastic open-source tools with active user communities for natural language processing. If you are interested in learning the nuts and bolts of NLP, choose a framework in your preferred programming language to lower the learning curve and dive in! The techniques and terminology learned from using one framework will always apply to a different framework even if it is in a different programming language so you aren't at any risk of lock-in.

How We Train Philter's NLP Models

As described above, training our model requires annotated text. We have annotated text for various domains. We use this annotated text, along with a set of word embeddings, a few GPUs, and some time, to train the models for Philter. The output of the training is a file which contains the model. The model can then be used by Philter to identify person's names in text.

Evaluating a Model's Performance

To have an idea of how our model will perform we use some common metrics called precision and recall. These metrics give us an idea of how well the model is performing on our test data. We don't need to get into the details of precision and recall here. However, one important thing we want you to know is often we will try to maximize the recall value when training the model. Maximizing the recall means it is better to label some text as a person's name even if it is not than it is to risk not labeling a person's name. When dealing with sensitive information in text it can be advantageous to err on the side of caution instead of risk missing a person's name not being filtered. Restated, maximizing recall means false positives are more acceptable than false negatives.

Currently Available Models for Philter

Once we are satisfied with the model's performance we publish it and make it available on our website. Here's the models we have so far:

NameDomainDescription
covid-19-1.0HealthcareOptimized for text relating to COVID-19.
general-2.0General UseA general model for use across various domains.
general-2.0-liteGeneral UseA general model for use across various domains optimized for speed and size.
healthcare-2.0HealthcareA model for text in the healthcare domain.
healthcare-2.0-liteHealthcareA model for text in the healthcare domain optimized for speed and size.

We have models for general usage and models more specialized for specific domains such as healthcare. We are continuously training and updating our models to keep them current and improve their performance. The model included with Philter is a general usage model.

To stay up to date on model updates please follow us on Twitter or subscribe to our very low volume newsletter.



Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.