Philter 1.7.0

PhilterWe are happy to announce that Philter 1.7.0 has been released and is currently being published to the DockerHub and the AWS, Azure, and Google Cloud marketplaces. Look for it to be available for deployment into your cloud in the next couple of days.

Click here to deploy Philter in your cloud of choice!

Philter finds and removes sensitive information, such as PII and PHI, in text. Philter can be integrated with virtually any platform, such as Apache Kafka, Apache Flink, Apache NiFi, Apache Pulsar, and Amazon Kinesis. Philter can redact, replace, encrypt, and hash sensitive information.

Philter can currently identify:  Ages, Bitcoin Addresses, Cities, Counties, Credit Cards, Custom Dictionaries, Custom Identifiers (medical record numbers, financial transaction numbers), Dates, Drivers License Numbers, Email Addresses, IBAN Codes, IP Addresses, MAC Addresses, Passport Numbers, Persons' Names, Phone/Fax Numbers, SSNs and TINs, Shipping Tracking Numbers, States, URLs, VINs, Zip Codes

Learn more about Philter.

 Philter Version 
Launch Philter on AWS1.7.0
Launch Philter on Azure1.7.0
Launch Philter on Google Cloud1.7.0

What's New in Philter 1.7.0?

Philter 1.7.0 brings a new experimental feature that breaks large text into smaller pieces of text for more efficient processing. This new feature is described below and is introduced in Philter 1.7.0 as an experimental feature. We welcome and encourage your feedback on the feature but caution you that the feature may undergo major changes in future versions.

Some of the changes and new features in Philter 1.7.0 are described below. Refer to the Release History for a full list of changes.

Automatically Splitting Input Text

Philter 1.7.0 bring a new experimental feature that breaks long input text up into pieces and processed each piece individually. After processing, Philter combines the individual results into a single response back to the client. The purpose of this feature is to allow Philter to better handle long input text.

What is a "long" input text can depend on several factors, such as the hardware running Philter, the network, and the density of sensitive information in the text. Because of this, you have some control over how Philter breaks long text into separate pieces. You can choose between two methods of splitting. The first method splits the text based on the locations of new line characters in the text. The second method splits the text into individual lines of nearly equal length.

The alternative to allowing Philter to split the text is to split the text yourself client side prior to sending the text to Philter. When doing the split client side you have full control over how the text is split. On the flip side, you also have to handle the individual response for each split, something Philter handles for you when you delegate the splitting to Philter.

Input text splitting is enabled and configured in filter profiles. This allows you to configure splitting based on individual filter profiles allowing some text to be split and other text not split based on the chosen filter profile for the text.

See Philter's User's Guide for how to configure splitting in a filter profile.

If you use this feature please send us feedback. We are looking to improve it for future versions and value your feedback. Please see the User's Guide for more details.

Reporting Metrics via Prometheus

Philter supported metrics reporting via JMX, Amazon CloudWatch, and Datadog. In Philter 1.7.0 we added support for monitoring Philter's metrics via Prometheus. When enabled, Philter will expose an HTTP endpoint suitable for scraping by Prometheus. See Philter's Settings for details on how to enable the Prometheus metrics. Look for a separate blog post soon that dives into monitoring Philter's metrics with Prometheus.

Smaller AWS EBS Volume

The EBS volume size for Philter 1.7.0 has been reduced from 20 GB to 8 GB. This reduces the monthly cost by $1.20 for Philter by only requiring a smaller SSD volume. This cost may or may not seem trivial, but when multiple Philter instances are deployed the savings will add up.

Other Changes

Other new features in Philter 1.7.0 include:

  • Terms can now be ignored based on regular expression patterns. Previously Philter had the ability to ignore specified terms but the terms had to match exactly. Now you can specify terms to ignore via regular expression patterns. An example use of this new feature is to ignore non-sensitive information that can change such as timestamps in log messages.
  • Added ability to read ignored terms from files outside of the filter profile.
  • Custom dictionary terms can now be phrases or multi-term keywords.
  • Added “classification” condition to Identifier filter to allow for writing conditionals against the classification value.
  • Added configurable timeout values to allow for modifying timeouts of internal Philter communication. This can help when processing larger amounts of text. See the Settings for more information.
  • Added option to IBAN Code filter to allow spaces in the IBAN codes.
  • Ignore lists for individual filters are no longer case-sensitive. (“John” will be ignored for “JOHN.”)

Protecting Sensitive Information in Streaming Platforms

Streaming platforms like Apache Kafka and Apache Pulsar provide wonderful capabilities around ingesting data. With these platforms we can build all types of solutions across many industries from healthcare to IoT and everything in between. Inevitably, the problem arises of how to deal with sensitive information that resides in the streaming data. Questions such as how do we make sure that data never crosses a boundary, how do we keep that data safe, and how can we remove the sensitive information from the incoming data so we can continue processing the data? These are all very good questions to ask and in this post we present a couple architectures to address those questions and help maintain the security of your streaming data. These architectures along with Philter can help protect the sensitive information in your streaming data.

Whether you are using Apache Kafka, Apache Pulsar, or some other streaming platform is largely irrelevant. Each of these platforms are largely built on top of the same concepts and even share quite a bit of terminology, such as brokers and topics. (A broker is a single instance of Kafka or Pulsar and a topic is how the streaming data is organized when it reaches the broker.)

Streaming Healthcare Data

Let's assume you have an architecture where you have a 3 broker installation of Apache Kafka that is accepting streaming data from a hospital. This data contains patient information which has PII and PHI. An external system is publishing data to your Apache Kafka brokers. The brokers receive the data, store it in topics, and a downstream system consumes from Apache Kafka and processes the statistics of the data by analyzing the text and persisting the results of the analysis into a database. Even though this is a hypothetical scenario it is an extremely common deployment architecture around distributed and streaming technologies.

Now you ask yourself those questions we mentioned previously. How to keep the PII and PHI secure in our streaming data? Your downstream processor does not care about the PII and PHI since it is only aggregating statistics. Having those downstream systems process the data containing PII and PHI puts our system at risk of inadvertent HIPAA violations by enlarging the perimeter of the system containing PII and PHI. Removing the PII and PHI from the streaming data before it gets consumed by the downstream processor would help keep the data safe and our system in compliance.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Filtering the Sensitive Information from the Streaming Data

There's a couple things we can do to remove the PII and PHI from the streaming data before it gets to the downstream processor.

The first option is to use Apache Kafka Streams or an Apache Pulsar Function (depending on which you are running) to consume the data, filter out the PII and PHI, and publish the filtered text back to Kafka or Pulsar on a different topic. Now update the name of the topic the downstream processor consumes from. The raw data  from the hospital containing PII and PHI will stay in its own topic. You can use Apache Kafka ACLs on the topics to help prevent someone from inadvertently consuming from the raw topic and only permit them to consume from the filtered topic. If, however, the idea of the raw data containing PII and PHI existing on the brokers is a concern then continue on to option two below.

The second option is to utilize a second Apache Kafka or Apache Pulsar cluster. Place this cluster in between the existing cluster and the downstream processor. Create an application to consume from the topic on the first brokers, remove the PII and PHI, and then publish the filtered data to a topic on the new brokers. (You can use something like Apache Flink to process the data. At the time of writing, Kafka Streams cannot be used because the source brokers and the destination brokers must be the same.) In this option, the sensitive data is physically separated from the rest of the data by residing on its own brokers.

Which option is best for you depends on your requirements around processing and security. In some cases, separate brokers may be overkill. But in other cases it may be the best option due to the physical boundary it creates between the raw data and the filtered data.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Refreshed cloud images

While we continue development on Philter 1.7.0 we have released minor updates to the AWS, Azure, and Google Cloud marketplaces. The only changes in these minor updates is to refresh the base image to include all available operating system updates. There are no changes to the Philter software. In the future we plan to consolidate our refreshed image updates for each cloud to minimize the number of separate versions.

There is no need to update to the minor release version if you are maintaining your operating system updates of your existing Philter instances.

You can find the details of all releases in Philter's Release Notes.


New "My Mountain Fog"

We have launched our new "My Mountain Fog" section of our website. The purpose in the relaunch was to make it easier to use and navigate. We simplified the layout and separated the sections like license keys and subscriptions into their own pages. You can still log in with Google or create a new login.

Check it out at my.mtnfog.com.


The Performance of Philter

PhilterIn all of our design and development of Philter, performance is always one of our top priorities. We ask ourselves questions like how can we implement this awesome new feature in Philter without negatively impacting performance? What can we do to improve performance?

However, the word "performance" can have a few different meanings when used in relation to Philter. In this post I want to dive down into the "performance" of Philter and how it impacts Philter's development.

Performance: Efficient processing of text

The first meaning of performance relates to how efficient Philter is when processing text. Philter takes input text, filters the sensitive information, and returns the output text. (That middle step is simplified a lot but hopefully you get the idea.) If any of these steps are not efficient, or performant, Philter won't be usable. Your client applications will time out and you will not want to use Philter.

Any new features or modifications to Philter's filtering capabilities has to be carefully designed and implemented. Even a small, seemingly innocent change can have large negative effects on performance. Because of that we as the developers must be careful and test accordingly. We use efficient data structures and make careful choices to select the operations that will provide the best performance.

This type of performance is typically measured in compute time and that's how we measure it. We have thousands of test cases that we execute with each new build of Philter. Over time we can see a history of the processing time and its downward trend as Philter gets more efficient.

Performance: Labeling the appropriate information as sensitive

The second meaning of performance may sometimes be referred to as accuracy. This meaning relates to how well Philter identified the sensitive information in the input text. Was all the sensitive information that Philter identified actually sensitive? Where there any false positives? False negatives? This type of performance is typically measured by a percentage, or by terms from information retrieval such as precision and recall.

In some cases, Philter's identification of sensitive information is non-deterministic, meaning statistics and machine learning algorithms are applied to locate sensitive information. Contrast this with a deterministic process such as looking for terms from a dictionary. How some of Philter's filters identify sensitive information can be controlled through a sensitivity level. Setting the sensitivity to high will likely identify more sensitive information but also have more false positives. Conversely, setting the sensitivity to low will likely result in finding fewer sensitive information and more false negatives. The sensitivity level of medium aims to bridge this gap. In some cases, false positives may be more acceptable than a false negative so a high sensitivity level is used. For the information retrieval folks out there is known as maximizing the recall.

For sensitive information like person's names we offer various models trained for specific domains. The purpose of this is to provide a higher level of accuracy when Philter is used in those domains.

Putting them together

Philter's "performance" is both of these. Philter must perform well in terms of time and processing efficiency as well as finding the appropriate sensitive information. We believe that both types are equally important. A system that takes hours to complete but with more accuracy may be just as unusable as a system that completes in milliseconds but finds no sensitive information.

If you are not yet using Philter to find and remove sensitive information from your text it's easy to get started. Just click on your platform of choice below. And if you need help please don't hesitate to reach out. We enjoy helping.

AWS Marketplace

Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter's Custom Dictionary Filter and "Fuzziness"

PhilterPhilter finds sensitive information in text based on a set of filters that you configure in a filter profile. Some of these filters are for predefined information like SSNs, phone numbers, and names. But sometimes you have a list of terms specific to your use-case that you want to identify, too. Philter's custom dictionary filter lets you specify a list of terms to label as sensitive information when found in your text.

You can learn more about the custom dictionary filter and all of its properties in the Philter User's Guide.

Philter 1.6.0 adds a new property called "fuzzy" to the custom dictionary filter. The "fuzzy" property accepts a value of true or false. When set to false, text being processed must match an item in the dictionary exactly for that text to be labeled as sensitive information. When set to true, the text does not have to match exactly. The "fuzzy" property allows for misspellings and typos to be present and still label the text as being sensitive information. In this blog post we want to dive a little bit more into this to better explain how the "fuzziness" works and is applied and the trade-offs when using it.

Also new in Philter 1.6.0 is the ability to provide the custom dictionary filter a path to a file that contains the terms. This way you don't have to include your terms directly in the filter profile.

Sample Filter Profile

To start, here's a simple filter profile that includes a custom dictionary filter. The dictionary contains three terms (john, jane, doe) and fuzziness is enabled with medium sensitivity. When any of those terms are found, they will be redacted with the pattern {{{REDACTED-%t}}}, where %t is replaced by the type which in this case is custom-dictionary.

{
   "name": "dictionary-example",
   "identifiers": {
      "dictionaries": [
         "customDictionary": {
            "terms": ["john", "jane", "doe"],
            "fuzzy": true,
            "sensitivity": "medium",
            "customDictionaryFilterStrategies": [
               {
                  "strategy": "REDACT",
                  "redactionFormat": "{{{REDACTED-%t}}}"
               }
            ]
         }
      ]
   }   
}

No fuzziness

We will start by describing what happens when the "fuzzy" property is set to false. This is the default behavior and is consistent with how Philter behaved prior to version 1.6.0. Items in the custom dictionary have to be found in the text exactly as they are in the dictionary. This means "John" is not the same as "Jon."

Disabling fuzziness is more efficient and will provide better performance. That's really all you need to know. But if you like getting into the details of things, read on! Internally, Philter uses an algorithm based off what's known as a bloom filter to efficiently scan a dictionary for matches. A bloom filter "is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set." In this case, the set is your list of terms in the dictionary and an element is each word from the input text. The bloom filter provides an efficient means of determining whether or not a given word is a term in your dictionary that you want to be identified as sensitive information.

A digression into bloom filters

Just to clarify, when we talk about Philter we talk a lot about "filters", such as a filter for SSNs, a filter for phone numbers, and so on. A bloom filter is not a filter like that. A bloom filter is an algorithm that provides an efficient means of asking the question "Does this item potentially exist in this dictionary?" A bloom filter will answer "yes, it might" or "no, it does not." Notice the response of "yes, it might." The bloom filter is not saying "Yes." Instead, it is staying "yes, it might." It's then up to the programmer to find out definitively if that item exists in the dictionary. That's essentially how a bloom filter works.

Yes, fuzziness!

Enabling fuzziness on a custom dictionary filter works differently. As Philter scans the input text, it not only considers the words or phrases themselves, but Philter also considers derivations of the words and phrases. When fuzziness is enabled, "John" may be the same as "Jon." Enabling fuzziness by setting the "fuzzy" property to true can be useful when you are concerned about misspellings or different spellings of terms in your text.

You can control the level of acceptable fuzziness by setting the "sensitivityLevel" property. Valid values are "low", "medium", and "high." The different between "Jon" and "John" is considered low while the different between "Jon" and "Johnny" is considered high. You can use the sensitivityLevel to find an acceptable level of fuzziness appropriate for your custom dictionary and your text. The default sensitivityLevel when not specified is "high."

An important distinction to make is that currently when fuzziness is disabled the custom dictionary can only contain single words. Phrases are not permitted as dictionary terms in Philter 1.6.0 but are allowed in the upcoming version 1.7.0. The internals of that change are interesting enough for their own blog post!

Summary

To summarize:

  • Setting fuzzy to false (the default settings) for the custom dictionary filter will provide better performance but terms in the custom dictionary must match exactly and only words (not phrases) are allowed in the dictionary.
  • Setting fuzzy to true allows the custom dictionary filter to be able to identify misspellings and different spellings of terms in the custom dictionary filter at the cost of performance. Use the sensitivityLevel values of low, medium, and high to control the allowed level of fuzziness.

Not yet using Philter?

Join our users across the healthcare, financial, legal, and other industries in using Philter to find and remove sensitive information from your text. Click on your platform below to get started.

AWS Marketplace

Preventing PII and PHI from Leaking into Application Logs

Introduction

This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application's development that's easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications' logs?

Philter and application logs

To give more confidence you can process your applications' logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application's log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O https://raw.githubusercontent.com/confluentinc/cp-all-in-one/5.5.1-post/cp-all-in-one/docker-compose.yml
docker-compose up

Spin up Philter in a docker container (request a free license key):

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we're good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here's an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
...
<Appenders>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>
  </Kafka>
</Appenders>

With this appender in our log4j2.xml file, our application's logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we're ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard kafka-console-consumer.sh script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person's names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.

Summary

In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j's Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter 1.6.0 Docker Containers

Philter 1.6.0 Docker containers are now available. To run the containers:

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

This will spin up the Philter containers. Once running you can send Philter some text for filtering:

curl -k https://localhost:8080/api/explain --data "George Washington was president and his ssn was 123-45-6789." -H "Content-type: text/plain"

If you're familiar with running Philter containers then you'll notice these instructions have not changed! But there is something new (and exciting!). Philter can now be configured using environment variables instead of having to manage the application.properties file inside of the container. We can modify the docker-compose file to include properties as environment variables. Here's an abbreviated example:

version: '3'
services:
  philter:
    depends_on:
      - "redis"
      - "philter-ner"
    environment:
      - PHILTER_CACHE_REDIS_ENABLED=true
      - PHILTER_CACHE_REDIS_HOST=redis
      - PHILTER_CACHE_REDIS_PORT=6379
      - PHILTER_CACHE_REDIS_AUTH_TOKEN=randompassword
      - PHILTER_LICENSE_KEY=
      - PHILTER_NER_ENDPOINT=http://philter-ner:18080/
...

In this snippet we have added a few environment variables to configure a Redis cache for Philter. That's it! We're excited about this new capability and hope that it makes Philter much easier to configure in your container environments.

Any of the Philter settings can be set as environment variables. Just prepend PHILTER_ to the property name and change periods to underscores. For an example, to enable span disambiguation the property is span.disambiguation.enabled, to set it as an environment variable use PHILTER_SPAN_DISAMBIGUATION_ENABLED.


How We Train Philter's NLP Models

PhilterIn this post I want to give some insight into how we create and train the NLP (natural language processing) models that Philter uses to identify entities like person's names in text.

Read this first :)

As a user of Philter you don't need to understand or even be aware of how we train Philter's NLP models. But it is helpful to know that Philter's NLP model can be changed based on your domain. For example, we offer some models trained specifically for the healthcare domain. These models were trained to give better performance when using Philter in a healthcare environment. See the bottom of this post for a list of the currently available NLP models for Philter.

What is NLP?

Some sensitive information can be identified by Philter based on patterns (SSNs) or dictionaries. Things like a person's name don't follow a pattern and while it may be found in a dictionary there isn't any guarantee your dictionary will contain all possible names. To identify person's names we rely on a set of techniques collectively known as natural language processing, or NLP.

NLP is a broad term used to describe many types of methods and technologies used to extract information from unstructured, or natural language, text. Some foundational common NLP tasks are to identify the language of some given text and to label the words in a sentence with their parts-of-speech types. More advanced tasks include named-entity recognition, summarizing text passages in a few sentences, translating text from one language to another, and determining the sentiment of a given text. It's a very exciting time in NLP due to lots of recent advancements in neural networks, GPU hardware, and just an explosion in the number of researchers and practitioners in the NLP community.

How does NLP work?

NLP tasks often require a trained model to operate. For instance, language translation requires a model that is able to take words and phrases in one language and produce another language. The model is trained in identical sets of text in both languages. How the words and phrases are used help the model determine how the text should be translated. Identifying person's names in text also requires a trained model. Training this type of model requires text that has been annotated, meaning that the entities have been labeled. The algorithms will use these labels to train the model to identify names in the future. An example of an annotated sentence:

{person}George Washington{/person} was president.

There are different annotation formats created for different purposes but I'm sure you get the idea. With annotated text we can train our model to know what a person's name looks like when the model is applied to unlabeled text. That's essentially all there is to it.

There are lots of fantastic open-source tools with active user communities for natural language processing. If you are interested in learning the nuts and bolts of NLP, choose a framework in your preferred programming language to lower the learning curve and dive in! The techniques and terminology learned from using one framework will always apply to a different framework even if it is in a different programming language so you aren't at any risk of lock-in.

How We Train Philter's NLP Models

As described above, training our model requires annotated text. We have annotated text for various domains. We use this annotated text, along with a set of word embeddings, a few GPUs, and some time, to train the models for Philter. The output of the training is a file which contains the model. The model can then be used by Philter to identify person's names in text.

Evaluating a Model's Performance

To have an idea of how our model will perform we use some common metrics called precision and recall. These metrics give us an idea of how well the model is performing on our test data. We don't need to get into the details of precision and recall here. However, one important thing we want you to know is often we will try to maximize the recall value when training the model. Maximizing the recall means it is better to label some text as a person's name even if it is not than it is to risk not labeling a person's name. When dealing with sensitive information in text it can be advantageous to err on the side of caution instead of risk missing a person's name not being filtered. Restated, maximizing recall means false positives are more acceptable than false negatives.

Currently Available Models for Philter

Once we are satisfied with the model's performance we publish it and make it available on our website. Here's the models we have so far:

[mf-philter-models download_links="false"]

We have models for general usage and models more specialized for specific domains such as healthcare. We are continuously training and updating our models to keep them current and improve their performance. The model included with Philter is a general usage model.

To stay up to date on model updates please follow us on Twitter or subscribe to our very low volume newsletter.



Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 


Philter 1.6.0

PhilterPhilter 1.6.0 will be available soon through the cloud marketplaces and DockerHub. This is probably the most significant release of Philter other than the first release 1.0.0.

Version 1.6.0 has many new features and a few fixes. Instead of writing a single blog post for the entire release we are going to write a few separate blog posts on the significant new features. We will highlight the new features just down below in this post and then follow-up over the next few days with posts that go more in-depth on each of the new features. Check out Philter's Release Notes.

Over the next few days we will be making updates to the Philter SDKs to accommodate the new features in Philter 1.6.0.

Deploy Philter

 Philter Version 
Launch Philter on AWS1.7.0
Launch Philter on Azure1.7.0
Launch Philter on Google Cloud1.7.0

New Features in Philter 1.6.0

The following are summaries of the new features added in Philter 1.6.0.

Alerts

The new alerts feature in Philter 1.6.0 allows you to cause Philter to generate an alert when a given filter condition is satisfied. For example, if you have a filter condition to only match a person's name of "John Smith", when this condition is satisfied Philter will generate an alert. The alert will be stored in Philter and can be retrieved and deleted using Philter's new Alerts API. Details of the Alerts are in Philter's User's Guide.

Span Disambiguation

Sometimes a piece of sensitive information could be one of a few filter types, such as an SSN, a phone number, or a driver's license number. The span disambiguation feature works to determine which of the potential filter types is most appropriate by analyzing the context of the sensitive information. Philter uses various natural language processing (NLP) techniques to determine which filter type the sensitive information most closely resembles. Because of the techniques used, the more text Philter sees the more accurate the span disambiguation will become.

Span disambiguation is documented in Philter's User's Guide.

New Filters: Bitcoin Address, IBAN Codes, US Passport Numbers, US Driver's License Numbers

Philter 1.6.0 contains several new filter types:

  • Bitcoin Address - Identify bitcoin addresses.
  • IBAN Codes - Identify International Bank Account Numbers.
  • US Passport Numbers - Identify US passport numbers issued since 1981.
  • US Driver's License Numbers - Identify US driver's license numbers for all 50 states.

Each of these new filters are available through filter profiles.

New Replacement Strategy: SHA-256 with random salt values

We previously added the ability to encrypt sensitive information in text. In Philter 1.6.0 we have added the ability to hash sensitive information using SHA-256. When the hash replacement strategy is selected, each piece of sensitive text will be replaced by the SHA-256 value of the sensitive text. Additionally, the hash replacement strategy has a "salt" property that when enabled will cause Philter to append a random salt value to each piece of sensitive text prior to hashing. The random hash value will be included in the filter response.

Custom Dictionary Filters Can Now Use an External Dictionary File

Philter's custom dictionary filter lets you specify a list of terms to identify as being sensitive. Prior to Philter 1.6.0, this list of terms had to be provided in the filter profile. With a long list it did not take long for the filter profile to become hard to read and even harder to manage. Now, instead of providing a list of terms in the filter profile you can simply provide the full path to a file that contains a list of terms. This keeps the filter profile compact and easier to manage. You can specify as many dictionary files as you need to and Philter will combine the terms when the filter profile is loaded.

Custom Dictionary Filters Now Have a "fuzzy" Property

Philter's custom dictionary filter previously always used fuzzy detection. (Fuzzy detection is like a spell checker - a misspelled name such as "Davd" can be identified as "David.") New in Philter 1.6.0 is a property on the custom dictionary filter called "fuzzy." This property controls whether or not fuzzy detection is enabled. This property was added because when fuzzy detection is not needed you can get a significant performance increase. When not enabled, Philter uses an optimized data structure to identify the terms. If fuzzy detection is not enabled we do recommend disabling it to take advantage of the performance gain.

Changed "Type" to "Classification"

A few filter types had additional information that provided further description of the sensitive information. For instance, the entity filter had a type that identified the "type" of the entity such as "PER" for person. We have changed the property "type" to "classification" for clarity and uniformity. Be sure to update your filter profiles if you have any filter conditions that use "type" to use "classification" instead. It is a drop-in replacement and you can simply change "type" to "classification."

Add Filter Condition for "Classification"

Philter 1.6.0 adds the ability to have a filter condition on "classification."

Redis Cache Can Now Use a Self-Signed SSL Certificate

Philter 1.6.0 can now connect to a Redis cache that is using a self-signed certificate. New configuration settings for the truststore and keystore allow for trusting the self-signed certificate.

Fixes and Improvements in Philter 1.6.0

The following is a list of fixes and improvements made in Philter 1.6.0.

Fixed Potential MAC Address Issue

We found and fixed a potential issue where a MAC Address might not be identified correctly.

Fixed Potential Ignore Issue with Custom Dictionary Filters

We found and fixed a potential issue where a term in a custom dictionary that is also a term in an ignore list might not be ignored correctly.

Fixed Potential Issue with Credit Card Number Validation

We found and fixed a potential issue where a credit card number might not be validated correctly. This only applies when credit card validation is enabled.


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter - A Real-World Use-Case

PhilterPhilter finds, identifies, and removes sensitive information from text. That's a very good and short description of Philter, but, as they say, a picture is worth a thousand words. In this post we will detail an actual, real-world use-case of Philter as we paint a picture with words!

"Super Helpdesk"

The Philter customer, we'll call them Super Helpdesk, is a provider of a software-as-a-service helpesk solution. Their customers sign-up to be able to offer a helpdesk to their customers. (Following? :) Super Helpdesk's users need the ability to optionally prevent sensitive information from being passed through in tickets. If a customer enters something sensitive they want to remove it from the ticket before the ticket enters the workflow.

In this case, the sensitive information Super Helpdesk is most worried about are credit card numbers. Due to security best practices and regulations like PCI-DSS, credit card numbers cannot exist in helpdesk tickets where they may be stored or transmitted unencrypted. Super Helpdesk needed a way to analyze the tickets entering their system in order to filter out the credit card numbers from the tickets.

The Solution

At a high-level, Super Helpdesk deployed Philter (in this case running on EC2 in AWS) to perform the filtering of the content of the helpdesk tickets. As new helpdesk tickets are submitted, the content of the ticket is sent to Philter and Philter immediately returns the content of the ticket with the credit card numbers redacted to just the last four digits. (Super Helpdesk also added an option for their users to control how Philter redacts the credit card numbers, with the available options being redact all or redact all but the last four digits.)

Now for the low-level implementation details! When new helpdesk tickets come in they are published to an Apache Kafka topic. A process consumes from the topic, does processing on the ticket, and ultimately inserts the ticket into a backend database. This process, written in Java, was modified to make use of the Philter Java SDK to enable the communication between the process and Philter.

We have found this to actually be a very common Extract-Transform-Load (ETL) design scenario across industries. Data in the form of text flows from an external system through a pipeline facilitated by Apache Kafka or Amazon Kinesis Firehose into an internal database. Along the way the data needs to be manipulated in some manner. In our case the data manipulation is to remove sensitive information from the text. Philter's API allows it to slide nearly seamlessly into the existing pipeline. Like Super Helpdesk did, just insert a step to send the text to Philter for filtering.

We made a previous blog post about using Philter inside of an AWS Kinesis Firehose using a Firehose Transformation. It describes how to make a Lambda function to invoke Philter on the text going through the pipeline to filter the text. Check it out at the link below.

Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

But, wait, why Philter?

You are probably saying, well, that seems like overkill for a simple problem to redact credit card numbers! Credit card numbers follow a well-defined pattern so why not just use a regular expression to find them? If all you want to do is find credit card numbers then a regular expression definitely may work.

So what does using Philter give us? A good bit actually. Through the use of filter profiles, Philter can have a pre-set list of types sensitive information. Each type of sensitive information can have its own redaction logic. For example, you could redact VISA card numbers while truncating AMEX card numbers. Or, you could only leave the last four digits of card numbers matching a condition. Additionally, each customer of the helpdesk platform may have different requirements around sensitive information. That logic can also be encapsulated in filter profiles. The regular expression logic just got more complicated.

Philter provides other features as well, such as the ability to capture metrics on the data, ability to encrypt the credit card numbers instead of removing them, and the ability to disambiguate between different types of sensitive information.

Lastly, a regular expression will never be able to find non-deterministic types of sensitive information like person's names. Philter's natural language processing (NLP) capabilities are able to find entities like person's names that do not follow any set pattern.

Try Philter

Deploying Philter to AWS, Azure, or GCP is easy because Philter is available through each of the cloud's marketplaces. Simply follow the marketplace steps to launch an instance of Philter in your private cloud.

 Philter Version 
Launch Philter on AWS1.7.0
Launch Philter on Azure1.7.0
Launch Philter on Google Cloud1.7.0

Share your experience!

We would love to hear how you are using Philter. Share your experience with us!


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Challenges in Finding Sensitive Information and Content in Text

Finding sensitive information and content in text has been a problem for as long as text has existed. But in the past few years due to the availability of cheaper data storage and streaming systems, finding sensitive information in text has become nearly a universal need across all industries. Today, systems that process streaming text often need to filter out any information considered sensitive directly in the pipeline to ensure the downstream applications have immediate access to the sanitized text. Streaming platforms are commonly used in industries such as healthcare and banking where the data can contain large amounts of sensitive information.

What is "sensitive information"?

Taking a step back, what is sensitive information? Sensitive information is simply any information that you or your organization deems as being sensitive. There are some global types of sensitive information such as personally identifiable information (PII) and protected health information (PHI). These types of sensitive information, among others, are typically regulated in how the information must be stored, transmitted, or used. But it is common for other types of information to be sensitive for your organization. This could be a list of terms, phrases, locations, or other information important to your organization. Simply put, if you consider it sensitive then it is sensitive.

Structured vs. Unstructured

It's important to note we are talking about unstructured, natural language text. Text in structured formats like XML or JSON are typically simpler to manipulate due to the inherent structure of the text. But in unstructured text we don't have the convenience of being told what is a "person's name" like an XML tag <personName> would do for us. There's generally three ways to find sensitive information in unstructured text.

Three Methods of Finding Sensitive Information in Text

The first method is to look for sensitive information that follows well-defined patterns. This is information like US social security numbers and phone numbers. Even though regular expressions are not a lot of fun, we can easily enough write regular expressions to match social security numbers and phone numbers. Once we have the regular expressions it's straightforward to apply the regular expression to the input text to find pieces of the text matching the patterns.

The second method is to look for sensitive information that can be found in a dictionary or database. This method works well for geographic locations and for information that you might have stored in your database or spreadsheet, such as a column of person's names. Once the list is accessible, again, it is fairly straightforward to look for those items in the text.

The third, and last, method is to employ the techniques of natural language processing (NLP). The technology and tools provided by the NLP ecosystem give us powerful ways to analyze unstructured text. We can use NLP to find sensitive information that does not follow well-defined patterns or is not referenced in a database column or spreadsheet, such as person's or organization's names. The past few years have seen remarkable advancements in NLP allowing these techniques to be able to analyze the text with great success.

Deterministic and Non-deterministic

The first two methods are deterministic. Finding text that matches a pattern and finding text contained in a dictionary is a pass/fail scenario - you either find the text you are looking for or you do not. The third method, NLP, is not deterministic. NLP uses trained models to be able to analyze the text. When an NLP method finds information in text it will have an associated confidence value that tells us just how sure the algorithms are that the associated information is what we are looking for.

Introducing Philter

PhilterPhilter is our software product that implements these three methods of identifying sensitive information in text. Philter supports finding, identifying, and removing sensitive information in text. You set the types of information you consider sensitive and then send text to Philter. The filtered text without the sensitive information is returned to you. With Philter you have full control over how the sensitive information is manipulated - you can redact it, replace it with random values, encrypt it, and more.

Often, sensitive information can follow the same pattern. For example, a US social security number is a 9 digit number. Many driver's license numbers can also be 9 digit numbers. Philter can disambiguate between a social security number and a driver's license number based on the number is used in the text. When using a dictionary we can't forget about misspellings. If we simply look for words in the dictionary we may not find a name that has been misspelled. Philter supports fuzzy searching by looking for misspellings when applying a dictionary-based filter.

This isn't nearly all Philter can do but it is some of the more exciting features to date. Take Philter for a test drive on the cloud of your choice. We'd be happy to walk you through it if you would like!

 Philter Version 
Launch Philter on AWS1.7.0
Launch Philter on Azure1.7.0
Launch Philter on Google Cloud1.7.0

Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter Docker Containers

We are excited to announce that Philter can now be launched as Docker containers. Previously, Philter was only available through the AWS, Azure, and Google Compute Cloud marketplaces. By making Philter available as containers, Philter can now easily be used outside those cloud platforms, in container orchestration tools such as Kubernetes, and on-premises. Philter finds, identifies, and removes sensitive information such as PHI and PII from natural language text.

Launching the Philter containers is easy:

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

This will download and run the containers. Once the containers are running you are ready to send filter requests.

curl http://localhost:8080/api/filter --data "George Washington was president and his ssn was 123-45-6789." -H "Content-type: text/plain"

A license key is required to be set for the containers to start and can be requested using the link below.


Philter

Use your own NLP models with Philter

New in Philter 1.5.0 is the ability to use your own custom NLP models with Philter. Available in both Standard and Enterprise editions.

Philter is able to identify named-entities in text through the use of a trained model. The model is able to identify things, like person's names, in the text that do not follow a well-defined pattern or are easily referenced in a dictionary. Philter's NLP model is interchangeable and we offer multiple models that you can choose from to better tailor Philter to your use-case and your domain.

However, there are times when using our models may not be sufficient, such as when your use-case does not exactly match our available models or you want to try to get better performance by training a model on text very similar to your input text. In those cases you can train a custom NLP model for use with Philter.

Getting Started

Before diving into the details, here are some examples that illustrate how to make a custom NLP model usable for Philter. Feel free to use these examples as a starting point for your models. Additionally, this capability has been documented in Philter's User's Guide.

The first example in that repository is an implementation of a named-entity recognizer using Apache OpenNLP. The entity recognizer is exposed via two endpoints using a Spring Boot REST controller.

  • The first endpoint /process simply passes the received text to the entity recognizer. The entity recognizer receives the text, extracts the entities, and returns them in a list of PhilterSpan objects.
  • The second endpoint /status simply returns HTTP 200 and the text healthy. In cases where your model may take a bit to load, it would be better to actually check the status of the model instead of just returning that it is healthy. But for this example, the load model loading is quick and done before the HTTP service is even available.

That's the only required endpoints. Philter will use those endpoints to interact with your model.

@RequestMapping(value = "/process", method = RequestMethod.POST, consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE)
public @ResponseBody List<PhilterSpan> process(@RequestBody String text) {
return openNlpNer.extract(text);
}

@RequestMapping(value = "/status", method = RequestMethod.GET)
public ResponseEntity<String> status() {
return new ResponseEntity<String>("healthy", HttpStatus.OK);
}

Click here to see the full source file.

The PhilterSpan object contains the details of a single extracted entity. An example response from the /process endpoint for a single entity is shown below.

[ 
  {
    text: "George Washington",
    tag: "person",
    score: "0.97",
    start: "0",
    end: "17"
  }
]

Custom NLP Models

Training Your Own Model

Philter is indifferent of the technologies and methods you choose to train your custom model. You can use any framework you like, such as Apache OpenNLP, spaCy, Stanford CoreNLP, or your own custom framework. Follow the framework's documentation for training a model.

Using a Model You Already Have

If you already have a trained NLP model you can skip the training part and proceed on to making the model accessible to Philter.

Using Your Own Model with Philter

Once your model has been trained and you are satisfied with its performance, to use the model with Philter you must expose the model by implementing a simple HTTP service interface around it. This service facilitates communication between Philter and your model. The service is illustrated below.

Once your model is available behind the HTTP interface described above, you are ready to use the model with Philter. On the Philter virtual machine, simply export the PHILTER_NER_ENDPOINT environment variable to be the location of the running HTTP service. It is recommended you set this environment variable in /etc/environment. If your HTTP service is running on the same host as Philter on port 8888, the environment variable would be set as:

export PHILTER_NER_ENDPOINT=http://localhost:8888/

Now restart the Philter service and stop and disable the philter-ner service.

sudo systemctl restart philter.service
sudo systemctl stop philter-ner.service
sudo systemctl disable philter-ner.service

When a filter profile containing an NER filter is applied, requests will be made to your HTTP service invoking your model inference returning the identified named-entities.

Recommendations and Best Practices

You have complete freedom to train your custom NLP model using whatever tools and processes you choose. However, from our experience that are a few things that can help you be successful.
The first recommendation is to contain your service in a Docker container. Doing so gives you a self-contained image that can be deployed and run virtually anywhere. It simplifies dependency management and protects you from dependency version changes.

The second recommendation is to make your HTTP service as lightweight as possible. Avoid any unnecessary code or features that could negatively impact the speed of your model inference.

Lastly, thoroughly evaluate your model prior to deploying the model to Philter to have a better expectation of performance.

Conclusion

Using a custom NLP model with Philter is a fairly straightforward process. Train your model, make it accessible by HTTP, and then deploy the HTTP service and the model such that the service is accessible from Philter.

Do you have to train a custom model to use Philter? Absolutely not. Philter's "out-of-the-box" capabilities are sufficient for most use-cases. Will using your own model give you better performance? Possibly. It depends on how well the model is trained and all of the parameters used. Training your own model can be a difficult and time consuming activity so it's best to have some familiarity with the process before starting.

We are excited to offer this feature and look forward to getting your feedback!

 Philter Version 
Launch Philter on AWS1.7.0
Launch Philter on Azure1.7.0
Launch Philter on Google Cloud1.7.0

Philter 1.5.0

Happy Friday! We are in the process of publishing Philter 1.5.0. Philter identifies and removes sensitive information in text. Look for Philter 1.5.0 to be available on the cloud marketplaces soon.

This version has a few new features in addition to minor improvements and fixes. The new features are described below.

New "Section" Filter

Philter 1.5.0 includes a new filter type called a "Section." This filter type lets you specify patterns that indicate the start and end of a section of text. For example, if your text has sentences or even paragraphs denoted with some marker, you can use the Section filter to redact those sentences or paragraphs. You just give the filter the regular expression patterns for the start and end markings. We have added the Section filter to the filter profiles documentation.

Amazon S3 to Store Filter Profiles

We have added the ability to store the filter profiles in an Amazon S3 bucket. The benefits of this is that now filter profiles can be shared across multiple instances of Philter. Previously, if you were running two instances of Philter you would have to update the filter profiles on each instance. By storing the filter profiles in S3 you can just update the filter profiles once via Philter's API. This does require a cache. The cache stores the filter profiles to lower the latency and reduce the number of calls to S3. (More on the cache below.)

We have published some CloudFormation and Terraform scripts to help with creating this architecture on GitHub.

Consolidated Caches

Philter previously used caches for the random anonymization values. With the introduction of using a cache for storing the profiles in S3 we have consolidated those caches into a single cache. Because of this, the configuration settings have been slightly renamed to reflect this. We have updated Philter's documentation with the renamed properties. Having a single cache means there is less to configure and fewer required resources.

If you are upgrading from a previous version you will need to change to the new cache property names.

Changeable Model File

The model file used by Philter can now be set in Philter's application.properties. Check out Philter's documentation for the details. By being able to set the model being used you can now select which model is most applicable to your use-case and domain.


Load-balanced and highly-available Philter CloudFormation template

We now have an AWS CloudFormation template to deploy an auto-scaled, highly-available Philter environment to identify and remove sensitive information from text. This template creates a VPC, load balancer, Philter instances, a Redis cache, and all required networking and security group configuration. Click the Launch Stack button to begin launching the stack.Philter

In an deployment of Philter that is a single EC2 instance, the EC2 instance is a single point of failure with no ability to respond to fluctuations in demand. By deploying more than one EC2 instance we can protect our application against failure and be able to scale up and down as needed.

The benefits of using this CloudFormation template is that it provides a pre-configured Philter architecture and deployment that is highly-available, scalable, and encrypts all data in-transit and all data at rest. Your API requests to Philter to filter sensitive information from text will have higher throughput since the load balancer will distribute those requests across the Philter instances. And as described below, the stack uses end-to-end encryption of data at-rest and in-transit.

The stack requires an active subscription to Philter via the AWS Marketplace. The template supports us-east-1, us-east-2, us-west-1, and us-west-2 regions.

The CloudFormation template is available in the philter-infrastructure-as-code repository on GitHub.

The Philter Stack Architecture

The deployment creates an elastic load balancer that is attached to an auto-scaled group of Philter EC2 instances. The load balancer spans two public subnets and the Philter EC2 instances are spread across two private subnets. Also in the private subnets is an Amazon Elasticache for Redis replication group. A NAT Gateway located in one of the public subnets provides outgoing internet access by routing the traffic to the VPC's Internet Gateway.

The load balancer will monitor the status of each Philter EC2 instance by periodically checking the /api/status endpoint. If an instance is found to be unhealthy after failing several consecutive health checks the failing instance will be replaced.

The Philter auto-scaling group is set to scale up and down based on the average CPU utilization of the Philter EC2 instances. When the CPU usage hits the high threshold another Philter EC2 instance will be added. When the CPU usage hits the low threshold, the auto-scaling group will begin removing (and terminating) instances from the group. The scaling policy is set to scale up faster rate than scaling down to avoid scaling down too quickly.

End-to-end Encryption

Incoming traffic to the load balancer is received by a TCP protocol handler on port 8080. These requests are distributed across the available Philter EC2 instances. The encrypted incoming traffic is terminated at the Philter EC2 instances. Network traffic between the Elasticache for Redis nodes is encrypted, and the data at-rest in the cache is also encrypted. The Philter EC2 instances use encrypted EBS volumes.

Launch the Stack

Click the Launch Stack button to launch the stack in your AWS account, or get the template here, or launch the stack using the AWS CLI with the command below.

aws cloudformation create-stack --stack-name philter --template-url s3://mtnfog-public/philter-resources/philter-vpc-load-balanced-with-redis.json

Once the stack completes Philter will be ready to accept requests. There will be an Output value called PhilterEndpoint. This value is the Philter API URL.

For example, if the value of PhilterEndpoint is https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/, then you can check Philter's status using the command:

curl -k https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/api/status

You can try a quick sample filter request with:

curl -k "https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/api/filter" \
  --data "George Washington lives in 90210 and his SSN was 123-45-6789." \
  -H "Content-type: text/plain"

Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter Studio 1.0.0

Philter Studio 1.0.0 is now available for download in your account. Philter Studio is an application for Windows 7/10 that provides convenient access to removing sensitive information from files and documents using Philter.

With Philter Studio’s intuitive interface you can quickly and easily utilize Philter to find and remove sensitive information from your files. Process files one at a time or queue up entire directories and process all files with a single click. Philter Studio supports finding and removing sensitive information in Microsoft Word files (.doc and .docx). Philter Studio can enable track changes so the redactions can be viewed while editing the document.

Philter Studio lets you to take a deep look at how the sensitive information in your text were identified and removed. The Compare and Explain feature visually highlights the information, describes why it was identified, and shows the redacted version.


Philter and COVID-19

Philter NLP Models

The natural language processing (NLP) capabilities of Philter are partly model-driven, meaning that we have trained models to identify information in text. These models are used to identify pieces of sensitive information that do not follow well-defined patterns or exist in referenced dictionaries, such as persons names. The model training process is a complex and compute-intensive procedure often taking days or even weeks to complete. Once a model is created it can be applied to text to identify specific parts of the text based on the text used to train the model and the parameters of the training.

NLP Models for Many Use-Cases and Industries

The model currently deployed in Philter is a model that is generic but yet provides good performance across many use-cases covering many different types of text. It has been our plan for some time to offer models trained for specific use-cases and industries, including non-healthcare industries, for those instances when Philter is used only on a certain type of text. This will give those specific use-cases an increase in performance when using a tailored model.

Philter's pluggable model implementation is not quite ready yet. However, we are going to go ahead and jump a bit ahead today in announcing a model tailored for personally identifiable information in text related to COVID-19. We hope that this model will give you improved performance when identifying sensitive information in COVID-19 related text.

Model Availability

Because we are jumping ahead of ourselves in order to make this model immediately available, we don't yet have any automation or tooling support around being able to download and install the model yourself. (We will in the future.) Until we do have the self-service tooling available, we will distribute the model and installation instructions to users of Philter via email upon request. There is no additional charge to request and use the model.

To request the Philter model trained using COVID-19 data please use our contact form and include your cloud marketplace (AWS, Azure, or GCP) subscription ID.

Philter is available as a 30-day trial. If you are working with data related to COVID-19 and your free trial expires, you can request no cost access to Philter's virtual machine images for continued use at no charge (except for the underlying cloud resources that you pay to the cloud provider).


Using Philter with Microsoft Power Automate (Flow)


Philter SDKs

We have some updates on the Philter SDKs!

The Philter SDKs provide API clients for interacting with Philter to identify and remove sensitive information from text. Each project contains examples showing how to use the SDK.

Philter SDK for Java

The Java SDK is now available in Maven Central.

Philter SDK for .NET

The .NET SDK is now available from NuGet.

Philter SDK for Golang

The Golang SDK is now available on GitHub.


Philter

Filtering Sensitive Information from Text using Apache NiFi and Philter

Awhile back we made a post describing how Philter can be used alongside Apache NiFi for identifying and removing sensitive information from text. Since that post, there have been changes to Philter and Apache NiFi so we thought it would be worthwhile to revisit that architecture and its configuration.

  • Apache NiFi is an application for creating and managing data flows that process data.
  • Philter identifies and removes sensitive information, such as PHI and PII, from natural language text. Philter is available on cloud marketplaces.

The Data Flow Architecture

In the architecture of our data flow, we are going to be ingesting natural language (unstructured) text from somewhere - it doesn't really matter where. In your use-case it may be from a file system, from an S3 bucket, or from an Apache Kafka topic. Once we have the text in the content of the NiFi flowfile, we will send the text to Philter where the sensitive information will be removed from the text. The filtered text will then be the content of the flowfile. In our example here we are going to read the files from a directory on the file system.

To interact with Philter we can use NiFi's InvokeHTTP processor since Philter's API is HTTP REST-based.

Finally, we will write the filtered text to some destination. Like the ingest source, where we write the text does not matter. We could write it back to the source or some other location - whatever is required by your use-case.

The NiFi Flow

The flow will use the GetFile processor to read /tmp/input/*.txt files. The contents of each file will be sent to Philter. The resulting filtered text will be written back to the file system at /tmp/output. (Click the image for a better view.)

Apache NiFi flow for Philter

If you want to quickly prototype it with minimal configuration, use a GenerateFlowFile processor and set the content manually to something like "His SSN was 123-45-6789."

Using GenerateFlowFile to test Philter.

InvokeHTTP Processor Configuration

The configuration of the InvokeHTTP processor is fairly simple. We just need to configure the HTTP Method, Remote URL, and Content Type. Set each as follows:

  • HTTP Method = POST
  • Remote URL = http://philter-ip:8080/api/filter
  • Content-Type = text/plain

Since we are not providing any values for the context, document ID, or filter profile name in the URL, Philter will use defaults values for each. When not provided, the default value for context is default, Philter will generate a document ID per request, and the default filter profile name is default.

These default values are detailed in Philter's API documentation. A context lets you group similar documents together, perhaps by business unit or purpose. A document ID should uniquely identify a document (such as a file name) and can be used to split up large documents for processing.

If you do want to set values for one or all of those instead of using the default values, just append them to the Remote URL: http://philter-ip:8080/api/filter?c=ctx&p=justssn In this request, the context is set to ctx and it tells Philter to use the filter profile named justssn. As a tip, you can use NiFi's expression language to parameterize the values in the URL.

InvokeHTTP processor configuration for Philter.

A Closer Look

If we use a LogAttribute processor we can get some insight into what's happening. In the log output below, we can see HTTP POST request that was made.

At the top of the log we see the filtered text from Philter. The input text from the file was "His SSN was 123-45-6789." Philter applied the default filter profile which looks for SSNs and responded with "His SSN was {{{REDACTED-ssn}}}."

(Filter profiles are very powerful and flexible configurations that let you have full control over the types of sensitive information that Philter identifies and how Philter manipulates that information when found.)

We can also see that since we did not provide a value for the document ID in the request, Philter assigned a document ID and returned it in the response in the x-document-id header.

His SSN was {{{REDACTED-ssn}}}.

--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Thu Feb 27 13:35:19 UTC 2020'
Key: 'lineageStartDate'
	Value: 'Thu Feb 27 13:35:11 UTC 2020'
Key: 'fileSize'
	Value: '31'
FlowFile Attribute Map Content
Key: 'Connection'
	Value: 'keep-alive'
Key: 'Content-Length'
	Value: '31'
Key: 'Content-Type'
	Value: 'text/plain;charset=UTF-8'
Key: 'Date'
	Value: 'Thu, 27 Feb 2020 13:35:19 GMT'
Key: 'Keep-Alive'
	Value: 'timeout=60'
Key: 'filename'
	Value: 'd206fc81-2c42-40ba-afbf-b5f9998b56c0'
Key: 'invokehttp.request.url'
	Value: 'http://10.1.1.221:8080/api/filter'
Key: 'invokehttp.status.code'
	Value: '200'
Key: 'invokehttp.status.message'
	Value: ''
Key: 'invokehttp.tx.id'
	Value: 'fbf2f6c0-1073-4fac-bc23-6d6a67b70423'
Key: 'mime.type'
	Value: 'text/plain;charset=UTF-8'
Key: 'path'
	Value: './'
Key: 'uuid'
	Value: '486ff4c2-6530-4e1c-aea2-e9965b86b10c'
Key: 'x-document-id'
	Value: 'fb75a2a4c164192542f89881aa8baf21'
--------------------------------------------------

Summary

Philter's API makes it easy to integrate Philter with applications like Apache NiFi. The InvokeHTTP processor native to NiFi is an ideal means of communicating with Philter.

To keep things simple, this example only considered SSNs in text. Philter supports many other types of sensitive information.

If performance is very important, there are a couple of things that can be done to help. First, Philter is stateless so you can run multiple instances of Philter behind a load balancer. Second, Philter Enterprise Edition can run natively inside an Apache NiFi flow without the need to make HTTP calls to Philter. Contact us if you would like to learn more about Philter Enterprise Edition's processor for Apache NiFi.

Philter's integration with applications like Apache NiFi is very important to us so look for more improvements and features in versions to come.


Philter 1.1.0

We are happy to announce Philter 1.1.0! This version brings some features we think you will find very useful because most were implemented directly from interactions with users. We look forward to future interactions to keep driving improvements!

We are very excited about this release, but we also have lots of exciting things to add in the next release and we will soon be making available Philter Studio, a free Windows application to use Philter. If you don't like managing filter profiles in JSON you will love Philter Studio!

We have begun the process of publishing Philter 1.1.0 to the cloud marketplaces and it should be available on the AWS, Azure, and GCP marketplaces in the next few days once publishing is complete. The Philter Deployment Guide walks through how to deploy Philter on each platform. You can also see the full Philter release notes.

To be notified when Philter 1.1.0 is available for deployment into your cloud, subscribe to our rarely-used mailing list below.

Subscribe

* indicates required

 

What's New in Philter 1.1.0

Ignore Lists

In some cases, there may be text that you never want to identify and remove as PII or PHI. An example may be an email address or telephone number of a business that is not relevant to the sensitive information in the text and removing this text may cause the document to lose meaning. Ignore lists allow you to specify a list of terms that are never removed (always ignored if found) from the documents. You can create as many ignore lists as you need and each one can contain as many terms as desired. The ignore lists are defined in the filter profile.

Here's how an ignore list is defined in a filter profile that only finds SSNs. The SSNs 123-45-6789 and 000-00-0000 will always be ignored and will remain in the documents unchanged.

{
  "name": "default",
  "identifiers": {
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT"
        }
      ]
    }
  },
  "ignored": [
    {
      "name": "ignored-terms",
      "terms": [
        "123-45-6789",
        "000-00-0000"
      ]
    }
  ]
}

Custom Dictionaries

You can now have custom dictionaries of terms that are to be identified as sensitive information. With a custom dictionary you can specify a list of terms, such as names, addresses, or other information, that should always be treated as personal information. You can create as many custom dictionaries as you need and each one can contain as many terms as desired. The custom dictionaries are defined in the filter profile.

Here's how a custom dictionary can be added to a filter profile. In this example, a custom dictionary of type names-with-j is created and it contains the terms james, jim, and john. When any of these terms are found in a document they will be redacted. The dictionaries item is an array so you can have as many dictionaries as required. (The "auto" setting for the sensitivity is discussed a little further down below.)

{
  "name": "default",
  "identifiers": {
    "dictionaries": [
      {
        "type": "names-with-j",
        "terms": [
          "james",
          "jim",
          "john"
        ],
        "sensitivity": "auto",
        "customFilterStrategies": [
          {
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}",
            "replacementScope": "DOCUMENT"
          }
        ]
      }
    ],
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT",
          "staticReplacement": "",
          "condition": ""
        }
      ]
    }
  }
  ]
}

"Fuzziness" Calculation

We added a new fuziness option when using dictionary filters. The previous options of LOW, MEDIUM, and HIGH were found to be either not restrictive enough or too restrictive. We have added an AUTO option that automatically determines the appropriate fuziness based on the length of term in question. For instance, the AUTO option sets the fuzziness for a short term to be on the low side, while a longer term allows a higher fuziness. We recommend using AUTO over the other options and expect it to perform better for you. The other options of LOW, MEDIUM, and HIGH are still available.

Explain API Endpoint

Philter operates as a black box. Text goes in and manipulated text comes out. What happened inside? To help provide insight into the black box, we have added a new API endpoint called explain. This endpoint performs text filtering but returns more information on the filtering process. The list of identified spans (pieces of text found to be sensitive) and applied spans are both returned as objects along with attributes about each span.

Here's an example output of calling the explain API endpoint given some sample text. The original API call:

curl -k -s "https://localhost:8080/api/explain?c=C1" --data "George Washington was president and his ssn was 123-45-6789 and he lived at 90210." -H "Content-type: text/plain" 

The response from the API call:

{
  "filteredText": "{{{REDACTED-entity}}} was president and his ssn was {{{REDACTED-ssn}}} and he lived at {{{REDACTED-zip-code}}}.",
  "context": "C1",
  "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
  "explanation": {
    "appliedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ],
    "identifiedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ]
  }
}

In the response, each identified span is listed with some attributes.

  • id - A random UUID identifying the span.
  • characterStart - The character-based index of the start of the span.
  • characterEnd - The character-based index of the end of the span.
  • filterType - The filter that identified this span.
  • context - The given context under which this span was identified.
  • documentId - The given documentId or a randomly generated documentId if none was provided.
  • confidence - Philter's confidence this span does in fact represent a span.
  • text - The text contained within the span.
  • replacement - The value which Philter used replace the text in the document.

The User's Guide has been updated to include the explain API endpoint.

Elasticsearch

As mentioned in a previous post, Philter 1.1.0 now uses Elasticsearch to store the identified spans instead of MongoDB. Please check that post for the details but we do want to mention again here that this change does not affect Philter's API and the change will be transparent to any of your existing Philter scripts or applications.

DataDog Metrics

Philter 1.1.0 adds support for sending metrics directly to Datadog.

New Metrics

Philter 1.1.0 adds new metrics for each type of filter. Now you will be able to see metrics for each type of filter in CloudWatch, JMX, and Datadog to give more insight into the types of sensitive information being found in your documents.


Philter and Elasticsearch

PhilterPhilter, our application for finding and removing PII and PHI from natural language text, has the ability to optionally store the identified text in an external data store. With this feature, you had access to a complete log of Philter's actions as well as the ability to reconstruct the original text in the future if you ever needed to.

In Philter 1.0,  we chose MongoDB as the external data store. With just a few configuration properties, Philter would connect to MongoDB and persist all identified "spans" (the identified text, its location in the document, and some other attributes) to a MongoDB database. This worked well but we realized that looking forward it might not have been the best choice.

In Philter 1.1 we are replacing MongoDB with Elasticsearch. The functionality and the Philter APIs will remain the same. The only difference is that now instead of the spans being stored in a MongoDB database they will now be stored in an Elasticsearch index. So, what, exactly are the benefits? Great question.

The first benefit comes with Elasticsearch and Kibana's ability to quickly and easily make dashboards to view the indexed data. With the spans in Elasticsearch, you can make a dashboard to summarize the spans by type, text, etc., to show insights into the PII and PHI that Philter is finding and manipulating in your text.

It also became quickly apparent that a primary use-case for users and the store would be to query the spans it contains. For example, a query to find all documents containing "John Doe" or all documents containing a certain date or phone number. A search engine is better prepared to handle those queries.

Another consideration is licensing. Elasticsearch is available under the Apache Software License or a compatible license while MongoDB is available under a Server Side Public License.

In summary, Philter 1.1 will offer support for using Elasticsearch as the store for identified PII and PHI. Remember, using the store is an optional feature of Philter. If you do not require any history of the text that Philter identifies then it is not needed. (By default, Philter's store feature is disabled and has to be explicitly enabled.) Support for using MongoDB as a store will not be available in Philter 1.1.

We are really excited about this change and excited about the possibilities that comes with it!


Filter Profile JSON Schema

Philter can find and remove many types of PII and PHI. You can select the types of of PII and PHI and how the identified values are removed or manipulated through what we call a "filter profile." A filter profile is a file that essentially lets you tell Philter what to do!

To help make creating and editing filter profiles a little bit easier, we have published the JSON schema.

https://www.mtnfog.com/filter-profile-schema.json

This JSON schema can be imported into some development tools to provide features such as validation and autocomplete. The screenshot below shows an example of adding the schema to IntelliJ. More details into the capability and features are available from the IntelliJ documentation.

Visual Studio Code and Atom (via a package) also include support for validating JSON documents per JSON schemas.

The Filter Profile Registry provides a way to centrally manage filter profiles across one or more instances of Philter.


Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

  • Updated 07/12/2020 to include a link to a similar solution using log4j and Apache Kafka.
  • Updated 05/20/2020 to include a link to running Philter as a container and a link to the solution example.
  • Updated 04/28/2020 to include a link to CloudFormation and Terraform scripts and link to using a signed certificate with Philter.

AWS Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from places such as CloudWatch, AWS IoT, and custom applications using the AWS SDK to places such as Amazon S3, Amazon Redshift, Amazon Elasticsearch, and others. In this post we will use S3 as the firehose's destination.

In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how AWS Kinesis Firehose and AWS Lambda can be used in conjunction with Philter to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Click here for a similar solution using log4j and Apache Kafka to remove sensitive information from application logs.

Prerequisites

Your must have a running instance of Philter. If you don't already have a running instance of Philter you can launch one through the AWS Marketplace or as a container. There are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

It's not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows you to communicate locally with Philter from the function.

Setting up the AWS Kinesis Firehose Transformation

There is no need to duplicate an excellent blog post on creating a Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests
import base64

def handler(event, context):

    output = []

    for record in event['records']:
        payload=base64.b64decode(record["data"])
        headers = {'Content-type': 'text/plain'}
        r = requests.post("https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
        filtered = r.text
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8')
        }
        output.append(output_record)

    return output

The following Kinesis Firehose test event can be used to test the function:

{
  "invocationId": "invocationIdExample",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "region": "us-east-1",
  "records": [
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    },
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    }    
  ]
}

This test event contains 2 messages and the data for each is base 64 encoded, which is the value "He lived in 90210 and his SSN was 123-45-6789." When the test is executed the response will be:

[
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."
]

When executing the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter's self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Kinesis Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.

Conclusion

In this blog post we have created an AWS Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.

Resources


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Our Approach to Continuous Delivery for Cloud Marketplaces

In this blog post I wanted to take a moment to share our challenges with continuous integration and delivery and how we approached them. Our Philter software to find and remove PII and PHI from text is deployed on (at the moment) three cloud marketplaces as well as being available for on-premises deployment. Each of the marketplaces, AWS Marketplace, Microsoft Azure Marketplace, and Google Compute Platform (GCP) Marketplace, all have their own requirements and constraints. We needed a pipeline that can build and test our code and deliver the binaries to each of the cloud marketplaces as a deployable image.

What tools you use to implement your process does not really matter. Some tools are more feature-rich than others and some are only better or worse in terms of difference of opinion. It's up to you to pick the tools that you or your organization want to use. We will mention the tools we use but don't take that as meaning only these tools will work. (We like being tool-agnostic to not make us afraid to try new tools.) Our build infrastructure runs in AWS.

Our builds are managed by Jenkins through Jenkinsfiles. Each project has a Jenkinsfile that defines the build stages for the project. These stages vary by project but are usually similar to "build", "test", and "deploy." The build and test stages are pretty self-explanatory. The deploy stage is where things get interesting (i.e. challengine).

We are using Hashicorp's Packer tool to create our images for the cloud marketplaces. A single packer JSON file contains a "builder" (in Packer terminology) for each cloud marketplace. A builder defines the necessary parameters for constructing the image on that specific cloud platform. For instance, when building on AWS EC2, the builder contains information about the VPC and subnet making the build, base AMI the image will be created from, and the AWS region for the image. Likewise, for Microsoft Azure, the builder defines things such as the storage account name, operating system name and version, and the Azure subscription ID. GCP has its own set of required parameters.

The rest of the Packer JSON file contains the steps that will be done to prepare the image. This includes steps such as executing commands over SSH to install prerequisite packages, upload build artifacts made by the Jenkins build, and lastly, prepare the system for being turned into an image.

After the Jenkinsfile's "deploy" stage executes, the end result will be a new image in each of the cloud platforms suitable for final testing prior to being made available on the cloud's marketplace. This testing is initiated by the build by publishing a message to an AWS SNS topic each an images completes creation. This triggers a process to create and start a virtual machine from the image powered by AWS Lambda. Required credentials are stored in AWS SSM Parameter Store.

Automated testing is then performed against the virtual machine. Individual testing of each image is required is due to the nuances and different requirements of each cloud platform and marketplace and different base images. For instance, on AWS the base image is Amazon Linux 2. On Microsoft Azure it is CentOS 7. The scripts that install the prerequisites and  configure the application can differ based on the base image.

The automated testing involves testing the application's API and by establishing an SSH connection to the virtual machine to verify files are in the correct location or have been configured properly. A message is published to a separate AWS SNS topic indicating success or failure of the tests and the virtual machine image is deleted/terminated leaving only the newly built image. The test results are persisted to a database along with the build number for reference. If testing was successful, we can proceed to the manual steps of publishing the image to the marketplaces when we are ready to do so. (All marketplaces require manual clicking to submit images so none of it can be automated.)

Continuous integration and delivery is important for all software projects. Having a consistent, repeatable process for building, testing, and packaging software for delivery is critical. A well-defined and implemented process can help teams find problems earlier, get configurations into code, and ultimately, get higher-quality products to the market faster.


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter

Sneak Peek at Philter Big-Data and ETL Integrations

As we are nearing the general availability of Philter we would like to take a minute to offer a quick look at Philter's integrations with other applications. Philter offers integration capabilities with Apache NiFi, Apache Kafka, and Apache Pulsar to provide PHI/PII filtering capabilities across your big-data and ETL ecosystems. We are very excited to offer these integrations for such awesome and popular open source applications.

To recap, Philter is an application to identify and, optionally, remove or replace protected health information (PHI) and personally identifiable information (PII) from natural language text.

Apache NiFi

Philter provides a custom Apache NiFi processor NAR that you can plug into your existing NiFi installations by copying the NAR file to NiFi's lib directory. The processor allows your NiFi flow to identify and replace PHI and PII directly in your flow without any required external services. The processor's configuration is similar to Philter's standard configuration. The processor accepts a filter profile, an optional MongoDB URI to use to store replaced values, and a cache to maintain state when anonymizing values consistently. For the cache, the processor utilizes NiFi's built-in DistributedMapCacheServer.

The processor operates on the content of the incoming flowfile by performing filtering on the content and replacing the content with the filtered text. An outbound transition provides the downstream processors with the filtered text.

Apache Kafka

Philter is able to integrate with Apache Kafka by providing the ability to consume text from Kafka, perform the filtering, and publish the filtered text to a different Kafka topic. Philter does this in a performant and fault tolerant manner by leveraging the Apache Flink streaming framework. This integration is suitable for integration into existing pipelines where text is being consumed from Kafka for processing because it requires minimal changes to the pipeline. Simple provide the appropriate configuration values to the Philter job and update your topic names.

Apache Pulsar

Philter integrates with Apache Pulsar via Pulsar Functions. A Pulsar Function enables Pulsar to execute functions on the streaming data as it passes through Pulsar. Pulsar is similar to Kafka in its functionality as a massive pub/sub application but unlike Kafka it provides the ability to directly transform the data inside of the application. This is an ideal integration point for Philter and your streaming architectures using Apache Pulsar.


Introducing ngramdb

ngramdb provides a distributed means of storing and querying N-grams (or bags of words) organized under contexts. A REST interface provides the ability to insert n-grams, execute “starts with” and “top” queries, and calculate similarity metrics of contexts. Apache Ignite provides the distributed and highly available persistence and powers the querying abilities.

ngramdb is experimental and significant changes are likely. We welcome your feedback and input into its future capabilities.

ngramdb is open source under the Apache License, version 2.0.

 https://github.com/mtnfog/ngramdb


Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP's language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language's probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.

In case you are not familiar with OpenNLP's language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.

To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi's lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.

git clone https://github.com/mtnfog/nlp-nifi-processors.git
cd nlp-nifi-processors
mvn clean install
cp langdetect-nifi-processor/langdetect-processor-nar/target/*.nar /path/to/nifi/lib/

The processor does not have any settings to configure. It's ready to work right "out of the box." You can add the processor to your NiFi canvas:

You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.

Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!

This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.

Let us know how the processor works out for you!


Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls "processors", you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let's get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

unzip nifi-1.4.0-bin.zip
cd nifi-1.4.0/bin
./nifi.sh start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi's GetFile processor to read the file. Add a GetFile processor to the canvas:


Right-click the GetFile processor and click Configure to bring up the processor's properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor's properties and set the following values:

  • HTTP Method - POST
  • Remote URL - http://localhost:7070/api/sentences
  • Content Type - text/plain

Set the processor to autoterminate for everything except Response. We also set the processor's name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let's go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method - POST
  • Remote URL - http://localhost:9000/api/extract
  • Content-Type - application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow's execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:

{"entities":[],"extractionTime":7}

Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There's a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don't have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can't leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).