Philter is now Certified for Cloudera Dataflow

We are excited to announce that Philter is now certified for Cloudera Dataflow (CDF). By leveraging Philter in your Apache NiFi data flows, you can redact protected health information (PHI), personally identifiable information (PII), and other types of sensitive information from your data.

Using Philter in Cloudera Dataflow is as simple as making the Philter processors available to your Apache NiFi instance and adding the processors to your canvas. You can configure the types of information to redact right inside the processor's properties. You can choose whether to use a centralized Philter instance or perform the redaction directly within the Apache NiFi flow. The first option allows for a centralized configuration while the latter provides significant performance improvements.

Philter on Cloudera Dataflow is compatible with all public clouds supported by Cloudera Dataflow.

Get Started

To get started with Philter on Cloudera Dataflow please contact us and we can guide you through the process of getting started. Visit our partner information on the Cloudera partner portal.

About Cloudera Dataflow

Cloudera DataFlow (CDF) is a CDP Public Cloud service that enables self-serve deployments of Apache NiFi data flows from a central catalog to auto-scaling Kubernetes clusters managed by CDP. Flow deployments can be monitored from a central dashboard with the ability to define KPIs to keep track of critical data flow metrics. CDF eliminates the operational overhead that is typically associated with running Apache NiFi clusters and allows users to fully focus on developing data flows and ensuring they meet business SLAs. Learn more about Cloudera Dataflow.

 


Philter featured in the AWS Marketplace's Healthcare Compliance

We are excited to share the Philter is now a featured product in the AWS Marketplace's Healthcare Compliance category.

The products selected for this feature “help ensure that IT infrastructure is compliant with changing policies and regulations, allowing teams to focus on driving patient-centric innovation.”

 Philter on AWS for Healthcare Compliance Data Sheet

Philter redacts PHI, PII, and other sensitive information from documents and text. With Philter, users can select the types of sensitive information to redact, anonymize, encrypt, or tokenize.

Philter can be launched in your AWS cloud via the AWS Marketplace in just a few minutes. Philter runs entirely within your private VPC so your sensitive data never has to leave your VPC.


New! Philter Add-Ins for Microsoft Office

We are very excited to announce the new Philter Add-Ins for Microsoft Office! These add-ins bring Philter's redaction capabilities directly into your Microsoft Word documents and Microsoft Excel spreadsheets. With these add-ins you can redact or highlight PHI, PII, and other sensitive information in your documents and spreadsheets with a single click. The Philter Add-Ins for Microsoft Office provide a great leap forward in streamlining document redaction.

Download or learn more about the add-ins at https://www.mtnfog.com/products/philter/office-add-ins/.

How do the add-ins work?

Both add-ins add a new content pane that makes Philter available in your document or spreadsheet. In the screenshot below you can see the Philter content pane in Microsoft Word. The user has clicked the Highlight button to highlight the sensitive information in the document. Clicking the Redact button would have redacted the sensitive information instead.

The Philter Add-In for Microsoft Word enables document redaction from directly inside your documents.

 

With the add-ins you can redact PHI and PII directly in your documents eliminating any second steps of sending your documents to Philter for redaction. This will help improve your document redaction processes and give you more control.

What are the licensing details?

The add-ins are free to download and use. An instance of Philter is required but can be shared among users. If you aren't yet enjoying Philter's redaction capabilities we can help you get started (or you can get started on your own in your preferred cloud in about 5 minutes).


Redacting PHI and PII from documents using Java

When you need to redact sensitive information like Protected Health Information (PHI) and Personally Identifiable Information (PII) from documents the Philter SDK for Java has you covered!

The Philter SDK for Java is an open source project that provides a client SDK for Philter. With this library it is easy to redact PHI and PII from documents using Philter. Here's an example:

In our Maven project we will add the dependency:

<dependency>
  <groupId>com.mtnfog</groupId>
  <artifactId>philter-sdk-java</artifactId>
  <version>1.3.0</version>
</dependency>

Now, we can instantiate a client:

PhilterClient client = new PhilterClient.PhilterClientBuilder().withEndpoint("https://127.0.0.1:8080").build();
FilterResponse filterResponse = client.filter(text);

Be sure to change the endpoint to the endpoint of your running Philter instance. Now you are ready to redact!

FilterResponse response = client.filter("George Washington was president.");

The text parameter will be sent to Philter for redaction. The returned object will contain a value that is the redacted text. With the default settings, that return value will be {{{REDACTED-ner}}} was president.

If you want to get more details of what happened you can use the explain function instead.

ExplainResponse response = client.explain("George Washington was president.");

The explain function provides insight into what Philter redacted and why it was redacted. You can use explain to help tune your redaction or for troubleshooting.

That's it! Now you are ready to integrate Philter's powerful redaction capabilities into your Java libraries and applications. For full samples check out the project's test class.

The Philter SDK for Java is licensed under the Apache License, version 2. It is available on GitHub.

 


Philter 1.10.1

Philter 1.10.1 adds new features for document and text redaction. Take control of the sensitive information in your text through powerful redaction and encryption capabilities.

Philter 1.10.1 will be available for deployment on the AWS, Azure, and Google Cloud marketplaces in the next few days. Contact us for Docker or on-premises deployments.

New User Interface

This version of Philter introduces a user interface for testing Philter's configuration and managing filter profiles. The user interface can be accessed at https://philter:9090. By default, the user interface communicates with the Philter service over SSL and with your web browser over SSL. We are excited to introduce the user interface and we are excited to continue to develop it in future versions.

Two-Way SSL Enabled by Default

Philter cloud marketplace images now have two-way SSL enabled by default. This should reduce manual configuration steps often required to use Philter. See Philter's User's Guide for more information.

Post-Filters Can be Disabled

Philter's post-filters can now be disabled. The post-filters are primarily intended for clean up by doing operations such as removing blank spaces and punctuation. In cases where you might want to leave those characters in identified sensitive information spans you can now individually disable the post-filters.

Phone Number Confidence Values Dynamic based on the Format

Each time a piece of sensitive information is identified in text Philter assigns it a confidence value. This value indicates Philter's "confidence" that the identified text actually is sensitive information. For phone numbers, this confidence value is now dynamic based on the format of the phone number. For example, the phone number (123) 456-7890 would be given a higher confidence than 1234567890. While both are valid phone numbers, the first number is formatted as a phone number. This provides higher assurance that it is a phone number in the text and its confidence value will be higher than the second number's confidence value.

Prometheus Metrics Enabled by Default

Prometheus metrics are now enabled by default. A common task among users was enabling the metrics after deployment. This change is to remove the need for that manual change. See Philter's User Guide for more information on the Prometheus metrics.

Standardized Base Images on Ubuntu 20.04 LTS

Previously, Philter deployment images across cloud platforms used a different base operating system image on each cloud. AWS was Amazon Linux and Azure was CentOS. Philter 1.10.0 standardizes on a single base operating system image of Ubuntu 20.04 LTS. This provides a consistent experience using Philter across multiple cloud platforms. Now, all Philter configuration files are in the same locations so regardless of where you deploy Philter the setup and use will be the same.

 


Speeding up Philter document redaction with a GPU

In a lot of cases using Philter on a CPU will provide sufficient performance. However, in deployments where performance has higher importance using Philter with GPU can provide a 10x or more improvement in performance.

In this post we will show how Philter running on AWS on a m5.large EC2 instance averaged about 1 second to filter a document. When the same set of documents were filtered using Philter on a p3.2xlarge EC2 instance the average time per document fell to around 0.1 seconds. That's a significant difference!

Things to Know

As of Philter 1.10.0 the only filter that can use the GPU is the named person's filter. If you aren't using this filter then you will not see any performance benefit from running with a GPU. This will likely change in the future.

Philter with a GPU on AWS EC2

The AWS EC2 p3.2xlarge instance type has a single Tesla V100 GPU. You do need to install the NVIDIA CUDA drivers onto the EC2 instance. Contact us or refer to Philter's User's Guide for the installation scripts. No Philter configuration changes are needed to use the GPU because when a GPU is present Philter will automatically use it. In summary, the steps are:

  1. Launch an instance of Philter on any EC2 instance type.
  2. Stop the Philter instance.
  3. Change the instance type to p3.2xlarge and start the instance.
  4. Install the NVIDIA CUDA drivers. (Contact us or see Philter's User's Guide for the installation scripts.)
  5. Reboot the instance.

Philter is ready to serve API requests!

Monitoring the Performance

You can monitor the performance of Philter using the Prometheus monitor. Once enabled, Philter's metrics will be available for scraping at http://philter:9100/metrics. You will want to look at the philter_ner_entity_time_ms_seconds_sum and philter_ner_entity_time_ms_seconds_count metrics.

The first metric is the total number milliseconds spent applying the NER (named person's) filter. The second metric is the total number of times the filter was applied. Dividing those two numbers will give us the average time spent each time applying the filter. The screenshot below shows an example Grafana configuration for those metrics.


Redacting information from documents doesn't have to be hard

The title says it all. Redacting information from documents doesn't have to be hard. But yet it still seems it can be without applying the appropriate level of care. Whether you are manually redacting information from a document or using a tool like Philter, care is required to make sure the redaction is permanent and the redacted text is unavailable when complete.

The American Bar Association details some notable embarrassing redaction failures that have happened in the legal system. On that link the American Bar Association describes how information was redacted from PDF documents by having black boxes drawn over the text. At first glance it appears the information under the black boxes has been redacted. However, by simply selecting all of the text in the PDF and pasting it to another application, such as Notepad or Microsoft Word, the redacted text seemingly magically becomes available! This is not only an embarrassing failure on the legal firm but it can also present a very serious breach of sensitive information. That information was not supposed to be available for very specific reasons.

Philter can redact information from PDF documents. For security and to prevent instances such as those described on that page, Philter returns image files instead of modified PDF files. The images are the PDF files but with the sensitive information blacked out. The text under the black rectangles in the image cannot be recovered through copy and paste since there is no text under the black rectangles.

Philter's approach to redacting information from PDFs is that the once processed the information is permanently inaccessible. You still have your original documents that were provided to Philter and now you have the image files containing the permanently redacted text. PDF filtering is available in Philter as of version 1.9.0. We are very excited to offer this capability in Philter and look forward to expanding it through your comments and feedback.

To filter a PDF document, just set the Content-Type header to application/pdf in your request:

curl -k -X POST https://localhost:8080/api/filter -d @file.pdf -H "Content-Type: application/PDF" -O redacted.zip

The response will be saved to the file redacted.zip and it will contain the redacted PDF pages as images.


Philter Price Reduction

We are thrilled to announce a price reduction for Philter on the cloud marketplaces. Philter pricing now starts at $0.49/hr, down from $0.79/hr. Additionally, where applicable, the annual pricing has been reduced accordingly as well. The pricing is tiered depending on the size of the compute instance running Philter.

The price change will take effect across the AWS, Azure, and Google Cloud marketplaces in the coming week. This pricing change will not affect support or managed services.

We would like to take a moment to thank our users for helping to make this price reduction possible. It is because of our supportive users that we are able to make this change.

AWS Marketplace

Philter 1.8.0

Philter 1.8.0 has been released.

This version brings:

  • The ability to capture timing metrics for each of the filter types. Capturing these metrics will provide insights into the performance of the filters.
  • The ability to specify terms to ignore in files for each filter type. Previously, lists of ignored terms had to be specified in the filter profile. Being able to specify the terms to ignore in files outside the filter profile allow for cleaner and easier to manage filter profiles.

Philter 1.8.0 is now available for deployment from the cloud marketplaces.

Launch Philter in your cloud. See Philter's full Release Notes.


Protecting Sensitive Information in Streaming Platforms

Streaming platforms like Apache Kafka and Apache Pulsar provide wonderful capabilities around ingesting data. With these platforms we can build all types of solutions across many industries from healthcare to IoT and everything in between. Inevitably, the problem arises of how to deal with sensitive information that resides in the streaming data. Questions such as how do we make sure that data never crosses a boundary, how do we keep that data safe, and how can we remove the sensitive information from the incoming data so we can continue processing the data? These are all very good questions to ask and in this post we present a couple architectures to address those questions and help maintain the security of your streaming data. These architectures along with Philter can help protect the sensitive information in your streaming data.

Whether you are using Apache Kafka, Apache Pulsar, or some other streaming platform is largely irrelevant. Each of these platforms are largely built on top of the same concepts and even share quite a bit of terminology, such as brokers and topics. (A broker is a single instance of Kafka or Pulsar and a topic is how the streaming data is organized when it reaches the broker.)

Streaming Healthcare Data

Let's assume you have an architecture where you have a 3 broker installation of Apache Kafka that is accepting streaming data from a hospital. This data contains patient information which has PII and PHI. An external system is publishing data to your Apache Kafka brokers. The brokers receive the data, store it in topics, and a downstream system consumes from Apache Kafka and processes the statistics of the data by analyzing the text and persisting the results of the analysis into a database. Even though this is a hypothetical scenario it is an extremely common deployment architecture around distributed and streaming technologies.

Now you ask yourself those questions we mentioned previously. How to keep the PII and PHI secure in our streaming data? Your downstream processor does not care about the PII and PHI since it is only aggregating statistics. Having those downstream systems process the data containing PII and PHI puts our system at risk of inadvertent HIPAA violations by enlarging the perimeter of the system containing PII and PHI. Removing the PII and PHI from the streaming data before it gets consumed by the downstream processor would help keep the data safe and our system in compliance.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Filtering the Sensitive Information from the Streaming Data

There's a couple things we can do to remove the PII and PHI from the streaming data before it gets to the downstream processor.

The first option is to use Apache Kafka Streams or an Apache Pulsar Function (depending on which you are running) to consume the data, filter out the PII and PHI, and publish the filtered text back to Kafka or Pulsar on a different topic. Now update the name of the topic the downstream processor consumes from. The raw data  from the hospital containing PII and PHI will stay in its own topic. You can use Apache Kafka ACLs on the topics to help prevent someone from inadvertently consuming from the raw topic and only permit them to consume from the filtered topic. If, however, the idea of the raw data containing PII and PHI existing on the brokers is a concern then continue on to option two below.

The second option is to utilize a second Apache Kafka or Apache Pulsar cluster. Place this cluster in between the existing cluster and the downstream processor. Create an application to consume from the topic on the first brokers, remove the PII and PHI, and then publish the filtered data to a topic on the new brokers. (You can use something like Apache Flink to process the data. At the time of writing, Kafka Streams cannot be used because the source brokers and the destination brokers must be the same.) In this option, the sensitive data is physically separated from the rest of the data by residing on its own brokers.

Which option is best for you depends on your requirements around processing and security. In some cases, separate brokers may be overkill. But in other cases it may be the best option due to the physical boundary it creates between the raw data and the filtered data.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Refreshed cloud images

While we continue development on Philter 1.7.0 we have released minor updates to the AWS, Azure, and Google Cloud marketplaces. The only changes in these minor updates is to refresh the base image to include all available operating system updates. There are no changes to the Philter software. In the future we plan to consolidate our refreshed image updates for each cloud to minimize the number of separate versions.

There is no need to update to the minor release version if you are maintaining your operating system updates of your existing Philter instances.

You can find the details of all releases in Philter's Release Notes.


The Performance of Philter

PhilterIn all of our design and development of Philter, performance is always one of our top priorities. We ask ourselves questions like how can we implement this awesome new feature in Philter without negatively impacting performance? What can we do to improve performance?

However, the word "performance" can have a few different meanings when used in relation to Philter. In this post I want to dive down into the "performance" of Philter and how it impacts Philter's development.

Performance: Efficient processing of text

The first meaning of performance relates to how efficient Philter is when processing text. Philter takes input text, filters the sensitive information, and returns the output text. (That middle step is simplified a lot but hopefully you get the idea.) If any of these steps are not efficient, or performant, Philter won't be usable. Your client applications will time out and you will not want to use Philter.

Any new features or modifications to Philter's filtering capabilities has to be carefully designed and implemented. Even a small, seemingly innocent change can have large negative effects on performance. Because of that we as the developers must be careful and test accordingly. We use efficient data structures and make careful choices to select the operations that will provide the best performance.

This type of performance is typically measured in compute time and that's how we measure it. We have thousands of test cases that we execute with each new build of Philter. Over time we can see a history of the processing time and its downward trend as Philter gets more efficient.

Performance: Labeling the appropriate information as sensitive

The second meaning of performance may sometimes be referred to as accuracy. This meaning relates to how well Philter identified the sensitive information in the input text. Was all the sensitive information that Philter identified actually sensitive? Where there any false positives? False negatives? This type of performance is typically measured by a percentage, or by terms from information retrieval such as precision and recall.

In some cases, Philter's identification of sensitive information is non-deterministic, meaning statistics and machine learning algorithms are applied to locate sensitive information. Contrast this with a deterministic process such as looking for terms from a dictionary. How some of Philter's filters identify sensitive information can be controlled through a sensitivity level. Setting the sensitivity to high will likely identify more sensitive information but also have more false positives. Conversely, setting the sensitivity to low will likely result in finding fewer sensitive information and more false negatives. The sensitivity level of medium aims to bridge this gap. In some cases, false positives may be more acceptable than a false negative so a high sensitivity level is used. For the information retrieval folks out there is known as maximizing the recall.

For sensitive information like person's names we offer various models trained for specific domains. The purpose of this is to provide a higher level of accuracy when Philter is used in those domains.

Putting them together

Philter's "performance" is both of these. Philter must perform well in terms of time and processing efficiency as well as finding the appropriate sensitive information. We believe that both types are equally important. A system that takes hours to complete but with more accuracy may be just as unusable as a system that completes in milliseconds but finds no sensitive information.

If you are not yet using Philter to find and remove sensitive information from your text it's easy to get started. Just click on your platform of choice below. And if you need help please don't hesitate to reach out. We enjoy helping.

AWS Marketplace

Preventing PII and PHI from Leaking into Application Logs

Introduction

This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application's development that's easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications' logs?

Philter and application logs

To give more confidence you can process your applications' logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application's log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O https://raw.githubusercontent.com/confluentinc/cp-all-in-one/5.5.1-post/cp-all-in-one/docker-compose.yml
docker-compose up

Spin up Philter in a docker container:

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we're good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here's an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
...
<Appenders>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>
  </Kafka>
</Appenders>

With this appender in our log4j2.xml file, our application's logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we're ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard kafka-console-consumer.sh script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person's names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.

Summary

In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j's Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.


How We Train Philter's NLP Models

PhilterIn this post I want to give some insight into how we create and train the NLP (natural language processing) models that Philter uses to identify entities like person's names in text.

Read this first :)

As a user of Philter you don't need to understand or even be aware of how we train Philter's NLP models. But it is helpful to know that Philter's NLP model can be changed based on your domain. For example, we offer some models trained specifically for the healthcare domain. These models were trained to give better performance when using Philter in a healthcare environment. See the bottom of this post for a list of the currently available NLP models for Philter.

What is NLP?

Some sensitive information can be identified by Philter based on patterns (SSNs) or dictionaries. Things like a person's name don't follow a pattern and while it may be found in a dictionary there isn't any guarantee your dictionary will contain all possible names. To identify person's names we rely on a set of techniques collectively known as natural language processing, or NLP.

NLP is a broad term used to describe many types of methods and technologies used to extract information from unstructured, or natural language, text. Some foundational common NLP tasks are to identify the language of some given text and to label the words in a sentence with their parts-of-speech types. More advanced tasks include named-entity recognition, summarizing text passages in a few sentences, translating text from one language to another, and determining the sentiment of a given text. It's a very exciting time in NLP due to lots of recent advancements in neural networks, GPU hardware, and just an explosion in the number of researchers and practitioners in the NLP community.

How does NLP work?

NLP tasks often require a trained model to operate. For instance, language translation requires a model that is able to take words and phrases in one language and produce another language. The model is trained in identical sets of text in both languages. How the words and phrases are used help the model determine how the text should be translated. Identifying person's names in text also requires a trained model. Training this type of model requires text that has been annotated, meaning that the entities have been labeled. The algorithms will use these labels to train the model to identify names in the future. An example of an annotated sentence:

{person}George Washington{/person} was president.

There are different annotation formats created for different purposes but I'm sure you get the idea. With annotated text we can train our model to know what a person's name looks like when the model is applied to unlabeled text. That's essentially all there is to it.

There are lots of fantastic open-source tools with active user communities for natural language processing. If you are interested in learning the nuts and bolts of NLP, choose a framework in your preferred programming language to lower the learning curve and dive in! The techniques and terminology learned from using one framework will always apply to a different framework even if it is in a different programming language so you aren't at any risk of lock-in.

How We Train Philter's NLP Models

As described above, training our model requires annotated text. We have annotated text for various domains. We use this annotated text, along with a set of word embeddings, a few GPUs, and some time, to train the models for Philter. The output of the training is a file which contains the model. The model can then be used by Philter to identify person's names in text.

Evaluating a Model's Performance

To have an idea of how our model will perform we use some common metrics called precision and recall. These metrics give us an idea of how well the model is performing on our test data. We don't need to get into the details of precision and recall here. However, one important thing we want you to know is often we will try to maximize the recall value when training the model. Maximizing the recall means it is better to label some text as a person's name even if it is not than it is to risk not labeling a person's name. When dealing with sensitive information in text it can be advantageous to err on the side of caution instead of risk missing a person's name not being filtered. Restated, maximizing recall means false positives are more acceptable than false negatives.

Currently Available Models for Philter

Once we are satisfied with the model's performance we publish it and make it available on our website. We have models for general usage and models more specialized for specific domains such as healthcare. We are continuously training and updating our models to keep them current and improve their performance. The model included with Philter is a general usage model.

To stay up to date on model updates please follow us on Twitter.



Philter - A Real-World Use-Case

PhilterPhilter finds, identifies, and removes sensitive information from text. That's a very good and short description of Philter, but, as they say, a picture is worth a thousand words. In this post we will detail an actual, real-world use-case of Philter as we paint a picture with words!

"Super Helpdesk"

The Philter customer, we'll call them Super Helpdesk, is a provider of a software-as-a-service helpesk solution. Their customers sign-up to be able to offer a helpdesk to their customers. (Following? :) Super Helpdesk's users need the ability to optionally prevent sensitive information from being passed through in tickets. If a customer enters something sensitive they want to remove it from the ticket before the ticket enters the workflow.

In this case, the sensitive information Super Helpdesk is most worried about are credit card numbers. Due to security best practices and regulations like PCI-DSS, credit card numbers cannot exist in helpdesk tickets where they may be stored or transmitted unencrypted. Super Helpdesk needed a way to analyze the tickets entering their system in order to filter out the credit card numbers from the tickets.

The Solution

At a high-level, Super Helpdesk deployed Philter (in this case running on EC2 in AWS) to perform the filtering of the content of the helpdesk tickets. As new helpdesk tickets are submitted, the content of the ticket is sent to Philter and Philter immediately returns the content of the ticket with the credit card numbers redacted to just the last four digits. (Super Helpdesk also added an option for their users to control how Philter redacts the credit card numbers, with the available options being redact all or redact all but the last four digits.)

Now for the low-level implementation details! When new helpdesk tickets come in they are published to an Apache Kafka topic. A process consumes from the topic, does processing on the ticket, and ultimately inserts the ticket into a backend database. This process, written in Java, was modified to make use of the Philter Java SDK to enable the communication between the process and Philter.

We have found this to actually be a very common Extract-Transform-Load (ETL) design scenario across industries. Data in the form of text flows from an external system through a pipeline facilitated by Apache Kafka or Amazon Kinesis Firehose into an internal database. Along the way the data needs to be manipulated in some manner. In our case the data manipulation is to remove sensitive information from the text. Philter's API allows it to slide nearly seamlessly into the existing pipeline. Like Super Helpdesk did, just insert a step to send the text to Philter for filtering.

We made a previous blog post about using Philter inside of an AWS Kinesis Firehose using a Firehose Transformation. It describes how to make a Lambda function to invoke Philter on the text going through the pipeline to filter the text. Check it out at the link below.

Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

But, wait, why Philter?

You are probably saying, well, that seems like overkill for a simple problem to redact credit card numbers! Credit card numbers follow a well-defined pattern so why not just use a regular expression to find them? If all you want to do is find credit card numbers then a regular expression definitely may work.

So what does using Philter give us? A good bit actually. Through the use of filter profiles, Philter can have a pre-set list of types sensitive information. Each type of sensitive information can have its own redaction logic. For example, you could redact VISA card numbers while truncating AMEX card numbers. Or, you could only leave the last four digits of card numbers matching a condition. Additionally, each customer of the helpdesk platform may have different requirements around sensitive information. That logic can also be encapsulated in filter profiles. The regular expression logic just got more complicated.

Philter provides other features as well, such as the ability to capture metrics on the data, ability to encrypt the credit card numbers instead of removing them, and the ability to disambiguate between different types of sensitive information.

Lastly, a regular expression will never be able to find non-deterministic types of sensitive information like person's names. Philter's natural language processing (NLP) capabilities are able to find entities like person's names that do not follow any set pattern.

Try Philter

Deploying Philter to AWS, Azure, or GCP is easy because Philter is available through each of the cloud's marketplaces. Simply follow the marketplace steps to launch an instance of Philter in your private cloud.

Philter Version
Launch Philter on AWS2.1.0
Launch Philter on Azure2.1.0
Launch Philter on Google Cloud2.1.0

Share your experience!

We would love to hear how you are using Philter. Share your experience with us!


Philter 1.5.0

Happy Friday! We are in the process of publishing Philter 1.5.0. Philter identifies and removes sensitive information in text. Look for Philter 1.5.0 to be available on the cloud marketplaces soon.

This version has a few new features in addition to minor improvements and fixes. The new features are described below.

New "Section" Filter

Philter 1.5.0 includes a new filter type called a "Section." This filter type lets you specify patterns that indicate the start and end of a section of text. For example, if your text has sentences or even paragraphs denoted with some marker, you can use the Section filter to redact those sentences or paragraphs. You just give the filter the regular expression patterns for the start and end markings. We have added the Section filter to the filter profiles documentation.

Amazon S3 to Store Filter Profiles

We have added the ability to store the filter profiles in an Amazon S3 bucket. The benefits of this is that now filter profiles can be shared across multiple instances of Philter. Previously, if you were running two instances of Philter you would have to update the filter profiles on each instance. By storing the filter profiles in S3 you can just update the filter profiles once via Philter's API. This does require a cache. The cache stores the filter profiles to lower the latency and reduce the number of calls to S3. (More on the cache below.)

We have published some CloudFormation and Terraform scripts to help with creating this architecture on GitHub.

Consolidated Caches

Philter previously used caches for the random anonymization values. With the introduction of using a cache for storing the profiles in S3 we have consolidated those caches into a single cache. Because of this, the configuration settings have been slightly renamed to reflect this. We have updated Philter's documentation with the renamed properties. Having a single cache means there is less to configure and fewer required resources.

If you are upgrading from a previous version you will need to change to the new cache property names.

Changeable Model File

The model file used by Philter can now be set in Philter's application.properties. Check out Philter's documentation for the details. By being able to set the model being used you can now select which model is most applicable to your use-case and domain.


CloudFormation template for a highly-available Philter

We now have an AWS CloudFormation template to deploy an auto-scaled, highly-available Philter environment to identify and remove sensitive information from text. This template creates a VPC, load balancer, Philter instances, a Redis cache, and all required networking and security group configuration. Click the Launch Stack button to begin launching the stack.Philter

In an deployment of Philter that is a single EC2 instance, the EC2 instance is a single point of failure with no ability to respond to fluctuations in demand. By deploying more than one EC2 instance we can protect our application against failure and be able to scale up and down as needed.

The benefits of using this CloudFormation template is that it provides a pre-configured Philter architecture and deployment that is highly-available, scalable, and encrypts all data in-transit and all data at rest. Your API requests to Philter to filter sensitive information from text will have higher throughput since the load balancer will distribute those requests across the Philter instances. And as described below, the stack uses end-to-end encryption of data at-rest and in-transit.

The stack requires an active subscription to Philter via the AWS Marketplace. The template supports us-east-1, us-east-2, us-west-1, and us-west-2 regions.

The CloudFormation template is available in the philter-infrastructure-as-code repository on GitHub.

The Philter Stack Architecture

The deployment creates an elastic load balancer that is attached to an auto-scaled group of Philter EC2 instances. The load balancer spans two public subnets and the Philter EC2 instances are spread across two private subnets. Also in the private subnets is an Amazon Elasticache for Redis replication group. A NAT Gateway located in one of the public subnets provides outgoing internet access by routing the traffic to the VPC's Internet Gateway.

The load balancer will monitor the status of each Philter EC2 instance by periodically checking the /api/status endpoint. If an instance is found to be unhealthy after failing several consecutive health checks the failing instance will be replaced.

The Philter auto-scaling group is set to scale up and down based on the average CPU utilization of the Philter EC2 instances. When the CPU usage hits the high threshold another Philter EC2 instance will be added. When the CPU usage hits the low threshold, the auto-scaling group will begin removing (and terminating) instances from the group. The scaling policy is set to scale up faster rate than scaling down to avoid scaling down too quickly.

End-to-end Encryption

Incoming traffic to the load balancer is received by a TCP protocol handler on port 8080. These requests are distributed across the available Philter EC2 instances. The encrypted incoming traffic is terminated at the Philter EC2 instances. Network traffic between the Elasticache for Redis nodes is encrypted, and the data at-rest in the cache is also encrypted. The Philter EC2 instances use encrypted EBS volumes.

Launch the Stack

Click the Launch Stack button to launch the stack in your AWS account, or get the template here, or launch the stack using the AWS CLI with the command below.

aws cloudformation create-stack --stack-name philter --template-url s3://mtnfog-public/philter-resources/philter-vpc-load-balanced-with-redis.json

Once the stack completes Philter will be ready to accept requests. There will be an Output value called PhilterEndpoint. This value is the Philter API URL.

For example, if the value of PhilterEndpoint is https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/, then you can check Philter's status using the command:

curl -k https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/api/status

You can try a quick sample filter request with:

curl -k "https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/api/filter" \
  --data "George Washington lives in 90210 and his SSN was 123-45-6789." \
  -H "Content-type: text/plain"

Philter Studio 1.0.0

Philter Studio 1.0.0 is now available. Philter Studio is an application for Windows 7/10 that provides convenient access to removing sensitive information from files and documents using Philter.

With Philter Studio’s intuitive interface you can quickly and easily utilize Philter to find and remove sensitive information from your files. Process files one at a time or queue up entire directories and process all files with a single click. Philter Studio supports finding and removing sensitive information in Microsoft Word files (.doc and .docx). Philter Studio can enable track changes so the redactions can be viewed while editing the document.

Philter Studio lets you to take a deep look at how the sensitive information in your text were identified and removed. The Compare and Explain feature visually highlights the information, describes why it was identified, and shows the redacted version.


Philter and COVID-19

Philter NLP Models

The natural language processing (NLP) capabilities of Philter are partly model-driven, meaning that we have trained models to identify information in text. These models are used to identify pieces of sensitive information that do not follow well-defined patterns or exist in referenced dictionaries, such as persons names. The model training process is a complex and compute-intensive procedure often taking days or even weeks to complete. Once a model is created it can be applied to text to identify specific parts of the text based on the text used to train the model and the parameters of the training.

NLP Models for Many Use-Cases and Industries

The model currently deployed in Philter is a model that is generic but yet provides good performance across many use-cases covering many different types of text. It has been our plan for some time to offer models trained for specific use-cases and industries, including non-healthcare industries, for those instances when Philter is used only on a certain type of text. This will give those specific use-cases an increase in performance when using a tailored model.

Philter's pluggable model implementation is not quite ready yet. However, we are going to go ahead and jump a bit ahead today in announcing a model tailored for personally identifiable information in text related to COVID-19. We hope that this model will give you improved performance when identifying sensitive information in COVID-19 related text.

Model Availability

Because we are jumping ahead of ourselves in order to make this model immediately available, we don't yet have any automation or tooling support around being able to download and install the model yourself. (We will in the future.) Until we do have the self-service tooling available, we will distribute the model and installation instructions to users of Philter via email upon request. There is no additional charge to request and use the model.

To request the Philter model trained using COVID-19 data please use our contact form and include your cloud marketplace (AWS, Azure, or GCP) subscription ID.


Using Philter with Microsoft Power Automate (Flow)


Philter SDKs

We have some updates on the Philter SDKs!

The Philter SDKs provide API clients for interacting with Philter to identify and remove sensitive information from text. Each project contains examples showing how to use the SDK.

Philter SDK for Java

The Java SDK is now available in Maven Central.

Philter SDK for .NET

The .NET SDK is now available from NuGet.

Philter SDK for Golang

The Golang SDK is now available on GitHub.


Filtering Sensitive Information using Apache NiFi with Philter

Awhile back we made a post describing how Philter can be used alongside Apache NiFi for identifying and removing sensitive information from text. Since that post, there have been changes to Philter and Apache NiFi so we thought it would be worthwhile to revisit that architecture and its configuration.

  • Apache NiFi is an application for creating and managing data flows that process data.
  • Philter identifies and removes sensitive information, such as PHI and PII, from natural language text. Philter is available on cloud marketplaces.

The Data Flow Architecture

In the architecture of our data flow, we are going to be ingesting natural language (unstructured) text from somewhere - it doesn't really matter where. In your use-case it may be from a file system, from an S3 bucket, or from an Apache Kafka topic. Once we have the text in the content of the NiFi flowfile, we will send the text to Philter where the sensitive information will be removed from the text. The filtered text will then be the content of the flowfile. In our example here we are going to read the files from a directory on the file system.

To interact with Philter we can use NiFi's InvokeHTTP processor since Philter's API is HTTP REST-based.

Finally, we will write the filtered text to some destination. Like the ingest source, where we write the text does not matter. We could write it back to the source or some other location - whatever is required by your use-case.

The NiFi Flow

The flow will use the GetFile processor to read /tmp/input/*.txt files. The contents of each file will be sent to Philter. The resulting filtered text will be written back to the file system at /tmp/output. (Click the image for a better view.)

Apache NiFi flow for Philter

If you want to quickly prototype it with minimal configuration, use a GenerateFlowFile processor and set the content manually to something like "His SSN was 123-45-6789."

Using GenerateFlowFile to test Philter.

InvokeHTTP Processor Configuration

The configuration of the InvokeHTTP processor is fairly simple. We just need to configure the HTTP Method, Remote URL, and Content Type. Set each as follows:

  • HTTP Method = POST
  • Remote URL = http://philter-ip:8080/api/filter
  • Content-Type = text/plain

Since we are not providing any values for the context, document ID, or filter profile name in the URL, Philter will use defaults values for each. When not provided, the default value for context is default, Philter will generate a document ID per request, and the default filter profile name is default.

These default values are detailed in Philter's API documentation. A context lets you group similar documents together, perhaps by business unit or purpose. A document ID should uniquely identify a document (such as a file name) and can be used to split up large documents for processing.

If you do want to set values for one or all of those instead of using the default values, just append them to the Remote URL: http://philter-ip:8080/api/filter?c=ctx&p=justssn In this request, the context is set to ctx and it tells Philter to use the filter profile named justssn. As a tip, you can use NiFi's expression language to parameterize the values in the URL.

InvokeHTTP processor configuration for Philter.

A Closer Look

If we use a LogAttribute processor we can get some insight into what's happening. In the log output below, we can see HTTP POST request that was made.

At the top of the log we see the filtered text from Philter. The input text from the file was "His SSN was 123-45-6789." Philter applied the default filter profile which looks for SSNs and responded with "His SSN was {{{REDACTED-ssn}}}."

(Filter profiles are very powerful and flexible configurations that let you have full control over the types of sensitive information that Philter identifies and how Philter manipulates that information when found.)

We can also see that since we did not provide a value for the document ID in the request, Philter assigned a document ID and returned it in the response in the x-document-id header.

His SSN was {{{REDACTED-ssn}}}.

--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Thu Feb 27 13:35:19 UTC 2020'
Key: 'lineageStartDate'
	Value: 'Thu Feb 27 13:35:11 UTC 2020'
Key: 'fileSize'
	Value: '31'
FlowFile Attribute Map Content
Key: 'Connection'
	Value: 'keep-alive'
Key: 'Content-Length'
	Value: '31'
Key: 'Content-Type'
	Value: 'text/plain;charset=UTF-8'
Key: 'Date'
	Value: 'Thu, 27 Feb 2020 13:35:19 GMT'
Key: 'Keep-Alive'
	Value: 'timeout=60'
Key: 'filename'
	Value: 'd206fc81-2c42-40ba-afbf-b5f9998b56c0'
Key: 'invokehttp.request.url'
	Value: 'http://10.1.1.221:8080/api/filter'
Key: 'invokehttp.status.code'
	Value: '200'
Key: 'invokehttp.status.message'
	Value: ''
Key: 'invokehttp.tx.id'
	Value: 'fbf2f6c0-1073-4fac-bc23-6d6a67b70423'
Key: 'mime.type'
	Value: 'text/plain;charset=UTF-8'
Key: 'path'
	Value: './'
Key: 'uuid'
	Value: '486ff4c2-6530-4e1c-aea2-e9965b86b10c'
Key: 'x-document-id'
	Value: 'fb75a2a4c164192542f89881aa8baf21'
--------------------------------------------------

Summary

Philter's API makes it easy to integrate Philter with applications like Apache NiFi. The InvokeHTTP processor native to NiFi is an ideal means of communicating with Philter.

To keep things simple, this example only considered SSNs in text. Philter supports many other types of sensitive information.

If performance is very important, there are a couple of things that can be done to help. First, Philter is stateless so you can run multiple instances of Philter behind a load balancer. Second, Philter Enterprise Edition can run natively inside an Apache NiFi flow without the need to make HTTP calls to Philter. Contact us if you would like to learn more about Philter Enterprise Edition's processor for Apache NiFi.

Philter's integration with applications like Apache NiFi is very important to us so look for more improvements and features in versions to come.


Philter

Philter 1.3.1

We are happy to announce the release of Philter 1.3.1!

Philter 1.3.1 release notes
Philter 1.3.1 documentation

This version of Philter makes some minor changes to filtering and adds support for MAC addresses and tax-payer identification numbers (TINs). Also new is the ability to encrypt sensitive information in the text using AES encryption using the CRYPTO_REPLACE filter strategy. Additionally, Azure and GCP images are now built on CentOS 8.

Other changes include the ability to use the context in a filter condition, the ability to provide a user-set document ID to Philter's API, and the requirement of Java 11.

Philter Enterprise Edition 1.3.1 has been certified for Red Hat Enterprise Linux 8. For enterprise customers, Philter's containers are now built on Red Hat's Universal Base Image to give best performance on Red Hat Enterprise Linux deployments.

Launch Philter in your cloud.


Philter

Philter 1.3.0

Today I am happy to announce the availability of Philter 1.3.0! This version includes various tweaks to improve performance and we definitely encourage you to upgrade to 1.3.0. This version greatly lowers the required time to process text while improving the accuracy of identified information.

The only new user-facing feature is a modification to the URL filter to add an option to require the URLs to start with http, https, or www. This change adds a new property to the URL filter profile. All other improvements are related to the internal workings of Philter.

Look for Philter 1.3.0 to be available on the cloud marketplaces in a few days.

Philter 1.3.0 Release Notes


Philter 1.2.0

We are happy to announce the release of Philter 1.2.0. This version brings new features to filter profiles along with some minor changes. Philter 1.2.0 will be available on the cloud marketplaces in a day or two. Let's get to it and see what's new!

A Recap

Philter is an application to analyze text for potentially identifiable information (PII) and protected health information (PHI) and remove or manipulate those items when found. The types of information that Philter looks for and how it acts upon the information is called a filter profile. A filter profile is just a file that lists the types of PII/PHI that you are interested in, e.g. credit card numbers, persons names, etc. Philter is available on the AWS Marketplace, Azure Marketplace, GCP Marketplace.

Contact us for a live demo or feel free to take Philter for a spin on one of the cloud marketplaces taking advantage of its free trial period. Check out the full Release History.

What's New in Philter 1.2.0

Filter Specific Ignore Lists

A filter profile can now have lists of ignored terms specific to each filter type. For example, let's say there is a number "123-45-6789" in your text and it keeps getting identified as an SSN because it fits the SSN format. However, you know this number is not an SSN and do not want it removed. You can now add "123-45-6789" to a list of ignored terms for the SSN filter to prevent it from being removed from the text. Each type of filter has its own ignore list.

Global Ignore Lists

A filter profile can now have zero or more ignore lists that apply to all filter types. Items added to this list are ignored for all filter types. All items present in the global ignore lists will never be removed from the input text.

Disabling Filters

Previously, to disable a filter type in a filter profile you had to delete it from the filter profile. This can be problematic because you might have configuration in there you don't want to just delete and lose. New in Philter 1.2.0, each filter type has an enabled property that controls whether or not the filter is applied. When set to false the filter is not applied. The default value is always true to enable each filter type.

Invalid Credit Card Numbers

Philter identifies credit card numbers based on the patterns and algorithms of the numbers. In Philter 1.2.0, a new option was added to the credit card filter type that allows invalid credit card numbers to be filtered as well. An invalid credit card number is a number that matches the pattern of a credit card number but fails the credit card number's generation algorithm. (The algorithm is the Luhn algorithm.) This option is disabled by default.

Valid Dates

Philter identifies dates based on date patterns. Sometimes, a date may match a valid pattern but not be a valid date, such as February 30 or even March 45. Philter 1.2.0 adds a new option to the date filter to require that identified dates be valid dates. When enabled, dates found to not be valid dates are not removed from the text. This option is disabled by default.

Option to Remove Punctuation

Philter 1.2.0 adds a new option to the filter profile for named-entity recognition to remove punctuation from the input text prior to processing the text. By default this option is disabled and punctuation is not removed. Removing punctuation can be beneficial in cases where punctuation is being included in entities. This can happen in cases where the last word of the sentence is a name and the period is included in the filtered text. (This doesn't always happen and we're working on removing those occurrences even more through improvements to the named-entity recognition capability.)

Encrypting Connections to Redis

Philter's consistent anonymization feature stores the identified text in a Redis cache. This allows a clustered Philter installation to be able to replace identified text consistently across all instances of Philter. (When Redis is not used, the identified text values are stored in memory on each Philter instance.) Philter 1.2.0 requires all connections to a Redis cache be encrypted and requires the use of a Redis auth token.


Philter 1.1.0

We are happy to announce Philter 1.1.0! This version brings some features we think you will find very useful because most were implemented directly from interactions with users. We look forward to future interactions to keep driving improvements!

We are very excited about this release, but we also have lots of exciting things to add in the next release and we will soon be making available Philter Studio, a free Windows application to use Philter. If you don't like managing filter profiles in JSON you will love Philter Studio!

We have begun the process of publishing Philter 1.1.0 to the cloud marketplaces and it should be available on the AWS, Azure, and GCP marketplaces in the next few days once publishing is complete. The Philter Quick Start walks through how to deploy Philter on each platform. You can also see the full Philter release notes.

What's New in Philter 1.1.0

Ignore Lists

In some cases, there may be text that you never want to identify and remove as PII or PHI. An example may be an email address or telephone number of a business that is not relevant to the sensitive information in the text and removing this text may cause the document to lose meaning. Ignore lists allow you to specify a list of terms that are never removed (always ignored if found) from the documents. You can create as many ignore lists as you need and each one can contain as many terms as desired. The ignore lists are defined in the filter profile.

Here's how an ignore list is defined in a filter profile that only finds SSNs. The SSNs 123-45-6789 and 000-00-0000 will always be ignored and will remain in the documents unchanged.

{
  "name": "default",
  "identifiers": {
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT"
        }
      ]
    }
  },
  "ignored": [
    {
      "name": "ignored-terms",
      "terms": [
        "123-45-6789",
        "000-00-0000"
      ]
    }
  ]
}

Custom Dictionaries

You can now have custom dictionaries of terms that are to be identified as sensitive information. With a custom dictionary you can specify a list of terms, such as names, addresses, or other information, that should always be treated as personal information. You can create as many custom dictionaries as you need and each one can contain as many terms as desired. The custom dictionaries are defined in the filter profile.

Here's how a custom dictionary can be added to a filter profile. In this example, a custom dictionary of type names-with-j is created and it contains the terms james, jim, and john. When any of these terms are found in a document they will be redacted. The dictionaries item is an array so you can have as many dictionaries as required. (The "auto" setting for the sensitivity is discussed a little further down below.)

{
  "name": "default",
  "identifiers": {
    "dictionaries": [
      {
        "type": "names-with-j",
        "terms": [
          "james",
          "jim",
          "john"
        ],
        "sensitivity": "auto",
        "customFilterStrategies": [
          {
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}",
            "replacementScope": "DOCUMENT"
          }
        ]
      }
    ],
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT",
          "staticReplacement": "",
          "condition": ""
        }
      ]
    }
  }
  ]
}

"Fuzziness" Calculation

We added a new fuziness option when using dictionary filters. The previous options of LOW, MEDIUM, and HIGH were found to be either not restrictive enough or too restrictive. We have added an AUTO option that automatically determines the appropriate fuziness based on the length of term in question. For instance, the AUTO option sets the fuzziness for a short term to be on the low side, while a longer term allows a higher fuziness. We recommend using AUTO over the other options and expect it to perform better for you. The other options of LOW, MEDIUM, and HIGH are still available.

Explain API Endpoint

Philter operates as a black box. Text goes in and manipulated text comes out. What happened inside? To help provide insight into the black box, we have added a new API endpoint called explain. This endpoint performs text filtering but returns more information on the filtering process. The list of identified spans (pieces of text found to be sensitive) and applied spans are both returned as objects along with attributes about each span.

Here's an example output of calling the explain API endpoint given some sample text. The original API call:

curl -k -s "https://localhost:8080/api/explain?c=C1" --data "George Washington was president and his ssn was 123-45-6789 and he lived at 90210." -H "Content-type: text/plain" 

The response from the API call:

{
  "filteredText": "{{{REDACTED-entity}}} was president and his ssn was {{{REDACTED-ssn}}} and he lived at {{{REDACTED-zip-code}}}.",
  "context": "C1",
  "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
  "explanation": {
    "appliedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ],
    "identifiedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ]
  }
}

In the response, each identified span is listed with some attributes.

  • id - A random UUID identifying the span.
  • characterStart - The character-based index of the start of the span.
  • characterEnd - The character-based index of the end of the span.
  • filterType - The filter that identified this span.
  • context - The given context under which this span was identified.
  • documentId - The given documentId or a randomly generated documentId if none was provided.
  • confidence - Philter's confidence this span does in fact represent a span.
  • text - The text contained within the span.
  • replacement - The value which Philter used replace the text in the document.

The User's Guide has been updated to include the explain API endpoint.

Elasticsearch

As mentioned in a previous post, Philter 1.1.0 now uses Elasticsearch to store the identified spans instead of MongoDB. Please check that post for the details but we do want to mention again here that this change does not affect Philter's API and the change will be transparent to any of your existing Philter scripts or applications.

DataDog Metrics

Philter 1.1.0 adds support for sending metrics directly to Datadog.

New Metrics

Philter 1.1.0 adds new metrics for each type of filter. Now you will be able to see metrics for each type of filter in CloudWatch, JMX, and Datadog to give more insight into the types of sensitive information being found in your documents.


Philter and Elasticsearch

PhilterPhilter, our application for finding and removing PII and PHI from natural language text, has the ability to optionally store the identified text in an external data store. With this feature, you had access to a complete log of Philter's actions as well as the ability to reconstruct the original text in the future if you ever needed to.

In Philter 1.0,  we chose MongoDB as the external data store. With just a few configuration properties, Philter would connect to MongoDB and persist all identified "spans" (the identified text, its location in the document, and some other attributes) to a MongoDB database. This worked well but we realized that looking forward it might not have been the best choice.

In Philter 1.1 we are replacing MongoDB with Elasticsearch. The functionality and the Philter APIs will remain the same. The only difference is that now instead of the spans being stored in a MongoDB database they will now be stored in an Elasticsearch index. So, what, exactly are the benefits? Great question.

The first benefit comes with Elasticsearch and Kibana's ability to quickly and easily make dashboards to view the indexed data. With the spans in Elasticsearch, you can make a dashboard to summarize the spans by type, text, etc., to show insights into the PII and PHI that Philter is finding and manipulating in your text.

It also became quickly apparent that a primary use-case for users and the store would be to query the spans it contains. For example, a query to find all documents containing "John Doe" or all documents containing a certain date or phone number. A search engine is better prepared to handle those queries.

Another consideration is licensing. Elasticsearch is available under the Apache Software License or a compatible license while MongoDB is available under a Server Side Public License.

In summary, Philter 1.1 will offer support for using Elasticsearch as the store for identified PII and PHI. Remember, using the store is an optional feature of Philter. If you do not require any history of the text that Philter identifies then it is not needed. (By default, Philter's store feature is disabled and has to be explicitly enabled.) Support for using MongoDB as a store will not be available in Philter 1.1.

We are really excited about this change and excited about the possibilities that comes with it!


Apache NiFi for Processing PHI Data

With the recent release of Apache NiFi 1.10.0, it seems like a good time to discuss using Apache NiFi with data containing protected health information (PHI). When PHI is present in data it can present significant concerns and impose many requirements you may not face otherwise due to regulations such as HIPAA.

Apache NiFi probably needs little introduction but in case you are new to it, Apache NiFi is a big-data ETL application that uses directed graphs called data flows to move and transform data. You can think of it as taking data from one place to another while, optionally, doing some transformation to the data. The data goes through the flow in a construct known as a flow file. In this post we'll consider a simple data flow that reads file from a remote SFTP server and uploads the files to S3. We don't need to look at a complex data flow to understand how PHI can impact our setup.

Encryption of Data at Rest and In-motion

Two core things to address when PHI data is present is encryption of the data at rest and encryption of the data in motion. The first step is to identify those places where sensitive data will be at rest and in motion.

For encryption of data at rest, the first location is the remote SFTP server. In this example, let's assume the remote SFTP server is not managed by us, has the appropriate safeguards, and is someone else's responsibility. As the data goes through the NiFi flow, the next place the data is at rest is inside NiFi's provenance repository. (The provenance repository stores the history of all flow files that pass through the data flow.) NiFi then uploads the files to S3. AWS gives us the capability to encrypt S3 bucket contents by default so we will use that through an S3 bucket policy.

For encryption of data in motion, we have the connection between the SFTP server and NiFi and between NiFi and S3. Since we are using an SFTP server, our communication to the SFTP server will be encrypted. Similarly, we will access S3 over HTTPS providing encryption there as well.

If we are using a multi-node NiFi cluster, we may also have the communication between the NiFi nodes in the cluster. If the flows only execute on a single node you may argue that encryption between the nodes is not necessary. However, what happens in the future when the flow's behavior is changed and now PHI data is being transmitted in plain text across a network? For that reason, it's best to set up encryption between NiFi nodes from the start. This is covered in the NiFi System Administrator's Guide.

Encrypting Apache NiFi's Data at Rest

The best way to ensure encryption of data at rest is to use full disk encryption for the NiFi instances. (If you are on AWS and running NiFi on EC2 instances, use an encrypted EBS volume.) This ensures that all data persisted on the system will be encrypted no matter where the data appears. If a NiFi processor decides to have a bad day and dump error data to the log there is a risk of PHI data being included in the log. With full disk encryption we can be sure that even that data is encrypted as well.

Looking at Other Methods

Let's recap the NiFi repositories:

PHI could exist in any of these repositories when PHI data is passing through a NiFi flow. NiFi does have an encrypted provenance repository implementation and NiFi 1.10.0 introduces an experimental encrypted content repository but there are some caveats. (Currently, NiFi does not have an implementation of an encrypted flowfile repository.)

When using these encryption implementations, spillage of PHI onto the file system through a log file or some other means is a risk. There will be a bit of overhead due to the additional CPU instructions to perform the encryption. Comparing usage of the encrypted repositories with using an encrypted EBS volume, we don't have to worry about spilling unencrypted PHI to the disk, and per the AWS EBS encryption documentation, "You can expect the same IOPS performance on encrypted volumes as on unencrypted volumes, with a minimal effect on latency."

There is also the NiFi EncryptContent processor that can encrypt (and decrypt despite the name!) the content of flow files. This processor has use but in very specific cases. Trying to encrypt data at the level of the data flow for compliance reasons is not recommended due to the data possibly existing elsewhere in the NiFi repositories.

Removing PHI from Text in a NiFi Flow

PhilterWhat if you want to remove PHI (and PII) from the content of flow files as they go through a NiFi data flow? Check out our product Philter. It provides the ability to find and remove many types of PHI and PII from natural language, unstructured text from within a NiFi flow. Text containing PHI is sent to Philter and Philter responds with same text but with the PHI and PII removed.

Conclusion

Full disk encryption and encrypting all connections in the NiFi flow and between NiFi nodes provides encryption of data at rest and in motion. It's also recommended that you check with your organization's compliance officer to determine if there are any other requirements imposed by your organization or other relevant regulation prior to deployment. It's best to gather that information up front to avoid rework in the future!