Philter is now Certified for Cloudera Dataflow

We are excited to announce that Philter is now certified for Cloudera Dataflow (CDF). By leveraging Philter in your Apache NiFi data flows, you can redact protected health information (PHI), personally identifiable information (PII), and other types of sensitive information from your data.

Using Philter in Cloudera Dataflow is as simple as making the Philter processors available to your Apache NiFi instance and adding the processors to your canvas. You can configure the types of information to redact right inside the processor's properties. You can choose whether to use a centralized Philter instance or perform the redaction directly within the Apache NiFi flow. The first option allows for a centralized configuration while the latter provides significant performance improvements.

Philter on Cloudera Dataflow is compatible with all public clouds supported by Cloudera Dataflow.

Get Started

To get started with Philter on Cloudera Dataflow please contact us and we can guide you through the process of getting started. Visit our partner information on the Cloudera partner portal.

About Cloudera Dataflow

Cloudera DataFlow (CDF) is a CDP Public Cloud service that enables self-serve deployments of Apache NiFi data flows from a central catalog to auto-scaling Kubernetes clusters managed by CDP. Flow deployments can be monitored from a central dashboard with the ability to define KPIs to keep track of critical data flow metrics. CDF eliminates the operational overhead that is typically associated with running Apache NiFi clusters and allows users to fully focus on developing data flows and ensuring they meet business SLAs. Learn more about Cloudera Dataflow.


Philter is a featured product in the AWS Marketplace's Healthcare Compliance category

We are excited to share the Philter is now a featured product in the AWS Marketplace's Healthcare Compliance category.

The products selected for this feature “help ensure that IT infrastructure is compliant with changing policies and regulations, allowing teams to focus on driving patient-centeric innovation.”

 Philter on AWS for Healthcare Compliance Data Sheet

Philter redacts PHI, PII, and other sensitive information from documents and text. With Philter, users can select the types of sensitive information to redact, anonymize, encrypt, or tokenize.

Philter can be launched in your AWS cloud via the AWS Marketplace in just a few minutes. Philter runs entirely within your private VPC so your sensitive data never has to leave your VPC.

Using Philter in the AWS Reference Architecture for HIPAA

Philter is a featured solution in the AWS Marketplace Healthcare Compliance category!

AWS has provided a HIPAA Reference Architecture for applications that contain protected health information (PHI). This reference architecture gives us a starting point for a highly-available architecture that spans multiple availability zones, three VPCs for management, production, and development resources, logging, VPN for customer connectivity, along with a set of AWS config rules and logging. The source code for the reference architecture is available.

Philter is our software that redacts PHI, PII, and other sensitive information from text and documents. Philter runs in your cloud so your data never leaves your network to be redacted and its API allows for integration into virtually any application or system. In this blog post we will look at how Philter can be deployed via the AWS Marketplace inside an architecture developed using the AWS HIPAA Reference Architecture.

AWS Reference Architecture for HIPAA with Philter

The image below shows the AWS Reference Architecture for HIPAA with Philter deployments.

HIPAA Reference Architecture with Philter
HIPAA Reference Architecture with Philter

Redacting PHI in your application

Your application has the requirement of redacting PHI prior to processing and you want to deploy Philter in this reference architecture. So how does Philter fit into this architecture? The answer is seamlessly. Philter can be deployed from the AWS Marketplace into one of the private subnets in the production VPC. From there, Philter’s API will be available to the rest of your application. Your application can now send data to Philter for redaction and receive back the redacted text. The VPC flow logs configured as part of the reference architecture will capture the network traffic and Philter’s application and system log can be sent to CloudWatch Logs.

If you want to customize the configuration of Philter you can create an AMI from the Philter AMI on the AWS Marketplace. This may be useful if you want to "bake in" configuration for sending logs to CloudWatch Logs or an organizational SSL certificate.

Highly-available Philter deployment

For a highly available Philter deployment, create an autoscaling group for the Philter AMI in the two private subnets of the production VPC. Create a load balancer in the production VPC and register the autoscaling group with it. Now, you have a single endpoint for Philter’s API with a load balanced, highly-available set of instances behind the load balancer. You can configure auto scaling policies if you would like the Philter instances to scale up or down based on network traffic (or some other metric).

Data encryption

Network traffic to Philter will be encrypted. By default Philter uses a self-signed certificate but you can replace it with a certificate for your organization. Also, when deploying the Philter instances be sure to do so using encrypted EBS volumes. These two items will give you encryption of data at rest and in motion for your Philter instances.


You will also want to deploy an instance of Philter in the private subnet of the development VPC. This will give you an instance of Philter to use while developing and testing your application. This Philter instance can be a smaller instance type, such as a t3.large, to save cost.

Get started

To get started, deploy Philter from the AWS Marketplace.

New! Philter Add-Ins for Microsoft Office

We are very excited to announce the new Philter Add-Ins for Microsoft Office! These add-ins bring Philter's redaction capabilities directly into your Microsoft Word documents and Microsoft Excel spreadsheets. With these add-ins you can redact or highlight PHI, PII, and other sensitive information in your documents and spreadsheets with a single click. The Philter Add-Ins for Microsoft Office provide a great leap forward in streamlining document redaction.

Download or learn more about the add-ins at

How do the add-ins work?

Both add-ins add a new content pane that makes Philter available in your document or spreadsheet. In the screenshot below you can see the Philter content pane in Microsoft Word. The user has clicked the Highlight button to highlight the sensitive information in the document. Clicking the Redact button would have redacted the sensitive information instead.

The Philter Add-In for Microsoft Word enables document redaction from directly inside your documents.


With the add-ins you can redact PHI and PII directly in your documents eliminating any second steps of sending your documents to Philter for redaction. This will help improve your document redaction processes and give you more control.

What are the licensing details?

The add-ins are free to download and use. An instance of Philter is required but can be shared among users. If you aren't yet enjoying Philter's redaction capabilities we can help you get started (or you can get started on your own in your preferred cloud in about 5 minutes).

Philter 1.10.1

Philter 1.10.1 adds new features for document and text redaction. Take control of the sensitive information in your text through powerful redaction and encryption capabilities.

Philter 1.10.1 will be available for deployment on the AWS, Azure, and Google Cloud marketplaces in the next few days. Contact us for Docker or on-premises deployments.

New User Interface

This version of Philter introduces a user interface for testing Philter's configuration and managing filter profiles. The user interface can be accessed at https://philter:9090. By default, the user interface communicates with the Philter service over SSL and with your web browser over SSL. We are excited to introduce the user interface and we are excited to continue to develop it in future versions.

Two-Way SSL Enabled by Default

Philter cloud marketplace images now have two-way SSL enabled by default. This should reduce manual configuration steps often required to use Philter. See Philter's User's Guide for more information.

Post-Filters Can be Disabled

Philter's post-filters can now be disabled. The post-filters are primarily intended for clean up by doing operations such as removing blank spaces and punctuation. In cases where you might want to leave those characters in identified sensitive information spans you can now individually disable the post-filters.

Phone Number Confidence Values Dynamic based on the Format

Each time a piece of sensitive information is identified in text Philter assigns it a confidence value. This value indicates Philter's "confidence" that the identified text actually is sensitive information. For phone numbers, this confidence value is now dynamic based on the format of the phone number. For example, the phone number (123) 456-7890 would be given a higher confidence than 1234567890. While both are valid phone numbers, the first number is formatted as a phone number. This provides higher assurance that it is a phone number in the text and its confidence value will be higher than the second number's confidence value.

Prometheus Metrics Enabled by Default

Prometheus metrics are now enabled by default. A common task among users was enabling the metrics after deployment. This change is to remove the need for that manual change. See Philter's User Guide for more information on the Prometheus metrics.

Standardized Base Images on Ubuntu 20.04 LTS

Previously, Philter deployment images across cloud platforms used a different base operating system image on each cloud. AWS was Amazon Linux and Azure was CentOS. Philter 1.10.0 standardizes on a single base operating system image of Ubuntu 20.04 LTS. This provides a consistent experience using Philter across multiple cloud platforms. Now, all Philter configuration files are in the same locations so regardless of where you deploy Philter the setup and use will be the same.


Speeding up Philter document redaction with a GPU

In a lot of cases using Philter on a CPU will provide sufficient performance. However, in deployments where performance has higher importance using Philter with GPU can provide a 10x or more improvement in performance.

In this post we will show how Philter running on AWS on a m5.large EC2 instance averaged about 1 second to filter a document. When the same set of documents were filtered using Philter on a p3.2xlarge EC2 instance the average time per document fell to around 0.1 seconds. That's a significant difference!

Things to Know

As of Philter 1.10.0 the only filter that can use the GPU is the named person's filter. If you aren't using this filter then you will not see any performance benefit from running with a GPU. This will likely change in the future.

Philter with a GPU on AWS EC2

The AWS EC2 p3.2xlarge instance type has a single Tesla V100 GPU. You do need to install the NVIDIA CUDA drivers onto the EC2 instance. Contact us or refer to Philter's User's Guide for the installation scripts. No Philter configuration changes are needed to use the GPU because when a GPU is present Philter will automatically use it. In summary, the steps are:

  1. Launch an instance of Philter on any EC2 instance type.
  2. Stop the Philter instance.
  3. Change the instance type to p3.2xlarge and start the instance.
  4. Install the NVIDIA CUDA drivers. (Contact us or see Philter's User's Guide for the installation scripts.)
  5. Reboot the instance.

Philter is ready to serve API requests!

Monitoring the Performance

You can monitor the performance of Philter using the Prometheus monitor. Once enabled, Philter's metrics will be available for scraping at http://philter:9100/metrics. You will want to look at the philter_ner_entity_time_ms_seconds_sum and philter_ner_entity_time_ms_seconds_count metrics.

The first metric is the total number milliseconds spent applying the NER (named person's) filter. The second metric is the total number of times the filter was applied. Dividing those two numbers will give us the average time spent each time applying the filter. The screenshot below shows an example Grafana configuration for those metrics.

Philter Managed Deployment

Philter Managed Deployment on the AWS Marketplace

Today we are excited to announce the availability of Philter Managed Deployment on the AWS Marketplace to allow you to quickly get started on deploying a pre-configured instance of Philter into a HIPAA-compliant AWS VPC.

Often, a challenge of using a product like Philter in the cloud is ensuring compliance to the requirements of HIPAA. With the Philter Managed Deployment, our team of AWS certified engineers will construct a HIPAA-compliant cloud architecture to support Philter and your document workload.

To get started visit the Philter Managed Deployment on the AWS Marketplace and click the Continue button. Complete the form and click Send Request to Seller. This does not obligate you to anything. We will receive your request and reach out to begin the conversation.

Philter Price Reduction

We are thrilled to announce a price reduction for Philter on the cloud marketplaces. Philter pricing now starts at $0.49/hr, down from $0.79/hr. Additionally, where applicable, the annual pricing has been reduced accordingly as well. The pricing is tiered depending on the size of the compute instance running Philter.

The price change will take effect across the AWS, Azure, and Google Cloud marketplaces in the coming week. This pricing change will not affect support or managed services.

We would like to take a moment to thank our users for helping to make this price reduction possible. It is because of our supportive users that we are able to make this change.

AWS Marketplace

Philter 1.8.0

PhilterPhilter 1.8.0 has been released.

This version brings:

  • The ability to capture timing metrics for each of the filter types. Capturing these metrics will provide insights into the performance of the filters.
  • The ability to specify terms to ignore in files for each filter type. Previously, lists of ignored terms had to be specified in the filter profile. Being able to specify the terms to ignore in files outside the filter profile allow for cleaner and easier to manage filter profiles.
  • A single Docker container. Prior to 1.8.0, Philter was composed of two Docker containers that had to be deployed together. Philter 1.8.0 now consists of only a single Docker container. This should help remove some of the complexity around deploying Philter in a containerized environment. We have also migrated Philter's Docker images from DockerHub to our own Docker registry. Learn more about the Philter containers.

Philter 1.8.0 is now available for deployment from the cloud marketplaces and as a Docker container.

Launch Philter in your cloud. See Philter's full Release Notes.

Philter 1.7.0

PhilterWe are happy to announce that Philter 1.7.0 has been released and is currently being published to the DockerHub and the AWS, Azure, and Google Cloud marketplaces. Look for it to be available for deployment into your cloud in the next couple of days.

Click here to deploy Philter in your cloud of choice!

Philter finds and removes sensitive information, such as PII and PHI, in text. Philter can be integrated with virtually any platform, such as Apache Kafka, Apache Flink, Apache NiFi, Apache Pulsar, and Amazon Kinesis. Philter can redact, replace, encrypt, and hash sensitive information.

Philter is capable of redacting:  Ages, Bitcoin Addresses, Cities, Counties, Credit Cards, Custom Dictionaries, Custom Identifiers (medical record numbers, financial transaction numbers), Dates, Drivers License Numbers, Email Addresses, IBAN Codes, IP Addresses, MAC Addresses, Passport Numbers, Persons' Names, Phone/Fax Numbers, SSNs and TINs, Shipping Tracking Numbers, States, URLs, VINs, Zip Codes

Learn more about Philter.

 Philter Version 
Launch Philter on AWS1.11.0
Launch Philter on Azure1.11.0
Launch Philter on Google Cloud1.11.0

What's New in Philter 1.7.0?

Philter 1.7.0 brings a new experimental feature that breaks large text into smaller pieces of text for more efficient processing. This new feature is described below and is introduced in Philter 1.7.0 as an experimental feature. We welcome and encourage your feedback on the feature but caution you that the feature may undergo major changes in future versions.

Some of the changes and new features in Philter 1.7.0 are described below. Refer to the Release History for a full list of changes.

Automatically Splitting Input Text

Philter 1.7.0 bring a new experimental feature that breaks long input text up into pieces and processed each piece individually. After processing, Philter combines the individual results into a single response back to the client. The purpose of this feature is to allow Philter to better handle long input text.

What is a "long" input text can depend on several factors, such as the hardware running Philter, the network, and the density of sensitive information in the text. Because of this, you have some control over how Philter breaks long text into separate pieces. You can choose between two methods of splitting. The first method splits the text based on the locations of new line characters in the text. The second method splits the text into individual lines of nearly equal length.

The alternative to allowing Philter to split the text is to split the text yourself client side prior to sending the text to Philter. When doing the split client side you have full control over how the text is split. On the flip side, you also have to handle the individual response for each split, something Philter handles for you when you delegate the splitting to Philter.

Input text splitting is enabled and configured in filter profiles. This allows you to configure splitting based on individual filter profiles allowing some text to be split and other text not split based on the chosen filter profile for the text.

See Philter's User's Guide for how to configure splitting in a filter profile.

If you use this feature please send us feedback. We are looking to improve it for future versions and value your feedback. Please see the User's Guide for more details.

Reporting Metrics via Prometheus

Philter supported metrics reporting via JMX, Amazon CloudWatch, and Datadog. In Philter 1.7.0 we added support for monitoring Philter's metrics via Prometheus. When enabled, Philter will expose an HTTP endpoint suitable for scraping by Prometheus. See Philter's Settings for details on how to enable the Prometheus metrics. Look for a separate blog post soon that dives into monitoring Philter's metrics with Prometheus.

Smaller AWS EBS Volume

The EBS volume size for Philter 1.7.0 has been reduced from 20 GB to 8 GB. This reduces the monthly cost by $1.20 for Philter by only requiring a smaller SSD volume. This cost may or may not seem trivial, but when multiple Philter instances are deployed the savings will add up.

Other Changes

Other new features in Philter 1.7.0 include:

  • Terms can now be ignored based on regular expression patterns. Previously Philter had the ability to ignore specified terms but the terms had to match exactly. Now you can specify terms to ignore via regular expression patterns. An example use of this new feature is to ignore non-sensitive information that can change such as timestamps in log messages.
  • Added ability to read ignored terms from files outside of the filter profile.
  • Custom dictionary terms can now be phrases or multi-term keywords.
  • Added “classification” condition to Identifier filter to allow for writing conditionals against the classification value.
  • Added configurable timeout values to allow for modifying timeouts of internal Philter communication. This can help when processing larger amounts of text. See the Settings for more information.
  • Added option to IBAN Code filter to allow spaces in the IBAN codes.
  • Ignore lists for individual filters are no longer case-sensitive. (“John” will be ignored for “JOHN.”)

Protecting Sensitive Information in Streaming Platforms

Streaming platforms like Apache Kafka and Apache Pulsar provide wonderful capabilities around ingesting data. With these platforms we can build all types of solutions across many industries from healthcare to IoT and everything in between. Inevitably, the problem arises of how to deal with sensitive information that resides in the streaming data. Questions such as how do we make sure that data never crosses a boundary, how do we keep that data safe, and how can we remove the sensitive information from the incoming data so we can continue processing the data? These are all very good questions to ask and in this post we present a couple architectures to address those questions and help maintain the security of your streaming data. These architectures along with Philter can help protect the sensitive information in your streaming data.

Whether you are using Apache Kafka, Apache Pulsar, or some other streaming platform is largely irrelevant. Each of these platforms are largely built on top of the same concepts and even share quite a bit of terminology, such as brokers and topics. (A broker is a single instance of Kafka or Pulsar and a topic is how the streaming data is organized when it reaches the broker.)

Streaming Healthcare Data

Let's assume you have an architecture where you have a 3 broker installation of Apache Kafka that is accepting streaming data from a hospital. This data contains patient information which has PII and PHI. An external system is publishing data to your Apache Kafka brokers. The brokers receive the data, store it in topics, and a downstream system consumes from Apache Kafka and processes the statistics of the data by analyzing the text and persisting the results of the analysis into a database. Even though this is a hypothetical scenario it is an extremely common deployment architecture around distributed and streaming technologies.

Now you ask yourself those questions we mentioned previously. How to keep the PII and PHI secure in our streaming data? Your downstream processor does not care about the PII and PHI since it is only aggregating statistics. Having those downstream systems process the data containing PII and PHI puts our system at risk of inadvertent HIPAA violations by enlarging the perimeter of the system containing PII and PHI. Removing the PII and PHI from the streaming data before it gets consumed by the downstream processor would help keep the data safe and our system in compliance.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Filtering the Sensitive Information from the Streaming Data

There's a couple things we can do to remove the PII and PHI from the streaming data before it gets to the downstream processor.

The first option is to use Apache Kafka Streams or an Apache Pulsar Function (depending on which you are running) to consume the data, filter out the PII and PHI, and publish the filtered text back to Kafka or Pulsar on a different topic. Now update the name of the topic the downstream processor consumes from. The raw data  from the hospital containing PII and PHI will stay in its own topic. You can use Apache Kafka ACLs on the topics to help prevent someone from inadvertently consuming from the raw topic and only permit them to consume from the filtered topic. If, however, the idea of the raw data containing PII and PHI existing on the brokers is a concern then continue on to option two below.

The second option is to utilize a second Apache Kafka or Apache Pulsar cluster. Place this cluster in between the existing cluster and the downstream processor. Create an application to consume from the topic on the first brokers, remove the PII and PHI, and then publish the filtered data to a topic on the new brokers. (You can use something like Apache Flink to process the data. At the time of writing, Kafka Streams cannot be used because the source brokers and the destination brokers must be the same.) In this option, the sensitive data is physically separated from the rest of the data by residing on its own brokers.

Which option is best for you depends on your requirements around processing and security. In some cases, separate brokers may be overkill. But in other cases it may be the best option due to the physical boundary it creates between the raw data and the filtered data.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Refreshed cloud images

While we continue development on Philter 1.7.0 we have released minor updates to the AWS, Azure, and Google Cloud marketplaces. The only changes in these minor updates is to refresh the base image to include all available operating system updates. There are no changes to the Philter software. In the future we plan to consolidate our refreshed image updates for each cloud to minimize the number of separate versions.

There is no need to update to the minor release version if you are maintaining your operating system updates of your existing Philter instances.

You can find the details of all releases in Philter's Release Notes.

The Performance of Philter

PhilterIn all of our design and development of Philter, performance is always one of our top priorities. We ask ourselves questions like how can we implement this awesome new feature in Philter without negatively impacting performance? What can we do to improve performance?

However, the word "performance" can have a few different meanings when used in relation to Philter. In this post I want to dive down into the "performance" of Philter and how it impacts Philter's development.

Performance: Efficient processing of text

The first meaning of performance relates to how efficient Philter is when processing text. Philter takes input text, filters the sensitive information, and returns the output text. (That middle step is simplified a lot but hopefully you get the idea.) If any of these steps are not efficient, or performant, Philter won't be usable. Your client applications will time out and you will not want to use Philter.

Any new features or modifications to Philter's filtering capabilities has to be carefully designed and implemented. Even a small, seemingly innocent change can have large negative effects on performance. Because of that we as the developers must be careful and test accordingly. We use efficient data structures and make careful choices to select the operations that will provide the best performance.

This type of performance is typically measured in compute time and that's how we measure it. We have thousands of test cases that we execute with each new build of Philter. Over time we can see a history of the processing time and its downward trend as Philter gets more efficient.

Performance: Labeling the appropriate information as sensitive

The second meaning of performance may sometimes be referred to as accuracy. This meaning relates to how well Philter identified the sensitive information in the input text. Was all the sensitive information that Philter identified actually sensitive? Where there any false positives? False negatives? This type of performance is typically measured by a percentage, or by terms from information retrieval such as precision and recall.

In some cases, Philter's identification of sensitive information is non-deterministic, meaning statistics and machine learning algorithms are applied to locate sensitive information. Contrast this with a deterministic process such as looking for terms from a dictionary. How some of Philter's filters identify sensitive information can be controlled through a sensitivity level. Setting the sensitivity to high will likely identify more sensitive information but also have more false positives. Conversely, setting the sensitivity to low will likely result in finding fewer sensitive information and more false negatives. The sensitivity level of medium aims to bridge this gap. In some cases, false positives may be more acceptable than a false negative so a high sensitivity level is used. For the information retrieval folks out there is known as maximizing the recall.

For sensitive information like person's names we offer various models trained for specific domains. The purpose of this is to provide a higher level of accuracy when Philter is used in those domains.

Putting them together

Philter's "performance" is both of these. Philter must perform well in terms of time and processing efficiency as well as finding the appropriate sensitive information. We believe that both types are equally important. A system that takes hours to complete but with more accuracy may be just as unusable as a system that completes in milliseconds but finds no sensitive information.

If you are not yet using Philter to find and remove sensitive information from your text it's easy to get started. Just click on your platform of choice below. And if you need help please don't hesitate to reach out. We enjoy helping.

AWS Marketplace

Jeff Zemerick is the founder of Mountain Fog. He is a certified AWS and Google Cloud engineer many times over, current chair of the Apache OpenNLP project, and experienced software engineer. You can visit his website at, or contact Jeff at or on LinkedIn.  

Philter's Custom Dictionary Filter and "Fuzziness"

PhilterPhilter finds sensitive information in text based on a set of filters that you configure in a filter profile. Some of these filters are for predefined information like SSNs, phone numbers, and names. But sometimes you have a list of terms specific to your use-case that you want to identify, too. Philter's custom dictionary filter lets you specify a list of terms to label as sensitive information when found in your text.

You can learn more about the custom dictionary filter and all of its properties in the Philter User's Guide.

Philter 1.6.0 adds a new property called "fuzzy" to the custom dictionary filter. The "fuzzy" property accepts a value of true or false. When set to false, text being processed must match an item in the dictionary exactly for that text to be labeled as sensitive information. When set to true, the text does not have to match exactly. The "fuzzy" property allows for misspellings and typos to be present and still label the text as being sensitive information. In this blog post we want to dive a little bit more into this to better explain how the "fuzziness" works and is applied and the trade-offs when using it.

Also new in Philter 1.6.0 is the ability to provide the custom dictionary filter a path to a file that contains the terms. This way you don't have to include your terms directly in the filter profile.

Sample Filter Profile

To start, here's a simple filter profile that includes a custom dictionary filter. The dictionary contains three terms (john, jane, doe) and fuzziness is enabled with medium sensitivity. When any of those terms are found, they will be redacted with the pattern {{{REDACTED-%t}}}, where %t is replaced by the type which in this case is custom-dictionary.

   "name": "dictionary-example",
   "identifiers": {
      "dictionaries": [
         "customDictionary": {
            "terms": ["john", "jane", "doe"],
            "fuzzy": true,
            "sensitivity": "medium",
            "customDictionaryFilterStrategies": [
                  "strategy": "REDACT",
                  "redactionFormat": "{{{REDACTED-%t}}}"

No fuzziness

We will start by describing what happens when the "fuzzy" property is set to false. This is the default behavior and is consistent with how Philter behaved prior to version 1.6.0. Items in the custom dictionary have to be found in the text exactly as they are in the dictionary. This means "John" is not the same as "Jon."

Disabling fuzziness is more efficient and will provide better performance. That's really all you need to know. But if you like getting into the details of things, read on! Internally, Philter uses an algorithm based off what's known as a bloom filter to efficiently scan a dictionary for matches. A bloom filter "is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set." In this case, the set is your list of terms in the dictionary and an element is each word from the input text. The bloom filter provides an efficient means of determining whether or not a given word is a term in your dictionary that you want to be identified as sensitive information.

A digression into bloom filters

Just to clarify, when we talk about Philter we talk a lot about "filters", such as a filter for SSNs, a filter for phone numbers, and so on. A bloom filter is not a filter like that. A bloom filter is an algorithm that provides an efficient means of asking the question "Does this item potentially exist in this dictionary?" A bloom filter will answer "yes, it might" or "no, it does not." Notice the response of "yes, it might." The bloom filter is not saying "Yes." Instead, it is staying "yes, it might." It's then up to the programmer to find out definitively if that item exists in the dictionary. That's essentially how a bloom filter works.

Yes, fuzziness!

Enabling fuzziness on a custom dictionary filter works differently. As Philter scans the input text, it not only considers the words or phrases themselves, but Philter also considers derivations of the words and phrases. When fuzziness is enabled, "John" may be the same as "Jon." Enabling fuzziness by setting the "fuzzy" property to true can be useful when you are concerned about misspellings or different spellings of terms in your text.

You can control the level of acceptable fuzziness by setting the "sensitivityLevel" property. Valid values are "low", "medium", and "high." The different between "Jon" and "John" is considered low while the different between "Jon" and "Johnny" is considered high. You can use the sensitivityLevel to find an acceptable level of fuzziness appropriate for your custom dictionary and your text. The default sensitivityLevel when not specified is "high."

An important distinction to make is that currently when fuzziness is disabled the custom dictionary can only contain single words. Phrases are not permitted as dictionary terms in Philter 1.6.0 but are allowed in the upcoming version 1.7.0. The internals of that change are interesting enough for their own blog post!


To summarize:

  • Setting fuzzy to false (the default settings) for the custom dictionary filter will provide better performance but terms in the custom dictionary must match exactly and only words (not phrases) are allowed in the dictionary.
  • Setting fuzzy to true allows the custom dictionary filter to be able to identify misspellings and different spellings of terms in the custom dictionary filter at the cost of performance. Use the sensitivityLevel values of low, medium, and high to control the allowed level of fuzziness.

Not yet using Philter?

Join our users across the healthcare, financial, legal, and other industries in using Philter to find and remove sensitive information from your text. Click on your platform below to get started.

AWS Marketplace

Preventing PII and PHI from Leaking into Application Logs


This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application's development that's easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications' logs?

Philter and application logs

To give more confidence you can process your applications' logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application's log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O
docker-compose up

Spin up Philter in a docker container (request a free license key):

curl -O
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we're good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here's an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>

With this appender in our log4j2.xml file, our application's logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we're ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person's names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.


In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j's Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.

Jeff Zemerick is the founder of Mountain Fog. He is a certified AWS and Google Cloud engineer many times over, current chair of the Apache OpenNLP project, and experienced software engineer. You can visit his website at, or contact Jeff at or on LinkedIn.  

How We Train Philter's NLP Models

PhilterIn this post I want to give some insight into how we create and train the NLP (natural language processing) models that Philter uses to identify entities like person's names in text.

Read this first :)

As a user of Philter you don't need to understand or even be aware of how we train Philter's NLP models. But it is helpful to know that Philter's NLP model can be changed based on your domain. For example, we offer some models trained specifically for the healthcare domain. These models were trained to give better performance when using Philter in a healthcare environment. See the bottom of this post for a list of the currently available NLP models for Philter.

What is NLP?

Some sensitive information can be identified by Philter based on patterns (SSNs) or dictionaries. Things like a person's name don't follow a pattern and while it may be found in a dictionary there isn't any guarantee your dictionary will contain all possible names. To identify person's names we rely on a set of techniques collectively known as natural language processing, or NLP.

NLP is a broad term used to describe many types of methods and technologies used to extract information from unstructured, or natural language, text. Some foundational common NLP tasks are to identify the language of some given text and to label the words in a sentence with their parts-of-speech types. More advanced tasks include named-entity recognition, summarizing text passages in a few sentences, translating text from one language to another, and determining the sentiment of a given text. It's a very exciting time in NLP due to lots of recent advancements in neural networks, GPU hardware, and just an explosion in the number of researchers and practitioners in the NLP community.

How does NLP work?

NLP tasks often require a trained model to operate. For instance, language translation requires a model that is able to take words and phrases in one language and produce another language. The model is trained in identical sets of text in both languages. How the words and phrases are used help the model determine how the text should be translated. Identifying person's names in text also requires a trained model. Training this type of model requires text that has been annotated, meaning that the entities have been labeled. The algorithms will use these labels to train the model to identify names in the future. An example of an annotated sentence:

{person}George Washington{/person} was president.

There are different annotation formats created for different purposes but I'm sure you get the idea. With annotated text we can train our model to know what a person's name looks like when the model is applied to unlabeled text. That's essentially all there is to it.

There are lots of fantastic open-source tools with active user communities for natural language processing. If you are interested in learning the nuts and bolts of NLP, choose a framework in your preferred programming language to lower the learning curve and dive in! The techniques and terminology learned from using one framework will always apply to a different framework even if it is in a different programming language so you aren't at any risk of lock-in.

How We Train Philter's NLP Models

As described above, training our model requires annotated text. We have annotated text for various domains. We use this annotated text, along with a set of word embeddings, a few GPUs, and some time, to train the models for Philter. The output of the training is a file which contains the model. The model can then be used by Philter to identify person's names in text.

Evaluating a Model's Performance

To have an idea of how our model will perform we use some common metrics called precision and recall. These metrics give us an idea of how well the model is performing on our test data. We don't need to get into the details of precision and recall here. However, one important thing we want you to know is often we will try to maximize the recall value when training the model. Maximizing the recall means it is better to label some text as a person's name even if it is not than it is to risk not labeling a person's name. When dealing with sensitive information in text it can be advantageous to err on the side of caution instead of risk missing a person's name not being filtered. Restated, maximizing recall means false positives are more acceptable than false negatives.

Currently Available Models for Philter

Once we are satisfied with the model's performance we publish it and make it available on our website. We have models for general usage and models more specialized for specific domains such as healthcare. We are continuously training and updating our models to keep them current and improve their performance. The model included with Philter is a general usage model.

To stay up to date on model updates please follow us on Twitter or subscribe to our very low volume newsletter.

Jeff Zemerick is the founder of Mountain Fog. He is a certified AWS and Google Cloud engineer many times over, current chair of the Apache OpenNLP project, and experienced software engineer. You can visit his website at, or contact Jeff at or on LinkedIn.  

Philter 1.6.0

PhilterPhilter 1.6.0 will be available soon through the cloud marketplaces and DockerHub. This is probably the most significant release of Philter other than the first release 1.0.0.

Version 1.6.0 has many new features and a few fixes. Instead of writing a single blog post for the entire release we are going to write a few separate blog posts on the significant new features. We will highlight the new features just down below in this post and then follow-up over the next few days with posts that go more in-depth on each of the new features. Check out Philter's Release Notes.

Over the next few days we will be making updates to the Philter SDKs to accommodate the new features in Philter 1.6.0.

Deploy Philter

 Philter Version 
Launch Philter on AWS1.11.0
Launch Philter on Azure1.11.0
Launch Philter on Google Cloud1.11.0

New Features in Philter 1.6.0

The following are summaries of the new features added in Philter 1.6.0.


The new alerts feature in Philter 1.6.0 allows you to cause Philter to generate an alert when a given filter condition is satisfied. For example, if you have a filter condition to only match a person's name of "John Smith", when this condition is satisfied Philter will generate an alert. The alert will be stored in Philter and can be retrieved and deleted using Philter's new Alerts API. Details of the Alerts are in Philter's User's Guide.

Span Disambiguation

Sometimes a piece of sensitive information could be one of a few filter types, such as an SSN, a phone number, or a driver's license number. The span disambiguation feature works to determine which of the potential filter types is most appropriate by analyzing the context of the sensitive information. Philter uses various natural language processing (NLP) techniques to determine which filter type the sensitive information most closely resembles. Because of the techniques used, the more text Philter sees the more accurate the span disambiguation will become.

Span disambiguation is documented in Philter's User's Guide.

New Filters: Bitcoin Address, IBAN Codes, US Passport Numbers, US Driver's License Numbers

Philter 1.6.0 contains several new filter types:

  • Bitcoin Address - Identify bitcoin addresses.
  • IBAN Codes - Identify International Bank Account Numbers.
  • US Passport Numbers - Identify US passport numbers issued since 1981.
  • US Driver's License Numbers - Identify US driver's license numbers for all 50 states.

Each of these new filters are available through filter profiles.

New Replacement Strategy: SHA-256 with random salt values

We previously added the ability to encrypt sensitive information in text. In Philter 1.6.0 we have added the ability to hash sensitive information using SHA-256. When the hash replacement strategy is selected, each piece of sensitive text will be replaced by the SHA-256 value of the sensitive text. Additionally, the hash replacement strategy has a "salt" property that when enabled will cause Philter to append a random salt value to each piece of sensitive text prior to hashing. The random hash value will be included in the filter response.

Custom Dictionary Filters Can Now Use an External Dictionary File

Philter's custom dictionary filter lets you specify a list of terms to identify as being sensitive. Prior to Philter 1.6.0, this list of terms had to be provided in the filter profile. With a long list it did not take long for the filter profile to become hard to read and even harder to manage. Now, instead of providing a list of terms in the filter profile you can simply provide the full path to a file that contains a list of terms. This keeps the filter profile compact and easier to manage. You can specify as many dictionary files as you need to and Philter will combine the terms when the filter profile is loaded.

Custom Dictionary Filters Now Have a "fuzzy" Property

Philter's custom dictionary filter previously always used fuzzy detection. (Fuzzy detection is like a spell checker - a misspelled name such as "Davd" can be identified as "David.") New in Philter 1.6.0 is a property on the custom dictionary filter called "fuzzy." This property controls whether or not fuzzy detection is enabled. This property was added because when fuzzy detection is not needed you can get a significant performance increase. When not enabled, Philter uses an optimized data structure to identify the terms. If fuzzy detection is not enabled we do recommend disabling it to take advantage of the performance gain.

Changed "Type" to "Classification"

A few filter types had additional information that provided further description of the sensitive information. For instance, the entity filter had a type that identified the "type" of the entity such as "PER" for person. We have changed the property "type" to "classification" for clarity and uniformity. Be sure to update your filter profiles if you have any filter conditions that use "type" to use "classification" instead. It is a drop-in replacement and you can simply change "type" to "classification."

Add Filter Condition for "Classification"

Philter 1.6.0 adds the ability to have a filter condition on "classification."

Redis Cache Can Now Use a Self-Signed SSL Certificate

Philter 1.6.0 can now connect to a Redis cache that is using a self-signed certificate. New configuration settings for the truststore and keystore allow for trusting the self-signed certificate.

Fixes and Improvements in Philter 1.6.0

The following is a list of fixes and improvements made in Philter 1.6.0.

Fixed Potential MAC Address Issue

We found and fixed a potential issue where a MAC Address might not be identified correctly.

Fixed Potential Ignore Issue with Custom Dictionary Filters

We found and fixed a potential issue where a term in a custom dictionary that is also a term in an ignore list might not be ignored correctly.

Fixed Potential Issue with Credit Card Number Validation

We found and fixed a potential issue where a credit card number might not be validated correctly. This only applies when credit card validation is enabled.

Jeff Zemerick is the founder of Mountain Fog. He is a certified AWS and Google Cloud engineer many times over, current chair of the Apache OpenNLP project, and experienced software engineer. You can visit his website at, or contact Jeff at or on LinkedIn.  

Challenges in Finding Sensitive Information and Content in Text

Finding sensitive information and content in text has been a problem for as long as text has existed. But in the past few years due to the availability of cheaper data storage and streaming systems, finding sensitive information in text has become nearly a universal need across all industries. Today, systems that process streaming text often need to filter out any information considered sensitive directly in the pipeline to ensure the downstream applications have immediate access to the sanitized text. Streaming platforms are commonly used in industries such as healthcare and banking where the data can contain large amounts of sensitive information.

What is "sensitive information"?

Taking a step back, what is sensitive information? Sensitive information is simply any information that you or your organization deems as being sensitive. There are some global types of sensitive information such as personally identifiable information (PII) and protected health information (PHI). These types of sensitive information, among others, are typically regulated in how the information must be stored, transmitted, or used. But it is common for other types of information to be sensitive for your organization. This could be a list of terms, phrases, locations, or other information important to your organization. Simply put, if you consider it sensitive then it is sensitive.

Structured vs. Unstructured

It's important to note we are talking about unstructured, natural language text. Text in structured formats like XML or JSON are typically simpler to manipulate due to the inherent structure of the text. But in unstructured text we don't have the convenience of being told what is a "person's name" like an XML tag <personName> would do for us. There's generally three ways to find sensitive information in unstructured text.

Three Methods of Finding Sensitive Information in Text

The first method is to look for sensitive information that follows well-defined patterns. This is information like US social security numbers and phone numbers. Even though regular expressions are not a lot of fun, we can easily enough write regular expressions to match social security numbers and phone numbers. Once we have the regular expressions it's straightforward to apply the regular expression to the input text to find pieces of the text matching the patterns.

The second method is to look for sensitive information that can be found in a dictionary or database. This method works well for geographic locations and for information that you might have stored in your database or spreadsheet, such as a column of person's names. Once the list is accessible, again, it is fairly straightforward to look for those items in the text.

The third, and last, method is to employ the techniques of natural language processing (NLP). The technology and tools provided by the NLP ecosystem give us powerful ways to analyze unstructured text. We can use NLP to find sensitive information that does not follow well-defined patterns or is not referenced in a database column or spreadsheet, such as person's or organization's names. The past few years have seen remarkable advancements in NLP allowing these techniques to be able to analyze the text with great success.

Deterministic and Non-deterministic

The first two methods are deterministic. Finding text that matches a pattern and finding text contained in a dictionary is a pass/fail scenario - you either find the text you are looking for or you do not. The third method, NLP, is not deterministic. NLP uses trained models to be able to analyze the text. When an NLP method finds information in text it will have an associated confidence value that tells us just how sure the algorithms are that the associated information is what we are looking for.

Introducing Philter

PhilterPhilter is our software product that implements these three methods of identifying sensitive information in text. Philter supports finding, identifying, and removing sensitive information in text. You set the types of information you consider sensitive and then send text to Philter. The filtered text without the sensitive information is returned to you. With Philter you have full control over how the sensitive information is manipulated - you can redact it, replace it with random values, encrypt it, and more.

Often, sensitive information can follow the same pattern. For example, a US social security number is a 9 digit number. Many driver's license numbers can also be 9 digit numbers. Philter can disambiguate between a social security number and a driver's license number based on the number is used in the text. When using a dictionary we can't forget about misspellings. If we simply look for words in the dictionary we may not find a name that has been misspelled. Philter supports fuzzy searching by looking for misspellings when applying a dictionary-based filter.

This isn't nearly all Philter can do but it is some of the more exciting features to date. Take Philter for a test drive on the cloud of your choice. We'd be happy to walk you through it if you would like!

 Philter Version 
Launch Philter on AWS1.11.0
Launch Philter on Azure1.11.0
Launch Philter on Google Cloud1.11.0

Jeff Zemerick is the founder of Mountain Fog. He is a certified AWS and Google Cloud engineer many times over, current chair of the Apache OpenNLP project, and experienced software engineer. You can visit his website at, or contact Jeff at or on LinkedIn.  

Use your own NLP models with Philter

New in Philter 1.5.0 is the ability to use your own custom NLP models with Philter. Available in both Standard and Enterprise editions.

Philter is able to identify named-entities in text through the use of a trained model. The model is able to identify things, like person's names, in the text that do not follow a well-defined pattern or are easily referenced in a dictionary. Philter's NLP model is interchangeable and we offer multiple models that you can choose from to better tailor Philter to your use-case and your domain.

However, there are times when using our models may not be sufficient, such as when your use-case does not exactly match our available models or you want to try to get better performance by training a model on text very similar to your input text. In those cases you can train a custom NLP model for use with Philter.

Getting Started

Before diving into the details, here are some examples that illustrate how to make a custom NLP model usable for Philter. Feel free to use these examples as a starting point for your models. Additionally, this capability has been documented in Philter's User's Guide.

The first example in that repository is an implementation of a named-entity recognizer using Apache OpenNLP. The entity recognizer is exposed via two endpoints using a Spring Boot REST controller.

  • The first endpoint /process simply passes the received text to the entity recognizer. The entity recognizer receives the text, extracts the entities, and returns them in a list of PhilterSpan objects.
  • The second endpoint /status simply returns HTTP 200 and the text healthy. In cases where your model may take a bit to load, it would be better to actually check the status of the model instead of just returning that it is healthy. But for this example, the load model loading is quick and done before the HTTP service is even available.

That's the only required endpoints. Philter will use those endpoints to interact with your model.

@RequestMapping(value = "/process", method = RequestMethod.POST, consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE)
public @ResponseBody List<PhilterSpan> process(@RequestBody String text) {
return openNlpNer.extract(text);

@RequestMapping(value = "/status", method = RequestMethod.GET)
public ResponseEntity<String> status() {
return new ResponseEntity<String>("healthy", HttpStatus.OK);

Click here to see the full source file.

The PhilterSpan object contains the details of a single extracted entity. An example response from the /process endpoint for a single entity is shown below.

    text: "George Washington",
    tag: "person",
    score: "0.97",
    start: "0",
    end: "17"

Custom NLP Models

Training Your Own Model

Philter is indifferent of the technologies and methods you choose to train your custom model. You can use any framework you like, such as Apache OpenNLP, spaCy, Stanford CoreNLP, or your own custom framework. Follow the framework's documentation for training a model.

Using a Model You Already Have

If you already have a trained NLP model you can skip the training part and proceed on to making the model accessible to Philter.

Using Your Own Model with Philter

Once your model has been trained and you are satisfied with its performance, to use the model with Philter you must expose the model by implementing a simple HTTP service interface around it. This service facilitates communication between Philter and your model. The service is illustrated below.

Once your model is available behind the HTTP interface described above, you are ready to use the model with Philter. On the Philter virtual machine, simply export the PHILTER_NER_ENDPOINT environment variable to be the location of the running HTTP service. It is recommended you set this environment variable in /etc/environment. If your HTTP service is running on the same host as Philter on port 8888, the environment variable would be set as:

export PHILTER_NER_ENDPOINT=http://localhost:8888/

Now restart the Philter service and stop and disable the philter-ner service.

sudo systemctl restart philter.service
sudo systemctl stop philter-ner.service
sudo systemctl disable philter-ner.service

When a filter profile containing an NER filter is applied, requests will be made to your HTTP service invoking your model inference returning the identified named-entities.

Recommendations and Best Practices

You have complete freedom to train your custom NLP model using whatever tools and processes you choose. However, from our experience that are a few things that can help you be successful.
The first recommendation is to contain your service in a Docker container. Doing so gives you a self-contained image that can be deployed and run virtually anywhere. It simplifies dependency management and protects you from dependency version changes.

The second recommendation is to make your HTTP service as lightweight as possible. Avoid any unnecessary code or features that could negatively impact the speed of your model inference.

Lastly, thoroughly evaluate your model prior to deploying the model to Philter to have a better expectation of performance.


Using a custom NLP model with Philter is a fairly straightforward process. Train your model, make it accessible by HTTP, and then deploy the HTTP service and the model such that the service is accessible from Philter.

Do you have to train a custom model to use Philter? Absolutely not. Philter's "out-of-the-box" capabilities are sufficient for most use-cases. Will using your own model give you better performance? Possibly. It depends on how well the model is trained and all of the parameters used. Training your own model can be a difficult and time consuming activity so it's best to have some familiarity with the process before starting.

We are excited to offer this feature and look forward to getting your feedback!

 Philter Version 
Launch Philter on AWS1.11.0
Launch Philter on Azure1.11.0
Launch Philter on Google Cloud1.11.0

Filter Profile JSON Schema

Philter can find and remove many types of PII and PHI. You can select the types of of PII and PHI and how the identified values are removed or manipulated through what we call a "filter profile." A filter profile is a file that essentially lets you tell Philter what to do!

To help make creating and editing filter profiles a little bit easier, we have published the JSON schema.

This JSON schema can be imported into some development tools to provide features such as validation and autocomplete. The screenshot below shows an example of adding the schema to IntelliJ. More details into the capability and features are available from the IntelliJ documentation.

Visual Studio Code and Atom (via a package) also include support for validating JSON documents per JSON schemas.

The Filter Profile Registry provides a way to centrally manage filter profiles across one or more instances of Philter.

Apache NiFi for Processing PHI Data

With the recent release of Apache NiFi 1.10.0, it seems like a good time to discuss using Apache NiFi with data containing protected health information (PHI). When PHI is present in data it can present significant concerns and impose many requirements you may not face otherwise due to regulations such as HIPAA.

Apache NiFi probably needs little introduction but in case you are new to it, Apache NiFi is a big-data ETL application that uses directed graphs called data flows to move and transform data. You can think of it as taking data from one place to another while, optionally, doing some transformation to the data. The data goes through the flow in a construct known as a flow file. In this post we'll consider a simple data flow that reads file from a remote SFTP server and uploads the files to S3. We don't need to look at a complex data flow to understand how PHI can impact our setup.

Encryption of Data at Rest and In-motion

Two core things to address when PHI data is present is encryption of the data at rest and encryption of the data in motion. The first step is to identify those places where sensitive data will be at rest and in motion.

For encryption of data at rest, the first location is the remote SFTP server. In this example, let's assume the remote SFTP server is not managed by us, has the appropriate safeguards, and is someone else's responsibility. As the data goes through the NiFi flow, the next place the data is at rest is inside NiFi's provenance repository. (The provenance repository stores the history of all flow files that pass through the data flow.) NiFi then uploads the files to S3. AWS gives us the capability to encrypt S3 bucket contents by default so we will use that through an S3 bucket policy.

For encryption of data in motion, we have the connection between the SFTP server and NiFi and between NiFi and S3. Since we are using an SFTP server, our communication to the SFTP server will be encrypted. Similarly, we will access S3 over HTTPS providing encryption there as well.

If we are using a multi-node NiFi cluster, we may also have the communication between the NiFi nodes in the cluster. If the flows only execute on a single node you may argue that encryption between the nodes is not necessary. However, what happens in the future when the flow's behavior is changed and now PHI data is being transmitted in plain text across a network? For that reason, it's best to set up encryption between NiFi nodes from the start. This is covered in the NiFi System Administrator's Guide.

Encrypting Apache NiFi's Data at Rest

The best way to ensure encryption of data at rest is to use full disk encryption for the NiFi instances. (If you are on AWS and running NiFi on EC2 instances, use an encrypted EBS volume.) This ensures that all data persisted on the system will be encrypted no matter where the data appears. If a NiFi processor decides to have a bad day and dump error data to the log there is a risk of PHI data being included in the log. With full disk encryption we can be sure that even that data is encrypted as well.

Looking at Other Methods

Let's recap the NiFi repositories:

PHI could exist in any of these repositories when PHI data is passing through a NiFi flow. NiFi does have an encrypted provenance repository implementation and NiFi 1.10.0 introduces an experimental encrypted content repository but there are some caveats. (Currently, NiFi does not have an implementation of an encrypted flowfile repository.)

When using these encryption implementations, spillage of PHI onto the file system through a log file or some other means is a risk. There will be a bit of overhead due to the additional CPU instructions to perform the encryption. Comparing usage of the encrypted repositories with using an encrypted EBS volume, we don't have to worry about spilling unencrypted PHI to the disk, and per the AWS EBS encryption documentation, "You can expect the same IOPS performance on encrypted volumes as on unencrypted volumes, with a minimal effect on latency."

There is also the NiFi EncryptContent processor that can encrypt (and decrypt despite the name!) the content of flow files. This processor has use but in very specific cases. Trying to encrypt data at the level of the data flow for compliance reasons is not recommended due to the data possibly existing elsewhere in the NiFi repositories.

Removing PHI from Text in a NiFi Flow

PhilterWhat if you want to remove PHI (and PII) from the content of flow files as they go through a NiFi data flow? Check out our product Philter. It provides the ability to find and remove many types of PHI and PII from natural language, unstructured text from within a NiFi flow. Text containing PHI is sent to Philter and Philter responds with same text but with the PHI and PII removed.


Full disk encryption and encrypting all connections in the NiFi flow and between NiFi nodes provides encryption of data at rest and in motion. It's also recommended that you check with your organization's compliance officer to determine if there are any other requirements imposed by your organization or other relevant regulation prior to deployment. It's best to gather that information up front to avoid rework in the future!

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!