Phileas - The Open Source PII and PHI redaction engine

I am delighted to announce the project that provides the core PII and PHI redaction capabilities is now open source! Introducing Phileas, the PII and PHI redaction engine! Phileas is now available under the Apache license on GitHub.

Both Philter and Phirestream use Phileas to identify and redact sensitive information like PII and PHI. Phileas does all of the heavy lifting, while Philter and Phirestream make its functionality user-friendly and provide the NLP models.

Everyone is welcome to look at the code that powers Philter and Phirestream, use it, and contribute! In the next few weeks we will be adding better developer documentation to help you utilize Phileas in your applications. For the past 5 years, Phileas was only an internal project used by Philter and Phirestream, so please hang with us while we smooth out the edges and add user-facing documentation!

Phileas powers the redaction capabilities of Philter and Phirestream.

Philter and Phirestream will remain on the AWS, Azure, and Google Cloud marketplaces. We will continue to provide commercial support for those products. New versions of Philter and Phirestream will use the open source Phileas project.

We decided to open source Phileas because, firstly, we believe in open source. We also want to give our users the ability to look into how Philter and Phirestream work. Identifying and redacting sensitive information is a challenge with important implications! We want our users to have a better understanding of how these products work and to have a more open line of communication as to what features are implemented next. In that regard, we will be migrating our tasks over from our private Jira to GitHub issues in the next few days as well.

What is format-preserving encryption?

In cryptography, you have plain text and cipher text. An encryption algorithm transforms the plain text into the cipher text. The cipher text won't look anything like the plain text, in terms of characters and length. There are many different kinds of encryption algorithms, serving many different purposes. The cipher text for each of these algorithms will all be different.Philter adds format-preserving encryption as a filter strategy

Let's take the case of a credit card number, a common piece of sensitive information that is often encrypted. A credit card number is 16 digits long. Encrypting the credit card number with the industry standard AES-128-CBC algorithm will produce a cipher text much longer than the credit card number. If we are storing the credit card number in a database column configured for length 16, the cipher text will be too long to be stored in the database column.

Format-preserving encryption is a method of encryption that causes the cipher text to retain the same format as the plain text. For example, encrypting a credit card number with a format-preserving encryption algorithm will result in a cipher text of 16 characters in length, but will look nothing else like the original credit card number. Typically, only numeric, alphabetic, or alphanumeric characters can be used with format-preserving encryption.

The cipher text can be decrypted into the original plain text if the original credit card numbers are needed.

Learn more about format-preserving encryption.

Format-Preserving Encryption in Philter

Philter 2.1.0 adds format-preserving encryption as a filter strategy for bank numbers, bitcoin addresses, credit cards, drivers license numbers, IBAN codes, passport numbers, SSNs/TINs, package tracking numbers, and VINs. By specifying FPE_ENCRYPT_REPLACE as the filter strategy for one of those items of PII, Philter will encrypt the PII using format-preserving encryption.

Philter will replace the original PII with its encrypted version, and since format-preserving encryption was used, the replacement (encrypted) value will appear in the same format. This is useful when it is important that PII be encrypted but its length not be modified.

If you are not concerned about encrypting the original value, you can use the RANDOM_REPLACE filter strategy to replace PII with random values also in the same format as the original PII. Just remember that random replacement is not encryption and is not reversible. Use random replacement when using documents for machine learning or other processes where the original values are not important.

To enable format-preserving encryption for a type of sensitive information, simply add it to the filter profile. The following is an example filter profile that uses format-preserving encryption for credit card numbers. Just replace the key and tweak values with your own values.

   "name": "credit-cards",
   "identifiers": {
      "creditCardNumbers": {
         "creditCardNumberFilterStrategies": [
               "strategy": "FPE_ENCRYPT_REPLACE",
               "key": "...",
               "tweak: "..."

Learn more about format-preserving encryption in Philter's User Guide. Also, Philter has several other filter strategies to give full control over how your data is redacted.

About Philter

Philter redacts PII and PHI from documents.Philter redacts PII and PHI from documents to help you maintain HIPAA compliance, meet industry regulations, and leverage your documents for valuable secondary purposes. Philter can be deployed on Amazon Web Services, Google Cloud, and Microsoft Azure.

Our Proven Customer Engagement Process for Document Deidentification

We have helped many users with their document redaction needs across many industries. One thing became very obvious very quickly. No two users (even those in the same industry!) have the same document redaction needs. The types and formats of documents vary widely, as well as the types of information that needs redacted.

Our proven customer engagement process for helping customers evaluate Philter and Phirestream is described below. This process is available at no cost. We have created and refined this process to help customers make informed decisions about Philter and Phirestream and to streamline the evaluation process.

We are always an email or a phone call away during the process! To get started get in touch.

Step One – Document Sharing

As the first step in our customer engagement process we ask you to share with us documents representative of those you would like redacted. We will process the documents in line with your redaction needs and provide you with the redacted documents for your review.

If you have sample documents that already do not contain any PII, PHI, or other sensitive information those documents are ideal. However, if you do not have any sample documents we are prepared to receive documents subject to industry regulations such as HIPAA and PCI-DSS. We will provide details on how to share your documents with us in a secure and encrypted manner. We store your documents with the highest level of security and only for the minimum time required to process the files. We will sign an NDA or BAA if required.

Step Two – Sample Document Redaction

Once we receive from you your documents and your redaction requirements we will begin work to redact the files with Philter. This involves creating an instance of Philter in an isolated compute environment just for your documents. We do this to ensure maximum security of your documents. We will configure Philter and begin to redact your provide documents.

This step usually takes one to three working days based on the number of sample documents and the complexity of the redaction requirements.

Step Three – Review of Redacted Documents

We will share with you the redacted documents for your review. We will schedule a phone call or provide in an email a detailed overview of how Philter operated, answer any questions you may have, and provide the Philter configuration we used for the redaction.  At this point our goal is for you to have a sufficient understanding of Philter and its capabilities as it relates to your documents.

To get started get in touch!

Redacting Legal Documents with Philter

Philter has powerful filtering capabilities that can redact information such as names, dates, phone numbers, and other sensitive information from PDF documents.

Interested in using Philter to redact your legal documents? Get in touch for a complimentary exploration.

Supported Types of Sensitive Information

Philter is not limited to names and dates as redacted in the example document above. Philter supports many other types of sensitive information such as phone numbers, social security numbers, International Bank Account Numbers (IBANs), passport numbers, drivers license numbers, and many more types. You can also create your own custom types of sensitive information to identify information like case numbers or other identifiers.

Example Legal PDF Document Redaction

The example PDF document shown below was processed by Philter. The top image is the first page of the original PDF. The bottom image is the redacted version after being processed by Philter. Names and dates were the selected information to be redacted. Click on each image for a larger view.



Getting Started with Philter on Google Cloud

Philter is available to users of Google Cloud through the Google Cloud Marketplace. Using the marketplace, you can provision a Philter virtual machine with just a few clicks. The provisioning process is easy but we have a Getting Started with Philter on Google Cloud Platform guide to help you get started even easier.

Philter is now Certified for Cloudera Dataflow

We are excited to announce that Philter is now certified for Cloudera Dataflow (CDF). By leveraging Philter in your Apache NiFi data flows, you can redact protected health information (PHI), personally identifiable information (PII), and other types of sensitive information from your data.

Using Philter in Cloudera Dataflow is as simple as making the Philter processors available to your Apache NiFi instance and adding the processors to your canvas. You can configure the types of information to redact right inside the processor's properties. You can choose whether to use a centralized Philter instance or perform the redaction directly within the Apache NiFi flow. The first option allows for a centralized configuration while the latter provides significant performance improvements.

Philter on Cloudera Dataflow is compatible with all public clouds supported by Cloudera Dataflow.

Get Started

To get started with Philter on Cloudera Dataflow please contact us and we can guide you through the process of getting started. Visit our partner information on the Cloudera partner portal.

About Cloudera Dataflow

Cloudera DataFlow (CDF) is a CDP Public Cloud service that enables self-serve deployments of Apache NiFi data flows from a central catalog to auto-scaling Kubernetes clusters managed by CDP. Flow deployments can be monitored from a central dashboard with the ability to define KPIs to keep track of critical data flow metrics. CDF eliminates the operational overhead that is typically associated with running Apache NiFi clusters and allows users to fully focus on developing data flows and ensuring they meet business SLAs. Learn more about Cloudera Dataflow.


Philter featured in the AWS Marketplace's Healthcare Compliance

We are excited to share the Philter is now a featured product in the AWS Marketplace's Healthcare Compliance category.

The products selected for this feature “help ensure that IT infrastructure is compliant with changing policies and regulations, allowing teams to focus on driving patient-centric innovation.”

 Philter on AWS for Healthcare Compliance Data Sheet

Philter redacts PHI, PII, and other sensitive information from documents and text. With Philter, users can select the types of sensitive information to redact, anonymize, encrypt, or tokenize.

Philter can be launched in your AWS cloud via the AWS Marketplace in just a few minutes. Philter runs entirely within your private VPC so your sensitive data never has to leave your VPC.

Philter in an AWS Reference Architecture for HIPAA

Philter is a featured solution in the AWS Marketplace Healthcare Compliance category!

AWS has provided a HIPAA Reference Architecture for applications that contain protected health information (PHI). This reference architecture gives us a starting point for a highly-available architecture that spans multiple availability zones, three VPCs for management, production, and development resources, logging, VPN for customer connectivity, along with a set of AWS config rules and logging. The source code for the reference architecture is available.

Philter is our software that redacts PHI, PII, and other sensitive information from text and documents. Philter runs in your cloud so your data never leaves your network to be redacted and its API allows for integration into virtually any application or system. In this blog post we will look at how Philter can be deployed via the AWS Marketplace inside an architecture developed using the AWS HIPAA Reference Architecture.

AWS Reference Architecture for HIPAA with Philter

The image below shows the AWS Reference Architecture for HIPAA with Philter deployments.

HIPAA Reference Architecture with Philter
HIPAA Reference Architecture with Philter

Redacting PHI in your application

Your application has the requirement of redacting PHI prior to processing and you want to deploy Philter in this reference architecture. So how does Philter fit into this architecture? The answer is seamlessly. Philter can be deployed from the AWS Marketplace into one of the private subnets in the production VPC. From there, Philter’s API will be available to the rest of your application. Your application can now send data to Philter for redaction and receive back the redacted text. The VPC flow logs configured as part of the reference architecture will capture the network traffic and Philter’s application and system log can be sent to CloudWatch Logs.

If you want to customize the configuration of Philter you can create an AMI from the Philter AMI on the AWS Marketplace. This may be useful if you want to "bake in" configuration for sending logs to CloudWatch Logs or an organizational SSL certificate.

Highly-available Philter deployment

For a highly available Philter deployment, create an autoscaling group for the Philter AMI in the two private subnets of the production VPC. Create a load balancer in the production VPC and register the autoscaling group with it. Now, you have a single endpoint for Philter’s API with a load balanced, highly-available set of instances behind the load balancer. You can configure auto scaling policies if you would like the Philter instances to scale up or down based on network traffic (or some other metric).

Data encryption

Network traffic to Philter will be encrypted. By default Philter uses a self-signed certificate but you can replace it with a certificate for your organization. Also, when deploying the Philter instances be sure to do so using encrypted EBS volumes. These two items will give you encryption of data at rest and in motion for your Philter instances.


You will also want to deploy an instance of Philter in the private subnet of the development VPC. This will give you an instance of Philter to use while developing and testing your application. This Philter instance can be a smaller instance type, such as a t3.large, to save cost.

Get started

To get started, deploy Philter from the AWS Marketplace.

New! Philter Add-Ins for Microsoft Office

We are very excited to announce the new Philter Add-Ins for Microsoft Office! These add-ins bring Philter's redaction capabilities directly into your Microsoft Word documents and Microsoft Excel spreadsheets. With these add-ins you can redact or highlight PHI, PII, and other sensitive information in your documents and spreadsheets with a single click. The Philter Add-Ins for Microsoft Office provide a great leap forward in streamlining document redaction.

Download or learn more about the add-ins at

How do the add-ins work?

Both add-ins add a new content pane that makes Philter available in your document or spreadsheet. In the screenshot below you can see the Philter content pane in Microsoft Word. The user has clicked the Highlight button to highlight the sensitive information in the document. Clicking the Redact button would have redacted the sensitive information instead.

The Philter Add-In for Microsoft Word enables document redaction from directly inside your documents.


With the add-ins you can redact PHI and PII directly in your documents eliminating any second steps of sending your documents to Philter for redaction. This will help improve your document redaction processes and give you more control.

What are the licensing details?

The add-ins are free to download and use. An instance of Philter is required but can be shared among users. If you aren't yet enjoying Philter's redaction capabilities we can help you get started (or you can get started on your own in your preferred cloud in about 5 minutes).

Redacting PHI and PII from documents using Java

When you need to redact sensitive information like Protected Health Information (PHI) and Personally Identifiable Information (PII) from documents the Philter SDK for Java has you covered!

The Philter SDK for Java is an open source project that provides a client SDK for Philter. With this library it is easy to redact PHI and PII from documents using Philter. Here's an example:

In our Maven project we will add the dependency:


Now, we can instantiate a client:

PhilterClient client = new PhilterClient.PhilterClientBuilder().withEndpoint("").build();
FilterResponse filterResponse = client.filter(text);

Be sure to change the endpoint to the endpoint of your running Philter instance. Now you are ready to redact!

FilterResponse response = client.filter("George Washington was president.");

The text parameter will be sent to Philter for redaction. The returned object will contain a value that is the redacted text. With the default settings, that return value will be {{{REDACTED-ner}}} was president.

If you want to get more details of what happened you can use the explain function instead.

ExplainResponse response = client.explain("George Washington was president.");

The explain function provides insight into what Philter redacted and why it was redacted. You can use explain to help tune your redaction or for troubleshooting.

That's it! Now you are ready to integrate Philter's powerful redaction capabilities into your Java libraries and applications. For full samples check out the project's test class.

The Philter SDK for Java is licensed under the Apache License, version 2. It is available on GitHub.


Redacting PHI and PII in Apache Kafka Data streams

Apache Kafka can be found in virtually every industry today. Its ability to scale and efficiently process streaming data has made it a cornerstone of data streaming architectures. Some industries, like healthcare  may have restrictions on the data being processed. HIPAA has rules around how PHI (Protected Health Information) is handled. If you are ingesting streaming patient data you may need to redact sensitive information, like patient names and birthdates, in the text. Redacting sensitive information from the streaming data can help protect a pipeline against inadvertent exposure of PHI and PII and it could help get the data into a format so it can be used for secondary purposes, such as training a machine learning model.


Phirestream works alongside Apache Kafka to redact PHI and PII in data streams. Phirestream provides an implementation of Apache Kafka's REST interface that receives your data, redacts the sensitive information, and then sends the redacted data to its destination Apache Kafka topic. It doesn't require you to modify your data ingest pipelines. Instead, you can simply "inject" Phirestream into your pipeline with minimal configuration changes.

You can configure Phirestream to redact the types of PHI and PII you need, such as person's names, ages, dates, zip codes, and more. You can choose to simply redact each instance of PHI and PII or you can encrypt, anonymize, or randomize the values. Phirestream gives you complete control over how your streaming data is to be redacted.

Using Phirestream

Here's how to use Phirestream. In the example curl request below we are sending a small piece of text to Phirestream for redaction.

curl -k -X POST \
  https://localhost:8080/topics/default \
  -H 'Content-Type: application/vnd.kafka.json.v2+json' \
  -d '{
    "records": [
            "key": "key-1",
            "value": "George Washington was president.",
            "headers": [
              {"key": "profile", "value": "default"}

Phirestream receives the text, redacts the person's name (George Washington) and then publishes the redacted text to an Apache Kafka topic named default. To see the redacted text we can consume from the topic: \
   --topic default \
   --bootstrap-server localhost:9092 \

The output of this command will be the redacted text:

{{{REDACTED-entity}}} was president.

Filter Profiles

The types of sensitive information that are identified by Phirestream are defined in files called filter profiles. A filter profile specifies the types of sensitive information and how to redact those types. Phirestream selects which filter profile to apply based on the name of the Apache Kafka topic. In the example above, the topic name was default so the filter profile named default was applied.

Philter 1.10.1

Philter 1.10.1 adds new features for document and text redaction. Take control of the sensitive information in your text through powerful redaction and encryption capabilities.

Philter 1.10.1 will be available for deployment on the AWS, Azure, and Google Cloud marketplaces in the next few days. Contact us for Docker or on-premises deployments.

New User Interface

This version of Philter introduces a user interface for testing Philter's configuration and managing filter profiles. The user interface can be accessed at https://philter:9090. By default, the user interface communicates with the Philter service over SSL and with your web browser over SSL. We are excited to introduce the user interface and we are excited to continue to develop it in future versions.

Two-Way SSL Enabled by Default

Philter cloud marketplace images now have two-way SSL enabled by default. This should reduce manual configuration steps often required to use Philter. See Philter's User's Guide for more information.

Post-Filters Can be Disabled

Philter's post-filters can now be disabled. The post-filters are primarily intended for clean up by doing operations such as removing blank spaces and punctuation. In cases where you might want to leave those characters in identified sensitive information spans you can now individually disable the post-filters.

Phone Number Confidence Values Dynamic based on the Format

Each time a piece of sensitive information is identified in text Philter assigns it a confidence value. This value indicates Philter's "confidence" that the identified text actually is sensitive information. For phone numbers, this confidence value is now dynamic based on the format of the phone number. For example, the phone number (123) 456-7890 would be given a higher confidence than 1234567890. While both are valid phone numbers, the first number is formatted as a phone number. This provides higher assurance that it is a phone number in the text and its confidence value will be higher than the second number's confidence value.

Prometheus Metrics Enabled by Default

Prometheus metrics are now enabled by default. A common task among users was enabling the metrics after deployment. This change is to remove the need for that manual change. See Philter's User Guide for more information on the Prometheus metrics.

Standardized Base Images on Ubuntu 20.04 LTS

Previously, Philter deployment images across cloud platforms used a different base operating system image on each cloud. AWS was Amazon Linux and Azure was CentOS. Philter 1.10.0 standardizes on a single base operating system image of Ubuntu 20.04 LTS. This provides a consistent experience using Philter across multiple cloud platforms. Now, all Philter configuration files are in the same locations so regardless of where you deploy Philter the setup and use will be the same.


Speeding up Philter document redaction with a GPU

In a lot of cases using Philter on a CPU will provide sufficient performance. However, in deployments where performance has higher importance using Philter with GPU can provide a 10x or more improvement in performance.

In this post we will show how Philter running on AWS on a m5.large EC2 instance averaged about 1 second to filter a document. When the same set of documents were filtered using Philter on a p3.2xlarge EC2 instance the average time per document fell to around 0.1 seconds. That's a significant difference!

Things to Know

As of Philter 1.10.0 the only filter that can use the GPU is the named person's filter. If you aren't using this filter then you will not see any performance benefit from running with a GPU. This will likely change in the future.

Philter with a GPU on AWS EC2

The AWS EC2 p3.2xlarge instance type has a single Tesla V100 GPU. You do need to install the NVIDIA CUDA drivers onto the EC2 instance. Contact us or refer to Philter's User's Guide for the installation scripts. No Philter configuration changes are needed to use the GPU because when a GPU is present Philter will automatically use it. In summary, the steps are:

  1. Launch an instance of Philter on any EC2 instance type.
  2. Stop the Philter instance.
  3. Change the instance type to p3.2xlarge and start the instance.
  4. Install the NVIDIA CUDA drivers. (Contact us or see Philter's User's Guide for the installation scripts.)
  5. Reboot the instance.

Philter is ready to serve API requests!

Monitoring the Performance

You can monitor the performance of Philter using the Prometheus monitor. Once enabled, Philter's metrics will be available for scraping at http://philter:9100/metrics. You will want to look at the philter_ner_entity_time_ms_seconds_sum and philter_ner_entity_time_ms_seconds_count metrics.

The first metric is the total number milliseconds spent applying the NER (named person's) filter. The second metric is the total number of times the filter was applied. Dividing those two numbers will give us the average time spent each time applying the filter. The screenshot below shows an example Grafana configuration for those metrics.

Redacting information from documents doesn't have to be hard

The title says it all. Redacting information from documents doesn't have to be hard. But yet it still seems it can be without applying the appropriate level of care. Whether you are manually redacting information from a document or using a tool like Philter, care is required to make sure the redaction is permanent and the redacted text is unavailable when complete.

The American Bar Association details some notable embarrassing redaction failures that have happened in the legal system. On that link the American Bar Association describes how information was redacted from PDF documents by having black boxes drawn over the text. At first glance it appears the information under the black boxes has been redacted. However, by simply selecting all of the text in the PDF and pasting it to another application, such as Notepad or Microsoft Word, the redacted text seemingly magically becomes available! This is not only an embarrassing failure on the legal firm but it can also present a very serious breach of sensitive information. That information was not supposed to be available for very specific reasons.

Philter can redact information from PDF documents. For security and to prevent instances such as those described on that page, Philter returns image files instead of modified PDF files. The images are the PDF files but with the sensitive information blacked out. The text under the black rectangles in the image cannot be recovered through copy and paste since there is no text under the black rectangles.

Philter's approach to redacting information from PDFs is that the once processed the information is permanently inaccessible. You still have your original documents that were provided to Philter and now you have the image files containing the permanently redacted text. PDF filtering is available in Philter as of version 1.9.0. We are very excited to offer this capability in Philter and look forward to expanding it through your comments and feedback.

To filter a PDF document, just set the Content-Type header to application/pdf in your request:

curl -k -X POST https://localhost:8080/api/filter -d @file.pdf -H "Content-Type: application/PDF" -O

The response will be saved to the file and it will contain the redacted PDF pages as images.

Philter Managed Deployment

Philter Managed Deployment on the AWS Marketplace

Today we are excited to announce the availability of Philter Managed Deployment on the AWS Marketplace to allow you to quickly get started on deploying a pre-configured instance of Philter into a HIPAA-compliant AWS VPC.

Often, a challenge of using a product like Philter in the cloud is ensuring compliance to the requirements of HIPAA. With the Philter Managed Deployment, our team of AWS certified engineers will construct a HIPAA-compliant cloud architecture to support Philter and your document workload.

To get started visit the Philter Managed Deployment on the AWS Marketplace and click the Continue button. Complete the form and click Send Request to Seller. This does not obligate you to anything. We will receive your request and reach out to begin the conversation.

Philter Price Reduction

We are thrilled to announce a price reduction for Philter on the cloud marketplaces. Philter pricing now starts at $0.49/hr, down from $0.79/hr. Additionally, where applicable, the annual pricing has been reduced accordingly as well. The pricing is tiered depending on the size of the compute instance running Philter.

The price change will take effect across the AWS, Azure, and Google Cloud marketplaces in the coming week. This pricing change will not affect support or managed services.

We would like to take a moment to thank our users for helping to make this price reduction possible. It is because of our supportive users that we are able to make this change.

AWS Marketplace

Philter 1.8.0

Philter 1.8.0 has been released.

This version brings:

  • The ability to capture timing metrics for each of the filter types. Capturing these metrics will provide insights into the performance of the filters.
  • The ability to specify terms to ignore in files for each filter type. Previously, lists of ignored terms had to be specified in the filter profile. Being able to specify the terms to ignore in files outside the filter profile allow for cleaner and easier to manage filter profiles.

Philter 1.8.0 is now available for deployment from the cloud marketplaces.

Launch Philter in your cloud. See Philter's full Release Notes.

Philter 1.7.0

PhilterWe are happy to announce that Philter 1.7.0 has been released and is currently being published to the DockerHub and the AWS, Azure, and Google Cloud marketplaces. Look for it to be available for deployment into your cloud in the next couple of days.

Click here to deploy Philter in your cloud of choice!

Philter finds and removes sensitive information, such as PII and PHI, in text. Philter can be integrated with virtually any platform, such as Apache Kafka, Apache Flink, Apache NiFi, Apache Pulsar, and Amazon Kinesis. Philter can redact, replace, encrypt, and hash sensitive information.

Philter is capable of redacting:  Ages, Bitcoin Addresses, Cities, Counties, Credit Cards, Custom Dictionaries, Custom Identifiers (medical record numbers, financial transaction numbers), Dates, Drivers License Numbers, Email Addresses, IBAN Codes, IP Addresses, MAC Addresses, Passport Numbers, Persons' Names, Phone/Fax Numbers, SSNs and TINs, Shipping Tracking Numbers, States, URLs, VINs, Zip Codes
Philter Version
Launch Philter on AWS2.1.0
Launch Philter on Azure2.1.0
Launch Philter on Google Cloud2.1.0

What's New in Philter 1.7.0?

Philter 1.7.0 brings a new experimental feature that breaks large text into smaller pieces of text for more efficient processing. This new feature is described below and is introduced in Philter 1.7.0 as an experimental feature. We welcome and encourage your feedback on the feature but caution you that the feature may undergo major changes in future versions.

Some of the changes and new features in Philter 1.7.0 are described below. Refer to the Release History for a full list of changes.

Automatically Splitting Input Text

Philter 1.7.0 bring a new experimental feature that breaks long input text up into pieces and processed each piece individually. After processing, Philter combines the individual results into a single response back to the client. The purpose of this feature is to allow Philter to better handle long input text.

What is a "long" input text can depend on several factors, such as the hardware running Philter, the network, and the density of sensitive information in the text. Because of this, you have some control over how Philter breaks long text into separate pieces. You can choose between two methods of splitting. The first method splits the text based on the locations of new line characters in the text. The second method splits the text into individual lines of nearly equal length.

The alternative to allowing Philter to split the text is to split the text yourself client side prior to sending the text to Philter. When doing the split client side you have full control over how the text is split. On the flip side, you also have to handle the individual response for each split, something Philter handles for you when you delegate the splitting to Philter.

Input text splitting is enabled and configured in filter profiles. This allows you to configure splitting based on individual filter profiles allowing some text to be split and other text not split based on the chosen filter profile for the text.

See Philter's User's Guide for how to configure splitting in a filter profile.

If you use this feature please send us feedback. We are looking to improve it for future versions and value your feedback. Please see the User's Guide for more details.

Reporting Metrics via Prometheus

Philter supported metrics reporting via JMX, Amazon CloudWatch, and Datadog. In Philter 1.7.0 we added support for monitoring Philter's metrics via Prometheus. When enabled, Philter will expose an HTTP endpoint suitable for scraping by Prometheus. See Philter's Settings for details on how to enable the Prometheus metrics. Look for a separate blog post soon that dives into monitoring Philter's metrics with Prometheus.

Smaller AWS EBS Volume

The EBS volume size for Philter 1.7.0 has been reduced from 20 GB to 8 GB. This reduces the monthly cost by $1.20 for Philter by only requiring a smaller SSD volume. This cost may or may not seem trivial, but when multiple Philter instances are deployed the savings will add up.

Other Changes

Other new features in Philter 1.7.0 include:

  • Terms can now be ignored based on regular expression patterns. Previously Philter had the ability to ignore specified terms but the terms had to match exactly. Now you can specify terms to ignore via regular expression patterns. An example use of this new feature is to ignore non-sensitive information that can change such as timestamps in log messages.
  • Added ability to read ignored terms from files outside of the filter profile.
  • Custom dictionary terms can now be phrases or multi-term keywords.
  • Added “classification” condition to Identifier filter to allow for writing conditionals against the classification value.
  • Added configurable timeout values to allow for modifying timeouts of internal Philter communication. This can help when processing larger amounts of text. See the Settings for more information.
  • Added option to IBAN Code filter to allow spaces in the IBAN codes.
  • Ignore lists for individual filters are no longer case-sensitive. (“John” will be ignored for “JOHN.”)

Protecting Sensitive Information in Streaming Platforms

Streaming platforms like Apache Kafka and Apache Pulsar provide wonderful capabilities around ingesting data. With these platforms we can build all types of solutions across many industries from healthcare to IoT and everything in between. Inevitably, the problem arises of how to deal with sensitive information that resides in the streaming data. Questions such as how do we make sure that data never crosses a boundary, how do we keep that data safe, and how can we remove the sensitive information from the incoming data so we can continue processing the data? These are all very good questions to ask and in this post we present a couple architectures to address those questions and help maintain the security of your streaming data. These architectures along with Philter can help protect the sensitive information in your streaming data.

Whether you are using Apache Kafka, Apache Pulsar, or some other streaming platform is largely irrelevant. Each of these platforms are largely built on top of the same concepts and even share quite a bit of terminology, such as brokers and topics. (A broker is a single instance of Kafka or Pulsar and a topic is how the streaming data is organized when it reaches the broker.)

Streaming Healthcare Data

Let's assume you have an architecture where you have a 3 broker installation of Apache Kafka that is accepting streaming data from a hospital. This data contains patient information which has PII and PHI. An external system is publishing data to your Apache Kafka brokers. The brokers receive the data, store it in topics, and a downstream system consumes from Apache Kafka and processes the statistics of the data by analyzing the text and persisting the results of the analysis into a database. Even though this is a hypothetical scenario it is an extremely common deployment architecture around distributed and streaming technologies.

Now you ask yourself those questions we mentioned previously. How to keep the PII and PHI secure in our streaming data? Your downstream processor does not care about the PII and PHI since it is only aggregating statistics. Having those downstream systems process the data containing PII and PHI puts our system at risk of inadvertent HIPAA violations by enlarging the perimeter of the system containing PII and PHI. Removing the PII and PHI from the streaming data before it gets consumed by the downstream processor would help keep the data safe and our system in compliance.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Filtering the Sensitive Information from the Streaming Data

There's a couple things we can do to remove the PII and PHI from the streaming data before it gets to the downstream processor.

The first option is to use Apache Kafka Streams or an Apache Pulsar Function (depending on which you are running) to consume the data, filter out the PII and PHI, and publish the filtered text back to Kafka or Pulsar on a different topic. Now update the name of the topic the downstream processor consumes from. The raw data  from the hospital containing PII and PHI will stay in its own topic. You can use Apache Kafka ACLs on the topics to help prevent someone from inadvertently consuming from the raw topic and only permit them to consume from the filtered topic. If, however, the idea of the raw data containing PII and PHI existing on the brokers is a concern then continue on to option two below.

The second option is to utilize a second Apache Kafka or Apache Pulsar cluster. Place this cluster in between the existing cluster and the downstream processor. Create an application to consume from the topic on the first brokers, remove the PII and PHI, and then publish the filtered data to a topic on the new brokers. (You can use something like Apache Flink to process the data. At the time of writing, Kafka Streams cannot be used because the source brokers and the destination brokers must be the same.) In this option, the sensitive data is physically separated from the rest of the data by residing on its own brokers.

Which option is best for you depends on your requirements around processing and security. In some cases, separate brokers may be overkill. But in other cases it may be the best option due to the physical boundary it creates between the raw data and the filtered data.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Refreshed cloud images

While we continue development on Philter 1.7.0 we have released minor updates to the AWS, Azure, and Google Cloud marketplaces. The only changes in these minor updates is to refresh the base image to include all available operating system updates. There are no changes to the Philter software. In the future we plan to consolidate our refreshed image updates for each cloud to minimize the number of separate versions.

There is no need to update to the minor release version if you are maintaining your operating system updates of your existing Philter instances.

You can find the details of all releases in Philter's Release Notes.

The Performance of Philter

PhilterIn all of our design and development of Philter, performance is always one of our top priorities. We ask ourselves questions like how can we implement this awesome new feature in Philter without negatively impacting performance? What can we do to improve performance?

However, the word "performance" can have a few different meanings when used in relation to Philter. In this post I want to dive down into the "performance" of Philter and how it impacts Philter's development.

Performance: Efficient processing of text

The first meaning of performance relates to how efficient Philter is when processing text. Philter takes input text, filters the sensitive information, and returns the output text. (That middle step is simplified a lot but hopefully you get the idea.) If any of these steps are not efficient, or performant, Philter won't be usable. Your client applications will time out and you will not want to use Philter.

Any new features or modifications to Philter's filtering capabilities has to be carefully designed and implemented. Even a small, seemingly innocent change can have large negative effects on performance. Because of that we as the developers must be careful and test accordingly. We use efficient data structures and make careful choices to select the operations that will provide the best performance.

This type of performance is typically measured in compute time and that's how we measure it. We have thousands of test cases that we execute with each new build of Philter. Over time we can see a history of the processing time and its downward trend as Philter gets more efficient.

Performance: Labeling the appropriate information as sensitive

The second meaning of performance may sometimes be referred to as accuracy. This meaning relates to how well Philter identified the sensitive information in the input text. Was all the sensitive information that Philter identified actually sensitive? Where there any false positives? False negatives? This type of performance is typically measured by a percentage, or by terms from information retrieval such as precision and recall.

In some cases, Philter's identification of sensitive information is non-deterministic, meaning statistics and machine learning algorithms are applied to locate sensitive information. Contrast this with a deterministic process such as looking for terms from a dictionary. How some of Philter's filters identify sensitive information can be controlled through a sensitivity level. Setting the sensitivity to high will likely identify more sensitive information but also have more false positives. Conversely, setting the sensitivity to low will likely result in finding fewer sensitive information and more false negatives. The sensitivity level of medium aims to bridge this gap. In some cases, false positives may be more acceptable than a false negative so a high sensitivity level is used. For the information retrieval folks out there is known as maximizing the recall.

For sensitive information like person's names we offer various models trained for specific domains. The purpose of this is to provide a higher level of accuracy when Philter is used in those domains.

Putting them together

Philter's "performance" is both of these. Philter must perform well in terms of time and processing efficiency as well as finding the appropriate sensitive information. We believe that both types are equally important. A system that takes hours to complete but with more accuracy may be just as unusable as a system that completes in milliseconds but finds no sensitive information.

If you are not yet using Philter to find and remove sensitive information from your text it's easy to get started. Just click on your platform of choice below. And if you need help please don't hesitate to reach out. We enjoy helping.

AWS Marketplace

Philter's Custom Dictionary Filter and "Fuzziness"

PhilterPhilter finds sensitive information in text based on a set of filters that you configure in a filter profile. Some of these filters are for predefined information like SSNs, phone numbers, and names. But sometimes you have a list of terms specific to your use-case that you want to identify, too. Philter's custom dictionary filter lets you specify a list of terms to label as sensitive information when found in your text.

You can learn more about the custom dictionary filter and all of its properties in the Philter User's Guide.

Philter 1.6.0 adds a new property called "fuzzy" to the custom dictionary filter. The "fuzzy" property accepts a value of true or false. When set to false, text being processed must match an item in the dictionary exactly for that text to be labeled as sensitive information. When set to true, the text does not have to match exactly. The "fuzzy" property allows for misspellings and typos to be present and still label the text as being sensitive information. In this blog post we want to dive a little bit more into this to better explain how the "fuzziness" works and is applied and the trade-offs when using it.

Also new in Philter 1.6.0 is the ability to provide the custom dictionary filter a path to a file that contains the terms. This way you don't have to include your terms directly in the filter profile.

Sample Filter Profile

To start, here's a simple filter profile that includes a custom dictionary filter. The dictionary contains three terms (john, jane, doe) and fuzziness is enabled with medium sensitivity. When any of those terms are found, they will be redacted with the pattern {{{REDACTED-%t}}}, where %t is replaced by the type which in this case is custom-dictionary.

   "name": "dictionary-example",
   "identifiers": {
      "dictionaries": [
         "customDictionary": {
            "terms": ["john", "jane", "doe"],
            "fuzzy": true,
            "sensitivity": "medium",
            "customDictionaryFilterStrategies": [
                  "strategy": "REDACT",
                  "redactionFormat": "{{{REDACTED-%t}}}"

No fuzziness

We will start by describing what happens when the "fuzzy" property is set to false. This is the default behavior and is consistent with how Philter behaved prior to version 1.6.0. Items in the custom dictionary have to be found in the text exactly as they are in the dictionary. This means "John" is not the same as "Jon."

Disabling fuzziness is more efficient and will provide better performance. That's really all you need to know. But if you like getting into the details of things, read on! Internally, Philter uses an algorithm based off what's known as a bloom filter to efficiently scan a dictionary for matches. A bloom filter "is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set." In this case, the set is your list of terms in the dictionary and an element is each word from the input text. The bloom filter provides an efficient means of determining whether or not a given word is a term in your dictionary that you want to be identified as sensitive information.

A digression into bloom filters

Just to clarify, when we talk about Philter we talk a lot about "filters", such as a filter for SSNs, a filter for phone numbers, and so on. A bloom filter is not a filter like that. A bloom filter is an algorithm that provides an efficient means of asking the question "Does this item potentially exist in this dictionary?" A bloom filter will answer "yes, it might" or "no, it does not." Notice the response of "yes, it might." The bloom filter is not saying "Yes." Instead, it is staying "yes, it might." It's then up to the programmer to find out definitively if that item exists in the dictionary. That's essentially how a bloom filter works.

Yes, fuzziness!

Enabling fuzziness on a custom dictionary filter works differently. As Philter scans the input text, it not only considers the words or phrases themselves, but Philter also considers derivations of the words and phrases. When fuzziness is enabled, "John" may be the same as "Jon." Enabling fuzziness by setting the "fuzzy" property to true can be useful when you are concerned about misspellings or different spellings of terms in your text.

You can control the level of acceptable fuzziness by setting the "sensitivityLevel" property. Valid values are "low", "medium", and "high." The different between "Jon" and "John" is considered low while the different between "Jon" and "Johnny" is considered high. You can use the sensitivityLevel to find an acceptable level of fuzziness appropriate for your custom dictionary and your text. The default sensitivityLevel when not specified is "high."

An important distinction to make is that currently when fuzziness is disabled the custom dictionary can only contain single words. Phrases are not permitted as dictionary terms in Philter 1.6.0 but are allowed in the upcoming version 1.7.0. The internals of that change are interesting enough for their own blog post!


To summarize:

  • Setting fuzzy to false (the default settings) for the custom dictionary filter will provide better performance but terms in the custom dictionary must match exactly and only words (not phrases) are allowed in the dictionary.
  • Setting fuzzy to true allows the custom dictionary filter to be able to identify misspellings and different spellings of terms in the custom dictionary filter at the cost of performance. Use the sensitivityLevel values of low, medium, and high to control the allowed level of fuzziness.

Not yet using Philter?

Join our users across the healthcare, financial, legal, and other industries in using Philter to find and remove sensitive information from your text. Click on your platform below to get started.

AWS Marketplace

Preventing PII and PHI from Leaking into Application Logs


This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application's development that's easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications' logs?

Philter and application logs

To give more confidence you can process your applications' logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application's log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O
docker-compose up

Spin up Philter in a docker container:

curl -O
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we're good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here's an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>

With this appender in our log4j2.xml file, our application's logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we're ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person's names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.


In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j's Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.

How We Train Philter's NLP Models

PhilterIn this post I want to give some insight into how we create and train the NLP (natural language processing) models that Philter uses to identify entities like person's names in text.

Read this first :)

As a user of Philter you don't need to understand or even be aware of how we train Philter's NLP models. But it is helpful to know that Philter's NLP model can be changed based on your domain. For example, we offer some models trained specifically for the healthcare domain. These models were trained to give better performance when using Philter in a healthcare environment. See the bottom of this post for a list of the currently available NLP models for Philter.

What is NLP?

Some sensitive information can be identified by Philter based on patterns (SSNs) or dictionaries. Things like a person's name don't follow a pattern and while it may be found in a dictionary there isn't any guarantee your dictionary will contain all possible names. To identify person's names we rely on a set of techniques collectively known as natural language processing, or NLP.

NLP is a broad term used to describe many types of methods and technologies used to extract information from unstructured, or natural language, text. Some foundational common NLP tasks are to identify the language of some given text and to label the words in a sentence with their parts-of-speech types. More advanced tasks include named-entity recognition, summarizing text passages in a few sentences, translating text from one language to another, and determining the sentiment of a given text. It's a very exciting time in NLP due to lots of recent advancements in neural networks, GPU hardware, and just an explosion in the number of researchers and practitioners in the NLP community.

How does NLP work?

NLP tasks often require a trained model to operate. For instance, language translation requires a model that is able to take words and phrases in one language and produce another language. The model is trained in identical sets of text in both languages. How the words and phrases are used help the model determine how the text should be translated. Identifying person's names in text also requires a trained model. Training this type of model requires text that has been annotated, meaning that the entities have been labeled. The algorithms will use these labels to train the model to identify names in the future. An example of an annotated sentence:

{person}George Washington{/person} was president.

There are different annotation formats created for different purposes but I'm sure you get the idea. With annotated text we can train our model to know what a person's name looks like when the model is applied to unlabeled text. That's essentially all there is to it.

There are lots of fantastic open-source tools with active user communities for natural language processing. If you are interested in learning the nuts and bolts of NLP, choose a framework in your preferred programming language to lower the learning curve and dive in! The techniques and terminology learned from using one framework will always apply to a different framework even if it is in a different programming language so you aren't at any risk of lock-in.

How We Train Philter's NLP Models

As described above, training our model requires annotated text. We have annotated text for various domains. We use this annotated text, along with a set of word embeddings, a few GPUs, and some time, to train the models for Philter. The output of the training is a file which contains the model. The model can then be used by Philter to identify person's names in text.

Evaluating a Model's Performance

To have an idea of how our model will perform we use some common metrics called precision and recall. These metrics give us an idea of how well the model is performing on our test data. We don't need to get into the details of precision and recall here. However, one important thing we want you to know is often we will try to maximize the recall value when training the model. Maximizing the recall means it is better to label some text as a person's name even if it is not than it is to risk not labeling a person's name. When dealing with sensitive information in text it can be advantageous to err on the side of caution instead of risk missing a person's name not being filtered. Restated, maximizing recall means false positives are more acceptable than false negatives.

Currently Available Models for Philter

Once we are satisfied with the model's performance we publish it and make it available on our website. We have models for general usage and models more specialized for specific domains such as healthcare. We are continuously training and updating our models to keep them current and improve their performance. The model included with Philter is a general usage model.

To stay up to date on model updates please follow us on Twitter.

Philter 1.6.0

PhilterPhilter 1.6.0 will be available soon through the cloud marketplaces. This is probably the most significant release of Philter other than the first release 1.0.0.

Version 1.6.0 has many new features and a few fixes. Instead of writing a single blog post for the entire release we are going to write a few separate blog posts on the significant new features. We will highlight the new features just down below in this post and then follow-up over the next few days with posts that go more in-depth on each of the new features. Check out Philter's Release Notes.

Over the next few days we will be making updates to the Philter SDKs to accommodate the new features in Philter 1.6.0.

Deploy Philter

Philter Version
Launch Philter on AWS2.1.0
Launch Philter on Azure2.1.0
Launch Philter on Google Cloud2.1.0

New Features in Philter 1.6.0

The following are summaries of the new features added in Philter 1.6.0.


The new alerts feature in Philter 1.6.0 allows you to cause Philter to generate an alert when a given filter condition is satisfied. For example, if you have a filter condition to only match a person's name of "John Smith", when this condition is satisfied Philter will generate an alert. The alert will be stored in Philter and can be retrieved and deleted using Philter's new Alerts API. Details of the Alerts are in Philter's User's Guide.

Span Disambiguation

Sometimes a piece of sensitive information could be one of a few filter types, such as an SSN, a phone number, or a driver's license number. The span disambiguation feature works to determine which of the potential filter types is most appropriate by analyzing the context of the sensitive information. Philter uses various natural language processing (NLP) techniques to determine which filter type the sensitive information most closely resembles. Because of the techniques used, the more text Philter sees the more accurate the span disambiguation will become.

Span disambiguation is documented in Philter's User's Guide.

New Filters: Bitcoin Address, IBAN Codes, US Passport Numbers, US Driver's License Numbers

Philter 1.6.0 contains several new filter types:

  • Bitcoin Address - Identify bitcoin addresses.
  • IBAN Codes - Identify International Bank Account Numbers.
  • US Passport Numbers - Identify US passport numbers issued since 1981.
  • US Driver's License Numbers - Identify US driver's license numbers for all 50 states.

Each of these new filters are available through filter profiles.

New Replacement Strategy: SHA-256 with random salt values

We previously added the ability to encrypt sensitive information in text. In Philter 1.6.0 we have added the ability to hash sensitive information using SHA-256. When the hash replacement strategy is selected, each piece of sensitive text will be replaced by the SHA-256 value of the sensitive text. Additionally, the hash replacement strategy has a "salt" property that when enabled will cause Philter to append a random salt value to each piece of sensitive text prior to hashing. The random hash value will be included in the filter response.

Custom Dictionary Filters Can Now Use an External Dictionary File

Philter's custom dictionary filter lets you specify a list of terms to identify as being sensitive. Prior to Philter 1.6.0, this list of terms had to be provided in the filter profile. With a long list it did not take long for the filter profile to become hard to read and even harder to manage. Now, instead of providing a list of terms in the filter profile you can simply provide the full path to a file that contains a list of terms. This keeps the filter profile compact and easier to manage. You can specify as many dictionary files as you need to and Philter will combine the terms when the filter profile is loaded.

Custom Dictionary Filters Now Have a "fuzzy" Property

Philter's custom dictionary filter previously always used fuzzy detection. (Fuzzy detection is like a spell checker - a misspelled name such as "Davd" can be identified as "David.") New in Philter 1.6.0 is a property on the custom dictionary filter called "fuzzy." This property controls whether or not fuzzy detection is enabled. This property was added because when fuzzy detection is not needed you can get a significant performance increase. When not enabled, Philter uses an optimized data structure to identify the terms. If fuzzy detection is not enabled we do recommend disabling it to take advantage of the performance gain.

Changed "Type" to "Classification"

A few filter types had additional information that provided further description of the sensitive information. For instance, the entity filter had a type that identified the "type" of the entity such as "PER" for person. We have changed the property "type" to "classification" for clarity and uniformity. Be sure to update your filter profiles if you have any filter conditions that use "type" to use "classification" instead. It is a drop-in replacement and you can simply change "type" to "classification."

Add Filter Condition for "Classification"

Philter 1.6.0 adds the ability to have a filter condition on "classification."

Redis Cache Can Now Use a Self-Signed SSL Certificate

Philter 1.6.0 can now connect to a Redis cache that is using a self-signed certificate. New configuration settings for the truststore and keystore allow for trusting the self-signed certificate.

Fixes and Improvements in Philter 1.6.0

The following is a list of fixes and improvements made in Philter 1.6.0.

Fixed Potential MAC Address Issue

We found and fixed a potential issue where a MAC Address might not be identified correctly.

Fixed Potential Ignore Issue with Custom Dictionary Filters

We found and fixed a potential issue where a term in a custom dictionary that is also a term in an ignore list might not be ignored correctly.

Fixed Potential Issue with Credit Card Number Validation

We found and fixed a potential issue where a credit card number might not be validated correctly. This only applies when credit card validation is enabled.

Philter - A Real-World Use-Case

PhilterPhilter finds, identifies, and removes sensitive information from text. That's a very good and short description of Philter, but, as they say, a picture is worth a thousand words. In this post we will detail an actual, real-world use-case of Philter as we paint a picture with words!

"Super Helpdesk"

The Philter customer, we'll call them Super Helpdesk, is a provider of a software-as-a-service helpesk solution. Their customers sign-up to be able to offer a helpdesk to their customers. (Following? :) Super Helpdesk's users need the ability to optionally prevent sensitive information from being passed through in tickets. If a customer enters something sensitive they want to remove it from the ticket before the ticket enters the workflow.

In this case, the sensitive information Super Helpdesk is most worried about are credit card numbers. Due to security best practices and regulations like PCI-DSS, credit card numbers cannot exist in helpdesk tickets where they may be stored or transmitted unencrypted. Super Helpdesk needed a way to analyze the tickets entering their system in order to filter out the credit card numbers from the tickets.

The Solution

At a high-level, Super Helpdesk deployed Philter (in this case running on EC2 in AWS) to perform the filtering of the content of the helpdesk tickets. As new helpdesk tickets are submitted, the content of the ticket is sent to Philter and Philter immediately returns the content of the ticket with the credit card numbers redacted to just the last four digits. (Super Helpdesk also added an option for their users to control how Philter redacts the credit card numbers, with the available options being redact all or redact all but the last four digits.)

Now for the low-level implementation details! When new helpdesk tickets come in they are published to an Apache Kafka topic. A process consumes from the topic, does processing on the ticket, and ultimately inserts the ticket into a backend database. This process, written in Java, was modified to make use of the Philter Java SDK to enable the communication between the process and Philter.

We have found this to actually be a very common Extract-Transform-Load (ETL) design scenario across industries. Data in the form of text flows from an external system through a pipeline facilitated by Apache Kafka or Amazon Kinesis Firehose into an internal database. Along the way the data needs to be manipulated in some manner. In our case the data manipulation is to remove sensitive information from the text. Philter's API allows it to slide nearly seamlessly into the existing pipeline. Like Super Helpdesk did, just insert a step to send the text to Philter for filtering.

We made a previous blog post about using Philter inside of an AWS Kinesis Firehose using a Firehose Transformation. It describes how to make a Lambda function to invoke Philter on the text going through the pipeline to filter the text. Check it out at the link below.

Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

But, wait, why Philter?

You are probably saying, well, that seems like overkill for a simple problem to redact credit card numbers! Credit card numbers follow a well-defined pattern so why not just use a regular expression to find them? If all you want to do is find credit card numbers then a regular expression definitely may work.

So what does using Philter give us? A good bit actually. Through the use of filter profiles, Philter can have a pre-set list of types sensitive information. Each type of sensitive information can have its own redaction logic. For example, you could redact VISA card numbers while truncating AMEX card numbers. Or, you could only leave the last four digits of card numbers matching a condition. Additionally, each customer of the helpdesk platform may have different requirements around sensitive information. That logic can also be encapsulated in filter profiles. The regular expression logic just got more complicated.

Philter provides other features as well, such as the ability to capture metrics on the data, ability to encrypt the credit card numbers instead of removing them, and the ability to disambiguate between different types of sensitive information.

Lastly, a regular expression will never be able to find non-deterministic types of sensitive information like person's names. Philter's natural language processing (NLP) capabilities are able to find entities like person's names that do not follow any set pattern.

Try Philter

Deploying Philter to AWS, Azure, or GCP is easy because Philter is available through each of the cloud's marketplaces. Simply follow the marketplace steps to launch an instance of Philter in your private cloud.

Philter Version
Launch Philter on AWS2.1.0
Launch Philter on Azure2.1.0
Launch Philter on Google Cloud2.1.0

Share your experience!

We would love to hear how you are using Philter. Share your experience with us!

Challenges in Finding Sensitive Information in Text

Finding sensitive information and content in text has been a problem for as long as text has existed. But in the past few years due to the availability of cheaper data storage and streaming systems, finding sensitive information in text has become nearly a universal need across all industries. Today, systems that process streaming text often need to filter out any information considered sensitive directly in the pipeline to ensure the downstream applications have immediate access to the sanitized text. Streaming platforms are commonly used in industries such as healthcare and banking where the data can contain large amounts of sensitive information.

What is "sensitive information"?

Taking a step back, what is sensitive information? Sensitive information is simply any information that you or your organization deems as being sensitive. There are some global types of sensitive information such as personally identifiable information (PII) and protected health information (PHI). These types of sensitive information, among others, are typically regulated in how the information must be stored, transmitted, or used. But it is common for other types of information to be sensitive for your organization. This could be a list of terms, phrases, locations, or other information important to your organization. Simply put, if you consider it sensitive then it is sensitive.

Structured vs. Unstructured

It's important to note we are talking about unstructured, natural language text. Text in structured formats like XML or JSON are typically simpler to manipulate due to the inherent structure of the text. But in unstructured text we don't have the convenience of being told what is a "person's name" like an XML tag <personName> would do for us. There's generally three ways to find sensitive information in unstructured text.

Three Methods of Finding Sensitive Information in Text

The first method is to look for sensitive information that follows well-defined patterns. This is information like US social security numbers and phone numbers. Even though regular expressions are not a lot of fun, we can easily enough write regular expressions to match social security numbers and phone numbers. Once we have the regular expressions it's straightforward to apply the regular expression to the input text to find pieces of the text matching the patterns.

The second method is to look for sensitive information that can be found in a dictionary or database. This method works well for geographic locations and for information that you might have stored in your database or spreadsheet, such as a column of person's names. Once the list is accessible, again, it is fairly straightforward to look for those items in the text.

The third, and last, method is to employ the techniques of natural language processing (NLP). The technology and tools provided by the NLP ecosystem give us powerful ways to analyze unstructured text. We can use NLP to find sensitive information that does not follow well-defined patterns or is not referenced in a database column or spreadsheet, such as person's or organization's names. The past few years have seen remarkable advancements in NLP allowing these techniques to be able to analyze the text with great success.

Deterministic and Non-deterministic

The first two methods are deterministic. Finding text that matches a pattern and finding text contained in a dictionary is a pass/fail scenario - you either find the text you are looking for or you do not. The third method, NLP, is not deterministic. NLP uses trained models to be able to analyze the text. When an NLP method finds information in text it will have an associated confidence value that tells us just how sure the algorithms are that the associated information is what we are looking for.

Introducing Philter

PhilterPhilter is our software product that implements these three methods of identifying sensitive information in text. Philter supports finding, identifying, and removing sensitive information in text. You set the types of information you consider sensitive and then send text to Philter. The filtered text without the sensitive information is returned to you. With Philter you have full control over how the sensitive information is manipulated - you can redact it, replace it with random values, encrypt it, and more.

Often, sensitive information can follow the same pattern. For example, a US social security number is a 9 digit number. Many driver's license numbers can also be 9 digit numbers. Philter can disambiguate between a social security number and a driver's license number based on the number is used in the text. When using a dictionary we can't forget about misspellings. If we simply look for words in the dictionary we may not find a name that has been misspelled. Philter supports fuzzy searching by looking for misspellings when applying a dictionary-based filter.

This isn't nearly all Philter can do but it is some of the more exciting features to date. Take Philter for a test drive on the cloud of your choice. We'd be happy to walk you through it if you would like!

Philter Version
Launch Philter on AWS2.1.0
Launch Philter on Azure2.1.0
Launch Philter on Google Cloud2.1.0

Philter 1.5.0

Happy Friday! We are in the process of publishing Philter 1.5.0. Philter identifies and removes sensitive information in text. Look for Philter 1.5.0 to be available on the cloud marketplaces soon.

This version has a few new features in addition to minor improvements and fixes. The new features are described below.

New "Section" Filter

Philter 1.5.0 includes a new filter type called a "Section." This filter type lets you specify patterns that indicate the start and end of a section of text. For example, if your text has sentences or even paragraphs denoted with some marker, you can use the Section filter to redact those sentences or paragraphs. You just give the filter the regular expression patterns for the start and end markings. We have added the Section filter to the filter profiles documentation.

Amazon S3 to Store Filter Profiles

We have added the ability to store the filter profiles in an Amazon S3 bucket. The benefits of this is that now filter profiles can be shared across multiple instances of Philter. Previously, if you were running two instances of Philter you would have to update the filter profiles on each instance. By storing the filter profiles in S3 you can just update the filter profiles once via Philter's API. This does require a cache. The cache stores the filter profiles to lower the latency and reduce the number of calls to S3. (More on the cache below.)

We have published some CloudFormation and Terraform scripts to help with creating this architecture on GitHub.

Consolidated Caches

Philter previously used caches for the random anonymization values. With the introduction of using a cache for storing the profiles in S3 we have consolidated those caches into a single cache. Because of this, the configuration settings have been slightly renamed to reflect this. We have updated Philter's documentation with the renamed properties. Having a single cache means there is less to configure and fewer required resources.

If you are upgrading from a previous version you will need to change to the new cache property names.

Changeable Model File

The model file used by Philter can now be set in Philter's Check out Philter's documentation for the details. By being able to set the model being used you can now select which model is most applicable to your use-case and domain.

CloudFormation template for a highly-available Philter

We now have an AWS CloudFormation template to deploy an auto-scaled, highly-available Philter environment to identify and remove sensitive information from text. This template creates a VPC, load balancer, Philter instances, a Redis cache, and all required networking and security group configuration. Click the Launch Stack button to begin launching the stack.Philter

In an deployment of Philter that is a single EC2 instance, the EC2 instance is a single point of failure with no ability to respond to fluctuations in demand. By deploying more than one EC2 instance we can protect our application against failure and be able to scale up and down as needed.

The benefits of using this CloudFormation template is that it provides a pre-configured Philter architecture and deployment that is highly-available, scalable, and encrypts all data in-transit and all data at rest. Your API requests to Philter to filter sensitive information from text will have higher throughput since the load balancer will distribute those requests across the Philter instances. And as described below, the stack uses end-to-end encryption of data at-rest and in-transit.

The stack requires an active subscription to Philter via the AWS Marketplace. The template supports us-east-1, us-east-2, us-west-1, and us-west-2 regions.

The CloudFormation template is available in the philter-infrastructure-as-code repository on GitHub.

The Philter Stack Architecture

The deployment creates an elastic load balancer that is attached to an auto-scaled group of Philter EC2 instances. The load balancer spans two public subnets and the Philter EC2 instances are spread across two private subnets. Also in the private subnets is an Amazon Elasticache for Redis replication group. A NAT Gateway located in one of the public subnets provides outgoing internet access by routing the traffic to the VPC's Internet Gateway.

The load balancer will monitor the status of each Philter EC2 instance by periodically checking the /api/status endpoint. If an instance is found to be unhealthy after failing several consecutive health checks the failing instance will be replaced.

The Philter auto-scaling group is set to scale up and down based on the average CPU utilization of the Philter EC2 instances. When the CPU usage hits the high threshold another Philter EC2 instance will be added. When the CPU usage hits the low threshold, the auto-scaling group will begin removing (and terminating) instances from the group. The scaling policy is set to scale up faster rate than scaling down to avoid scaling down too quickly.

End-to-end Encryption

Incoming traffic to the load balancer is received by a TCP protocol handler on port 8080. These requests are distributed across the available Philter EC2 instances. The encrypted incoming traffic is terminated at the Philter EC2 instances. Network traffic between the Elasticache for Redis nodes is encrypted, and the data at-rest in the cache is also encrypted. The Philter EC2 instances use encrypted EBS volumes.

Launch the Stack

Click the Launch Stack button to launch the stack in your AWS account, or get the template here, or launch the stack using the AWS CLI with the command below.

aws cloudformation create-stack --stack-name philter --template-url s3://mtnfog-public/philter-resources/philter-vpc-load-balanced-with-redis.json

Once the stack completes Philter will be ready to accept requests. There will be an Output value called PhilterEndpoint. This value is the Philter API URL.

For example, if the value of PhilterEndpoint is, then you can check Philter's status using the command:

curl -k

You can try a quick sample filter request with:

curl -k "" \
  --data "George Washington lives in 90210 and his SSN was 123-45-6789." \
  -H "Content-type: text/plain"

Philter Studio 1.0.0

Philter Studio 1.0.0 is now available. Philter Studio is an application for Windows 7/10 that provides convenient access to removing sensitive information from files and documents using Philter.

With Philter Studio’s intuitive interface you can quickly and easily utilize Philter to find and remove sensitive information from your files. Process files one at a time or queue up entire directories and process all files with a single click. Philter Studio supports finding and removing sensitive information in Microsoft Word files (.doc and .docx). Philter Studio can enable track changes so the redactions can be viewed while editing the document.

Philter Studio lets you to take a deep look at how the sensitive information in your text were identified and removed. The Compare and Explain feature visually highlights the information, describes why it was identified, and shows the redacted version.

Philter and COVID-19

Philter NLP Models

The natural language processing (NLP) capabilities of Philter are partly model-driven, meaning that we have trained models to identify information in text. These models are used to identify pieces of sensitive information that do not follow well-defined patterns or exist in referenced dictionaries, such as persons names. The model training process is a complex and compute-intensive procedure often taking days or even weeks to complete. Once a model is created it can be applied to text to identify specific parts of the text based on the text used to train the model and the parameters of the training.

NLP Models for Many Use-Cases and Industries

The model currently deployed in Philter is a model that is generic but yet provides good performance across many use-cases covering many different types of text. It has been our plan for some time to offer models trained for specific use-cases and industries, including non-healthcare industries, for those instances when Philter is used only on a certain type of text. This will give those specific use-cases an increase in performance when using a tailored model.

Philter's pluggable model implementation is not quite ready yet. However, we are going to go ahead and jump a bit ahead today in announcing a model tailored for personally identifiable information in text related to COVID-19. We hope that this model will give you improved performance when identifying sensitive information in COVID-19 related text.

Model Availability

Because we are jumping ahead of ourselves in order to make this model immediately available, we don't yet have any automation or tooling support around being able to download and install the model yourself. (We will in the future.) Until we do have the self-service tooling available, we will distribute the model and installation instructions to users of Philter via email upon request. There is no additional charge to request and use the model.

To request the Philter model trained using COVID-19 data please use our contact form and include your cloud marketplace (AWS, Azure, or GCP) subscription ID.

Using Philter with Microsoft Power Automate (Flow)

Philter SDKs

We have some updates on the Philter SDKs!

The Philter SDKs provide API clients for interacting with Philter to identify and remove sensitive information from text. Each project contains examples showing how to use the SDK.

Philter SDK for Java

The Java SDK is now available in Maven Central.

Philter SDK for .NET

The .NET SDK is now available from NuGet.

Philter SDK for Golang

The Golang SDK is now available on GitHub.

Filtering Sensitive Information using Apache NiFi with Philter

Awhile back we made a post describing how Philter can be used alongside Apache NiFi for identifying and removing sensitive information from text. Since that post, there have been changes to Philter and Apache NiFi so we thought it would be worthwhile to revisit that architecture and its configuration.

  • Apache NiFi is an application for creating and managing data flows that process data.
  • Philter identifies and removes sensitive information, such as PHI and PII, from natural language text. Philter is available on cloud marketplaces.

The Data Flow Architecture

In the architecture of our data flow, we are going to be ingesting natural language (unstructured) text from somewhere - it doesn't really matter where. In your use-case it may be from a file system, from an S3 bucket, or from an Apache Kafka topic. Once we have the text in the content of the NiFi flowfile, we will send the text to Philter where the sensitive information will be removed from the text. The filtered text will then be the content of the flowfile. In our example here we are going to read the files from a directory on the file system.

To interact with Philter we can use NiFi's InvokeHTTP processor since Philter's API is HTTP REST-based.

Finally, we will write the filtered text to some destination. Like the ingest source, where we write the text does not matter. We could write it back to the source or some other location - whatever is required by your use-case.

The NiFi Flow

The flow will use the GetFile processor to read /tmp/input/*.txt files. The contents of each file will be sent to Philter. The resulting filtered text will be written back to the file system at /tmp/output. (Click the image for a better view.)

Apache NiFi flow for Philter

If you want to quickly prototype it with minimal configuration, use a GenerateFlowFile processor and set the content manually to something like "His SSN was 123-45-6789."

Using GenerateFlowFile to test Philter.

InvokeHTTP Processor Configuration

The configuration of the InvokeHTTP processor is fairly simple. We just need to configure the HTTP Method, Remote URL, and Content Type. Set each as follows:

  • HTTP Method = POST
  • Remote URL = http://philter-ip:8080/api/filter
  • Content-Type = text/plain

Since we are not providing any values for the context, document ID, or filter profile name in the URL, Philter will use defaults values for each. When not provided, the default value for context is default, Philter will generate a document ID per request, and the default filter profile name is default.

These default values are detailed in Philter's API documentation. A context lets you group similar documents together, perhaps by business unit or purpose. A document ID should uniquely identify a document (such as a file name) and can be used to split up large documents for processing.

If you do want to set values for one or all of those instead of using the default values, just append them to the Remote URL: http://philter-ip:8080/api/filter?c=ctx&p=justssn In this request, the context is set to ctx and it tells Philter to use the filter profile named justssn. As a tip, you can use NiFi's expression language to parameterize the values in the URL.

InvokeHTTP processor configuration for Philter.

A Closer Look

If we use a LogAttribute processor we can get some insight into what's happening. In the log output below, we can see HTTP POST request that was made.

At the top of the log we see the filtered text from Philter. The input text from the file was "His SSN was 123-45-6789." Philter applied the default filter profile which looks for SSNs and responded with "His SSN was {{{REDACTED-ssn}}}."

(Filter profiles are very powerful and flexible configurations that let you have full control over the types of sensitive information that Philter identifies and how Philter manipulates that information when found.)

We can also see that since we did not provide a value for the document ID in the request, Philter assigned a document ID and returned it in the response in the x-document-id header.

His SSN was {{{REDACTED-ssn}}}.

Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Thu Feb 27 13:35:19 UTC 2020'
Key: 'lineageStartDate'
	Value: 'Thu Feb 27 13:35:11 UTC 2020'
Key: 'fileSize'
	Value: '31'
FlowFile Attribute Map Content
Key: 'Connection'
	Value: 'keep-alive'
Key: 'Content-Length'
	Value: '31'
Key: 'Content-Type'
	Value: 'text/plain;charset=UTF-8'
Key: 'Date'
	Value: 'Thu, 27 Feb 2020 13:35:19 GMT'
Key: 'Keep-Alive'
	Value: 'timeout=60'
Key: 'filename'
	Value: 'd206fc81-2c42-40ba-afbf-b5f9998b56c0'
Key: 'invokehttp.request.url'
	Value: ''
Key: 'invokehttp.status.code'
	Value: '200'
Key: 'invokehttp.status.message'
	Value: ''
Key: ''
	Value: 'fbf2f6c0-1073-4fac-bc23-6d6a67b70423'
Key: 'mime.type'
	Value: 'text/plain;charset=UTF-8'
Key: 'path'
	Value: './'
Key: 'uuid'
	Value: '486ff4c2-6530-4e1c-aea2-e9965b86b10c'
Key: 'x-document-id'
	Value: 'fb75a2a4c164192542f89881aa8baf21'


Philter's API makes it easy to integrate Philter with applications like Apache NiFi. The InvokeHTTP processor native to NiFi is an ideal means of communicating with Philter.

To keep things simple, this example only considered SSNs in text. Philter supports many other types of sensitive information.

If performance is very important, there are a couple of things that can be done to help. First, Philter is stateless so you can run multiple instances of Philter behind a load balancer. Second, Philter Enterprise Edition can run natively inside an Apache NiFi flow without the need to make HTTP calls to Philter. Contact us if you would like to learn more about Philter Enterprise Edition's processor for Apache NiFi.

Philter's integration with applications like Apache NiFi is very important to us so look for more improvements and features in versions to come.


Philter 1.3.1

We are happy to announce the release of Philter 1.3.1!

Philter 1.3.1 release notes
Philter 1.3.1 documentation

This version of Philter makes some minor changes to filtering and adds support for MAC addresses and tax-payer identification numbers (TINs). Also new is the ability to encrypt sensitive information in the text using AES encryption using the CRYPTO_REPLACE filter strategy. Additionally, Azure and GCP images are now built on CentOS 8.

Other changes include the ability to use the context in a filter condition, the ability to provide a user-set document ID to Philter's API, and the requirement of Java 11.

Philter Enterprise Edition 1.3.1 has been certified for Red Hat Enterprise Linux 8. For enterprise customers, Philter's containers are now built on Red Hat's Universal Base Image to give best performance on Red Hat Enterprise Linux deployments.

Launch Philter in your cloud.


Philter 1.3.0

Today I am happy to announce the availability of Philter 1.3.0! This version includes various tweaks to improve performance and we definitely encourage you to upgrade to 1.3.0. This version greatly lowers the required time to process text while improving the accuracy of identified information.

The only new user-facing feature is a modification to the URL filter to add an option to require the URLs to start with http, https, or www. This change adds a new property to the URL filter profile. All other improvements are related to the internal workings of Philter.

Look for Philter 1.3.0 to be available on the cloud marketplaces in a few days.

Philter 1.3.0 Release Notes

Philter 1.2.0

We are happy to announce the release of Philter 1.2.0. This version brings new features to filter profiles along with some minor changes. Philter 1.2.0 will be available on the cloud marketplaces in a day or two. Let's get to it and see what's new!

A Recap

Philter is an application to analyze text for potentially identifiable information (PII) and protected health information (PHI) and remove or manipulate those items when found. The types of information that Philter looks for and how it acts upon the information is called a filter profile. A filter profile is just a file that lists the types of PII/PHI that you are interested in, e.g. credit card numbers, persons names, etc. Philter is available on the AWS Marketplace, Azure Marketplace, GCP Marketplace.

Contact us for a live demo or feel free to take Philter for a spin on one of the cloud marketplaces taking advantage of its free trial period. Check out the full Release History.

What's New in Philter 1.2.0

Filter Specific Ignore Lists

A filter profile can now have lists of ignored terms specific to each filter type. For example, let's say there is a number "123-45-6789" in your text and it keeps getting identified as an SSN because it fits the SSN format. However, you know this number is not an SSN and do not want it removed. You can now add "123-45-6789" to a list of ignored terms for the SSN filter to prevent it from being removed from the text. Each type of filter has its own ignore list.

Global Ignore Lists

A filter profile can now have zero or more ignore lists that apply to all filter types. Items added to this list are ignored for all filter types. All items present in the global ignore lists will never be removed from the input text.

Disabling Filters

Previously, to disable a filter type in a filter profile you had to delete it from the filter profile. This can be problematic because you might have configuration in there you don't want to just delete and lose. New in Philter 1.2.0, each filter type has an enabled property that controls whether or not the filter is applied. When set to false the filter is not applied. The default value is always true to enable each filter type.

Invalid Credit Card Numbers

Philter identifies credit card numbers based on the patterns and algorithms of the numbers. In Philter 1.2.0, a new option was added to the credit card filter type that allows invalid credit card numbers to be filtered as well. An invalid credit card number is a number that matches the pattern of a credit card number but fails the credit card number's generation algorithm. (The algorithm is the Luhn algorithm.) This option is disabled by default.

Valid Dates

Philter identifies dates based on date patterns. Sometimes, a date may match a valid pattern but not be a valid date, such as February 30 or even March 45. Philter 1.2.0 adds a new option to the date filter to require that identified dates be valid dates. When enabled, dates found to not be valid dates are not removed from the text. This option is disabled by default.

Option to Remove Punctuation

Philter 1.2.0 adds a new option to the filter profile for named-entity recognition to remove punctuation from the input text prior to processing the text. By default this option is disabled and punctuation is not removed. Removing punctuation can be beneficial in cases where punctuation is being included in entities. This can happen in cases where the last word of the sentence is a name and the period is included in the filtered text. (This doesn't always happen and we're working on removing those occurrences even more through improvements to the named-entity recognition capability.)

Encrypting Connections to Redis

Philter's consistent anonymization feature stores the identified text in a Redis cache. This allows a clustered Philter installation to be able to replace identified text consistently across all instances of Philter. (When Redis is not used, the identified text values are stored in memory on each Philter instance.) Philter 1.2.0 requires all connections to a Redis cache be encrypted and requires the use of a Redis auth token.

Philter 1.1.0

We are happy to announce Philter 1.1.0! This version brings some features we think you will find very useful because most were implemented directly from interactions with users. We look forward to future interactions to keep driving improvements!

We are very excited about this release, but we also have lots of exciting things to add in the next release and we will soon be making available Philter Studio, a free Windows application to use Philter. If you don't like managing filter profiles in JSON you will love Philter Studio!

We have begun the process of publishing Philter 1.1.0 to the cloud marketplaces and it should be available on the AWS, Azure, and GCP marketplaces in the next few days once publishing is complete. The Philter Quick Start walks through how to deploy Philter on each platform. You can also see the full Philter release notes.

What's New in Philter 1.1.0

Ignore Lists

In some cases, there may be text that you never want to identify and remove as PII or PHI. An example may be an email address or telephone number of a business that is not relevant to the sensitive information in the text and removing this text may cause the document to lose meaning. Ignore lists allow you to specify a list of terms that are never removed (always ignored if found) from the documents. You can create as many ignore lists as you need and each one can contain as many terms as desired. The ignore lists are defined in the filter profile.

Here's how an ignore list is defined in a filter profile that only finds SSNs. The SSNs 123-45-6789 and 000-00-0000 will always be ignored and will remain in the documents unchanged.

  "name": "default",
  "identifiers": {
    "ssn": {
      "ssnFilterStrategies": [
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT"
  "ignored": [
      "name": "ignored-terms",
      "terms": [

Custom Dictionaries

You can now have custom dictionaries of terms that are to be identified as sensitive information. With a custom dictionary you can specify a list of terms, such as names, addresses, or other information, that should always be treated as personal information. You can create as many custom dictionaries as you need and each one can contain as many terms as desired. The custom dictionaries are defined in the filter profile.

Here's how a custom dictionary can be added to a filter profile. In this example, a custom dictionary of type names-with-j is created and it contains the terms james, jim, and john. When any of these terms are found in a document they will be redacted. The dictionaries item is an array so you can have as many dictionaries as required. (The "auto" setting for the sensitivity is discussed a little further down below.)

  "name": "default",
  "identifiers": {
    "dictionaries": [
        "type": "names-with-j",
        "terms": [
        "sensitivity": "auto",
        "customFilterStrategies": [
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}",
            "replacementScope": "DOCUMENT"
    "ssn": {
      "ssnFilterStrategies": [
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT",
          "staticReplacement": "",
          "condition": ""

"Fuzziness" Calculation

We added a new fuziness option when using dictionary filters. The previous options of LOW, MEDIUM, and HIGH were found to be either not restrictive enough or too restrictive. We have added an AUTO option that automatically determines the appropriate fuziness based on the length of term in question. For instance, the AUTO option sets the fuzziness for a short term to be on the low side, while a longer term allows a higher fuziness. We recommend using AUTO over the other options and expect it to perform better for you. The other options of LOW, MEDIUM, and HIGH are still available.

Explain API Endpoint

Philter operates as a black box. Text goes in and manipulated text comes out. What happened inside? To help provide insight into the black box, we have added a new API endpoint called explain. This endpoint performs text filtering but returns more information on the filtering process. The list of identified spans (pieces of text found to be sensitive) and applied spans are both returned as objects along with attributes about each span.

Here's an example output of calling the explain API endpoint given some sample text. The original API call:

curl -k -s "https://localhost:8080/api/explain?c=C1" --data "George Washington was president and his ssn was 123-45-6789 and he lived at 90210." -H "Content-type: text/plain" 

The response from the API call:

  "filteredText": "{{{REDACTED-entity}}} was president and his ssn was {{{REDACTED-ssn}}} and he lived at {{{REDACTED-zip-code}}}.",
  "context": "C1",
  "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
  "explanation": {
    "appliedSpans": [
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
    "identifiedSpans": [
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"

In the response, each identified span is listed with some attributes.

  • id - A random UUID identifying the span.
  • characterStart - The character-based index of the start of the span.
  • characterEnd - The character-based index of the end of the span.
  • filterType - The filter that identified this span.
  • context - The given context under which this span was identified.
  • documentId - The given documentId or a randomly generated documentId if none was provided.
  • confidence - Philter's confidence this span does in fact represent a span.
  • text - The text contained within the span.
  • replacement - The value which Philter used replace the text in the document.

The User's Guide has been updated to include the explain API endpoint.


As mentioned in a previous post, Philter 1.1.0 now uses Elasticsearch to store the identified spans instead of MongoDB. Please check that post for the details but we do want to mention again here that this change does not affect Philter's API and the change will be transparent to any of your existing Philter scripts or applications.

DataDog Metrics

Philter 1.1.0 adds support for sending metrics directly to Datadog.

New Metrics

Philter 1.1.0 adds new metrics for each type of filter. Now you will be able to see metrics for each type of filter in CloudWatch, JMX, and Datadog to give more insight into the types of sensitive information being found in your documents.

Philter and Elasticsearch

PhilterPhilter, our application for finding and removing PII and PHI from natural language text, has the ability to optionally store the identified text in an external data store. With this feature, you had access to a complete log of Philter's actions as well as the ability to reconstruct the original text in the future if you ever needed to.

In Philter 1.0,  we chose MongoDB as the external data store. With just a few configuration properties, Philter would connect to MongoDB and persist all identified "spans" (the identified text, its location in the document, and some other attributes) to a MongoDB database. This worked well but we realized that looking forward it might not have been the best choice.

In Philter 1.1 we are replacing MongoDB with Elasticsearch. The functionality and the Philter APIs will remain the same. The only difference is that now instead of the spans being stored in a MongoDB database they will now be stored in an Elasticsearch index. So, what, exactly are the benefits? Great question.

The first benefit comes with Elasticsearch and Kibana's ability to quickly and easily make dashboards to view the indexed data. With the spans in Elasticsearch, you can make a dashboard to summarize the spans by type, text, etc., to show insights into the PII and PHI that Philter is finding and manipulating in your text.

It also became quickly apparent that a primary use-case for users and the store would be to query the spans it contains. For example, a query to find all documents containing "John Doe" or all documents containing a certain date or phone number. A search engine is better prepared to handle those queries.

Another consideration is licensing. Elasticsearch is available under the Apache Software License or a compatible license while MongoDB is available under a Server Side Public License.

In summary, Philter 1.1 will offer support for using Elasticsearch as the store for identified PII and PHI. Remember, using the store is an optional feature of Philter. If you do not require any history of the text that Philter identifies then it is not needed. (By default, Philter's store feature is disabled and has to be explicitly enabled.) Support for using MongoDB as a store will not be available in Philter 1.1.

We are really excited about this change and excited about the possibilities that comes with it!

Using AWS Kinesis Firehose Transformations to Filter Sensitive Information

AWS Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from places such as CloudWatch, AWS IoT, and custom applications using the AWS SDK to places such as Amazon S3, Amazon Redshift, Amazon Elasticsearch, and others. In this post we will use S3 as the firehose's destination.

In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how AWS Kinesis Firehose and AWS Lambda can be used in conjunction with Philter to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Click here for a similar solution using log4j and Apache Kafka to remove sensitive information from application logs.


Your must have a running instance of Philter. If you don't already have a running instance of Philter you can launch one through the AWS Marketplace. There are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

It's not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows you to communicate locally with Philter from the function.

Setting up the AWS Kinesis Firehose Transformation

There is no need to duplicate an excellent blog post on creating a Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests
import base64

def handler(event, context):

    output = []

    for record in event['records']:
        headers = {'Content-type': 'text/plain'}
        r ="https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
        filtered = r.text
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8')

    return output

The following Kinesis Firehose test event can be used to test the function:

  "invocationId": "invocationIdExample",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "region": "us-east-1",
  "records": [
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="

This test event contains 2 messages and the data for each is base 64 encoded, which is the value "He lived in 90210 and his SSN was 123-45-6789." When the test is executed the response will be:

  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."

When executing the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter's self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Kinesis Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.


In this blog post we have created an AWS Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.


Apache NiFi for Processing PHI Data

With the recent release of Apache NiFi 1.10.0, it seems like a good time to discuss using Apache NiFi with data containing protected health information (PHI). When PHI is present in data it can present significant concerns and impose many requirements you may not face otherwise due to regulations such as HIPAA.

Apache NiFi probably needs little introduction but in case you are new to it, Apache NiFi is a big-data ETL application that uses directed graphs called data flows to move and transform data. You can think of it as taking data from one place to another while, optionally, doing some transformation to the data. The data goes through the flow in a construct known as a flow file. In this post we'll consider a simple data flow that reads file from a remote SFTP server and uploads the files to S3. We don't need to look at a complex data flow to understand how PHI can impact our setup.

Encryption of Data at Rest and In-motion

Two core things to address when PHI data is present is encryption of the data at rest and encryption of the data in motion. The first step is to identify those places where sensitive data will be at rest and in motion.

For encryption of data at rest, the first location is the remote SFTP server. In this example, let's assume the remote SFTP server is not managed by us, has the appropriate safeguards, and is someone else's responsibility. As the data goes through the NiFi flow, the next place the data is at rest is inside NiFi's provenance repository. (The provenance repository stores the history of all flow files that pass through the data flow.) NiFi then uploads the files to S3. AWS gives us the capability to encrypt S3 bucket contents by default so we will use that through an S3 bucket policy.

For encryption of data in motion, we have the connection between the SFTP server and NiFi and between NiFi and S3. Since we are using an SFTP server, our communication to the SFTP server will be encrypted. Similarly, we will access S3 over HTTPS providing encryption there as well.

If we are using a multi-node NiFi cluster, we may also have the communication between the NiFi nodes in the cluster. If the flows only execute on a single node you may argue that encryption between the nodes is not necessary. However, what happens in the future when the flow's behavior is changed and now PHI data is being transmitted in plain text across a network? For that reason, it's best to set up encryption between NiFi nodes from the start. This is covered in the NiFi System Administrator's Guide.

Encrypting Apache NiFi's Data at Rest

The best way to ensure encryption of data at rest is to use full disk encryption for the NiFi instances. (If you are on AWS and running NiFi on EC2 instances, use an encrypted EBS volume.) This ensures that all data persisted on the system will be encrypted no matter where the data appears. If a NiFi processor decides to have a bad day and dump error data to the log there is a risk of PHI data being included in the log. With full disk encryption we can be sure that even that data is encrypted as well.

Looking at Other Methods

Let's recap the NiFi repositories:

PHI could exist in any of these repositories when PHI data is passing through a NiFi flow. NiFi does have an encrypted provenance repository implementation and NiFi 1.10.0 introduces an experimental encrypted content repository but there are some caveats. (Currently, NiFi does not have an implementation of an encrypted flowfile repository.)

When using these encryption implementations, spillage of PHI onto the file system through a log file or some other means is a risk. There will be a bit of overhead due to the additional CPU instructions to perform the encryption. Comparing usage of the encrypted repositories with using an encrypted EBS volume, we don't have to worry about spilling unencrypted PHI to the disk, and per the AWS EBS encryption documentation, "You can expect the same IOPS performance on encrypted volumes as on unencrypted volumes, with a minimal effect on latency."

There is also the NiFi EncryptContent processor that can encrypt (and decrypt despite the name!) the content of flow files. This processor has use but in very specific cases. Trying to encrypt data at the level of the data flow for compliance reasons is not recommended due to the data possibly existing elsewhere in the NiFi repositories.

Removing PHI from Text in a NiFi Flow

PhilterWhat if you want to remove PHI (and PII) from the content of flow files as they go through a NiFi data flow? Check out our product Philter. It provides the ability to find and remove many types of PHI and PII from natural language, unstructured text from within a NiFi flow. Text containing PHI is sent to Philter and Philter responds with same text but with the PHI and PII removed.


Full disk encryption and encrypting all connections in the NiFi flow and between NiFi nodes provides encryption of data at rest and in motion. It's also recommended that you check with your organization's compliance officer to determine if there are any other requirements imposed by your organization or other relevant regulation prior to deployment. It's best to gather that information up front to avoid rework in the future!

Our Approach to Continuous Delivery for Cloud Marketplaces

In this blog post I wanted to take a moment to share our challenges with continuous integration and delivery and how we approached them. Our Philter software to find and remove PII and PHI from text is deployed on (at the moment) three cloud marketplaces as well as being available for on-premises deployment. Each of the marketplaces, AWS Marketplace, Microsoft Azure Marketplace, and Google Compute Platform (GCP) Marketplace, all have their own requirements and constraints. We needed a pipeline that can build and test our code and deliver the binaries to each of the cloud marketplaces as a deployable image.

What tools you use to implement your process does not really matter. Some tools are more feature-rich than others and some are only better or worse in terms of difference of opinion. It's up to you to pick the tools that you or your organization want to use. We will mention the tools we use but don't take that as meaning only these tools will work. (We like being tool-agnostic to not make us afraid to try new tools.) Our build infrastructure runs in AWS.

Our builds are managed by Jenkins through Jenkinsfiles. Each project has a Jenkinsfile that defines the build stages for the project. These stages vary by project but are usually similar to "build", "test", and "deploy." The build and test stages are pretty self-explanatory. The deploy stage is where things get interesting (i.e. challengine).

We are using Hashicorp's Packer tool to create our images for the cloud marketplaces. A single packer JSON file contains a "builder" (in Packer terminology) for each cloud marketplace. A builder defines the necessary parameters for constructing the image on that specific cloud platform. For instance, when building on AWS EC2, the builder contains information about the VPC and subnet making the build, base AMI the image will be created from, and the AWS region for the image. Likewise, for Microsoft Azure, the builder defines things such as the storage account name, operating system name and version, and the Azure subscription ID. GCP has its own set of required parameters.

The rest of the Packer JSON file contains the steps that will be done to prepare the image. This includes steps such as executing commands over SSH to install prerequisite packages, upload build artifacts made by the Jenkins build, and lastly, prepare the system for being turned into an image.

After the Jenkinsfile's "deploy" stage executes, the end result will be a new image in each of the cloud platforms suitable for final testing prior to being made available on the cloud's marketplace. This testing is initiated by the build by publishing a message to an AWS SNS topic each an images completes creation. This triggers a process to create and start a virtual machine from the image powered by AWS Lambda. Required credentials are stored in AWS SSM Parameter Store.

Automated testing is then performed against the virtual machine. Individual testing of each image is required is due to the nuances and different requirements of each cloud platform and marketplace and different base images. For instance, on AWS the base image is Amazon Linux 2. On Microsoft Azure it is CentOS 7. The scripts that install the prerequisites and  configure the application can differ based on the base image.

The automated testing involves testing the application's API and by establishing an SSH connection to the virtual machine to verify files are in the correct location or have been configured properly. A message is published to a separate AWS SNS topic indicating success or failure of the tests and the virtual machine image is deleted/terminated leaving only the newly built image. The test results are persisted to a database along with the build number for reference. If testing was successful, we can proceed to the manual steps of publishing the image to the marketplaces when we are ready to do so. (All marketplaces require manual clicking to submit images so none of it can be automated.)

Continuous integration and delivery is important for all software projects. Having a consistent, repeatable process for building, testing, and packaging software for delivery is critical. A well-defined and implemented process can help teams find problems earlier, get configurations into code, and ultimately, get higher-quality products to the market faster.

Mountain Fog, Inc. Announces Release of Philter Software to Find and Remove PII and PHI

Mountain Fog, Inc. Announces Release of Philter Software to Find and Remove PII and PHI

Philter finds and removes sensitive information such as PII and PHI from text.

October 19, 2019 – Morgantown, WV – Mountain Fog, Inc. today announced the immediate availability of their Philter software to scan natural language text for personally identifiable information (PII) and protected health information (PHI).  Philter can redact or replace the sensitive text through user-defined configurations.

"We are very excited to offer to the health information technology industry," said Jeff Zemerick, president of Mountain Fog. "Text containing PII and PHI presents unique challenges to businesses and organizations and Philter provides a means of finding and manipulating that PII and PHI."

Philter is available through the Amazon Web Services (AWS) Marketplace for deployment into customers' private clouds. "By deploying Philter directly into customer's own cloud accounts, the sensitive text never has to leave the customer's cloud," according to Mr. Zemerick. "This provides increased security and performance."

For more information on Philter or to request a live demonstration contact the Mountain Fog team at



About Mountain Fog, Inc.

Mountain Fog develops innovative cloud, big-data, and natural language processing solutions. For more information on Philter or to request a live demonstration contact the Mountain Fog team.

888-789-3894 |

Sneak Peek at Philter Big-Data and ETL Integrations

As we are nearing the general availability of Philter we would like to take a minute to offer a quick look at Philter's integrations with other applications. Philter offers integration capabilities with Apache NiFi, Apache Kafka, and Apache Pulsar to provide PHI/PII filtering capabilities across your big-data and ETL ecosystems. We are very excited to offer these integrations for such awesome and popular open source applications.

To recap, Philter is an application to identify and, optionally, remove or replace protected health information (PHI) and personally identifiable information (PII) from natural language text.

Apache NiFi

Philter provides a custom Apache NiFi processor NAR that you can plug into your existing NiFi installations by copying the NAR file to NiFi's lib directory. The processor allows your NiFi flow to identify and replace PHI and PII directly in your flow without any required external services. The processor's configuration is similar to Philter's standard configuration. The processor accepts a filter profile, an optional MongoDB URI to use to store replaced values, and a cache to maintain state when anonymizing values consistently. For the cache, the processor utilizes NiFi's built-in DistributedMapCacheServer.

The processor operates on the content of the incoming flowfile by performing filtering on the content and replacing the content with the filtered text. An outbound transition provides the downstream processors with the filtered text.

Apache Kafka

Philter is able to integrate with Apache Kafka by providing the ability to consume text from Kafka, perform the filtering, and publish the filtered text to a different Kafka topic. Philter does this in a performant and fault tolerant manner by leveraging the Apache Flink streaming framework. This integration is suitable for integration into existing pipelines where text is being consumed from Kafka for processing because it requires minimal changes to the pipeline. Simple provide the appropriate configuration values to the Philter job and update your topic names.

Apache Pulsar

Philter integrates with Apache Pulsar via Pulsar Functions. A Pulsar Function enables Pulsar to execute functions on the streaming data as it passes through Pulsar. Pulsar is similar to Kafka in its functionality as a massive pub/sub application but unlike Kafka it provides the ability to directly transform the data inside of the application. This is an ideal integration point for Philter and your streaming architectures using Apache Pulsar.

Filter Profiles in Philter

Today we are excited to announce a new feature in Philter that is a result of Philter's open beta testing. We are excited to offer this functionality just prior to Philter going live. Thanks to those who provided their feedback to make this possible!

Previously in Philter the configurations of each "filter" was static and configured when Philter started. The limitation imposed by this implementation is that if you wanted to filter documents differently based on some criteria you had to run two instances of Philter and add logic when using Philter's API to send your document to the appropriate instance. It was also restrictive because each enabled filter was configured with some of the same values such as the replacement format. You could not replace a zip code differently than a credit card number, for example.

We have changed how the filters are configured and we are introducing the new feature as "filter profiles." A feature profile is a defined set of filters and each's respective configuration defined in a JSON file. Now, a single instance of Philter can simultaneously apply multiple filter profiles and selectively choose which to utilize on a per-request basis. We have also added more options to handling each individual PII/PHI identifier such as being able to independently configure how to redact or replace each one. For instance, it is now possible to truncate zip codes to a chosen length instead of simply replacing the whole zip code.

Here's an example filter profile that enables filters and defines how corresponding PII/PHI should be replaced. Note how each identifier has its own strategy for handling items - individual types are no longer constrained to sharing the same strategy. With filter profiles, you can also selectively enable identifier filters. Filters not defined in the profile will not be enabled for that profile. Non-deterministic filters such as NLP-based ones can now have their own sensitivity setting, too.


You can initialize Philter with as many filter profiles as you need. Using the REST API you can select the filter profile to use when making your request by providing a p parameter along with the name of a filter profile as shown in this sample request:

curl -k -X POST "https://localhost:8080/api/filter?c=context&p=profile" \
  -d @file.txt \
  -H Content-Type "text/plain"

We have lots of ways to expand the filter profiles on our to-do list, such as providing centralized filter profile management and an API around filter profiles to allow for remote management of them so look for updates on those features soon.

PHI in a DevOps Environment

We have all had a doctor's visit where we are asked to fill out their HIPAA form regarding who they can share our medical data with. The form is typically titled HIPAA Privacy Notice or something similar. Because of this, probably most of us are familiar with HIPAA and what its general purpose. But for those of us in the tech industry, even those outside of the healthcare sector, it's beneficial to have a slightly deeper understanding of HIPAA and protected health information (PHI).

PHI is any information that can be used to identify a patient. Formally, HIPAA defines 18 categories of PHI under the HIPAA Privacy Rule. These categories include names, social security numbers, addresses, biometric data, and patient record numbers. That's most of the obvious ones. Things like email address, vehicle identification numbers (VINs), and fax numbers are also part of those 18 categories.

The advent of DevOps cultures has helped make knowledge of HIPAA and PHI a team requirement. All team members need to understand what is PHI and the implications of having PHI in a system. Prior to DevOps, it was likely that only a select few team members had access to data containing PHI. The democratization of team responsibilities introduced by DevOps now means all team members may potentially have access to PHI data.

PHI and DevOps

Team Training

The very first thing your organization should do is develop a training program to educate current and future team members on PHI. (The HIPAA Privacy and Security Rules mandate appropriate training.) The content of the training is out of scope here, but once completing the training each team member should have a solid working knowledge of HIPAA and PHI and awareness of the possible penalties for failing to proect PHI at all times.

Well-Defined Scope and Approved Services

Next, the scope of PHI in your system should be very well defined and documented. The boundaries where PHI exists should be well-known to all team members. If you are operating in a cloud environment with a signed Business Associate Agreement, all team members should be aware of the cloud provider's services that are approved for PHI data. It is a common pitfall to utilize a cloud provider's service that is not on the list of approved services.

During the design phase, it is imperative the team fully research any third-party services to determine if they are approved for storing PHI data. After delivery, we recommend regularly checking the cloud providers list of approved services as new additions are made over time. For example, a new service just launched may not be approved, but in the next six months it will be listed on the BAA list of approved services.

Least Privilege Permissions

Your method of assigning permissions to team members should be based on the paradigm of least privilege. Team members should only have access to the data they need to in order to perform their role. For example, in an AWS environment it is a best practice to have a separate AWS account to contain the application logs. This account is only accessible by the project's security team members. The application can write logs to the account via CloudWatch or some other log aggregation utility but only the security team members can view the logs.

Encryption of Data

Each team member should also be aware of the basic security precautions required for PHI. Encryption of data at rest and in motion is paramount, even in a virtual private cloud or other isolated environment. The movement of traffic in an isolated environment is not a substitute for transport encryption. Full disk encryption, transport layer encryption, and encryption provided by blob stores and other persistent data stores will help protect the data. A deep knowledge of the cloud provider being utilized is essential. Resources are available for AWS, Azure, and Google Cloud.

Disaster and Recovery Plan

Your team should have a disaster and recovery plan that clearly defines the required steps and roles. This is required by the HIPAA Security Rule. The plan should define how PHI data will be managed and restored and how the project will operate in the event of some natural or man-made disaster. The plan should clearly outline how the PHI data will be protected at all times throughout execution of the plan.

While a disaster and recovery plan is not specific to DevOps teams, a DevOps culture will certainly affect how the plan is written. Roles and responsibilities may transcend the traditional lines of development and operations. It may utilize an on-call schedule to contact the applicable team members. A DevOps culture makes it crucial that all team members be aware of the plan since the responsibility of executing the plan likely requires at least one team member from each role.

It is also crucial that all disaster and recovery plans be tested in order to identify areas of improvement and to ensure team members are aware of and can execute their responsibilities when needed. In AWS terminology these planned events are referred to as Game days.

Going Forward

The task of managing PHI can be a daunting one but it should not be an impediment to your project's success. With the appropriate care, knowledge, and awareness, your team can create and execute a strategy to successfully manage PHI in your environments. If you need to remove PHI from data take a look at our Philter application. Philter can help you make that data usable for other purposes.