Philter is now Certified for Cloudera Dataflow

We are excited to announce that Philter is now certified for Cloudera Dataflow (CDF). By leveraging Philter in your Apache NiFi data flows, you can redact protected health information (PHI), personally identifiable information (PII), and other types of sensitive information from your data.

Using Philter in Cloudera Dataflow is as simple as making the Philter processors available to your Apache NiFi instance and adding the processors to your canvas. You can configure the types of information to redact right inside the processor's properties. You can choose whether to use a centralized Philter instance or perform the redaction directly within the Apache NiFi flow. The first option allows for a centralized configuration while the latter provides significant performance improvements.

Philter on Cloudera Dataflow is compatible with all public clouds supported by Cloudera Dataflow.

Get Started

To get started with Philter on Cloudera Dataflow please contact us and we can guide you through the process of getting started. Visit our partner information on the Cloudera partner portal.

About Cloudera Dataflow

Cloudera DataFlow (CDF) is a CDP Public Cloud service that enables self-serve deployments of Apache NiFi data flows from a central catalog to auto-scaling Kubernetes clusters managed by CDP. Flow deployments can be monitored from a central dashboard with the ability to define KPIs to keep track of critical data flow metrics. CDF eliminates the operational overhead that is typically associated with running Apache NiFi clusters and allows users to fully focus on developing data flows and ensuring they meet business SLAs. Learn more about Cloudera Dataflow.

 


Philter Price Reduction

We are thrilled to announce a price reduction for Philter on the cloud marketplaces. Philter pricing now starts at $0.49/hr, down from $0.79/hr. Additionally, where applicable, the annual pricing has been reduced accordingly as well. The pricing is tiered depending on the size of the compute instance running Philter.

The price change will take effect across the AWS, Azure, and Google Cloud marketplaces in the coming week. This pricing change will not affect support or managed services.

We would like to take a moment to thank our users for helping to make this price reduction possible. It is because of our supportive users that we are able to make this change.

AWS Marketplace

Philter 1.8.0

Philter 1.8.0 has been released.

This version brings:

  • The ability to capture timing metrics for each of the filter types. Capturing these metrics will provide insights into the performance of the filters.
  • The ability to specify terms to ignore in files for each filter type. Previously, lists of ignored terms had to be specified in the filter profile. Being able to specify the terms to ignore in files outside the filter profile allow for cleaner and easier to manage filter profiles.

Philter 1.8.0 is now available for deployment from the cloud marketplaces.

Launch Philter in your cloud. See Philter's full Release Notes.


Preventing PII and PHI from Leaking into Application Logs

Introduction

This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application's development that's easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications' logs?

Philter and application logs

To give more confidence you can process your applications' logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application's log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O https://raw.githubusercontent.com/confluentinc/cp-all-in-one/5.5.1-post/cp-all-in-one/docker-compose.yml
docker-compose up

Spin up Philter in a docker container:

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we're good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here's an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
...
<Appenders>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>
  </Kafka>
</Appenders>

With this appender in our log4j2.xml file, our application's logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we're ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard kafka-console-consumer.sh script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person's names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.

Summary

In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j's Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.


Philter 1.2.0

We are happy to announce the release of Philter 1.2.0. This version brings new features to filter profiles along with some minor changes. Philter 1.2.0 will be available on the cloud marketplaces in a day or two. Let's get to it and see what's new!

A Recap

Philter is an application to analyze text for potentially identifiable information (PII) and protected health information (PHI) and remove or manipulate those items when found. The types of information that Philter looks for and how it acts upon the information is called a filter profile. A filter profile is just a file that lists the types of PII/PHI that you are interested in, e.g. credit card numbers, persons names, etc. Philter is available on the AWS Marketplace, Azure Marketplace, GCP Marketplace.

Contact us for a live demo or feel free to take Philter for a spin on one of the cloud marketplaces taking advantage of its free trial period. Check out the full Release History.

What's New in Philter 1.2.0

Filter Specific Ignore Lists

A filter profile can now have lists of ignored terms specific to each filter type. For example, let's say there is a number "123-45-6789" in your text and it keeps getting identified as an SSN because it fits the SSN format. However, you know this number is not an SSN and do not want it removed. You can now add "123-45-6789" to a list of ignored terms for the SSN filter to prevent it from being removed from the text. Each type of filter has its own ignore list.

Global Ignore Lists

A filter profile can now have zero or more ignore lists that apply to all filter types. Items added to this list are ignored for all filter types. All items present in the global ignore lists will never be removed from the input text.

Disabling Filters

Previously, to disable a filter type in a filter profile you had to delete it from the filter profile. This can be problematic because you might have configuration in there you don't want to just delete and lose. New in Philter 1.2.0, each filter type has an enabled property that controls whether or not the filter is applied. When set to false the filter is not applied. The default value is always true to enable each filter type.

Invalid Credit Card Numbers

Philter identifies credit card numbers based on the patterns and algorithms of the numbers. In Philter 1.2.0, a new option was added to the credit card filter type that allows invalid credit card numbers to be filtered as well. An invalid credit card number is a number that matches the pattern of a credit card number but fails the credit card number's generation algorithm. (The algorithm is the Luhn algorithm.) This option is disabled by default.

Valid Dates

Philter identifies dates based on date patterns. Sometimes, a date may match a valid pattern but not be a valid date, such as February 30 or even March 45. Philter 1.2.0 adds a new option to the date filter to require that identified dates be valid dates. When enabled, dates found to not be valid dates are not removed from the text. This option is disabled by default.

Option to Remove Punctuation

Philter 1.2.0 adds a new option to the filter profile for named-entity recognition to remove punctuation from the input text prior to processing the text. By default this option is disabled and punctuation is not removed. Removing punctuation can be beneficial in cases where punctuation is being included in entities. This can happen in cases where the last word of the sentence is a name and the period is included in the filtered text. (This doesn't always happen and we're working on removing those occurrences even more through improvements to the named-entity recognition capability.)

Encrypting Connections to Redis

Philter's consistent anonymization feature stores the identified text in a Redis cache. This allows a clustered Philter installation to be able to replace identified text consistently across all instances of Philter. (When Redis is not used, the identified text values are stored in memory on each Philter instance.) Philter 1.2.0 requires all connections to a Redis cache be encrypted and requires the use of a Redis auth token.