Preventing PII and PHI from Leaking into Application Logs

Introduction

This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application’s development that’s easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications’ logs?

Philter and application logs

To give more confidence you can process your applications’ logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application’s log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O https://raw.githubusercontent.com/confluentinc/cp-all-in-one/5.5.1-post/cp-all-in-one/docker-compose.yml
docker-compose up

Spin up Philter in a docker container (request a free license key):

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we’re good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here’s an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
...
<Appenders>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>
  </Kafka>
</Appenders>

With this appender in our log4j2.xml file, our application’s logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we’re ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard kafka-console-consumer.sh script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person’s names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.

Summary

In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j’s Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.