Protecting Sensitive Information in Streaming Platforms

Streaming platforms like Apache Kafka and Apache Pulsar provide wonderful capabilities around ingesting data. With these platforms we can build all types of solutions across many industries from healthcare to IoT and everything in between. Inevitably, the problem arises of how to deal with sensitive information that resides in the streaming data. Questions such as how do we make sure that data never crosses a boundary, how do we keep that data safe, and how can we remove the sensitive information from the incoming data so we can continue processing the data? These are all very good questions to ask and in this post we present a couple architectures to address those questions and help maintain the security of your streaming data. These architectures along with Philter can help protect the sensitive information in your streaming data.

Whether you are using Apache Kafka, Apache Pulsar, or some other streaming platform is largely irrelevant. Each of these platforms are largely built on top of the same concepts and even share quite a bit of terminology, such as brokers and topics. (A broker is a single instance of Kafka or Pulsar and a topic is how the streaming data is organized when it reaches the broker.)

Streaming Healthcare Data

Let's assume you have an architecture where you have a 3 broker installation of Apache Kafka that is accepting streaming data from a hospital. This data contains patient information which has PII and PHI. An external system is publishing data to your Apache Kafka brokers. The brokers receive the data, store it in topics, and a downstream system consumes from Apache Kafka and processes the statistics of the data by analyzing the text and persisting the results of the analysis into a database. Even though this is a hypothetical scenario it is an extremely common deployment architecture around distributed and streaming technologies.

Now you ask yourself those questions we mentioned previously. How to keep the PII and PHI secure in our streaming data? Your downstream processor does not care about the PII and PHI since it is only aggregating statistics. Having those downstream systems process the data containing PII and PHI puts our system at risk of inadvertent HIPAA violations by enlarging the perimeter of the system containing PII and PHI. Removing the PII and PHI from the streaming data before it gets consumed by the downstream processor would help keep the data safe and our system in compliance.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Filtering the Sensitive Information from the Streaming Data

There's a couple things we can do to remove the PII and PHI from the streaming data before it gets to the downstream processor.

The first option is to use Apache Kafka Streams or an Apache Pulsar Function (depending on which you are running) to consume the data, filter out the PII and PHI, and publish the filtered text back to Kafka or Pulsar on a different topic. Now update the name of the topic the downstream processor consumes from. The raw data  from the hospital containing PII and PHI will stay in its own topic. You can use Apache Kafka ACLs on the topics to help prevent someone from inadvertently consuming from the raw topic and only permit them to consume from the filtered topic. If, however, the idea of the raw data containing PII and PHI existing on the brokers is a concern then continue on to option two below.

The second option is to utilize a second Apache Kafka or Apache Pulsar cluster. Place this cluster in between the existing cluster and the downstream processor. Create an application to consume from the topic on the first brokers, remove the PII and PHI, and then publish the filtered data to a topic on the new brokers. (You can use something like Apache Flink to process the data. At the time of writing, Kafka Streams cannot be used because the source brokers and the destination brokers must be the same.) In this option, the sensitive data is physically separated from the rest of the data by residing on its own brokers.

Which option is best for you depends on your requirements around processing and security. In some cases, separate brokers may be overkill. But in other cases it may be the best option due to the physical boundary it creates between the raw data and the filtered data.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Philter's Custom Dictionary Filter and "Fuzziness"

PhilterPhilter finds sensitive information in text based on a set of filters that you configure in a filter profile. Some of these filters are for predefined information like SSNs, phone numbers, and names. But sometimes you have a list of terms specific to your use-case that you want to identify, too. Philter's custom dictionary filter lets you specify a list of terms to label as sensitive information when found in your text.

You can learn more about the custom dictionary filter and all of its properties in the Philter User's Guide.

Philter 1.6.0 adds a new property called "fuzzy" to the custom dictionary filter. The "fuzzy" property accepts a value of true or false. When set to false, text being processed must match an item in the dictionary exactly for that text to be labeled as sensitive information. When set to true, the text does not have to match exactly. The "fuzzy" property allows for misspellings and typos to be present and still label the text as being sensitive information. In this blog post we want to dive a little bit more into this to better explain how the "fuzziness" works and is applied and the trade-offs when using it.

Also new in Philter 1.6.0 is the ability to provide the custom dictionary filter a path to a file that contains the terms. This way you don't have to include your terms directly in the filter profile.

Sample Filter Profile

To start, here's a simple filter profile that includes a custom dictionary filter. The dictionary contains three terms (john, jane, doe) and fuzziness is enabled with medium sensitivity. When any of those terms are found, they will be redacted with the pattern {{{REDACTED-%t}}}, where %t is replaced by the type which in this case is custom-dictionary.

{
   "name": "dictionary-example",
   "identifiers": {
      "dictionaries": [
         "customDictionary": {
            "terms": ["john", "jane", "doe"],
            "fuzzy": true,
            "sensitivity": "medium",
            "customDictionaryFilterStrategies": [
               {
                  "strategy": "REDACT",
                  "redactionFormat": "{{{REDACTED-%t}}}"
               }
            ]
         }
      ]
   }   
}

No fuzziness

We will start by describing what happens when the "fuzzy" property is set to false. This is the default behavior and is consistent with how Philter behaved prior to version 1.6.0. Items in the custom dictionary have to be found in the text exactly as they are in the dictionary. This means "John" is not the same as "Jon."

Disabling fuzziness is more efficient and will provide better performance. That's really all you need to know. But if you like getting into the details of things, read on! Internally, Philter uses an algorithm based off what's known as a bloom filter to efficiently scan a dictionary for matches. A bloom filter "is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set." In this case, the set is your list of terms in the dictionary and an element is each word from the input text. The bloom filter provides an efficient means of determining whether or not a given word is a term in your dictionary that you want to be identified as sensitive information.

A digression into bloom filters

Just to clarify, when we talk about Philter we talk a lot about "filters", such as a filter for SSNs, a filter for phone numbers, and so on. A bloom filter is not a filter like that. A bloom filter is an algorithm that provides an efficient means of asking the question "Does this item potentially exist in this dictionary?" A bloom filter will answer "yes, it might" or "no, it does not." Notice the response of "yes, it might." The bloom filter is not saying "Yes." Instead, it is staying "yes, it might." It's then up to the programmer to find out definitively if that item exists in the dictionary. That's essentially how a bloom filter works.

Yes, fuzziness!

Enabling fuzziness on a custom dictionary filter works differently. As Philter scans the input text, it not only considers the words or phrases themselves, but Philter also considers derivations of the words and phrases. When fuzziness is enabled, "John" may be the same as "Jon." Enabling fuzziness by setting the "fuzzy" property to true can be useful when you are concerned about misspellings or different spellings of terms in your text.

You can control the level of acceptable fuzziness by setting the "sensitivityLevel" property. Valid values are "low", "medium", and "high." The different between "Jon" and "John" is considered low while the different between "Jon" and "Johnny" is considered high. You can use the sensitivityLevel to find an acceptable level of fuzziness appropriate for your custom dictionary and your text. The default sensitivityLevel when not specified is "high."

An important distinction to make is that currently when fuzziness is disabled the custom dictionary can only contain single words. Phrases are not permitted as dictionary terms in Philter 1.6.0 but are allowed in the upcoming version 1.7.0. The internals of that change are interesting enough for their own blog post!

Summary

To summarize:

  • Setting fuzzy to false (the default settings) for the custom dictionary filter will provide better performance but terms in the custom dictionary must match exactly and only words (not phrases) are allowed in the dictionary.
  • Setting fuzzy to true allows the custom dictionary filter to be able to identify misspellings and different spellings of terms in the custom dictionary filter at the cost of performance. Use the sensitivityLevel values of low, medium, and high to control the allowed level of fuzziness.

Not yet using Philter?

Join our users across the healthcare, financial, legal, and other industries in using Philter to find and remove sensitive information from your text. Click on your platform below to get started.

AWS Marketplace

Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

  • Updated 07/12/2020 to include a link to a similar solution using log4j and Apache Kafka.
  • Updated 05/20/2020 to include a link to running Philter as a container and a link to the solution example.
  • Updated 04/28/2020 to include a link to CloudFormation and Terraform scripts and link to using a signed certificate with Philter.

AWS Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from places such as CloudWatch, AWS IoT, and custom applications using the AWS SDK to places such as Amazon S3, Amazon Redshift, Amazon Elasticsearch, and others. In this post we will use S3 as the firehose's destination.

In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how AWS Kinesis Firehose and AWS Lambda can be used in conjunction with Philter to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Click here for a similar solution using log4j and Apache Kafka to remove sensitive information from application logs.

Prerequisites

Your must have a running instance of Philter. If you don't already have a running instance of Philter you can launch one through the AWS Marketplace or as a container. There are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

It's not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows you to communicate locally with Philter from the function.

Setting up the AWS Kinesis Firehose Transformation

There is no need to duplicate an excellent blog post on creating a Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests
import base64

def handler(event, context):

    output = []

    for record in event['records']:
        payload=base64.b64decode(record["data"])
        headers = {'Content-type': 'text/plain'}
        r = requests.post("https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
        filtered = r.text
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8')
        }
        output.append(output_record)

    return output

The following Kinesis Firehose test event can be used to test the function:

{
  "invocationId": "invocationIdExample",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "region": "us-east-1",
  "records": [
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    },
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    }    
  ]
}

This test event contains 2 messages and the data for each is base 64 encoded, which is the value "He lived in 90210 and his SSN was 123-45-6789." When the test is executed the response will be:

[
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."
]

When executing the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter's self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Kinesis Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.

Conclusion

In this blog post we have created an AWS Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.

Resources


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.