Protecting Sensitive Information in Streaming Platforms

Streaming platforms like Apache Kafka and Apache Pulsar provide wonderful capabilities around ingesting data. With these platforms we can build all types of solutions across many industries from healthcare to IoT and everything in between. Inevitably, the problem arises of how to deal with sensitive information that resides in the streaming data. Questions such as how do we make sure that data never crosses a boundary, how do we keep that data safe, and how can we remove the sensitive information from the incoming data so we can continue processing the data? These are all very good questions to ask and in this post we present a couple architectures to address those questions and help maintain the security of your streaming data. These architectures along with Philter can help protect the sensitive information in your streaming data.

Whether you are using Apache Kafka, Apache Pulsar, or some other streaming platform is largely irrelevant. Each of these platforms are largely built on top of the same concepts and even share quite a bit of terminology, such as brokers and topics. (A broker is a single instance of Kafka or Pulsar and a topic is how the streaming data is organized when it reaches the broker.)

Streaming Healthcare Data

Let's assume you have an architecture where you have a 3 broker installation of Apache Kafka that is accepting streaming data from a hospital. This data contains patient information which has PII and PHI. An external system is publishing data to your Apache Kafka brokers. The brokers receive the data, store it in topics, and a downstream system consumes from Apache Kafka and processes the statistics of the data by analyzing the text and persisting the results of the analysis into a database. Even though this is a hypothetical scenario it is an extremely common deployment architecture around distributed and streaming technologies.

Now you ask yourself those questions we mentioned previously. How to keep the PII and PHI secure in our streaming data? Your downstream processor does not care about the PII and PHI since it is only aggregating statistics. Having those downstream systems process the data containing PII and PHI puts our system at risk of inadvertent HIPAA violations by enlarging the perimeter of the system containing PII and PHI. Removing the PII and PHI from the streaming data before it gets consumed by the downstream processor would help keep the data safe and our system in compliance.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.

Filtering the Sensitive Information from the Streaming Data

There's a couple things we can do to remove the PII and PHI from the streaming data before it gets to the downstream processor.

The first option is to use Apache Kafka Streams or an Apache Pulsar Function (depending on which you are running) to consume the data, filter out the PII and PHI, and publish the filtered text back to Kafka or Pulsar on a different topic. Now update the name of the topic the downstream processor consumes from. The raw data  from the hospital containing PII and PHI will stay in its own topic. You can use Apache Kafka ACLs on the topics to help prevent someone from inadvertently consuming from the raw topic and only permit them to consume from the filtered topic. If, however, the idea of the raw data containing PII and PHI existing on the brokers is a concern then continue on to option two below.

The second option is to utilize a second Apache Kafka or Apache Pulsar cluster. Place this cluster in between the existing cluster and the downstream processor. Create an application to consume from the topic on the first brokers, remove the PII and PHI, and then publish the filtered data to a topic on the new brokers. (You can use something like Apache Flink to process the data. At the time of writing, Kafka Streams cannot be used because the source brokers and the destination brokers must be the same.) In this option, the sensitive data is physically separated from the rest of the data by residing on its own brokers.

Which option is best for you depends on your requirements around processing and security. In some cases, separate brokers may be overkill. But in other cases it may be the best option due to the physical boundary it creates between the raw data and the filtered data.

Philter Philter finds and remove sensitive information from text. Learn more about Philter.