Filter Sensitive Information from Streaming Text

Phirestream removes sensitive information from Apache Kafka streams.
 

Phirestream FAQ

Frequently asked questions about Phirestream. For any questions not answered here please contact us. To get started with Phirestream get in touch.

What is Phirestream?

The goal of Phirestream is to keep sensitive information from entering your Apache Kafka topics and downstream pipelines and applications.

Phirestream is an application that filters sensitive information from streaming text prior to that text being published to Apache Kafka. Phirestream works by acting as a proxy to Apache Kafka. Phirestream processes the published text to redact, remove, or encrypt the types of sensitive information you have defined in Phirestream’s settings. Phirestream then publishes the filtered text to your Apache Kafka brokers.

Why is sensitive information in streaming text a problem?

The presence of sensitive information in streaming text can present difficult challenges. If the sensitive information is not needed by downstream applications then the presence of sensitive information presents an unnecessary security risk. By using Phirestream to keep sensitive information from ever entering the Apache Kafka topics, we can help keep the cluster and the downstream applications secure.

What types of sensitive information can Phirestream identify?

Phirestream can identify the following types of sensitive information in addition to custom-defined types.

Ages, Bitcoin Addresses, Cities, Counties, Credit Cards, Custom Dictionaries, Custom Identifiers (medical record numbers, financial transaction numbers), Dates, Drivers License Numbers, Email Addresses, IBAN Codes, IP Addresses, MAC Addresses, Passport Numbers, Persons’ Names, Phone/Fax Numbers, SSNs and TINs, Shipping Tracking Numbers, States, URLs, VINs, Zip Codes

How does Phirestream know what kinds of sensitive information to find?

A filter profile is a configuration file in which you list the types of sensitive information you want Phirestream to act upon. Each Apache Kafka topic can have its own filter profile to provide control over how different data is redacted.

How do I deploy Phirestream?

Phirestream can be deployed in AWS, Azure, and Google Cloud with just a few clicks. Click here to deploy Phirestream in your cloud.

For on-premises deployments of Phirestream please contact us.

Is Phirestream guaranteed to find 100% of all sensitive information in my text?

Phirestream uses state of the art natural language processing (NLP) technology to identify sensitive information in text. These NLP methods use trained models created from a large corpus of text. The process of applying the model to text is non-deterministic. There are many factors that could affect the identification of sensitive information in your text such as how similar your text is to the corpus that was used to train the model, how the text is formatted, and the length of the text. For these reasons, it is important that you assess Phirestream’s performance prior to utilization in a production system.

The confidence value in the filter strategy condition can be used to tune the NLP engine’s detection. Each identified entity has an associated confidence score between 0 and 100 indicating the model’s estimate that the text is actually an entity, with 0 being the lowest confidence and 100 being the highest confidence. The confidence value in the filter strategy allows you to filter out entities based on the confidence. For example, the condition confidence > 75 means that entities having less than a 75 confidence value will be ignored and entities having a confidence value greater than 75 will be filtered from the text.

 

Apache, Apache Kafka, Kafka, and associated open source project names are trademarks of the Apache Software Foundation.