Philter Price Reduction

We are thrilled to announce a price reduction for Philter on the cloud marketplaces. Philter pricing now starts at $0.49/hr, down from $0.79/hr. Additionally, where applicable, the annual pricing has been reduced accordingly as well. The pricing is tiered depending on the size of the compute instance running Philter.

The price change will take effect across the AWS, Azure, and Google Cloud marketplaces in the coming week. This pricing change will not affect support or managed services.

We would like to take a moment to thank our users for helping to make this price reduction possible. It is because of our supportive users that we are able to make this change.

AWS Marketplace

Philter 1.8.0

PhilterPhilter 1.8.0 has been released.

This version brings:

  • The ability to capture timing metrics for each of the filter types. Capturing these metrics will provide insights into the performance of the filters.
  • The ability to specify terms to ignore in files for each filter type. Previously, lists of ignored terms had to be specified in the filter profile. Being able to specify the terms to ignore in files outside the filter profile allow for cleaner and easier to manage filter profiles.
  • A single Docker container. Prior to 1.8.0, Philter was composed of two Docker containers that had to be deployed together. Philter 1.8.0 now consists of only a single Docker container. This should help remove some of the complexity around deploying Philter in a containerized environment. We have also migrated Philter's Docker images from DockerHub to our own Docker registry. Learn more about the Philter containers.

Philter 1.8.0 is now available for deployment from the cloud marketplaces and as a Docker container.

Launch Philter in your cloud. See Philter's full Release Notes.


Philter 1.7.0

PhilterWe are happy to announce that Philter 1.7.0 has been released and is currently being published to the DockerHub and the AWS, Azure, and Google Cloud marketplaces. Look for it to be available for deployment into your cloud in the next couple of days.

Click here to deploy Philter in your cloud of choice!

Philter finds and removes sensitive information, such as PII and PHI, in text. Philter can be integrated with virtually any platform, such as Apache Kafka, Apache Flink, Apache NiFi, Apache Pulsar, and Amazon Kinesis. Philter can redact, replace, encrypt, and hash sensitive information.

Philter can currently identify:  Ages, Bitcoin Addresses, Cities, Counties, Credit Cards, Custom Dictionaries, Custom Identifiers (medical record numbers, financial transaction numbers), Dates, Drivers License Numbers, Email Addresses, IBAN Codes, IP Addresses, MAC Addresses, Passport Numbers, Persons' Names, Phone/Fax Numbers, SSNs and TINs, Shipping Tracking Numbers, States, URLs, VINs, Zip Codes

Learn more about Philter.

 Philter Version 
Launch Philter on Google Cloud1.8.0
Launch Philter on AWS1.8.0
Launch Philter on Azure1.7.0

What's New in Philter 1.7.0?

Philter 1.7.0 brings a new experimental feature that breaks large text into smaller pieces of text for more efficient processing. This new feature is described below and is introduced in Philter 1.7.0 as an experimental feature. We welcome and encourage your feedback on the feature but caution you that the feature may undergo major changes in future versions.

Some of the changes and new features in Philter 1.7.0 are described below. Refer to the Release History for a full list of changes.

Automatically Splitting Input Text

Philter 1.7.0 bring a new experimental feature that breaks long input text up into pieces and processed each piece individually. After processing, Philter combines the individual results into a single response back to the client. The purpose of this feature is to allow Philter to better handle long input text.

What is a "long" input text can depend on several factors, such as the hardware running Philter, the network, and the density of sensitive information in the text. Because of this, you have some control over how Philter breaks long text into separate pieces. You can choose between two methods of splitting. The first method splits the text based on the locations of new line characters in the text. The second method splits the text into individual lines of nearly equal length.

The alternative to allowing Philter to split the text is to split the text yourself client side prior to sending the text to Philter. When doing the split client side you have full control over how the text is split. On the flip side, you also have to handle the individual response for each split, something Philter handles for you when you delegate the splitting to Philter.

Input text splitting is enabled and configured in filter profiles. This allows you to configure splitting based on individual filter profiles allowing some text to be split and other text not split based on the chosen filter profile for the text.

See Philter's User's Guide for how to configure splitting in a filter profile.

If you use this feature please send us feedback. We are looking to improve it for future versions and value your feedback. Please see the User's Guide for more details.

Reporting Metrics via Prometheus

Philter supported metrics reporting via JMX, Amazon CloudWatch, and Datadog. In Philter 1.7.0 we added support for monitoring Philter's metrics via Prometheus. When enabled, Philter will expose an HTTP endpoint suitable for scraping by Prometheus. See Philter's Settings for details on how to enable the Prometheus metrics. Look for a separate blog post soon that dives into monitoring Philter's metrics with Prometheus.

Smaller AWS EBS Volume

The EBS volume size for Philter 1.7.0 has been reduced from 20 GB to 8 GB. This reduces the monthly cost by $1.20 for Philter by only requiring a smaller SSD volume. This cost may or may not seem trivial, but when multiple Philter instances are deployed the savings will add up.

Other Changes

Other new features in Philter 1.7.0 include:

  • Terms can now be ignored based on regular expression patterns. Previously Philter had the ability to ignore specified terms but the terms had to match exactly. Now you can specify terms to ignore via regular expression patterns. An example use of this new feature is to ignore non-sensitive information that can change such as timestamps in log messages.
  • Added ability to read ignored terms from files outside of the filter profile.
  • Custom dictionary terms can now be phrases or multi-term keywords.
  • Added “classification” condition to Identifier filter to allow for writing conditionals against the classification value.
  • Added configurable timeout values to allow for modifying timeouts of internal Philter communication. This can help when processing larger amounts of text. See the Settings for more information.
  • Added option to IBAN Code filter to allow spaces in the IBAN codes.
  • Ignore lists for individual filters are no longer case-sensitive. (“John” will be ignored for “JOHN.”)

Preventing PII and PHI from Leaking into Application Logs

Introduction

This blog post demonstrates how to use Philter to find and remove sensitive information from application logs. In this post we use log4j and Apache Kafka but the concepts can be applied to virtually any logging framework and streaming system, such as Amazon Kinesis. (See this post for a similar solution using Kinesis Firehose Transformations to filter sensitive information from text.)

There is a sample project for this blog post available here on GitHub that you can copy and adapt. The code demonstrates how to publish log4j log messages to Apache Kafka which is required for the logs to then be consumed and filtered by Philter as described in this post.

PII and PHI in application logs

Development on a system that will contain personally identifiable information (PII) or protected health information (PHI) can be a challenge. Things like proper data encryption of data in motion and data at rest and maintaining audit logs have to be considered, just to name a few.Using Philter to find and remove sensitive information in log files using log4j and Apache Kafka.

One part of the application's development that's easy to disregard is application logging. Logs make our lives easier. They help developers find and fix problems and they give insights into what our applications are doing at any given time. But sometimes even the seemingly most innocuous log messages can be found to contain PII and PHI at runtime. For example, if an application error occurs and an error is logged, some piece of the logged message could inadvertently contain PII and PHI.

Having a well-defined set of rules for logging when developing these applications is an important thing to consider. Developers should always log user IDs and not user names or other unique identifiers. Reviews of pull requests should consider those rules as well to catch any that might have been missed. But even then can we be sure that there will never be any PII or PHI in our applications' logs?

Philter and application logs

To give more confidence you can process your applications' logs with Philter. In this post we are using Java and the log4j framework but nearly all modern programming languages have similar capabilities or have a similar logging framework. If you are developing on .NET you can use log4net and one of the third-party Apache Kafka appenders available through NuGet.

We are going to modify our application's log4j logging configuration to publish the logs to Apache Kafka. Philter will consume the logs from Apache Kafka and the filtered logs will be persisted to disk. So, to implement the code in this post you will need at least one running Apache Kafka broker and an instance of Philter. Some good music never hurts either.

Spin up Kafka in a docker container:

curl -O https://raw.githubusercontent.com/confluentinc/cp-all-in-one/5.5.1-post/cp-all-in-one/docker-compose.yml
docker-compose up

Spin up Philter in a docker container (request a free license key):

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

Create a philter topic:

docker-compose exec broker kafka-topics --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic philter

Now we're good to go.

In a later post we will accomplish the same thing of filtering PII and PHI from application logs but with Apache Pulsar instead of Apache Kafka. Apache Pulsar has the concept of Functions that make manipulating the streaming data with Philter very easy. Stay tuned for that post!

Logging to Apache Kafka

Using log4j to output logs to Apache Kafka is very easy. Through the KafkaAppender we can configure log4j to publish the logs to an Apache Kafka topic. Here's an example of how to configure the appender:

<?xml version="1.0" encoding="UTF-8"?>
...
<Appenders>
  <Kafka name="Kafka" topic="philter">
    <PatternLayout pattern="%date %message"/>
    <Property name="bootstrap.servers">localhost:9092</Property>
  </Kafka>
</Appenders>

With this appender in our log4j2.xml file, our application's logs will be published to the Kafka topic called philter. The Kafka broker is running at localhost:9092. Great! Now we're ready to have Philter consume from the topic to remove PII and PHI from the logs.

Using Philter to remove PII and PHI from the logs

To consume from the Kafka topic we are going to use the kafkacat utility. This is a nifty little tool that makes some Kafka operations really easy from the command line. (You could also use the standard kafka-console-consumer.sh script that comes with Kafka.) The following command consumes from the philter topic and writes the messages to standard out.

kafkacat -b localhost:9092 -t philter

Instead of writing the messages to standard out we need to send the messages to Philter where any PII or PHI in them can be removed so we will pipe the text to the Philter CLI. The Philter CLI will send the text to Philter. The filtered text is redirected to a file called filtered.log. This file contains the filtered log messages.

kafkacat -C -b localhost:9092 -t philter -q -f '%s\n' -c 1 -e | ./philter-cli-linux-amd64 -i -h https://localhost:8080

This command uses kafkacat to consume from our philter topic. In this command we are telling kafkacat to be quiet (-q) and not produce any extraneous output, format the message by displaying only the message (-f), to only consume a single message (-c), and to exit (-e) after doing so. The result of this command is that a single message is consumed from Kafka and sent to Philter. using the philter-cli. The filtered text is then written to the console.

The flow of the messages

Our example project logs 100 message to the Kafka topic. Each message looks like the following where 99 is just an incrementing integer:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient 123-45-6789.

The output from Philter after processing the message is:

08:43:10.066 [main] INFO com.mtnfog.App - This is a sample log message 99. Beginning processing for patient {{{REDACTED-ssn}}}.

We can see that Philter identified 123-45-6789 as a social security number and redacted it from the text. This is a simple example but based on how Philter is configured (specifically its filter profile) we could have been looking for many other types of sensitive information such as person's names, unique identifiers, ages, dates, and so on.

We also just wrote the filtered log to system out. We could have easily redirected it to a file, to another Kafka topic, or to somewhere else. We also used the philter-cli to send the text to Philter. You could have also used curl or one of the Philter SDKs.

Summary

In this blog post we showed how Philter can find and remove sensitive information from application log files using log4j's Kafka appender, Apache Kafka, and kafkacat. We hope this example is useful. A sample project to product log messages to a Kafka topic is available on GitHub.


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter 1.6.0

PhilterPhilter 1.6.0 will be available soon through the cloud marketplaces and DockerHub. This is probably the most significant release of Philter other than the first release 1.0.0.

Version 1.6.0 has many new features and a few fixes. Instead of writing a single blog post for the entire release we are going to write a few separate blog posts on the significant new features. We will highlight the new features just down below in this post and then follow-up over the next few days with posts that go more in-depth on each of the new features. Check out Philter's Release Notes.

Over the next few days we will be making updates to the Philter SDKs to accommodate the new features in Philter 1.6.0.

Deploy Philter

 Philter Version 
Launch Philter on Google Cloud1.8.0
Launch Philter on AWS1.8.0
Launch Philter on Azure1.7.0

New Features in Philter 1.6.0

The following are summaries of the new features added in Philter 1.6.0.

Alerts

The new alerts feature in Philter 1.6.0 allows you to cause Philter to generate an alert when a given filter condition is satisfied. For example, if you have a filter condition to only match a person's name of "John Smith", when this condition is satisfied Philter will generate an alert. The alert will be stored in Philter and can be retrieved and deleted using Philter's new Alerts API. Details of the Alerts are in Philter's User's Guide.

Span Disambiguation

Sometimes a piece of sensitive information could be one of a few filter types, such as an SSN, a phone number, or a driver's license number. The span disambiguation feature works to determine which of the potential filter types is most appropriate by analyzing the context of the sensitive information. Philter uses various natural language processing (NLP) techniques to determine which filter type the sensitive information most closely resembles. Because of the techniques used, the more text Philter sees the more accurate the span disambiguation will become.

Span disambiguation is documented in Philter's User's Guide.

New Filters: Bitcoin Address, IBAN Codes, US Passport Numbers, US Driver's License Numbers

Philter 1.6.0 contains several new filter types:

  • Bitcoin Address - Identify bitcoin addresses.
  • IBAN Codes - Identify International Bank Account Numbers.
  • US Passport Numbers - Identify US passport numbers issued since 1981.
  • US Driver's License Numbers - Identify US driver's license numbers for all 50 states.

Each of these new filters are available through filter profiles.

New Replacement Strategy: SHA-256 with random salt values

We previously added the ability to encrypt sensitive information in text. In Philter 1.6.0 we have added the ability to hash sensitive information using SHA-256. When the hash replacement strategy is selected, each piece of sensitive text will be replaced by the SHA-256 value of the sensitive text. Additionally, the hash replacement strategy has a "salt" property that when enabled will cause Philter to append a random salt value to each piece of sensitive text prior to hashing. The random hash value will be included in the filter response.

Custom Dictionary Filters Can Now Use an External Dictionary File

Philter's custom dictionary filter lets you specify a list of terms to identify as being sensitive. Prior to Philter 1.6.0, this list of terms had to be provided in the filter profile. With a long list it did not take long for the filter profile to become hard to read and even harder to manage. Now, instead of providing a list of terms in the filter profile you can simply provide the full path to a file that contains a list of terms. This keeps the filter profile compact and easier to manage. You can specify as many dictionary files as you need to and Philter will combine the terms when the filter profile is loaded.

Custom Dictionary Filters Now Have a "fuzzy" Property

Philter's custom dictionary filter previously always used fuzzy detection. (Fuzzy detection is like a spell checker - a misspelled name such as "Davd" can be identified as "David.") New in Philter 1.6.0 is a property on the custom dictionary filter called "fuzzy." This property controls whether or not fuzzy detection is enabled. This property was added because when fuzzy detection is not needed you can get a significant performance increase. When not enabled, Philter uses an optimized data structure to identify the terms. If fuzzy detection is not enabled we do recommend disabling it to take advantage of the performance gain.

Changed "Type" to "Classification"

A few filter types had additional information that provided further description of the sensitive information. For instance, the entity filter had a type that identified the "type" of the entity such as "PER" for person. We have changed the property "type" to "classification" for clarity and uniformity. Be sure to update your filter profiles if you have any filter conditions that use "type" to use "classification" instead. It is a drop-in replacement and you can simply change "type" to "classification."

Add Filter Condition for "Classification"

Philter 1.6.0 adds the ability to have a filter condition on "classification."

Redis Cache Can Now Use a Self-Signed SSL Certificate

Philter 1.6.0 can now connect to a Redis cache that is using a self-signed certificate. New configuration settings for the truststore and keystore allow for trusting the self-signed certificate.

Fixes and Improvements in Philter 1.6.0

The following is a list of fixes and improvements made in Philter 1.6.0.

Fixed Potential MAC Address Issue

We found and fixed a potential issue where a MAC Address might not be identified correctly.

Fixed Potential Ignore Issue with Custom Dictionary Filters

We found and fixed a potential issue where a term in a custom dictionary that is also a term in an ignore list might not be ignored correctly.

Fixed Potential Issue with Credit Card Number Validation

We found and fixed a potential issue where a credit card number might not be validated correctly. This only applies when credit card validation is enabled.


Jeff Zemerick is the founder of Mountain Fog. He is a 10x certified AWS engineer, current chair of the Apache OpenNLP project, and experienced software engineer.

You can contact Jeff at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter Docker Containers

We are excited to announce that Philter can now be launched as Docker containers. Previously, Philter was only available through the AWS, Azure, and Google Compute Cloud marketplaces. By making Philter available as containers, Philter can now easily be used outside those cloud platforms, in container orchestration tools such as Kubernetes, and on-premises. Philter finds, identifies, and removes sensitive information such as PHI and PII from natural language text.

Launching the Philter containers is easy:

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

This will download and run the containers. Once the containers are running you are ready to send filter requests.

curl http://localhost:8080/api/filter --data "George Washington was president and his ssn was 123-45-6789." -H "Content-type: text/plain"

A license key is required to be set for the containers to start and can be requested using the link below.


Philter Enterprise Edition certified for Red Hat Enterprise Linux

We are happy to announce that Philter Enterprise Edition has achieved Red Hat Enterprise Linux software certification! Philter Enterprise Edition is now listed in the Red Hat Ecosystem Catalog.

We are excited to be able to bring Philter to users of Red Hat Enterprise Linux, whether it be in the cloud or on-premises.

Learn more about Philter Enterprise Edition.


Philter 1.2.0

We are happy to announce the release of Philter 1.2.0. This version brings new features to filter profiles along with some minor changes. Philter 1.2.0 will be available on the cloud marketplaces in a day or two. Let's get to it and see what's new!

A Recap

Philter is an application to analyze text for potentially identifiable information (PII) and protected health information (PHI) and remove or manipulate those items when found. The types of information that Philter looks for and how it acts upon the information is called a filter profile. A filter profile is just a file that lists the types of PII/PHI that you are interested in, e.g. credit card numbers, persons names, etc. Philter is available on the AWS Marketplace, Azure Marketplace, GCP Marketplace.

Contact us for a live demo or feel free to take Philter for a spin on one of the cloud marketplaces taking advantage of its free trial period. Check out the full Release History.

What's New in Philter 1.2.0

Filter Specific Ignore Lists

A filter profile can now have lists of ignored terms specific to each filter type. For example, let's say there is a number "123-45-6789" in your text and it keeps getting identified as an SSN because it fits the SSN format. However, you know this number is not an SSN and do not want it removed. You can now add "123-45-6789" to a list of ignored terms for the SSN filter to prevent it from being removed from the text. Each type of filter has its own ignore list.

Global Ignore Lists

A filter profile can now have zero or more ignore lists that apply to all filter types. Items added to this list are ignored for all filter types. All items present in the global ignore lists will never be removed from the input text.

Disabling Filters

Previously, to disable a filter type in a filter profile you had to delete it from the filter profile. This can be problematic because you might have configuration in there you don't want to just delete and lose. New in Philter 1.2.0, each filter type has an enabled property that controls whether or not the filter is applied. When set to false the filter is not applied. The default value is always true to enable each filter type.

Invalid Credit Card Numbers

Philter identifies credit card numbers based on the patterns and algorithms of the numbers. In Philter 1.2.0, a new option was added to the credit card filter type that allows invalid credit card numbers to be filtered as well. An invalid credit card number is a number that matches the pattern of a credit card number but fails the credit card number's generation algorithm. (The algorithm is the Luhn algorithm.) This option is disabled by default.

Valid Dates

Philter identifies dates based on date patterns. Sometimes, a date may match a valid pattern but not be a valid date, such as February 30 or even March 45. Philter 1.2.0 adds a new option to the date filter to require that identified dates be valid dates. When enabled, dates found to not be valid dates are not removed from the text. This option is disabled by default.

Option to Remove Punctuation

Philter 1.2.0 adds a new option to the filter profile for named-entity recognition to remove punctuation from the input text prior to processing the text. By default this option is disabled and punctuation is not removed. Removing punctuation can be beneficial in cases where punctuation is being included in entities. This can happen in cases where the last word of the sentence is a name and the period is included in the filtered text. (This doesn't always happen and we're working on removing those occurrences even more through improvements to the named-entity recognition capability.)

Encrypting Connections to Redis

Philter's consistent anonymization feature stores the identified text in a Redis cache. This allows a clustered Philter installation to be able to replace identified text consistently across all instances of Philter. (When Redis is not used, the identified text values are stored in memory on each Philter instance.) Philter 1.2.0 requires all connections to a Redis cache be encrypted and requires the use of a Redis auth token.


Introducing ngramdb

ngramdb provides a distributed means of storing and querying N-grams (or bags of words) organized under contexts. A REST interface provides the ability to insert n-grams, execute “starts with” and “top” queries, and calculate similarity metrics of contexts. Apache Ignite provides the distributed and highly available persistence and powers the querying abilities.

ngramdb is experimental and significant changes are likely. We welcome your feedback and input into its future capabilities.

ngramdb is open source under the Apache License, version 2.0.

 https://github.com/mtnfog/ngramdb


Apache OpenNLP Language Detection in Apache NiFi

When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP's language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language's probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.

In case you are not familiar with OpenNLP's language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.

To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi's lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.

git clone https://github.com/mtnfog/nlp-nifi-processors.git
cd nlp-nifi-processors
mvn clean install
cp langdetect-nifi-processor/langdetect-processor-nar/target/*.nar /path/to/nifi/lib/

The processor does not have any settings to configure. It's ready to work right "out of the box." You can add the processor to your NiFi canvas:

You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.

Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!

This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.

Let us know how the processor works out for you!