Philter 1.6.0

PhilterPhilter 1.6.0 will be available soon through the cloud marketplaces and DockerHub. This is probably the most significant release of Philter other than the first release 1.0.0.

Version 1.6.0 has many new features and a few fixes. Instead of writing a single blog post for the entire release we are going to write a few separate blog posts on the significant new features. We will highlight the new features just down below in this post and then follow-up over the next few days with posts that go more in-depth on each of the new features. Check out Philter's Release Notes.

Over the next few days we will be making updates to the Philter SDKs to accommodate the new features in Philter 1.6.0.

Deploy Philter

 Philter VersionBase OS
Launch Philter on AWS (all regions including GovCloud)1.5.0Amazon Linux 2
Launch Philter on Azure1.5.0CentOS 7.7
Launch Philter on Google Cloud Compute1.5.0CentOS 8

New Features in Philter 1.6.0

The following are summaries of the new features added in Philter 1.6.0.

Alerts

The new alerts feature in Philter 1.6.0 allows you to cause Philter to generate an alert when a given filter condition is satisfied. For example, if you have a filter condition to only match a person's name of "John Smith", when this condition is satisfied Philter will generate an alert. The alert will be stored in Philter and can be retrieved and deleted using Philter's new Alerts API. Details of the Alerts are in Philter's User's Guide.

Span Disambiguation

Sometimes a piece of sensitive information could be one of a few filter types, such as an SSN, a phone number, or a driver's license number. The span disambiguation feature works to determine which of the potential filter types is most appropriate by analyzing the context of the sensitive information. Philter uses various natural language processing (NLP) techniques to determine which filter type the sensitive information most closely resembles. Because of the techniques used, the more text Philter sees the more accurate the span disambiguation will become.

Span disambiguation is documented in Philter's User's Guide.

New Filters: Bitcoin Address, IBAN Codes, US Passport Numbers, US Driver's License Numbers

Philter 1.6.0 contains several new filter types:

  • Bitcoin Address - Identify bitcoin addresses.
  • IBAN Codes - Identify International Bank Account Numbers.
  • US Passport Numbers - Identify US passport numbers issued since 1981.
  • US Driver's License Numbers - Identify US driver's license numbers for all 50 states.

Each of these new filters are available through filter profiles.

New Replacement Strategy: SHA-256 with random salt values

We previously added the ability to encrypt sensitive information in text. In Philter 1.6.0 we have added the ability to hash sensitive information using SHA-256. When the hash replacement strategy is selected, each piece of sensitive text will be replaced by the SHA-256 value of the sensitive text. Additionally, the hash replacement strategy has a "salt" property that when enabled will cause Philter to append a random salt value to each piece of sensitive text prior to hashing. The random hash value will be included in the filter response.

Custom Dictionary Filters Can Now Use an External Dictionary File

Philter's custom dictionary filter lets you specify a list of terms to identify as being sensitive. Prior to Philter 1.6.0, this list of terms had to be provided in the filter profile. With a long list it did not take long for the filter profile to become hard to read and even harder to manage. Now, instead of providing a list of terms in the filter profile you can simply provide the full path to a file that contains a list of terms. This keeps the filter profile compact and easier to manage. You can specify as many dictionary files as you need to and Philter will combine the terms when the filter profile is loaded.

Custom Dictionary Filters Now Have a "fuzzy" Property

Philter's custom dictionary filter previously always used fuzzy detection. (Fuzzy detection is like a spell checker - a misspelled name such as "Davd" can be identified as "David.") New in Philter 1.6.0 is a property on the custom dictionary filter called "fuzzy." This property controls whether or not fuzzy detection is enabled. This property was added because when fuzzy detection is not needed you can get a significant performance increase. When not enabled, Philter uses an optimized data structure to identify the terms. If fuzzy detection is not enabled we do recommend disabling it to take advantage of the performance gain.

Changed "Type" to "Classification"

A few filter types had additional information that provided further description of the sensitive information. For instance, the entity filter had a type that identified the "type" of the entity such as "PER" for person. We have changed the property "type" to "classification" for clarity and uniformity. Be sure to update your filter profiles if you have any filter conditions that use "type" to use "classification" instead. It is a drop-in replacement and you can simply change "type" to "classification."

Add Filter Condition for "Classification"

Philter 1.6.0 adds the ability to have a filter condition on "classification."

Redis Cache Can Now Use a Self-Signed SSL Certificate

Philter 1.6.0 can now connect to a Redis cache that is using a self-signed certificate. New configuration settings for the truststore and keystore allow for trusting the self-signed certificate.

Fixes and Improvements in Philter 1.6.0

The following is a list of fixes and improvements made in Philter 1.6.0.

Fixed Potential MAC Address Issue

We found and fixed a potential issue where a MAC Address might not be identified correctly.

Fixed Potential Ignore Issue with Custom Dictionary Filters

We found and fixed a potential issue where a term in a custom dictionary that is also a term in an ignore list might not be ignored correctly.

Fixed Potential Issue with Credit Card Number Validation

We found and fixed a potential issue where a credit card number might not be validated correctly. This only applies when credit card validation is enabled.


Jeff Zemerick is the founder of Mountain Fog. You can contact him at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter - A Real-World Use-Case

PhilterPhilter finds, identifies, and removes sensitive information from text. That's a very good and short description of Philter, but, as they say, a picture is worth a thousand words. In this post we will detail an actual, real-world use-case of Philter as we paint a picture with words!

"Super Helpdesk"

The Philter customer, we'll call them Super Helpdesk, is a provider of a software-as-a-service helpesk solution. Their customers sign-up to be able to offer a helpdesk to their customers. (Following? :) Super Helpdesk's users need the ability to optionally prevent sensitive information from being passed through in tickets. If a customer enters something sensitive they want to remove it from the ticket before the ticket enters the workflow.

In this case, the sensitive information Super Helpdesk is most worried about are credit card numbers. Due to security best practices and regulations like PCI-DSS, credit card numbers cannot exist in helpdesk tickets where they may be stored or transmitted unencrypted. Super Helpdesk needed a way to analyze the tickets entering their system in order to filter out the credit card numbers from the tickets.

The Solution

At a high-level, Super Helpdesk deployed Philter (in this case running on EC2 in AWS) to perform the filtering of the content of the helpdesk tickets. As new helpdesk tickets are submitted, the content of the ticket is sent to Philter and Philter immediately returns the content of the ticket with the credit card numbers redacted to just the last four digits. (Super Helpdesk also added an option for their users to control how Philter redacts the credit card numbers, with the available options being redact all or redact all but the last four digits.)

Now for the low-level implementation details! When new helpdesk tickets come in they are published to an Apache Kafka topic. A process consumes from the topic, does processing on the ticket, and ultimately inserts the ticket into a backend database. This process, written in Java, was modified to make use of the Philter Java SDK to enable the communication between the process and Philter.

We have found this to actually be a very common Extract-Transform-Load (ETL) design scenario across industries. Data in the form of text flows from an external system through a pipeline facilitated by Apache Kafka or Amazon Kinesis Firehose into an internal database. Along the way the data needs to be manipulated in some manner. In our case the data manipulation is to remove sensitive information from the text. Philter's API allows it to slide nearly seamlessly into the existing pipeline. Like Super Helpdesk did, just insert a step to send the text to Philter for filtering.

We made a previous blog post about using Philter inside of an AWS Kinesis Firehose using a Firehose Transformation. It describes how to make a Lambda function to invoke Philter on the text going through the pipeline to filter the text. Check it out at the link below.

Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

But, wait, why Philter?

You are probably saying, well, that seems like overkill for a simple problem to redact credit card numbers! Credit card numbers follow a well-defined pattern so why not just use a regular expression to find them? If all you want to do is find credit card numbers then a regular expression definitely may work.

So what does using Philter give us? A good bit actually. Through the use of filter profiles, Philter can have a pre-set list of types sensitive information. Each type of sensitive information can have its own redaction logic. For example, you could redact VISA card numbers while truncating AMEX card numbers. Or, you could only leave the last four digits of card numbers matching a condition. Additionally, each customer of the helpdesk platform may have different requirements around sensitive information. That logic can also be encapsulated in filter profiles. The regular expression logic just got more complicated.

Philter provides other features as well, such as the ability to capture metrics on the data, ability to encrypt the credit card numbers instead of removing them, and the ability to disambiguate between different types of sensitive information.

Lastly, a regular expression will never be able to find non-deterministic types of sensitive information like person's names. Philter's natural language processing (NLP) capabilities are able to find entities like person's names that do not follow any set pattern.

Try Philter

Deploying Philter to AWS, Azure, or GCP is easy because Philter is available through each of the cloud's marketplaces. Simply follow the marketplace steps to launch an instance of Philter in your private cloud.

 Philter VersionBase OS
Launch Philter on AWS (all regions including GovCloud)1.5.0Amazon Linux 2
Launch Philter on Azure1.5.0CentOS 7.7
Launch Philter on Google Cloud Compute1.5.0CentOS 8

Share your experience!

We would love to hear how you are using Philter. Share your experience with us!


Jeff Zemerick is the founder of Mountain Fog. You can contact him at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Challenges in Finding Sensitive Information and Content in Text

Finding sensitive information and content in text has been a problem for as long as text has existed. But in the past few years due to the availability of cheaper data storage and streaming systems, finding sensitive information in text has become nearly a universal need across all industries. Today, systems that process streaming text often need to filter out any information considered sensitive directly in the pipeline to ensure the downstream applications have immediate access to the sanitized text. Streaming platforms are commonly used in industries such as healthcare and banking where the data can contain large amounts of sensitive information.

What is "sensitive information"?

Taking a step back, what is sensitive information? Sensitive information is simply any information that you or your organization deems as being sensitive. There are some global types of sensitive information such as personally identifiable information (PII) and protected health information (PHI). These types of sensitive information, among others, are typically regulated in how the information must be stored, transmitted, or used. But it is common for other types of information to be sensitive for your organization. This could be a list of terms, phrases, locations, or other information important to your organization. Simply put, if you consider it sensitive then it is sensitive.

Structured vs. Unstructured

It's important to note we are talking about unstructured, natural language text. Text in structured formats like XML or JSON are typically simpler to manipulate due to the inherent structure of the text. But in unstructured text we don't have the convenience of being told what is a "person's name" like an XML tag <personName> would do for us. There's generally three ways to find sensitive information in unstructured text.

Three Methods of Finding Sensitive Information in Text

The first method is to look for sensitive information that follows well-defined patterns. This is information like US social security numbers and phone numbers. Even though regular expressions are not a lot of fun, we can easily enough write regular expressions to match social security numbers and phone numbers. Once we have the regular expressions it's straightforward to apply the regular expression to the input text to find pieces of the text matching the patterns.

The second method is to look for sensitive information that can be found in a dictionary or database. This method works well for geographic locations and for information that you might have stored in your database or spreadsheet, such as a column of person's names. Once the list is accessible, again, it is fairly straightforward to look for those items in the text.

The third, and last, method is to employ the techniques of natural language processing (NLP). The technology and tools provided by the NLP ecosystem give us powerful ways to analyze unstructured text. We can use NLP to find sensitive information that does not follow well-defined patterns or is not referenced in a database column or spreadsheet, such as person's or organization's names. The past few years have seen remarkable advancements in NLP allowing these techniques to be able to analyze the text with great success.

Deterministic and Non-deterministic

The first two methods are deterministic. Finding text that matches a pattern and finding text contained in a dictionary is a pass/fail scenario - you either find the text you are looking for or you do not. The third method, NLP, is not deterministic. NLP uses trained models to be able to analyze the text. When an NLP method finds information in text it will have an associated confidence value that tells us just how sure the algorithms are that the associated information is what we are looking for.

Introducing Philter

PhilterPhilter is our software product that implements these three methods of identifying sensitive information in text. Philter supports finding, identifying, and removing sensitive information in text. You set the types of information you consider sensitive and then send text to Philter. The filtered text without the sensitive information is returned to you. With Philter you have full control over how the sensitive information is manipulated - you can redact it, replace it with random values, encrypt it, and more.

Often, sensitive information can follow the same pattern. For example, a US social security number is a 9 digit number. Many driver's license numbers can also be 9 digit numbers. Philter can disambiguate between a social security number and a driver's license number based on the number is used in the text. When using a dictionary we can't forget about misspellings. If we simply look for words in the dictionary we may not find a name that has been misspelled. Philter supports fuzzy searching by looking for misspellings when applying a dictionary-based filter.

This isn't nearly all Philter can do but it is some of the more exciting features to date. Take Philter for a test drive on the cloud of your choice. We'd be happy to walk you through it if you would like!

 Philter VersionBase OS
Launch Philter on AWS (all regions including GovCloud)1.5.0Amazon Linux 2
Launch Philter on Azure1.5.0CentOS 7.7
Launch Philter on Google Cloud Compute1.5.0CentOS 8

Jeff Zemerick is the founder of Mountain Fog. You can contact him at jeff.zemerick@mtnfog.com or on LinkedIn.  
 
 
 
 
 

Philter Docker Containers

We are excited to announce that Philter can now be launched as Docker containers. Previously, Philter was only available through the AWS, Azure, and Google Compute Cloud marketplaces. By making Philter available as containers, Philter can now easily be used outside those cloud platforms, in container orchestration tools such as Kubernetes, and on-premises. Philter finds, identifies, and removes sensitive information such as PHI and PII from natural language text.

Launching the Philter containers is easy:

curl -O https://raw.githubusercontent.com/mtnfog/philter/master/docker-compose.yml
docker-compose up

This will download and run the containers. Once the containers are running you are ready to send filter requests.

curl http://localhost:8080/api/filter --data "George Washington was president and his ssn was 123-45-6789." -H "Content-type: text/plain"

A license key is required to be set for the containers to start and can be requested using the link below.


Philter

Use your own NLP models with Philter

New in Philter 1.5.0 is the ability to use your own custom NLP models with Philter. Available in both Standard and Enterprise editions.

Philter is able to identify named-entities in text through the use of a trained model. The model is able to identify things, like person's names, in the text that do not follow a well-defined pattern or are easily referenced in a dictionary. Philter's NLP model is interchangeable and we offer multiple models that you can choose from to better tailor Philter to your use-case and your domain.

However, there are times when using our models may not be sufficient, such as when your use-case does not exactly match our available models or you want to try to get better performance by training a model on text very similar to your input text. In those cases you can train a custom NLP model for use with Philter.

Getting Started

Before diving into the details, here are some examples that illustrate how to make a custom NLP model usable for Philter. Feel free to use these examples as a starting point for your models. Additionally, this capability has been documented in Philter's User's Guide.

The first example in that repository is an implementation of a named-entity recognizer using Apache OpenNLP. The entity recognizer is exposed via two endpoints using a Spring Boot REST controller.

  • The first endpoint /process simply passes the received text to the entity recognizer. The entity recognizer receives the text, extracts the entities, and returns them in a list of PhilterSpan objects.
  • The second endpoint /status simply returns HTTP 200 and the text healthy. In cases where your model may take a bit to load, it would be better to actually check the status of the model instead of just returning that it is healthy. But for this example, the load model loading is quick and done before the HTTP service is even available.

That's the only required endpoints. Philter will use those endpoints to interact with your model.

@RequestMapping(value = "/process", method = RequestMethod.POST, consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE)
public @ResponseBody List<PhilterSpan> process(@RequestBody String text) {
return openNlpNer.extract(text);
}

@RequestMapping(value = "/status", method = RequestMethod.GET)
public ResponseEntity<String> status() {
return new ResponseEntity<String>("healthy", HttpStatus.OK);
}

Click here to see the full source file.

The PhilterSpan object contains the details of a single extracted entity. An example response from the /process endpoint for a single entity is shown below.

[ 
  {
    text: "George Washington",
    tag: "person",
    score: "0.97",
    start: "0",
    end: "17"
  }
]

Custom NLP Models

Training Your Own Model

Philter is indifferent of the technologies and methods you choose to train your custom model. You can use any framework you like, such as Apache OpenNLP, spaCy, Stanford CoreNLP, or your own custom framework. Follow the framework's documentation for training a model.

Using a Model You Already Have

If you already have a trained NLP model you can skip the training part and proceed on to making the model accessible to Philter.

Using Your Own Model with Philter

Once your model has been trained and you are satisfied with its performance, to use the model with Philter you must expose the model by implementing a simple HTTP service interface around it. This service facilitates communication between Philter and your model. The service is illustrated below.

Once your model is available behind the HTTP interface described above, you are ready to use the model with Philter. On the Philter virtual machine, simply export the PHILTER_NER_ENDPOINT environment variable to be the location of the running HTTP service. It is recommended you set this environment variable in /etc/environment. If your HTTP service is running on the same host as Philter on port 8888, the environment variable would be set as:

export PHILTER_NER_ENDPOINT=http://localhost:8888/

Now restart the Philter service and stop and disable the philter-ner service.

sudo systemctl restart philter.service
sudo systemctl stop philter-ner.service
sudo systemctl disable philter-ner.service

When a filter profile containing an NER filter is applied, requests will be made to your HTTP service invoking your model inference returning the identified named-entities.

Recommendations and Best Practices

You have complete freedom to train your custom NLP model using whatever tools and processes you choose. However, from our experience that are a few things that can help you be successful.
The first recommendation is to contain your service in a Docker container. Doing so gives you a self-contained image that can be deployed and run virtually anywhere. It simplifies dependency management and protects you from dependency version changes.

The second recommendation is to make your HTTP service as lightweight as possible. Avoid any unnecessary code or features that could negatively impact the speed of your model inference.

Lastly, thoroughly evaluate your model prior to deploying the model to Philter to have a better expectation of performance.

Conclusion

Using a custom NLP model with Philter is a fairly straightforward process. Train your model, make it accessible by HTTP, and then deploy the HTTP service and the model such that the service is accessible from Philter.

Do you have to train a custom model to use Philter? Absolutely not. Philter's "out-of-the-box" capabilities are sufficient for most use-cases. Will using your own model give you better performance? Possibly. It depends on how well the model is trained and all of the parameters used. Training your own model can be a difficult and time consuming activity so it's best to have some familiarity with the process before starting.

We are excited to offer this feature and look forward to getting your feedback!

 Philter VersionBase OS
Launch Philter on AWS (all regions including GovCloud)1.5.0Amazon Linux 2
Launch Philter on Azure1.5.0CentOS 7.7
Launch Philter on Google Cloud Compute1.5.0CentOS 8

Philter 1.5.0

Happy Friday! We are in the process of publishing Philter 1.5.0. Philter identifies and removes sensitive information in text. Look for Philter 1.5.0 to be available on the cloud marketplaces soon.

This version has a few new features in addition to minor improvements and fixes. The new features are described below.

New "Section" Filter

Philter 1.5.0 includes a new filter type called a "Section." This filter type lets you specify patterns that indicate the start and end of a section of text. For example, if your text has sentences or even paragraphs denoted with some marker, you can use the Section filter to redact those sentences or paragraphs. You just give the filter the regular expression patterns for the start and end markings. We have added the Section filter to the filter profiles documentation.

Amazon S3 to Store Filter Profiles

We have added the ability to store the filter profiles in an Amazon S3 bucket. The benefits of this is that now filter profiles can be shared across multiple instances of Philter. Previously, if you were running two instances of Philter you would have to update the filter profiles on each instance. By storing the filter profiles in S3 you can just update the filter profiles once via Philter's API. This does require a cache. The cache stores the filter profiles to lower the latency and reduce the number of calls to S3. (More on the cache below.)

We have published some CloudFormation and Terraform scripts to help with creating this architecture on GitHub.

Consolidated Caches

Philter previously used caches for the random anonymization values. With the introduction of using a cache for storing the profiles in S3 we have consolidated those caches into a single cache. Because of this, the configuration settings have been slightly renamed to reflect this. We have updated Philter's documentation with the renamed properties. Having a single cache means there is less to configure and fewer required resources.

If you are upgrading from a previous version you will need to change to the new cache property names.

Changeable Model File

The model file used by Philter can now be set in Philter's application.properties. Check out Philter's documentation for the details. By being able to set the model being used you can now select which model is most applicable to your use-case and domain.


Load-balanced and highly-available Philter CloudFormation template

We now have an AWS CloudFormation template to deploy an auto-scaled, highly-available Philter environment to identify and remove sensitive information from text. This template creates a VPC, load balancer, Philter instances, a Redis cache, and all required networking and security group configuration. Click the Launch Stack button to begin launching the stack.Philter

In an deployment of Philter that is a single EC2 instance, the EC2 instance is a single point of failure with no ability to respond to fluctuations in demand. By deploying more than one EC2 instance we can protect our application against failure and be able to scale up and down as needed.

The benefits of using this CloudFormation template is that it provides a pre-configured Philter architecture and deployment that is highly-available, scalable, and encrypts all data in-transit and all data at rest. Your API requests to Philter to filter sensitive information from text will have higher throughput since the load balancer will distribute those requests across the Philter instances. And as described below, the stack uses end-to-end encryption of data at-rest and in-transit.

The stack requires an active subscription to Philter via the AWS Marketplace. The template supports us-east-1, us-east-2, us-west-1, and us-west-2 regions.

The CloudFormation template is available in the philter-infrastructure-as-code repository on GitHub.

The Philter Stack Architecture

The deployment creates an elastic load balancer that is attached to an auto-scaled group of Philter EC2 instances. The load balancer spans two public subnets and the Philter EC2 instances are spread across two private subnets. Also in the private subnets is an Amazon Elasticache for Redis replication group. A NAT Gateway located in one of the public subnets provides outgoing internet access by routing the traffic to the VPC's Internet Gateway.

The load balancer will monitor the status of each Philter EC2 instance by periodically checking the /api/status endpoint. If an instance is found to be unhealthy after failing several consecutive health checks the failing instance will be replaced.

The Philter auto-scaling group is set to scale up and down based on the average CPU utilization of the Philter EC2 instances. When the CPU usage hits the high threshold another Philter EC2 instance will be added. When the CPU usage hits the low threshold, the auto-scaling group will begin removing (and terminating) instances from the group. The scaling policy is set to scale up faster rate than scaling down to avoid scaling down too quickly.

End-to-end Encryption

Incoming traffic to the load balancer is received by a TCP protocol handler on port 8080. These requests are distributed across the available Philter EC2 instances. The encrypted incoming traffic is terminated at the Philter EC2 instances. Network traffic between the Elasticache for Redis nodes is encrypted, and the data at-rest in the cache is also encrypted. The Philter EC2 instances use encrypted EBS volumes.

Launch the Stack

Click the Launch Stack button to launch the stack in your AWS account, or get the template here, or launch the stack using the AWS CLI with the command below.

aws cloudformation create-stack --stack-name philter --template-url s3://mtnfog-public/philter-resources/philter-vpc-load-balanced-with-redis.json

Once the stack completes Philter will be ready to accept requests. There will be an Output value called PhilterEndpoint. This value is the Philter API URL.

For example, if the value of PhilterEndpoint is https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/, then you can check Philter's status using the command:

curl -k https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/api/status

You can try a quick sample filter request with:

curl -k "https://philter2-philterlo-5lc0jo7if8g1-586151735.us-east-1.elb.amazonaws.com:8080/api/filter" \
  --data "George Washington lives in 90210 and his SSN was 123-45-6789." \
  -H "Content-type: text/plain"

Philter Studio 1.0.0

Philter Studio 1.0.0 is now available for download in your account. Philter Studio is an application for Windows 7/10 that provides convenient access to removing sensitive information from files and documents using Philter.

With Philter Studio’s intuitive interface you can quickly and easily utilize Philter to find and remove sensitive information from your files. Process files one at a time or queue up entire directories and process all files with a single click. Philter Studio supports finding and removing sensitive information in Microsoft Word files (.doc and .docx). Philter Studio can enable track changes so the redactions can be viewed while editing the document.

Philter Studio lets you to take a deep look at how the sensitive information in your text were identified and removed. The Compare and Explain feature visually highlights the information, describes why it was identified, and shows the redacted version.


Philter and COVID-19

Philter NLP Models

The natural language processing (NLP) capabilities of Philter are partly model-driven, meaning that we have trained models to identify information in text. These models are used to identify pieces of sensitive information that do not follow well-defined patterns or exist in referenced dictionaries, such as persons names. The model training process is a complex and compute-intensive procedure often taking days or even weeks to complete. Once a model is created it can be applied to text to identify specific parts of the text based on the text used to train the model and the parameters of the training.

NLP Models for Many Use-Cases and Industries

The model currently deployed in Philter is a model that is generic but yet provides good performance across many use-cases covering many different types of text. It has been our plan for some time to offer models trained for specific use-cases and industries, including non-healthcare industries, for those instances when Philter is used only on a certain type of text. This will give those specific use-cases an increase in performance when using a tailored model.

Philter's pluggable model implementation is not quite ready yet. However, we are going to go ahead and jump a bit ahead today in announcing a model tailored for personally identifiable information in text related to COVID-19. We hope that this model will give you improved performance when identifying sensitive information in COVID-19 related text.

Model Availability

Because we are jumping ahead of ourselves in order to make this model immediately available, we don't yet have any automation or tooling support around being able to download and install the model yourself. (We will in the future.) Until we do have the self-service tooling available, we will distribute the model and installation instructions to users of Philter via email upon request. There is no additional charge to request and use the model.

To request the Philter model trained using COVID-19 data please use our contact form and include your cloud marketplace (AWS, Azure, or GCP) subscription ID.

Philter is available as a 30-day trial. If you are working with data related to COVID-19 and your free trial expires, you can request no cost access to Philter's virtual machine images for continued use at no charge (except for the underlying cloud resources that you pay to the cloud provider).


Using Philter with Microsoft Power Automate (Flow)


Philter SDKs

We have some updates on the Philter SDKs!

The Philter SDKs provide API clients for interacting with Philter to identify and remove sensitive information from text. Each project contains examples showing how to use the SDK.

Philter SDK for Java

The Java SDK is now available in Maven Central.

Philter SDK for .NET

The .NET SDK is now available from NuGet.

Philter SDK for Golang

The Golang SDK is now available on GitHub.


Philter

Filtering Sensitive Information from Text using Apache NiFi and Philter

Awhile back we made a post describing how Philter can be used alongside Apache NiFi for identifying and removing sensitive information from text. Since that post, there have been changes to Philter and Apache NiFi so we thought it would be worthwhile to revisit that architecture and its configuration.

  • Apache NiFi is an application for creating and managing data flows that process data.
  • Philter identifies and removes sensitive information, such as PHI and PII, from natural language text. Philter is available on cloud marketplaces.

The Data Flow Architecture

In the architecture of our data flow, we are going to be ingesting natural language (unstructured) text from somewhere - it doesn't really matter where. In your use-case it may be from a file system, from an S3 bucket, or from an Apache Kafka topic. Once we have the text in the content of the NiFi flowfile, we will send the text to Philter where the sensitive information will be removed from the text. The filtered text will then be the content of the flowfile. In our example here we are going to read the files from a directory on the file system.

To interact with Philter we can use NiFi's InvokeHTTP processor since Philter's API is HTTP REST-based.

Finally, we will write the filtered text to some destination. Like the ingest source, where we write the text does not matter. We could write it back to the source or some other location - whatever is required by your use-case.

The NiFi Flow

The flow will use the GetFile processor to read /tmp/input/*.txt files. The contents of each file will be sent to Philter. The resulting filtered text will be written back to the file system at /tmp/output. (Click the image for a better view.)

Apache NiFi flow for Philter

If you want to quickly prototype it with minimal configuration, use a GenerateFlowFile processor and set the content manually to something like "His SSN was 123-45-6789."

Using GenerateFlowFile to test Philter.

InvokeHTTP Processor Configuration

The configuration of the InvokeHTTP processor is fairly simple. We just need to configure the HTTP Method, Remote URL, and Content Type. Set each as follows:

  • HTTP Method = POST
  • Remote URL = http://philter-ip:8080/api/filter
  • Content-Type = text/plain

Since we are not providing any values for the context, document ID, or filter profile name in the URL, Philter will use defaults values for each. When not provided, the default value for context is default, Philter will generate a document ID per request, and the default filter profile name is default.

These default values are detailed in Philter's API documentation. A context lets you group similar documents together, perhaps by business unit or purpose. A document ID should uniquely identify a document (such as a file name) and can be used to split up large documents for processing.

If you do want to set values for one or all of those instead of using the default values, just append them to the Remote URL: http://philter-ip:8080/api/filter?c=ctx&p=justssn In this request, the context is set to ctx and it tells Philter to use the filter profile named justssn. As a tip, you can use NiFi's expression language to parameterize the values in the URL.

InvokeHTTP processor configuration for Philter.

A Closer Look

If we use a LogAttribute processor we can get some insight into what's happening. In the log output below, we can see HTTP POST request that was made.

At the top of the log we see the filtered text from Philter. The input text from the file was "His SSN was 123-45-6789." Philter applied the default filter profile which looks for SSNs and responded with "His SSN was {{{REDACTED-ssn}}}."

(Filter profiles are very powerful and flexible configurations that let you have full control over the types of sensitive information that Philter identifies and how Philter manipulates that information when found.)

We can also see that since we did not provide a value for the document ID in the request, Philter assigned a document ID and returned it in the response in the x-document-id header.

His SSN was {{{REDACTED-ssn}}}.

--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
	Value: 'Thu Feb 27 13:35:19 UTC 2020'
Key: 'lineageStartDate'
	Value: 'Thu Feb 27 13:35:11 UTC 2020'
Key: 'fileSize'
	Value: '31'
FlowFile Attribute Map Content
Key: 'Connection'
	Value: 'keep-alive'
Key: 'Content-Length'
	Value: '31'
Key: 'Content-Type'
	Value: 'text/plain;charset=UTF-8'
Key: 'Date'
	Value: 'Thu, 27 Feb 2020 13:35:19 GMT'
Key: 'Keep-Alive'
	Value: 'timeout=60'
Key: 'filename'
	Value: 'd206fc81-2c42-40ba-afbf-b5f9998b56c0'
Key: 'invokehttp.request.url'
	Value: 'http://10.1.1.221:8080/api/filter'
Key: 'invokehttp.status.code'
	Value: '200'
Key: 'invokehttp.status.message'
	Value: ''
Key: 'invokehttp.tx.id'
	Value: 'fbf2f6c0-1073-4fac-bc23-6d6a67b70423'
Key: 'mime.type'
	Value: 'text/plain;charset=UTF-8'
Key: 'path'
	Value: './'
Key: 'uuid'
	Value: '486ff4c2-6530-4e1c-aea2-e9965b86b10c'
Key: 'x-document-id'
	Value: 'fb75a2a4c164192542f89881aa8baf21'
--------------------------------------------------

Summary

Philter's API makes it easy to integrate Philter with applications like Apache NiFi. The InvokeHTTP processor native to NiFi is an ideal means of communicating with Philter.

To keep things simple, this example only considered SSNs in text. Philter supports many other types of sensitive information.

If performance is very important, there are a couple of things that can be done to help. First, Philter is stateless so you can run multiple instances of Philter behind a load balancer. Second, Philter Enterprise Edition can run natively inside an Apache NiFi flow without the need to make HTTP calls to Philter. Contact us if you would like to learn more about Philter Enterprise Edition's processor for Apache NiFi.

Philter's integration with applications like Apache NiFi is very important to us so look for more improvements and features in versions to come.


Philter

Philter 1.3.0

Today I am happy to announce the availability of Philter 1.3.0! This version includes various tweaks to improve performance and we definitely encourage you to upgrade to 1.3.0. This version greatly lowers the required time to process text while improving the accuracy of identified information.

The only new user-facing feature is a modification to the URL filter to add an option to require the URLs to start with http, https, or www. This change adds a new property to the URL filter profile. All other improvements are related to the internal workings of Philter.

Look for Philter 1.3.0 to be available on the cloud marketplaces in a few days.

Philter 1.3.0 Release Notes


Philter 1.1.0

We are happy to announce Philter 1.1.0! This version brings some features we think you will find very useful because most were implemented directly from interactions with users. We look forward to future interactions to keep driving improvements!

We are very excited about this release, but we also have lots of exciting things to add in the next release and we will soon be making available Philter Studio, a free Windows application to use Philter. If you don't like managing filter profiles in JSON you will love Philter Studio!

We have begun the process of publishing Philter 1.1.0 to the cloud marketplaces and it should be available on the AWS, Azure, and GCP marketplaces in the next few days once publishing is complete. The Philter Deployment Guide walks through how to deploy Philter on each platform. You can also see the full Philter release notes.

To be notified when Philter 1.1.0 is available for deployment into your cloud, subscribe to our rarely-used mailing list below.

 

What's New in Philter 1.1.0

Ignore Lists

In some cases, there may be text that you never want to identify and remove as PII or PHI. An example may be an email address or telephone number of a business that is not relevant to the sensitive information in the text and removing this text may cause the document to lose meaning. Ignore lists allow you to specify a list of terms that are never removed (always ignored if found) from the documents. You can create as many ignore lists as you need and each one can contain as many terms as desired. The ignore lists are defined in the filter profile.

Here's how an ignore list is defined in a filter profile that only finds SSNs. The SSNs 123-45-6789 and 000-00-0000 will always be ignored and will remain in the documents unchanged.

{
  "name": "default",
  "identifiers": {
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT"
        }
      ]
    }
  },
  "ignored": [
    {
      "name": "ignored-terms",
      "terms": [
        "123-45-6789",
        "000-00-0000"
      ]
    }
  ]
}

Custom Dictionaries

You can now have custom dictionaries of terms that are to be identified as sensitive information. With a custom dictionary you can specify a list of terms, such as names, addresses, or other information, that should always be treated as personal information. You can create as many custom dictionaries as you need and each one can contain as many terms as desired. The custom dictionaries are defined in the filter profile.

Here's how a custom dictionary can be added to a filter profile. In this example, a custom dictionary of type names-with-j is created and it contains the terms james, jim, and john. When any of these terms are found in a document they will be redacted. The dictionaries item is an array so you can have as many dictionaries as required. (The "auto" setting for the sensitivity is discussed a little further down below.)

{
  "name": "default",
  "identifiers": {
    "dictionaries": [
      {
        "type": "names-with-j",
        "terms": [
          "james",
          "jim",
          "john"
        ],
        "sensitivity": "auto",
        "customFilterStrategies": [
          {
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}",
            "replacementScope": "DOCUMENT"
          }
        ]
      }
    ],
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT",
          "staticReplacement": "",
          "condition": ""
        }
      ]
    }
  }
  ]
}

"Fuzziness" Calculation

We added a new fuziness option when using dictionary filters. The previous options of LOW, MEDIUM, and HIGH were found to be either not restrictive enough or too restrictive. We have added an AUTO option that automatically determines the appropriate fuziness based on the length of term in question. For instance, the AUTO option sets the fuzziness for a short term to be on the low side, while a longer term allows a higher fuziness. We recommend using AUTO over the other options and expect it to perform better for you. The other options of LOW, MEDIUM, and HIGH are still available.

Explain API Endpoint

Philter operates as a black box. Text goes in and manipulated text comes out. What happened inside? To help provide insight into the black box, we have added a new API endpoint called explain. This endpoint performs text filtering but returns more information on the filtering process. The list of identified spans (pieces of text found to be sensitive) and applied spans are both returned as objects along with attributes about each span.

Here's an example output of calling the explain API endpoint given some sample text. The original API call:

curl -k -s "https://localhost:8080/api/explain?c=C1" --data "George Washington was president and his ssn was 123-45-6789 and he lived at 90210." -H "Content-type: text/plain" 

The response from the API call:

{
  "filteredText": "{{{REDACTED-entity}}} was president and his ssn was {{{REDACTED-ssn}}} and he lived at {{{REDACTED-zip-code}}}.",
  "context": "C1",
  "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
  "explanation": {
    "appliedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ],
    "identifiedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ]
  }
}

In the response, each identified span is listed with some attributes.

  • id - A random UUID identifying the span.
  • characterStart - The character-based index of the start of the span.
  • characterEnd - The character-based index of the end of the span.
  • filterType - The filter that identified this span.
  • context - The given context under which this span was identified.
  • documentId - The given documentId or a randomly generated documentId if none was provided.
  • confidence - Philter's confidence this span does in fact represent a span.
  • text - The text contained within the span.
  • replacement - The value which Philter used replace the text in the document.

The User's Guide has been updated to include the explain API endpoint.

Elasticsearch

As mentioned in a previous post, Philter 1.1.0 now uses Elasticsearch to store the identified spans instead of MongoDB. Please check that post for the details but we do want to mention again here that this change does not affect Philter's API and the change will be transparent to any of your existing Philter scripts or applications.

DataDog Metrics

Philter 1.1.0 adds support for sending metrics directly to Datadog.

New Metrics

Philter 1.1.0 adds new metrics for each type of filter. Now you will be able to see metrics for each type of filter in CloudWatch, JMX, and Datadog to give more insight into the types of sensitive information being found in your documents.


Philter and Elasticsearch

PhilterPhilter, our application for finding and removing PII and PHI from natural language text, has the ability to optionally store the identified text in an external data store. With this feature, you had access to a complete log of Philter's actions as well as the ability to reconstruct the original text in the future if you ever needed to.

In Philter 1.0,  we chose MongoDB as the external data store. With just a few configuration properties, Philter would connect to MongoDB and persist all identified "spans" (the identified text, its location in the document, and some other attributes) to a MongoDB database. This worked well but we realized that looking forward it might not have been the best choice.

In Philter 1.1 we are replacing MongoDB with Elasticsearch. The functionality and the Philter APIs will remain the same. The only difference is that now instead of the spans being stored in a MongoDB database they will now be stored in an Elasticsearch index. So, what, exactly are the benefits? Great question.

The first benefit comes with Elasticsearch and Kibana's ability to quickly and easily make dashboards to view the indexed data. With the spans in Elasticsearch, you can make a dashboard to summarize the spans by type, text, etc., to show insights into the PII and PHI that Philter is finding and manipulating in your text.

It also became quickly apparent that a primary use-case for users and the store would be to query the spans it contains. For example, a query to find all documents containing "John Doe" or all documents containing a certain date or phone number. A search engine is better prepared to handle those queries.

Another consideration is licensing. Elasticsearch is available under the Apache Software License or a compatible license while MongoDB is available under a Server Side Public License.

In summary, Philter 1.1 will offer support for using Elasticsearch as the store for identified PII and PHI. Remember, using the store is an optional feature of Philter. If you do not require any history of the text that Philter identifies then it is not needed. (By default, Philter's store feature is disabled and has to be explicitly enabled.) Support for using MongoDB as a store will not be available in Philter 1.1.

We are really excited about this change and excited about the possibilities that comes with it!


Filter Profile JSON Schema

Philter can find and remove many types of PII and PHI. You can select the types of of PII and PHI and how the identified values are removed or manipulated through what we call a "filter profile." A filter profile is a file that essentially lets you tell Philter what to do!

To help make creating and editing filter profiles a little bit easier, we have published the JSON schema.

https://www.mtnfog.com/filter-profile-schema.json

This JSON schema can be imported into some development tools to provide features such as validation and autocomplete. The screenshot below shows an example of adding the schema to IntelliJ. More details into the capability and features are available from the IntelliJ documentation.

Visual Studio Code and Atom (via a package) also include support for validating JSON documents per JSON schemas.

The Filter Profile Registry provides a way to centrally manage filter profiles across one or more instances of Philter.


Using AWS Kinesis Firehose Transformations to Filter Sensitive Information from Streaming Text

  • Updated 05/20/2020 to include a link to running Philter as a container and a link to the solution example.
  • Updated 04/28/2020 to include a link to CloudFormation and Terraform scripts and link to using a signed certificate with Philter.

AWS Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from places such as CloudWatch, AWS IoT, and custom applications using the AWS SDK to places such as Amazon S3, Amazon Redshift, Amazon Elasticsearch, and others. In this post we will use S3 as the firehose's destination.

Sometimes you want to manipulate the data as it goes through the firehose. In this blog post we will show how AWS Kinesis Firehose and AWS Lambda can be used to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Prerequisites

Your must have a running instance of Philter. If you don't already have a running instance of Philter you can launch one through the AWS Marketplace or as a container. There are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

It's not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows you to communicate locally with Philter from the function.

There is no need to duplicate an excellent blog post on creating a Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests
import base64

def handler(event, context):

    output = []

    for record in event['records']:
        payload=base64.b64decode(record["data"])
        headers = {'Content-type': 'text/plain'}
        r = requests.post("https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
        filtered = r.text
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8')
        }
        output.append(output_record)

    return output

The following Kinesis Firehose test event can be used to test the function:

{
  "invocationId": "invocationIdExample",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "region": "us-east-1",
  "records": [
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    },
    {
      "recordId": "49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp": 1495072949453,
      "data": "R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    }    
  ]
}

This test event contains 2 messages and the data for each is base 64 encoded, which is the value "He lived in 90210 and his SSN was 123-45-6789." When the test is executed the response will be:

[
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."
]

When executing the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter's self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Kinesis Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.

Conclusion

In this blog post we have created an AWS Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.

Resources