Challenges in Finding Sensitive Information and Content in Text

Finding sensitive information and content in text has been a problem for as long as text has existed. But in the past few years due to the availability of cheaper data storage and streaming systems, finding sensitive information in text has become nearly a universal need across all industries. Today, systems that process streaming text often need to filter out any information considered sensitive directly in the pipeline to ensure the downstream applications have immediate access to the sanitized text. Streaming platforms are commonly used in industries such as healthcare and banking where the data can contain large amounts of sensitive information.

What is “sensitive information”?

Taking a step back, what is sensitive information? Sensitive information is simply any information that you or your organization deems as being sensitive. There are some global types of sensitive information such as personally identifiable information (PII) and protected health information (PHI). These types of sensitive information, among others, are typically regulated in how the information must be stored, transmitted, or used. But it is common for other types of information to be sensitive for your organization. This could be a list of terms, phrases, locations, or other information important to your organization. Simply put, if you consider it sensitive then it is sensitive.

Structured vs. Unstructured

It’s important to note we are talking about unstructured, natural language text. Text in structured formats like XML or JSON are typically simpler to manipulate due to the inherent structure of the text. But in unstructured text we don’t have the convenience of being told what is a “person’s name” like an XML tag <personName> would do for us. There’s generally three ways to find sensitive information in unstructured text.

Three Methods of Finding Sensitive Information in Text

The first method is to look for sensitive information that follows well-defined patterns. This is information like US social security numbers and phone numbers. Even though regular expressions are not a lot of fun, we can easily enough write regular expressions to match social security numbers and phone numbers. Once we have the regular expressions it’s straightforward to apply the regular expression to the input text to find pieces of the text matching the patterns.

The second method is to look for sensitive information that can be found in a dictionary or database. This method works well for geographic locations and for information that you might have stored in your database or spreadsheet, such as a column of person’s names. Once the list is accessible, again, it is fairly straightforward to look for those items in the text.

The third, and last, method is to employ the techniques of natural language processing (NLP). The technology and tools provided by the NLP ecosystem give us powerful ways to analyze unstructured text. We can use NLP to find sensitive information that does not follow well-defined patterns or is not referenced in a database column or spreadsheet, such as person’s or organization’s names. The past few years have seen remarkable advancements in NLP allowing these techniques to be able to analyze the text with great success.

Deterministic and Non-deterministic

The first two methods are deterministic. Finding text that matches a pattern and finding text contained in a dictionary is a pass/fail scenario – you either find the text you are looking for or you do not. The third method, NLP, is not deterministic. NLP uses trained models to be able to analyze the text. When an NLP method finds information in text it will have an associated confidence value that tells us just how sure the algorithms are that the associated information is what we are looking for.

Introducing Philter

PhilterPhilter is our software product that implements these three methods of identifying sensitive information in text. Philter supports finding, identifying, and removing sensitive information in text. You set the types of information you consider sensitive and then send text to Philter. The filtered text without the sensitive information is returned to you. With Philter you have full control over how the sensitive information is manipulated – you can redact it, replace it with random values, encrypt it, and more.

Often, sensitive information can follow the same pattern. For example, a US social security number is a 9 digit number. Many driver’s license numbers can also be 9 digit numbers. Philter can disambiguate between a social security number and a driver’s license number based on the number is used in the text. When using a dictionary we can’t forget about misspellings. If we simply look for words in the dictionary we may not find a name that has been misspelled. Philter supports fuzzy searching by looking for misspellings when applying a dictionary-based filter.

This isn’t nearly all Philter can do but it is some of the more exciting features to date. Take Philter for a test drive on the cloud of your choice. We’d be happy to walk you through it if you would like!

 Philter VersionBase OS
Launch Philter on AWS (all regions including GovCloud)1.5.0Amazon Linux 2
Launch Philter on Azure1.5.0CentOS 7.7
Launch Philter on Google Cloud Compute1.5.0CentOS 8

Jeff Zemerick is the founder of Mountain Fog. You can contact him at jeff.zemerick@mtnfog.com or on LinkedIn.