Philter's Custom Dictionary Filter and "Fuzziness"

PhilterPhilter finds sensitive information in text based on a set of filters that you configure in a filter profile. Some of these filters are for predefined information like SSNs, phone numbers, and names. But sometimes you have a list of terms specific to your use-case that you want to identify, too. Philter's custom dictionary filter lets you specify a list of terms to label as sensitive information when found in your text.

You can learn more about the custom dictionary filter and all of its properties in the Philter User's Guide.

Philter 1.6.0 adds a new property called "fuzzy" to the custom dictionary filter. The "fuzzy" property accepts a value of true or false. When set to false, text being processed must match an item in the dictionary exactly for that text to be labeled as sensitive information. When set to true, the text does not have to match exactly. The "fuzzy" property allows for misspellings and typos to be present and still label the text as being sensitive information. In this blog post we want to dive a little bit more into this to better explain how the "fuzziness" works and is applied and the trade-offs when using it.

Also new in Philter 1.6.0 is the ability to provide the custom dictionary filter a path to a file that contains the terms. This way you don't have to include your terms directly in the filter profile.

Sample Filter Profile

To start, here's a simple filter profile that includes a custom dictionary filter. The dictionary contains three terms (john, jane, doe) and fuzziness is enabled with medium sensitivity. When any of those terms are found, they will be redacted with the pattern {{{REDACTED-%t}}}, where %t is replaced by the type which in this case is custom-dictionary.

{
   "name": "dictionary-example",
   "identifiers": {
      "dictionaries": [
         "customDictionary": {
            "terms": ["john", "jane", "doe"],
            "fuzzy": true,
            "sensitivity": "medium",
            "customDictionaryFilterStrategies": [
               {
                  "strategy": "REDACT",
                  "redactionFormat": "{{{REDACTED-%t}}}"
               }
            ]
         }
      ]
   }   
}

No fuzziness

We will start by describing what happens when the "fuzzy" property is set to false. This is the default behavior and is consistent with how Philter behaved prior to version 1.6.0. Items in the custom dictionary have to be found in the text exactly as they are in the dictionary. This means "John" is not the same as "Jon."

Disabling fuzziness is more efficient and will provide better performance. That's really all you need to know. But if you like getting into the details of things, read on! Internally, Philter uses an algorithm based off what's known as a bloom filter to efficiently scan a dictionary for matches. A bloom filter "is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set." In this case, the set is your list of terms in the dictionary and an element is each word from the input text. The bloom filter provides an efficient means of determining whether or not a given word is a term in your dictionary that you want to be identified as sensitive information.

A digression into bloom filters

Just to clarify, when we talk about Philter we talk a lot about "filters", such as a filter for SSNs, a filter for phone numbers, and so on. A bloom filter is not a filter like that. A bloom filter is an algorithm that provides an efficient means of asking the question "Does this item potentially exist in this dictionary?" A bloom filter will answer "yes, it might" or "no, it does not." Notice the response of "yes, it might." The bloom filter is not saying "Yes." Instead, it is staying "yes, it might." It's then up to the programmer to find out definitively if that item exists in the dictionary. That's essentially how a bloom filter works.

Yes, fuzziness!

Enabling fuzziness on a custom dictionary filter works differently. As Philter scans the input text, it not only considers the words or phrases themselves, but Philter also considers derivations of the words and phrases. When fuzziness is enabled, "John" may be the same as "Jon." Enabling fuzziness by setting the "fuzzy" property to true can be useful when you are concerned about misspellings or different spellings of terms in your text.

You can control the level of acceptable fuzziness by setting the "sensitivityLevel" property. Valid values are "low", "medium", and "high." The different between "Jon" and "John" is considered low while the different between "Jon" and "Johnny" is considered high. You can use the sensitivityLevel to find an acceptable level of fuzziness appropriate for your custom dictionary and your text. The default sensitivityLevel when not specified is "high."

An important distinction to make is that currently when fuzziness is disabled the custom dictionary can only contain single words. Phrases are not permitted as dictionary terms in Philter 1.6.0 but are allowed in the upcoming version 1.7.0. The internals of that change are interesting enough for their own blog post!

Summary

To summarize:

  • Setting fuzzy to false (the default settings) for the custom dictionary filter will provide better performance but terms in the custom dictionary must match exactly and only words (not phrases) are allowed in the dictionary.
  • Setting fuzzy to true allows the custom dictionary filter to be able to identify misspellings and different spellings of terms in the custom dictionary filter at the cost of performance. Use the sensitivityLevel values of low, medium, and high to control the allowed level of fuzziness.

Not yet using Philter?

Join our users across the healthcare, financial, legal, and other industries in using Philter to find and remove sensitive information from your text. Click on your platform below to get started.

AWS Marketplace