Philter 1.1.0

We are happy to announce Philter 1.1.0! This version brings some features we think you will find very useful because most were implemented directly from interactions with users. We look forward to future interactions to keep driving improvements!

We are very excited about this release, but we also have lots of exciting things to add in the next release and we will soon be making available Philter Studio, a free Windows application to use Philter. If you don’t like managing filter profiles in JSON you will love Philter Studio!

We have begun the process of publishing Philter 1.1.0 to the cloud marketplaces and it should be available on the AWS, Azure, and GCP marketplaces in the next few days once publishing is complete. The Philter Deployment Guide walks through how to deploy Philter on each platform. You can also see the full Philter release notes.

To be notified when Philter 1.1.0 is available for deployment into your cloud, subscribe to our rarely-used mailing list below.

 

What’s New in Philter 1.1.0

Ignore Lists

In some cases, there may be text that you never want to identify and remove as PII or PHI. An example may be an email address or telephone number of a business that is not relevant to the sensitive information in the text and removing this text may cause the document to lose meaning. Ignore lists allow you to specify a list of terms that are never removed (always ignored if found) from the documents. You can create as many ignore lists as you need and each one can contain as many terms as desired. The ignore lists are defined in the filter profile.

Here’s how an ignore list is defined in a filter profile that only finds SSNs. The SSNs 123-45-6789 and 000-00-0000 will always be ignored and will remain in the documents unchanged.

{
  "name": "default",
  "identifiers": {
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT"
        }
      ]
    }
  },
  "ignored": [
    {
      "name": "ignored-terms",
      "terms": [
        "123-45-6789",
        "000-00-0000"
      ]
    }
  ]
}

Custom Dictionaries

You can now have custom dictionaries of terms that are to be identified as sensitive information. With a custom dictionary you can specify a list of terms, such as names, addresses, or other information, that should always be treated as personal information. You can create as many custom dictionaries as you need and each one can contain as many terms as desired. The custom dictionaries are defined in the filter profile.

Here’s how a custom dictionary can be added to a filter profile. In this example, a custom dictionary of type names-with-j is created and it contains the terms james, jim, and john. When any of these terms are found in a document they will be redacted. The dictionaries item is an array so you can have as many dictionaries as required. (The “auto” setting for the sensitivity is discussed a little further down below.)

{
  "name": "default",
  "identifiers": {
    "dictionaries": [
      {
        "type": "names-with-j",
        "terms": [
          "james",
          "jim",
          "john"
        ],
        "sensitivity": "auto",
        "customFilterStrategies": [
          {
            "strategy": "REDACT",
            "redactionFormat": "{{{REDACTED-%t}}}",
            "replacementScope": "DOCUMENT"
          }
        ]
      }
    ],
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "{{{REDACTED-%t}}}",
          "replacementScope": "DOCUMENT",
          "staticReplacement": "",
          "condition": ""
        }
      ]
    }
  }
  ]
}

“Fuzziness” Calculation

We added a new fuziness option when using dictionary filters. The previous options of LOW, MEDIUM, and HIGH were found to be either not restrictive enough or too restrictive. We have added an AUTO option that automatically determines the appropriate fuziness based on the length of term in question. For instance, the AUTO option sets the fuzziness for a short term to be on the low side, while a longer term allows a higher fuziness. We recommend using AUTO over the other options and expect it to perform better for you. The other options of LOW, MEDIUM, and HIGH are still available.

Explain API Endpoint

Philter operates as a black box. Text goes in and manipulated text comes out. What happened inside? To help provide insight into the black box, we have added a new API endpoint called explain. This endpoint performs text filtering but returns more information on the filtering process. The list of identified spans (pieces of text found to be sensitive) and applied spans are both returned as objects along with attributes about each span.

Here’s an example output of calling the explain API endpoint given some sample text. The original API call:

curl -k -s "https://localhost:8080/api/explain?c=C1" --data "George Washington was president and his ssn was 123-45-6789 and he lived at 90210." -H "Content-type: text/plain" 

The response from the API call:

{
  "filteredText": "{{{REDACTED-entity}}} was president and his ssn was {{{REDACTED-ssn}}} and he lived at {{{REDACTED-zip-code}}}.",
  "context": "C1",
  "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
  "explanation": {
    "appliedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ],
    "identifiedSpans": [
      {
        "id": "b7c5b777-460e-4033-8d91-0f2d3a2d6424",
        "characterStart": 0,
        "characterEnd": 17,
        "filterType": "NER_ENTITY",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 0.9189682900905609,
        "text": "George Washington",
        "replacement": "{{{REDACTED-entity}}}"
      },
      {
        "id": "b4a2d019-b7cb-4fc7-8598-bec1904124b4",
        "characterStart": 48,
        "characterEnd": 59,
        "filterType": "SSN",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "123-45-6789",
        "replacement": "{{{REDACTED-ssn}}}"
      },
      {
        "id": "48b10b67-6ad2-4b5a-934f-a3b4fd190618",
        "characterStart": 76,
        "characterEnd": 81,
        "filterType": "ZIP_CODE",
        "context": "C1",
        "documentId": "117d088b-354a-48b0-b2d7-bdf0335650d5",
        "confidence": 1,
        "text": "90210",
        "replacement": "{{{REDACTED-zip-code}}}"
      }
    ]
  }
}

In the response, each identified span is listed with some attributes.

  • id – A random UUID identifying the span.
  • characterStart – The character-based index of the start of the span.
  • characterEnd – The character-based index of the end of the span.
  • filterType – The filter that identified this span.
  • context – The given context under which this span was identified.
  • documentId – The given documentId or a randomly generated documentId if none was provided.
  • confidence – Philter’s confidence this span does in fact represent a span.
  • text – The text contained within the span.
  • replacement – The value which Philter used replace the text in the document.

The User’s Guide has been updated to include the explain API endpoint.

Elasticsearch

As mentioned in a previous post, Philter 1.1.0 now uses Elasticsearch to store the identified spans instead of MongoDB. Please check that post for the details but we do want to mention again here that this change does not affect Philter’s API and the change will be transparent to any of your existing Philter scripts or applications.

DataDog Metrics

Philter 1.1.0 adds support for sending metrics directly to Datadog.

New Metrics

Philter 1.1.0 adds new metrics for each type of filter. Now you will be able to see metrics for each type of filter in CloudWatch, JMX, and Datadog to give more insight into the types of sensitive information being found in your documents.