The Performance of Philter

PhilterIn all of our design and development of Philter, performance is always one of our top priorities. We ask ourselves questions like how can we implement this awesome new feature in Philter without negatively impacting performance? What can we do to improve performance?

However, the word “performance” can have a few different meanings when used in relation to Philter. In this post I want to dive down into the “performance” of Philter and how it impacts Philter’s development.

Performance: Efficient processing of text

The first meaning of performance relates to how efficient Philter is when processing text. Philter takes input text, filters the sensitive information, and returns the output text. (That middle step is simplified a lot but hopefully you get the idea.) If any of these steps are not efficient, or performant, Philter won’t be usable. Your client applications will time out and you will not want to use Philter.

Any new features or modifications to Philter’s filtering capabilities has to be carefully designed and implemented. Even a small, seemingly innocent change can have large negative effects on performance. Because of that we as the developers must be careful and test accordingly. We use efficient data structures and make careful choices to select the operations that will provide the best performance.

This type of performance is typically measured in compute time and that’s how we measure it. We have thousands of test cases that we execute with each new build of Philter. Over time we can see a history of the processing time and its downward trend as Philter gets more efficient.

Performance: Labeling the appropriate information as sensitive

The second meaning of performance may sometimes be referred to as accuracy. This meaning relates to how well Philter identified the sensitive information in the input text. Was all the sensitive information that Philter identified actually sensitive? Where there any false positives? False negatives? This type of performance is typically measured by a percentage, or by terms from information retrieval such as precision and recall.

In some cases, Philter’s identification of sensitive information is non-deterministic, meaning statistics and machine learning algorithms are applied to locate sensitive information. Contrast this with a deterministic process such as looking for terms from a dictionary. How some of Philter’s filters identify sensitive information can be controlled through a sensitivity level. Setting the sensitivity to high will likely identify more sensitive information but also have more false positives. Conversely, setting the sensitivity to low will likely result in finding fewer sensitive information and more false negatives. The sensitivity level of medium aims to bridge this gap. In some cases, false positives may be more acceptable than a false negative so a high sensitivity level is used. For the information retrieval folks out there is known as maximizing the recall.

For sensitive information like person’s names we offer various models trained for specific domains. The purpose of this is to provide a higher level of accuracy when Philter is used in those domains.

Putting them together

Philter’s “performance” is both of these. Philter must perform well in terms of time and processing efficiency as well as finding the appropriate sensitive information. We believe that both types are equally important. A system that takes hours to complete but with more accuracy may be just as unusable as a system that completes in milliseconds but finds no sensitive information.

If you are not yet using Philter to find and remove sensitive information from your text it’s easy to get started. Just click on your platform of choice below. And if you need help please don’t hesitate to reach out. We enjoy helping.

AWS Marketplace