OpenNLP’s RegexNameFinder and Tokenizing

OpenNLP’s RegexNameFinder takes one or more regular expressions and uses those expressions to extract entities from the input text. This is very useful for instances in which you want to extract things that follow a set format, like phone numbers and email addresses. However, when tokenizing the input to the RegexNameFinder be careful because it can affect the RegexNameFinder’s accuracy.

The RegexNameFinder is very simple to use and here’s an example borrowed from an OpenNLP testcase.

Pattern testPattern = Pattern.compile("test");
String sentence[] = new String[]{"a", "test", "b", "c"};

Pattern[] patterns = new Pattern[]{testPattern};
Map<String, Pattern[]> regexMap = new HashMap<>();
String type = "testtype";

regexMap.put(type, patterns);

RegexNameFinder finder =
new RegexNameFinder(regexMap);

Span[] result = finder.find(sentence);

The sentence variable is a list of tokens. In the above example the tokens are set manually. In a more likely scenario the string would be received as “a test b c” and it would be up to the application to tokenize the string into {“a”, “test”, “b”, “c”}.

There are three types of tokenizers available in OpenNLP – the WhitespaceTokenizer, the SimpleTokenizer, and a tokenizer (TokenizerME) that uses a token model you have trained. The WhitespaceTokenizer works on, you guessed it, white space. The locations of white space in the string is used to tokenize the string. The SimpleTokenizer looks at character classes, such as letters and numbers.

Let’s take the example string “My email address is me@me.com and I like Gmail.” Using the WhitespaceTokenizer the tokens are {“My”, “email”, “address”, “is”, “me@me.com”, “and”, “I”, “like”, “Gmail.”}. If we use the RegexNameFinder with a regular expression that matches an email address, OpenNLP will return to us the span covering “me@me.com”. Works great!

However, let’s consider the sentence “My email address is me@me.com.” Using the WhitespaceTokenizer again the tokens are {“My”, “email”, “address”, “is”, “me@me.com.”}. Notice the last token includes the sentence’s period. Our regular expression for an email address will not match “me@me.com.” because it is not a valid email address. Using the SimpleTokenizer doesn’t give any better results.

How to work around this is up to you. You could make a custom tokenizer by implementing the Tokenizer interface, try using a token model, or massaging your text before it is passed to the tokenizer.

Leave a Reply