String Tokenization with OpenNLP

OpenNLP is an open-source library for performing various NLP functions. One of those function is string tokenization. With OpenNLP’s tokenizers you can break text into its individual tokens. For example, given the text “George Washington was president” the tokens are [“George”, “Washington”, “was”, “president”].

If you don’t want the trouble of making your own project look at Sonnet Tokenization Engine. Sonnet, for short, performs text tokenization via a REST API. It is available on the AWS and Azure marketplaces.

A lot of NLP functions operate on tokenized text so tokenization is an important part of an NLP pipeline. In this post we will use OpenNLP to tokenize some text. At time of writing the current version of OpenNLP is 1.8.3.

The tokenizers in OpenNLP are located under the opennlp.tools.tokenize package. This package contains three important classes and they are:

  • WhitespaceTokenizer
  • SimpleTokenizer
  • TokenizerME

The WhitespaceTokenizer does simply that – breaks text into tokens based on the presence of whitespace in the text. The SimpleTokenizer is a little bit smarter. It tokenizes text based on the character classes in the text. Lastly, the TokenizerME performs tokenization using a trained token model. As long as you have data to train your own model this is the class you should use as it will give the best performance. All three classes implement the Tokenizer interface.

You can include the OpenNLP dependency in your project:

<dependency>
 <groupId>org.apache.opennlp</groupId>
 <artifactId>opennlp-tools</artifactId>
 <version>1.8.3</version>
</dependency>

The WhitespaceTokenizer and SimpleTokenizer can be used in a very similar manner:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
tokenizer.tokenize("George Washington was president.");
String tokens[] = tokenizer.tokenize(sentence);

And the WhitespaceTokenizer:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
tokenizer.tokenize("George Washington was president.");
String tokens[] = tokenizer.tokenize(sentence);

The tokenize() function takes a string and returns a string array which are the tokens of the string.

As mentioned earlier, the TokenizerME class uses a trained model to tokenize text. This is much more fun than the previous examples. To use this class we first load a token model from the disk. We are going to use the en-token.bin model file available here. Note that these models are really only good for testing since the text they were trained from is likely different from the text you will be using.

To start we load the model into an input stream from the disk:

InputStream inputStream = new FileInputStream("/path/to/en-token.bin"); 
TokenizerModel model = new TokenizerModel(inputStream);

Now we can instantiate a Tokenizer from the model:

TokenizerME tokenizer = new TokenizerME(tokenModel);

Since TokenizerME implements the Tokenizer interface it works just like the SimpleTokenizer and WhitespaceTokenizer:

String tokens[] = tokenizer.tokenize("George Washington was president.");

The tokenizer will tokenize the text using the trained model and return the tokens in the string array. Pretty cool, right?

Deploy Sonnet Tokenization Engine on AWS and Azure.

Leave a Reply