Creating Custom Tokenization Models with Sonnet Tokenization Engine

Sonnet Tokenization Engine 1.1.0 includes the ability to train custom token models from your text. Using your own token model provides improved performance because the model will more closely match your text to be tokenized. This post describes how to launch an instance of Sonnet Tokenization Engine on AWS, connect to it, train a custom token model, and then use it.

To get started, let’s launch an instance of Sonnet Tokenization Engine from the AWS Marketplace. On the product page, click the orange “Continue to Subscribe” button.


On the next page, we highly recommend selecting a VPC from the VPC Settings options. This is to allow you to launch Sonnet Tokenization Engine on a newer instance type. Select your VPC and a public subnet.

Now, select an instance type. We recommend a t2.micro for this demonstration. In production you will likely want a larger instance type.

Now click the “Launch with 1-Click” button!

An instance of Sonnet Tokenization Engine will now be starting in your AWS account. Head over to your EC2 console to check it out. By default, for security purposes port 22 for SSH is not open to the instance. Let’s open port 22 so we can SSH to the instance. Click on the instance’s security group, click Inbound Rules, and add port 22. Now let’s SSH into the instance.

ssh -i keypair.pem

Sonnet Tokenization Engine is installed under /opt/sonnet.

cd /opt/sonnet

Training a custom token model requires training data. The format for this data is a single sentence per line with tokens separated by whitespace or <SPLIT>. You can download sample training data for this exercise.

wget -O /tmp/token.train

We also need a training definition file. Again, we can download one for this exercise:

wget -O /tmp/token-training-definition.xml

Using these two files we are now ready to train our model.

sudo su sonnet
./bin/ /tmp/token-training-definition.xml

The output will look similar to:

Sonnet Token Model Generator
Version: 1.1.0
Beginning training using definition file: /tmp/token-training-definition.xml
2018-03-17 12:47:46,135 DEBUG [main] models.ModelOperationsUtils ( - Using OpenNLP data format.
2018-03-17 12:47:46,260 INFO  [main] training.TokenModelOperations ( - Beginning tokenizer model training. Output model will be: /tmp/token.bin
Indexing events with TwoPass using cutoff of 0

	Computing event counts...  done. 6002 events
	Indexing...  done.
Collecting events... Done indexing in 0.54 s.
Incorporating indexed data for training...  
	Number of Event Tokens: 6002
	    Number of Outcomes: 2
	  Number of Predicates: 6290
Computing model parameters...
Performing 100 iterations.
  1:  . (5991/6002) 0.9981672775741419
  2:  . (5995/6002) 0.9988337220926358
  3:  . (5996/6002) 0.9990003332222592
  4:  . (5997/6002) 0.9991669443518827
  5:  . (5996/6002) 0.9990003332222592
  6:  . (5998/6002) 0.9993335554815062
  7:  . (5998/6002) 0.9993335554815062
  8:  . (6000/6002) 0.9996667777407531
  9:  . (6000/6002) 0.9996667777407531
 10:  . (6000/6002) 0.9996667777407531
Stopping: change in training set accuracy less than 1.0E-5
Stats: (6002/6002) 1.0
Compressed 6290 parameters to 159
1 outcome patterns
Entity model generated complete. Summary:
Model file   : /tmp/token.bin
Manifest file : token.bin.manifest
Time Taken    : 2690 ms

The created model file and its associated manifest file will have been created. Copy the manifest file to Sonnet’s models directory.

cp /tmp/token.bin.manifest /opt/sonnet/models/

Now start/restart Sonnet.

sudo service sonnet restart

The model will be loaded and ready for use. All API requests for tokenization that are received for the model’s language will be processed by the model. To try it:

curl "" -d "Tokenize this text please." -H "Content-Type: text/plain"

Intel “Meltdown” and “Spectre” Vulnerabilities

With the recent announcement of the vulnerabilities known as “Spectre” and “Meltdown” in Intel processors we have made this post to inform our users how to protect their virtual machines of our products launched via cloud marketplaces.

Products Launched via Docker Containers

Docker uses the host’s system kernel. Refer to your host OS’s documentation on applying the necessary kernel patch.

Products Launched via the AWS Marketplace

The following product versions are using kernel 4.9.62-21.56.amzn1.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each instance:

sudo yum update
sudo reboot
uname -r

The output of the last command will an updated kernel version of 4.9.76-3.78.amzn1.x86_64 (or newer). Details are available on the AWS Amazon Linux Security Center.

Products Launched via the Azure Marketplace

The following product versions are running on CentOS 7.3 on kernel 3.10.0-514.26.2.el7.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each virtual machine:

sudo yum update
sudo reboot
uname -r

The output of the last command will show an updated kernel version of 3.10.0-693.11.6.el7.x86_64 (or newer). For more information see the Red Hat Security Advisory and the announcement email.



Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls “processors”, you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let’s get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

cd nifi-1.4.0/bin
./ start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi’s GetFile processor to read the file. Add a GetFile processor to the canvas:

Right-click the GetFile processor and click Configure to bring up the processor’s properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor’s properties and set the following values:

  • HTTP Method – POST
  • Remote URL – http://localhost:7070/api/sentences
  • Content Type – text/plain

Set the processor to autoterminate for everything except Response. We also set the processor’s name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let’s go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method – POST
  • Remote URL – http://localhost:9000/api/extract
  • Content-Type – application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow’s execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:


Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There’s a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don’t have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can’t leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).