Intel “Meltdown” and “Spectre” Vulnerabilities

With the recent announcement of the vulnerabilities known as “Spectre” and “Meltdown” in Intel processors we have made this post to inform our users how to protect their virtual machines of our products launched via cloud marketplaces.

Products Launched via Docker Containers

Docker uses the host’s system kernel. Refer to your host OS’s documentation on applying the necessary kernel patch.

Products Launched via the AWS Marketplace

The following product versions are using kernel 4.9.62-21.56.amzn1.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each instance:

sudo yum update
sudo reboot
uname -r

The output of the last command will an updated kernel version of 4.9.76-3.78.amzn1.x86_64 (or newer). Details are available on the AWS Amazon Linux Security Center.

Products Launched via the Azure Marketplace

The following product versions are running on CentOS 7.3 on kernel 3.10.0-514.26.2.el7.x86_64 which needs updated.

  • Renku Language Detection Engine 1.0.0
  • Prose Sentence Extraction Engine 1.0.0
  • Sonnet Tokenization Engine 1.0.0
  • Idyl E3 Entity Extraction Engine 3.0.0

Run the following commands on each virtual machine:

sudo yum update
sudo reboot
uname -r

The output of the last command will show an updated kernel version of 3.10.0-693.11.6.el7.x86_64 (or newer). For more information see the Red Hat Security Advisory and the announcement email.

 

 

Sentence Extraction with Custom Trained NLP Models

Introducing Sentence Extraction

A common required task of natural language processing (NLP) is to extract sentences from natural language text. This can be a task on its own or as part of a larger NLP system. There are several ways to go about doing sentence extraction. There is the naive way of splitting based on the presence of periods. That works great until you remember that periods don’t always indicate a sentence break. There are tools that break text into sentences based on rules. There is actually a standard for communicating these rules called Segmentation Rules eXchange, or, SRX. These rules often work with very good success, however, they are language dependent. Additionally, implementing code for these rules can be difficult because not all programming languages have the necessary constructs.

Model-Based Sentence Extraction

This brings us to model-based sentence extraction. In this approach we use trained models to identify sentence boundaries in natural language text. In summary, we take training text, run it through a training process, and we get a model that can be used to extract sentences. A significant benefit of model-based sentence extraction is that you can adapt your model to represent the actual text you will be processing. This leads to potentially great performance. Our NLP Building Block product called Prose Sentence Extraction Engine uses this model-based approach.

Training a Custom Sentence Model with Prose Sentence Extraction Engine

Prose Sentence Extraction Engine 1.1.0 introduced the ability to create custom models for extracting sentences from natural language text. Using a custom model typically provides a much greater level of accuracy than relying on the internal Prose logic to extract sentences. Creating a custom model is fairly simple and this blog post demonstrates how to do it.

To get started we are going to launch Prose Sentence Extraction Engine via the AWS Marketplace. The benefit of doing this is in just a few seconds (okay, maybe 30 seconds) we will have an instance of Prose fully configured and ready to go. Once the instance is up and running in EC2 we can SSH into it. (Note that the SSH username is ec2-user.) All commands presented in this post are executed through SSH on the Prose instance.

SSH to the Prose instance on EC2:

ssh -i key.pem ec2-user@ec2-54-174-13-245.compute-1.amazonaws.com

Once connected, change to the Prose directory:

cd /opt/prose

Training a sentence extraction model requires training text. This text needs to be formatted in a certain way – one sentence per line. This is how Prose learns how to recognize a sentence for any given language. We have some training text for you to use for this example. When creating a model for your production use you should use text representative of the real text that you will be processing. This gives the best performance.

Download the example training text to the instance:

wget https://s3.amazonaws.com/mtnfog-public/a-christmas-carol-sentences.txt -O /tmp/a-christmas-carol-sentences.txt

Take a look at the first few lines of the file you just downloaded. You will see that it is a sentence per line. This file is also attached to this blog post and can be downloaded at the bottom of this post.

Now, edit the example training definition file:

sudo nano example-training-definition-template.xml

You want to modify the trainingdata file to be “/tmp/a-christmas-carol-sentences.txt” and set the output model file as shown below:

<?xml version="1.0" encoding="UTF-8"?>
<trainingdefinition xmlns="https://www.mtnfog.com">
  <algorithm/>
  <trainingdata file="/tmp/a-christmas-carol-sentences.txt" format="opennlp"/>
  <model name="sentence" file="/tmp/sentence.bin" encryptionkey="random" language="eng" type="sentence"/>
</trainingdefinition>

This training definition says we are creating a sentence model for English (eng) text. The trainined model file will be written to /tmp/sentence.bin. Now, we are ready to train the model:

./bin/train-model.sh example-training-definition-template.xml

You will see some output quickly scroll by. Since the input text is rather small, the training only takes at most a few seconds. Your output should look similar to:

$ ./bin/train-model.sh example-training-definition-template.xml

Prose Sentence Model Generator
Version: 1.1.0
Beginning training using definition file: /opt/prose/example-training-definition-template.xml
2017-12-31 19:21:03,451 DEBUG [main] models.ModelOperationsUtils (ModelOperationsUtils.java:40) - Using OpenNLP data format.
2017-12-31 19:21:03,567 INFO  [main] training.SentenceModelOperations (SentenceModelOperations.java:281) - Beginning sentence model training. Output model will be: /tmp/sentence.bin
Indexing events with TwoPass using cutoff of 0

	Computing event counts...  done. 1990 events
	Indexing...  done.
Collecting events... Done indexing in 0.41 s.
Incorporating indexed data for training...  
done.
	Number of Event Tokens: 1990
	    Number of Outcomes: 2
	  Number of Predicates: 2274
Computing model parameters...
Performing 100 iterations.
  1:  . (1827/1990) 0.9180904522613065
  2:  . (1882/1990) 0.9457286432160804
  3:  . (1910/1990) 0.9597989949748744
  4:  . (1915/1990) 0.9623115577889447
  5:  . (1940/1990) 0.9748743718592965
  6:  . (1950/1990) 0.9798994974874372
  7:  . (1953/1990) 0.9814070351758793
  8:  . (1948/1990) 0.978894472361809
  9:  . (1962/1990) 0.985929648241206
 10:  . (1954/1990) 0.9819095477386934
 20:  . (1979/1990) 0.9944723618090452
 30:  . (1986/1990) 0.9979899497487437
 40:  . (1990/1990) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1990/1990) 1.0
...done.
Compressed 2274 parameters to 707
1 outcome patterns
2017-12-31 19:21:04,491 INFO  [main] manifest.ModelManifestUtils (ModelManifestUtils.java:108) - Removing existing manifest file /tmp/sentence.bin.manifest.
Sentence model generated complete. Summary:
Model file   : /tmp/sentence.bin
Manifest file : sentence.bin.manifest
Time Taken    : 1056 ms

Our model has been created and we can now use it. First, let’s stop Prose in case it is running:

sudo service prose stop

Next, copy the model file and its manifest file to /opt/prose/models:

sudo cp /tmp/sentence.* /opt/prose/models/

Since we moved the model file, let’s also update the model’s file name in the manifest file:

sudo nano models/sentence.bin.manifest

Change the model.filename property to be sentence.bin (remove the /tmp/). The manifest should now look like:

model.id=e54091c9-89de-4edb-828b-4edf58006c73
model.name=sentence
model.type=sentence
model.subtype=none
model.filename=sentence.bin
language.code=eng
license.key=
encryption.key=random
creator.version=prose-1.1.0
model.source=
generation=1

Now, with our models in place, we can now start Prose. If we tail Prose’s log while loading we can see that it finds and loads our custom model:

sudo service prose start && tail -f /var/log/prose.log

In case you are curious, the lines in the log that show the model was loaded will look similar to these:

[INFO ] 2017-12-31 19:25:57.933 [main] ModelManifestUtils - Found model manifest ./models//sentence.bin.manifest.
[INFO ] 2017-12-31 19:25:57.939 [main] ModelManifestUtils - Validating model manifest ./models//sentence.bin.manifest.
[WARN ] 2017-12-31 19:25:57.942 [main] ModelManifestUtils - The license.key in ./models//sentence.bin.manifest is missing.
[INFO ] 2017-12-31 19:25:58.130 [main] ModelManifestUtils - Entity Class: sentence, Model File Name: sentence.bin, Language Code: en, License Key: 
[INFO ] 2017-12-31 19:25:58.135 [main] DefaultSentenceDetectionService - Found 1 models to load.
[INFO ] 2017-12-31 19:25:58.138 [main] LocalModelLoader - Using local model loader directory ./models/
[INFO ] 2017-12-31 19:25:58.560 [main] ModelLoader - Model validation successful.
[INFO ] 2017-12-31 19:25:58.569 [main] DefaultSentenceDetectionService - Found sentence model for language eng

Yay! This means that Prose has started and loaded our model. Requests to Prose to extract sentences for English text will now use our model. Let’s try it:

curl http://ec2-54-174-13-245.compute-1.amazonaws.com:8060/api/sentences -d "This is a sentence. This is another sentence. This is also a sentence." -H "Content-type: text/plain"

The response we receive from Prose is:

["This is a sentence.","This is another sentence.","This is also a sentence."]

Our sentence model worked! Prose successfully took in the natural language English text and sent us back three sentences that made up the text.

Prose Sentence Extraction Engine is available on the AWS Marketplace, Azure Marketplace, an Dockerhub. You can launch Prose Sentence Extraction Engine on any of those platforms in just a few seconds.

At the time of publishing, Prose 1.1.0 was in-process of being published to the Azure and AWS Marketplaces. If 1.1.0 is not yet available on those marketplaces it will be in just a few days once the update has been published.

Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls “processors”, you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let’s get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

unzip nifi-1.4.0-bin.zip
cd nifi-1.4.0/bin
./nifi.sh start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi’s GetFile processor to read the file. Add a GetFile processor to the canvas:


Right-click the GetFile processor and click Configure to bring up the processor’s properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor’s properties and set the following values:

  • HTTP Method – POST
  • Remote URL – http://localhost:7070/api/sentences
  • Content Type – text/plain

Set the processor to autoterminate for everything except Response. We also set the processor’s name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let’s go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method – POST
  • Remote URL – http://localhost:9000/api/extract
  • Content-Type – application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow’s execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone https://github.com/mtnfog/nlp-building-blocks
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:

{"entities":[],"extractionTime":7}

Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There’s a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don’t have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can’t leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).