Pre-trained PubMed Vectors

We have added a download to our Datasets page. This addition is pre-trained vectors for PubMed Open-Access Subset.

PubMed comprises more than 29 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites.

These pre-trained word vectors were created from the commercial PubMed Open Access Subset. There is a lot of great information inside the collection of biomedical text and we hope these word vectors are useful to you in your NLP and text mining experiments.

Go to our Datasets page to access the downloads.

Creating an N-gram Language Model

A statistical language model is a probability distribution over sequences of words. (source)

We can build a language model using n-grams and query it to determine the probability of an arbitrary sentence (a sequence of words) belonging to that language.

Language modeling has uses in various NLP applications such as statistical machine translation and speech recognition. It's easy to see how being able to determine the probability a sentence belongs to a corpus can be useful in areas such as machine translation.

To build it, we need a corpus and a language modeling tool. We will use kenlm as our tool. Other language modeling tools exist and some are listed at bottom of the Language Model Wikipedia article.

To start, we will clone the kenlm repository from GitHub:

git clone

Once cloned, we will follow the instructions in the repository's README for how to compile. Those instructions are:

mkdir -p build
cd build
cmake ..
make -j 4

Once done we have a bin directory that contains the kenlm binaries. We can now create our language model. For text to experiment with I used the raw text of Pride and Prejudice. You will most certainly need a much, much larger corpus to get more meaningful results. But this should be sufficient for testing and learning.

To create the model:

./bin/lmplz -o 5 < book.txt >

This creates an ARPA file whose format can be found documented here. The -o option specifies the order (length of the n-grams) of the model. With this language model we can calculate the probability of an arbitrary sentence being found in Pride and Prejudice.

echo "This is my sentence ." | ./bin/query

The output shows us a few things.

Loading the LM will be faster if you build a binary file.
This=14 2 -2.8062737 is=16 2 -1.1830423 my=186 3 -1.7089757 sentence=6455 1 -4.2776613 .=0 1 -4.980392 </s>=2 1 -1.2587173 Total: -16.215061 OOV: 1
Perplexity including OOVs: 504.0924558936663
Perplexity excluding OOVs: 176.57688116229482
OOVs: 1
Tokens: 6
Name:query VmPeak:33044 kB VmRSS:4836 kB RSSMax:14040 kB user:0.273361 sys:0.00804 CPU:0.281469 real:0.279475

The value -16.215061 is the log probability of the sentence belonging to the language. Ten to the power of -16.215061 gives us 6.0945129x10^-17.

Compare with word2vec

So how does an n-gram language model compare with word2vec models? Do they do the same thing? No, they don't. In an n-gram language model the order of the words is important. word2vec does not consider the ordering of words, and instead, only looks at the words in a given window size. This allows word2vec to predict the neighboring words given some context without consideration of word order.

A little bit more...

This post did not go into the inner workings of kenlm. For those details refer to the kenlm repository or to this paper. Of particular note is Kneser-Ney smoothing, the algorithm used by kenlm to improve results for instances such as when a word is found that was not present in the corpus. A corpus will never contain every possible n-gram so it is possible the sentence we are estimating has an n-gram not included in the model.

Note that the input text to kenlm should be preprocessed and tokenized, a step which we skipped here. You could use Sonnet Tokenization Engine.

To see an example of kenlm used in support of statistical machine translation see Apache Joshua.

Orchestrating NLP Building Blocks with Apache NiFi for Named-Entity Extraction

This blog post shows how we can create an NLP pipeline to perform named-entity extraction on natural language text using our NLP Building Blocks and Apache NiFi. Our NLP Building Blocks provide the ability to perform sentence extraction, string tokenization, and named-entity extraction. They are implemented as microservices and can be deployed almost anywhere, such as AWS, Azure, and as Docker containers.

At the completion of this blog post we will have a system that reads natural language text stored in files on the file system, pulls out the sentences of the each, finds the tokens in each sentence, and finds the named-entities in the tokens.

Apache NiFi is an open-source application that provides data flow capabilities. Using NiFi you can visually define how data should flow through your system. Using what NiFi calls "processors", you can ingest data from many data sources, perform operations on the data such as transformations and aggregations, and then output the data to an external system. We will be using NiFi to facilitate the flow of text through our NLP pipeline. The text will be read from plain text files on the file system. We will then:

  • Identify the sentences in input text.
  • For each sentence, extract the tokens in the sentence.
  • Process the tokens for named-entities.

To get started we will stand up the NLP Building Blocks. This consists of the following applications:

We will launch these applications using a docker-compose script.

git clone
cd nlp-building-blocks
docker-compose up

This will pull the docker images from DockerHub and run the containers. We now have each NLP building block up and running. Let's get Apache NiFi up and running, too.

To get started with Apache NiFi we will download it. It is a big download at just over 1 GB. You can download it from the Apache NiFi Downloads page or directly from a mirror at this link for NiFi 1.4.0. Once the download is done we will unzip the download and start NiFi:

cd nifi-1.4.0/bin
./ start

NiFi will start and after a few minutes it will be available at http://localhost:8080/nifi. (If you are curious you can see the NiFi log under logs/nifi-app.log.) Open your browser to that page and you will see the NiFi canvas as shown below. We can now design our data flow around the NLP Building Blocks!

If you want to skip to the meat and potatoes you can get the NiFi template described below in the nlp-building-blocks repository.

Our source data is going to be read from text files on our computer stored under /tmp/in/. We will use NiFi's GetFile processor to read the file. Add a GetFile processor to the canvas:

Right-click the GetFile processor and click Configure to bring up the processor's properties. The only property we are going to set is the Input Directory property. Set it to /tmp/in/ and click Apply:

We will use the InvokeHTTP processor to send API requests to the NLP Building Blocks, so, add a new InvokeHTTP processor to the canvas:

This first InvokeHTTP processor will be used to send to the data to Prose Sentence Detection Engine to extract the sentences in the text. Open the InvokeHTTP processor's properties and set the following values:

  • HTTP Method - POST
  • Remote URL - http://localhost:7070/api/sentences
  • Content Type - text/plain

Set the processor to autoterminate for everything except Response. We also set the processor's name to ProseSentenceExtractionEngine. Since we will be using multiple InvokeHTTP processors this lets us easily differentiate between them. We can now create a connection between the GetFile and InvokeHTTP processors by clicking and drawing a line between them. Our flow right now reads files from the filesystem and sends the contents to Prose:

The sentences returned from Prose will be in a JSON array. We can split this array into individual FlowFiles with the SplitJson processor. Add a SplitJson processor to the canvas and set its JsonPath Expression property to $.* as shown below:

Connect the SplitJson processor to the ProseSentenceExtractionEngine processor for the Response relationship. The canvas should now look like this:

Now that we have the individual sentences in the text we can send those sentences to Sonnet Tokenization Engine to tokenize the sentences. Similar to before, add an InvokeHTTP processor and name it SonnetTokenizationEngine. Set its method to POST, the Remote URL to http://localhost:9040/api/tokenize, and the Content-Type to text/plain. Automatically terminate every relationship except Response. Connect it to the SplitJson processor using the Split relationship. The result of this processor will be an array of tokens from the input sentence.

While we are at it, let's go ahead and add an InvokeHTTP processor for Idyl E3 Entity Extraction Engine. Add the processor to the canvas and set its name to IdylE3EntityExtractionEngine. Set its properties:

  • HTTP Method - POST
  • Remote URL - http://localhost:9000/api/extract
  • Content-Type - application/json

Connect the IdylE3EntityExtractionEngine processor to the SonnetTokenizationProcessor via the Response relationship. All other relationships can be set to autoterminate. To make things easier to see, we are going to add an UpdateAttribute processor that sets the filename for each FlowFile to a random UUID. Add an UpdateAttribute processor and add a new property called filename with the value ${uuid}.txt. We will also add a processor to write the FlowFiles to disk so we can see what happened during the flow's execution. We will add a PutFile processor and set its Directory property to /tmp/out/.

Our finished flow looks like this:

To test our flow we are going to use a super simple text file. The full contents of the text file are:

George Washington was president. This is another sentence. Martha Washington was first lady.

Save this file as /tmp/in/test.txt.

Now, start up the NLP Building Blocks:

git clone
cd nlp-building-blocks
docker-compose up

Now you can start the processors in the flow! The file /tmp/in/test.txt will disappear and three files will appear in /tmp/out/. The three files will have random UUIDs for filenames thanks to the UpdateAttribute processor. If we look at the contents of each of these files we see:

First file:

{"entities":[{"text":"George Washington","confidence":0.96,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488188929,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":84}

Second file:


Third file:

{"entities":[{"text":"Martha Washington","confidence":0.89,"span":{"tokenStart":0,"tokenEnd":2},"type":"person","languageCode":"eng","extractionDate":1514488189026,"metadata":{"x-model-filename":"mtnfog-en-person.bin"}}],"extractionTime":2}

The input text was broken into three sentences so we have three output files. In the first file we see that George Washington was extracted as a person entity. The second file did not have any entities. The third file had Martha Washington as a person entity. Our NLP pipeline orchestrated by Apache NiFi read the input, broke it into sentences, broke each sentence into tokens, and then identified named-entities from the tokens.

This flow assumed the language would always be English but if you are unsure you can add another InvokeHTTP processor to utilize Renku Language Detection Engine. This will enable language detection inside your flow and you can route the FlowFiles through the flow based on the detected language giving you a very powerful NLP pipeline.

There's a lot of cool stuff here but arguably one of the coolest is that by using the NLP Building Blocks you don't have to pay per-request pricing that many of the NLP services charge. You can run this pipeline as much as you need to. And if you are in an environment where your text can't leave your network, this pipeline can be run completely behind a firewall (just like we did in this post).