Creating an N-gram Language Model

A statistical language model is a probability distribution over sequences of words. (source)

We can build a language model using n-grams and query it to determine the probability of an arbitrary sentence (a sequence of words) belonging to that language.

Language modeling has uses in various NLP applications such as statistical machine translation and speech recognition. It’s easy to see how being able to determine the probability a sentence belongs to a corpus can be useful in areas such as machine translation.

To build it, we need a corpus and a language modeling tool. We will use kenlm as our tool. Other language modeling tools exist and some are listed at bottom of the Language Model Wikipedia article.

To start, we will clone the kenlm repository from GitHub:

git clone https://github.com/kpu/kenlm.git

Once cloned, we will follow the instructions in the repository’s README for how to compile. Those instructions are:

mkdir -p build
cd build
cmake ..
make -j 4

Once done we have a bin directory that contains the kenlm binaries. We can now create our language model. For text to experiment with I used the raw text of Pride and Prejudice. You will most certainly need a much, much larger corpus to get more meaningful results. But this should be sufficient for testing and learning.

To create the model:

./bin/lmplz -o 5 < book.txt > book.lm.arpa

This creates an ARPA file whose format can be found documented here. The -o option specifies the order (length of the n-grams) of the model. With this language model we can calculate the probability of an arbitrary sentence being found in Pride and Prejudice.

echo "This is my sentence ." | ./bin/query book.lm.arpa

The output shows us a few things.

Loading the LM will be faster if you build a binary file.
Reading book.lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
This=14 2 -2.8062737 is=16 2 -1.1830423 my=186 3 -1.7089757 sentence=6455 1 -4.2776613 .=0 1 -4.980392 </s>=2 1 -1.2587173 Total: -16.215061 OOV: 1
Perplexity including OOVs: 504.0924558936663
Perplexity excluding OOVs: 176.57688116229482
OOVs: 1
Tokens: 6
Name:query VmPeak:33044 kB VmRSS:4836 kB RSSMax:14040 kB user:0.273361 sys:0.00804 CPU:0.281469 real:0.279475

The value -16.215061 is the log probability of the sentence belonging to the language. Ten to the power of -16.215061 gives us 6.0945129×10^-17.

Compare with word2vec

So how does an n-gram language model compare with word2vec models? Do they do the same thing? No, they don’t. In an n-gram language model the order of the words is important. word2vec does not consider the ordering of words, and instead, only looks at the words in a given window size. This allows word2vec to predict the neighboring words given some context without consideration of word order.

A little bit more…

This post did not go into the inner workings of kenlm. For those details refer to the kenlm repository or to this paper. Of particular note is Kneser-Ney smoothing, the algorithm used by kenlm to improve results for instances such as when a word is found that was not present in the corpus. A corpus will never contain every possible n-gram so it is possible the sentence we are estimating has an n-gram not included in the model.

Note that the input text to kenlm should be preprocessed and tokenized, a step which we skipped here. You could use Sonnet Tokenization Engine.

To see an example of kenlm used in support of statistical machine translation see Apache Joshua.

Share this post

Leave a Reply