Quick Introduction to word2vec

In a previous post I gave links to some pretrained models for a few implementations of word vectors. In this post we’ll take a look at word vectors and their applications.

If you have been anywhere around NLP in the past couple of years you have undoubtedly heard of word2vec. As John Rupert Firth said, “You shall know a word by the company it keeps.” That is the premise behind word2vec. Words that have similar contexts will be placed closer to each other by the algorithm. For example, Paris and France will be closer together than Paris and Germany.

If you are interested in the details of the algorithms behind word2vec you will want to see the paper Efficient Estimation of Word Representations in Vector Space by Tomas Mikolov et. al. and the code that goes along with it.

There are two model architectures of word2vec. One of these is called the Skip-Gram model. This model uses the current word to predict the surrounding words (the context). The other model architecture is called continuous bag-of-words (CBOW). This model predicts the current word from the surrounding words (context). For both models, the limit on the number of surrounding words is controlled by the window size parameter.

The practical applications of word vectors includes, but is not limited to, NLP tasks like named-entity recognition, machine translation, sentiment analysis, recommendation engines, and document retrieval. Word vectors have been applied to other domains such as biological sequences of proteins and genes.

Nearly all deep learning and NLP toolkits available today offer at least some support for word vectors. Tensorflow, GluonNLP (based on MXNet), and cloud-based tools such as Amazon SageMaker BlazingText support word vectors.