Datasets and Generated Artifacts

We have made the following datasets available for your use in NLP and text mining tasks. Please refer to each item’s README file for important information on how the model was created. For additional information on any of the datasets available here please contact us.

Please use your brower’s Save Link As functionality to download if you have problems.

PubMed Open Access Subset (Commercial) Pretrained word2vec Vectors

Word vectors generated by word2vec from the commercial Open Access Subset of the PubMed collection of biomedical literature. The text was preprocessed by converting all text to lowercase and removing all punctuation. These vectors were created in August 2019.

FileSizeWindowMin CountAlgorithm
Vectors
Readme
200510cbow
Vectors
Readme
200510skipgram

PubMed Open Access Subset (Commercial) Pretrained fastText Vectors

Word vectors generated by fastText from the commercial Open Access Subset of the PubMed collection of biomedical literature. The text was preprocessed by converting all text to lowercase and removing all punctuation. These vectors were created in August 2019. Refer to the fastText documentation for information on the file formats.

FileSizeWindowMin CountAlgorithm
Model
Vectors
Readme
200510cbow
Model
Vectors
Readme
200510skipgram