Extracting Patient Names, Hospitals, and Drugs with Idyl NLP

In this post we will demonstrate how Idyl NLP can be used to find patient names, hospitals, and names of drugs in natural language text.

A quickly growing use of natural language processing (NLP) exists in the healthcare industry. Recent advancements in technology have made it possible to extract useful and very valuable information from unstructured medical records. This can be used to correlate patient information and look for treatment patterns. From a security perspective, it may be necessary to be able to simply and quickly determine and identify any protected health information (PHI) that may exist in a document for auditing or compliance requirements.

Idyl NLP is our opensource NLP library for Java licensed under the business-friendly Apache license, version 2.0. The library provides various NLP capabilities through abstracted interfaces to lower level NLP libraries. The people, places, and things we are concerned with here are patient names, hospitals, and drug names. The goal of Idyl NLP is to provide a powerful, yet easy to use NLP framework.

Extracting Drug Names via a Dictionary

Drug names in the text do not require a trained model since they can be identified via dictionary models. To identify drug names we will use the FDA’s Orange Book. From the Orange Book CSV download we extracted the “Trade Name” column to a text file. Because some drugs are duplicated, we sort the file and remove the duplicate entries. We now have a text file of drug names having one drug per line.

cat drugs.txt | sort | uniq -d > drugs-sorted.txt

We will use Idyl NLP’s dictionary entity recognizer to find the drug names in our input text. The dictionary entity recognizer takes a file and reads its contents into a bloom filter. The dictionary entity recognizer accepts as input tokenized text. Because some drug names may consist of more than one word, we can not do a simple contains check against the dictionary. Instead, we produce a list of n-grams of the tokenized text having length one up to the length of the number of input tokens. We can now see if the bloom filter “might contain” each n-gram. If the bloom filter returns true, we then do a definite check to rule out false positives. Using a bloom filter provides much improved efficiency for dictionary checks.

Extracting Patient Names and Hospitals

Patient names and hospitals will be extracted from the input text through the use of trained models. Each model was created from the same training data. The only difference is that in the patient model the patient names were annotated, and in the hospital model the names of hospitals were annotated. This training process gives us two model files, one for patients and one for hospitals, and their associated model manifest files. To use these models we will instantiate a model-based entity recognizer. The recognizer will load the two trained entity models from disk.

Creating the Pipeline

To use these two entity recognizers we will create a NerPipeline. This pipeline accepts a list of entity recognizers when built along with other configurable settings, such as a sentence detector and tokenizer. When the pipeline is executed, each entity recognizer will be applied to the input text. The output will be a list of Entity objects that contain information about each extracted entity.

The Code

Below is the code written that was described above. Refer to the idylnlp-samples project for up to date examples since this code could change between the time it was written and the time you see it here. This code used Idyl NLP 1.1.0-SNAPSHOT.

Creating the dictionary entity recognizer. The first argument specifies that the entities extracted will be identified as English, the second argument is the full path to the file created from the Orange Book, the third argument is the type of entity, and fourth parameter is the false positive probability for the bloom filter, and the last argument indicates that the dictionary lookup is not case-sensitive.

DictionaryEntityRecognizer dictionaryRecognizer = new DictionaryEntityRecognizer(LanguageCode.en, "/path/to/drugs-sorted.txt", "drug", 0.1, false);

Creating the model entity recognizer requires us to read the model manifests from disk. Maps correlate models for entity types and languages.

String modelPath = "/path/to/trained-models/";

LocalModelLoader<TokenNameFinderModel> entityModelLoader = new LocalModelLoader<>(new TrueValidator(), modelPath);

ModelManifest patientModelManifest = ModelManifestUtils.readManifest("/full/path/to/patient.manifest");
ModelManifest hospitalModelManifest = ModelManifestUtils.readManifest("/full/path/to/hospital.manifest");

Set<StandardModelManifest> patientModelManifests = new HashSet<StandardModelManifest>();

Set<StandardModelManifest> hospitalModelManifests = new HashSet<StandardModelManifest>();

Map<LanguageCode, Set<StandardModelManifest>> persons = new HashMap<>();
persons.put(LanguageCode.en, patientModelManifests);

Map<LanguageCode, Set<StandardModelManifest>> hospitals = new HashMap<>();
hospitals.put(LanguageCode.en, hospitalModelManifests);

Map<String, Map<LanguageCode, Set<StandardModelManifest>>> models = new HashMap<>();
models.put("person", persons);
models.put("hospital", hospitals);

OpenNLPEntityRecognizerConfiguration config = new Builder()

OpenNLPEntityRecognizer modelRecognizer = new OpenNLPEntityRecognizer(config);

Now we can create the pipeline providing the entity recognizers:

List<EntityRecognizer> entityRecognizers = new ArrayList<>();

NerPipeline pipeline = new NerPipeline.NerPipelineBuilder().withEntityRecognizers(entityRecognizers ).build;

And, finally, we can execute the pipeline:

String input = FileUtils.readFileToString(new File("/tmp/input-file.txt"));
EntityExtractionResponse response = pipeline.run(input);

The response will contain a set of entities (persons, hospitals, and drugs) that were extracted from the input text.


Because we created the pipeline using most defaults, it will use an internal English sentence detector and tokenizer. For other languages you can create the pipeline with other options. As when using any trained model to perform named-entity recognition, the performance of the model is important. How well the training data represents the actual data will be crucial to achieving good performance.

Simplified Named-Entity Extraction Pipeline in Idyl NLP

Idyl NLP 1.1.0 introduces a simplified named-entity extraction pipeline that can be created in just a few lines of code. The following code block shows how to make a pipeline to extract named-person entities from natural language English text in Idyl NLP.

NerPipelineBuilder builder = new NerPipeline.NerPipelineBuilder();
NerPipeline pipeline = builder.build(LanguageCode.en);

EntityExtractionResponse response = pipeline.run("George Washington was president.");
for(Entity entity : response.getEntities()) {

When you run this code a single line will be printed to the screen:

Text: George Washington; Confidence: 0.96; Type: person; Language Code: eng; Span: [0..2);

Internally, the pipeline creates a sentence detector, tokenizer, and named-entity recognizer for the given language. Currently only person-entities for English is supported but we will be adding support for more languages and more entity types in the future. The goal of this functionality is to simplify the amount of code needed to perform a complex operation like named-entity extraction. The NerPipeline class is new in Idyl NLP 1.1.0-SNAPSHOT.

Idyl NLP is our open-source, Apache-licensed NLP framework for Java. Its releases are available in Maven Central and daily snapshots are also available. See Idyl NLP on GitHub at https://github.com/idylnlp/idylnlp for the code, examples, and documentation. Idyl NLP powers our NLP Building Blocks.

Idyl NLP

We have open-sourced our NLP library and its associated projects on GitHub. The library, Idyl NLP, is a Java natural language processing library. It is licensed under the Apache License, version 2.0.

Idyl NLP stands on the shoulders of giants to provide a capable and flexible NLP library. Utilizing components such as OpenNLP and DeepLearning4j under the hood, Idyl NLP offers various implementations for NLP tasks such as language detection, sentence extraction, tokenization, named-entity extraction, and document classification.

Idyl NLP has its own webpage at http://idylnlp.ai and is available in Maven Central under the group ai.idylnlp.

Here are the GitHub project links:

Idyl NLP powers our NLP building block microservices and they are also open source on GitHub:

NLP Models and Model Zoo

Idyl NLP has the ability to automatically download NLP models when needed. The Idyl NLP Models repository contains model manifests for various NLP models. Through the manifest files, Idyl NLP can automatically download the model file referenced by the manifest and use it. The service powering the service is the Idyl NLP Model Zoo that will soon be hosted at zoo.idylnlp.ai. It is a Spring boot application that provides a REST interface for querying and downloading models so you can run your own model zoo for internal usage. See these two repositories on GitHub for more information about the available models and the model zoo. Models will become available through the repository in the coming days.

Sample Projects

There are some sample projects available for Idyl NLP. The samples illustrate how to use some of Idyl NLP’s core capabilities and hopefully provide starting points for using Idyl NLP in your projects.


We are committed to further developing Idyl NLP and its ecosystem. We welcome the community’s contributions to help it foster and grow. We hope that the business friendly Apache license helps Idyl NLP’s adoption. Like most software engineers we are a bit behind on documentation. In the near term we will be focusing on the wiki, javadocs, and the sample projects. Our NLP Building Blocks will continue to be powered by Idyl NLP.

For questions or more information please contact help@idylnlp.ai.