In this post we will demonstrate how Idyl NLP can be used to find patient names, hospitals, and names of drugs in natural language text.
A quickly growing use of natural language processing (NLP) exists in the healthcare industry. Recent advancements in technology have made it possible to extract useful and very valuable information from unstructured medical records. This can be used to correlate patient information and look for treatment patterns. From a security perspective, it may be necessary to be able to simply and quickly determine and identify any protected health information (PHI) that may exist in a document for auditing or compliance requirements.
Idyl NLP is our opensource NLP library for Java licensed under the business-friendly Apache license, version 2.0. The library provides various NLP capabilities through abstracted interfaces to lower level NLP libraries. The people, places, and things we are concerned with here are patient names, hospitals, and drug names. The goal of Idyl NLP is to provide a powerful, yet easy to use NLP framework.
Extracting Drug Names via a Dictionary
Drug names in the text do not require a trained model since they can be identified via dictionary models. To identify drug names we will use the FDA’s Orange Book. From the Orange Book CSV download we extracted the “Trade Name” column to a text file. Because some drugs are duplicated, we sort the file and remove the duplicate entries. We now have a text file of drug names having one drug per line.
cat drugs.txt | sort | uniq -d > drugs-sorted.txt
We will use Idyl NLP’s dictionary entity recognizer to find the drug names in our input text. The dictionary entity recognizer takes a file and reads its contents into a bloom filter. The dictionary entity recognizer accepts as input tokenized text. Because some drug names may consist of more than one word, we can not do a simple contains check against the dictionary. Instead, we produce a list of n-grams of the tokenized text having length one up to the length of the number of input tokens. We can now see if the bloom filter “might contain” each n-gram. If the bloom filter returns true, we then do a definite check to rule out false positives. Using a bloom filter provides much improved efficiency for dictionary checks.
Extracting Patient Names and Hospitals
Patient names and hospitals will be extracted from the input text through the use of trained models. Each model was created from the same training data. The only difference is that in the patient model the patient names were annotated, and in the hospital model the names of hospitals were annotated. This training process gives us two model files, one for patients and one for hospitals, and their associated model manifest files. To use these models we will instantiate a model-based entity recognizer. The recognizer will load the two trained entity models from disk.
Creating the Pipeline
To use these two entity recognizers we will create a NerPipeline. This pipeline accepts a list of entity recognizers when built along with other configurable settings, such as a sentence detector and tokenizer. When the pipeline is executed, each entity recognizer will be applied to the input text. The output will be a list of Entity objects that contain information about each extracted entity.
Below is the code written that was described above. Refer to the idylnlp-samples project for up to date examples since this code could change between the time it was written and the time you see it here. This code used Idyl NLP 1.1.0-SNAPSHOT.
Creating the dictionary entity recognizer. The first argument specifies that the entities extracted will be identified as English, the second argument is the full path to the file created from the Orange Book, the third argument is the type of entity, and fourth parameter is the false positive probability for the bloom filter, and the last argument indicates that the dictionary lookup is not case-sensitive.
DictionaryEntityRecognizer dictionaryRecognizer = new DictionaryEntityRecognizer(LanguageCode.en, "/path/to/drugs-sorted.txt", "drug", 0.1, false);
Creating the model entity recognizer requires us to read the model manifests from disk. Maps correlate models for entity types and languages.
String modelPath = "/path/to/trained-models/"; LocalModelLoader<TokenNameFinderModel> entityModelLoader = new LocalModelLoader<>(new TrueValidator(), modelPath); ModelManifest patientModelManifest = ModelManifestUtils.readManifest("/full/path/to/patient.manifest"); ModelManifest hospitalModelManifest = ModelManifestUtils.readManifest("/full/path/to/hospital.manifest"); Set<StandardModelManifest> patientModelManifests = new HashSet<StandardModelManifest>(); patientModelManifests.add(patientModelManifest); Set<StandardModelManifest> hospitalModelManifests = new HashSet<StandardModelManifest>(); hospitalModelManifests.add(hospitalModelManifest); Map<LanguageCode, Set<StandardModelManifest>> persons = new HashMap<>(); persons.put(LanguageCode.en, patientModelManifests); Map<LanguageCode, Set<StandardModelManifest>> hospitals = new HashMap<>(); hospitals.put(LanguageCode.en, hospitalModelManifests); Map<String, Map<LanguageCode, Set<StandardModelManifest>>> models = new HashMap<>(); models.put("person", persons); models.put("hospital", hospitals); OpenNLPEntityRecognizerConfiguration config = new Builder() .withEntityModelLoader(entityModelLoader) .withEntityModels(models) .build(); OpenNLPEntityRecognizer modelRecognizer = new OpenNLPEntityRecognizer(config);
Now we can create the pipeline providing the entity recognizers:
List<EntityRecognizer> entityRecognizers = new ArrayList<>(); entityRecognizers.add(dictionaryRecognizer); entityRecognizers.add(modelRecognizer); NerPipeline pipeline = new NerPipeline.NerPipelineBuilder().withEntityRecognizers(entityRecognizers ).build;
And, finally, we can execute the pipeline:
String input = FileUtils.readFileToString(new File("/tmp/input-file.txt")); EntityExtractionResponse response = pipeline.run(input);
The response will contain a set of entities (persons, hospitals, and drugs) that were extracted from the input text.
Because we created the pipeline using most defaults, it will use an internal English sentence detector and tokenizer. For other languages you can create the pipeline with other options. As when using any trained model to perform named-entity recognition, the performance of the model is important. How well the training data represents the actual data will be crucial to achieving good performance.