When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language’s probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.
In case you are not familiar with OpenNLP’s language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.
To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi’s lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.
git clone https://github.com/mtnfog/nlp-nifi-processors.git cd nlp-nifi-processors mvn clean install cp langdetect-nifi-processor/langdetect-processor-nar/target/*.nar /path/to/nifi/lib/
The processor does not have any settings to configure. It’s ready to work right “out of the box.” You can add the processor to your NiFi canvas:
You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.
Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!
This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.
Let us know how the processor works out for you!