Recently (in May 2019) I had the honor of attending and speaking at the Dataworks Summit in Washington D.C. The conference had many interesting topics and keynote speakers focused on big-data technologies and business applications. I also always enjoy exploring downtown Washington DC. Whether it is doing the “hike” across the National Mall taking in the sights or visiting all of the nearby shopping, there’s always something new to see.
One thing that caught my attention early on was the number of talks that either focused largely on Apache NiFi or at least mentioned Apache NiFi as a component of a larger data ingest platform. Apache NiFi has definitely cemented itself squarely as a core piece of data flow orchestration.
My talk was one of those. In my talk that described a process to ingest natural language text, process it, and persist extracted entities to a database, Apache NiFi was the workhorse that drove the process. Without NiFi, I would have had to write a lot more code and probably ended up with a much less elegant and performant solution.
In conclusion, if you have not yet looked at Apache NiFi for your data ingest and transformation (think ETL) pipeline needs, do yourself a favor and spend a few minutes with NiFi. I think you will find what you like. And if you need help along the way drop me a note. Just say, “Hey Jeff, I need a pipeline to do X. Show me how it can be done with NiFi.” I’ll be glad to help.
We are happy to let you know how Idyl E3 Entity Extraction Engine can be used with Apache NiFi. First, what is Apache NiFi? From the NiFi homepage: “Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” Idyl E3 extracts entities (persons, places, things) from natural language text.
That’s a very short description of NiFi but it is very accurate. Apache NiFi allows you to configure simple or complex processes of data processing. For example, you can configure a pipeline to consume files from a file system and upload them to S3. (See Example Dataflow Templates.) There are many operations you can do and they are performed by components called Processors. There are many excellent guides available about NiFi, such as:
There are many processors available for NiFi out of the box. One in particular is the InvokeHttp processor that lets your pipeline send an HTTP request. You can use this processor to send text to Idyl E3 for entity extraction from within your pipeline. However, to make things a bit simpler and more flexible we have created a custom NiFi processor just for Idyl E3. This processor is available on GitHub and its binaries will be included with all editions of Idyl E3 starting with version 2.3.0.
Instructions for how to use the Idyl E3 processor will be added to the Idyl E3 documentation bit they are simple. Here’s a rundown. Copy the idyl-e3-nifi-processor.jar from Idyl E3’s home directory to NiFi’s lib directory. Restart NiFi. Once NiFi is available you will see the Idyl E3 in the list of processors when adding a processor:
There are a few properties you can set but the only required property is the Idyl E3 endpoint. By default, the processor extracts entities from the input text but this can be changed using the action property. The available actions are:
- extract (the default) to get a JSON response containing the entities.
- annotate to return the input text with the entities annotated.
- sanitize to return the input text with the entities removed.
- ingest to extract entities from the input text but provide no response. (This is useful if you are letting Idyl E3 plugins handle the publishing of entities to a database or other service outside of the NiFi data flow.)
The available properties are shown in the screen capture below:
And that is it. The processor will send text to Idyl E3 for entity extraction via Idyl E3’s /api/v2/extract endpoint. The response from Idyl E3 containing the entities will be placed in a new idyl-e3-response attribute.
The Idyl E3 NiFi processor is licensed under the Apache Software License, version 2.0. Under the hood, the processor uses the Idyl E3 client SDK for Java which is also licensed under the Apache license.