Dataworks Summit 2019

Recently (in May 2019) I had the honor of attending and speaking at the Dataworks Summit in Washington D.C. The conference had many interesting topics and keynote speakers focused on big-data technologies and business applications. I also always enjoy exploring downtown Washington DC.  Whether it is doing the "hike" across the National Mall taking in the sights or visiting all of the nearby shopping, there's always something new to see.

One thing that caught my attention early on was the number of talks that either focused largely on Apache NiFi or at least mentioned Apache NiFi as a component of a larger data ingest platform. Apache NiFi has definitely cemented itself squarely as a core piece of data flow orchestration.

My talk was one of those. In my talk that described a process to ingest natural language text, process it, and persist extracted entities to a database, Apache NiFi was the workhorse that drove the process. Without NiFi, I would have had to write a lot more code and probably ended up with a much less elegant and performant solution.

In conclusion, if you have not yet looked at Apache NiFi for your data ingest and transformation (think ETL) pipeline needs, do yourself a favor and spend a few minutes with NiFi. I think you will find what you like. And if you need help along the way drop me a note. Just say, "Hey Jeff, I need a pipeline to do X. Show me how it can be done with NiFi." I'll be glad to help.

https://youtu.be/zi3A_SVT8AE

Need more help?

We provide consulting services around AWS and big-data tools like Apache NiFi. Get in touch by sending us a message. We look forward to hearing from you!

PyData Washington DC 2018

Last month in November 2018 I had the privilege of attending and presenting at PyData Washington DC 2018 at Capital One. It was my first PyData event and I learned so much from it that I hope to attend many more in the future and I encourage you to do so, too, if you're interested in all things data science and the supporting Python ecosystem. Plus, I got some awesome stickers for my laptop.

My presentation was an introduction to machine translation and demonstrated how machine translation can be used in a streaming pipeline. In it, I gave a (brief) overview of machine translation and how it has evolved from early methods, to statistical machine translation (SMT), to today's neural machine translation (NMT). The demonstrated application used Apache Flink to consume German tweets and, via Sockeye, translate the German tweets to English.

A video and slides will be available soon and will be posted here. The code for the project is available here. Credits to Suneel Marthi for the original implementation from which mine is forked and to Kellen Sunderland for the Sockeye-Thrift code that enabled communication between Flink and Sockeye.


Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.

I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.

We used Apache NiFi to drive the process. The data flow is summarized as follows:

The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level. 

Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.

https://www.youtube.com/watch?v=ek-crQwMfnQ&t=838s