PyData Washington DC 2018

Last month in November 2018 I had the privilege of attending and presenting at PyData Washington DC 2018 at Capital One. It was my first PyData event and I learned so much from it that I hope to attend many more in the future and I encourage you to do so, too, if you’re interested in all things data science and the supporting Python ecosystem. Plus, I got some awesome stickers for my laptop.

My presentation was an introduction to machine translation and demonstrated how machine translation can be used in a streaming pipeline. In it, I gave a (brief) overview of machine translation and how it has evolved from early methods, to statistical machine translation (SMT), to today’s neural machine translation (NMT). The demonstrated application used Apache Flink to consume German tweets and, via Sockeye, translate the German tweets to English.

A video and slides will be available soon and will be posted here. The code for the project is available here. Credits to Suneel Marthi for the original implementation from which mine is forked and to Kellen Sunderland for the Sockeye-Thrift code that enabled communication between Flink and Sockeye.

Lucidworks Activate Search and AI Conference

Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.

I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.

We used Apache NiFi to drive the process. The data flow is summarized as follows:

The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level. 

Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.