Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.
I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.
We used Apache NiFi to drive the process. The data flow is summarized as follows:
The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level.
Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.