A publicly traded index focused on bitcoin investments is using NLP to select the holdings. From the press release:
The index underlying KOIN was constructed utilizing a natural language processing algorithm that screens for global stocks that are believed to have a current or future economic interest in blockchain technology. By harnessing the power of textual analysis and artificial intelligence, companies are uncovered that might otherwise be overlooked by traditional analytical research.
It’s true there is a lot of information in unstructured text but to make that information useful it needs to be extracted and understood on a large-scale basis. This new fund is a great example of a practical use of NLP. If we take a minute to think about the requirements for a system like this we can identify these items:
Scalable – The system has to support an enormous amount of text quickly. News didn’t stop or take a break to let us catch up. The system must scale horizontally to meet demand.
Multi-lingual – Blockchain news isn’t just written in English or any other single language. The system must be able to support text documents written in many different languages. We’re interested in global stocks.
Customizable – Press releases and news reports represent two specific categories of text. They aren’t like other categories such as legal documents, encyclopedia articles, or general human conversation text. The system needs to be customizable in that it can support text from various formats. A general, all-purpose document processor won’t give us the results we need.
NLP – The system likely needs to be able to process natural language text and identify key topics, generate summaries, identify entities (companies and persons), and detect sentiment.
There are, of course, always other requirements but these represent arguably the largest areas.
How can we meet these requirements? To help provide scalabilty we can use an establish cloud provider like AWS or Azure. These platforms give us the tools we need in order to make an application scale to meet demand so that’s a good starting point. For the other requirements we can select from available tools based on whether we are making our own implementation from the ground up or using components publicly available. Both ways have their own advantages and disadvantages. To save time (and money) we’ll assume you would rather use other tools instead of building them yourself. If not, then you better stop reading and get to coding!
ngramdb provides a distributed means of storing and querying N-grams (or bags of words) organized under contexts. A REST interface provides the ability to insert n-grams, execute “starts with” and “top” queries, and calculate similarity metrics of contexts. Apache Ignite provides the distributed and highly available persistence and powers the querying abilities.
ngramdb is experimental and significant changes are likely. We welcome your feedback and input into its future capabilities.
Last month in November 2018 I had the privilege of attending and presenting at PyData Washington DC 2018 at Capital One. It was my first PyData event and I learned so much from it that I hope to attend many more in the future and I encourage you to do so, too, if you’re interested in all things data science and the supporting Python ecosystem. Plus, I got some awesome stickers for my laptop.
My presentation was an introduction to machine translation and demonstrated how machine translation can be used in a streaming pipeline. In it, I gave a (brief) overview of machine translation and how it has evolved from early methods, to statistical machine translation (SMT), to today’s neural machine translation (NMT). The demonstrated application used Apache Flink to consume German tweets and, via Sockeye, translate the German tweets to English.
A video and slides will be available soon and will be posted here. The code for the project is available here. Credits to Suneel Marthi for the original implementation from which mine is forked and to Kellen Sunderland for the Sockeye-Thrift code that enabled communication between Flink and Sockeye.
Back in October 2018 I had the privilege of attending and presenting at Lucidworks Activate Search and AI Conference in Montreal, Canada. It was a first-class event with lots of great, informative sessions set in the middle of a remarkable city.
I was a co-presenter of Embracing Diversity: Searching over multiple languages with Suneel Marthi in which we presented a method of performing cross-language information retrieval (CLIR) using Apache Solr, Apache NiFi, Apache OpenNLP, and Sockeye. Our approach implemented an English-in/English-out system for facilitating searches over a multilingual document index.
We used Apache NiFi to drive the process. The data flow is summarized as follows:
The English search term is read from a file on disk. (This is just to demonstrate the system. We could easily receive the search term from somewhere else such as via a REST listener or by some other means.) The search term is translated via Sockeye to the other languages contained in the corpus. The translated search terms are sent to a local instance of Solr. The resulting documents are translated to English, summarized, and returned. While this is an abbreviated description of the process, it captures the steps at a high level.
Check out the video below for the full presentation. The code for the custom Apache NiFi processors described in the presentation are available on GitHub. All of the software used is open source so you can build this system in your own environments. If you have any questions please get in touch and I will be glad to help.
Today we are introducing NLP Flow, a collection of processors for the popular Apache NiFi data platform to support NLP pipeline data flows.
Apache NiFi is a cross-platform tool for creating and managing data flows. With Apache NiFi you can create flows to ingest data from a multitude of sources, perform transformations and logic on the data, and interface with external systems. Apache NiFi is a stable and proven platform used by companies worldwide.
Extending Apache NiFi to support NLP pipelines is a perfect fit. NLP Flow is, in Apache NiFi terminology, a set of processors that facilitate NLP tasks via our NLP Building Blocks. With NLP Flow, you can create powerful NLP pipelines inside of Apache NiFi to perform language identification, sentence extraction, text tokenization, and named-entity extraction. For example, an NLP pipeline to ingest text from HDFS, extract all named-person entities for English and Spanish text, and persist the entities to a MongoDB database can be managed and executed within Apache NiFi.
NLP Flow is free for everyone to use. An existing Apache NiFi (a free download) installation is required.
We have open-sourced our NLP library and its associated projects on GitHub. The library, Idyl NLP, is a Java natural language processing library. It is licensed under the Apache License, version 2.0.
Idyl NLP stands on the shoulders of giants to provide a capable and flexible NLP library. Utilizing components such as OpenNLP and DeepLearning4j under the hood, Idyl NLP offers various implementations for NLP tasks such as language detection, sentence extraction, tokenization, named-entity extraction, and document classification.
Idyl NLP has the ability to automatically download NLP models when needed. The Idyl NLP Models repository contains model manifests for various NLP models. Through the manifest files, Idyl NLP can automatically download the model file referenced by the manifest and use it. The service powering the service is the Idyl NLP Model Zoo that will soon be hosted at zoo.idylnlp.ai. It is a Spring boot application that provides a REST interface for querying and downloading models so you can run your own model zoo for internal usage. See these two repositories on GitHub for more information about the available models and the model zoo. Models will become available through the repository in the coming days.
There are some sample projects available for Idyl NLP. The samples illustrate how to use some of Idyl NLP’s core capabilities and hopefully provide starting points for using Idyl NLP in your projects.
We are committed to further developing Idyl NLP and its ecosystem. We welcome the community’s contributions to help it foster and grow. We hope that the business friendly Apache license helps Idyl NLP’s adoption. Like most software engineers we are a bit behind on documentation. In the near term we will be focusing on the wiki, javadocs, and the sample projects. Our NLP Building Blocks will continue to be powered by Idyl NLP.
Renku Language Detection Engine 1.1.0 has been released. It is available now as a DockerHub and will be available on the AWS Marketplace and Azure Marketplace in a few days. This version adds a new API endpoint that returns a list of the languages (as ISO-639-3 codes) supported by Renku. The AWS Marketplace image is built using the newest version of the Amazon Linux AMI, and the Azure Marketplace image is now built on CentOS 7.4 (previously was 7.3).
With the recent announcement of the vulnerabilities known as “Spectre” and “Meltdown” in Intel processors we have made this post to inform our users how to protect their virtual machines of our products launched via cloud marketplaces.
Products Launched via Docker Containers
Docker uses the host’s system kernel. Refer to your host OS’s documentation on applying the necessary kernel patch.
Products Launched via the AWS Marketplace
The following product versions are using kernel 4.9.62-21.56.amzn1.x86_64 which needs updated.
Renku Language Detection Engine 1.0.0
Prose Sentence Extraction Engine 1.0.0
Sonnet Tokenization Engine 1.0.0
Idyl E3 Entity Extraction Engine 3.0.0
Run the following commands on each instance:
sudo yum update
The output of the last command will an updated kernel version of 4.9.76-3.78.amzn1.x86_64 (or newer). Details are available on the AWS Amazon Linux Security Center.
Products Launched via the Azure Marketplace
The following product versions are running on CentOS 7.3 on kernel 3.10.0-514.26.2.el7.x86_64 which needs updated.
Renku Language Detection Engine 1.0.0
Prose Sentence Extraction Engine 1.0.0
Sonnet Tokenization Engine 1.0.0
Idyl E3 Entity Extraction Engine 3.0.0
Run the following commands on each virtual machine:
When making an NLP pipeline in Apache NiFi it can be a requirement to route the text through the pipeline based on the language of the text. But how do we get the language of the text inside our pipeline? This blog post introduces a processor for Apache NiFi that utilizes Apache OpenNLP’s language detection capabilities. This processor receives natural language text and returns an ordered list of detected languages along with each language’s probability. Your pipeline can get the first language in the list (it has the highest probability) and use it to route your text through your pipeline.
In case you are not familiar with OpenNLP’s language detection, it provides the ability to detect over 100 languages. It works best with text containing more than one sentence (the more text the better). It was introduced in OpenNLP 1.8.3.
To use the processor, first clone it from GitHub. Then build it and copy the nar file to your NiFi’s lib directory (and restart NiFi if it was running). We are using NiFi 1.4.0.
The processor does not have any settings to configure. It’s ready to work right “out of the box.” You can add the processor to your NiFi canvas:
You will likely want to connect the processor to a EvaluateJsonPath processor to extract the language from the JSON response and then to a RouteOnAttribute processor to route the text through the pipeline based on the language. Also, this processor will work with Apache NiFi MiNiFi to determine the language of text on edge devices. MiNiFi, for short, is a subproject of Apache NiFi that allows for capturing data into NiFi flows from edge locations.
Backing up a bit, why would we need to route text through the pipeline depending on its language? The actions taken further down in the pipeline are likely to be language dependent. For instance, the next step might be to tokenize the text but knowing how to tokenize it requires knowing what language it is. Or, if the next step is to send the text to an entity extraction process we need to know which entity model to use based on the language. So, language detection in an NLP pipeline can be a crucial initial step. A previous blog post showed how to use NiFi for an NLP pipeline and extending it with language detection to it would be a great addition!
This processor performs the language detection inside the NiFi process. Everything remains inside your NiFi installation. This should be adequate for a lot of use-cases, but, if you need more throughput check out Renku Language Detection Engine. It works very similar to this processor in that it receives text and returns a list of identified languages. However, Renku is implemented as a stateless, scalable microservice meaning you can deploy it as much as you need to in order to meet your use-cases requirements. And maybe the best part is that Renku is free for everyone to use without any limits.
We are happy to announce that Sonnet Tokenization Engine, Prose Sentence Extraction Engine, and Idyl E3 Entity Extraction Engine have joined Renku Language Detection Engine on the Microsoft Azure Marketplace!
Renku Language Detection Engine is now available on Microsoft Azure and VMs can be launched via the Azure Marketplace. Renku Language Detection Engine is available at no cost – you only pay the standard Azure infrastructure costs.
Idyl E3 Entity Extraction Engine is an all-in-one solution for performing entity extraction from natural language text. It takes in unmodified natural language text and through a pipeline, it identifies the language of the text, the sentences in the text, tokenizes those sentences, and extracts entities from those tokens. It’s not exactly what you would call a microservice. The archives for version 2.6.0 are nearly 1 GB in size.
With the introduction of the NLP Building Blocks earlier this year, we began breaking up Idyl E3 into a set of smaller services to perform its individual functions. Renku identifies languages, Prose extracts sentences, and Sonnet performs tokenization. Joining the mix soon with its first release will be Lacuna that classifies documents. Lacuna can be used to route documents through your NLP pipelines based on their content. Each of these applications are small (less than 30 MB), stateless, and horizontally scalable. Using these building blocks for an NLP pipeline instead of the all-in-one Idyl E3 provides much improved flexibility in your NLP pipelines. You can now create loosely connected microservices in your custom NLP pipeline.
With that said, Idyl E3 3.0 will become a microservice whose only function is to perform entity extraction. This will dramatically cut Idyl E3’s deployment size making it easier to deploy and manage. Like the other building blocks, Idyl E3 3.0 will be available as a Docker container. Because Idyl E3’s functionality will be trimmed down its pricing will also be reduced. Stay tuned for the updated pricing.
To help bring the NLP building blocks together in a pipeline we have made the nlp-building-blocks-java-sdk available on GitHub. It includes clients for each product’s API. The Apache2 license product also includes the ability to tie each client together in a pipeline. This is a Java project but we hope to eventually have similar projects available for other languages.
We are very excited to take this path of making NLP building block microservices. We believe it provides awesome flexibility and control over your NLP pipelines.
Renku Language Detection Engine is now available. Renku, for short, is an NLP building block application that performs language detection on natural language text. Renku’s API allows you to submit text for analysis and receive back a list of language codes and associated probabilities. Renku is free for personal, non-commercial, and commercial use.
You can get started with Renku in a docker container quickly:
docker run -p 7070:7070 -it mtnfog/renku:1.2.0
Once running, you can submit requests to Renku. For example:
curl http://localhost:7070/api/language -d "George Washington was the first president of the United States."
The response from Renku will be a list of three-letter language codes and each’s associated probability. The languages will be ordered from highest probability to lowest. In this example the highest probability language will be “eng” for English.
We are in the process of publishing Idyl E3 2.5.1 to the AWS Marketplace and also to our website for download. The only change in 2.5.1 from 2.5.0 is a fix to address OpenNLP CVE-2017-12620. We have updated the Release Notes to reflect this as well.
The details of the issue are explained in OpenNLP CVE-2017-12620. It is important that only models from trusted sources are used in Idyl E3. Please be aware of a model’s origin whether it be a model that was downloaded from our website, created by you, or created by someone else in your organization.
Some interesting news this week is that Yahoo! has open-sourced their software that drives many of their content recommendation systems. The software, called Vespa, is available at vespa.ai.
Annotations on words and phrases in the text can be provided as text is ingested into Vespa. This process is described in the Vespa Annotations API documentation. But in order to make these annotations you need something that can identify persons, places, and things in the text! Idyl E3 Entity Extraction Engine is perfect for this and here’s how:
You probably have a pipeline in which text is gathered from some source and eventually pushed to your search application, in this case we’re using Vespa. All that is needed is to modify your pipeline to first send the text to Idyl E3 to get the entities. Once a response is received from Idyl E3 the text along with its annotations can be sent on to Vespa. It really is that easy. You can customize the types of entities to extract through the entity models installed in Idyl E3. So you could annotate persons, places, and things like buildings, schools, and airports.
To recap, in case you have not yet read about Vespa it is worth a few minutes to read about. Its ability to ingest text with annotations makes a natural fit for Idyl E3. You can certainly use Idyl E3 to annotate text for Vespa now and we’re going to make some improvements to make working with Vespa even easier.
As we work toward Idyl E3 2.6.0 we keep the Release Notes page updated with what’s new, tweaked, and fixed in 2.6.0. Probably the most significant new feature is support for GPUs.
Less exciting but still useful is how models that fail to load are handled in 2.6.0. Previously when a model failed to load it would be retried the next time the model is needed. If nothing has changed that could help the model load then this can result in needlessly trying to load the model and failing. In 2.6.0 if a model fails to load it is added to a blacklist and Idyl E3 will not attempt to reload any model on the blacklist until Idyl E3 is restarted. A message will be included in Idyl E3’s log when a model is blacklisted.
A model can fail to load for a few reasons. The most common reasons are:
The model file defined in the manifest does not exist or cannot be read due to insufficient permissions.
The model’s encryption key is invalid.
The model’s license key is invalid.
IDYL_E3_HOME Environment Variable
Also noteworthy is the IDYL_E3_HOME environment variable that must be set. If you launch Idyl E3 through the AWS Marketplace it is taken care for you. If not, you just need to set IDYL_E3_HOME to the location where you extracted Idyl E3 (we recommend /opt/idyl-e3):
Most of Idyl E3’s scripts reference the IDYL_E3_HOME environment variable to know where to find its file.
The last new thing we’ll mention here is the new tool included with Idyl E3 called the Model Downloader. When run, this command line tool shows you models available for download from us that you can download and install into your Idyl E3. No more downloading via your web browser and then having to copy to Idyl E3. You can now download models straight from Idyl E3. The tool will prompt you for a login (it is your Mountain Fog account username and password – register for free if you need a login) and then present you with a simple menu. The tool also supports a non-interactive mode so you can script the download of models!
We’ll give a more detailed look at the Model Downloader tool once 2.6.0 is released so stay tuned.
In Idyl E3 2.6.0 we will be introducing a command line tool to download entity, sentence, and token models directly from us. The tool will make getting and using Idyl E3 models much easier. You will no longer have to manually download a model, unzip it, and copy it to Idyl E3’s models directory. The tool will perform these steps for you. It will have both interactive and non-interactive modes so it can be integrated into provisioning scripts to automatically obtain models when deployed.
On our side, the tool will help us to more rapidly create models and make them available to you.
The tool will be bundled with Idyl E3 2.6.0 and will support all platforms.
Idyl E3 Entity Extraction Engine is now available for launch via the AWS Marketplace into the GovCloud region. When launching Idyl E3 just select AWS GovCloud (US) from the Region dropdown (highlighted in red below). Both Idyl E3 Free Edition and Idyl E3 Analyst Edition can be launched into GovCloud.
Here’s a quick summary of what’s new in Idyl E3 2.5.0. It’s not available yet but will be soon. For a full list check out the Idyl E3 Release Notes.
English-language person entity models can now be trained using the ConLL-2003 format.
You can now create and use deep learning neural network entity models. Check out the previous blog posts for more information!
There’s a new setting that allows you to control how duplicate entities per extraction request are handled. You can choose to retain all duplicates or return only the duplicate entity with the highest probability.
A new TCP endpoint accepts streaming text. This endpoint deprecates the /ingest API endpoint.
Idyl E3 2.5.0 changes all language codes to 3-letter ISO 3166 codes. While 2-letter codes are still supported we recommend using the 3-letter codes instead.
Entities extracted by a non-model method (regular expression, dictionary) used to return a value of 100.0 for the entity’s probability. Extracted entity probabilities should exist within the range 0 to 1 so these entities are now extracted with a probability of 1.0 instead of 100.0.
As we mentioned in an earlier post, Idyl E3 2.5.0 will include the ability to create and user deep learning neural network entity models. As we get closer to the release of Idyl E3 2.5.0 we wanted to introduce the new capability and compare it with the current entity models.
In Idyl E3 2.4.0 you can create entity models through a perceptron algorithm. This algorithm requires as input annotated training text and a list of features. Feature selection can be a difficult task. Too many features can result in over-fitting the model such that it performs well on the input text but does not generalize well to other text. Feature selection is a crucial part of producing a useful, quality model.
Idyl E3 2.5.0’s ability to create and use deep learning models still requires annotated input text but does not require a list of features. The features are discovered automatically during the execution of the neural network algorithm through the use of word vectors, produced by applications like Word2vec or GloVe. Using a tool like this, generate a file of vectors from your training text to provide to Idyl E3 during model training. In summary, manual feature selection is not required for deep learning models.
While word vectors really helps with deep learning model training, training a deep learning model can still be a challenging task. A neural network has many hyperparameters that tune the underlying algorithms. Small changes to these hyperparameters can have a dramatic effect on the generated model. Hyperparameter optimization is an active area of academic and industry research. Tools and best practices exist to help with hyperparameter selection and we will provide some useful resources to help in the near future.
Idyl E3 2.5.0 and newer versions will continue to support using and creating maximum entropy based models so you can choose which type of model you want to create and use.
In the upcoming week we will be posting an updated English-person entities base model on the Models page. The model, like the version it replaces, will be free to use and included in the upcoming Idyl E3 2.5.0 release. To give an idea of the performance of this model, we evaluated the model against the CoNLL-2003 training set and the results are as follows:
Idyl E3 Entity Extraction Engine 2.5.0 will introduce entity extraction powered by deep learning neural networks. Neural networks are powerful machine learning algorithms that excel at tasks like natural language processing. Idyl E3 will also support entity model training and usage on GPUs as well as CPUs. Using GPUs provides significant performance improvements. Idyl E3 2.5.0 will add support for AWS’ P2 instance type.
Entity models created by a deep learning neural network will be referred to as “second generation models.” Entity models created by Idyl E3 2.4.0 and earlier will be referred to as “first generation models.”
So how are the current entity models going to be different than the deep learning entity models?
Good question. Training entity models with the Idyl E3 2.4.0 and earlier require you to identify “features” of your text in order to train the model. Some examples of features include where an entity appears in a sentence, what words surround it, if the word is capitalized, and so on. While you can create very powerful models using this method, identifying the features can be a laborious task that requires intimate knowledge of the text. It can also result in over-fitting causing the model to not apply well to non-training text.
When training a deep learning entity model there is no need to identify the features as the algorithm is able to learn the features on its own during the training. It is able to do this through word vectors. Idyl E3 2.5.0 will be able to use word vectors generated by word vector applications such as word2vec and GloVe. To create a deep learning entity model simply provide your input training text and word vectors and Idyl E3 will generate the model.
Can I customize the neural network used to train a model?
There will be many options available to customize the neural network used for model training with a standard set of options to be used out of the box. We will describe all of the available options in the Idyl E3 2.5.0 User’s Guide.
Will there be any other impacts of the new type of model training?
No. You can continue to use your existing first generation models. You can also continue to train new first generation models. In fact, you can use first and second generation models simultaneously in an Idyl E3 pipeline.
Any other questions that we did not cover? Let us know!
Idyl E3 2.4.0 now includes an English-language “Places” model as well as an English-language “Persons” model. Prior to version 2.4.0, only the persons models was included. Idyl E3 2.4.0 Analyst Edition will be available from the AWS Marketplace soon.
The model will be loaded automatically when Idyl E3 2.4.0 Analyst Edition starts. An entity extraction request such as “George Washington was president of the United States.” will return two entities:
George Washington (person)
United States (place)
Idyl E3 2.4.0 comes with a free 30 day trial period in which you can use a single instance of Idyl E3 in AWS by only paying the cost of the underlying instance!
In the next release of Idyl E3 Entity Extraction Engine (which will be version 2.4.0) we will introduce the Training Definition File to help alleviate a few problems.
When training an entity model there are quite a few command line arguments that you have to provide. The sheer number of arguments doesn’t help with usability.
After training a model, unless you keep excellent documentation it’s easy to lose track of the training parameters. What entity type? Language? Iterations? Features? And so on.
How do you manage the command line arguments and the feature generators XML file?
The Training Definition File offers a solution to these problems. It is an XML file that contains all of the training parameters. Everything. Now you have a record of the parameters used to create the model while also simplifying the command line arguments. Note that you can still use the command line arguments as they will remain available.
Below is an example of a training definition file. Note that the final format may change between now and the release.
We are announcing the availability of Idyl E3 2.3.0! This version has a long list of new features. You can see the full list in the Release Notes and we’ll summarize the changes below and they are covered in the documentation.
You can download the new version from our website and look for it to be available on cloud marketplaces in the upcoming week. We love adding new features and supporting our users’ needs. Feel free to let us know how we’re doing!
There is a new option for API authentication. You can now use HMACSHA512 instead of plain authorization.
There is a new /sanitize endpoint that takes in text, identifies the entities in the text, and returns the text without the entities. This endpoint is useful for cases where you want to sanitize PII or PHI information from text.
A new sort parameter was added to the /extract endpoint to control how the extracted entities are sorted in the response.
Each API endpoint now responds with HTTP 405 Method Not Allowed when given a HEAD request. This change is to support smoother integration with Idyl E3 and Apache NiFi.
Version 1 of the API has been deprecated and will be removed in Idyl E3 2.4.0.
Can now create and use part-of-speech and lemmatization models to improve entity extraction performance.
Added new feature generators to help improve entity extraction performance.
No matter what architecture you choose we recommend creating a pre-configured Idyl E3 AMI and using it to launch new Idyl E3 instances. This method is recommended instead of relying on user-data scripts to perform the configuration because the time required to spin up a pre-configured AMI can be significantly less than user-data scripts. If you want to have the AMI configuration under source control I highly recommend using Hashicorp’s Packer to build the AMI.
Before we describe the architectures it is helpful to note that the Idyl E3 API is stateless. There is no session data necessary to be shared by multiple instances and as long as all Idyl E3 instances are configured identically (as they should be when behind a load balancer), it does not matter which instance gets routed the entity extraction request. We can take advantage of this stateless architecture to allow us to scale Idyl E3 up (and down) as much as we need to in order to meet the demands of the load.
The first architecture is a very simple one yet probably adequate to meet the needs of most users. This architecture has a single VPC that contains two subnets. One subnet is public and it contains an Elastic Load Balancer (ELB) and the other subnet is private and it contains the Idyl E3 instances. In the diagram shown below, the ELB is set to be a public ELB allowing Idyl E3 requests to be received from the internet. However, if your application will also run in the VPC you can change the ELB to an internal ELB. Note that this architecture uses a fixed number of Idyl E3 instances behind the ELB. Any scaling up or down will have to be performed manually when needed. Idyl E3’s API has a /health endpoint that returns HTTP 200 OK when everything is ok and that is perfect for ELB instance health checks.
Load-balanced and Auto-scaling Architecture
The previous architecture is a simple but very functional and it minimizes cost. The first thing that will be noticed in this architecture is the static nature of the Idyl E3 instances. To provide some flexibility we can modify this architecture a bit to put the Idyl E3 instances into an autoscaling group. We can use the group’s Desired Capacity to manually control the number of Idyl E3 instances or we can configure the autoscaling group to automatically scale up and down based on some chosen metrics. The average CPU usage is a good metric for scaling Idyl E3 because entity extraction can cause the CPU usage to rise. With that change here is what our architecture looks like now:
With the autoscaling we don’t have to worry about unexpected surges or decreases in entity extraction requests. The number of Idyl E3 instances will automatically scale up and down based on the average CPU usage of all Idyl E3 instances. Scaling down is important in order to keep costs to a minimum. Nobody wants to pay for more than what they need.
This architecture is available in our GitHub repository of Idyl E3 CloudFormation Templates. The template also contains an optional bastion instance to facilitate SSH access into the Idyl E3 instances from outside the VPC.
Got more complicated requirements? Let us know. We have AWS certified engineers on staff and we’ll be glad to help.
We have published a new open source project on GitHub that is an Apache NiFi processor that filters entities through an Entity Query Language (EQL) query. When used along with the Idyl E3 NiFi Processor you can perform entity filtering in a NiFi dataflow pipeline.
To add the EQL processor to your NiFi pipeline, clone the project and build it or download the jar file from our website. Then copy the jar to NiFi’s lib directory and restart NiFi. The processor will not be available in the list of processors:
The EQL processor has a single property that holds the EQL query:
For this example our query will look for entities whose text is “George Washington”:
select * from entities where text = "George Washington"
Entities matching the EQL query will be outputted from the processor as JSON. Entities not matching the EQL query will be dropped.
With this capability we can create Apache NiFi dataflows that produce alerts when an entity matches a given set of conditions. Entities matching the EQL query can be published to an SQS queue, a Kafka stream, or any other NiFi processor.
The Entity Query Language previously existed as a component of the EntityDB project. It is now its own project on GitHub and is licensed under the Apache Software License, 2.0. The project’s README.md contains more examples of how to construct EQL queries.
We are happy to let you know how Idyl E3 Entity Extraction Engine can be used with Apache NiFi. First, what is Apache NiFi? From the NiFi homepage: “Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” Idyl E3 extracts entities (persons, places, things) from natural language text.
That’s a very short description of NiFi but it is very accurate. Apache NiFi allows you to configure simple or complex processes of data processing. For example, you can configure a pipeline to consume files from a file system and upload them to S3. (See Example Dataflow Templates.) There are many operations you can do and they are performed by components called Processors. There are many excellent guides available about NiFi, such as:
There are many processors available for NiFi out of the box. One in particular is the InvokeHttp processor that lets your pipeline send an HTTP request. You can use this processor to send text to Idyl E3 for entity extraction from within your pipeline. However, to make things a bit simpler and more flexible we have created a custom NiFi processor just for Idyl E3. This processor is available on GitHub and its binaries will be included with all editions of Idyl E3 starting with version 2.3.0.
Instructions for how to use the Idyl E3 processor will be added to the Idyl E3 documentation bit they are simple. Here’s a rundown. Copy the idyl-e3-nifi-processor.jar from Idyl E3’s home directory to NiFi’s lib directory. Restart NiFi. Once NiFi is available you will see the Idyl E3 in the list of processors when adding a processor:
There are a few properties you can set but the only required property is the Idyl E3 endpoint. By default, the processor extracts entities from the input text but this can be changed using the action property. The available actions are:
extract (the default) to get a JSON response containing the entities.
annotate to return the input text with the entities annotated.
sanitize to return the input text with the entities removed.
ingest to extract entities from the input text but provide no response. (This is useful if you are letting Idyl E3 plugins handle the publishing of entities to a database or other service outside of the NiFi data flow.)
The available properties are shown in the screen capture below:
And that is it. The processor will send text to Idyl E3 for entity extraction via Idyl E3’s /api/v2/extract endpoint. The response from Idyl E3 containing the entities will be placed in a new idyl-e3-response attribute.
Idyl E3’s entity model training tool expects entities in training text to be annotated in the format used by OpenNLP. This format uses START and END tags to denote entities:
<START:person> George Washington <END> was president.
This works great but it has a drawback. The annotations and text have to be combined in a single file. Once the text is annotated it becomes difficult to use the training text for any other purposes.
New Annotation Format
Idyl E3 2.4.0 is going to introduce an additional method of annotating text that allows the annotations to be stored separate from the training text. In 2.4.0 the annotations will be able to be stored in a separate file (and we plan to eventually support storing the annotations in a database). Even though Idyl E3 2.4.0 is not yet ready for prime time, we wanted to introduce this feature early in case you are in the middle of any annotation efforts and want to use the new format.
It is still required that the input text contain a single sentence per line. Use blank lines to indicate document boundaries. Here’s an example of a simple input training file:
George Washington was president .
He was president of the United States .
George Washington was married to Martha Washington .
In 1755 , Washington became the senior American aide to British General Edward Braddock on the ill-fated Braddock expedition .
And here’s the annotations stored in a separate file:
1 0 2 person
2 5 6 place
3 0 2 person
3 5 7 person
4 11 12 person
Here’s what this means. Each line in the annotations file represents an annotation in the training text. So there are 5 annotations in this example.
The first column is the line number that contains the entity. In this example there is an annotation in each of the 3 lines.
The second column is the token index of the start of the entity. Indexes are zero-based so the first token is zero!
The third column is the token index of the end of the entity.
The last column is the type of the entity.
Note that there are two entities in the third line and each is put on its own separate line in the annotations file. Specifying the entity text in the three column format simplifies the annotation by removing the need to specify the entity’s token start and end positions. This will only annotate the first occurrence of the entity text. (If Edward Braddock had occurred more than once in the input text on line 4 only the first occurrence would be annotated.)
Now your annotations can be kept separate from your training text allowing you to use your training text for other purposes. Additionally, we hope that this new annotation method helps decrease the time required for annotating and helps with automating the process. As mentioned earlier in the post, currently the only supported means of storing the annotations is in a separate file but we plan to extend this to support databases in a future release of Idyl E3.
The Entity Model Generator tool included in Idyl E3 has been updated to allow for using this new annotation format. You can, however, continue to use the OpenNLP-style annotations when creating entity models. This new annotation format is only available for entity models. Sentence, token, parts-of-speech, and lemma model annotations will remain unchanged in 2.4.0.
Idyl E3 2.2.0 added support for publishing metrics to a Graphite server. To help make it easier to deploy a Graphite server we have added a new project on our GitHub that contains a Packer script for creating a Graphite AMI. Usage instructions are available in the project’s readme file.
As you may know, Idyl E3’s entity extraction capabilities is provided by a customized version of OpenNLP. Since the release of OpenNLP 1.7.0, the OpenNLP team has been able to release more often than previously. Because of the more frequent OpenNLP releases we may not incorporate each release into Idyl E3. We will analyze the changes in each new OpenNLP version to decide if the changes should be incorporated into Idyl E3.
Also, we do have on the (distant) roadmap the ability to make the underlying NLP engine pluggable to allow you to choose which NLP engine to use with Idyl E3.
Idyl E3 Analyst Edition 2.1.0 is now available on the AWS Marketplace. Idyl E3 Analyst Edition includes everything in the Free and Standard editions plus licenses for all plugins and licenses for unlimited custom models.
In Idyl E3 2.2.0 we are introducing a feature we call Heuristic Confidence Filtering. Here’s how it works.
As you may (or may not) already know, each entity extraction request can have an associated “confidence threshold value.” Any entities that are extracted who have a confidence lower than this value will not be returned in the entity extraction response. This is useful but it is a bit of a sledgehammer approach and can either result in too much noise or missed entities depending on its value.
When enabled, heuristic confidence filtering tracks the confidence values of extracted entities per the entity model that extracted them. Once a large enough sample of confidence values has been collected, Idyl E3 will filter entities by determining if an entity’s confidence value is significant to the mean of the collected values. This provides a way to filter out noise but still receive important entities.
It is important to note that the confidence threshold value still plays a part even when heuristic confidence filtering is enabled. Any entity whose confidence value is greater than or equal to the confidence threshold for that request will always be returned even when heuristic confidence filtering is enabled.
Because of the mathematical calculations involved and the memory required to store the confidence values the heuristic confidence filtering does require a bit more computation time but not to the point where it should be noticeable.
We are excited to offer this feature and we hope that it helps with “entity noise.” We welcome your feedback on how it performs for you! For more information on this feature you can refer to the Idyl E3 2.2.0 User Documentation or by contacting us. Look for Idyl E3 2.2.0 to be available in February 2017.
Idyl E3 2.1.0 has been released. This version introduces a new version of the API that includes changes to the extract and ingest endpoints. With version 2 of the API these two endpoints accept text in the body of the request instead of as a query string parameter. Version 1 of the API is still available so you do not need to update your clients unless you just want to or need to for other reasons. The Idyl E3 Java SDK and the Idyl E3 .NET SDK have been updated to use API v2.
Idyl E3 2.1.0 is based on a customized OpenNLP 1.7.0 which was released in early January 2016. Previous versions of Idyl E3 were based on a customized OpenNLP 1.6.0.
Idyl E3 2.1.0 Analyst Edition will be available on the AWS Marketplace soon. The Analyst Edition includes all plugins and allows for the use of unlimited custom models without separate licensing. (See the Idyl E3 edition comparison.)
Today we are announcing Idyl E3 2.0. It has been over a year since version 1.0 was introduced and we’d like to thank our users for helping us to reach this milestone. The main goals of version 2.0 were to make Idyl E3 extensible and increase performance. We would like to thank our users for helping us get to this milestone release. We could not have done it without your feedback and comments.
Idyl E3 is available for download from our website. Look for Idyl E3 2.0 to be available on the AWS Marketplace and other channels shortly thereafter.
This edition of Idyl E3 is free. It includes an English-persons entity model and no plugins. This edition can be customized with plugins and models to meet your requirements.
Idyl E3 Standard Edition
The Standard Edition includes everything in the free edition plus model evaluation tools and priority email technical support.
Idly E3 Analyst Edition
The Analyst Edition includes everything in the standard edition plus all plugins and supports unlimited custom models.
In Idyl E3 1.x, things like email addresses and phone numbers were extracted through built-in functionality called extraction modules. In version 2.0 we are introducing plugins. There are two types of plugins – a plugin type that perform an entity extraction and a plugin type that publishes the extracted entities. Plugins can be downloaded from our website and installed in your Idyl E3. The following plugins are currently available or will soon be:
Text Consumption Plugins
Consume input text from Kafka topic
Consume input text from Kinesis stream
Entity Extraction Plugins
Phone numbers extraction plugin
Email addresses extraction plugin
Hashtags extraction plugin
User mentions extraction plugin
Document Processing Plugins
Parse text from PDF files
Entity Publisher Plugins
AWS Kinesis Firehose publisher plugin
EntityDB publisher plugin
Internal changes were made to improve Idyl E3’s performance to lower the time to extract entities. One change was the removal of the web-based dashboard. Configuration is now done directly through the properties file.
Custom Sentence and Token Models
Also new in version 2.0 to increase performance is the ability to generate and use custom sentence and token models. In versions 1.x, internal models were used for sentence detection and sentence tokenizing. These models were not always representative of the input text so their performance was degraded. In version 2.0 you have the option to generate sentence and token models from your data or use the legacy internal models just as versions 1.x did. You can still create your own entity models.