Renku Language Detection Engine

Renku Language Detection Engine is now available. Renku, for short, is an NLP building block application that performs language detection on natural language text. Renku’s API allows you to submit text for analysis and receive back a list of language codes and associated probabilities. Renku is free for personal, non-commercial, and commercial use.

You can get started with Renku in a docker container quickly:

docker run -p 7070:7070 -it mtnfog/renku:1.0.0

Once running, you can submit requests to Renku. For example:

curl http://localhost:7070/api/language -d "George Washington was the first president of the United States."

The response from Renku will be a list of three-letter language codes and each’s associated probability. The languages will be ordered from highest probability to lowest. In this example the highest probability language will be “eng” for English.

Idyl E3 API Simulator

Idyl E3The Idyl E3 API Simulator can help you build and debug Idyl E3 client applications is now available. You can access it at http://api.mtnfog.com:9000/api. This is not an actual Idyl E3 application endpoint but instead a service that simulates the Idyl E3 API. All extraction requests will receive the same response regardless of the text received. The purpose of the Idyl E3 API simulator is to provide a mock service to assist with the implementation of Idyl E3 clients. You are welcome to use the Idyl E3 API simulator as much as you need to as there are no service limitations.

See the full details of the Idyl E3 API Simulator.

Idyl E3 2.5.1

Idyl E3We are in the process of publishing Idyl E3 2.5.1 to the AWS Marketplace and also to our website for download. The only change in 2.5.1 from 2.5.0 is a fix to address OpenNLP CVE-2017-12620. We have updated the Release Notes to reflect this as well.

The details of the issue are explained in OpenNLP CVE-2017-12620. It is important that only models from trusted sources are used in Idyl E3. Please be aware of a model’s origin whether it be a model that was downloaded from our website, created by you, or created by someone else in your organization.

Yahoo! Vespa and Entity Annotations

Some interesting news this week is that Yahoo! has open-sourced their software that drives many of their content recommendation systems. The software, called Vespa, is available at vespa.ai.

Annotations on words and phrases in the text can be provided as text is ingested into Vespa. This process is described in the Vespa Annotations API documentation. But in order to make these annotations you need something that can identify persons, places, and things in the text! Idyl E3 Entity Extraction Engine is perfect for this and here’s how:

You probably have a pipeline in which text is gathered from some source and eventually pushed to your search application, in this case we’re using Vespa. All that is needed is to modify your pipeline to first send the text to Idyl E3 to get the entities. Once a response is received from Idyl E3 the text along with its annotations can be sent on to Vespa. It really is that easy. You can customize the types of entities to extract through the entity models installed in Idyl E3. So you could annotate persons, places, and things like buildings, schools, and airports.

To recap, in case you have not yet read about Vespa it is worth a few minutes to read about. Its ability to ingest text with annotations makes a natural fit for Idyl E3. You can certainly use Idyl E3 to annotate text for Vespa now and we’re going to make some improvements to make working with Vespa even easier.

Streaming Text in Idyl E3 2.5.0

Idyl E3The Idyl E3 API has an /extract endpoint that receives text and returns the extracted entities in response. This means you have to make a full HTTP connection for each extraction request. Idyl E3 2.5.0 introduces the ability to accept streaming text through a TCP socket. When Idyl E3 starts it will open a TCP port and listen for incoming text. As text is received the socket will extract entities from the text and return an entity extraction response.

Now you can extract entities from the command line using a tool like netcat:

cat some-file.txt | netcat [idyl-e3-ip-address] [port]

Compare that command with using cURL:

curl -X POST http://idyl-e3-ip-address:port/api/v2/extract -H "Content-Type: plain/text; charset=UTF-8" -d "George Washington was president."

It’s easy to see which command is simpler. Using streaming should make processing text files and other constant sources of text much simpler.

The response to streaming input is identical to the response received from the /extract endpoint. (Both commands above will produce the same output.)

{
   "entities":[
      {
         "text":"George Washington",
         "confidence":0.96,
         "span":{
            "tokenStart":0,
            "tokenEnd":2,
            "characterStart":0,
            "characterEnd":17
         },
         "type":"person",
         "languageCode":"eng",
         "context":"not-set",
         "documentId":"not-set",
         "extractionDate":1502970191843,
         "metadata":
         }
      }
   ],
   "extractionTime":72
}

Streaming is disabled by default. To enable it set the streaming.enabled property to true in Idyl E3’s properties file. Streaming does not currently support authentication. See the Idyl E3 Documentation for more streaming configuration options.

What’s New in Idyl E3 2.5.0

Idyl E3Here’s a quick summary of what’s new in Idyl E3 2.5.0. It’s not available yet but will be soon. For a full list check out the Idyl E3 Release Notes.

What’s New

  • English-language person entity models can now be trained using the ConLL-2003 format.
  • You can now create and use deep learning neural network entity models. Check out the previous blog posts for more information!
  • There’s a new setting that allows you to control how duplicate entities per extraction request are handled. You can choose to retain all duplicates or return only the duplicate entity with the highest probability.
  • A new TCP endpoint accepts streaming text. This endpoint deprecates the /ingest API endpoint.

What’s Changed

  • Idyl E3 2.5.0 changes all language codes to 3-letter ISO 3166 codes. While 2-letter codes are still supported we recommend using the 3-letter codes instead.

What’s Fixed

  • Entities extracted by a non-model method (regular expression, dictionary) used to return a value of 100.0 for the entity’s probability. Extracted entity probabilities should exist within the range 0 to 1 so these entities are now extracted with a probability of 1.0 instead of 100.0.

Deep Learning Entity Models in Idyl E3 2.5.0

Idyl E3As we mentioned in an earlier post, Idyl E3 2.5.0 will include the ability to create and user deep learning neural network entity models. As we get closer to the release of Idyl E3 2.5.0 we wanted to introduce the new capability and compare it with the current entity models.

In Idyl E3 2.4.0 you can create entity models through a perceptron algorithm. This algorithm requires as input annotated training text and a list of features. Feature selection can be a difficult task. Too many features can result in over-fitting the model such that it performs well on the input text but does not generalize well to other text. Feature selection is a crucial part of producing a useful, quality model.

Idyl E3 2.5.0’s ability to create and use deep learning models still requires annotated input text but does not require a list of features. The features are discovered automatically during the execution of the neural network algorithm through the use of word vectors, produced by applications like Word2vec or GloVe. Using a tool like this, generate a file of vectors from your training text to provide to Idyl E3 during model training. In summary, manual feature selection is not required for deep learning models.

While word vectors really helps with deep learning model training, training a deep learning model can still be a challenging task. A neural network has many hyperparameters that tune the underlying algorithms. Small changes to these hyperparameters can have a dramatic effect on the generated model. Hyperparameter optimization is an active area of academic and industry research. Tools and best practices exist to help with hyperparameter selection and we will provide some useful resources to help in the near future.

Idyl E3 2.5.0 and newer versions will continue to support using and creating maximum entropy based models so you can choose which type of model you want to create and use.

 

Updated English-person Entities Base Model

In the upcoming week we will be posting an updated English-person entities base model on the Models page. The model, like the version it replaces, will be free to use and included in the upcoming Idyl E3 2.5.0 release. To give an idea of the performance of this model, we evaluated the model against the CoNLL-2003 training set and the results are as follows:

  • Precision: 0.916720
  • Recall: 0.776873
  • F-Measure: 0.841023

Please keep in mind that these models are trained on general text and may not provide adequate performance for all text. In these cases it is recommended that you use Idyl E3 Analyst Edition to create a custom entity model from your text. Launch an instance on AWS today.

 

Deep Learning Entity Extraction in Idyl E3

Idyl E3 Entity Extraction Engine 2.5.0 will introduce entity extraction powered by deep learning neural networks. Neural networks are powerful machine learning algorithms that excel at tasks like natural language processing. Idyl E3 will also support entity model training and usage on GPUs as well as CPUs. Using GPUs provides significant performance improvements. Idyl E3 2.5.0 will add support for AWS’ P2 instance type.

Entity models created by a deep learning neural network will be referred to as “second generation models.” Entity models created by Idyl E3 2.4.0 and earlier will be referred to as “first generation models.”

So how are the current entity models going to be different than the deep learning entity models?

Good question. Training entity models with the Idyl E3 2.4.0 and earlier require you to identify “features” of your text in order to train the model. Some examples of features include where an entity appears in a sentence, what words surround it, if the word is capitalized, and so on. While you can create very powerful models using this method, identifying the features can be a laborious task that requires intimate knowledge of the text. It can also result in over-fitting causing the model to not apply well to non-training text.

When training a deep learning entity model there is no need to identify the features as the algorithm is able to learn the features on its own during the training. It is able to do this through word vectors. Idyl E3 2.5.0 will be able to use word vectors generated by word vector applications such as word2vec and GloVe. To create a deep learning entity model simply provide your input training text and word vectors and Idyl E3 will generate the model.

Can I customize the neural network used to train a model?

There will be many options available to customize the neural network used for model training with a standard set of options to be used out of the box. We will describe all of the available options in the Idyl E3 2.5.0 User’s Guide.

Will there be any other impacts of the new type of model training?

No. You can continue to use your existing first generation models. You can also continue to train new first generation models. In fact, you can use first and second generation models simultaneously in an Idyl E3 pipeline.

Any other questions that we did not cover? Let us know!

Handling Duplicate Entities

When performing entity extraction it is common for an entity extraction request to return duplicate entities. For example, given the input:

George Washington was president. George Washington was married to Martha.

Idyl E3 may return the following entities:

  • George Washington – person – 86% confidence
  • George Washington – person – 89% confidence

The entity “George Washington” is a duplicate entity because the entity text and entity type match at least one other entity in the same entity extraction response. New in Idyl E3 2.4.0 you can choose how to handle duplicate entities. The default behavior (and the same in past versions) is to return all entities regardless of whether they are duplicates or not. A new option is to only return the entity having the highest confidence. For example, given the above entities Idyl E3 would only return the entity having 89% confidence. Entities having a confidence lower than 89% will be ignored.

The “Duplicate Entity Handling Strategy” is controlled via the duplicate.entity.handling.strategy property in Idyl E3’s configuration file. The valid values are:

  • retain – All entities are returned. This is the default behavior.
  • highest – When duplicate entities are present in a single entity extraction request, only the entity having the highest confidence value will be returned.

In summary, the new duplicate.entity.handling.strategy property controls how duplicate entities are handled on a per-entity extraction request basis. This property will be available in Idyl E3 2.4.0 and is documented in Idyl E3 2.4.0’s configuration documentation.

Training Definition File

In the next release of Idyl E3 Entity Extraction Engine (which will be version 2.4.0) we will introduce the Training Definition File to help alleviate a few problems.

The problems:

  1. When training an entity model there are quite a few command line arguments that you have to provide. The sheer number of arguments doesn’t help with usability.
  2. After training a model, unless you keep excellent documentation it’s easy to lose track of the training parameters. What entity type? Language? Iterations? Features? And so on.
  3. How do you manage the command line arguments and the feature generators XML file?

The Training Definition File offers a solution to these problems. It is an XML file that contains all of the training parameters. Everything. Now you have a record of the parameters used to create the model while also simplifying the command line arguments. Note that you can still use the command line arguments as they will remain available.

Below is an example of a training definition file. Note that the final format may change between now and the release.

<?xml version="1.0" encoding="UTF-8"?>
<trainingdefinition xmlns="http://www.mtnfog.com">
	<algorithm cutoff="1" iterations="1" threads="2" />
	<trainingdata file="person-train.txt" />
	<model file="person.bin" encryptionkey="enckey" language="en" type="person" />	
	<features>
		<generators>
			<cache>
				<generators>
					<window prevLength="2" nextLength="2">
						<tokenclass />
					</window>
					<window prevLength="2" nextLength="2">
						<token />
					</window>
					<definition />
					<prevmap />
					<bigram />
					<sentence begin="true" end="true" />
				</generators>
			</cache>
		</generators>
	</features>
</trainingdefinition>

You can see in the above XML that all of the entity model training parameters are included. The training definition file defines four things:

  1. The training algorithm.
  2. The training data.
  3. The output model.
  4. The feature generators.

This removes the need for a separate feature generators file since it is now included in the training definition file. Now when training an entity model you can use the simpler command:

java -jar idyl-e3-entity-model-generator.jar -td training-definition.xml

Look for the training definition file functionality to be included with Idyl E3 2.4.0. The details may change so check back for updates.

Idyl E3 Entity Extraction Engine AWS Reference Architectures

With the popularity of running Idyl E3 Entity Extraction Engine on AWS we wanted to provide some AWS reference architectures to help you get started deploying Idyl E3 to AWS. Don’t forget Idyl E3 is available on the AWS Marketplace for easy launching and we have some Idyl E3 CloudFormation templates available on GitHub. We offer managed Idyl E3 services is you prefer a hands-off approach to Idyl E3 deployment and operation.

A Few Notes Before Starting

Using a Pre-Configured AMI

No matter what architecture you choose we recommend creating a pre-configured Idyl E3 AMI and using it to launch new Idyl E3 instances. This method is recommended instead of relying on user-data scripts to perform the configuration because the time required to spin up a pre-configured AMI can be significantly less than user-data scripts. If you want to have the AMI configuration under source control I highly recommend using Hashicorp’s Packer to build the AMI.

Stateless API

Before we describe the architectures it is helpful to note that the Idyl E3 API is stateless. There is no session data necessary to be shared by multiple instances and as long as all Idyl E3 instances are configured identically (as they should be when behind a load balancer), it does not matter which instance gets routed the entity extraction request. We can take advantage of this stateless architecture to allow us to scale Idyl E3 up (and down) as much as we need to in order to meet the demands of the load.

Load-balanced Architecture

The first architecture is a very simple one yet probably adequate to meet the needs of most users. This architecture has a single VPC that contains two subnets. One subnet is public and it contains an Elastic Load Balancer (ELB) and the other subnet is private and it contains the Idyl E3 instances. In the diagram shown below, the ELB is set to be a public ELB allowing Idyl E3 requests to be received from the internet. However, if your application will also run in the VPC you can change the ELB to an internal ELB. Note that this architecture uses a fixed number of Idyl E3 instances behind the ELB. Any scaling up or down will have to be performed manually when needed. Idyl E3’s API has a /health endpoint that returns HTTP 200 OK when everything is ok and that is perfect for ELB instance health checks.

Simple Idyl E3 AWS Architecture with VPC and ELB

Load-balanced and Auto-scaling Architecture

Launch the Idyl E3 CloudFormation stack!

The previous architecture is a simple but very functional and it minimizes cost. The first thing that will be noticed in this architecture is the static nature of the Idyl E3 instances. To provide some flexibility we can modify this architecture a bit to put the Idyl E3 instances into an autoscaling group. We can use the group’s Desired Capacity to manually control the number of Idyl E3 instances or we can configure the autoscaling group to automatically scale up and down based on some chosen metrics. The average CPU usage is a good metric for scaling Idyl E3 because entity extraction can cause the CPU usage to rise. With that change here is what our architecture looks like now:

Idyl E3 AWS architecture with VPC, ELB, and autoscaling.

With the autoscaling we don’t have to worry about unexpected surges or decreases in entity extraction requests. The number of Idyl E3 instances will automatically scale up and down based on the average CPU usage of all Idyl E3 instances. Scaling down is important in order to keep costs to a minimum. Nobody wants to pay for more than what they need.

This architecture is available in our GitHub repository of Idyl E3 CloudFormation Templates. The template also contains an optional bastion instance to facilitate SSH access into the Idyl E3 instances from outside the VPC.

Need more?

Got more complicated requirements? Let us know. We have AWS certified engineers on staff and we’ll be glad to help.

Apache NiFi EQL Processor

We have published a new open source project on GitHub that is an Apache NiFi processor that filters entities through an Entity Query Language (EQL) query. When used along with the Idyl E3 NiFi Processor you can perform entity filtering in a NiFi dataflow pipeline.

To add the EQL processor to your NiFi pipeline, clone the project and build it or download the jar file from our website. Then copy the jar to NiFi’s lib directory and restart NiFi. The processor will not be available in the list of processors:

The EQL processor has a single property that holds the EQL query:

For this example our query will look for entities whose text is “George Washington”:

select * from entities where text = "George Washington"

Entities matching the EQL query will be outputted from the processor as JSON. Entities not matching the EQL query will be dropped.

With this capability we can create Apache NiFi dataflows that produce alerts when an entity matches a given set of conditions. Entities matching the EQL query can be published to an SQS queue, a Kafka stream, or any other NiFi processor.

The Entity Query Language previously existed as a component of the EntityDB project. It is now its own project on GitHub and is licensed under the Apache Software License, 2.0. The project’s README.md contains more examples of how to construct EQL queries.

Apache NiFi and Idyl E3 Entity Extraction Engine

We Apache NiFiare happy to let you know how Idyl E3 Entity Extraction Engine can be used with Apache NiFi. First, what is Apache NiFi? From the NiFi homepage: “Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic.” Idyl E3 extracts entities (persons, places, things) from natural language text.

That’s a very short description of NiFi but it is very accurate. Apache NiFi allows you to configure simple or complex processes of data processing. For example, you can configure a pipeline to consume files from a file system and upload them to S3. (See Example Dataflow Templates.) There are many operations you can do and they are performed by components called Processors. There are many excellent guides available about NiFi, such as:

There are many processors available for NiFi out of the box. One in particular is the InvokeHttp processor that lets your pipeline send an HTTP request. You can use this processor to send text to Idyl E3 for entity extraction from within your pipeline. However, to make things a bit simpler and more flexible we have created a custom NiFi processor just for Idyl E3. This processor is available on GitHub and its binaries will be included with all editions of Idyl E3 starting with version 2.3.0.

Idyl E3 NiFi Processor

Instructions for how to use the Idyl E3 processor will be added to the Idyl E3 documentation bit they are simple. Here’s a rundown. Copy the idyl-e3-nifi-processor.jar from Idyl E3’s home directory to NiFi’s lib directory. Restart NiFi. Once NiFi is available you will see the Idyl E3 in the list of processors when adding a processor:

Idyl E3 NiFi Processor

There are a few properties you can set but the only required property is the Idyl E3 endpoint. By default, the processor extracts entities from the input text but this can be changed using the action property. The available actions are:

  • extract (the default) to get a JSON response containing the entities.
  • annotate to return the input text with the entities annotated.
  • sanitize to return the input text with the entities removed.
  • ingest to extract entities from the input text but provide no response. (This is useful if you are letting Idyl E3 plugins handle the publishing of entities to a database or other service outside of the NiFi data flow.)

The available properties are shown in the screen capture below:

And that is it. The processor will send text to Idyl E3 for entity extraction via Idyl E3’s /api/v2/extract endpoint. The response from Idyl E3 containing the entities will be placed in a new idyl-e3-response attribute.

The Idyl E3 NiFi processor is licensed under the Apache Software License, version 2.0. Under the hood, the processor uses the Idyl E3 client SDK for Java which is also licensed under the Apache license.

Idyl E3 SDK for Go

The Idyl E3 SDK for Go is now available on GitHub. This SDK allows you to integrate Idyl E3’s entity extraction capabilities into your Go projects.

Like the other Idyl E3 SDKs, the project is licensed under the Apache Software License, version 2.0.

It’s easy to use:

endpoint := "http://localhost:9000"
s := "George Washington was president."
confidence := 0
context =: "context"
documentID := "documentID"
language := "en"
key := "your-api-key"

response := Extract(endpoint, s, confidence, context, documentID, language, key)

[related_post]

Amazon EBS Elastic Volumes

On Feb 13, 2017, Amazon Web Services announced elastic EBS volumes! If you have used EC2 much you have undoubtedly been frustrated by the rigidness of EBS volumes. Once created they could not be modified or resized. If your EC2 instance required more disk space your only option was to manually create a new volume of the desired size and attach it to your instance. Now that EBS volumes are more “elastic” you can now simply resize an EBS volume. I put “elastic” in quotes because the volume size can only be increased and not decreased. That’s more elastic than before but sill not completely elastic. In addition to adjusting size, you can now adjust performance and change the volume type even while the volume is in use. These functions are available for your existing EBS volumes.

You can use the AWS CLI to modify a volume:

aws ec2 modify-volume --region us-east-1 --volume-id vol-11111111111111111 --size 200 --volume-type io1 --iops 10000

After enlarging a volume don’t forget to tell your OS to use the newly allocated storage.

This can make like a lot easier is many situation. As described in the AWS blog post, you can use this functionality in combination with CloudWatch and Lamba to automatically enlarge volumes when running low on disk space. You can also use it to simply save money by starting with a smaller EBS volume than what you might need knowing you have the flexibility to increase the capacity of the volumes when needed.

Why do we find this interesting? Our Idyl E3 managed services run in AWS and we encourage all potential customers to launch Idyl E3 from the AWS Marketplace due to its ease of use and turn-key capabilities. So we like to pass interesting and relevant information regarding related services on to our users and readers when it comes available. Learn more about Idyl E3’s entity extraction capabilities.

[related_post]

New Feature Generators in Idyl E3 2.3.0

A feature generator is arguably the most important part of model-based entity extraction. The feature generators create “features” based on aspects of the input text that are used to determine what is and what is not an entity. Choosing the right (or wrong) features when training your entity models can have a significant impact on the performance of the models so we want you to have a good selection of feature generators available for use.

There are some new feature generators in Idyl E3 2.3.0 available to you that we’d like to take a minute to describe. All of the available feature generators and how to apply each one is described in the Idyl E3 2.3.0 Documentation.

New Feature Generators in Idyl E3 2.3.0

Special Character Feature Generator

This feature generator generates features for tokens that contains special characters. For example, the token Hello would not generate a feature but the token He*llo would generate a feature. This feature generator is probably most useful in the domains of science and healthcare, particularly chemical and drug names.

Token Part of Speech Feature Generator

This feature generator generates features based on each token’s part of speech. To use this feature generator you must provide a trained part of speech model. (Idyl E3 2.3.0 includes a tool for creating parts-of-speech models from your text.) This feature generator helps improve entity extraction performance by also being able to consider each entity’s part of speech.

Word Normalization Feature Generator

This feature generator normalizes tokens by replacing all uppercase characters with A, all lowercase characters with a, and all digits with 0. For example, the token HelloWorld25 would be normalized to AaaaaAaaaa00. This feature generator can optionally lemmatize each token prior to the normalization by applying a lemmatization model. (Idyl E3 2.3.0 includes a tool for creating lemmatization models from your text.)  Like the special character feature generator, this feature generator is also probably most useful in the domains of science and healthcare, particularly chemical and drug names.

[related_post]

 

Idyl E3 and Google Cloud Natural Language API

In late 2016 Google announced a new service on their Google Cloud platform called Google Cloud Natural Language API. This service provides various natural language processing capabilities including entity extraction. At first sight it seems as if Google’s Cloud Natural Language’s API is a direct competitor with Idyl E3 but when given a closer look the two products are very different. This blog post compares and contrasts Idyl E3 and Google Cloud Natural Language API’s entity extraction capabilities.

From the Google Cloud Natural Language API website:

Google Cloud Natural Language API reveals the structure and meaning of text by offering powerful machine learning models in an easy to use REST API. You can use it to extract information about people, places, events and much more, mentioned in text documents, news articles or blog posts.

This sounds a lot like Idyl E3. But let’s take a closer look at the similarities and differences between Idyl E3 and the Google Cloud Natural Language API.

Comparison of Idyl E3 and Google Cloud Natural Language API

Idyl E3Google Cloud Natural Language API and Idyl E3 are similar in that they are both applications that expose entity extraction capabilities for natural language text over an API interface. Both accept text and return the extracted entities. Idyl E3 is an application that you manage and can be installed behind your organization’s firewall. Google Cloud Natural Language API is a software-as-a-service (SaaS) offering and Google manages the application and billing. In addition to entity extraction, Google Cloud Natural Language API also offers sentiment analysis.

Security

Text sent to Google Cloud Natural Language API is transmitted over the public internet. Even though the text is sent using SSL encryption, this may not be acceptable for text containing sensitive information. Some workloads are not allowed to be transmitted outside of the organization. Idyl E3 runs behind a firewall so your text never leaves your network. This makes Idyl E3 ideal for security sensitive workloads.

Entity Types

Google Cloud Natural Language API supports identifying the following entity types: Unknown, Person, Location, Organization, Event, Work of Art, Consumer Good, Other. Idyl E3 is not limited to any set of entities. With Idyl E3 you are in full control of the entity types because you are able to create entity models for any types of entities. For instance, you can train Idyl E3 to extract Hospitals, Buildings, Bridges, Schools, Stadiums, and more.

Types of Text used for Training

The types (news articles, blog posts, encyclopedia articles, etc.) of text that was used to train the engine powering Google Cloud Natural Language API does not seem to be documented. The type of text that was used is important to provide a high-level of accuracy when extracting entities. With Idyl E3’s ability to create custom models, you can create models specifically for your text, whether it be emails, legal documents, or other text.

For optimal performance, it is very important that the text being processed is similar to the text that was used to train the models.

Language Support

Google Cloud Natural Language API only supports English, Spanish, and Japanese for entity analysis (source). Idyl E3 is not limited to by language. Idyl E3 can create and use entity models for any UTF-8 language.

Cost

Google Cloud Natural Language API’s pricing is per API request. This means that the more you use it the higher your bill. This is not the case with Idyl E3. Idyl E3 has flat licensing pricing. You do not pay per request.

20,000,000 Google Cloud Natural Language API requests: Monthly price = $5,000 (20,000,000 / 1,000 * 0.25)

In contrast, with Idyl E3 you could make 20 million or 100 million API requests per month and there is no additional cost. For example, you can launch Idyl E3 Analyst Edition from the AWS Marketplace for $1.50 per hour. If used for a full month the cost would be $1,080 (plus EC2 instance fees) no matter how many extraction requests you submit to Idyl E3. As you can see, Idyl E3 can cost substantially less than Google Cloud Natural Language API.

Control

With Idyl E3 you have full control over the entity extraction process. You can create custom sentence, token, and entity models for your text giving higher accuracy and improved performance. Idyl E3’s heuristic confidence filtering helps remove noise from the identified entities. Google Cloud Natural Language API does not have a concept of entity confidence values.

Additionally, you have full control over Idyl E3’s deployment architecture. You can also use Idyl E3 in an UIMA pipeline with the UIMA Annotator for Idyl E3.

Summary

To conclude, Idyl E3 and Google Cloud Natural Language API are very different products. They both expose an API for entity extraction from natural language text but that’s where the similarities stop. We will be offering an Idyl E3 plugin that supports using Google Cloud Natural Language API to complement Idyl E3’s entity extraction capabilities. By providing this plugin Idyl E3 will be exposing a common API for both services. Look for it to be available soon.

[related_post]

Idyl E3 2.2.0

Today we are announcing the release of Idyl E3 2.2.0. (See the full Release Notes.) This version brings some new exciting features such as heuristic confidence filtering, support for all UTF-8 languages, and statistics reporting.

Idyl E3 2.2.0 can be downloaded from our website today. Look for it to be available on the AWS Marketplace in the upcoming week.

In related news:

[related_post]

Idyl E3 and OpenNLP

As you may know, Idyl E3’s entity extraction capabilities is provided by a customized version of OpenNLP. Since the release of OpenNLP 1.7.0, the OpenNLP team has been able to release more often than previously. Because of the more frequent OpenNLP releases we may not incorporate each release into Idyl E3. We will analyze the changes in each new OpenNLP version to decide if the changes should be incorporated into Idyl E3.

Also, we do have on the (distant)  roadmap the ability to make the underlying NLP engine pluggable to allow you to choose which NLP engine to use with Idyl E3.

Heuristic Confidence Filtering

In Idyl E3 2.2.0 we are introducing a feature we call Heuristic Confidence Filtering. Here’s how it works.

As you may (or may not) already know, each entity extraction request can have an associated “confidence threshold value.” Any entities that are extracted who have a confidence lower than this value will not be returned in the entity extraction response. This is useful but it is a bit of a sledgehammer approach and can either result in too much noise or missed entities depending on its value.

When enabled, heuristic confidence filtering tracks the confidence values of extracted entities per the entity model that extracted them. Once a large enough sample of confidence values has been collected, Idyl E3 will filter entities by determining if an entity’s confidence value is significant to the mean of the collected values. This provides a way to filter out noise but still receive important entities.

It is important to note that the confidence threshold value still plays a part even when heuristic confidence filtering is enabled. Any entity whose confidence value is greater than or equal to the confidence threshold for that request will always be returned even when heuristic confidence filtering is enabled.

Because of the mathematical calculations involved and the memory required to store the confidence values the heuristic confidence filtering does require a bit more computation time but not to the point where it should be noticeable.

We are excited to offer this feature and we hope that it helps with “entity noise.” We welcome your feedback on how it performs for you! For more information on this feature you can refer to the Idyl E3 2.2.0 User Documentation or by contacting us. Look for Idyl E3 2.2.0 to be available in February 2017.

AWS CloudFormation Supports YAML

In an exciting update from AWS, it was announced that CloudFormation now supports YAML in addition to JSON. I think most of us will agree this is great. The JSON templates worked, but whew, were they hard to read and the lack of the ability to add comments sometimes made my templates look more like sudokus or word searches than anything else.

They also announced the support for cross-stack references. That means no more duplicating resources between templates! There’s a small gotcha with cross-stack references in that the names of the exported values have to be unique in your account and have to be literal string values.

These new features are significant enough that I felt they deserved a mention on this blog. They will definitely have an immediate impact on how we create CloudFormation for ourselves and our clients.

OpenNLP’s RegexNameFinder and Tokenizing

OpenNLP’s RegexNameFinder takes one or more regular expressions and uses those expressions to extract entities from the input text. This is very useful for instances in which you want to extract things that follow a set format, like phone numbers and email addresses. However, when tokenizing the input to the RegexNameFinder be careful because it can affect the RegexNameFinder’s accuracy.

The RegexNameFinder is very simple to use and here’s an example borrowed from an OpenNLP testcase.

Pattern testPattern = Pattern.compile("test");
String sentence[] = new String[]{"a", "test", "b", "c"};

Pattern[] patterns = new Pattern[]{testPattern};
Map<String, Pattern[]> regexMap = new HashMap<>();
String type = "testtype";

regexMap.put(type, patterns);

RegexNameFinder finder =
new RegexNameFinder(regexMap);

Span[] result = finder.find(sentence);

The sentence variable is a list of tokens. In the above example the tokens are set manually. In a more likely scenario the string would be received as “a test b c” and it would be up to the application to tokenize the string into {“a”, “test”, “b”, “c”}.

There are three types of tokenizers available in OpenNLP – the WhitespaceTokenizer, the SimpleTokenizer, and a tokenizer (TokenizerME) that uses a token model you have trained. The WhitespaceTokenizer works on, you guessed it, white space. The locations of white space in the string is used to tokenize the string. The SimpleTokenizer looks at character classes, such as letters and numbers.

Let’s take the example string “My email address is me@me.com and I like Gmail.” Using the WhitespaceTokenizer the tokens are {“My”, “email”, “address”, “is”, “me@me.com”, “and”, “I”, “like”, “Gmail.”}. If we use the RegexNameFinder with a regular expression that matches an email address, OpenNLP will return to us the span covering “me@me.com”. Works great!

However, let’s consider the sentence “My email address is me@me.com.” Using the WhitespaceTokenizer again the tokens are {“My”, “email”, “address”, “is”, “me@me.com.”}. Notice the last token includes the sentence’s period. Our regular expression for an email address will not match “me@me.com.” because it is not a valid email address. Using the SimpleTokenizer doesn’t give any better results.

How to work around this is up to you. You could make a custom tokenizer by implementing the Tokenizer interface, try using a token model, or massaging your text before it is passed to the tokenizer.