Idyl NLP Annotation Format

Idyl E3’s entity model training tool expects entities in training text to be annotated in the format used by OpenNLP. This format uses START and END tags to denote entities:

<START:person> George Washington <END> was president.

This works great but it has a drawback. The annotations and text have to be combined in a single file. Once the text is annotated it becomes difficult to use the training text for any other purposes.

New Annotation Format

Idyl E3 2.4.0 is going to introduce an additional method of annotating text that allows the annotations to be stored separate from the training text. In 2.4.0 the annotations will be able to be stored in a separate file (and we plan to eventually support storing the annotations in a database). Even though Idyl E3 2.4.0 is not yet ready for prime time, we wanted to introduce this feature early in case you are in the middle of any annotation efforts and want to use the new format.

It is still required that the input text contain a single sentence per line.Use blank lines to indicate document boundaries. Here’s an example of a simple input training file:

George Washington was president.
He was president of the United States.
George Washington was married to Martha Washington.
In 1755, Washington became the senior American aide to British General Edward Braddock on the ill-fated Braddock expedition.

And here’s the annotations stored in a separate file:

1 0 2 person
2 "United States" place
3 0 2 person
3 5 7 person
4 "Edward Braddock" person

Here’s what this means. Each line in the annotations file represents an annotation in the training text. So there are 5 annotations in this example.

For the lines having 3 columns:

  • The first column is the line number that contains the entity.
  • The second column is the text of the entity in double quotes.
  • The third column is the type of the entity.

For the lines with 4 columns:

  • The first column is the line number that contains the entity. In this example there is an annotation in each of the 3 lines.
  • The second column is the token index of the start of the entity. Indexes are zero-based so the first token is zero!
  • The third column is the token index of the end of the entity.
  • The last column is the type of the entity.

Note that there are two entities in the third line and each is put on its own separate line in the annotations file. Specifying the entity text in the three column format simplifies the annotation by removing the need to specify the entity’s token start and end positions. This will only annotate the first occurrence of the entity text. (If Edward Braddock had occurred more than once in the input text on line 4 only the first occurrence would be annotated.)

Summary

Now your annotations can be kept separate from your training text allowing you to use your training text for other purposes. Additionally, we hope that this new annotation method helps decrease the time required for annotating and helps with automating the process. As mentioned earlier in the post, currently the only supported means of storing the annotations is in a separate file but we plan to extend this to support databases in a future release of Idyl E3.

The Entity Model Generator tool included in Idyl E3 has been updated to allow for using this new annotation format. You can, however, continue to use the OpenNLP-style annotations when creating entity models. This new annotation format is only available for entity models. Sentence, token, parts-of-speech, and lemma model annotations will remain unchanged in 2.4.0.

Related Post

Idyl E3 SDK for Go

The Idyl E3 SDK for Go is now available on GitHub. This SDK allows you to integrate Idyl E3’s entity extraction capabilities into your Go projects.

Like the other Idyl E3 SDKs, the project is licensed under the Apache Software License, version 2.0.

It’s easy to use:

endpoint := "http://localhost:9000"
s := "George Washington was president."
confidence := 0
context =: "context"
documentID := "documentID"
language := "en"
key := "your-api-key"

response := Extract(endpoint, s, confidence, context, documentID, language, key)
Related Post

Amazon EBS Elastic Volumes

On Feb 13, 2017, Amazon Web Services announced elastic EBS volumes! If you have used EC2 much you have undoubtedly been frustrated by the rigidness of EBS volumes. Once created they could not be modified or resized. If your EC2 instance required more disk space your only option was to manually create a new volume of the desired size and attach it to your instance. Now that EBS volumes are more “elastic” you can now simply resize an EBS volume. I put “elastic” in quotes because the volume size can only be increased and not decreased. That’s more elastic than before but sill not completely elastic. In addition to adjusting size, you can now adjust performance and change the volume type even while the volume is in use. These functions are available for your existing EBS volumes.

You can use the AWS CLI to modify a volume:

aws ec2 modify-volume --region us-east-1 --volume-id vol-11111111111111111 --size 200 --volume-type io1 --iops 10000

After enlarging a volume don’t forget to tell your OS to use the newly allocated storage.

This can make like a lot easier is many situation. As described in the AWS blog post, you can use this functionality in combination with CloudWatch and Lamba to automatically enlarge volumes when running low on disk space. You can also use it to simply save money by starting with a smaller EBS volume than what you might need knowing you have the flexibility to increase the capacity of the volumes when needed.

Why do we find this interesting? Our Idyl E3 managed services run in AWS and we encourage all potential customers to launch Idyl E3 from the AWS Marketplace due to its ease of use and turn-key capabilities. So we like to pass interesting and relevant information regarding related services on to our users and readers when it comes available. Learn more about Idyl E3’s entity extraction capabilities.

Related Post

New Feature Generators in Idyl E3 2.3.0

A feature generator is arguably the most important part of model-based entity extraction. The feature generators create “features” based on aspects of the input text that are used to determine what is and what is not an entity. Choosing the right (or wrong) features when training your entity models can have a significant impact on the performance of the models so we want you to have a good selection of feature generators available for use.

There are some new feature generators in Idyl E3 2.3.0 available to you that we’d like to take a minute to describe. All of the available feature generators and how to apply each one is described in the Idyl E3 2.3.0 Documentation.

New Feature Generators in Idyl E3 2.3.0

Special Character Feature Generator

This feature generator generates features for tokens that contains special characters. For example, the token Hello would not generate a feature but the token He*llo would generate a feature. This feature generator is probably most useful in the domains of science and healthcare, particularly chemical and drug names.

Token Part of Speech Feature Generator

This feature generator generates features based on each token’s part of speech. To use this feature generator you must provide a trained part of speech model. (Idyl E3 2.3.0 includes a tool for creating parts-of-speech models from your text.) This feature generator helps improve entity extraction performance by also being able to consider each entity’s part of speech.

Word Normalization Feature Generator

This feature generator normalizes tokens by replacing all uppercase characters with A, all lowercase characters with a, and all digits with 0. For example, the token HelloWorld25 would be normalized to AaaaaAaaaa00. This feature generator can optionally lemmatize each token prior to the normalization by applying a lemmatization model. (Idyl E3 2.3.0 includes a tool for creating lemmatization models from your text.)  Like the special character feature generator, this feature generator is also probably most useful in the domains of science and healthcare, particularly chemical and drug names.

Related Post

 

Packer Script for Graphite AMI

Idyl E3 2.2.0 added support for publishing metrics to a Graphite server. To help make it easier to deploy a Graphite server we have added a new project on our GitHub that contains a Packer script for creating a Graphite AMI. Usage instructions are available in the project’s readme file.

mtnfog/graphite-ami

Related Post

Idyl E3 and Google Cloud Natural Language API

In late 2016 Google announced a new service on their Google Cloud platform called Google Cloud Natural Language API. This service provides various natural language processing capabilities including entity extraction. At first sight it seems as if Google’s Cloud Natural Language’s API is a direct competitor with Idyl E3 but when given a closer look the two products are very different. This blog post compares and contrasts Idyl E3 and Google Cloud Natural Language API’s entity extraction capabilities.

From the Google Cloud Natural Language API website:

Google Cloud Natural Language API reveals the structure and meaning of text by offering powerful machine learning models in an easy to use REST API. You can use it to extract information about people, places, events and much more, mentioned in text documents, news articles or blog posts.

This sounds a lot like Idyl E3. But let’s take a closer look at the similarities and differences between Idyl E3 and the Google Cloud Natural Language API.

Comparison of Idyl E3 and Google Cloud Natural Language API

Idyl E3Google Cloud Natural Language API and Idyl E3 are similar in that they are both applications that expose entity extraction capabilities for natural language text over an API interface. Both accept text and return the extracted entities. Idyl E3 is an application that you manage and can be installed behind your organization’s firewall. Google Cloud Natural Language API is a software-as-a-service (SaaS) offering and Google manages the application and billing. In addition to entity extraction, Google Cloud Natural Language API also offers sentiment analysis.

Security

Text sent to Google Cloud Natural Language API is transmitted over the public internet. Even though the text is sent using SSL encryption, this may not be acceptable for text containing sensitive information. Some workloads are not allowed to be transmitted outside of the organization. Idyl E3 runs behind a firewall so your text never leaves your network. This makes Idyl E3 ideal for security sensitive workloads.

Entity Types

Google Cloud Natural Language API supports identifying the following entity types: Unknown, Person, Location, Organization, Event, Work of Art, Consumer Good, Other. Idyl E3 is not limited to any set of entities. With Idyl E3 you are in full control of the entity types because you are able to create entity models for any types of entities. For instance, you can train Idyl E3 to extract Hospitals, Buildings, Bridges, Schools, Stadiums, and more.

Types of Text used for Training

The types (news articles, blog posts, encyclopedia articles, etc.) of text that was used to train the engine powering Google Cloud Natural Language API does not seem to be documented. The type of text that was used is important to provide a high-level of accuracy when extracting entities. With Idyl E3’s ability to create custom models, you can create models specifically for your text, whether it be emails, legal documents, or other text.

For optimal performance, it is very important that the text being processed is similar to the text that was used to train the models.

Language Support

Google Cloud Natural Language API only supports English, Spanish, and Japanese for entity analysis (source). Idyl E3 is not limited to by language. Idyl E3 can create and use entity models for any UTF-8 language.

Cost

Google Cloud Natural Language API’s pricing is per API request. This means that the more you use it the higher your bill. This is not the case with Idyl E3. Idyl E3 has flat licensing pricing. You do not pay per request.

20,000,000 Google Cloud Natural Language API requests: Monthly price = $5,000 (20,000,000 / 1,000 * 0.25)

In contrast, with Idyl E3 you could make 20 million or 100 million API requests per month and there is no additional cost. For example, you can launch Idyl E3 Analyst Edition from the AWS Marketplace for $1.50 per hour. If used for a full month the cost would be $1,080 (plus EC2 instance fees) no matter how many extraction requests you submit to Idyl E3. As you can see, Idyl E3 can cost substantially less than Google Cloud Natural Language API.

Control

With Idyl E3 you have full control over the entity extraction process. You can create custom sentence, token, and entity models for your text giving higher accuracy and improved performance. Idyl E3’s heuristic confidence filtering helps remove noise from the identified entities. Google Cloud Natural Language API does not have a concept of entity confidence values.

Additionally, you have full control over Idyl E3’s deployment architecture. You can also use Idyl E3 in an UIMA pipeline with the UIMA Annotator for Idyl E3.

Summary

To conclude, Idyl E3 and Google Cloud Natural Language API are very different products. They both expose an API for entity extraction from natural language text but that’s where the similarities stop. We will be offering an Idyl E3 plugin that supports using Google Cloud Natural Language API to complement Idyl E3’s entity extraction capabilities. By providing this plugin Idyl E3 will be exposing a common API for both services. Look for it to be available soon.

Related Post

Idyl E3 2.2.0

Today we are announcing the release of Idyl E3 2.2.0. (See the full Release Notes.) This version brings some new exciting features such as heuristic confidence filtering, support for all UTF-8 languages, and statistics reporting.

Idyl E3 2.2.0 can be downloaded from our website today. Look for it to be available on the AWS Marketplace in the upcoming week.

In related news:

Related Post

Idyl E3 and OpenNLP

As you may know, Idyl E3’s entity extraction capabilities is provided by a customized version of OpenNLP. Since the release of OpenNLP 1.7.0, the OpenNLP team has been able to release more often than previously. Because of the more frequent OpenNLP releases we may not incorporate each release into Idyl E3. We will analyze the changes in each new OpenNLP version to decide if the changes should be incorporated into Idyl E3.

Also, we do have on the (distant)  roadmap the ability to make the underlying NLP engine pluggable to allow you to choose which NLP engine to use with Idyl E3.

Heuristic Confidence Filtering

In Idyl E3 2.2.0 we are introducing a feature we call Heuristic Confidence Filtering. Here’s how it works.

As you may (or may not) already know, each entity extraction request can have an associated “confidence threshold value.” Any entities that are extracted who have a confidence lower than this value will not be returned in the entity extraction response. This is useful but it is a bit of a sledgehammer approach and can either result in too much noise or missed entities depending on its value.

When enabled, heuristic confidence filtering tracks the confidence values of extracted entities per the entity model that extracted them. Once a large enough sample of confidence values has been collected, Idyl E3 will filter entities by determining if an entity’s confidence value is significant to the mean of the collected values. This provides a way to filter out noise but still receive important entities.

It is important to note that the confidence threshold value still plays a part even when heuristic confidence filtering is enabled. Any entity whose confidence value is greater than or equal to the confidence threshold for that request will always be returned even when heuristic confidence filtering is enabled.

Because of the mathematical calculations involved and the memory required to store the confidence values the heuristic confidence filtering does require a bit more computation time but not to the point where it should be noticeable.

We are excited to offer this feature and we hope that it helps with “entity noise.” We welcome your feedback on how it performs for you! For more information on this feature you can refer to the Idyl E3 2.2.0 User Documentation or by contacting us. Look for Idyl E3 2.2.0 to be available in February 2017.

Idyl E3 2.1.0

Idyl E3 2.1.0 has been released. This version introduces a new version of the API that includes changes to the extract and ingest endpoints. With version 2 of the API these two endpoints accept text in the body of the request instead of as a query string parameter. Version 1 of the API is still available so you do not need to update your clients unless you just want to or need to for other reasons. The Idyl E3 Java SDK and the Idyl E3 .NET SDK have been updated to use API v2.

Idyl E3 2.1.0 is based on a customized OpenNLP 1.7.0 which was released in early January 2016.  Previous versions of Idyl E3 were based on a customized OpenNLP 1.6.0.

Idyl E3 2.1.0 Analyst Edition will be available on the AWS Marketplace soon. The Analyst Edition includes all plugins and allows for the use of unlimited custom models without separate licensing. (See the Idyl E3 edition comparison.)

Privacy Policy Changes

We want to make you aware of a recent change to our Privacy Policy. We added a paragraph to the “Non-personal identification information” section about product update checks. The new paragraph describes the information that is transmitted when our products perform an updated version check. Remember that update checks can always be enabled or disabled — please check the product’s documentation for instructions or contact us.

Idyl E3 2.0

Update: Idyl E3 2.0 is now available on the AWS Marketplace: https://aws.amazon.com/marketplace/pp/B01BSQUR2K

Today we are announcing Idyl E3 2.0. It has been over a year since version 1.0 was introduced and we’d like to thank our users for helping us to reach this milestone. The main goals of version 2.0 were to make Idyl E3 extensible and increase performance. We would like to thank our users for helping us get to this milestone release. We could not have done it without your feedback and comments.

Idyl E3 is available for download from our website. Look for Idyl E3 2.0 to be available on the AWS Marketplace and other channels shortly thereafter.

Three Editions

Idyl E3 2.0 will be available in three editions:

Idyl E3 Free Edition

This edition of Idyl E3 is free. It includes an English-persons entity model and no plugins. This edition can be customized with plugins and models to meet your requirements.

Idyl E3 Standard Edition

The Standard Edition includes everything in the free edition plus model evaluation tools and priority email technical support.

Idly E3 Analyst Edition

The Analyst Edition includes everything in the standard edition plus all plugins and supports unlimited custom models.

Plugins

In Idyl E3 1.x, things like email addresses and phone numbers were extracted through built-in functionality called extraction modules. In version 2.0 we are introducing plugins. There are two types of plugins – a plugin type that perform an entity extraction and a plugin type that publishes the extracted entities. Plugins can be downloaded from our website and installed in your Idyl E3. The following plugins are currently available or will soon be:

Text Consumption Plugins

  • Consume input text from Kafka topic
  • Consume input text from Kinesis stream

Entity Extraction Plugins

  • Phone numbers extraction plugin
  • Email addresses extraction plugin
  • Hashtags extraction plugin
  • User mentions extraction plugin

Document Processing Plugins

  • Parse text from PDF files

Entity Publisher Plugins

  • AWS Kinesis Firehose publisher plugin
  • EntityDB publisher plugin

Internal changes were made to improve Idyl E3’s performance to lower the time to extract entities. One change was the removal of the web-based dashboard. Configuration is now done directly through the properties file.

Custom Sentence and Token Models

Also new in version 2.0 to increase performance is the ability to generate and use custom sentence and token models. In versions 1.x, internal models were used for sentence detection and sentence tokenizing. These models were not always representative of the input text so their performance was degraded. In version 2.0 you have the option to generate sentence and token models from your data or use the legacy internal models just as versions 1.x did. You can still create your own entity models.

UIMA

Idyl E3 2.0 supports integration with UIMA through the Idyl E3 UIMA connector.

EntityDB and AWS CloudWatch Metrics

We added the ability for EntityDB to report metrics to AWS CloudWatch. The metrics reported include the numbers of entities stored and indexed. The screen capture of an AWS CloudWatch graph is shown below. The system that generated the metrics illustrated by the chart was composed of 5 EntityDB t2.micro instances in auto-scaling group behind an elastic load balancer. An SQS queue was used for the entity queue and entities were persisted to a MongoDB database also running on a t2.micro instance. (This architecture was created using the CloudFormation templates in the GitHub repository.)

EntityDB CloudWatch metrics

As the metrics show, the entities are being stored at a rate much faster than the entities are being indexed. We will be working to make the index rate (orange line) more closely follow the stored rate (blue line).

Cloud NLP

We have made available some NLP services over a REST API. The services, collectively called Cloud NLP, currently include sentiment analysis and language detection. Additional services will be added over the next few weeks. Cloud NLP requires an API key that you can get for free by contacting us or by consuming Cloud NLP through the Mashape API Marketplace.

The Cloud NLP Java client SDK is now available on GitHub. It is licensed under the Apache Software License, version 2.0. The Maven dependency information is:

<dependency>
    <groupId>com.mtnfog.cloudnlp</groupId>
    <artifactId>cloud-nlp-java-sdk</artifactId>
    <version>1.0.0</version>
</dependency>

The Cloud NLP Java client SDK includes support for accessing the Cloud NLP services directly with a Mountain Fog API key or through Mashape. It’s easy to use:

CloudNlpClient client = new CloudNlpStandardClient(API_KEY, CloudNlpStandardClient.MTNFOG_CLOUDNLP_ENDPOINT);

String language = cloudNlpClient.detectLanguage("This is english text.");
int sentiment = cloudNlpClient.analyzeSentiment("This widget is great!");

Similarly, to use Cloud NLP via Mashape just change to the CloudNlpMashapeClient:

CloudNlpClient cloudNlpClient = new CloudNlpMashapeClient(MASHAPE_API_KEY);

String language = cloudNlpClient.detectLanguage("This is english text.");
int sentiment = cloudNlpClient.analyzeSentiment("This widget is great!");

And that’s it. As mentioned earlier, look for more natural language processing services to be added to Cloud NLP in the near future!

Open Source Updates

In the past week we made the following updates to our open source projects.

Entity Model – Updated to include a new Span class on entity. The Span class identifies the location of the entity in the source text by token and by character indexes. This update was made to the entity-model and entity-model-net projects. Version 1.0.8 of entity-model was published to Maven Central and version 2.0.0 of entity-model-net was published to NuGet.

Idyl E3 UIMA Annotator – An update was made that annotates the entities based on the character index instead of the token index so the entities are properly annotated in UIMA. (The Idyl E3 UIMA Annotator requires Idyl E3 1.13.0 which is not quite ready but look for it soon.)

Idyl E3 Client SDKs – The Idyl E3 client SDK for Java was updated to use entity-model 1.0.8. The Idyl E3 client SDK for .NET was updated to use the new MountainFog.EntityModel 2.0.0 package from NuGet.

AnthologyAnthology was updated to include the ability to load balance Idyl E3 entity extraction requests. You can now specify multiple Idyl E3 endpoints per entity type when defining the routes.

EntityDBEntityDB was updated to use entity-model 1.0.8.

 

Idyl E3 UIMA Annotator

We have published a new project to our GitHub. The new project is a UIMA annotator that uses Idyl E3 for named entity recognition. When added to a UIMA pipeline, the annotator will send the text that is the subject of analysis to Idyl E3. The project is licensed under the Apache Software License, version 2.0.

The Idyl E3 UIMA annotator requires Idyl E3 1.13.0 which will be available very soon.

S3 Uploads with Pre-signed URLs

In case you didn’t know we also do consulting services for Amazon Web Services users. (One team member has achieved all 5 AWS certifications!) One thing clients commonly want to do is allow users to upload images (and other files but typically images) directly to S3. By uploading directly to S3, you don’t have to make a custom webservice to handle the upload and then send the file to S3. Direct uploads by the client to S3 are much more efficient.

An easy way to accomplish direct client uploads to S3 is through the use of pre-signed URLs. Originally, pre-signed URLs were available just for downloading files from S3 but they can be used for uploads, too. (A pre-signed URL for a download is a time-expiring link that allows anyone with that URL to download the file. They are very useful.) For uploading, a pre-signed URL allows anyone with the URL to upload a file to your S3 bucket. You define the target location in S3 and set some options and the URL is generated. You can then use that URL in an HTML upload form and the user’s file will upload directly to S3.

Here’s some sample code from the AWS documentation that shows how to generate a pre-signed URL for uploading:

AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider()); 

java.util.Date expiration = new java.util.Date();
long msec = expiration.getTime();
msec += 1000 * 60 * 60; // Add 1 hour.
expiration.setTime(msec);

GeneratePresignedUrlRequest generatePresignedUrlRequest = new GeneratePresignedUrlRequest(bucketName, objectKey);
generatePresignedUrlRequest.setMethod(HttpMethod.PUT); 
generatePresignedUrlRequest.setExpiration(expiration);
             
URL s = s3client.generatePresignedUrl(generatePresignedUrlRequest); 

// Use the pre-signed URL to upload an object.

In the code we set the expiration for the URL and the HTTP method and then we generate the URL. (See the full code example.)

And that’s where the hold up comes. There’s a slight gotcha that is very often missed by our clients. Our clients encounter the problem when a user goes to upload a file using the pre-signed URL and receive a forbidden error. Here’s the important point:

The user who generates the pre-signed URL must have permissions to upload a file to S3. 

This piece of information is in the AWS documentation but it is missed a lot. Here’s the text:

A pre-signed URL gives you access to the object identified in the URL, provided that the creator of the pre-signed URL has permissions to access that object. That is, if you receive a pre-signed URL to upload an object, you can upload the object only if the creator of the pre-signed URL has the necessary permissions to upload that object.

Hopefully this will save you a few minutes of head bashing in case you run into this the next time you’re implementing S3 uploads with pre-signed URLs.

Idyl Talk – New Open Source Project

We have pushed a new open source project to our GitHub called Idyl Talk. The goal of Idyl Talk is to replace traditional interface-defined software communication with natural language text.

When software communicates with other software, either internally or with external software, the communication is defined by interfaces. These interfaces tell each side how to communicate. Interfaces are an essential piece of good design. But what happens when two components have to communicate, and for whatever reasons, it is difficult (or impossible) to define the interface? Idyl Talk addresses this problem by letting software components communicate using natural language English text.

Imagine your refrigerator talking to your smartphone app to update your shopping list. The communication might look a bit like this:

{
    inventory: {
        "milk": "low",
        "eggs": 12
    }
}

Your smartphone receives the message and an app notifies you that you need milk. For this to be possible the developers of the refrigerator and the smartphone app have to agree on some interface that dictates the communication between the devices. This requires collaboration, and of course, time and money.

Now, imagine that when you are running low on milk your refrigerator sends the following message to your smartphone app:

You are low on milk.

The agreed-to interface here is the English language. With Idyl Talk can now create devices that are enabled to communicate even if they do not exist yet! The app processes the received message and alerts you that you are low on milk.

Sound interesting? We think so! We welcome your contributions to the project as it matures and grows. Check out Idyl Talk on GitHub.

See a listing of all our open source projects.

AWS CloudFormation Supports YAML

In an exciting update from AWS, it was announced that CloudFormation now supports YAML in addition to JSON. I think most of us will agree this is great. The JSON templates worked, but whew, were they hard to read and the lack of the ability to add comments sometimes made my templates look more like sudokus or word searches than anything else.

They also announced the support for cross-stack references. That means no more duplicating resources between templates! There’s a small gotcha with cross-stack references in that the names of the exported values have to be unique in your account and have to be literal string values.

These new features are significant enough that I felt they deserved a mention on this blog. They will definitely have an immediate impact on how we create CloudFormation for ourselves and our clients.

EntityDB is Open Source

EntityDB is now open source on GitHub. It is licensed under the AGPLv3. The goal of EntityDB is to provide an integration solution for storing, managing, and querying entities (persons, places, and things). Everyone is welcome to contribute to its development and future as we work toward a first release.

EntityDB provides a choice of underlying databases. MySQL, MongoDB, Cassandra, and DynamoDB are currently supported. The Entity Query Language (EQL) is also included in the open sourced code. EQL provides an abstraction layer for querying the entities regardless of the underlying database.

Proprietary licenses are available for situations where the AGPLv3 is not suitable. Please contact us for more information.

OpenNLP’s RegexNameFinder and Tokenizing

OpenNLP’s RegexNameFinder takes one or more regular expressions and uses those expressions to extract entities from the input text. This is very useful for instances in which you want to extract things that follow a set format, like phone numbers and email addresses. However, when tokenizing the input to the RegexNameFinder be careful because it can affect the RegexNameFinder’s accuracy.

The RegexNameFinder is very simple to use and here’s an example borrowed from an OpenNLP testcase.

Pattern testPattern = Pattern.compile("test");
String sentence[] = new String[]{"a", "test", "b", "c"};

Pattern[] patterns = new Pattern[]{testPattern};
Map<String, Pattern[]> regexMap = new HashMap<>();
String type = "testtype";

regexMap.put(type, patterns);

RegexNameFinder finder =
new RegexNameFinder(regexMap);

Span[] result = finder.find(sentence);

The sentence variable is a list of tokens. In the above example the tokens are set manually. In a more likely scenario the string would be received as “a test b c” and it would be up to the application to tokenize the string into {“a”, “test”, “b”, “c”}.

There are three types of tokenizers available in OpenNLP – the WhitespaceTokenizer, the SimpleTokenizer, and a tokenizer (TokenizerME) that uses a token model you have trained. The WhitespaceTokenizer works on, you guessed it, white space. The locations of white space in the string is used to tokenize the string. The SimpleTokenizer looks at character classes, such as letters and numbers.

Let’s take the example string “My email address is me@me.com and I like Gmail.” Using the WhitespaceTokenizer the tokens are {“My”, “email”, “address”, “is”, “me@me.com”, “and”, “I”, “like”, “Gmail.”}. If we use the RegexNameFinder with a regular expression that matches an email address, OpenNLP will return to us the span covering “me@me.com”. Works great!

However, let’s consider the sentence “My email address is me@me.com.” Using the WhitespaceTokenizer again the tokens are {“My”, “email”, “address”, “is”, “me@me.com.”}. Notice the last token includes the sentence’s period. Our regular expression for an email address will not match “me@me.com.” because it is not a valid email address. Using the SimpleTokenizer doesn’t give any better results.

How to work around this is up to you. You could make a custom tokenizer by implementing the Tokenizer interface, try using a token model, or massaging your text before it is passed to the tokenizer.

Idyl E3 1.12.0

Idyl E3Look for Idyl E3 1.12.0 to be available on various cloud marketplaces this week. Version 1.12.0 starts the separation from the entity stores we announced in our last post. It also contains some minor fixes and improvements. (See the Idyl E3 Release Notes.)

There will be multiple versions of Idyl E3 1.12.0 available. The versions will differ based on what entity models are included in the version. One version will not have any entity models making it ideal for scenarios when you want to use your own generated entity models. As a reminder, you can create entity models from your own data for use with Idyl E3. Using your own data to generate models will result in models that perform better than our models for your type of data.

Idyl E3’s entity store and EntityDB

Along with the ability to extract entities from text, Idyl E3’s entity store feature allows you to save the extracted entities to a database of your choice. Supported databases include a relational database like MySQL and the NoSQL databases MongoDB and DynamoDB. In addition to save the entities to a database you can also query the entities using a special language called Entity Query Language (EQL). EQL has a SQL-like syntax letting you select entities based on conditions in the query. Your EQL query is translated into a native query for your selected database. A single EQL query can be executed against MySQL, MongoDB, and DynamoDB.

The entity store feature of Idyl E3 is being separated from Idyl E3 into its own product called EntityDB. This separation will allow Idyl E3 to focus on entity extraction. Idyl E3 will integrate with EntityDB’s public API to still provide entity storage services.

EntityDB will continue to support the same databases as well as a new database – Apache Cassandra. Cassandra is ideally suited for storing entities and will allow for large-scale querying and analysis. The Cassandra-based entity store will support EQL queries but you will also have the ability to query it using other tools like SparkSQL.

Look for the first version of EntityDB to be available in the near future. We have a large roadmap for EntityDB and plan to add features incrementally over a series of releases.

Mountain Fog, Inc. Listed in AWS Marketplace for the U.S. Intelligence Community

Mountain Fog, Inc. Listed in AWS Marketplace for the U.S. Intelligence Community

Idyl E3 Entity Extraction Engine now available to 17 US Intelligence Agencies in a Cloud Marketplace

June 20 – Morgantown, WV – Mountain Fog, Inc., a leading provider of natural language processing software for commercial and law enforcement users, today announced it is among the first group of technology vendors to be listed in Amazon Web Services (AWS) Marketplace for the U.S. Intelligence Community (IC). AWS Marketplace for the U.S. IC is designed exclusively for the 17 intelligence agencies to evaluate, purchase, and deploy in minutes via 1-Click® a broad array of common software infrastructure, developer tools, and business software products, with the categories of products and vendors growing over time.

Mountain Fog’s product, Idyl E3 Entity Extraction Engine, analyzes multilingual natural language text and identifies persons, places, and things within the text. Its integrated rules engine and entity persistence capabilities provide a complete solution for processing unstructured text. “Idyl E3 can help government agencies manage unstructured text and turn it into usable information. We are pleased to offer Idyl E3 on the AWS Marketplace for the U.S. IC in order to give more agencies immediate access to its capabilities,” said Mountain Fog president Jeff Zemerick.

AWS Marketplace for the U.S. IC provides the same purchasing convenience, open and transparent license terms and conditions, and variety of pricing models, including hourly usage and annual subscription, as the commercial AWS Marketplace. It also supports Bring-Your-Own-License (BYOL) so that agencies can more easily migrate existing software licenses and applications. AWS Marketplace for the U.S. IC is part of the Commercial Cloud Services (C2S) program, under the Director of National Intelligence (DNI) Intelligence Community (IC) Information Technology Enterprise (IC ITE). For more information on AWS Marketplace for the U.S. IC, contact icmp@amazon.com.

###

About Mountain Fog, Inc.

Mountain Fog was founded in 2011 to develop innovative language processing solutions. Our team of developers and engineers specialize in big-data analysis, natural language processing, and cloud systems. Our philosophy is simple – provide the best products we can and back them up with unparalleled customer support. (412) 206-1079 | sales@mtnfog.com

 

 

 

User-created entity models

Idyl E3 offers can extract many types of entities such as building, cities, and more. In instances where we do not offer the type of entity you need we will soon be offering a tool to create your own entity models. You will be able to create your own entity models from your own text giving you entity models customized for your own use-cases. This ability is expected to be available in an upcoming release so stay tuned for more details.

Idyl E3 is Now Free


Idyl E3
is now available for free. And to make it even better, Idyl E3 will soon also be available for Windows Azure, VMware ESXi, and as a standalone download. Idyl E3 will continue to be available through the AWS Marketplace. Deploy Idyl E3 to the platform of your choice.

And to make it even better, the entity models used by Idyl E3 for entity extraction are now configurable and additional models are available for download. Idyl E3 comes with a base model for extracting person entities that is fully functional. Models capable of extracting more entities with higher confidence are available.

We will still provide support for Idyl E3. Priority support is available as well as development and integration support to help you get Idyl E3 integrated into your systems.

We are very excited about the expanded availability of Idyl E3. If you need help getting started with Idyl E3 or have any questions please contact us.

New Website

In case you haven’t noticed we have rolled out our website update. The goal of this update was to improve usability. We felt that our previous website was at times hard to navigate with too much text. In the new website we are going for simplicity, especially on the product pages.

The new website also features a new and improved My Account. Soon you will be able to access your downloads and purchase history from your account. The new website also features improved single sign-on with our .

We have also migrated our blog to the new website. No longer do you have to leave our main site to checkout our newest blog posts.

So, please bear with us over the next few days as we iron out any issues and missed 404 errors.

Sample Verse Sentiment Definition

Verse analyzes the sentiment of text by using Sentiment Definition files. A sentiment definition file is simply a file that defines the sentiment you want to identify. Here’s a simple example of a sentiment definition for a violent sentiment:

sentiment=violent
fuzzy=true

wound   1
hurt    1
fight   1
murder  3
destroy 2

That’s it. There are a couple of settings followed by words and their corresponding weight values. Let’s walk through the sentiment definition.

The first line, sentiment=violent, sets the name of the sentiment. If the sentiment definition is for “happy” then you would set sentiment=happy.

The second line, fuzzy=true, enables fuzzy-matching for this sentiment definition. The words listed at the bottom of the sentiment definition are looked for when analyzing text. Fuzzy-matching allows matches in cases of misspellings. For instance, when fuzzy matching is enabled “detroy” will match to “destroy.”

Fuzzy matching is not a global Verse setting – instead it is enabled on a per-sentiment definition basis. In this example we have enabled fuzzy matching. To disable fuzzy matching either remove the line completely or change it to false.

Next is the list of words associated with the sentiment. Each word is associated with an integer value that is that word’s weight. The word and the weight are separated by a tab. The weight is the importance of the word to the sentiment relative to the other words in the list. For example, the word murder is stronger than the word wound. Input text containing variations of the word murder will have a stronger sentiment that an input text containing variations of the word wound. The weights that you give the words are completely up to you and allow for tailoring the sentiment definition exactly to your needs.

Negative weights are permitted. A word in a sentiment definition with a negative weight is essentially opposite to the sentiment. Use negative weights cautiously.

And that’s sentiment definitions! They are very simple but powerful. You can upload sentiment definitions through the Verse dashboard.

If you have any questions or would like assistance please get in touch!

Idyl E3 1.6.0 Available on the AWS Marketplace

Idyl E3 1.6.0 is now available on the AWS Marketplace. This release brings some good fixes and exciting new features.

Here’s some of what’s new in 1.6.0:

  • Access to the API can now require authentication.
  • Restarts are no longer required when changing settings.
  • New dashboard UI styling is cleaner and easier to navigate.
  • Entity filters are now customizable through the dashboard settings.
  • Added SQS visibility timeout setting.
  • Added SNS message subject setting.
  • Added CloudWatch metric name setting.
  • Passwords and AWS keys are encrypted in the settings.

And some things that were fixed or improved:

  • Added cURL upload example.
  • Fixed document upload when upload parameters are missing.
  • The entity store setting is loaded and shown correctly after saving.
  • Documentation updates.

This is by far the most stable and feature-rich version yet. But with that said, we have already started on version 1.7.0 to offer even more. If you have any feedback or feature requests please let us know!

With the release of Idyl E3 1.6.0 we have also updated the client SDKs. You can find them on GitHub or through Maven Central and NuGet.

Announcing Verse Sentiment Analysis Engine

We would like to announce a new product! Verse Sentiment Analysis Engine 1.0.0 is now available on the AWS Marketplace.

Verse analyzes input text and determines the sentiment of the text. With Verse you can determine if the sentiment of text is positive, negative, violent, happy, or other emotion. Verse works by employing “sentiment definitions” that you create. Each sentiment definition allows Verse to analyze text for that sentiment. (Look for a walkthrough of a sample sentiment definition file in an upcoming post.)

Verse supports fuzzy-matching to work around misspelled words. You can integrate directly with Verse’s API with our Verse client SDKs on GitHub or you can create your own integration.

Verse is low-cost and pricing per EC2 instance is constant, meaning you pay the same for Verse no matter what size EC2 instance you choose to use. Try Verse out on a t2.small instance and move to a c4.xlarge instance without any increase in the Verse software fee.

Learn more about Verse on its product page or contact us to schedule a short demo of Verse’s capabilities or if you have any questions!

We are very interested in improving Verse through the next versions so if you can share with us how you use Verse and tell us about your experience using it that would be fantastic!

Idyl E3 1.6.0 Available on the AWS Marketplace

Idyl E3 1.6.0 is now available on the AWS Marketplace. This release brings some good fixes and exciting new features.

Here’s some of what’s new in 1.6.0:

  • Access to the API can now require authentication.
  • Restarts are no longer required when changing settings.
  • New dashboard UI styling is cleaner and easier to navigate.
  • Entity filters are now customizable through the dashboard settings.
  • Added SQS visibility timeout setting.
  • Added SNS message subject setting.
  • Added CloudWatch metric name setting.
  • Passwords and AWS keys are encrypted in the settings.

And some things that were fixed or improved:

  • Added cURL upload example.
  • Fixed document upload when upload parameters are missing.
  • The entity store setting is loaded and shown correctly after saving.
  • Documentation updates.

This is by far the most stable and feature-rich version yet. But with that said, we have already started on version 1.7.0 to offer even more. If you have any feedback or feature requests please let us know!

With the release of Idyl E3 1.6.0 we have also updated the client SDKs. You can find them on GitHub or through Maven Central and NuGet.

Idyl E3 1.5.3 now available

Idyl E3 1.5.3 is now available on the AWS Marketplace. Version 1.5.3 adds support for extracting the following entity types in addition to person and place entities:

  • Email addresses
  • Twitter usernames
  • Hashtags
  • US and international phone numbers

Version 1.5.3 also adds MongoDB as a supported entity store and adds an API endpoint for querying entities through the Entity Query Language (EQL).

See the Idyl E3 Release Notes page for the full history. We are very excited about this release! If you have any questions or comments please get in touch at idyl@mtnfog.com.

Idyl Cloud and Entity Extraction

A very large number of our users use Idyl E3 for entity extraction since it can be used in a local network instead of Idyl Cloud. (There are a few reasons for this but a couple big ones that we hear often are because of the sensitive nature of the users’ text and for performance.) Because of this we are removing entity extraction from Idyl Cloud so we can fully devote to its development in Idyl E3. One feature on the Idyl E3 roadmap is to allow for custom entity models and this is not a feature that’s readily accommodated by Idyl Cloud.

The Idyl Cloud SDKs will be updated to reflect this change.

Idyl E3 1.4 Now Available

Idyl E3 1.4 is now available on the AWS Marketplace. You can see the full Release Notes but here’s a summary of what’s new in 1.4.

If you have any questions please get in touch. Helpdesk tickets can now be created directly from our website and you can always reach us directly for more production information or general questions.

Idyl Cloud Integration

Idyl E3 1.4 is integrated with Idyl Cloud for entity disambiguation and enrichment. If enabled, all entities extracted by Idyl E3 will be sent to Idyl Cloud for disambiguation and enrichment. To enable this feature provide your Idyl Cloud API key in Idyl E3’s settings. Requests made to Idyl Cloud via Idyl E3 will be billed at the rate defined by your Idyl Cloud subscription plan.

Entity Store

New in 1.4 is the Entity Store feature. The Entity Store is a database that stores extracted entities and enrichments. When an entity extraction request is received, Idyl E3 extracts the entities and then persists the entities to the Entity Store. The Entity Store can be any JDBC database, such as MySQL, Oracle, or SQL Server.

API Changes

There is a new query API for performing queries against the entity store. With the query API you can find entities by text, context, and confidence. Since the Entity Store is an RDBMS you can always write more complex queries against it directly.

Two additional optional parameters have been added for the extraction API. The documentId parameter lets you categorize your text by documents. (So now documents can be categorized by context and by document ID.) The value of documentId can be any value that identifies your text.

The second new parameter is refTag. This parameter, also optional, lets you associate a value with the extraction request. This value can be anything and is only for your reference.

SDKs

The Java SDK for Idyl E3 has been updated on GitHub to support Idyl E3 1.4. We will be updating the .NET SDK for Idyl E3 shortly.

Upgrading to 1.4

When running on AWS you can upgrade to 1.4 by replacing any existing Idyl E3 instances with instances running 1.4. Any existing clients for 1.3 will work for 1.4 but will not have the entity querying capabilities.

What’s coming?

We have some more exciting features lined up. Coming soon will be the ability to use DynamoDB as an Entity Store and improved settings management.

Idyl Cloud SDKs

Idyl Cloud 1.1.0 SDKs for Java and .NET are now available. The .NET SDK is available through NuGet and the Java SDK is available through Maven Central using the dependency:

<dependency>
<groupId>com.mtnfog.idyl.cloud.sdk</groupId>
<artifactId>idyl-cloud-java-sdk</artifactId>
<version>1.1.0</version>
</dependency>

These versions add support for entity disambiguation and enrichment. Both SDKs support consuming Idyl Cloud through Mashape.

Both SDKs are available on GitHub and are licensed under the Apache 2.0 license.

Almost half of WV geotagged tweets are sent from Morgantown and Huntington

Mountain Fog is a West Virginia company, and as such we take an interest in the social media use of West Virginians. From June 9, 2015, to June 19, 2015, we sampled tweets and divided them into two categories – tweets that were sent from West Virginia and tweets that were sent from the other 49 states. Our goal was to survey the tweets between the two categories for similarities and differences.

We captured approximately 209,000 tweets, of those about 800, or about 0.40%, originated in West Virginia. (It is interesting to note that WV’s population represents 0.58% of the United States’ population according to the 2014 census.)

Tweets by City

Almost half (45.7%) of all WV geotagged tweets were sent from Morgantown and Huntington. Charleston, WV’s largest city by population, came in fourth behind Parkersburg. Perhaps the younger, student populations of Morgantown and Huntington helped contribute to the rank of each city since the cities are not ordered by population, but that’s just a hypothesis. Other areas of WV represented to a lesser degree are Wheeling and Weirton in the northern panhandle and Martinsburg in the eastern panhandle. Fewer tweets were sent from the Fairmont/Clarksburg and Beckley areas. (The West Virginia tweets that were not geotagged with a city were not considered.)

Tweets by West Virginia City

 

Heat map of tweets by West Virginia city

Sentiment of Tweets

Next, we looked at the sentiment of WV tweets compared to non-WV tweets. We used Idyl’s sentiment analyzer. (In case you are not familiar, Idyl is our product for performing text analysis.) We found WV tweets to be more positive than tweets from the rest of the country. 37% of WV tweets were found to have a positive sentiment compared to 31% of the tweets from the rest of the country. WV tweets were also less negative by 1%. The sentiment analysis algorithm determines whether the sentiment of a tweet is positive, negative, or neutral based on the text of the tweet. For example, the tweet “This place is great” has a positive sentiment while “This place is terrible” has a negative sentiment.

Sentiment
Count of WV Tweets
Count of Non-WV Tweets
Negative 172 (20.8%) 46,438 (21.07%)
Neutral 347 (41.96%) 104,308 (47.34%)
Positive 308 (37.24%) 69,604 (31.59%)

Tweet Content

As for the content of the tweets they were all over the board. There were tweets about the NBA finals, school being out, and random conversations. Perhaps a larger sample size would expose more specific topics.

Thanks for reading and stay tuned for further updates.

Idyl Extraction Engine Java SDK on Maven Central

The Idyl Extraction Engine Java SDK is now available in the Maven Central repository:

<dependency>
   <groupId>com.mtnfog.idyl.ami.sdk</groupId>
   <artifactId>idyl-ami-java-sdk</artifactId>
   <version>1.0.0</version>
</dependency>

The SDK is licensed under the Apache Software License, version 2.0 and the source code is available on Bitbucket.

The SDK provides an IdylAmiClient that has functions for submitting text for entity extraction and interacting with the optionally integrated services. An example invocation of entity extraction using the SDK is:

Idyl Extraction Engine.NET SDK Available through NuGet

 

http://www.nuget.org/packages/MountainFog.IdylAMI.SDK/1.0.0

The Idyl Extraction Engine.NET SDK is now available through NuGet. Similar to the Java SDK, the .NET SDK for the Idyl AMI provides the ability to submit text to the Idyl AMI entity extraction engine and parse the returned entities. The Idyl AMI .NET SDK is licensed under the Apache Software License, version 2.0.

Idyl AMI SDK for .NET on NuGet

Use the SDK for easy integration of Idyl’s entity extraction capabilities into your .NET applications. The source code of the SDK is available on Bitbucket. We welcome any feedback on the SDK.

Announcing the Idyl Extraction Engine on the AWS Marketplace

We are very excited to announce that Idyl Extraction Engine is now available through the AWS Marketplace. Now you can have person entity extraction capabilities inside your own cloud with no request limits, no contracts, zero initial investment, and the first 7 days are free.

The Idyl AMI for Person Entities is a turn-key person entity extraction solution. Through a simple webservice (REST) interface, Idyl AMI’s extraction capabilities can be integrated into your text processing systems and solutions.

Idyl AMI includes support for integrating with other AWS services:

  • DynamoDB integration allows for storing your extracted entities.
  • Automatically put your extracted entities onto an SQS queue for later processing.
  • Trigger SNS notifications when entities are extracted.
  • Submit extraction metrics to CloudWatch to monitor extraction times.

These integrations are all optional and can be used in combination with each other.

Launch the Idyl AMI for Person Entities in your cloud today from the AWS Marketplace.