Idyl E3’s entity model training tool expects entities in training text to be annotated in the format used by OpenNLP. This format uses START and END tags to denote entities:
<START:person> George Washington <END> was president.
This works great but it has a drawback. The annotations and text have to be combined in a single file. Once the text is annotated it becomes difficult to use the training text for any other purposes.
New Annotation Format
Idyl E3 2.4.0 is going to introduce an additional method of annotating text that allows the annotations to be stored separate from the training text. In 2.4.0 the annotations will be able to be stored in a separate file (and we plan to eventually support storing the annotations in a database). Even though Idyl E3 2.4.0 is not yet ready for prime time, we wanted to introduce this feature early in case you are in the middle of any annotation efforts and want to use the new format.
It is still required that the input text contain a single sentence per line.Use blank lines to indicate document boundaries. Here’s an example of a simple input training file:
George Washington was president. He was president of the United States. George Washington was married to Martha Washington. In 1755, Washington became the senior American aide to British General Edward Braddock on the ill-fated Braddock expedition.
And here’s the annotations stored in a separate file:
1 0 2 person 2 "United States" place 3 0 2 person 3 5 7 person 4 "Edward Braddock" person
Here’s what this means. Each line in the annotations file represents an annotation in the training text. So there are 5 annotations in this example.
For the lines having 3 columns:
- The first column is the line number that contains the entity.
- The second column is the text of the entity in double quotes.
- The third column is the type of the entity.
For the lines with 4 columns:
- The first column is the line number that contains the entity. In this example there is an annotation in each of the 3 lines.
- The second column is the token index of the start of the entity. Indexes are zero-based so the first token is zero!
- The third column is the token index of the end of the entity.
- The last column is the type of the entity.
Note that there are two entities in the third line and each is put on its own separate line in the annotations file. Specifying the entity text in the three column format simplifies the annotation by removing the need to specify the entity’s token start and end positions. This will only annotate the first occurrence of the entity text. (If Edward Braddock had occurred more than once in the input text on line 4 only the first occurrence would be annotated.)
Now your annotations can be kept separate from your training text allowing you to use your training text for other purposes. Additionally, we hope that this new annotation method helps decrease the time required for annotating and helps with automating the process. As mentioned earlier in the post, currently the only supported means of storing the annotations is in a separate file but we plan to extend this to support databases in a future release of Idyl E3.
The Entity Model Generator tool included in Idyl E3 has been updated to allow for using this new annotation format. You can, however, continue to use the OpenNLP-style annotations when creating entity models. This new annotation format is only available for entity models. Sentence, token, parts-of-speech, and lemma model annotations will remain unchanged in 2.4.0.