A feature generator is arguably the most important part of model-based entity extraction. The feature generators create “features” based on aspects of the input text that are used to determine what is and what is not an entity. Choosing the right (or wrong) features when training your entity models can have a significant impact on the performance of the models so we want you to have a good selection of feature generators available for use.
There are some new feature generators in Idyl E3 2.3.0 available to you that we’d like to take a minute to describe. All of the available feature generators and how to apply each one is described in the Idyl E3 2.3.0 Documentation.
New Feature Generators in Idyl E3 2.3.0
Special Character Feature Generator
This feature generator generates features for tokens that contains special characters. For example, the token Hello would not generate a feature but the token He*llo would generate a feature. This feature generator is probably most useful in the domains of science and healthcare, particularly chemical and drug names.
Token Part of Speech Feature Generator
This feature generator generates features based on each token’s part of speech. To use this feature generator you must provide a trained part of speech model. (Idyl E3 2.3.0 includes a tool for creating parts-of-speech models from your text.) This feature generator helps improve entity extraction performance by also being able to consider each entity’s part of speech.
Word Normalization Feature Generator
This feature generator normalizes tokens by replacing all uppercase characters with A, all lowercase characters with a, and all digits with 0. For example, the token HelloWorld25 would be normalized to AaaaaAaaaa00. This feature generator can optionally lemmatize each token prior to the normalization by applying a lemmatization model. (Idyl E3 2.3.0 includes a tool for creating lemmatization models from your text.) Like the special character feature generator, this feature generator is also probably most useful in the domains of science and healthcare, particularly chemical and drug names.