Distribution of Entity Confidence Values in a Sample Data Set

In a previous post titled Tuning the Confidence Threshold Parameter we described how the confidence threshold parameter can be used to control the strictness of the entity extraction. We would like to now give a little more insight into the parameter.

We recently extracted entities from more than 500,000 documents with Idyl. These documents were mostly news and news-like articles. (I say “News-like” because some did not follow the traditional format of a news article.) During the extraction we tracked the confidence value of each entity.  When the processing was complete we randomly selected 10,000 of the entities and produced the histogram of the confidence values shown below. (The Y-axis is the number of entities having the confidence value on the X-axis.)

 As the histogram shows, nearly all of the entities extracted had a confidence value greater than 50. In our spot checks, all of the entities with a confidence value less than 50 was not an actual entity and could be discarded. (They included things like abbreviations.) Between 60 and 80 the entities were more reliable, with about 75% of the entities being actual entities. Nearly all entities that were extracted with a confidence level greater than 80 were actual entities. We just spot checked the extracted entities in this investigation but in a follow-up post we will provide numbers and percentages.

The takeaway from all this is that choosing a confidence threshold of 80 is probably a safe value. You can always, of course, tweak the value later if you find that you need to.

Thanks for reading!