Introduction to Named Entity Recognition (NER)
Named Entity Recognition (NER) is the process of labeling named-entities in the text. Named entities are real-world objects such as persons, locations, organizations etc, that can be denoted by a proper name.
Here is the Stanford-NER result for the sentence: “Khan Academy is a Mountain View based non-profit educational organization created by educator Salman Khan.”
ORGANIZATION: Khan Academy, LOCATION: Mountain View, PERSON: Salman Khan
How do we do NER?
The are multiple approaches towards named entity recognition. The simplest one, used in the earlier days was creating a named entity dataset and doing a lookup for every input sentence in the dataset for named entities. Here, our datasets can be lists. For example, one list might be organization/company named entities like:
organizations = [Apple, Google, Microsoft, Dell, Intel, ... ]
and another one might be location entities:
locations = [Mountain View, Bangalore, Amsterdam, ..]
And the process of entity recognition is doing a direct match using these lists for possible named entities (organizations and locations- in our case).
But there's an obvious problem with this approach. This method fails to get the context of a sentence for entity recognition. For example in the sentence: "Some Apples are red, but not any red apple is green”, our entity recognition program might mistag Apple as an organization (because it is in our organization list) even though it was the fruit apple which is referred in the sentence.
Approaches
Practical approached for NER can be either based on linguistic grammar-based techniques or Machine learning or a combination of both.
Grammer based techniques require a lot of man-hours from experienced computational linguistics to create hand-crafted grammar based solutions which will have good precision, but a lower recall. These techniques have become less popular since the arrival of better machine learning based techniques.
Statistical machine learning based approached usually require a large manually annotated training dataset. Semisupervised approaches have been also been developed to reduce part of the annotation effort. Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice
Reading
Here are some links for further reading on named entity recognition:
- A good read on various statistical methods for NER: A survey of named entity recognition and classification.
- Describes a state-of-the-art neural network based approach for NER: Neural architectures for named entity recognition.
Datasets
Here are some datasets for NER which are licensed free for non-commercial use.
- Annotated Corpus for Named Entity Recognition: Corpus (CoNLL 2002)
- N3 - A collection of datasets for NER
- Annotated Corpus for Named Entity Recognition by Anton Dmitriev
- The Groningen Meaning Bank (GMB)
- Dataset from the OKE Challenge 2016
Try out!
See the following tools/programs for the general purpose named entity recognition.
- AllenNLP NER - Implements a state-of-the-art NN based approach.
- Stanford NER - A production quality, popular NER system.