Natural language 18_named-entity Recognition

Source: Internet
Author: User

Https://en.wikipedia.org/wiki/Named-entity_recognition

named-entity Recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of in Formation extraction, seeks to locate, and classify named entities in text into pre-defined categories such as the name s of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most of the NER systems have been structured as taking an unannotated block of text, such as this one:

Jim bought shares of Acme Corp. In 2006.

and producing an annotated block of text that highlights the names of entities:

[Jim]person bought shares of [ACME Corp.] Organization in [2006]time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression has been Detec Ted and classified.

State-of-the-art NER Systems for 中文版 produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure when Human Annotators scored 97.60% and 96.95%. [1] [2]

Contents
  • 1 problem Definition
    • 1.1 Formal Evaluation
  • 2 approaches
  • 3 Problem Domains
  • 4 Current challenges and
  • 5 Software
  • 6 See also
  • 7 References
  • 8 External Links
Problem Definition

In the expression named entity , the word named restricts the task to those entities for which one or Many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 are referred to as Ford or Ford Mo Tor company . Rigid designators include proper names as well as terms for certain biological species and Substances.[3]

Full named-entity recognition are often broken down, conceptually and possibly also in Implementations,[4" as a distinct problems:detection of names, and classification of the names by the type of ENT Ity they refer to (e.g. person, organization, location and Other[ 5]). The first phase is typically simplified to a segmentation problem:names was defined to be contiguous spans of tokens , with no nesting, so, "Bank of America" are a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem are formally similar to chunking.

Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) could also be considered as named en Tities in the context of the NER task. While some instances of these types is good examples of rigid designators (e.g., the year 2001) there is also many inval ID ones (e.g., I take my vacations in "June"). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar . In the second case, the month June may refer to the month of a undefined year ( past June , next June , June 2020 , etc.). It is arguable, the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity was therefore not strict and often have to being explained in the context in W Hich it is used. [6]

Certain hierarchies of named entity types has been proposed in the literature. BBN categories, proposed in 2002, was used for Question answering and consists of types and subtypes. [7] sekine ' s extended hierarchy, proposed in 2002, is made of subtypes. [8] More recently, in-Ritter used a hierarchy based on common freebase entity types in ground-breaking experiment s on NER over social media text. [9]

Formal Evaluation

To evaluate the quality of a NER system ' s output, several measures has been defined. While accuracy on the token level are one possibility, it suffers from the problems:the vast majority of tokens in Real-wo Rld text is isn't part of the entity names as usually defined, so the baseline accuracy (always predict ' not an entity ') is ext Ravagantly high, typically >90%; and mispredicting the full span of an entity name was not properly penalized (finding only a person's first name when their Last name follows is scored as? Accuracy).

In academic conferences such as CONLL, a variant of the F1 score have been defined as follows:[5]

    • Precision is the number of predicted entity name spans, line up exactly with spans in the gold standard Evalu ation data. i.e. when [person Hans] [person Blick] are predicted but [person Hans Blick] were required, precision for the predicted name is zero. Precision is then averaged through all predicted entity names.
    • Recall is similarly the number of names in the gold standard, that appear at exactly, the same location in the predictions.
    • F1 score is the harmonic mean of these.

It follows from the above definition if any prediction that misses a single token, includes a spurious token, or have the Wrong class, is a hard error and does not contribute to either precision or recall.

Evaluation models based on a token-by-token matching has been proposed. [Such] models is able to handle also partially overlapping matches, yet fully rewarding only exact matches. They allow a finer grained evaluation and comparison of extraction systems, taking to account also the degree of MISMATC h in non-exact predictions.

approaches

NER Systems has been created that use linguistic grammar-based techniques as well as statistical models, i.e. Machine Lea Rning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work B Y experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches has been suggested to avoid part of the annotation effort. [All ][ One]

Many different classifier types has been used to perform machine-learned NER, with conditional random fields being a typi Cal Choice. []

Problem Domains

Indicates that even state-of-the-art NER systems is brittle, meaning that NER systems developed for one Domai n does not typically perform well on the other domains. [14] considerable effort is involved in tuning NER systems to perform well in a NE W Domain; This is true for both rule-based and trainable statistical systems.

Early NER Systems in the 1990s is aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, suc H as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there have been a great deal of interest in entity identification in the molecular biology, bioinformatic s, and medical natural language processing communities. The most common entity of interest in this domain has been names of genes and gene products. There have been also considerable interest in the recognition of chemical entities and drugs in the context of the Chemdner Competition, with the teams participating in the this task. [15]

Current challenges and

Despite the F1 numbers reported on the MUC-7 dataset, the problem of Named Entity recognition are far from being solve D. The main efforts is directed to reducing the annotation labor by employing semi-supervised learning,[11][16] Robust Performance across Domains[17][ and scaling up to fine-grained entity types. [8][19] in Recent years, many projects has turned to crowdsourcing, which are a promising solution to obtain high-quality aggregate h Uman judgments for supervised and semi-supervised machine learning approaches to Ner.[20] Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and Searc H queries. [21]

A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia [23] [] can be seen as an instance of extremely fine-grained named entity recognition, where the types is the actual Wi Kipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a wikification system:

Url="Http://en.wikipedia.org/wiki/Michael_I._Jordan"</ENTITY> was a professor at  url=" Http://en.wikipedia.org/wiki/University_of_California,_Berkeley "</ENTITY>    
Software
    • GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
    • OPENNLP includes rule-based and statistical named-entity recognition
    • Stanford University also has the Stanford Named Entity recognizer

Natural language 18_named-entity Recognition

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.