nlp--natural language Processing and machine learning Conference

Source: Internet
Author: User

http://blog.csdn.net/ice110956/article/details/17090061

Organize the natural language processing and machine learning conference in Chongqing in mid-November, first speaking for natural language processing.

From the basic theory to practical application, the basic framework is collated.

1. Foundation for natural Language processing

Part -of-speech tagging (POS):

Tagging each word in a sentence can be seen as a key task of syntactic analysis and as the lowest level of syntactic analysis. It is very useful for subsequent syntactic analysis, semantic disambiguation, and other tasks.

POS collection, which is the basic part of speech rules:

Commonly used is the Penntreebank set, pack good 45 tags

Basic methods:

Rule based: Construct labeling rules manually based on vocabulary and other language knowledge

Based on learning: training based on human corpus

Statistical model: Hmm,maximum Entropymarkov (MEMM), Conditional random field (CRF)

Rule Learning: Transformation Basedlearning (TBL)

Sequence POS:

POS problems can be seen as the most sequential issue of POS.

Classification-based sequence labeling:

Consider each word as a feature of the context, such as an adjacent word, to be solved by a classification algorithm.

Such as: John saw Thefish and decided to take it to the table.

Saw can be seen as a feature of John + saw +fish, using the classification algorithm to POS.

Disadvantages:

1. It is not easy to integrate word marker information from the left and right two directions.

2. It is difficult to express and convey the uncertainty of the half-paragraph of the word label, which makes it difficult to determine the most likely joint label for all the words in the sequence.

The specific algorithm has the forward classification and the back-direction classification.

Probability-based sequence labeling: The probabilistic sequence labeling model allows the uncertainty of multiple interdependent individual classifications in an integrated sequence to uniformly determine the most likely global label judgments.

Typical model: HMM,MEMM,CRF

Among them, hmm can use supervised learning and unsupervised learning, semi-supervised learning and so on. The Viterbi dynamic programming algorithm is used.

Chinese grammar Analysis effect

Total F value is 95%

The main error is the new Word; the named entity recognition effect is low, and the effect is related to the text type. Total water supplies over 90%.

syntactic analysis (sentence structure)

Type: syntactic analysis and dependency analysis, complete analysis and shallow analysis.

Involves knowledge: Block Analysis (chunking), Chomsky syntax hierarchy, context free grammars (CFG) contextual syntax, Syntax tree (parsing), etc.

Syntactic structure Analysis (parsing):

1. Given a string of finalization symbols and a CFG, determine whether the symbol can be generated by the CFG and return the syntax tree for that symbol string.

2. Search to get the derivation of the syntactic tree

Top-down parsing: Starting with the initial character

Bottom-up parsing: Starting from Terminator in the symbol string

3. Dynamic Programming Parsing method

CKY (Cocke-kasami-younger) algorithm; Based on bottom-up analysis, syntax normalization is required

Enrley parser: Top-down analysis, no syntax normalization, but more complex

Chart parser: Merging top-down and bottom-up search

Statistical syntactic analysis

Probabilistic values are calculated for each syntactic tree using the syntactic probability model, and syntactic analysis models are allowed using supervised learning and unsupervised learning.

Probabilistic context Freegrammar (PCFG): The probability form of CFG, and the cky of probability words.

Trained Tree Library:

See Wiki:treebank

Chinese syntactic analysis effect:

The overall level of the phrase structure F value >=80%, the dependency is 90%

2. Internet semantic Computing and information summarization

Semantic analysis (sentence meaning):

Get the meaning of the language unit: Different levels, lexical level, sentence level, text level

Sentence-level semantic analysis driven by syntax: The semantic analysis of sentence is derived from the semantic combination of its constituent components. Get sentence meaning expression based on lexical and grammatical information.

1. Use a syntactic tree to generate a first-order logical expression.

2. Grammatical role labeling: agent, patient, source, purpose, tool, etc.

Grammatical analysis effect: Deep semantic analysis is difficult, there is no mature technology and system; The overall level of semantic role labeling (f-value) at 70%

Textual Analysis (discourseparsing)

A discourse is a set of coherent and structured sentences, such as monologue and dialogue.

Main tasks: Discourse segmentation (segmentation) Inter-sentence relationship recognition, refers to the elimination.

Ideally, deep text comprehension techniques are needed to deal with the above tasks, but the shallow analysis method is used so far.

1. Text segmentation:

Divides a document into a linear sequence of sub-topics. such as scientific articles can be divided into: abstract, Introduction, methods, results, conclusions and so on.

Application: Document Summary: Each paragraph is summarized separately, information retrieval and information extraction: on the appropriate paragraph

Related tasks: Paragraph splits for speech recognition text.

Method: A method based on cohesion (cohesion-based approach)

Divide the text into sub-topics, each sub-topic of the paragraph/sentence cohesion between the sub-topic at the boundary of the cohesion of poor

Texttiling algorithm.

2. Text Structure (discoursestructure):

A discourse hierarchy based on coherent relationships, similar to the structure of a syntactic tree. Tree nodes represent a coherent relationship between sentences: Discourse segment (Notlinear)

Application: Digest system: Can ignore or merge the unit that is connected by elaboration relation; question-Answering system: Using explanation relationship to answer; Information Extraction system: No need to extract information from a unit that does not have a coherent relationship.

3. Textual analysis

Refers to elimination (Referenceresolution): Determines which entity is represented by which language.

Classification:

Coreference Resolution (co-reference digestion): Discovers a reference expression that points to the same entity, that is, looking for a common finger chain, such as: {Mr.obama,the president,he}

Pronominal anaphora Resolution (personal pronoun digestion): As the next sentence he points to Mr.obama.

Lexical semantic computation

I want to kick your->ithink flat you.

Research significance: How to express the meaning of words? synonyms, antonyms, upper words, inferior words, similar similarities, etc.

noun: meaning (word senses):

The specific meaning of a word

A word can have multiple meanings

A meaning can be described by a comment. Apple: Fruit, red, yellow or green, sweet.

Lexical Similarity (wordsimilarity)

Synonyms/antonyms, etc. two value relationship

More lenient criteria: lexical similarity/word sense distance (word similarity or semantic distance)

Two methods of calculation:

Semantic dictionary-based approach (thesaurus-based): Construct a WordNet to judge the relationship in WordNet

Corpus-based Method (Distributional/statisticalalgorithm): To compare the context of a word in a corpus.

Similarity of word meanings based on WordNet:

The famous English word meaning relation computational resource, thesaurus.

The base unit is a synet, which is a collection of synonyms.

Each entry contains multiple synet, which are used for annotations.

Different synets are connected by different meanings.

Disadvantages of the semantic dictionary approach:

Many languages do not have useful semantic dictionaries. Many new words are not included. Limited to nouns, not perfect for adjectives and verbs.

Lexical similarity based on corpus statistics:

For example, we can infer the meaning of an unknown English word based on many words and contexts. Corpus statistics are also a similar process. The semantic of a word is counted by the corpus of the Internet. Or have the opportunity to wiki Wikipedia semantic analysis and so on.

Word sense disambiguation

After the semantics are calculated, the semantics can be used to disambiguation.

Internet Information Digest

The content of a large amount of refining and summary, with a concise, intuitive summary to summarize the main content of user concern. such as micro-Bo Atlas, News Digest and so on, is a major use of natural language processing and discourse analysis.

nlp--natural language Processing and machine learning Conference

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.