NLP | natural language processing, nlp Natural Language Processing
What is Syntax Parsing?In the process of natural language learning, everyone must have learned grammar. For example, a sentence can be expressed by a subject, a predicate, or an object. In the process of natural language processing, many application scenarios need to consider the syntax of sentences. Therefore, it is very important to study Syntax Parsing.
Syntax Parsing has two main problems: the expression and storage methods of sentence syntax in computer, as well as corpus dataset, and the algorithm of Syntax Parsing.
For the first question, we can use a tree structure, as shown in. S represents a sentence. NP, VP, and PP are nouns, verbs, and prepositional phrases (phrase level ); n, V, and P are nouns, verbs, and prepositions.
In actual storage, the preceding tree can be represented as (S (NP (N Boeing) (VP (V is) (VP (V located) (PP (P in) (NP (N Seattle )))))). There are already mature and manually labeled corpus datasets on The Internet, such as The Penn Treebank Project (Penn Treebank II Constituent Tags ).
For the second problem, we need an appropriate algorithm to handle it. This is what we will discuss in this chapter.
Context-Free Grammer defines the following Context-independent syntax to generate the syntax tree of a sentence. 1) N indicates a group of non-leaf nodes, such as {S, NP, VP, N ...} 2) Σ indicates the annotation of a group of leaf nodes, such as {boeing, is ...} 3) R indicates a group of rules. Each rule can be expressed as X-> Y1Y2... yn, X, N, Yi, (N ∪ Σ) 4) S indicates the annotation starting from the syntax tree
For example, a subset of the syntax can be expressed as follows. When a sentence is given, the syntax can be parsed from left to right. For example, the man sleeps can be expressed as (S (NP (DT the) (NN man) (VP sleeps )).
This context-independent syntax can easily deduce the syntax structure of a sentence. However, the disadvantage is that the structure may have ambiguity. For example, the syntax tree in the following two graphs can represent the same sentence. Common ambiguity problems include: 1) different parts of speech of a word. For example, "can" indicates a modal verb, which sometimes represents a jar. 2) the scope of a prepositional phrase, for a structure such as vp pp, the second prepositional phrase may describe VP or the first PP; 3) continuous name, such as NN.
The Context-independent Syntax of probability distribution (Probabilistic Context-Free Grammar) is ambiguous due to Syntax Parsing, we need to find a way to find the most likely tree from multiple possible syntax trees. A common method is PCFG (Probabilistic Context-Free Grammar ). As shown in, in addition to regular syntax rules, each rule is given a probability. For each generated syntax tree, we use the product of the probability of the rule as the probability of occurrence of the syntax tree.
To sum up, when we have more or more syntax trees, we can calculate the probability p (t) of each syntax tree separately ), the syntax tree with the highest probability is the expected result, arg max p (t ).
The training algorithm has defined the Syntax Parsing algorithm, which relies on the definition of N, Σ, R, S in CFG and p (x) in PCFG ). We mentioned above that Penn Treebank has provided a very large corpus dataset through manual methods. Our task is to train the parameters required by PCFG from the corpus. 1) Calculate all N and Σ in the corpus; 2) use all the rules in the corpus as R; 3) estimate p (x) from the corpus for each rule A-> B) = p (A-> B)/p ();
Based on the definition of CFG, we redefine a syntax format called Chomsky. This format requires that each rule be in the format of X-> Y1 Y2 or X-> Y. In fact, the Chomsky syntax format ensures that the generated syntax tree is always a binary tree format, and any syntax tree can always be converted to the Chomsky syntax format.
The syntax tree prediction algorithm assumes that we already have a PCFG model that contains N, Σ, R, S, p (x) and other parameters, and the Chomsky syntax format of the total number of syntax trees. When a sentence x1, x2,..., xn is input, how do we calculate the syntax tree corresponding to the sentence? The first method is the method of violent traversal. Each word x may have m = len (N) values, and the sentence length is n. In each case, there are at least n rules, so in the case of time complexity O (m * n), we can determine all possible syntax trees and calculate the best one. The second method is dynamic planning of course. We define w [I, j, X] as the maximum probability from the I word to the j Word indicated by X. Intuitively speaking, such as xi, xi + 1 ,..., xj, when X = PP, the subtree may be interpreted in multiple ways, such as (p np) or (PP), but w [I, j, PP] indicates that when we continue to apply recursion, we only select the combination method with the highest current probability. In special cases, w [I, I, X] = p (X-> xi ). Therefore, the dynamic programming equation can be represented as w [I, j, X] = max (p (X-> y z) * w (I, s, Y) * w (s + 1, j, Z )). There are many cases in leetcode to describe the dynamic planning method.
Syntax Parsing is completed according to the above algorithm process. Although PCFG also has some disadvantages, such as: 1) Lack of lexical information; 2) handling of continuous phrases (such as nouns and prepositions. But in general, it provides a very effective implementation method for Syntax Parsing.
What is NLP?
NLP was founded by Richard Bandler and John Grinder in 1976 at the University of California. NLP is the abbreviation of Neuro, Linguistic, and Programming. They have the following meanings:
Neuro: "Neural", which is translated as "physical and mental", refers to the connection between our mind and body through our brain system; our brain system controls our sensory organs to maintain connections in the world.
Linguistic: language refers to the use of language to interact with others, through the posture, gestures, habits and other silent language to show our thinking patterns, beliefs and various inner states; the language pattern used by the direct connection between our minds and our bodies.
Programming: A program refers to extracting computer science words to point out that our thoughts, feelings, and actions are just habitual programs that can be improved through our "thinking" software. With repetitive procedures that improve our thinking and behavior, we can achieve more satisfactory results in our actions.
Because it is a learning that studies subjective human experience and how our brains work. Therefore, you can translate it into "physical and mental grammar programming ". It is a detailed and feasible model of human behavior and communication procedures. It contains traditional neurology, physiology, psychology and linguistics and Human Brain control.
What is nlp?
In short, NLP starts with cracking the Language and thinking model of successful people, and uses their original ideas. What does NLP mean? NLP refers to the Natural Language recognition program: NLP (Natural Language)