The parser describes the grammatical structure of a sentence to help other applications to reason. Natural language introduces many unexpected ambiguities, which can be quickly discovered by our understanding of the world. Give me a favorite example:
The correct parsing is the connection between "with" and "pizza", and the parsing of the error links "with" and "Eat" together:
In the past few years, the Natural Language Processing (NLP) community has made great progress in grammatical analysis. Small Python implementations are now likely to perform better than the widely used Stanford parser.
The rest of the article sets the problem first and then takes you through the simple implementation of the preparation. The first 200 lines in the parser.py code describe the speaker and learner (here) of the part of speech. Unless you are very familiar with the study of NLP, you should at least skim before you study this article.
The Cython system and the redshift are written for my current research. After the expiry of the contract with Macquarie University, I plan to improve it for general use in June. The current version is hosted on the GitHub.
Problem Description
It is very friendly to enter such a command on your phone:
Set volume to zero where I ' m in a meeting, unless John ' s school calls.
The appropriate policy configuration is followed. On Android, you can use Tasker to do things like this, while the NL interface is better. Receive a semantic representation that can be edited, and you can learn what it thinks you mean, and can fix his ideas, which is especially friendly.
There are many problems to be solved in this work, but some kinds of syntactic forms are absolutely necessary. We need to know:
Unless John ' s school calls, when I ' m in a meeting, set volume to zero
is another way of parsing instructions, and
Unless John ' s school, call when I ' m in a meeting
expressed a completely different meaning.
A dependency parser returns a diagram of words and a word to make inference easier. A graph is a tree structure, with a direction edge, each node (word) has and only one arc (head dependency).
Usage examples:
>>> parser = parser. Parser ()
>>> tokens = "Set the volume to zero where I ' m in a meeting unless John ' s school calls". Split ()>>> tags, heads = parser.parse (tokens)
>>> heads
[-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15,
>>> for I, h in enumerate (heads):
... Head = Tokens[heads[h]] If H >= 1 Else ' None '
... Print (Tokens[i] + ' <--' + head])
Set <--None of the
<--volume volume <--
Set to
<--Se T
Zero <--to when
<--Set
I <--' m
' m <--the ' <--
' m
a <--meetingmeeting <--in
unless <--Set
John <--' s
' <--calls
school <--calls
calls <--unless
One idea is that it should be a little easier to deduce from a grammar analysis than a string. Semantic analysis mapping is expected to be simpler than literal mapping.
The most puzzling thing about this question is that correctness is determined by the practice, the annotation Guide. If you do not have a reading guide and are not a linguist, you cannot judge the correctness of the resolution, which makes the whole task seem strange and false.
For example, there is an error in the above parsing: "John's school calls" has a structural error according to Stanford's annotation guide. The structure of this part of the sentence is to instruct the annotation how to parse a case similar to "John's School Clothes".
This is worth considering in depth. Theoretically, we have established the guidelines, so "correct" parsing should be the opposite. If we violate the agreement, there are good reasons to believe that the task of parsing will become more difficult because of the reduced consistency of tasks and other language > Laws. "2" But we can test the experience, and we are happy to gain advantage by reversing the strategy.
We really need the differences in practice--we don't want to receive the same structure, otherwise the results won't be very useful. Note the guidelines make a balance between what distinguishes downstream applications and which parsers can be easily predicted.
Mapping Tree
When deciding what kind of diagram to build, we can make a particularly effective simplification: restrict the diagram structure that will be processed. It not only has the advantage in the easy study, but also has the function in the deepening algorithm understanding. Most of the > English parsing work, we follow the constraints of the dependency diagram is the mapping tree:
Tree. In addition to the root, each word has an arc head.
mapping relationship. For each pair of dependencies (A1, A2) and (B1, B2), if A1 < B2, then A2 >=. In other words, dependencies cannot intersect. There is no way to exist a pair of A1 B1 A2 B2 or B1 A1 B2 A2 forms of dependency.
There are a lot of literatures in the analysis of unmapped trees, and there are relatively few literatures in the analytic loop-free mapping. I'm going to elaborate on the parsing algorithm used in the mapping tree domain.
The greedy analysis based on the transformation
Our parser takes a string symbol list as input, outputting a list of the arcs that represent the edges in the diagram. If the first arc head element is J, the dependency consists of an edge (J, i). The parser based on conversion > is a finite state converter, which maps an array of n words to an output array of n arc header indices.
The ARC header array represents the MSNBC's Arc head: MSNBC's word index is 1,reported's Word index is 2, head[1] = = 2. You should have discovered why the tree structure is so convenient--if we output a DAG structure, the word in this structure may contain multiple arcs, and the tree structure will no longer work.
Although heads can be represented as an array, we do like to maintain a certain alternative way to access parsing to facilitate efficient extraction of features. The Parse class is like this:
Class Parse (object):
def __init__ (self, N):
SELF.N = n
self.heads = [None] * (n-1)
self.lefts = []
se Lf.rights = [] for
I in range (n+1):
self.lefts.append (defaultlist (0))
self.rights.append (defaultlist (0)
def add_arc (self, Head, child):
self.heads[child] = head
if child < head:
Self.lefts[head]. Append (Child)
else:
self.rights[head].append (Child)
As with parsing, we also need to track the position in the sentence. By placing an index in the words array and introducing the stack mechanism, the stack can be pressed into the word, set the arc of the word, pop-up words. So our state data structure is the basis.
- An index I, active in the list of symbols
- Join dependencies in the parser so far
- A stack containing the words generated before index I, we have declared the arc head for these words.
Each step of the parsing process applies one of three actions:
SHIFT = 0; right = 1; left = 2
moves = [Shift, right, left]
def transition (move, I, Stack, parse):
global SHIFT, OK, left
if move = SHIFT:
stack.append (i) return
i + 1
elif move = right:
parse.add_arc (stack[-2), Stack.pop ()) C10/>return i
elif move = left:
parse.add_arc (i, Stack.pop ()) return
i
raise Grammarerror (" Unknown move:%d '% move ')
The left and right actions add dependencies and stack, while SHIFT presses the stack and increases the I value in the cache.
Therefore, the parser begins with an empty stack with a cached index of 0 and no dependency records. Select a valid action to apply to the current state. Continue to select the operation and apply until the stack is empty and the cached index reaches the end of the input array. (It is difficult to understand this algorithm without progressive tracking.) Try to prepare a sentence, draw the mapping parse tree, and then traverse the parse tree by selecting the correct transformation sequence. )
The following is the parsing loop in the code:
Class Parser (object): ...
Def parse (self, words):
tags = self.tagger (words)
n = len (words)
idx = 1
stack = [0]
deps = Parse (n)while Stack or idx < n:
features = extract_features (words, tags, idx, n, Stack, deps)
scores = Self.model.s Core (features)
valid_moves = get_valid_moves (i, N, Len (stack))
next_move = max (valid_moves, Key=lambda move: Scores[move])
idx = Transition (next_move, IDX, Stack, parse) return
tags, parse
def get_valid_moves (i, n , stack_depth):
moves = []
if I < n:
moves.append (SHIFT)
if stack_depth >= 2:
Moves.append (right)
if stack_depth >= 1:
moves.append [left] return
moves
We begin with a marked sentence and initialize the state. The state is then mapped to a feature set that takes a linear model rating. Then look for the highest scoring effective operation and apply it to the state.
The scoring model here works the same as in the POS notation. If you are puzzled by the idea of extracting features and using linear model scoring, you should review this article. The following are tips for how the scoring model works:
Class Perceptron (object) ...
DEF score (self, features):
all_weights = self.weights
scores = Dict ((clas, 0) for Clas in self.classes) for
fe At, value in Features.items ():
if value = = 0:
Continue
If feat not in All_weights:
continue
weights = All_weights[feat]
for clas, weight in Weights.items ():
Scores[clas] + = value * Weight return
scores
This simply sums up the class weights for each feature. This is usually represented as a dot product, but I find it less appropriate to handle many classes.
The directional parser (redshift) traverses multiple candidate elements, but ultimately selects the best one. We will focus on efficiency and simplicity and ignore its accuracy. We only conducted a single analysis. Our search strategy will be completely greedy, just like the pos tag. We will lock each step in the selection.
If you read the speech markers carefully, you may find the following similarities. What we do is map the parsing problem to a sequence tag problem that uses "flattened" solutions, or unstructured learning algorithms (search by greedy).
Feature Set
Feature extraction code is always ugly. The characteristics of a parser are those identified in the context.
- Top three words in the cache (N0, N1, N2)
- Three words on top of stack (S0, S1, S2)
- S0 two children on the left (S0B1, s0b2);
- S0 two children on the far right (S0F1, S0F2);
- N0 two children on the left (N0B1, n0b2);
We pointed out the above 12 identified word lists, POS tagging, and identification associated with the number of children.
Because the linear model is used, the feature refers to the ternary group of atomic properties.
def extract_features(words, tags, n0, n, stack, parse):
def get_stack_context(depth, stack, data):
if depth >;= 3:
return data[stack[-1]], data[stack[-2]], data[stack[-3]]
elif depth >= 2:
return data[stack[-1]], data[stack[-2]], ''
elif depth == 1:
return data[stack[-1]], '', ''
else:
return '', '', ''
def get_buffer_context(i, n, data):
if i + 1 >= n:
return data[i], '', ''
elif i + 2 >= n:
return data[i], data[i + 1], ''
else:
return data[i], data[i + 1], data[i + 2]
def get_parse_context(word, deps, data):
if word == -1:
return 0, '', ''
deps = deps[word]
valency = len(deps)
if not valency:
return 0, '', ''
elif valency == 1:
return 1, data[deps[-1]], ''
else:
return valency, data[deps[-1]], data[deps[-2]]
features = {}
# Set up the context pieces --- the word, W, and tag, T, of:
# S0-2: Top three words on the stack
# N0-2: First three words of the buffer
# n0b1, n0b2: Two leftmost children of the first word of the buffer
# s0b1, s0b2: Two leftmost children of the top word of the stack
# s0f1, s0f2: Two rightmost children of the top word of the stack
depth = len(stack)
s0 = stack[-1] if depth else -1
Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words)
Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags)
Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words)
Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags)
Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words)
Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags)
Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words)
_, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags)
Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words)
_, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags)
Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words)
_, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags)
# Cap numeric features at 5?
# String-distance
Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0
features['bias'] = 1
# Add word and tag unigrams
for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2):
if w:
features['w=%s' % w] = 1
for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2):
if t:
features['t=%s' % t] = 1
# Add word/tag pairs
for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))):
if w or t:
features['%d w=%s, t=%s' % (i, w, t)] = 1
# Add some bigrams
features['s0w=%s, n0w=%s' % (Ws0, Wn0)] = 1
features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1
features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1
features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1
features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1
features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1
features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1
features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1
# Add some tag trigrams
trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0),
(Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1),
(Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2),
(Ts0, Ts1, Ts1))
for i, (t1, t2, t3) in enumerate(trigrams):
if t1 or t2 or t3:
features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1
# Add some valency and distance features
vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b))
vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b))
d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0),
('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0))
for i, (w_t, v_d) in enumerate(vw + vt + d):
if w_t or v_d:
features['val/d-%d %s %d' % (i, w_t, v_d)] = 1
return features
Training
Learning Weights and POS tagging use the same algorithm, the average perceptron algorithm. Its main advantage is that it is an online learning algorithm: Example one after another inflow, we do the prediction, check the real answer, if the forecast error adjusted the opinion (weight).
The cycle training looks like this:
Class Parser (object): ...
def train_one (self, ITN, words, Gold_tags, gold_heads):
n = len (words)
i = 2; stack = [1]; parse = Parse (n)
ta GS = Self.tagger.tag (words) while
stack or (i + 1) < n:
features = extract_features (words, tags, I, n, Stack, p ARSE)
scores = Self.model.score (features)
valid_moves = get_valid_moves (i, N, Len (stack))
guess = max ( Valid_moves, Key=lambda Move:scores[move])
gold_moves = get_gold_moves (i, N, Stack, parse.heads, gold_heads) Best
= max (gold_moves, Key=lambda move:scores[move])
self.model.update (best, guess, features)
i = Transition (guess, I, stack, parse)
# return number correct return
len ([I to I in range (n-1) if parse.heads[i] = = Gold_heads[i]])
The most interesting part of the training process is get_gold_moves. With Goldbery and Nivre (2012), our parser's performance may be improved, and they have pointed out that we have been wrong for many years.
In the POS tagging article, I would like to remind you that during training, you want to make sure that the last two predictive markers are the characteristics of the current tag, not the last two gold markers. Only predictive markup during the test, if the feature is based on the golden sequence of the training process, the training environment will not be consistent with the test environment, so you will get the wrong weight.
The problem we face in grammar analysis is that we don't know how to pass the predictive sequence! By using the gold standard tree structure and discovering the transition sequences that can be converted into trees, and so on, so that the training is working, you get the return sequence of actions that guarantee the execution of the movement, and will get the gold standard dependencies.
The problem is that if the parser is in any state that is not along the golden standard sequence, we don't know how to teach it to do the "correct" movement. Once the parser has gone wrong, we don't know how to train from the instance.
This is a big problem, because it means that once the parser starts to make an error, it stops in any state that is not part of the training data-causing more errors to happen.
For greedy parsers, the problem is specific: once the directional characteristics are used, there is a natural way to make structural predictions.
Like all the best breakthroughs, once you understand these, the solution seems obvious. All we have to do is define a function that asks "How many gold standard dependencies can be recovered from this state". If you can define this function, you can do each movement in turn and ask, "How many gold standard dependencies can be recovered from this state?" ”。 If you take the action to make the less gold standard dependent on the implementation, then it is suboptimal.
There are a lot of things to understand here.
So we have the function Oracle (state):
Oracle (state) = | Gold_arcs∩reachable_arcs (State) |
We have a set of operations, each of which returns a new state. We need to know:
Shift_cost = Oracle (state) –oracle (shift)
right_cost = Oracle (state) –oracle
left_cost = Oracle (state) –oracle (left)
Now, at least one operation returns 0. Oracle asks: "What is the cost of the best way forward?" The first step in the best path is to transfer, to the right, or to the left.
It turns out that Oracle simplifies many transition systems. We are using the derivative of the transition system--arc Hybrid is proposed by Goldberg and Nivre (2013).
We have implemented Oracle as a way to return 0-cost movement, not Oracle (state), which implements a function. This can prevent us from doing a bunch of expensive copy operations. Hopefully the reasoning in the code is not too hard to understand, and if you're confused and want to find out about the flowers, you can refer to Goldberg and Nivre's papers.
def get_gold_moves(n0, n, stack, heads, gold):
def deps_between(target, others, gold):
for word in others:
if gold[word] == target or gold[target] == word:
return True
return False
valid = get_valid_moves(n0, n, len(stack))
if not stack or (SHIFT in valid and gold[n0] == stack[-1]):
return [SHIFT]
if gold[stack[-1]] == n0:
return [LEFT]
costly = set([m for m in MOVES if m not in valid])
# If the word behind s0 is its gold head, Left is incorrect
if len(stack) >= 2 and gold[stack[-1]] == stack[-2]:
costly.add(LEFT)
# If there are any dependencies between n0 and the stack,
# pushing n0 will lose them.
if SHIFT not in costly and deps_between(n0, stack, gold):
costly.add(SHIFT)
# If there are any dependencies between s0 and the buffer, popping
# s0 will lose them.
if deps_between(stack[-1], range(n0+1, n-1), gold):
costly.add(LEFT)
costly.add(RIGHT)
return [m for m in MOVES if m not in costly]
The "Dynamic Oracle" training process produces a great difference in accuracy-usually 1-2%, and no difference in the way it runs. The old "static Oracle" greedy training process is completely outdated; there's no reason to do that.
Summarize
I feel that language technology, especially those related to grammar, is particularly mysterious. I can't imagine what kind of program can be implemented.
I think it's natural for people that the best solution can be quite complex. 200,000 lines of Java package feel appropriate.
However, when only a single algorithm is implemented, the algorithm code is often very short. When you implement only one algorithm, you do know what to write before you write, and you don't need to focus on any unnecessary abstract concepts that have great performance implications.
Comments
[1] I'm really not sure how to calculate the number of lines of code for the Stanford parser. Its jar file is loaded with 200k size content, including a large number of different models. It's not important, but it seems to be safe around 50k.
[2] For example, how to parse "John's School of Music calls"? You need to confirm that the "John's school" phrase has the same knot as "John's School calls", "John's School of Music calls" Frame. Reasoning about the different "slots" that can be put into phrases is a key approach to our reasoning in syntactic analysis. You can think of each phrase as a connector with a different shape, you need to insert a different slot--each phrase also has a number of slots in different shapes. We are trying to figure out what kind of connectors are in place, so we can find out how the sentences are connected together.
[3] There is a newer version of the Stanford parser that uses the "depth learning" technique, which is more accurate. However, the accuracy of the final model is still at the back of the best migration-reduction parser. This is a great article, the idea is implemented on a parser, this parser is not the most advanced is not really important.
[4] A detail: The Stanford dependency is actually generated automatically by the given gold standard phrase structure tree. Refer to the Stanford Dependency Converter page here: http://nlp.stanford.edu/software/stanford-dependencies.shtml.
no basis to guess
For a long time, the incremental language processing algorithm is the main interest of the scientific community. If you want to write a parser to test the theory of how a human statement processor works, then the parser needs to build a partial interpreter. Here there is ample evidence, including commonsense reflection, that it establishes our not cached input, and the speaker completes the expression immediately analysis.
But compared with the neat scientific features, the current algorithm wins! As far as I can tell you, the secret to winning is:
Incremental. Early text restriction search.
Error-driven. Training contains an operational assumption that an error is updated.
The connection with human statement processing looks tempting. I look forward to seeing whether the breakthroughs in these projects have brought some progress in psycholinguistics.
Reference Bibliography
The literature of NLP is almost completely open. All relevant papers can be found here: http://aclweb.org/anthology/.
The parser I described is the implementation of a dynamic Oracle Arc-hybrid system:
Goldberg, YOAV; Nivre, Joakim
Training deterministic parsers with non-deterministic Oracles
Tacl 2013
However, I have written my own characteristics. The initial description of the Arc-hybrid system is here:
Kuhlmann, Marco; Gomez-rodriguez, Carlos; Satta, Giorgio
Dynamic programming algorithms for transition-based dependency parsers
ACL 2011
The dynamic Oracle training method was originally described here:
A Dynamic Oracle for Arc-eager Dependency parsing
Goldberg, YOAV; Nivre, Joakim
Coling 2012
When Zhang and Clark study directional search, this work relies on a major breakthrough in the accuracy of the parser based on conversion. They have published many papers, but the preferred reference is:
Zhang, Yue; Clark, Steven.
Syntactic processing Using The generalized perceptron and Beam Search
Computational Linguistics 2011 (1)
Another important article is the feature engineering article for this short story, which further improves accuracy:
Zhang, Yue; Nivre, Joakim
transition-based Dependency parsing with Rich non-local Features
ACL 2011
As a learning framework for directional parsers, the generalized perceptron comes from this article
Collins, Michael.
Discriminative training Methods for Hidden Markov models:theory and experiments with perceptron algorithms
EMNLP 2002
Experiment Details
The results of the article quoted the Wall Street Journal Corpus 22nd. The Stanford parser performs the following:
JAVA-MX10000M-CP "$scriptdir/*:" Edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-outputformat "Penn" Edu /stanford/nlp/models/lexparser/englishfactored.ser.gz $*
A small post processing is applied to undo the Stanford Parser's added assumption tag to match the number to the PTB tag:
"" "Stanford parser retokenises numbers. Split them.
"" " Import sys
import re
qp_re = re.compile (' \xc2\xa0 ') for line in
Sys.stdin: Line
= Line.rstrip ()
if Qp_re.search (line): Line
= Line.replace ("CD", ' (QP (CD ', 1) + ') ' Line
= line.replace (' \xc2\xa0 ', ') (CD ')print Line
The resulting PTB format file is converted to a dependency using the Stanford converter:
for f in $1/*.MRG; Do
echo $f
grep-v CODE $f > $f 2 "
out=" $f. Dep "
JAVA-MX800M-CP" $scriptdir/*: " edu.stanford.nlp.trees.EnglishGrammaticalStructure \
-treefile "$f. 2"-BASIC-MAKECOPULAHEAD-CONLLX > $out Done
I can't easily read it, but it should just use the general settings of the relevant literature to convert each. mrg file in a directory to a Stanford basic dependency file in a conull format.
Then I converted the gold standard tree from the Wall Street Journal Corpus 22nd to evaluate it. An accurate score is an unmarked subordinate fraction (such as an arc index) in all unmarked identities.
To train parser.py, I output the Wall Street Journal Corpus 02-21 's Gold standard PTB tree structure to the same transformation script.
Word, the Stanford model and parser.py are trained in the same set of statements and are predicted on the holding test set where we know the answers. Accuracy refers to how many correct sentences we have correctly answered first words.
Test the speed on a 2.4Ghz Xeon processor. I experimented on the server and provided more memory for the Stanford parser. The parser.py system works well on my MacBook Air. In parser.py's experiment, I used PyPy, which was about half as fast as the earlier benchmark.
One reason parser.py runs so fast is that it does unlabeled parsing. According to previous experiments, the labeled parser may be 400 times times slower and the accuracy will be increased by about 1%. If you can access the data, it will be a good exercise opportunity for the reader to adapt the program to the tagged parser.
The results of the redshift parser are removed from the version B6B624C9900F3BF and run as follows:
./scripts/train.py-x zhang+stack-k 8-p ~/data/stanford/train.conll ~/data/parsers/tmp
./scripts/parse.py ~/data /parsers/tmp ~/data/stanford/devi.txt/tmp/parse/
./scripts/evaluate.py/tmp/parse/parses ~/data/stanford/ Dev.conll