This article mainly introduces how to use only 500 lines of Python code to implement an English Parser. natural language processing has recently become a hot topic in the industry. The author is a NLP developer, A friend may refer to the syntax analyzer to describe the syntax structure of a sentence, which is used to help other applications to perform reasoning. Natural language introduces many unexpected ambiguities, which can be quickly discovered by our understanding of the world. Here is an example that I like very much:
The correct resolution is to connect "with" and "pizza", and the wrong resolution associates "with" and "eat:
In the past few years, the natural language processing (NLP) community has made great progress in syntax analysis. Now, small Python implementations may be better than the widely used Stanford parser.
The rest of the article first sets the problem, and then shows you the concise implementation of the preparation. The first 200 lines in the parser. py code describe the annotator and learner of the part of speech (here ). Unless you are very familiar with NLP research, you should at least skip this article before studying it.
The Cython system and Redshift are written for my current research. After my contract with mccury expires, I plan to improve it in May for general purposes. The current version is hosted on GitHub.
Problem Description
It is very friendly to enter such a command on your mobile phone:
Set volume to zero when I'm in a meeting, unless John's school CILS.
Then configure the appropriate policy. On Android, you can use Tasker to do this, but the NL interface is better. When you receive editable semantic representations, you will be able to understand what they think you mean and correct their ideas. This is especially friendly.
This work has many problems to solve, but some types of syntaxes are absolutely necessary. We need to know:
Unless John's school cils, when I'm in a meeting, set volume to zero
Is another method for parsing commands, and
Unless John's school, call when I'm in a meeting
It expresses completely different meanings.
The dependency parser returns the relationship between a word and a word, making inference easier. The relational graph is a tree structure with a directed edge. each node (word) has only one incoming arc (header dependency ).
Usage example:
>>> parser = parser.Parser()>>> tokens = "Set the volume to zero when I 'm in a meeting unless John 's school calls".split()>>> tags, heads = parser.parse(tokens)>>> heads[-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11]>>> for i, h in enumerate(heads):... head = tokens[heads[h]] if h >= 1 else 'None'... print(tokens[i] + ' <-- ' + head])Set <-- Nonethe <-- volumevolume <-- Setto <-- Setzero <-- towhen <-- SetI <-- 'm'm <-- whenin <-- 'ma <-- meetingmeeting <-- inunless <-- SetJohn <-- 's's <-- callsschool <-- callscalls <-- unless
One idea is that the derivation through syntactic analysis is slightly easier than the string. Semantic analysis ING is expected to be simpler than literal meaning ING.
The most confusing problem is that the correctness is determined by the Convention, that is, the comment guide. If you do not read the guide and are not a linguistics, you cannot judge whether the resolution is correct. This makes the entire task strange and false.
For example, there is an error in the above parsing: According to Stanford's note guide, "John's school CILS" has a structure error. The structure of the sentence is to guide the annotator to parse an example similar to "John's school clothes.
This is worth further consideration. In theory, we have already formulated rules, so the "correct" resolution should be the opposite. If we violate the conventions, we have good reasons to believe that parsing tasks will become more difficult, because the consistency between tasks and other languages will decrease. [2] but we can test our experience, and we are glad to take advantage of reverse strategy.
We do need the difference in convention-we do not want to receive the same structure, otherwise the results will not be very useful. Note the differences in the guide to balancing downstream applications with which parsers can easily predict.
Ing tree
When deciding what the relationship diagram looks like, we can make a particularly effective simplification: restrict the structure of the relationship diagram to be processed. It not only has advantages in learning and learning, but also plays a role in deepening understanding of algorithms. In most> English parsing, the dependency graph that follows the constraints is the ing tree:
Tree. Each word has an arc header except the root.
ING. For each pair of dependencies (a1, a2) and (b1, b2), if a1 <b2, then a2> = b2. In other words, the dependency cannot be crossed. There cannot be a dependency in the form of a1 b1 a2 b2 or b1 a1 b2 a2.
There are a wide range of documents on Parsing Non- ing trees, and there are relatively few documents on resolving directed graphs without loops. The parsing algorithm I will describe is used in the ing tree field.
Greedy conversion-based parsing
The syntax analyzer uses the string symbol list as the input, and outputs the arc header index list representing the edges in the graph. If the element of the I-th arc header is j, the dependency includes an edge (j, I ). Conversion-based syntax analyzer> is a finite state converter that maps arrays of N words to output arrays of N arc header indexes.
The arc header array indicates the arc header of MSNBC. The word index of MSNBC is 1, the word index of reported is 2, and the head [1] = 2. You should have discovered why the tree structure is so convenient-if we output a DAG structure, words in this structure may contain multiple arc headers, and the tree structure will no longer work.
Although heads can be represented as an array, we really like to maintain a certain alternative method to access and parse, so as to conveniently and efficiently extract features. The Parse class is like this:
class Parse(object): def __init__(self, n): self.n = n self.heads = [None] * (n-1) self.lefts = [] self.rights = [] for i in range(n+1): self.lefts.append(DefaultList(0)) self.rights.append(DefaultList(0)) def add_arc(self, head, child): self.heads[child] = head if child < head: self.lefts[head].append(child) else: self.rights[head].append(child)
Like syntax parsing, we also need to track the position in the sentence. By adding an index to the words array and introducing the stack mechanism, words can be pushed into the stack. when the arc header of a word is set, words will pop up. Therefore, our state data structure is the basis.
- One index I, active in the symbol list
- Dependencies added to the syntax parser until now
- A stack of words generated before index I is included. we have declared an arc header for these words.
Each step of the parsing process applies one of three operations:
SHIFT = 0; RIGHT = 1; LEFT = 2MOVES = [SHIFT, RIGHT, LEFT] def transition(move, i, stack, parse): global SHIFT, RIGHT, LEFT if move == SHIFT: stack.append(i) return i + 1 elif move == RIGHT: parse.add_arc(stack[-2], stack.pop()) return i elif move == LEFT: parse.add_arc(i, stack.pop()) return i raise GrammarError("Unknown move: %d" % move)
The LEFT and RIGHT operations add dependencies and play the stack, while SHIFT forces the stack and increases the I value in the cache.
Therefore, the parser starts with an empty stack, and the cache index is 0, with no dependency record. Select a valid operation and apply it to the current status. Continue to select the operation and apply it until the stack is empty and the cache index reaches the end of the input array. (It is difficult to understand this algorithm without step-by-step tracing. Prepare a sentence, draw a ing resolution tree, and then traverse the resolution tree by selecting the correct conversion sequence .)
The following is the parsing loop in the code:
class Parser(object): ... def parse(self, words): tags = self.tagger(words) n = len(words) idx = 1 stack = [0] deps = Parse(n) while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = self.model.score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse def get_valid_moves(i, n, stack_depth): moves = [] if i < n: moves.append(SHIFT) if stack_depth >= 2: moves.append(RIGHT) if stack_depth >= 1: moves.append(LEFT) return moves
We start with a marked sentence and initialize the state. Then, the state is mapped to a feature set scoring using a linear model. Next, find the effective operation with the highest score and apply it to the status.
The scoring model works the same way as the part-of-speech tagging. If you are confused about extracting features and scoring using linear models, you should review this article. The following is a prompt on how the scoring model works:
class Perceptron(object) ... def score(self, features): all_weights = self.weights scores = dict((clas, 0) for clas in self.classes) for feat, value in features.items(): if value == 0: continue if feat not in all_weights: continue weights = all_weights[feat] for clas, weight in weights.items(): scores[clas] += value * weight return scores
Here, we only sum the class weights of each feature. This is usually expressed as a dot product, but I find that it is not suitable for processing many classes.
The targeted parser (RedShift) traverses multiple candidate elements, but only selects the best one. We will focus on efficiency and simplicity while ignoring its accuracy. We only perform a single analysis. Our search strategy will be completely greedy, just like part-of-speech tagging. We will lock each step in the selection.
If you carefully read the part of speech mark, you may find the following similarity. What we do is map the resolution problem to a sequence tag problem solved by "flattening", or an unstructured learning algorithm (through greedy search ).
Feature Set
Feature extraction code is always ugly. The features of the syntax analyzer refer to some identifiers in the context.
- The first three words in the cache (n0, n1, n2)
- Three words (s0, s1, s2) at the top of the stack)
- The two leftmost s0 children (s0b1, s0b2 );
- The two rightmost s0 children (s0f1, s0f2 );
- The two leftmost children of n0 (n0b1, n0b2 );
We point out the word lists of the above 12 identifiers, part-of-speech tagging, and the number of children associated with the identifiers.
Because a linear model is used, a feature refers to a triple composed of atomic attributes.
def extract_features(words, tags, n0, n, stack, parse): def get_stack_context(depth, stack, data): if depth >;= 3: return data[stack[-1]], data[stack[-2]], data[stack[-3]] elif depth >= 2: return data[stack[-1]], data[stack[-2]], '' elif depth == 1: return data[stack[-1]], '', '' else: return '', '', '' def get_buffer_context(i, n, data): if i + 1 >= n: return data[i], '', '' elif i + 2 >= n: return data[i], data[i + 1], '' else: return data[i], data[i + 1], data[i + 2] def get_parse_context(word, deps, data): if word == -1: return 0, '', '' deps = deps[word] valency = len(deps) if not valency: return 0, '', '' elif valency == 1: return 1, data[deps[-1]], '' else: return valency, data[deps[-1]], data[deps[-2]] features = {} # Set up the context pieces --- the word, W, and tag, T, of: # S0-2: Top three words on the stack # N0-2: First three words of the buffer # n0b1, n0b2: Two leftmost children of the first word of the buffer # s0b1, s0b2: Two leftmost children of the top word of the stack # s0f1, s0f2: Two rightmost children of the top word of the stack depth = len(stack) s0 = stack[-1] if depth else -1 Ws0, Ws1, Ws2 = get_stack_context(depth, stack, words) Ts0, Ts1, Ts2 = get_stack_context(depth, stack, tags) Wn0, Wn1, Wn2 = get_buffer_context(n0, n, words) Tn0, Tn1, Tn2 = get_buffer_context(n0, n, tags) Vn0b, Wn0b1, Wn0b2 = get_parse_context(n0, parse.lefts, words) Vn0b, Tn0b1, Tn0b2 = get_parse_context(n0, parse.lefts, tags) Vn0f, Wn0f1, Wn0f2 = get_parse_context(n0, parse.rights, words) _, Tn0f1, Tn0f2 = get_parse_context(n0, parse.rights, tags) Vs0b, Ws0b1, Ws0b2 = get_parse_context(s0, parse.lefts, words) _, Ts0b1, Ts0b2 = get_parse_context(s0, parse.lefts, tags) Vs0f, Ws0f1, Ws0f2 = get_parse_context(s0, parse.rights, words) _, Ts0f1, Ts0f2 = get_parse_context(s0, parse.rights, tags) # Cap numeric features at 5? # String-distance Ds0n0 = min((n0 - s0, 5)) if s0 != 0 else 0 features['bias'] = 1 # Add word and tag unigrams for w in (Wn0, Wn1, Wn2, Ws0, Ws1, Ws2, Wn0b1, Wn0b2, Ws0b1, Ws0b2, Ws0f1, Ws0f2): if w: features['w=%s' % w] = 1 for t in (Tn0, Tn1, Tn2, Ts0, Ts1, Ts2, Tn0b1, Tn0b2, Ts0b1, Ts0b2, Ts0f1, Ts0f2): if t: features['t=%s' % t] = 1 # Add word/tag pairs for i, (w, t) in enumerate(((Wn0, Tn0), (Wn1, Tn1), (Wn2, Tn2), (Ws0, Ts0))): if w or t: features['%d w=%s, t=%s' % (i, w, t)] = 1 # Add some bigrams features['s0w=%s, n0w=%s' % (Ws0, Wn0)] = 1 features['wn0tn0-ws0 %s/%s %s' % (Wn0, Tn0, Ws0)] = 1 features['wn0tn0-ts0 %s/%s %s' % (Wn0, Tn0, Ts0)] = 1 features['ws0ts0-wn0 %s/%s %s' % (Ws0, Ts0, Wn0)] = 1 features['ws0-ts0 tn0 %s/%s %s' % (Ws0, Ts0, Tn0)] = 1 features['wt-wt %s/%s %s/%s' % (Ws0, Ts0, Wn0, Tn0)] = 1 features['tt s0=%s n0=%s' % (Ts0, Tn0)] = 1 features['tt n0=%s n1=%s' % (Tn0, Tn1)] = 1 # Add some tag trigrams trigrams = ((Tn0, Tn1, Tn2), (Ts0, Tn0, Tn1), (Ts0, Ts1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Ts0f1, Tn0), (Ts0, Tn0, Tn0b1), (Ts0, Ts0b1, Ts0b2), (Ts0, Ts0f1, Ts0f2), (Tn0, Tn0b1, Tn0b2), (Ts0, Ts1, Ts1)) for i, (t1, t2, t3) in enumerate(trigrams): if t1 or t2 or t3: features['ttt-%d %s %s %s' % (i, t1, t2, t3)] = 1 # Add some valency and distance features vw = ((Ws0, Vs0f), (Ws0, Vs0b), (Wn0, Vn0b)) vt = ((Ts0, Vs0f), (Ts0, Vs0b), (Tn0, Vn0b)) d = ((Ws0, Ds0n0), (Wn0, Ds0n0), (Ts0, Ds0n0), (Tn0, Ds0n0), ('t' + Tn0+Ts0, Ds0n0), ('w' + Wn0+Ws0, Ds0n0)) for i, (w_t, v_d) in enumerate(vw + vt + d): if w_t or v_d: features['val/d-%d %s %d' % (i, w_t, v_d)] = 1 return features
Training
The same algorithm is used for learning weights and part-of-speech tagging, that is, the average sensor algorithm. Its main advantage is that it is an online learning algorithm. In this example, we make predictions one by one and check the real answers. if the predictions are incorrect, we adjust the comments (weights ).
Loop training looks like this:
class Parser(object): ... def train_one(self, itn, words, gold_tags, gold_heads): n = len(words) i = 2; stack = [1]; parse = Parse(n) tags = self.tagger.tag(words) while stack or (i + 1) < n: features = extract_features(words, tags, i, n, stack, parse) scores = self.model.score(features) valid_moves = get_valid_moves(i, n, len(stack)) guess = max(valid_moves, key=lambda move: scores[move]) gold_moves = get_gold_moves(i, n, stack, parse.heads, gold_heads) best = max(gold_moves, key=lambda move: scores[move]) self.model.update(best, guess, features) i = transition(guess, i, stack, parse) # Return number correct return len([i for i in range(n-1) if parse.heads[i] == gold_heads[i]])
The most interesting part of the training process is get_gold_moves. Through Goldbery and Nivre (2012), the performance of our syntax parser may be improved. they once pointed out that we were wrong for many years.
In part-of-speech tagging, I remind you that during training, you must ensure that the last two prediction tags are passed as the features of the current tag, rather than the last two Gold tags. During the test, only the prediction tag is used. if the feature is based on the golden sequence during the training process, the training environment will not be consistent with the test environment, so the wrong weight will be obtained.
In syntax analysis, we are faced with the problem of not knowing how to pass the prediction sequence! By using the golden standard tree structure and discovering the transition sequence that can be converted to the tree, and so on, you can work with the training. you get the returned action sequence to ensure the execution of the motion, the golden standard dependency will be obtained.
The problem is that if the syntax analyzer is in any state that does not follow the golden standard sequence, we don't know how to teach it to make the "correct" motion. Once a syntax analyzer error occurs, we do not know how to train from the instance.
This is a big problem, because it means that once the syntax analyzer starts to encounter an error, it will stop in any state that is not part of the training data-resulting in more errors.
For greedy parser, the problem is specific: once the direction feature is used, there is a natural way to make structured prediction.
As with all the best breakthroughs, once you understand this, the solution seems obvious. What we need to do is define a function. this function asks "How many golden standard dependencies can be restored from this state ". If you can define this function, you can perform each kind of movement in sequence and then ask, "How many golden standard dependencies can be recovered from this state ?". If the adopted operation allows less gold standards to be implemented, then it is sub-optimal.
Many things need to be understood here.
Therefore, we have the Oracle (state) function ):
Oracle(state) = | gold_arcs ∩ reachable_arcs(state) |
We have a set of operations, and each operation returns a new state. We need to know:
shift_cost = Oracle(state) – Oracle(shift(state)) right_cost = Oracle(state) – Oracle(right(state)) left_cost = Oracle(state) – Oracle(left(state))
Currently, at least one operation returns 0. Oracle (state) question: "What is the cost of the best path forward ?" The first step of the optimal path is to transfer, to the right, or to the left.
Facts prove that Oracle simplifies many transitional systems. The derivatives of the transitional system we are using-Arc Hybrid was proposed by Goldberg and Nivre (2013.
We implement oracle as a method to return a zero-cost motion, rather than implementing a functional Oracle (state ). This prevents us from doing a bunch of expensive copy operations. I hope that the reasoning in the code is not too difficult to understand. if I am confused and want to explain it, you can refer to the papers of Goldberg and Nivre.
def get_gold_moves(n0, n, stack, heads, gold): def deps_between(target, others, gold): for word in others: if gold[word] == target or gold[target] == word: return True return False valid = get_valid_moves(n0, n, len(stack)) if not stack or (SHIFT in valid and gold[n0] == stack[-1]): return [SHIFT] if gold[stack[-1]] == n0: return [LEFT] costly = set([m for m in MOVES if m not in valid]) # If the word behind s0 is its gold head, Left is incorrect if len(stack) >= 2 and gold[stack[-1]] == stack[-2]: costly.add(LEFT) # If there are any dependencies between n0 and the stack, # pushing n0 will lose them. if SHIFT not in costly and deps_between(n0, stack, gold): costly.add(SHIFT) # If there are any dependencies between s0 and the buffer, popping # s0 will lose them. if deps_between(stack[-1], range(n0+1, n-1), gold): costly.add(LEFT) costly.add(RIGHT) return [m for m in MOVES if m not in costly]
The "dynamic oracle" training process produces a big difference in accuracy-typically 1-2%, and there is no difference in the method of running and running. The old "static oracle" greedy training process is completely out of date; there is no reason to do that.
Summary
I feel that language technologies, especially those related syntaxes, are especially mysterious. I cannot imagine what programs can be implemented.
I think it is natural that the best solution may be quite complicated. The 200,000-line Java package is recommended.
However, when a single algorithm is implemented, the algorithm code is usually very short. When you implement only one algorithm, you do know what to write before writing, and you do not need to pay attention to any unnecessary abstract concepts with great performance impact.
Note
[1] I'm not sure how to calculate the number of lines in the Stanford parser. Its jar file contains KB of content, including a large number of different models. This is not important, but it seems safe around 50 k.
[2] For example, how to resolve "John's school of music CILS "? Make sure that the phrase "John's school" has the same structure as "John's school cals" and "John's school of music cals. Inference of different "slots" that can be put into phrases is the key way for us to make rational syntactic analysis. You can think of each phrase as a connector with different shapes, and you need to insert different slots-each phrase also has a certain number of slots of different shapes. We are trying to figure out what kind of connector is, so we can figure out how sentences are connected together.
[3] Here we use the Stanford parser update version of Deep Learning, which is more accurate. However, the accuracy of the final model remains behind the best transfer-in reduction analyzer. This is a great article. this idea is implemented on a syntax Analyzer. it doesn't matter whether this syntax analyzer is the most advanced.
[4] Details: Stanford dependency is automatically generated based on the given golden standard phrase structure tree. See the Stanford dependency converter page here: http://nlp.stanford.edu/software/stanford-dependencies.shtml.
No guess
For a long time, incremental language processing algorithms have become a major interest in the scientific community. If you want to write a syntax analyzer to test the theory of how the human statement processor works, this analyzer needs to create some interpreters. Here there is sufficient evidence, including common sense reflection, which sets up input that we do not cache, and the speaker completes expression analysis immediately.
However, compared with neat scientific features, the current algorithm wins! The secret to winning is:
Incremental. Early-stage restricted text search.
Error driver. The training contains an operation assumption that an error occurs, that is, an update.
The connection with human statement processing seems attractive. I am looking forward to seeing whether these engineering breakthroughs have brought about some progress in psychological linguistics.
Bibliography
The NLP documentation is almost completely open. All related papers can be found here: http://aclweb.org/anthology /.
The parser I described is the implementation of the dynamic oracle arc-hybrid system:
Goldberg, Yoav; Nivre, Joakim
Training Deterministic Parsers with Non-Deterministic Parsers
TACL 2013
However, I have compiled my own features. The initial description of the arc-hybrid system is as follows:
Kuhlmann, Marco; Gomez-Rodriguez, Carlos; Satta, Giorgio
Dynamic programming algorithms for transition-based dependency parsers
ACL 2011
Here we first describe the dynamic oracle training method:
A Dynamic Oracle for Arc-Eager Dependency Parsing
Goldberg, Yoav; Nivre, Joakim
COLING 2012
When Zhang and Clark studied targeted search, this work relies on a major breakthrough in accuracy by a conversion-based parser. They have published many papers, but the first reference is:
Zhang, Yue; Clark, Steven
Syntactic Processing Using the Generalized Perceptron and Beam Search
Computational Linguistics 2011 (1)
Another important article is this short feature engineering article, which further improves accuracy:
Zhang, Yue; Nivre, Joakim
Transition-based Dependency Parsing with Rich Non-local Features
ACL 2011
As the learning framework of the targeted parser, the sensor in the broad sense comes from this article.
Collins, Michael
Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms
EMNLP 2002
Lab details
The results at the beginning of this article reference 22nd articles in The Wall Street Journal corpus. The Stanford parser is executed as follows:
java -mx10000m -cp "$scriptdir/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser \-outputFormat "penn" edu/stanford/nlp/models/lexparser/englishFactored.ser.gz $*
A small post-processing application is applied to remove the hypothetical tag added by the Stanford parser to make the number conform to the PTB tag:
"""Stanford parser retokenises numbers. Split them."""import sysimport re qp_re = re.compile('\xc2\xa0')for line in sys.stdin: line = line.rstrip() if qp_re.search(line): line = line.replace('(CD', '(QP (CD', 1) + ')' line = line.replace('\xc2\xa0', ') (CD ') print line
The resulting PTB format file is converted to the dependency using the Stanford converter:
for f in $1/*.mrg; do echo $f grep -v CODE $f > "$f.2" out="$f.dep" java -mx800m -cp "$scriptdir/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure \ -treeFile "$f.2" -basic -makeCopulaHead -conllx > $outdone
I cannot read it easily, but it should only use the general settings of relevant literature to convert each. mrg file in a directory into a Stanford basic dependent file in the CoNULL format.
Then, I switched the gold standard tree from the 22nd articles in The Wall Street daily corpus for evaluation. An accurate score refers to the unmarked affiliated scores (such as the arc header index) in all unlabeled identifiers)
To train parser. py, I output the golden standard PTB tree structure of the Wall Street Journal Corpus 02-21 to the same conversion script.
In a word, Stanford model and parser. py are trained in the same group of statements and predicted in the set where we know the answer. Accuracy refers to the number of correct statements that we have correctly answered.
Test speed on a 2.4 Ghz Xeon processor. I conducted an experiment on the server to provide more memory for the Stanford parser. The parser. py system runs well on my MacBook Air. In the parser. py experiment, I used PyPy; CPython is about half faster than the early benchmark.
One reason parser. py runs so fast is that it performs unlabeled parsing. According to previous experiments, the labeled parser may be 400 times slower and the accuracy is improved by about 1%. If you can access the data and adapt the program to the labeled parser, it will be a great opportunity for readers to exercise.
The result of the RedShift parser is obtained from version b6b624c9900f3bf. run the following command:
./scripts/train.py -x zhang+stack -k 8 -p ~/data/stanford/train.conll ~/data/parsers/tmp./scripts/parse.py ~/data/parsers/tmp ~/data/stanford/devi.txt /tmp/parse/./scripts/evaluate.py /tmp/parse/parses ~/data/stanford/dev.conll