The syntax analyzer describes the syntax structure of a sentence to help other applications to reason. Natural Language introduces many unexpected ambiguities, which can be quickly discovered by our understanding of the world. Here is an example that I like very much:
They ate the pizza with anchovies
The correct resolution is to connect "with" and "pizza", and the wrong Resolution Associates "with" and "eat:
Natural Language Processing (NLP) communities have made great progress in syntax analysis over the past few years. Now, small Python implementations may be better than the widely used Stanford parser.
Parser accuracy speed (word/second) Language location
Stanford 89.6% 19 Java> 50,000 [1]
Parser. py 89.8% 2,020 Python~ 500
Redshift93.6%2,580Cython ~ 4,000
The rest of the article first sets the problem, and then shows you the concise implementation of the preparation. The first 200 lines in the parser. py Code describe the annotator and learner of the part of speech here ). Unless you are very familiar with NLP research, you should at least skip this article before studying it.
The Cython system and Redshift are written for my current research. After my contract with mccury expires, I plan to improve it in May for general purposes. The current version is hosted on GitHub.
Problem description
It is very friendly to enter such a command on your mobile phone:
Set volume to zero when I'm in a meeting, unless John's school CILS.
Then configure the appropriate policy. On Android, you can use Tasker to do this, but the NL interface is better. When you receive editable semantic representations, you will be able to understand what they think you mean and correct their ideas. This is especially friendly.
This work has many problems to solve, but some types of syntaxes are absolutely necessary. We need to know:
Unless John's school CILS, when I'm in a meeting, set volume to zero
Is another method for parsing commands, and
Unless John's school, call when I'm in a meeting
It expresses completely different meanings.
The dependency parser returns the relationship between a word and a word, making inference easier. The relational graph is a tree structure with a directed edge. Each node word has only one input arc header dependency ).
Usage example:
- >>> parser = parser.Parser()
- >>> tokens = "Set the volume to zero when I 'm in a meeting unless John 's school calls".split()
- >>> tags, heads = parser.parse(tokens)
- >>> heads
- [-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11]
- >>> for i, h in enumerate(heads):
- ... head = tokens[heads[h]] if h >= 1 else 'None'
- ... print(tokens[i] + ' <-- ' + head])
- Set <-- None
- the <-- volume
- volume <-- Set
- to <-- Set
- zero <-- to
- when <-- Set
- I <-- 'm
- 'm <-- when
- in <-- 'm
- a <-- meeting
- meeting <-- in
- unless <-- Set
- John <-- 's
- 's <-- calls
- school <-- calls
- calls <-- unless
One idea is that the derivation through syntactic analysis is slightly easier than the string. Semantic Analysis ing is expected to be simpler than literal meaning ing.
The most confusing problem is that the correctness is determined by the Convention, that is, the Comment guide. If you do not read the guide and are not a linguistics, you cannot judge whether the resolution is correct. This makes the entire task strange and false.
For example, there is an error in the above parsing: According to Stanford's note Guide, "John's school CILS" has a structure error. The structure of the sentence is to guide the annotator to parse an example similar to "John's school clothes.
This is worth further consideration. In theory, we have already formulated rules, so the "correct" resolution should be the opposite. If we violate the conventions, we have good reasons to believe that parsing tasks will become more difficult, because the consistency between tasks and other languages will decrease. 2] But we can test our experience and we are happy to take advantage of reverse strategy.
We do need the difference in Convention-we do not want to receive the same structure, otherwise the results will not be very useful. Note the differences in the Guide to balancing downstream applications with which parsers can easily predict.
Ing tree
When deciding what the relationship diagram looks like, we can make a particularly effective simplification: restrict the structure of the relationship diagram to be processed. It not only has advantages in learning and learning, but also plays a role in deepening understanding of algorithms. In most> English Parsing, the dependency graph that follows the constraints is the ing tree:
There are a wide range of documents on Parsing non- ing trees, and there are relatively few documents on resolving Directed Graphs without loops. The parsing algorithm I will describe is used in the ing tree field.
Greedy conversion-based Parsing
The syntax analyzer uses the string symbol list as the input, and outputs the arc header index list representing the edges in the graph. If the element of the I-th arc header is j, the dependency includes an edge j, I ). Conversion-based syntax analyzer> is a finite state converter that maps arrays of N words to output arrays of N arc header indexes.
Start MSNBC reported that Facebook bought WhatsApp for $ 16bn root
0 2 9 2 4 2 4 4 7 0
The arc header array indicates the arc header of MSNBC. The word index of MSNBC is 1, the word index of reported is 2, and the head [1] = 2. You should have discovered why the tree structure is so convenient-if we output a DAG structure, words in this structure may contain multiple arc headers, and the tree structure will no longer work.
Although heads can be represented as an array, we really like to maintain a certain alternative method to access and parse, so as to conveniently and efficiently extract features. The Parse class is like this:
- class Parse(object):
- def __init__(self, n):
- self.n = n
- self.heads = [None] * (n-1)
- self.lefts = []
- self.rights = []
- for i in range(n+1):
- self.lefts.append(DefaultList(0))
- self.rights.append(DefaultList(0))
-
- def add_arc(self, head, child):
- self.heads[child] = head
- if child < head:
- self.lefts[head].append(child)
- else:
- self.rights[head].append(child)
Like Syntax Parsing, we also need to track the position in the sentence. By adding an index to the words array and introducing the stack mechanism, words can be pushed into the stack. When the arc header of a word is set, words will pop up. Therefore, our State data structure is the basis.
- One index I, active in the symbol list
- Dependencies added to the syntax parser until now
- A stack of words generated before index I is included. We have declared an arc header for these words.
Each step of the parsing process applies one of three operations:
- SHIFT = 0; RIGHT = 1; LEFT = 2
- MOVES = [SHIFT, RIGHT, LEFT]
-
- def transition(move, i, stack, parse):
- global SHIFT, RIGHT, LEFT
- if move == SHIFT:
- stack.append(i)
- return i + 1
- elif move == RIGHT:
- parse.add_arc(stack[-2], stack.pop())
- return i
- elif move == LEFT:
- parse.add_arc(i, stack.pop())
- return i
- raise GrammarError("Unknown move: %d" % move)
The LEFT and RIGHT operations add dependencies and play the stack, while SHIFT forces the stack and increases the I value in the cache.
Therefore, the parser starts with an empty stack, and the cache index is 0, with no dependency record. Select a valid operation and apply it to the current status. Continue to select the operation and apply it until the stack is empty and the cache index reaches the end of the input array. It is difficult to understand this algorithm without step-by-step tracing. Prepare a sentence, draw a ing resolution tree, and then traverse the resolution tree by selecting the correct conversion sequence .)
The following is the parsing loop in the Code:
- class Parser(object):
- ...
- def parse(self, words):
- tags = self.tagger(words)
- n = len(words)
- idx = 1
- stack = [0]
- deps = Parse(n)
- while stack or idx < n:
- features = extract_features(words, tags, idx, n, stack, deps)
- scores = self.model.score(features)
- valid_moves = get_valid_moves(i, n, len(stack))
- next_move = max(valid_moves, key=lambda move: scores[move])
- idx = transition(next_move, idx, stack, parse)
- return tags, parse
-
- def get_valid_moves(i, n, stack_depth):
- moves = []
- if i < n:
- moves.append(SHIFT)
- if stack_depth >= 2:
- moves.append(RIGHT)
- if stack_depth >= 1:
- moves.append(LEFT)
- return moves
We start with a marked sentence and initialize the state. Then, the state is mapped to a feature set scoring using a linear model. Next, find the effective operation with the highest score and apply it to the status.
The scoring model works the same way as the part-of-speech tagging. If you are confused about extracting features and scoring using linear models, you should review this article. The following is a prompt on how the scoring model works:
- class Perceptron(object)
- ...
- def score(self, features):
- all_weights = self.weights
- scores = dict((clas, 0) for clas in self.classes)
- for feat, value in features.items():
- if value == 0:
- continue
- if feat not in all_weights:
- continue
- weights = all_weights[feat]
- for clas, weight in weights.items():
- scores[clas] += value * weight
- return scores
Here, we only sum the class weights of each feature. This is usually expressed as a dot product, but I find that it is not suitable for processing many classes.
Targeted parser RedShift) traverses Multiple candidate elements, but in the end, only the best one will be selected. We will focus on efficiency and simplicity while ignoring its accuracy. We only perform a single analysis. Our search strategy will be completely greedy, just like part-of-speech tagging. We will lock each step in the selection.
If you carefully read the part of speech mark, you may find the following similarity. What we do is to map the resolution problem to a sequence Mark problem solved by "flattening", or a non-structured learning algorithm through greedy search ).