500 lines of Python code as an English Parser (1)

Source: Internet
Author: User

The syntax analyzer describes the syntax structure of a sentence to help other applications to reason. Natural Language introduces many unexpected ambiguities, which can be quickly discovered by our understanding of the world. Here is an example that I like very much:

They ate the pizza with anchovies

The correct resolution is to connect "with" and "pizza", and the wrong Resolution Associates "with" and "eat:

Natural Language Processing (NLP) communities have made great progress in syntax analysis over the past few years. Now, small Python implementations may be better than the widely used Stanford parser.

Parser accuracy speed (word/second) Language location
Stanford 89.6% 19 Java> 50,000 [1]
Parser. py 89.8% 2,020 Python~ 500
Redshift93.6%2,580Cython ~ 4,000

The rest of the article first sets the problem, and then shows you the concise implementation of the preparation. The first 200 lines in the parser. py Code describe the annotator and learner of the part of speech here ). Unless you are very familiar with NLP research, you should at least skip this article before studying it.

The Cython system and Redshift are written for my current research. After my contract with mccury expires, I plan to improve it in May for general purposes. The current version is hosted on GitHub.

Problem description

It is very friendly to enter such a command on your mobile phone:

Set volume to zero when I'm in a meeting, unless John's school CILS.

Then configure the appropriate policy. On Android, you can use Tasker to do this, but the NL interface is better. When you receive editable semantic representations, you will be able to understand what they think you mean and correct their ideas. This is especially friendly.

This work has many problems to solve, but some types of syntaxes are absolutely necessary. We need to know:

Unless John's school CILS, when I'm in a meeting, set volume to zero

Is another method for parsing commands, and

Unless John's school, call when I'm in a meeting

It expresses completely different meanings.

The dependency parser returns the relationship between a word and a word, making inference easier. The relational graph is a tree structure with a directed edge. Each node word has only one input arc header dependency ).

Usage example:

 
 
  1. >>> parser = parser.Parser()  
  2. >>> tokens = "Set the volume to zero when I 'm in a meeting unless John 's school calls".split()  
  3. >>> tags, heads = parser.parse(tokens)  
  4. >>> heads  
  5. [-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11]  
  6. >>> for i, h in enumerate(heads):   
  7. ...   head = tokens[heads[h]] if h >= 1 else 'None' 
  8. ...   print(tokens[i] + ' <-- ' + head])  
  9. Set <-- None 
  10. the <-- volume  
  11. volume <-- Set  
  12. to <-- Set  
  13. zero <-- to  
  14. when <-- Set  
  15. I <-- 'm  
  16. 'm <-- when  
  17. in <-- 'm  
  18. a <-- meeting  
  19. meeting <-- in 
  20. unless <-- Set  
  21. John <-- 's  
  22. 's   <-- calls  
  23. school <-- calls  
  24. calls <-- unless  

One idea is that the derivation through syntactic analysis is slightly easier than the string. Semantic Analysis ing is expected to be simpler than literal meaning ing.

The most confusing problem is that the correctness is determined by the Convention, that is, the Comment guide. If you do not read the guide and are not a linguistics, you cannot judge whether the resolution is correct. This makes the entire task strange and false.

For example, there is an error in the above parsing: According to Stanford's note Guide, "John's school CILS" has a structure error. The structure of the sentence is to guide the annotator to parse an example similar to "John's school clothes.

This is worth further consideration. In theory, we have already formulated rules, so the "correct" resolution should be the opposite. If we violate the conventions, we have good reasons to believe that parsing tasks will become more difficult, because the consistency between tasks and other languages will decrease. 2] But we can test our experience and we are happy to take advantage of reverse strategy.

We do need the difference in Convention-we do not want to receive the same structure, otherwise the results will not be very useful. Note the differences in the Guide to balancing downstream applications with which parsers can easily predict.

Ing tree

When deciding what the relationship diagram looks like, we can make a particularly effective simplification: restrict the structure of the relationship diagram to be processed. It not only has advantages in learning and learning, but also plays a role in deepening understanding of algorithms. In most> English Parsing, the dependency graph that follows the constraints is the ing tree:

There are a wide range of documents on Parsing non- ing trees, and there are relatively few documents on resolving Directed Graphs without loops. The parsing algorithm I will describe is used in the ing tree field.

Greedy conversion-based Parsing

The syntax analyzer uses the string symbol list as the input, and outputs the arc header index list representing the edges in the graph. If the element of the I-th arc header is j, the dependency includes an edge j, I ). Conversion-based syntax analyzer> is a finite state converter that maps arrays of N words to output arrays of N arc header indexes.

Start MSNBC reported that Facebook bought WhatsApp for $ 16bn root
0 2 9 2 4 2 4 4 7 0

The arc header array indicates the arc header of MSNBC. The word index of MSNBC is 1, the word index of reported is 2, and the head [1] = 2. You should have discovered why the tree structure is so convenient-if we output a DAG structure, words in this structure may contain multiple arc headers, and the tree structure will no longer work.

Although heads can be represented as an array, we really like to maintain a certain alternative method to access and parse, so as to conveniently and efficiently extract features. The Parse class is like this:

 
 
  1. class Parse(object):  
  2.     def __init__(self, n):  
  3.         self.n = n  
  4.         self.heads = [None] * (n-1)  
  5.         self.lefts = []  
  6.         self.rights = []  
  7.         for i in range(n+1):  
  8.             self.lefts.append(DefaultList(0))  
  9.             self.rights.append(DefaultList(0))  
  10.    
  11.     def add_arc(self, head, child):  
  12.         self.heads[child] = head  
  13.         if child < head:  
  14.             self.lefts[head].append(child)  
  15.         else:  
  16.             self.rights[head].append(child)  

Like Syntax Parsing, we also need to track the position in the sentence. By adding an index to the words array and introducing the stack mechanism, words can be pushed into the stack. When the arc header of a word is set, words will pop up. Therefore, our State data structure is the basis.

  • One index I, active in the symbol list
  • Dependencies added to the syntax parser until now
  • A stack of words generated before index I is included. We have declared an arc header for these words.

Each step of the parsing process applies one of three operations:

 
 
  1. SHIFT = 0; RIGHT = 1; LEFT = 2 
  2. MOVES = [SHIFT, RIGHT, LEFT]  
  3.    
  4. def transition(move, i, stack, parse):  
  5.     global SHIFT, RIGHT, LEFT  
  6.     if move == SHIFT:  
  7.         stack.append(i)  
  8.         return i + 1 
  9.     elif move == RIGHT:  
  10.         parse.add_arc(stack[-2], stack.pop())  
  11.         return i  
  12.     elif move == LEFT:  
  13.         parse.add_arc(i, stack.pop())  
  14.         return i  
  15.     raise GrammarError("Unknown move: %d" % move)  

The LEFT and RIGHT operations add dependencies and play the stack, while SHIFT forces the stack and increases the I value in the cache.

Therefore, the parser starts with an empty stack, and the cache index is 0, with no dependency record. Select a valid operation and apply it to the current status. Continue to select the operation and apply it until the stack is empty and the cache index reaches the end of the input array. It is difficult to understand this algorithm without step-by-step tracing. Prepare a sentence, draw a ing resolution tree, and then traverse the resolution tree by selecting the correct conversion sequence .)

The following is the parsing loop in the Code:

 
 
  1. class Parser(object):  
  2.     ...  
  3.     def parse(self, words):  
  4.         tags = self.tagger(words)  
  5.         n = len(words)  
  6.         idx = 1 
  7.         stack = [0]  
  8.         deps = Parse(n)  
  9.         while stack or idx < n:  
  10.             features = extract_features(words, tags, idx, n, stack, deps)  
  11.             scores = self.model.score(features)  
  12.             valid_moves = get_valid_moves(i, n, len(stack))  
  13.             next_move = max(valid_moves, key=lambda move: scores[move])  
  14.             idx = transition(next_move, idx, stack, parse)  
  15.         return tags, parse  
  16.    
  17. def get_valid_moves(i, n, stack_depth):  
  18.     moves = []  
  19.     if i < n:  
  20.         moves.append(SHIFT)  
  21.     if stack_depth >= 2:  
  22.         moves.append(RIGHT)  
  23.     if stack_depth >= 1:  
  24.         moves.append(LEFT)  
  25.     return moves  

We start with a marked sentence and initialize the state. Then, the state is mapped to a feature set scoring using a linear model. Next, find the effective operation with the highest score and apply it to the status.

The scoring model works the same way as the part-of-speech tagging. If you are confused about extracting features and scoring using linear models, you should review this article. The following is a prompt on how the scoring model works:

 
 
  1. class Perceptron(object)  
  2.     ...  
  3.     def score(self, features):  
  4.         all_weights = self.weights  
  5.         scores = dict((clas, 0) for clas in self.classes)  
  6.         for feat, value in features.items():  
  7.             if value == 0:  
  8.                 continue 
  9.             if feat not in all_weights:  
  10.                 continue 
  11.             weights = all_weights[feat]  
  12.             for clas, weight in weights.items():  
  13.                 scores[clas] += value * weight  
  14.         return scores  

Here, we only sum the class weights of each feature. This is usually expressed as a dot product, but I find that it is not suitable for processing many classes.

Targeted parser RedShift) traverses Multiple candidate elements, but in the end, only the best one will be selected. We will focus on efficiency and simplicity while ignoring its accuracy. We only perform a single analysis. Our search strategy will be completely greedy, just like part-of-speech tagging. We will lock each step in the selection.

If you carefully read the part of speech mark, you may find the following similarity. What we do is to map the resolution problem to a sequence Mark problem solved by "flattening", or a non-structured learning algorithm through greedy search ).


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.