500 lines of Python code as an English Parser (1)

Last Update:2014-05-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The syntax analyzer describes the syntax structure of a sentence to help other applications to reason. Natural Language introduces many unexpected ambiguities, which can be quickly discovered by our understanding of the world. Here is an example that I like very much:

They ate the pizza with anchovies

The correct resolution is to connect "with" and "pizza", and the wrong Resolution Associates "with" and "eat:

Natural Language Processing (NLP) communities have made great progress in syntax analysis over the past few years. Now, small Python implementations may be better than the widely used Stanford parser.

Parser accuracy speed (word/second) Language location
Stanford 89.6% 19 Java> 50,000 [1]
Parser. py 89.8% 2,020 Python~ 500
Redshift93.6%2,580Cython ~ 4,000

The rest of the article first sets the problem, and then shows you the concise implementation of the preparation. The first 200 lines in the parser. py Code describe the annotator and learner of the part of speech here ). Unless you are very familiar with NLP research, you should at least skip this article before studying it.

The Cython system and Redshift are written for my current research. After my contract with mccury expires, I plan to improve it in May for general purposes. The current version is hosted on GitHub.

Problem description

It is very friendly to enter such a command on your mobile phone:

Set volume to zero when I'm in a meeting, unless John's school CILS.

Then configure the appropriate policy. On Android, you can use Tasker to do this, but the NL interface is better. When you receive editable semantic representations, you will be able to understand what they think you mean and correct their ideas. This is especially friendly.

This work has many problems to solve, but some types of syntaxes are absolutely necessary. We need to know:

Unless John's school CILS, when I'm in a meeting, set volume to zero

Is another method for parsing commands, and

Unless John's school, call when I'm in a meeting

It expresses completely different meanings.

The dependency parser returns the relationship between a word and a word, making inference easier. The relational graph is a tree structure with a directed edge. Each node word has only one input arc header dependency ).

Usage example:

 
 
  
  >>> parser = parser.Parser()  
  
  >>> tokens = "Set the volume to zero when I 'm in a meeting unless John 's school calls".split()  
  
  >>> tags, heads = parser.parse(tokens)  
  
  >>> heads  
  
  [-1, 2, 0, 0, 3, 0, 7, 5, 7, 10, 8, 0, 13, 15, 15, 11]  
  
  >>> for i, h in enumerate(heads):   
  
  ...   head = tokens[heads[h]] if h >= 1 else 'None' 
  
  ...   print(tokens[i] + ' <-- ' + head])  
  
  Set <-- None 
  
  the <-- volume  
  
  volume <-- Set  
  
  to <-- Set  
  
  zero <-- to  
  
  when <-- Set  
  
  I <-- 'm  
  
  'm <-- when  
  
  in <-- 'm  
  
  a <-- meeting  
  
  meeting <-- in 
  
  unless <-- Set  
  
  John <-- 's  
  
  's   <-- calls  
  
  school <-- calls  
  
  calls <-- unless

One idea is that the derivation through syntactic analysis is slightly easier than the string. Semantic Analysis ing is expected to be simpler than literal meaning ing.

The most confusing problem is that the correctness is determined by the Convention, that is, the Comment guide. If you do not read the guide and are not a linguistics, you cannot judge whether the resolution is correct. This makes the entire task strange and false.

For example, there is an error in the above parsing: According to Stanford's note Guide, "John's school CILS" has a structure error. The structure of the sentence is to guide the annotator to parse an example similar to "John's school clothes.

This is worth further consideration. In theory, we have already formulated rules, so the "correct" resolution should be the opposite. If we violate the conventions, we have good reasons to believe that parsing tasks will become more difficult, because the consistency between tasks and other languages will decrease. 2] But we can test our experience and we are happy to take advantage of reverse strategy.

We do need the difference in Convention-we do not want to receive the same structure, otherwise the results will not be very useful. Note the differences in the Guide to balancing downstream applications with which parsers can easily predict.

Ing tree

When deciding what the relationship diagram looks like, we can make a particularly effective simplification: restrict the structure of the relationship diagram to be processed. It not only has advantages in learning and learning, but also plays a role in deepening understanding of algorithms. In most> English Parsing, the dependency graph that follows the constraints is the ing tree:

There are a wide range of documents on Parsing non- ing trees, and there are relatively few documents on resolving Directed Graphs without loops. The parsing algorithm I will describe is used in the ing tree field.

Greedy conversion-based Parsing

The syntax analyzer uses the string symbol list as the input, and outputs the arc header index list representing the edges in the graph. If the element of the I-th arc header is j, the dependency includes an edge j, I ). Conversion-based syntax analyzer> is a finite state converter that maps arrays of N words to output arrays of N arc header indexes.

Start MSNBC reported that Facebook bought WhatsApp for $ 16bn root
0 2 9 2 4 2 4 4 7 0

The arc header array indicates the arc header of MSNBC. The word index of MSNBC is 1, the word index of reported is 2, and the head [1] = 2. You should have discovered why the tree structure is so convenient-if we output a DAG structure, words in this structure may contain multiple arc headers, and the tree structure will no longer work.

Although heads can be represented as an array, we really like to maintain a certain alternative method to access and parse, so as to conveniently and efficiently extract features. The Parse class is like this:

 
 
  
  class Parse(object):  
  
      def __init__(self, n):  
  
          self.n = n  
  
          self.heads = [None] * (n-1)  
  
          self.lefts = []  
  
          self.rights = []  
  
          for i in range(n+1):  
  
              self.lefts.append(DefaultList(0))  
  
              self.rights.append(DefaultList(0))  
  
     
  
      def add_arc(self, head, child):  
  
          self.heads[child] = head  
  
          if child < head:  
  
              self.lefts[head].append(child)  
  
          else:  
  
              self.rights[head].append(child)

Like Syntax Parsing, we also need to track the position in the sentence. By adding an index to the words array and introducing the stack mechanism, words can be pushed into the stack. When the arc header of a word is set, words will pop up. Therefore, our State data structure is the basis.

One index I, active in the symbol list
Dependencies added to the syntax parser until now
A stack of words generated before index I is included. We have declared an arc header for these words.

Each step of the parsing process applies one of three operations:

 
 
  
  SHIFT = 0; RIGHT = 1; LEFT = 2 
  
  MOVES = [SHIFT, RIGHT, LEFT]  
  
     
  
  def transition(move, i, stack, parse):  
  
      global SHIFT, RIGHT, LEFT  
  
      if move == SHIFT:  
  
          stack.append(i)  
  
          return i + 1 
  
      elif move == RIGHT:  
  
          parse.add_arc(stack[-2], stack.pop())  
  
          return i  
  
      elif move == LEFT:  
  
          parse.add_arc(i, stack.pop())  
  
          return i  
  
      raise GrammarError("Unknown move: %d" % move)

The LEFT and RIGHT operations add dependencies and play the stack, while SHIFT forces the stack and increases the I value in the cache.

Therefore, the parser starts with an empty stack, and the cache index is 0, with no dependency record. Select a valid operation and apply it to the current status. Continue to select the operation and apply it until the stack is empty and the cache index reaches the end of the input array. It is difficult to understand this algorithm without step-by-step tracing. Prepare a sentence, draw a ing resolution tree, and then traverse the resolution tree by selecting the correct conversion sequence .)

The following is the parsing loop in the Code:

 
 
  
  class Parser(object):  
  
      ...  
  
      def parse(self, words):  
  
          tags = self.tagger(words)  
  
          n = len(words)  
  
          idx = 1 
  
          stack = [0]  
  
          deps = Parse(n)  
  
          while stack or idx < n:  
  
              features = extract_features(words, tags, idx, n, stack, deps)  
  
              scores = self.model.score(features)  
  
              valid_moves = get_valid_moves(i, n, len(stack))  
  
              next_move = max(valid_moves, key=lambda move: scores[move])  
  
              idx = transition(next_move, idx, stack, parse)  
  
          return tags, parse  
  
     
  
  def get_valid_moves(i, n, stack_depth):  
  
      moves = []  
  
      if i < n:  
  
          moves.append(SHIFT)  
  
      if stack_depth >= 2:  
  
          moves.append(RIGHT)  
  
      if stack_depth >= 1:  
  
          moves.append(LEFT)  
  
      return moves

We start with a marked sentence and initialize the state. Then, the state is mapped to a feature set scoring using a linear model. Next, find the effective operation with the highest score and apply it to the status.

The scoring model works the same way as the part-of-speech tagging. If you are confused about extracting features and scoring using linear models, you should review this article. The following is a prompt on how the scoring model works:

 
 
  
  class Perceptron(object)  
  
      ...  
  
      def score(self, features):  
  
          all_weights = self.weights  
  
          scores = dict((clas, 0) for clas in self.classes)  
  
          for feat, value in features.items():  
  
              if value == 0:  
  
                  continue 
  
              if feat not in all_weights:  
  
                  continue 
  
              weights = all_weights[feat]  
  
              for clas, weight in weights.items():  
  
                  scores[clas] += value * weight  
  
          return scores

Here, we only sum the class weights of each feature. This is usually expressed as a dot product, but I find that it is not suitable for processing many classes.

Targeted parser RedShift) traverses Multiple candidate elements, but in the end, only the best one will be selected. We will focus on efficiency and simplicity while ignoring its accuracy. We only perform a single analysis. Our search strategy will be completely greedy, just like part-of-speech tagging. We will lock each step in the selection.

If you carefully read the part of speech mark, you may find the following similarity. What we do is to map the resolution problem to a sequence Mark problem solved by "flattening", or a non-structured learning algorithm through greedy search ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

500 lines of Python code as an English Parser (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

500 lines of Python code as an English Parser (1)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support