Analysis of tokenizer. h In Parser using Python lexical analysis

Last Update:2013-12-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

If you are puzzled by the actual operations on Python lexical analysis, you can click the following article to learn about it, we hope you can implement the relevant Python lexical analysis in tokenizer under the Parser directory. h and tokenizer. cpp content.

In Python lexical analysis, tokenizer. h and tokenizer. cpp are implemented under the Parser directory. Other parts of Python directly call the functions defined in tokenizer. h, as follows:

 
 
  
  extern struct tok_state *PyTokenizer_FromString
(const char *);   
  
  extern struct tok_state *PyTokenizer_FromFile
(FILE *, char *, char *);   
  
  extern void PyTokenizer_Free(struct tok_state *);   
  
  extern int PyTokenizer_Get(struct tok_state *,
 char **, char **);

All these functions start with PyTokenizer. This is a convention in Python source code. Although Python is implemented in C language, its implementation method draws on many object-oriented ideas. For lexical analysis, these four functions can be considered as member functions of PyTokenizer. The first two functions, PyTokenizer_FromXXXX, can be considered as constructor and return the PyTokenizer instance.

The internal state of the PyTokenizer object, that is, the member variable, is stored in the tok_state. PyTokenizer_Free can be considered as a destructor to release the memory occupied by PyTokenizer, that is, tok_state. PyTokenizer_Get is a member function of PyTokenizer, which obtains the next Token in the Token stream.

In the Python lexical analysis, both functions need to pass in the tok_state pointer, which is consistent with the principle that the this pointer needs to be implicitly passed to the member function in C ++. We can see that the idea of OO is actually irrelevant to the language. Even a structured language like C can also write programs that face objects.

 
 
  
  tok_state

Tok_state is equivalent to the state of the PyTokenizer class, that is, the set of Private Members. Some definitions are as follows:

 
 
  
  /* Tokenizer state */   
  
  struct tok_state {   
  
  /* Input state; buf <= cur <= inp <= end */   
  
  /* NB an entire line is held in the buffer */   
  
  char *buf; /* Input buffer, or NULL; malloc'ed if 
fp != NULL */   
  
  char *cur; /* Next character in buffer */   
  
  char *inp; /* End of data in buffer */   
  
  char *end; /* End of input buffer if buf != NULL */   
  
  char *start; /* Start of current token if not NULL */   
  
  int done; /* E_OK normally, E_EOF at EOF, otherwise 
error code   
  
  /* NB If done != E_OK, cur must be == inp!!! */   
  
  FILE *fp; /* Rest of input; NULL if tokenizing a 
string */   
  
  int tabsize; /* Tab spacing */   
  
  int indent; /* Current indentation index */   
  
  int indstack[MAXINDENT]; /* Stack of indents */   
  
  int atbol; /* Nonzero if at begin of new line */   
  
  int pendin; /* Pending indents (if > 0) or dedents 
(if < 0) */   
  
  char *prompt, *nextprompt; /* For interactive 
prompting */   
  
  int lineno; /* Current line number */   
  
  int level; /* () [] {} Parentheses nesting level */   
  
  /* Used to allow free continuations inside them */   
  
  };

The most important thing is buf, cur, indium, end, start. These fields directly determine the buffer content:

Buf is the beginning of the buffer. If PyTokenizer is in string mode, the buf points to the string itself. Otherwise, it points to the buffer zone for file reading. Cur points to the next character in the buffer. The drop-down list points to the end position of valid data in the buffer. PyTokenizer is processed in the unit of action. The content of each row is stored between the buf and the P, including \ n. In general, PyTokenizer will take the next character from the buffer directly. once it reaches the position pointed to by P, it will prepare to remove a row.

When PyTokenizer is in different modes, the specific behavior is slightly different. End is the end of the buffer, which is not used in string mode. Start points to the start position of the current token. If the token is not analyzed yet, start is NULL. The above is an introduction to tokenizer. h and tokenizer. cpp in the Parser directory for implementing Python lexical analysis. I forget you will get something.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of tokenizer. h In Parser using Python lexical analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of tokenizer. h In Parser using Python lexical analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support