Analysis of tokenizer. h In Parser using Python lexical analysis

Source: Internet
Author: User

If you are puzzled by the actual operations on Python lexical analysis, you can click the following article to learn about it, we hope you can implement the relevant Python lexical analysis in tokenizer under the Parser directory. h and tokenizer. cpp content.

In Python lexical analysis, tokenizer. h and tokenizer. cpp are implemented under the Parser directory. Other parts of Python directly call the functions defined in tokenizer. h, as follows:

 
 
  1. extern struct tok_state *PyTokenizer_FromString
    (const char *);   
  2. extern struct tok_state *PyTokenizer_FromFile
    (FILE *, char *, char *);   
  3. extern void PyTokenizer_Free(struct tok_state *);   
  4. extern int PyTokenizer_Get(struct tok_state *,
     char **, char **); 

All these functions start with PyTokenizer. This is a convention in Python source code. Although Python is implemented in C language, its implementation method draws on many object-oriented ideas. For lexical analysis, these four functions can be considered as member functions of PyTokenizer. The first two functions, PyTokenizer_FromXXXX, can be considered as constructor and return the PyTokenizer instance.

The internal state of the PyTokenizer object, that is, the member variable, is stored in the tok_state. PyTokenizer_Free can be considered as a destructor to release the memory occupied by PyTokenizer, that is, tok_state. PyTokenizer_Get is a member function of PyTokenizer, which obtains the next Token in the Token stream.

In the Python lexical analysis, both functions need to pass in the tok_state pointer, which is consistent with the principle that the this pointer needs to be implicitly passed to the member function in C ++. We can see that the idea of OO is actually irrelevant to the language. Even a structured language like C can also write programs that face objects.

 
 
  1. tok_state  

Tok_state is equivalent to the state of the PyTokenizer class, that is, the set of Private Members. Some definitions are as follows:

 
 
  1. /* Tokenizer state */   
  2. struct tok_state {   
  3. /* Input state; buf <= cur <= inp <= end */   
  4. /* NB an entire line is held in the buffer */   
  5. char *buf; /* Input buffer, or NULL; malloc'ed if 
    fp != NULL */   
  6. char *cur; /* Next character in buffer */   
  7. char *inp; /* End of data in buffer */   
  8. char *end; /* End of input buffer if buf != NULL */   
  9. char *start; /* Start of current token if not NULL */   
  10. int done; /* E_OK normally, E_EOF at EOF, otherwise 
    error code   
  11. /* NB If done != E_OK, cur must be == inp!!! */   
  12. FILE *fp; /* Rest of input; NULL if tokenizing a 
    string */   
  13. int tabsize; /* Tab spacing */   
  14. int indent; /* Current indentation index */   
  15. int indstack[MAXINDENT]; /* Stack of indents */   
  16. int atbol; /* Nonzero if at begin of new line */   
  17. int pendin; /* Pending indents (if > 0) or dedents 
    (if < 0) */   
  18. char *prompt, *nextprompt; /* For interactive 
    prompting */   
  19. int lineno; /* Current line number */   
  20. int level; /* () [] {} Parentheses nesting level */   
  21. /* Used to allow free continuations inside them */   
  22. }; 

The most important thing is buf, cur, indium, end, start. These fields directly determine the buffer content:

Buf is the beginning of the buffer. If PyTokenizer is in string mode, the buf points to the string itself. Otherwise, it points to the buffer zone for file reading. Cur points to the next character in the buffer. The drop-down list points to the end position of valid data in the buffer. PyTokenizer is processed in the unit of action. The content of each row is stored between the buf and the P, including \ n. In general, PyTokenizer will take the next character from the buffer directly. once it reaches the position pointed to by P, it will prepare to remove a row.

When PyTokenizer is in different modes, the specific behavior is slightly different. End is the end of the buffer, which is not used in string mode. Start points to the start position of the current token. If the token is not analyzed yet, start is NULL. The above is an introduction to tokenizer. h and tokenizer. cpp in the Parser directory for implementing Python lexical analysis. I forget you will get something.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.