If you are puzzled by the actual operations on Python lexical analysis, you can click the following article to learn about it, we hope you can implement the relevant Python lexical analysis in tokenizer under the Parser directory. h and tokenizer. cpp content.
In Python lexical analysis, tokenizer. h and tokenizer. cpp are implemented under the Parser directory. Other parts of Python directly call the functions defined in tokenizer. h, as follows:
- extern struct tok_state *PyTokenizer_FromString
(const char *);
- extern struct tok_state *PyTokenizer_FromFile
(FILE *, char *, char *);
- extern void PyTokenizer_Free(struct tok_state *);
- extern int PyTokenizer_Get(struct tok_state *,
char **, char **);
All these functions start with PyTokenizer. This is a convention in Python source code. Although Python is implemented in C language, its implementation method draws on many object-oriented ideas. For lexical analysis, these four functions can be considered as member functions of PyTokenizer. The first two functions, PyTokenizer_FromXXXX, can be considered as constructor and return the PyTokenizer instance.
The internal state of the PyTokenizer object, that is, the member variable, is stored in the tok_state. PyTokenizer_Free can be considered as a destructor to release the memory occupied by PyTokenizer, that is, tok_state. PyTokenizer_Get is a member function of PyTokenizer, which obtains the next Token in the Token stream.
In the Python lexical analysis, both functions need to pass in the tok_state pointer, which is consistent with the principle that the this pointer needs to be implicitly passed to the member function in C ++. We can see that the idea of OO is actually irrelevant to the language. Even a structured language like C can also write programs that face objects.
- tok_state
Tok_state is equivalent to the state of the PyTokenizer class, that is, the set of Private Members. Some definitions are as follows:
- /* Tokenizer state */
- struct tok_state {
- /* Input state; buf <= cur <= inp <= end */
- /* NB an entire line is held in the buffer */
- char *buf; /* Input buffer, or NULL; malloc'ed if
fp != NULL */
- char *cur; /* Next character in buffer */
- char *inp; /* End of data in buffer */
- char *end; /* End of input buffer if buf != NULL */
- char *start; /* Start of current token if not NULL */
- int done; /* E_OK normally, E_EOF at EOF, otherwise
error code
- /* NB If done != E_OK, cur must be == inp!!! */
- FILE *fp; /* Rest of input; NULL if tokenizing a
string */
- int tabsize; /* Tab spacing */
- int indent; /* Current indentation index */
- int indstack[MAXINDENT]; /* Stack of indents */
- int atbol; /* Nonzero if at begin of new line */
- int pendin; /* Pending indents (if > 0) or dedents
(if < 0) */
- char *prompt, *nextprompt; /* For interactive
prompting */
- int lineno; /* Current line number */
- int level; /* () [] {} Parentheses nesting level */
- /* Used to allow free continuations inside them */
- };
The most important thing is buf, cur, indium, end, start. These fields directly determine the buffer content:
Buf is the beginning of the buffer. If PyTokenizer is in string mode, the buf points to the string itself. Otherwise, it points to the buffer zone for file reading. Cur points to the next character in the buffer. The drop-down list points to the end position of valid data in the buffer. PyTokenizer is processed in the unit of action. The content of each row is stored between the buf and the P, including \ n. In general, PyTokenizer will take the next character from the buffer directly. once it reaches the position pointed to by P, it will prepare to remove a row.
When PyTokenizer is in different modes, the specific behavior is slightly different. End is the end of the buffer, which is not used in string mode. Start points to the start position of the current token. If the token is not analyzed yet, start is NULL. The above is an introduction to tokenizer. h and tokenizer. cpp in the Parser directory for implementing Python lexical analysis. I forget you will get something.