We all know that Python program execution can be divided into five steps. This article will introduce the first step of Python program execution, that is, lexical analysis, if you are interested in the first step of Python program execution, you can click the following article.
Python source code analysis 3-lexical analyzer PyTokenizer favorites
Introduction
We analyzed the execution of the Python program in five steps:
Tokenizer performs lexical analysis and splits the source program into tokens.
Parser creates CST based on Token
CST is converted to AST
AST is compiled into bytecode
Execution bytecode
This article describes the first step of Python program execution, that is, lexical analysis.
In simple terms, lexical analysis combines the characters of the source program into tokens.
For example, sum = 0 can be divided into three tokens, 'sum', '=', and '0 '. Whitespace in the program is usually used only as a separator and will be ignored, so it is not displayed in the token list. However, in Python, due to the relationship between syntax rules, Tab/Space needs to be used to analyze program indentation, therefore, the processing of Whitespace in Python is slightly more complex than that of C/C ++ compilers.
In Python, lexical analysis is implemented in tokenizer. h and tokenizer. cpp under the Parser directory. Other parts of Python directly call the functions defined in tokenizer. h, as follows:
- extern struct tok_state
*PyTokenizer_FromString
(const char *);
- extern struct tok_state
*PyTokenizer_FromFile
(FILE *, char *, char *);
- extern void PyTokenizer_Free
(struct tok_state *);
- extern int PyTokenizer_Get
(struct tok_state *, char **, char **);
All these functions start with PyTokenizer. This is a convention in Python source code. Although Python is implemented in C language, its implementation method draws on many object-oriented ideas. For lexical analysis, these four functions can be considered as member functions of PyTokenizer.
The first two functions, PyTokenizer_FromXXXX, can be considered as constructor and return the PyTokenizer instance. The internal state of the PyTokenizer object, that is, the member variable, is stored in the tok_state. PyTokenizer_Free can be considered as a destructor to release the memory occupied by PyTokenizer, that is, tok_state.
PyTokenizer_Get is a member function of PyTokenizer, which obtains the next Token in the Token stream. Both functions need to pass in the tok_state pointer, which is consistent with the principle that the this pointer needs to be implicitly passed to the member function in C ++. We can see that the idea of OO is actually irrelevant to the language. Even a structured language like C can also write programs that face objects.
The above is the first step in the execution of the Python program, that is, the introduction of lexical analysis. I forget you will get something.