Converter 3: hand-written PHP to Python compiler, lexical part, python lexical
Last week I wrote ThinkPhp template to Flask and Django template.
The trick is to turn the entire PHP program into Python. Unlike templates, regular expression matching can be used to get lazy. This time it is not necessary to write a Php compiler.
I searched the internet and found that most of the transpile of Python to xxx is directly based on AST, omitting the most important Tokenizer and Parser. Directly write a Visitor. Otherwise, it may seem annoying to have a lot of code built on generators such as anlr.
Since everyone does not want to do this, I will try to manually write a Php compiler. It is implemented in three parts: Tokenizer, Parser, and Visitor.
I read longshu and Hu Shu for reference. I learned PHP carefully and didn't know it. It turned out that PHP had so many features and it was really tiring to be a compiler.
The Lexical part is very simple. It is an automatic machine. The design of a structure to store the automatic machine, and then simply and roughly program on the automatic machine, does not care about any performance, it is a hammer sales.
Writing is still fast, debugging is not very smooth, but I won't say it, haha
The automatic mechanism is not complex. Let's check it out. please correct me.
self.statemachine = { 'current': { 'state': 'default', 'content': '', 'line': 0}, 'default': [ {'name': 'open', 'next': 'php', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'<\?'}, {'name': 'open', 'next': 'php', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'<\?php'}], 'php': [ {'name': 'close', 'next': 'default', 'extra': 0, 'token': r'\?>', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'lnum', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'[0-9]+'}, {'name': 'dnum', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'([0-9]*\.[0-9]+)|([0-9]+\.[0-9]*)'}, {'name': 'exponent', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'(([0-9]+|([0-9]*\.[0-9]+)|([0-9]+\.[0-9]*))[eE][+-]?[0-9]+)'}, {'name': 'hnum', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'0x[0-9a-fA-F]+'}, {'name': 'bnum', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'0b[01]+'}, {'name': 'label', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'}, {'name': 'comment', 'next': 'commentline', 'extra': 1, 'token': r'//', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'comment', 'next': 'commentline', 'extra': 1, 'token': r'#', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'comment', 'next': 'comment', 'extra': 1, 'token': r'/\*', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'string', 'next': 'string1', 'extra': 1, 'token': r'\'', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'string', 'next': 'string2', 'extra': 1, 'token': r'"', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'symbol', 'next': '', 'extra': 0, 'start': 0, 'end': 0, 'cache': '', 'token': r'[\\\{\};:,\.\[\]\(\)\|\^&\+-/\*=%!~$<>\?@]'}], 'string1': [ {'name': 'string', 'next': 'php', 'extra': 0, 'token': r'\'', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'string', 'next': 'escape1', 'extra': 1, 'token': r'\\', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'string', 'next': '', 'extra': 1, 'token': r'', 'start': 0, 'end': 0, 'cache': ''}], 'escape1': [ {'name': 'string', 'next': 'string1', 'extra': 1, 'token': r'.', 'start': 0, 'end': 0, 'cache': ''}], 'string2': [ {'name': 'string', 'next': 'php', 'extra': 0, 'token': r'\'', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'string', 'next': 'escape2', 'extra': 1, 'token': r'\\', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'string', 'next': '', 'extra': 1, 'token': r'', 'start': 0, 'end': 0, 'cache': ''}], 'escape2': [ {'name': 'string', 'next': 'string2', 'extra': 1, 'token': r'.', 'start': 0, 'end': 0, 'cache': ''}], 'commentline': [ {'name': 'comment', 'next': 'php', 'extra': 0, 'token': r'(\r|\n|\r\n)', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'comment', 'next': 'php', 'extra': 0, 'token': r'', 'start': 0, 'end': 0, 'cache': ''}], 'comment': [ {'name': 'comment', 'next': 'php', 'extra': 0, 'token': r'\*/', 'start': 0, 'end': 0, 'cache': ''}, {'name': 'comment', 'next': '', 'extra': 1, 'token': r'', 'start': 0, 'end': 0, 'cache': ''}]}
Source code: converterV0.3.zip
<To be continued>