This is a creation in Article, where the information may have evolved or changed.
Objective
The virtual syntax tree (Abstract Syntax tree, AST) is the basis for parsing the interpreter/compiler, and is the basic tool for many front-end compiler tools, such as Webpack, postcss, less, etc. For ECMAScript, because the front wheel is numerous, the manpower is too abundant, has already been tired by the people to play. Just the parser there is,,, uglify acorn , and bablyon typescript esprima so on several kinds. And there is also the Community standard for AST: Estree.
This article mainly describes how to write an AST parser, but not through the analysis of JavaScript, but through the analysis html5 of the syntax tree to introduce, html5 the reason for the use of two points: one is its simple syntax, summed up only two kinds: Text and Tag , The second reason is that JavaScript has so much to do with the parser that it doesn't make sense to recreate a wheel, while html5 there are many AST analyzers, like htmlparser2 , and parser5 so on, but not as ESTree standard, at the same time, There is a problem with these analyzers: the tag properties cannot be manipulated in the defined syntax tree. So in order to solve this problem, just wrote an HTML parser, at the same time defined a perfect AST structure, and then some of this article.
AST definition
To keep track of the location properties of each node, first define a base node, all of which inherit from this junction:
export interface IBaseNode { start: number; // 节点起始位置 end: number; // 节点结束位置}
As mentioned earlier, the syntax type of HTML5 can eventually boil down to two types: one is, and the Text other is Tag , this is where an enumeration type is used to flag them.
export enum SyntaxKind { Text = 'Text', // 文本类型 Tag = 'Tag', // 标签类型}
For text, the property has only one original string value , so the structure is as follows:
export interface IText extends IBaseNode { type: SyntaxKind.Text; // 类型 value: string; // 原始字符串}
For Tag , you should include the beginning of the tag, the open attribute list attributes , the label name name , the sub-label/text body , and the closing part of the tag close :
export interface ITag extends IBaseNode { type: SyntaxKind.Tag; // 类型 open: IText; // 标签开始部分, 比如 <div id="1"> name: string; // 标签名称, 全部转换为小写 attributes: IAttribute[]; // 属性列表 body: Array<ITag | IText> // 子节点列表, 如果是一个非自闭合的标签, 并且起始标签已结束, 则为一个数组 | void // 如果是一个自闭合的标签, 则为void 0 | null; // 如果起始标签未结束, 则为null close: IText // 关闭标签部分, 存在则为一个文本节点 | void // 自闭合的标签没有关闭部分 | null; // 非自闭合标签, 但是没有关闭标签部分}
The property of a tag is a key-value pair that contains the name name and value value section, which defines the following structure:
export interface IAttribute extends IBaseNode { name: IText; // 名称 value: IAttributeValue | void; // 值}
Where the name is a normal text node, but the value is special, manifested in its possible sheets/double quotation marks wrapped up, and the quotation marks are meaningless, so define a tag value structure:
export interface IAttributeValue extends IBaseNode { value: string; // 值, 不包含引号部分 quote: '\'' | '"' | void; // 引号类型, 可能是', ", 或者没有}
Token parsing
The AST parsing first needs to parse the original text to get a list of symbols, and then get the final syntax tree through contextual context analysis.
Although HTML looks simple compared to JSON, the context is required, so although JSON can get the final result directly from token analysis, but HTML does not, token analysis is the first step, which is required. (JSON parsing can refer to one of my other articles: write a JSON parser (Golang) with freehand writing).
Token parsing, you need to analyze the meaning of tokens according to the current state, and then arrive at a token list.
First define the structure of the token:
export interface IToken { start: number; // 起始位置 end: number; // 结束位置 value: string; // token type: TokenKind; // 类型}
There are several token types:
export enum TokenKind { Literal = 'Literal', // 文本 OpenTag = 'OpenTag', // 标签名称 OpenTagEnd = 'OpenTagEnd', // 开始标签结束符, 可能是 '/', 或者 '', '--' CloseTag = 'CloseTag', // 关闭标签 Whitespace = 'Whitespace', // 开始标签类属性值之间的空白 AttrValueEq = 'AttrValueEq', // 属性中的= AttrValueNq = 'AttrValueNq', // 属性中没有引号的值 AttrValueSq = 'AttrValueSq', // 被单引号包起来的属性值 AttrValueDq = 'AttrValueDq', // 被双引号包起来的属性值}
Token analysis does not take into account the attribute's key/value relationships, which are uniformly considered to be a fragment in the attribute, and are treated = as a
The Special independent segment fragment is then handed over to the upper layer parser to analyze the key value relationship. The reason for doing this is to analyze the token
To avoid context processing and simplify the state machine State table. The status list is as follows:
enum State { Literal = 'Literal', BeforeOpenTag = 'BeforeOpenTag', OpeningTag = 'OpeningTag', AfterOpenTag = 'AfterOpenTag', InValueNq = 'InValueNq', InValueSq = 'InValueSq', InValueDq = 'InValueDq', ClosingOpenTag = 'ClosingOpenTag', OpeningSpecial = 'OpeningSpecial', OpeningDoctype = 'OpeningDoctype', OpeningNormalComment = 'OpeningNormalComment', InNormalComment = 'InNormalComment', InShortComment = 'InShortComment', ClosingNormalComment = 'ClosingNormalComment', ClosingTag = 'ClosingTag',}
The whole parsing uses the function programming, does not use OO, in order to simplify passes the state parameter between the function, because is a synchronous operation,
This takes advantage of the JavaScript event model, which uses global variables to hold the state. The list of global variables required for token analysis is as follows:
let state: State // 当前的状态let buffer: string // 输入的字符串let bufSize: number // 输入字符串长度let sectionStart: number // 正在解析的Token的起始位置let index: number // 当前解析的字符的位置let tokens: IToken[] // 已解析的token列表let char: number // 当前解析的位置的字符的UnicodePoint
Before you begin parsing, you need to initialize the global variables:
function init(input: string) { state = State.Literal buffer = input bufSize = input.length sectionStart = 0 index = 0 tokens = []}
The parsing is then parsed, and all the characters in the input string are traversed and processed accordingly according to the current state.
(Change status, output token, etc.), after parsing is complete, empty the global variable and return to the end.
export function tokenize(input: string): IToken[] { init(input) while (index < bufSize) { char = buffer.charCodeAt(index) switch (state) { // ...根据不同的状态进行相应的处理 // 文章忽略了对各个状态的处理, 详细了解可以查看源代码 } index++ } const _nodes = nodes // 清空状态 init('') return _nodes}
Syntax Tree parsing
After acquiring the token list, the final node tree needs to be parsed according to the context, in a similar way to tokenize.
All use global variables to save the delivery state and traverse all tokens, except that there is no global state machine.
Because the state can be judged entirely by the type of node being parsed.
export function parse(input: string): INode[] { init(input) while (index < count) { token = tokens[index] switch (token.type) { case TokenKind.Literal: if (!node) { node = createLiteral() pushNode(node) } else { appendLiteral(node) } break case TokenKind.OpenTag: node = void 0 parseOpenTag() break case TokenKind.CloseTag: node = void 0 parseCloseTag() break default: unexpected() break } index++ } const _nodes = nodes init() return _nodes}
Not too many explanations, you can go to GitHub to view the source code
Conclusion
Project is open source, name is html5parser , can be installed through Npm/yarn:
Or go to GitHub to view the source code: Acrazing/html5parser.
The
is currently fully tested for normal HTML parsing, with known bugs including parsing of annotations and parsing of non-finished
inputs (both at the syntactic level, token analysis passed the test).