The original Regular Expression Engine is completed, record ideas and designs

Last Update:2014-10-26 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I 've been writing this for the last 20 days... it's finally finished (the road to endless reconstruction ...)... I would like to thank the author of VC and re2 for their blog guidance and the source code reference of VC. thank you very much! Great inspiration. The scheme of traversing part of the regular syntax tree of VC aggregation is really subtle! Previously, although I knew how to traverse heterogeneous trees in the visitor mode, I did not know how to write the vistor framework to meet the requirements. when used, I constantly lament the good design. however, I copied the Framework Code :) because the implementation is too good. others are designed and implemented based on the reference provided by the blog.

The entire engine implements logging.

High-level design:

Basically, the implementation of the regular expression engine needs to be completed:

Lexical Analysis

Syntax analysis

Character Set orthogonal

Build NFA

According to the edge type of NFA, different parts of NFA can be decomposed to build DFA.

The RegEx parsing class includes the implementation of regular DFA and NFA.

Write the regular expression matching algorithm. in different stages of Regular Expression matching, switch the DFA and NFA matching processes.

Module Design:
The lexical analysis part of the regular language is very simple. The whole process can be matched by string matching, without the need to build DFA and so on.
The specific token to be parsed includes the following content in the syntax "":
Alert = unit "|" alert
Unit;
Unit = express unit | Express
Express = factor loop | factor
Loop = "{" number "}"
= "{" Number "," "}"
= "{" Number "," Number "}"
= "{" Number "}?"
= "{" Number "," "}?"
= "{" Number "," Number "}?"
= "*"
= "? "
= "+"
= "*? "
= "?? "
= "+? "
Factor = "(" alert ")"
= "(<" Name ">" alert ")"
= "(? : "Alert ")"
= "(? = "Alert ")"
= "(?! "Alert ")"
= "(? <"Alert ")"
= "(? <! "Alert ")"
= "$"
= "^"
= Backreference
= Charset
= Normalchar
Charset = "[^" charsetcompnent "]" | "[" charsetcompnent "]" | char | "\ x"
Charsetcompnent = charunit charsetcompnent | charunit
Charunit = char "-" char | char
Backreference = "\ K" <"name"> "|" \ "Number
Note = "(#"...")"
Lexical analysis:
The entire parsing framework is:

PTR <vector <regextoken> regexlex: parsingpattern (INT start_index, int end_index) {PTR <vector <regextoken> result (make_shared <vector <regextoken> ()); for (Auto Index = start_index; index <end_index;) {for (Auto catch_length = 4; catch_length> = 1; catch_length --) {Auto & Key = pattern. substr (index, catch_length); If (regexlex: action_map.find (key )! = Regexlex: action_map.end () {// regexlex: After action_map is executed, the index points to the correct position. no need for ++. regexlex: action_map [Key] (pattern, index, result, optional); break;} If (catch_length = 1) {// The description is a common character. normal length 5 :) regexlex: action_map [L "normal"] (pattern, index, result, optional) ;}} return move (result );}

The key of action_map is a token string, and an Enum token is returned to indicate the symbol type.
Optional is the optional content for. net regular expression matching.
The result of lexical analysis is a regextoken-type vector,
Class regextoken
{
Public:
Tokentype type;
Charrange position;
}
Contains the token type and the detected location range.
Then, vector is used as the input of the syntax analyzer to perform syntax analysis together with the original mode string (find the required information in the mode string based on the token position );
Notes for lexical analysis:
1. When the left and right characters in [] are [or], the characters in [] are treated as common characters except for those without escape characters.
2 .(? =) + Lookaround) The repeated metacharacters are treated as common characters.
3. consider nested expressions. during Processing (xxx indicates the character starting with a subexpression. You must first use the stack to find the matched ending ")" (")" to match multiple metacharacters into a subexpression ). for example (12 (321), when parsing the outer parenthesis, the following error occurs: "(" the last one must be handled correctly ")";
Syntax analysis:
The syntax analysis section is based on the longshu grammar analysis tutorial. the lalr syntax analyzer was written in the previous round of YACC, so this time I changed the taste and wrote a ll syntax analyzer. for more information, see the ll syntax analyzer section of longshu. the syntax of the entire syntax analysis part is:

GRAMMAR:

Factor
= "Capturebegin" captureright
= "Anonymitycapturebegin" anonymitycaptureright
= "Regexmacro" captureright
= "Nonecapture" alert "captureend"
= "Positivetivelookahead" alert "captureend"
= "Negativelookahead" alert "captureend"
= "Positivelookbehind" alert "captureend"
= "Negativelookbehind" alert "captureend"
= "Stringhead"
= "Stringtail"
= "Backreference"
= "Charset"
= "Normalchar"
= "Linebegin"
= "Lineend"
= "Matchallsymbol"
= "Generalmatch"
= "Macroreference"
= "Anonymitybackreference"
Captureright = "named" alert "captureend" | alert "captureend"
"" Contains the token of the regular expression returned during lexical analysis. difficulties in syntax analysis .... amount .. it is similar to the difficulty of handwriting ll syntax analyzer. first, create the first table. not Write YACC =. =... I constructed the first table by the human brain. then write the ll analyzer according to the grammar. returns a syntax tree.
Node Type of the syntax tree:

Class expression: Public enable_shared_from_this <expression> {public: Virtual void apply (iregexalogrithm & algorithm) = 0; bool isequal (PTR <expression> & target ); PTR <vector <charrange> getcharsettable (const PTR <vector <regexcontrol> & optional); void settreecharsetorthogonal (PTR <chartable> & target); pair <state *, state *> buildnfa (automachine * target); Private: void buildorthogonal (PTR <vector <int> & target) ;}; // character set in combination with class charsetexpression: public expression {public: bool reverse; vector <charrange> range; Public :}; // normal character class normalcharexpression: public expression {public: charrange Range ;}; // loop class loopexpression: public Expression {public: PTR <expression> expression; int begin; int end; bool greedy; Public :}; class sequenceexpression: public expression {public: PTR expression <> left; PTR <expression> right; Public :}; class alternationexpression: public expression {public: PTR <expression> left; PTR <expression> right ;}; class beginexpression: public Expression {public :}; class endexpression: public expression {}; class captureexpression: public expression {public: wstring name; PTR <expression> Expression ;}; class anonymitycaptureexpression: public Expression {public: int Index = 0; PTR <expression> Expression ;}; class macroexpression: public expression {public: wstring name; PTR <expression> Expression ;}; class macroreferenceexpression: public Expression {public: wstring name ;}; // non-capturing group class nonecaptureexpression: public expression {public: PTR <expression> Expression ;}; // reference class backreferenceexpression after naming: public Expression {public: wstring name ;}; class anonymitybackreferenceexpression: public expression {public: int index ;}; class negativelookbehindexpression: public expression {public: PTR <expression> Expression ;}; class positivelookbehindexpression: public expression {public: PTR <expression> Expression ;}; class negativelookaheadexpression: public expression {public: PTR expression <> Expression ;}; class positivetivelookaheadexpression: public expression {public: PTR <expression> expression ;};

In addition to a few more types, the syntax tree type is similar to that of the V big, and then the syntax tree traversal and construction part. This is just a reference to the V big blog.
Structure and traversal of http://www.cppblog.com/vczh/archive/2009/10/18/98862.html syntax tree
Http://www.cppblog.com/vczh/archive/2009/10/18/98873.html Character Set and Regularization
NFA structure, DFA structure:
This one is welcome to see V big extended regular expression constructor blog. http://www.cppblog.com/vczh/archive/2008/05/22/50763.html
However, I do not write the same statement. V indicates the entire regular expression. All nodes are on the same graph. because there is a command side and an end side to control the scope of the subexpression. I have not added the end edge. so I use subgraphs. bind a subexpression to the command side to make the index.
After matching the command edge, find the sub-expression based on the index to match.
For this directed graph, shared_ptr is used to avoid circular reference. We recommend that you create a shared_ptr node array. The Node pool. The original pointer is exposed for operation. You just have to remember that you don't have to worry about Delete.
In the conversion from enfa to NFA. V combines most of the edges directly .... merging all edges without consuming characters is cool... however, the data structure of my own edge (vector <edge *>) has some problems... when an edge match that does not consume characters fails, it is because all the subsequent edges are taken (I don't know what I am doing. You can refer to the two articles of the largest V in lsurl first ).
The matching position of the next edge is unknown. so I did not merge this... only empty edge is merged. to ensure that each subexpression has a unique start and end node, I added the final edge link to the end of the subexpression and matched it unconditionally on final, but the final edge will not be optimized.
The difficulty here is the reverse lookaround processing. because it is reverse. therefore, the sub-expression in it will be returned for matching. for example, 34 (<= 34) 54. match 3454 when the regular expression reaches the position between 4 and 5. start reverse view. the matching order is to match 4 first and then 3. instead of the write sequence in the subexpression (<= 34 ).
So the problem here is that after constructing a complete NFA, traverse the NFA and set the first layer of the NFA (each sub-expression is bound to the function side) all the subgraphs under the reverse lookaround side. and if the subgraph contains lookaround, it must be reversed (for example, (? <= 3 (? = 5) This way--... to ensure that the matching is successful.
Difficulties in matching algorithms:
Matching algorithm... there is one difficulty. You need to input a string and point it to the iterator. instead of ordinary indexes to traverse strings. in reverse view, the direction of reverse matching is required. the normal index int type cannot be reversed, so it is more convenient to use the iterator. it is OK to set a reverse iterator --. because the subgraph will call the Matching Algorithm for matching. so
This is a recursive process during compilation. so... the nesting of the reverse iterator (when the lookbehind side is continuously entered, the new reverse iterator in the layer will lead to template compilation recursion. here we need to break down the lookbehind edge processing function. for the index passed in by the reverse iterator. call. base () for forward iterator input, call reverse_iter (ITER) to construct reverse.
The general design is: DFA is used to match the subgraphs in the current State with DFA, and NFA is used to match the subgraphs without DFA. each edge in the NFA test and each status in the NFA match process must be stacked. if the matching is successful, the next edge is attempted. If all the edges in the current status fail, the current POP status is displayed. the status of the new back () is restored. execute the next edge. so reciprocating.
Compilation option design:
Explicitcapture, // does not use the anonymous capture group function. It is performed during lexical analysis and matched to "(". If optional contains this, nonecatupure token is returned.
Ignorecase, // case-insensitive match. It is processed when the character set is constructed and orthogonal. uppercase or lowercase is added when the character set is traversed.
Multiline, // $ ^ matches the end and start of a row to lookaround
Righttoleft, // use the reverse iterator to pass in the input string
Singleline, // change the character set range represented.
These are the most difficult points. Other blog posts of the V major will know about them, and they will write very well.
There are about 5000 lines of code throughout the project. I found hundreds of different test samples from various blogs and the net Regular Expression Engine on msdn, after writing it, I feel that I have learned a lot --.... although C ++ has been written for almost two years, many low-level mistakes will still be made ..., the difference between emplace_back and push_back. there are also & * mixed usage errors (pointer references )... debugging with the above directed graph is troublesome --... we recommend that you write a debug helper function that prints DFA and NFA Directed Graph Information, which is much faster than breakpoint debugging.
I hope we can make fewer mistakes in the future ..
Thank you again for your guidance! Pai_^
I wrote my blog for the first time this year... I hope to write more in the future :)
References:
1. http://www.cppblog.com/vczh/category/12070.html? Show = all
2. http://www.cppblog.com/vczh/archive/2008/05/22/50763.html
3. compilation principles of longshu

The original Regular Expression Engine is completed, record ideas and designs

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More