Regular expressions are basically used by every programmer, but implementing the regular expression engine seems like a difficult task. In fact, a simple regular expression engine can be realized by mastering the lexical analysis of the front-end of the compiling principle. This is a recommended course for NetEase Cloud class. Http://mooc.study.163.com/course/USTC-1000002001?tid=1000003000#/info
The basic Regular expression regular expression consists of characters and metacharacters, and the entire expression is used to describe a type of string that conforms to certain characteristics, such as an expression: ABC, which represents the string "ABC", which is concatenated in sequence by ' a ', ' B ', ' C ' three characters. The regular expression to be implemented in this paper is simple, it only realizes the function of connection, selection and closure. Definition directly from the PPT:
The approximate steps to achieve are as follows:
NFA refers to a non-deterministic automaton, which has multiple states that can be converted to any character. DFA refers to the determination of automata, on any character, up to only one state can be converted.
The Thompson algorithm is a recursive algorithm that first converts a single character to an NFA and then combines the NFA according to the rules. The conversions for a single character (such as C) are as follows
Two characters (e.g. E1E2) are connected in the middle with ε
Next is the choice (e.g. E1|e2)
Closures (such as e1*) are more complex
Knowing how to combine small NFA into a large NFA, how do you deal with regular expressions and convert them to NFA? We can deal with two stacks as we do with arithmetic, but this can only be done in simple cases. You can also use recursive descent analysis to build an abstract syntax tree, a|b syntax tree as follows, recursive descent analysis of the specific method can be seen in the previous recommended video or directly read the source
Since the NFA has multiple states that can be converted to any character, we need to convert it to DFA. Our NFA is actually Ε-NFA, there are many ε edges, and the DFA does not have ε edges, so we can remove these ε edges through the subset construction algorithm. The approximate idea of a subset construction algorithm is to take a subset of all the states that a character can reach from state a (including the received characters and then reach by the ε edge), and all the states that the subset can reach constitute a subset, and finally, the ε edge can be eliminated to get the DFA.
For the NFA
The subset construction algorithm steps are as follows
The first column of the first row I means the set of nodes that can be reached by any ε from the initial node of the NFA. The IA represents the collection from which a can be reached starting from the collection, and IB is receiving a set of states that a B can reach.
If IA and IB do not appear in I, fill them in the next i. The results are as follows
By using the Hopcroft algorithm to minimize the DFA, the idea of this algorithm is to condense the equivalent state into a node. For example, the following DFA
can be simplified to
So we have completed the whole step, and for the input string, if we can go along the DFA to receive state, it will be able to match. Specific source to see here may have bugs, the final minimization of the DFA also did not realize, light spray.
Finally recommend a few related links
Wheel Brother's Tutorial http://www.cppblog.com/vczh/archive/2008/05/22/50763.html
Http://www.cnblogs.com/cute/p/4021689.html, the man wrote it quite clearly.
Then recommend the course of NetEase public course http://mooc.study.163.com/learn/USTC-1000002001?tid=1000003000#/learn/content
Implementation of simple regular expression engine