Implementation of simple regular expression engine

Source: Internet
Author: User
Tags expression engine

Regular expressions are basically used by every programmer, but implementing the regular expression engine seems like a difficult task. In fact, a simple regular expression engine can be realized by mastering the lexical analysis of the front-end of the compiling principle. This is a recommended course for NetEase Cloud class. Http://mooc.study.163.com/course/USTC-1000002001?tid=1000003000#/info

The basic Regular expression regular expression consists of characters and metacharacters, and the entire expression is used to describe a type of string that conforms to certain characteristics, such as an expression: ABC, which represents the string "ABC", which is concatenated in sequence by ' a ', ' B ', ' C ' three characters. The regular expression to be implemented in this paper is simple, it only realizes the function of connection, selection and closure. Definition directly from the PPT:

The approximate steps to achieve are as follows:

NFA refers to a non-deterministic automaton, which has multiple states that can be converted to any character. DFA refers to the determination of automata, on any character, up to only one state can be converted.

The Thompson algorithm is a recursive algorithm that first converts a single character to an NFA and then combines the NFA according to the rules. The conversions for a single character (such as C) are as follows

Two characters (e.g. E1E2) are connected in the middle with ε

Next is the choice (e.g. E1|e2)

Closures (such as e1*) are more complex

Knowing how to combine small NFA into a large NFA, how do you deal with regular expressions and convert them to NFA? We can deal with two stacks as we do with arithmetic, but this can only be done in simple cases. You can also use recursive descent analysis to build an abstract syntax tree, a|b syntax tree as follows, recursive descent analysis of the specific method can be seen in the previous recommended video or directly read the source

Since the NFA has multiple states that can be converted to any character, we need to convert it to DFA. Our NFA is actually Ε-NFA, there are many ε edges, and the DFA does not have ε edges, so we can remove these ε edges through the subset construction algorithm. The approximate idea of a subset construction algorithm is to take a subset of all the states that a character can reach from state a (including the received characters and then reach by the ε edge), and all the states that the subset can reach constitute a subset, and finally, the ε edge can be eliminated to get the DFA.

For the NFA

The subset construction algorithm steps are as follows

The first column of the first row I means the set of nodes that can be reached by any ε from the initial node of the NFA. The IA represents the collection from which a can be reached starting from the collection, and IB is receiving a set of states that a B can reach.

If IA and IB do not appear in I, fill them in the next i. The results are as follows

By using the Hopcroft algorithm to minimize the DFA, the idea of this algorithm is to condense the equivalent state into a node. For example, the following DFA

can be simplified to

So we have completed the whole step, and for the input string, if we can go along the DFA to receive state, it will be able to match. Specific source to see here may have bugs, the final minimization of the DFA also did not realize, light spray.

Finally recommend a few related links

Wheel Brother's Tutorial http://www.cppblog.com/vczh/archive/2008/05/22/50763.html

Http://www.cnblogs.com/cute/p/4021689.html, the man wrote it quite clearly.

Then recommend the course of NetEase public course http://mooc.study.163.com/learn/USTC-1000002001?tid=1000003000#/learn/content

Implementation of simple regular expression engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.