Implementation of simple regular expression engine

Last Update:2016-02-24 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular expressions are basically used by every programmer, but implementing the regular expression engine seems like a difficult task. In fact, a simple regular expression engine can be realized by mastering the lexical analysis of the front-end of the compiling principle. This is a recommended course for NetEase Cloud class. Http://mooc.study.163.com/course/USTC-1000002001?tid=1000003000#/info

The basic Regular expression regular expression consists of characters and metacharacters, and the entire expression is used to describe a type of string that conforms to certain characteristics, such as an expression: ABC, which represents the string "ABC", which is concatenated in sequence by ' a ', ' B ', ' C ' three characters. The regular expression to be implemented in this paper is simple, it only realizes the function of connection, selection and closure. Definition directly from the PPT:

The approximate steps to achieve are as follows:

NFA refers to a non-deterministic automaton, which has multiple states that can be converted to any character. DFA refers to the determination of automata, on any character, up to only one state can be converted.

The Thompson algorithm is a recursive algorithm that first converts a single character to an NFA and then combines the NFA according to the rules. The conversions for a single character (such as C) are as follows

Two characters (e.g. E1E2) are connected in the middle with ε

Next is the choice (e.g. E1|e2)

Closures (such as e1*) are more complex

Knowing how to combine small NFA into a large NFA, how do you deal with regular expressions and convert them to NFA? We can deal with two stacks as we do with arithmetic, but this can only be done in simple cases. You can also use recursive descent analysis to build an abstract syntax tree, a|b syntax tree as follows, recursive descent analysis of the specific method can be seen in the previous recommended video or directly read the source

Since the NFA has multiple states that can be converted to any character, we need to convert it to DFA. Our NFA is actually Ε-NFA, there are many ε edges, and the DFA does not have ε edges, so we can remove these ε edges through the subset construction algorithm. The approximate idea of a subset construction algorithm is to take a subset of all the states that a character can reach from state a (including the received characters and then reach by the ε edge), and all the states that the subset can reach constitute a subset, and finally, the ε edge can be eliminated to get the DFA.

For the NFA

The subset construction algorithm steps are as follows

The first column of the first row I means the set of nodes that can be reached by any ε from the initial node of the NFA. The IA represents the collection from which a can be reached starting from the collection, and IB is receiving a set of states that a B can reach.

If IA and IB do not appear in I, fill them in the next i. The results are as follows

By using the Hopcroft algorithm to minimize the DFA, the idea of this algorithm is to condense the equivalent state into a node. For example, the following DFA

can be simplified to

So we have completed the whole step, and for the input string, if we can go along the DFA to receive state, it will be able to match. Specific source to see here may have bugs, the final minimization of the DFA also did not realize, light spray.

Finally recommend a few related links

Wheel Brother's Tutorial http://www.cppblog.com/vczh/archive/2008/05/22/50763.html

Http://www.cnblogs.com/cute/p/4021689.html, the man wrote it quite clearly.

Then recommend the course of NetEase public course http://mooc.study.163.com/learn/USTC-1000002001?tid=1000003000#/learn/content

Implementation of simple regular expression engine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More