1. History:
The regular expression was first formally described by the well-known mathematician Stephen Kleene in the 1940 's neurophysiology study. Specifically, Kleene summarizes the aforementioned neurophysiology studies, defines a "regular set" in a paper titled "Regular Set Algebra", defines an algebraic system on it, and introduces a notation system to describe the regular set, which is called "Regular expression" by the notation system. Decades after being studied in the circle of theoretical mathematics, Ken Thompson, who invented the Unix system in 1968, was the first to use regular expressions in the field of computers, and developed two useful text processing tools for QED and grep, which made great success. Over the next more than 10 years, a large number of first-class computer scientists and hackers have conducted intensive research and practice on regular expressions. In the early 1980, the UNIX movement's two central Bell Labs and the University of California, Berkeley, respectively, studied and implemented the regular expression engine around the grep tool. At the same time, the compiler "dragon book" author Alfred Aho developed the Egrep tool, which greatly expands and enhances the functionality of regular expressions. Since then, he has invented the popular awk text editing language with three people such as Brian Kernighan, author of the C programming language. By the year 1986, the regular expression ushered in a leap. First C language top hacker Henry Spencer published in source code a regular expression library written in C (not called open source at that time), thus bringing the mystery of the regular expression into the ordinary people's home, and then the technical geek Larry Wall turned out, The first version of the Perl language was published. Since then, Perl has been the standard-bearer of regular expressions, and it can be said that the standards and status of today's regular expressions are shaped by Perl. After the Perl 5.x was released, the regular expression entered a stable maturity, its powerful ability has conquered almost all the mainstream language platform, becomes each professional developer must master the basic tool.
2.DFA and NFA understand DFA and NFA
The regular expression engine is divided into two categories, a class called DFA (deterministic with poor automata), and another class called NFA (non-deterministic with poor automata). Two types of engines to work smoothly, must have a regular and a text string, a pinch in the hand, a eat down. DFA pinch the text string to compare the regular style, see a sub-regular type, the possible matching string is fully marked out, and then look at the next part of the regular style, according to the new matching results updated labeling. And the NFA is pinching the regular style to go over the text, eat a character, it is compared with the regular comparison, the match is written down: "Some year a certain date in a certain place on the match." "And then went down to dry. Once it doesn't match, spit out the character you just ate and spit it out until you get back to the last matching place.
The DFA differs from the NFA mechanism by bringing 5 effects:
1. DFA for each character in the text string only need to scan once, faster, but less features; the NFA has to eat characters, hyphens, slow, but rich features, so instead of a wide range of today's major regular expression engine, such as Perl, Ruby, Python's re module, The Regex libraries for Java and. NET are all NFA.
2. Only NFA supports features such as lazy and backreference;
3. NFA is eager to desire please, so the most Zokozheng-style priority matches the success, so occasionally miss the best match results, the DFA is "the longest Zokozheng-type priority match success".
4. The NFA defaults to the greedy quantifier (see item 4);
5. The NFA may fall into the trap of recursive invocation and performance is very poor.
Let me cite an example to illustrate the 3rd effect.
For example, use regular/perl|perlman/to match text ' Perlman book '. If it is the NFA, then the regular-oriented, hand pinch regular, eyes looking at the text, a character of a character to eat, after eating ' Perl ', and the first Zhong/perl/has been matched on the record, and then look down, eat into a ' m ', this bad, with the sub-type/perl/ Do not match, so the M spit up, report that the successful match ' Perl ', no longer care about the other, do not try to the back of the regular-type/perlman/, nature will not see that better answer.
If it is DFA, it is text-oriented, holding the text in hand, the eyes look at the regular style, a mouthful of eating. Eat to/p/, in the Hand of ' P ' on a hook, remember a pen, said the character has been matched, and then eat down. When the/perl/is seen, the DFA will not stop and will try another bite. At this time, the first Zhong is already at the top of the rope, did not eat, so he dumped it, to eat the second Zhong-style/m/. This one eats well, because matches again, then then eats down. Until we finish eating the regular formula, it is satisfying to report that the success matches the ' Perlman '.
The/perlman|perl/mode should be used to make the NFA work correctly.
With the above example, you can understand why the NFA is the left dial hand match, and the DFA is the longest left dial hand match. In fact, if you carefully analyze the differences between NFA and DFA, you can find out the truth. To understand these principles, it is very meaningful to apply regular expressions effectively.
Wrote that the formal definition of a regular expression is deliberately very streamlined, avoiding the definition of extra quantifiers? and +, they can be expressed as: a + = aa* and a? = (a|ε). Sometimes increase the complement operator ~; ~r indicates the collection of all strings that are not in R on σ *. The complement operator is superfluous because it is expressed using other operators (although the process of calculating this representation is complex and the result may increase exponentially).
The regular expression in this sense can express the regular language, which is precisely the language class which can be accepted by finite state automata. But there are important differences in brevity. A certain class of regular language can only be described by an automaton with an exponential growth in size, while the length of the required regular expression is only linearly increasing. The regular expression corresponds to the type of Chomsky hierarchy-3 grammar. On the other hand, there is a simple mapping between a regular expression and an indeterminate finite state automaton (NFA) that does not cause an explosion of this size; the NFA is often used as an alternative representation of a regular expression.
We also want to study the expression in this formalization. As shown in the following example, different regular expressions can express the same language: there is redundancy in this formalization.
It is possible to write an algorithm for two given regular expressions to determine whether the language they describe is inherently equal, to simplify each expression to a minimal deterministic finite automaton, to determine whether they are isomorphic (equivalent).
To what extent can this redundancy be reduced? Can we find interesting subsets of regular expressions that still have fully expressed power? Kleene asterisks and aggregates are obviously needed, but we may be able to limit their use. This raises a surprising and difficult question. Because the regular expression is so simple, there is no way to re-write it in a canonical form in grammar. The lack of axiomatic in the past led to an asterisk height problem. Recently, Dexter Kozen has axiomatic regular expressions with Kleene algebra.
Many real-world "regular expression" engines implement features that cannot be expressed in regular expression algebra.
Currently supported by the regular engine language type:
engine type |
program |
DFA |
awk (most versions), Egrep (most editions This), Flex, Lex, MySQL, procmail |
traditional NFA |
GNU Emacs, Java, grep (most versions), less, more,. NET language, PCRE library, Perl, PHP (all three sets of regular libraries), Python, Ruby, set (most versions), vi |
POSIX NFA |
Mawk, M Ortice lern System ' s utilities, GUN Emacs (used explicitly when specified) |
dfa/nfa hybrid |
GNU awk, GNU grep/e grep, Tcl |