DFA & NFA (simple comparison)

Source: Internet
Author: User
Tags net regex expression engine egrep

1. History

 

Regular Expressions originated from the study of neurophysiology in the 1940 s, first officially described by Stephen Kleene, a famous mathematician. Specifically, Kleene summarized the aforementioned neurophysiological research, defined the "regular set" in a paper titled "regular set algebra", and defined an algebraic system on it, A mark system is introduced to describe a regular expression ". After decades of research in the circle of theoretical mathematics, in 1968, Ken Thompson invented the UNIX system and used the first regular expression in the computer field, two practical text processing tools, QED and grep, were developed and achieved great success. Over the next decade, a large number of top-notch computer scientists and hackers have conducted intensive research and practices on regular expressions. In the early 1980 s, the two centers of the UNIX movement, Bell Labs and UC Berkeley, respectively studied and implemented the Regular Expression Engine around the grep tool. At the same time, Alfred Aho, author of The Compiler "longshu", developed the egrep tool, which greatly extends and enhances the functions of regular expressions. Later, he and Brian kernighan, author of C programming language, have invented the popular awk text editing language. By 1986, regular expressions had a leap. First, Henry Spencer, a top-level hacker in C language, released a library of Regular Expressions written in C Language (not called open source at the time) in the form of source code, so as to bring the mysteries of regular expressions into the ordinary house, then Larry Wall, the technical geeks, was born and released the first version of Perl. Since then, Perl has been the flagship of regular expressions. It can be said that today's standard and position of regular expressions are shaped by Perl. After the release of Perl 5.x, regular expressions have entered a stable and mature stage. Its powerful capabilities have conquered almost all mainstream language platforms and become a basic tool that every professional developer must master.

 

2. Reference

Understanding DFA and NFA
The Regular Expression Engine is divided into two types: DFA and NFA ). To work smoothly, both engines must have a regular expression and a text string, one in your hand and one in your food. DFA uses the text string to compare the regular expression. When a sub-regular expression is displayed, it marks all possible matching strings and then looks at the next part of the regular expression, update the annotation based on the new matching result. While NFA uses the regular expression to compare the text, eats a character, compares it with the regular expression, and writes down the match: "a year, a month, a day, matches somewhere! ", And then proceed. Once the character does not match, the character is spit out one by one until the last match is returned.
The difference between DFA and NFA mechanisms has five impacts:
1. DFA only needs to scan each character in a text string once, which is faster, but has fewer features. NFA needs to overwrite and vomit characters, which is slow, but has rich features, therefore, it is widely used. Today's major regular expression engines, such as Perl, Ruby, and Python re modules, Java, and. net RegEx library, all of which are NFA.
2. Only NFA supports features such as lazy and backreference;
3. NFA is eager to offer rewards. Therefore, the leftmost subregularizedregular expression is matched first, so the best matching result is occasionally missed. DFA is "the longest left subregularizedregular expression is matched first ".
4. NFA uses greedy quantifiers by default (see item 4 );
5. NFA may fall into the trap of recursive calling and has poor performance.

Here is an example to illustrate the 3rd impacts.

For example, use the regular expression/perl | Perlman/to match the 'perlman Book '. If it is NFA, It is guided by the regular expression, holding the regular expression in his hand, eyes on the text, one character and one character to eat, after eating 'perl, the first regular expression/perl/has been matched, so the record is recorded. Let's look at it again and eat a 'M'. This is bad, it does not match the sub-type/perl/, so I threw out M and reported that it successfully matched 'perl '. I don't care about anything else, if you do not try the sub-Regular Expression/Perlman/, you will naturally not see the better answer.

If it is DFA, It is text-oriented, holding the text in his hand, eyes looking at the regular expression, a bite to eat. Eat/P/, and put a hook on the 'p' in your hand. Remember this character, say it has been matched, and then eat it down. When/perl/is displayed, DFA will not stop and try again. At this time, the first sub-regular expression has been exhausted, so I don't have to eat it, so I will get rid of it and eat the second sub-Regular Expression/M /. This is good, because it matches again, so I will continue to eat. After completing the regular expression, I am satisfied with the report that the regular expression matches 'perlman '.

We can see that to make NFA work correctly, we should use the/Perlman | PERL/mode.

Through the above example, we can understand why NFA is the leftmost child match, while DFA is the leftmost child match. In fact, if you carefully analyze the differences between NFA and DFA, you can find out the truth. Understanding these principles makes sense to apply Regular Expressions effectively.

 

The formal definition of a regular expression is intentionally simplified to avoid defining excessive quantifiers? And +, they can be expressed as: A + = AA * and? = (A | ε ). Sometimes Add a population operator ~ ;~ R indicates a set of all strings not in R on Σ. The complement operator is redundant because it is expressed by other operators (although the process of calculating such expressions is complex, the result may increase exponentially ).
In this sense, regular expressions can express the regular language, which is precisely a language class accepted by finite state automation. But there are important differences in simplicity. A certain type of regular expression can only be described by the size index increasing automatic machine, while the length of the required regular expression only increases linearly. The regular expression corresponds to the type-3 syntax at the qiaomski level. On the other hand, there is a simple ing between the regular expression and the uncertain finite state automation (NFA) that does not result in such an explosion. Therefore, NFA is often used as an alternative expression for regular expressions.
We also need to study Expressiveness in this form. As shown in the following example, different regular expressions can express the same language: There is redundancy in this format.
It is possible to write an algorithm for two given regular expressions to determine whether the language they describe is essentially equal. Each expression is simplified to an extremely small deterministic finite automatic machine to determine whether they are homogeneous (equivalent ).
To what extent can such redundancy be reduced? Can we find an interesting subset of regular expressions with full expressiveness? The Kleene Asterisk and the Union set are obviously needed, but we may be able to limit their use. This raises an amazing difficult question. Because regular expressions are so simple, there is no way to rewrite them into a standard form in syntax. In the past, the lack of publicity caused the asterisk height problem. Recently, Dexter kozen used the cleini algebra to generalize regular expressions.
Many real-world "regular expressions" engines implement features that cannot be expressed using regular expression algebra.

 

 

Currently, the Regular Expression Engine supports the following languages:

 

Engine type Program
DFA Awk (majority), egrep (majority), flex, Lex, MySQL, procmail
Traditional NFA GNU Emacs, Java, grep (most versions), less, more ,. net Language, PCRE Library, Perl, PHP (all three regular library), Python, Ruby, set (most versions), VI
POSIX NFA Mawk, mortice lern system's utilities, and gun Emacs (timed use)
DFA/NFA Hybrid GNU awk, GNU grep/egrep, and Tcl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.