Regular

What is a regular expression? A regular expression is actually a tool used primarily to describe string matching and, of course, can be used to match other things, such as binary data, which may be most common in strings. Speaking of which, you may associate the following topics:

How to exclude malicious scripting code? To write a scripting language engine or compiler, can you do it with a regular expression? What is the principle of compiling?

In fact, to make it clear that all the problems in this area may have been beyond my ability, but I still have to write down, do not write down maybe one day I will really forget everything.

First of all, the regular expression. Regular expressions actually use an indeterminate automaton, non-deterministic finite automaton, or NFA. and the compiler principle commonly uses is to determine has the poor automaton (generally called has the poor automaton, or the finite automaton, other theories also includes the definite finite state automatic machine), deterministic finite automaton, abbreviation DFA. On the simple explanation of these concepts, casually goo a poor automaton can find a lot of, here recommend one. What is the difference between an NFA and a DFA? The simplest difference is that the NFA is entirely likely to have multiple interpretations of the same string input, while the DFA has only one way of understanding. From here we can simply understand that the rules that an NFA can accept are not necessarily translated into rules that the DFA understands. In fact, the difference in the NFA and DFA analysis process can be reflected, for the NFA, it is likely to need to detect a certain way of acceptance, when there is no acceptance may need to exit a layer, try another possible way of acceptance. The DFA is not the same, because there is only a certain way of understanding, so once the mismatch, there is no need to do another attempt, but can be straightforward to say "mismatch." So conversely, because the DFA has only one way of understanding, the efficiency should obviously be higher than the NFA. A more concise version of the above recommendation is given: the only difference between NFA and DFA is that the state transfer function is different.

The analytic formula used by the compiler in lexical analysis should actually look similar to the regular expression, so what should the expression look like? In fact, we're all familiar with things like this:

S-> A

A-> aa| B

b-> B

The three equation above is the state transition function, or you can understand it as a derivation formula. The sabab in it are all possible sets of States, of which SAB three are non-terminal, AB two are non-terminal, and S is the initial state. For the above derivation formula, it can be known that the system accepts only two inputs: A...A or A...ab. Why introduce A and b such a non-terminal? Because a lot of times when we're going to describe such a system, you will often encounter some repetitive, definable parts, such as one to multiple spaces \s+, we can simply write these things directly, or even express them directly with a single expression, but this can be very troublesome and difficult. For simplicity, a number of repetitions, or lengthy, or more important parts are grouped into a non terminator. The definition of non-terminal is not casual, but there are certain rules, this rule is not discussed here.

For the compiler, there are two main ways to construct the DFA, one is ll analysis method, the other is the LR analysis method. Both of these are based on the analysis method of input prediction, but the method of analysis is very different, ll belongs to the way of derivation, LR is the inductive way. If this is not easy to remember, then remember LL refers to the left-to-right input from left to right, and the LR is entered from left to right, from right to left. ll is more consistent with the way people think, but there are some limitations. LR is more difficult to understand, but the scope of application is more extensive than ll, the efficiency is also a bit higher.

Said for half a day as if with regular expression, in fact, those who make some preparation knowledge, the following carefully said regular expression. Let's look at the three derivation formulas I gave you, in fact, if all are represented by Terminators, then it should be:

A * (a|b)

This is the regular expression. Since it is more complex and unnecessary to define a large number of rules and state transfer functions for the entire machine, it is sufficient to provide a simple, fully non-terminal-described matching string for the regular expression's execution mechanism for general string matching.

But this does not mean that we should also use non-terminal to construct the entire regular expression, which can be very complicated and painful, because a slightly more complex match, the regular expression will be complex to let you not understand, or see the headache. Just a pair of parentheses is enough to keep you busy, half, and other things like escape characters. Therefore, it is absolutely necessary for us to define, as mentioned above, the key to the non-terminal, efficient creation of regular expressions. Unfortunately, however, the regular expression itself does not have such a definition, and. NET does not provide us with such a definition interface. What's more, most of the people who make regular expression creation tools there don't think of that at all. Take the last regulator I mentioned, there is an analyzer that translates the entire expression into English, has the convenience of writing snippet, and has a good text editor, but that's all. Analyzer may be helpful in understanding a regular expression, but it is not helpful for you to construct a regular expression of your own. Snippet is modeled after the advanced things in c#2.0, however, the construction of regular expressions is not written program, write a few parentheses, angle brackets and so on is not the most effective problem, and regular expressions do not have any pattern problems, So actually snippet is not helpful for constructing a regular expression. The perfect text editor does not jump out of the box with non-terminal, nor does it help to improve efficiency. You can try using my regular Expression builder to see what I mean by the concept I'm proposing.

So what do you need to be aware of in constructing regular expressions? Although the NFA because of uncertainty, so the limit is less than the DFA, construction is also more convenient. But bad construction can cause inefficient problems, so try to:

1, to avoid the occurrence of uncertainty. The solution is to use as much as possible (? >a| B) to replace a| B or (?: a| B) such a form. (A, B represents a very long regular expression string)

2, try to avoid the high level of recursion. eg

A * (b* (c|d) |b* (e|f) |b* (g|h)) |a* (b* (i* (j|k) |i* (l|m) |i* (n|o))

This would be a very painful thing to do with a string such as a a...b...i...o, because the matching process would go through:

A*b*c-> a*b*->a*b*d-> A *-> a*b*e-> a*b*-> a*b*f-> *-> a*b*g; Such a series of analysis-> the process of backtracking in order to reach the A*b*i*o match. The above expression can best be transformed into: a*b* (c|d|e|f|g|h|i* (J|k|l|m|n|o)) in this form. If you want to categorize by group, it is recommended that you modify your language format, or minimize the grouping, or use the code to determine the actual classification.

3, if you can, for example, you have to deal with the problem is more complex, a simple step-by-step process to reduce the complexity of the expression. The simpler the expression, the more likely it is to be efficient and less likely to be a design error. The distribution is not too much, after all, an analysis also takes time.