Perl Regular Expressions third-week notes

Last Update:2015-10-19 Source: Internet

Author: User

Tags perl regular expression posix egrep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular engine classification
There are two main classifications of regular engines:
DFA: egrep, awk, lex, flex
NFA: .NET, PHP, Perl, Ruby, Python, GNU Emacs, ed, sec, vi, grep, etc.
NFA has a longer history than DFA, but both engines have been developed for more than 20 years, and many variants have been produced. The emergence of POSIX is to standardize this phenomenon. POSIX not only specifies the characteristics of metacharacters, but also specifies how regular expressions should operate.
DFA conforms to POSIX standards, but if NFA is to comply with POSIX standards, it must be modified accordingly
So the engine can be divided into: DFA, traditional NFA, POSIXNFA
Type of test engine
The type of engine used by the tool determines the features that the engine can support and the purpose and use of these features.
Traditional NFA is the most widely used engine, and it is easy to identify
Method one: see if the ignore priority quantifier is supported, if it is, it can basically be determined to be a traditional NFA engine
Method 2: Use nfa | nfa not to match nfa not, if only nfa matches, it is traditional NFA, if the whole nfa not matches, it is either DFA or POSIX NFA
DFA and POSIX NFA
Method 1: DFA does not support capture brackets and backtesting
Method 2: Use X (. +) + X to match the shape = XX =================, if it takes a long time, it is NFA, if the time is short DFA
Awk, lex, egrep and other tools do not support backreference and $ 1 function
Matching basics Rule 1
Rule 1: Select the leftmost matching result first
Because the match first tries to match from the starting position of the string to be searched, the matching result with the leftmost starting position always takes precedence over other possible matching results.
Using cat to match The dragging belly indicates that your cat is too fat. The match is the cat in the indicates. The word cat can be matched, but the cat in the indicates appears earlier, so the match is the one.
If the engine cannot find a match at the beginning of the string, then it will try from the next position in the string.
Use fat | cat | belly | your to match The dragging belly indicatees that your cat is too fat. The matching result is belly
Matching basics Rule 2
Rule two: Standard quantifiers are matched first.
To complete complex regular expressions, we need to use metacharacters such as asterisks, plus signs, and question marks.
It can be seen from the name: the result of standard matching quantifiers "may" is not the longest of all possible matches, but they always try to match as many characters as possible until the upper limit of matching.
For example: use \ b \ w + s \ b to match a string containing s, such as regexes, \ w + can match an entire word regexes, but if \ w + matches an entire word, s cannot match. In order to complete the match, \ w + can only match regexes, leaving s \ b out.
Use [0-9] + to match all the numbers in marc 1998. After 1 is matched, the lower limit of successful matching is actually met, but due to the regular engine matching priority, all will continue to match 998 until no consecutive numbers are matched .
Excessive matching first
The feature of matching priority makes the quantifier match as many results as possible, but in fact it works like this: the quantifier first matches as much as possible, and then "forced" and "returned" according to the situation according to the rest of the regular expression Match result. This is the case of the regexes above. But the intersection can not destroy the necessary conditions for the establishment of a match, such as the first match of the plus sign.
Let's take a look at the process of matching about 24 char long with ^. * ([0-9] [0-9]):
After matching an entire string, the first [0-9] requires that it must match a digit, so. Is forced to surrender the last character g, but this does not allow [0-9] to match, so continue to return, After repeating it several times, the exchanged character is 4, and this time it can be matched. However, because the second [0-9] also requires matching a digit,. Is forced to return a character 2, and both [0-9] match successfully, so. * Actually matches about.
If you use ^. [0-9] + to match copyright 2003, then only 3 will be matched by [0-9] +, because when there are two quantifiers, the first-come-first-served rule is adopted and. The character will satisfy the following +. So 0 will not be returned.
NFA engine expression-led
We use to (nite | knight | night) to match ‘tonight’.
The regular expression starts at t and checks one character at a time. If it can match, it starts to match the next character. When a multi-select structure is encountered, the NFA engine will treat nite, knight, and night as three different independent wholes to match.
If the nite match is unsuccessful, try knight, if not successful, then try night.
The focus of expressions is from nite to knight to night, which is controlled by the elements of the expression, so this matching method is called expression dominance.
DFA engine text-led
Unlike NFA, when DFA scans a string, it will record all possible matches of "currently valid" and the next text character examined will continue to match only in "currently valid" matching possibilities. As for those that are already invalid, Just leave them alone.
Tonight
To (nite | knight | night)

This method is called text dominance, because every character of the text is controlling the behavior of the engine.
Comparing NFA and DFA
In general, the text-driven DFA engine is faster because the expression-driven NFA engine needs to try different sub-expressions of the same text. For example, in the previous example, there are three multi-select branches. The text night is Repeating the match three times is a waste of time.
In addition, the NFA engine must complete the entire regular expression (until the end of the regular expression) to know the matching result. The principle of DFA engine allows to know the result of global matching when there is no "valid" state.
But NFA is dominated by expressions, and all users can improve efficiency by modifying regular expressions, so it is very interesting to discuss the NFA engine.
Backtrack
The most important nature of NFA is that it will process each sub-expression or composed element in turn. When encountering the need to choose between two possible successful matches, it will choose one and remember the other one in order to Be prepared for possible needs later.
The situations where a choice needs to be made include quantifiers (decide whether to try another match) and multi-select structures (decide which multi-select branch to choose and which one to try later).
No matter which way is chosen, if it can match successfully, and the rest of the regular expression is also successful, the match is completed. If the remaining part of the regular expression fails to match, the engine will know that it needs to go back to the previous choice Somewhere, choose another spare branch to continue trying. In this way, the engine will eventually try all possible ways of expression, to be precise, it will try all the ways needed before the matching is completed.
Tonight
To (nitee | knight | night)

Two main points of backtracking
Faced with many choices, which branch should choose first?
Multi-select branches are the order from left to right.
If you need to choose between "make an attempt" and "escape attempt", for matching priority quantifiers, the engine will prefer "make an attempt", and for ignoring priority quantifiers, you will choose "skip trying".

When backtracking, which saved state should be selected?
The option stored closest to the current one is returned when the local failure forces backtracking. In fact, it's last in first out, which is similar to the heap structure of a computer.
If you sprinkle a pile of bread crumbs on each fork, then if there is a dead end ahead, you just have to go back along the original road until you find a pile of bread crumbs.

Standby state
In NFA terminology, these breadcrumbs are the standby state, they are used to mark: when necessary, the match can be restarted from here to try again. They saved two positions: the position in the regular expression, and the position of the untried branch in the string.

Match without backtracking:
Match abc with ab? C

A backtracking match was made:
Match ac with ab? C

Unsuccessful match:
Match abX with ab? C

Ignore priority matches:
Match abc with ab ?? C

Backtracking and matching first
In the backtracking example, we have already seen how? Match priority and? Ignore priority work. Now let's take a look at the asterisks and plus signs.
If you think that x * and x? X? X? X? X? ...... are basically the same, then the analysis process is almost the same as just now, but it has been repeated many times.
Use [0-9] + to match a 1234 num. After matching 4, there are four alternative states that can be traced back with the + sign:
1 234
12 34
123 4
1234

Because [0-9] + is equivalent to [0-9] [0-9]? [0-9]? ..................
Use ignore priority quantifiers
NFA supports ignoring priority quantifiers
? Is the corresponding ignoring priority quantifier.
+? Is the ignoring priority quantifier corresponding to +.
?? is the ignoring priority quantifier corresponding to?.
Use. *? To match Billions and millions of

Gist of match priority, ignore priority and backtracking
Whether it is matching priority or ignoring priority, it is for the global service. If the global needs, these two priority methods encounter "local matching failure", the engine will return to the alternative state (find the breadcrumbs), and then try The path has not been tried. So whether it is matching priority or ignoring priority, as long as the engine reports a match failure, he must have tried all possible. The order of the test paths is different for the two priority methods, but only after all possible paths have been tested, will the final match report fail.
If the matching result is unique (there is only one path), you can find this unique result using matching priority or ignoring priority, only in a different test order.
If there is more than one match result:
The name “McDonald ’s” is said “makudonarudo” in Japense.
. * Matches the longest result,. *? Matches the shortest result

Multiple choice structure
The NFA engines used by Perl, PHP, Java, .NET, and other languages, when they encounter a multi-select structure, check the multi-select branches of the expression in order from left to right. If you use subject | date, the subject will be used first. If you can match it, you will continue to do it. The date will not be controlled. If it does not match, then backtrack and use date. The place.
When we need to match a date like Jan 31, what we need is not simply Jan [0123] [0-9], because this may match Jan 00 or Jan 39, and it is impossible to match a date like Jan 7.

One way is to split the date, use 0? [1-9] to match the date of the first nine days that may start with 0, use [12] [0-9] to process the tenth to twenty-ninth, use 3 [01 ] Deal with the last two days. But we should pay attention to the order of the multiple choice structure. If we use Jan (0? [1-9] | [12] [0-9] | 3 [01]) to match Jan 31, we will only get Jan 3, because in A multi-choice branch can successfully match 3. So we put the shortest number that can be matched at the end, and the problem is solved:
Jan ([12] [0-9] | 3 [01] | 0? [1-9])
Match IP address
Match an ip address, separating four numbers with dots, for example 001.002.003.004
If you use [0-9] * \. [0-9] * \. [0-9] * \. [0-9] * to match this expression is obviously not delicate enough, it can even match only three points ... .
The second idea is to replace * with +, so as to ensure that there is a number, [0-9] + \. [0-9] + \. [0-9] + \. [0-9] +, Of course, such a regular expression still does not meet the requirements, because he will match the combination of 1234.5678.91234.3215600564.
We should limit only one or two or three numbers between dots. For genres that support interval quantifiers, you can use \ d {1,3} \. \ D {1,3} \. \ D { 1,3} to match, for those that do not support interval quantifiers, you can use \ d \ d? \ D? Or \ d (\ d? \ D)? Instead.
The above expression can indeed match the IP address, but now we have to go further and match the valid IP address, what
What to do?
We focus on where those numbers can appear in the field. The three-digit number of the IP address does not exceed 255, so if
A field contains only one or two numbers, there is no need to worry about whether the value of this field is legal, because it must be legal
Yes, this can be handled with \ d | \ d \ d. Similarly, we don't have to worry about the three-digit number starting with 0 or 1, because 000-
199 is legal, so now our expression becomes \ d | \ d \ d | [01] \ d \ d.
If the three digits starting with 2 are less than 255, it is legal, so the second digit less than 5 means the entire field is legal
If the second digit is 5, the third digit is less than 6 and the entire field is legal. This can be expressed as 2 [0-4] \ d | 25 [0-5].
Now our expression is \ d | \ d \ d | [01] \ d \ d | 2 [0-4] \ d | 25 [0-5], and then we can also take the first three branches
Simplified to [01]? \ D \ d? | 2 [0-4] \ d | 25 [0-5].
Now it is possible to match an accurate IP address.

Process file name
Remove the path at the beginning of the file name:
By changing / user / local / bin / gcc to gcc, we can use the feature of *. Matching priority, so that * matches a whole path
,then Add / to make the regular expression backtrack to the last slash, which is to force. * Return the character until the end is encountered
A slash. Specific approach = ~ s {^. * /} {}
Get the file name from the path:
Use [^ /] * $ to get the last file name from the path.
However, if you completely understand what we said earlier in the Ming Dynasty, you will find that this expression includes too much backtracking. even if
The short / user / local / bin / gcc also experienced more than 40 backtracks before obtaining the matching result.
Separate path and file name:
Use ^ (. *) / (. *) $ To separate the content before and after the last slash.
If it is more precise, you can use ^ (. *) / ([^ /] *) $

Match symmetrical brackets
To match val = foo (bar (this), 3.7) + 2 * (that-1); (bar (this), 3.7) in

\ (. * \)
\ ([^)] * \)
\ ([^ ()] * \)
The first match is too long, the second match is too short, and the third can only match (this)
Regular expression cannot match any depth of nested structure
But it can match a certain depth of nested brackets
Match HTML tag
The most common method is to use <[^>] +> to match HTML tags. This usually succeeds.
However, if the tag contains>, for example, <input name = “dir” value = “>”> this way it will not match
Merit.
Students who are familiar with HTML will find that there must be single or double quotes after the = sign, so we only need to put <>
The content in is divided into quoted and non-quoted text, which can be completed to match all the labels of the situation.

<(“[^”] ”|’ [^ ’]’ | [^ ”’>]) *>
Match HTML link
In HTML, <a href=“http://www.oreilly.com”> O’Reilly </a>
We use <a \ b ([^>] +)> (. *?) </a> to extract links and link text separately
Then $ 1 = ~ m {href \ s * = \ s * (?: ”([^”] *) ”| '([^'] *) '| ([^'”> \ s] +))} xi
Then use the $ + variable, which stores the last numbered variable among the digital variables such as $ 1 and $ 2.
Here is the URL we want.

Verify HTTP URL
Now that we have the URL address, let's see if it is an HTTP URL, and if so, break it down into a host name and
Two parts of the path.
The host name is after ^ http: // and the first backslash (if any), and the path is otherwise
Content, and the path is the other content: ^ http: // [[^ /] +) (/.*)? $
The URL may contain a port number, which is located between the host name and the path, starting with a colon

^ http: // [[^ /:] +) (: (\ d +))? (/.*)? $
Perl regular expression third week notes

label:

Original address: http://www.cnblogs.com/XBlack/p/4892874.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More