Compiling high-quality js: Correct Understanding of Regular Expression backtracking and js Regular Expressions
When a regular expression scans the target string, It scans the components of the regular expression from left to right one by one to test whether a match can be found at each position. For each quantizer and branch, you must determine how to proceed. If it is a quantizer (such as *, +? Or {2,}), then the regular expression must determine when to try to match more characters. If a branch (via | Operator) is encountered ), then the regular expression must be selected from these options for an attempt.
When a regular expression makes such a decision, if necessary, it will remember another option for future use. If the selected scheme matches successfully, the regular expression continues to scan the regular expression template. If the other part matches successfully, the match ends. However, if the selected scheme fails to find the corresponding match, or the later match fails, the regular expression will be traced back to the last decision point, and then select one of the remaining options. Continue until a match is found, or all possible permutation and combinations of quantifiers and branching options fail, and then move to the next character at the beginning of the process, repeat this process.
For example, the following code demonstrates how this process processes branches through backtracking.
/h(ello|appy) hippo/.test("hello there, happy hippo");
The above regular expression is used to match"hello hippo
"Or"happy hippo
". Search for an h at the beginning of the test. The first letter of the target string is exactly h, which is immediately found. Next, the sub-expression (ello | appy) provides two processing options. Select the leftmost option for the regular expression (The Branch selection is always from left to right), check whether ello matches the next character of the string, and then the regular expression matches the following space.
However, in the next match, the regular expression "enters the dead end", because h in hippo cannot match the next letter t in the string. At this time, the regular expression cannot be abandoned because it has not tried all the options, and then it goes back to the last checkpoint (after matching the first letter h) and try to match the second branch option. However, because the matching fails and there are no more options, the regular expression considers that the matching from the first character of the string cannot be successful, so it starts searching again from the second character. If the regular expression does not find h, continue to search for it until the first letter is found. It matches the happy h. Then, the regular expression enters the branch process again. This time ello failed to match, but in the second branch after backtracking, it matches the entire string "happy hippo" and the match was successful.
For another example, the following code demonstrates backtracking with repeated quantifiers.
var str = "<p>Para 1.</p>" +"Para 2.</p>" +"<div>Div.</div>";/<p>.*<\/p>/i.test(str);
The regular expression first matches the three letters starting with the string <p> and then .*. The dot number matches any character other than the line break. The asterisk (*) greedy quantizer indicates zero or multiple times of repetition and matches as many times as possible. Because there is no line break in the target string, the regular expression will match all the remaining strings! However, because there is more content in the regular expression template that needs to be matched, the regular expression tries to match <. Because the match at the end of the string is unsuccessful, one character is traced back each time and the matching attempt is continued until the regular expression returns to the </div> position of the tag. Next, try to match \/(Escape backslash). The match is successful. Then match p. The match is unsuccessful. Repeat the process until the end of the second paragraph is matched. If the matching result is successful, the first part of the header needs to be scanned until the end of the last one. This may not be the expected result.
Change the "greedy" quantizer * in the regular expression to "lazy" (also known as "non-greedy") quantifiers *?, To match a single paragraph. The backtracking of "lazy" quantifiers is performed in the opposite way. When the regular expression/<p> .*? <\/P>/push .*? First, try to skip all and then continue matching <\/p>.
This is because *? Matches zero or multiple times, with as few duplicates as possible. However, when the subsequent <string match fails at this point, the regular expression goes back to the next minimum number of characters: 1. The regular expression continues to look back to the end of the first paragraph like this, where the <\/p> following the quantifiers gets a full match.
If the target string has only one paragraph, the greedy version of the regular expression is equivalent to the lazy version, but the matching process is different.
When a regular expression takes a browser for several seconds or longer, the cause is probably that backtracking is out of control. To illustrate this problem, the following regular expression is provided to match the entire HTML file. This expression is split into multiple rows for page display. Unlike other regular expressions, JavaScript can match any character, including line breaks, if there is no option. In this example, any character is matched by [\ s \ S.
/
This regular expression works well when a normal HTML string is matched, but it becomes very bad when the target string lacks one or more tags. For example, if the
The solution to this problem is to specify the character matching form between separators as much as possible, such as the template ". *?" It is used to match a string enclosed by double quotation marks. Use the more specific [^ "\ rn] * to replace the overly broad .*? This removes several possible backtracing situations, such as trying to match quotation marks with dots or extending the search beyond the expected range.
In the HTML example, the solution is not that simple. The negative character type cannot be used. For example, [^ <] is used to replace [\ s \ S], because tags of other types may be encountered during the search process. However, you can repeat a non-capturing group to achieve the same effect. It contains a trace (blocking the next required tag) and a [\ s \ S] (any character) Meta sequence. In this way, you can ensure that every tag found in the middle position fails. Then, more importantly, the labels blocked by the [\ s \ S] template in the Backtracking process cannot be extended before they are discovered. After this method is applied, the final modification to the regular expression is as follows:
/
This eliminates potential backtracing loss and allows regular expressions to linearly relate the time used when an incomplete HTML string fails to be matched to the text length, however, the efficiency of regular expressions is not improved. In this way, you can perform multiple forward queries for each matching character, which is inefficient and the successful matching process is quite slow. This method is quite good when matching a short string, and matching an HTML file may require foresight and testing for thousands of times.