Traps hidden in regular expressions

Last Update:2018-06-22 Source: Internet

Author: User

Tags expression engine high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A few days ago, a project monitoring information suddenly reported abnormal, on the machine to view the use of related resources, found that the CPU utilization of nearly 100%. With Java's thread Dump tool, we exported the problematic stack information.

We can see that all the stacks are pointing to a method named Validateurl, so the error message is more than 100 in the stack. By troubleshooting the code, we know that the main function of this method is to verify that the URL is legitimate.

It's strange how a regular expression can lead to high CPU utilization. In order to understand the reappearance problem, we extracted the key code and made a simple unit test.

public static void Main (string[] args) {String Badregex = "^ ([hh][tt]{2}[pp]://|[ hh][tt]{2}[pp][ss]://) (([a-za-z0-9-~]+).) + ([a-za-z0-9-~\\\\/]) +$ "; String Bugurl = "http://www.fapiao.com/dddp-web/pdf/download?request= 6e7jgxxxxx4ild-kexxxxxxxqj4-chlmqvnenxc692m74h38sdfdsazxcumfcoh2fafy1vw__%5edadifjgief "; if (Bugurl.matches (Badregex)) {System.out.println ("match!!"); } else {System.out.println ("no match!!"); }}

When we run the example above, we can see from the Resource Monitor that the CPU utilization of a process called Java directly soared to 91.4%.

See here, we can basically infer that this regular expression is to lead to high CPU utilization of the murderer!

So we put the wrong emphasis on that regular expression:

^ ([hh][tt]{2}[pp]://| [HH] [TT] {2} [PP] [ss]://] ([a-za-z0-9-~]+).) + ([a-za-z0-9-~\\/]) +$

This regular expression does not seem to be a problem and can be divided into three parts:

The first part matches the HTTP and HTTPS protocols, and the second part matches www. Characters, and the third part matches many characters. I looked at the expression in a daze for a long time, did not find that there is no big problem.

The key reason for the high CPU usage here is that theengine implementation used by the Java regular expression is the NFA automaton, and this regular expression engine will backtrack (backtracking) when character matching occurs. when backtracking occurs, it can take a long time, possibly a few minutes, or perhaps a few hours, depending on the number and complexity of backtracking.

See here, perhaps everyone is not very clear what is backtracking, but also a bit ignorant. It doesn't matter, we start with the principle of regular expressions.

Regular expression engine

The regular expression is a very convenient matching symbol, but to achieve such a complex, powerful matching syntax, it is necessary to have a set of algorithms to implement, and the implementation of this algorithm is called the regular expression engine. To put it simply, there are two ways to implement the regular expression engine: DFA automata (deterministic Final automata deterministic) and NFA automata (Non deterministic finite automaton Uncertain type with poor automata).

For these two kinds of automata, they have their own differences, and here are not going to delve deeper into their principles. In short, the time complexity of the DFA automata is linear, more stable, but limited in function. The time complexity of the NFA is not stable, sometimes very good, sometimes not good, good depends on the regular expression you write. But the capabilities of the NFA are even more powerful, so languages such as Java,. NET, Perl, Python, Ruby, and PHP use the NFA to implement their regular expressions.

How does the NFA automatically add to the match? We use the following characters and expressions to illustrate.

Text= "Today is a nice day." Regex= "Day"

One important point to remember is that the NFA matches the regular expression as the benchmark. That is, the NFA automatically reads one character of the regular expression and then takes it to match the target string, and the match succeeds in exchanging the next character of the regular expression, otherwise it continues to compare with the next character of the target string. Perhaps you do not understand, OK, next we take the above example step by step analysis.

First, get the first match of the regular expression: D. So that goes and compares the character of the string, the first character of the string is T, does not match, and then changes to the next one. The second one is O, and it does not match, and then change the next. The third one is D, which matches, then reads the second character of the regular expression: a.

Reads the second match to the regular expression: a. That continues and matches the fourth character a of the string. Then read the third character of the regular expression: Y.

Reads the third match to the regular expression: Y. That continues and matches the fifth character y of the string. An attempt was made to read the next character of the regular expression, and when it was found, the match ended.

The above matching process is the NFA automaton matching process, but the actual matching process will be more complex than this, but its principle is unchanged.

Backtracking of NFA automata

Knowing how the NFA does string matching, we can now talk about the focus of this article: backtracking. In order to better explain the backtracking, we also use the following example to explain.

text= "ABBC" regex= "Ab{1,3}c"

The purpose of the above example is simple, the match starts with a, ends with C, and has 1-3 B-character strings in the middle. The process for the NFA to parse it is this:

First, read the first match of the regular expression A and the first character of the string a comparison, matching. The second character of the regular expression is read.

Reads the second match of the regular expression b{1,3} and the second character of the string, B, matches the other. But because b{1,3} represents 1-3 B strings, and the greedy nature of the NFA automaton (that is, to match as much as possible), it does not read the match of the next regular expression at this point, but instead still uses b{1,3} and the third character B of the string to find or match. Then continue using b{1,3} and the fourth character of the string C comparison, found that the mismatch. Backtracking occurs at this point.

What happens when backtracking is done? After the backtracking occurs, we have read the string fourth character C will be spit out and the pointer goes back to the position of the third string. After that, the program reads the next operator C of the regular expression, reads the next character C of the current pointer, and finds a match. Then the next operator is read, but this is over.

Let's look back at the regular expression of the previous check URL:

^ ([hh][tt]{2}[pp]://| [HH] [TT] {2} [PP] [ss]://] ([a-za-z0-9-~]+).) + ([a-za-z0-9-~\\/]) +$

The URL where the problem occurred is:

http://www.fapiao.com/dzfp-web/pdf/download?request= 6e7jgm38jfjghvrv4ild-ken64hcux4ql4a4qj4-chlmqvnenxc692m74h5oxkjgdsyazxcumfcoh2fafy1vw__%5edadifjgief

We divide this regular expression into three parts:

The first part: Verifying the protocol. ^ ([hh][tt]{2}[pp]://| [HH] [TT] {2} [PP] [ss]://].

Part Two: Verifying the domain name. ([a-za-z0-9-~]+).) +

Part Three: Verifying the parameters. ([a-za-z0-9-~\\/]) +$.

We can find the regular Expression Check protocol http://This part is not a problem, but when verifying the www.fapiao.com, it uses XXXX. This way to verify. So in fact the matching process is this:

Match to www.

Match to Fapiao.

Match to COM/DZFP-WEB/PDF/DOWNLOAD?REQUEST=6E7JGM38JF ...., you will find that because of the greedy match reason, so the program will always read the following string to match, and finally found no dot, so a character back.

This is the first question that exists for this regular expression.

Another problem is that in the third part of the regular expression, we find that the URL in question has an underscore (_) and a percent semicolon (%), but there is no regular expression in the third part. This causes a long string of characters to be matched before a match is found, and the last trace goes back.

This is the second problem with this regular expression.

Solution Solutions

Understand that backtracking is the cause of the problem, in fact, is to reduce this backtracking, you will find that if I add an underscore and a percent semicolon in the third part, the program is normal.

public static void Main (string[] args) {String Badregex = "^ ([hh][tt]{2}[pp]://|[ hh][tt]{2}[pp][ss]://) (([a-za-z0-9-~]+).) + ([a-za-z0-9-~_%\\\\/]) +$ "; String Bugurl = "http://www.fapiao.com/dddp-web/pdf/download?request= 6e7jgxxxxx4ild-kexxxxxxxqj4-chlmqvnenxc692m74h38sdfdsazxcumfcoh2fafy1vw__%5edadifjgief "; if (Bugurl.matches (Badregex)) {System.out.println ("match!!"); } else {System.out.println ("no match!!"); }}

Run the above program and the match!!。 will be printed immediately

But this is not enough, if later there are other URLs contain the messy characters, we can not be changed again. It must be unrealistic.

In fact, there are three modes in regular expressions: greedy mode, lazy mode, exclusive mode.

In relation to the number of matches, is there a +? * {Min,max} Four kinds of two times, if only used alone, then they are greedy mode.

If you add one more symbol after them, then the original greedy mode will become lazy mode, that is, match as little as possible. but the lazy pattern still has the backtracking phenomenon. TODO For example in the following example:

text= "ABBC" regex= "Ab{1,3}?c"

The first operator A of the regular expression matches the first character a of the string. So the second operator of a regular expression b{1,3}? Matches the second character B of the string, and the match succeeds. Because of the minimum matching principle, the third operator C of the regular expression matches the third character B of the string and finds that it does not match. So go back and take the second operator of the regular expression b{1,3}? Matches the third character B of the string, and the match succeeds. The third operator of the regular expression, C, matches the fourth character of the string C, and the match succeeds. So the end.

If you add a + symbol after them, then the original greedy mode will become exclusive mode, that is, match as many as possible, but do not backtrack.

Thus, if the problem is to be solved thoroughly, it is necessary to ensure that no backtracking occurs while ensuring the functionality. I add a plus sign to the second part of the regular expression that verifies the URL above, which is how it goes:

^ ([hh][tt]{2}[pp]://| [HH] [TT] {2} [PP] [ss]://] ([a-za-z0-9-~]+).) + +--->>> (add a + sign here) ([a-za-z0-9-~\\/]) +$

After that, there is no problem running the original program.

Finally, recommend a website, this site can check your written regular expression and corresponding string match when there will be a problem.

Online regex tester and debugger:php, PCRE, Python, Golang and JavaScript

For example, the URL that I have a problem with in this article will prompt after using the site check: catastrophic backgracking (catastrophic backtracking).

When you click on the "regex debugger" in the lower-left corner, it will tell you how many steps have been completed, and will list all of them and indicate where the backtracking occurred.

This regular expression in this article is automatically stopped after a 110,000-step attempt. This indicates that the regular expression does have problems and needs to be improved.

But when I test with our modified regular expression, this is the regular expression below.

^ ([hh][tt]{2}[pp]:\/\/| [HH] [TT] {2} [PP] [ss]:\/\/] ([a-za-z0-9-~]+).) + + ([a-za-z0-9-~\\\/]) +$

ToolTips have been checked in only 58 steps.

A character difference, the performance is tens of thousands of times times better than the gap.

It's amazing how a small regular expression can bring the CPU down. This also to the usual writing procedures of our vigilance, when encountering regular expressions to pay attention to the greedy mode and backtracking problems, otherwise we write an expression is a ray.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More