Talk about Java Regular expression stackoverflowerror problem and its optimization

Last Update:2015-08-02 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular can be regarded as a DSL, but it is very widely used, it can easily solve many scenarios of string matching, filtering problems. At the same time, there is an old saying:

"If you have a problem with regular expressions, then you have two problems now. ”

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they has a problems.

Today we're going to talk about the problem of the Java regular expression Stackoverflowerror and some optimization points.

1. Questions

Recently, some colleagues found that a regular in the local how to run the problem, but on the Hadoop cluster will always throw stackoverflowerror.

I'll simplify the code first:

package java8test;import java.util.regex.matcher;import java.util.regex.pattern;public  Class test {public static void main (String[] args)  {final String  test_regex =  "([=+]|[ \\s]| [\\p{p}]| [a-za-z0-9]| [\u4e00-\u9fa5]) +"; Stringbuilder line = new stringbuilder (); System.out.println ("++++++++++++++++++++++++++++++");for  (int i = 0; i <  10; i++)  {line.append ("http://hh.ooxx.com/ershoufang/? pgtid=14366988648680=+.7342327926307917&clickid=1&key=%2525u7261%2525u4e39%2525u5bcc%2525u8d35% 2525u82b1%2525u56ed&sourcetype=1_5 "); Line.append (" http://wiki.corp.com/index.php?title=Track%E6%A0%87%E5% 87%86%e6%97%a5%e5%bf%97hive%e8%a1%a8-%e5%8d%b3%e6%b8%85%e6%b4%97%e5%90%8e%e7%9a%84%e6%97%a5%e5%bf%97 "); Line.append ("Http://www.baidu.com/s?ie=UTF-8&wd=58%cd%ac%b3%c7%b6%fe%ca%d6%b3%b5%b2%e2%ca%d4%ca%fd%be%dd &AMP;TN=11000003_HAO_DG "); Line.append (" Http://csThe design fee for the ooxx.com/yewu/?key= City &cmcskey= starts low &final=1&jump=1&specialtype=gls "); Line.append (" Http%3A%2F %2fcq.ooxx.com%2fjob%2f%3fkey%3d%25e7%25bd%2591%25e4%25b8%258a%25e5%2585%25bc%25e8%2581%258c%26cmcskey%3d%25e7 %25bd%2591%25e4%25b8%258a%25e5%2585%25bc%25e8%2581%258c%26final%3d1%26jump%3d2%26specialtype%3dgls% 26canclequery%3disbiz%253d0%26sourcetype%3d4 ");} Line.append (" \001 11111111111111111111111111"); Pattern p_a = null;try {p_a = pattern.compile (TEST_REGEX); Matcher m_a = p_a.matcher (line);while  (M_a.find ())  {string a = m_ A.group (); System.out.println (a);}}  catch  (exception e)  {// todo: handle exception}system.out.println ("line  size:  " + line.length ());}}

The result after execution is:

++++++++++++++++++++++++++++++exception in thread "main" Java.lang.StackOverflowErrorat java.util.regex.pattern$ Loop.match (Unknown source) at Java.util.regex.pattern$grouptail.match (Unknown source) at java.util.regex.pattern$ Branchconn.match (Unknown source) at Java.util.regex.pattern$charproperty.match (Unknown source) ...

At first the problem was thrown from the cluster, and you can see that there are two characteristics of this exception:

(1) cannot be captured with Exception because the error is directly inherited from Throwable rather than Exception, so you should catch the error even if you want to capture it.

(2) Another point is that you can see the error is thrown and does not specify the line number, when the code is mixed in a hundreds of-line tool class, there are dozens of similar to the same time, undoubtedly to the location of the problem brought difficulties, which requires that we can have a certain unit testing capabilities.

Note:

(1) If your environment does not throw the above error, try to increase the number of times for the For loop or specify the JVM parameters:-xss1k

(2) If you still do not understand what stackoverflowerror means, you can refer to the previous article: Introduction to the JVM Runtime data area

2, problem analysis

The regular expression engine is divided into two categories, a class called DFA (deterministic with poor automata), and another class called NFA (non-deterministic with poor automata). Both types of engines must have a regular and a text string to work smoothly. DFA pinch the text string to compare the regular style, see a sub-regular type, the possible matching string is fully marked out, and then look at the next part of the regular style, according to the new matching results updated labeling. The NFA is pinching the regular style to go over the text, eat a character, compare it with the regular, match it down, and then down to dry. Once it doesn't match, spit out the character you just ate and spit it out until you get back to the last matching place.

The DFA differs from the NFA mechanism by bringing 5 effects:

1. DFA for each character in the text string only need to scan once, faster, but less features; the NFA has to eat characters, hyphens, slow, but rich features, so instead of a wide range of today's major regular expression engine, such as Perl, Ruby, Python's re module, The Regex libraries for Java and. NET are all NFA.

2. Only NFA supports features such as lazy and backreference;

3. NFA is eager to desire please, so the most Zokozheng-style priority matches the success, so occasionally miss the best match results, the DFA is "the longest Zokozheng-type priority match success".

4. NFA defaults to use greedy quantifiers;

5. The NFA may fall into the trap of recursive invocation and performance is very poor.

When using regular expressions, the underlying is executed by recursive invocation, each layer of recursion will be in the size of the stack thread to occupy a certain amount of memory, if the level of recursion many, will be reported Stackoverflowerror exception. So in the use of regular time is actually a pros and cons.

In a Java program, each thread has its own stack Space. This stack space is not allocated from the heap. So the size of the stack space is not affected by-XMX and-XMS, and the 2 JVM parameters only affect the size of the heap. Stack space is used to make recursive calls to methods when pressed into the stack Frame. So when the recursive call is too deep, it is possible to run out of stack Space and burst the StackOverflow error. The size of the Stack space varies with the size of the OS,JVM and environment variables. The default size is generally 512K. In a 64-bit system, the stack space value will be larger. Generally speaking, Stack space is sufficient for 128K. When you say what you need to do is observe. If your program does not burst StackOverflow errors, you can use-XSS to adjust the size of stack space to 128K. (eg:-xss128k)

The problem at the beginning of the article can be simply understood as the nested call hierarchy of the method is too deep, the upper stack of methods has not been released, resulting in insufficient space.

The next thing we need to do is to understand some of the optimization points of the regular performance and avoid this deep-seated recursive invocation.

3. Some optimization points of Java regularization3.1 pattern.compile () precompiled Expression

If you use the same regular expression multiple times in your program, be sure to compile with pattern.compile () instead of using pattern.matches () directly. If you use Pattern.matches () for the same regular expression over and over again, for example in a loop, there is a large consumption of regular expressions that are not compiled. Because the matches () method compiles the expression that is used each time. Also, remember that you can reuse the Matcher object with different input strings by calling the Reset () method.

3.2 Note selection (Beware of alternation)

Similar to "(x| y| Z) "The regular expression has a bad reputation for slowing down, so pay more attention. First, consider the order of selection, so that the more commonly used selections are placed in front so they can be matched more quickly. In addition, try to extract the common mode, for example substituting "(ABCD|ABEF)" with "AB (CD|EF)". The latter matches faster because the NFA tries to match AB and no longer tries any selections if it is not found. (In the current case, there are only two selections.) If there are many options, there will be a significant increase in speed. Selection does reduce the speed of the program. In my test, the expression ". * (ABCD|EFGH|IJKL). *" is three times times slower than calling String.IndexOf () three times-one option per expression.

3.3 Reducing grouping and nesting

If you do not actually need to get the text within a group, then use a non-capturing grouping. For example, use "(?: x)" Instead of "(x)".

To summarize: Reduce branch selection, reduce capture nesting, and reduce greedy matching

4. Solutions4.1 Temp Program

try...catch.../increase-XSS, the symptom does not cure, not recommend.

4.2 optimization is the kingly way4.2.1 Grammar level optimization

According to the 2.2 mentioned, we optimize this under:

Final String Test_regex = "([=+\\s\\p{p}a-za-z0-9\u4e00-\u9fa5]) +";

After testing, the JVM parameter is unchanged, the For loop 100w times until the OOM is no longer the stack overflow problem at the beginning of the article.

4.2.2 Business logic level optimization

Since I do not know the author's business scenario, do not do business optimization, the general principle is that when your regular is too complex, you can consider the logic of splitting, or part of the regular, if the regular as a universal tool may outweigh the gains.

Summary: In the field of string lookup and matching, the regular can be said to be almost "omnipotent", but in many scenarios, its cost is not to be underestimated, how to write high efficiency, maintainable regular or how to avoid the regular is worth our thinking.

Refer:

[1] stackoverflowerror problems and solutions related to Java regularization

http://blog.csdn.net/qq522935502/article/details/8161273

[2] Java regular and Stack Overflow

http://daimojingdeyu.iteye.com/blog/385304

[3] Optimizing regular Expressions in Java

http://blog.csdn.net/mydeman/article/details/1800636

[4] From a regular expression caused by Stackoverflowerror speaking

http://ren.iteye.com/blog/1828562

[5] Regular Expressions (iii): Unicode issues (bottom)

Http://www.infoq.com/cn/news/2011/03/regular-expressions-unicode-2

Http://www.infoq.com/cn/author/%E4%BD%99%E6%99%9F

[6] stackoverflowerror when matching large input using RegEx

Http://stackoverflow.com/questions/15082010/stackoverflowerror-when-matching-large-input-using-regex

[7] Try/catch on stack overflows in Java?

Http://stackoverflow.com/questions/2535723/try-catch-on-stack-overflows-in-java

[8] Java regular up-type causes the death cycle problem solution

http://blog.csdn.net/shixing_11/article/details/5997567

[9] JAVA Regular expression overflow problem and incomplete solutions

Http://www.blogjava.net/roymoro/archive/2011/04/28/349163.html

Talk about Java Regular expression stackoverflowerror problem and its optimization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More