Security Bulletin: Regular expression denial of service attacks and defenses

Source: Internet
Author: User
Tags expression numeric regular expression

In the November 2009 issue, I wrote an article titled "XML denial of service attack and defense" (msdn.microsoft.com/magazine/ee335713), in which I introduced some of the denial services that are particularly effective for XML analyzer (DoS ) Attack techniques. I get a lot of emails from readers about this article, and they all want to know more about it, which makes me realize that people already understand the severity of DoS attacks.

I believe that in the next 4-5 years, as permissions escalate, it becomes more difficult to execute an attack due to the constant adoption of memory protection measures such as Data Execution Protection (DEP), Address space layout randomization (ASLR), and isolation and privilege reduction techniques, where an attacker moves its target to DoS Extortion attacks. Currently, developers can continue to protect their applications by being ahead of the change in attack trends and in the direction of possible DoS evolution in the future.

Regular Expression Dos is one of these possible DOS evolution directions in the future. At the 2009 "Open WEB Application Security Project (OWASP)" Meeting in Israel, Checkmarx chief architect Alex Roichman and senior programmer Adar Weidman did a thorough study of regular expression DoS (also known as "Redos") Research reports. Their research suggests that writing an imprecise regular expression can be attacked so that a relatively short attack string (less than 50 characters) will take hours or longer to compute. In the worst case, the processing time actually equals the number of characters in the input string, which means that adding a string to the string doubles the processing time.

In this article, I'll describe the situations in which regular expressions are vulnerable to these attacks. I will also provide the "Regular expression Fuzzy test program" code, this test program is designed specifically to identify vulnerable regular expressions, and its identification method evaluates regular expressions against thousands of random inputs and marks unacceptable input for the length of time required to complete processing.

(Note: In this article, I assume that you are familiar with the syntax of regular expressions.) If you are unfamiliar with this syntax, you may need to read the article ". NET Framework Regular Expressions" (URL msdn.microsoft.com/library/hs600312) to supplement this knowledge, if you want to further study, please read Jeffrey Friedl's Reference Manual, "proficient in regular Expression 3rd Edition" [O ' reilly,2006].

Backtracking: The root of the problem

Essentially, there are two different types of regular expression engines: deterministic, finite automaton (DFA) engine and non-deterministic finite automaton (NFA) engine. The complete variance analysis of these two engine types is beyond the scope of this article, and we focus only on the following two areas:

NFA engines are backtracking engines. Unlike the DFA, which computes the maximum number of times for each character in the input string, the NFA engine can compute multiple occurrences of each character in the input string. (I'll explain the computational principles of this backtracking algorithm later.) Backtracking methods have many advantages because they can handle more complex regular expressions, such as regular expressions that contain backward references or capture parentheses. It also has some drawbacks, because its processing time greatly exceeds the processing time of the DFA.

The Microsoft. NET Framework System.Text.RegularExpression uses the NFA engine.

An important negative effect of backtracking is that although a regular expression can fairly quickly determine a positive match (that is, an input string matches a given regular expression), it takes longer to determine a negative match (the input string does not match the regular expression). In fact, the engine must determine that no possible "path" matches the regular expression in the input string, which means that all paths must be tested.

By using simple, ungrouped regular expressions, it is not a big problem to determine the time spent in negative matching. For example, suppose the regular expression to match is:

^\d+$

If the entire input string contains only numeric characters, this is a fairly simple matching regular expression. The ^ and $ characters represent the beginning and end of a string, and the expression \d represents a numeric character, + indicates that one or more characters will match. We test this expression using 123456X as an input string.

This input string is obviously not a match because X is not a numeric character. But how many paths does this sample regular expression have to calculate to come to this conclusion? It starts at the beginning of this string and finds that character 1 is a valid numeric character that matches this regular expression. It then moves to the character 2, which also matches the character. Therefore, at this point, this regular expression matches the string 12. Next, it tries 3 (match 123), and so on until it reaches X, the character does not match.

However, since our engine is a backtracking NFA engine, it will not stop at this point. Instead, it returns to its previous known match (12345) from its current match (123456), and then tries to match again from there. Because the next character after 5 is not the end of this string, the regular expression is not a match, it returns to its last known match (1234), and then attempts a match again. All matches are made in this manner until the engine returns to its first match (1), and the character after 1 is not the end of the string. At this point, the regular expression stops and no matches are found.

Overall, the engine calculates six paths: 123456, 12345, 1234, 123, 12, and 1. If this input string is incremented by one character, the engine computes one more path. Therefore, this regular expression is a linear algorithm relative to the length of the string, and there is no risk of the DoS being created. The System.Text.RegularExpressions.Regex object that uses ^\d+$ for its mode is calculated very quickly enough to quickly split and compute a large number of input strings (more than 10,000 characters).

Now, let's change this regular expression to group by a logarithmic character:

^ (\d+) $

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.