Regular expressions from getting started to being proficient and looking at each use again

Last Update:2017-01-08 Source: Internet

Author: User

Tags closing tag naming convention ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular Expressions (Regular expression, or regex) are strings that are used to match and manipulate text, which is created in a regular expression language that is used primarily to retrieve and replace certain text.

This article is only a small summary of the regular expression must know and the legendary 30-minute learning regular expression, therefore, does not intend to introduce the regular expression from the beginning, only to record some knowledge points.

1. Some meta-characters

Meta-character meaning

. All characters except line breaks. Use the escape character when you want to express the meaning of.

\d any numeric character (equivalent to [0-9])

\d any non-numeric character (equivalent to [^0-9])

\w any alphanumeric character or underscore character (equivalent to [a-za-z0-9_]) is understood to be similar to a variable naming convention

\w any non-alphanumeric character or non-underscore character (equivalent to [^a-za-z0-9_])

\s any whitespace character (equivalent to [\f\n\r\t\v])

\s any non-whitespace character (equivalent to [^\f\n\r\t\v])

\b Where any word begins or ends

\b Any place where a non-word begins or ends

^ is placed at the front of the regular expression, indicating that it starts with

$ is placed on the last side of the regular expression, indicating the end of the

2. Match a set of characters

Using [], can match one of several characters, such as [Nc]ba can match NBA, also can match CBA;

You can also use a set of character sets to match a range, through-connections, such as [0-9a-za-z] or class[a-d], are all possible. The first and last characters of a character range can be all characters in the ASCII character table, but in actual work, the most commonly used character ranges are numeric and alphabetic ranges.

Of course, the character range can also be non-, by the ^ character placed in front, such as [^0-5], matching 6,7,8,9. It is important to note that the ^ effect will be used for all characters or range of characters in a given character set, not just the character or range of characters that follow the ^ character. So ^, if you put it in the front of the interval, the expression is to take the non, if you want to take its original meaning, you must use the escape character. If it is not at the front of the interval, you may not use the escape character .

"Give me a chestnut 1."

The following example-1, is intended to match any character that is a or ^ (don't ask why, there is such a strange request)

1 String s = "a^"; 2 Pattern pattern = Pattern.compile ("[^a]*"); 3 System.out.println (Pattern.matcher (s). matches ());

It is obvious that this expression means all non-a characters, then it must be returned false.

The correct wording should be escaped, or put ^ behind a.

1         String s = "a^"; 2 //         String regex = "[a^]*"; 3         String regex = "[\\^a]*"; 4         Pattern pattern = pattern.compile (regex); 5         System.out.println (Pattern.matcher (s). matches ());

Indicates the range-------there is no need to escape when no characters are present .

3. Repeat multiple times

Character meaning

? Represents 0 or only 1, you can refer to the trinocular operator

+ indicates at least 1 or more

* indicates 0 or 1 or more can be

{n} exact match n times

{N,} matches at least n times

{n,m} matches at least n times, up to M times

However, these are not the points I want to talk about. The point is to prevent over-matching .

Take a look at the following example

"Give me a chestnut 2."

If there is an HTML format file, you want to find the tag inside, you may jiangzi write:.*. However, the result is wrong.

1         String s = "<p>fuck</p><p>you</p>"; 2         String regex = "<p>.*</p>"; 3         Pattern pattern = pattern.compile (regex); 4         Matcher Matcher = Pattern.matcher (s); 5         6          while (Matcher.find ()) {7            System.out.println (Matcher.group ()); 8         }

That is, it matches it once, but your idea is to match the two tags separately, that is, to match two times. This is an over-matching.

Why would Jiangzi? Because * and + are all so-called "greedy" metacharacters, their behavior pattern when matching is as far as possible from the beginning of a paragraph to match to the end, rather than from the beginning of the text to match until the first match encountered.

Then use the lazy version of the meta-character.

Character meaning

*? Match 0 or 1 or more, but with as little matching as possible

+? Match 1 or more, but with as little matching as possible

?? Match 0 or 1, but with as little matching as possible

{N,}? Match N or more, but match as little as possible

{n,m}? Match at least N, up to M, but as little as possible

So, the above example is just to change .* to .*?.

4. Sub-expressions or groupings

The purpose of dividing an expression into a series of sub-expressions is to use those sub-expressions as an independent element, which can be thought of as abstracting the repeating part.

Another major function of grouping is a backtracking reference (also called a backward reference). A backtracking reference refers to the sub-expression defined in the first half of the pattern in the latter part of the schema.

"Give me a chestnut 3."

If there is an HTML file with a tag from H1 to H6, you want to match the correct label. Attention is the right label, which means, like

The text to be matched is as follows:

SaySomething.

According to the previous study, it is easy to write the following code:

 1  String s = ";  2  String regex = "<[hH][1-6]>.*?</[hH][1-6]>" ; 3  pattern pattern = Pattern.compile (regex); 4  Matcher Matcher = Pattern.matcher (s); 5  while   (Matcher.find ()) { 7   System.out.println (Matcher.group ());  8 }

The result, however, was to take out the last illegal label. Why is the label illegal? Because the closing tag must match the tag name it corresponds to, that is, when you have matched the H1, you can only continue to match H1 instead of the other, that is, the latter part of the pattern needs to use the sub-expression defined in the first half.

Change the regular expression to < ([hh][1-6]) >.*?</\1> just fine, the string in Java remembers \ escaped.

Backtracking references are usually counted from 1 (\1,\2, etc.), and No. 0 matches the entire regular expression in many implementations.

So the question is, how are the nested groupings counted?

On the whole, it is counting from a large group to a smaller group, and the individual feels a bit like a pre-order traversal of a tree.

Take a look at the following example.

"Give me a chestnut 4."

1String s = "ABCDEFHG";2String regex = "(A) ((B) (C)) (D) ((E (F (H))) (G))";3Pattern pattern =pattern.compile (regex);4Matcher Matcher =Pattern.matcher (s);5         6          while(Matcher.find ()) {7 System.out.println (Matcher.group ());8              for(inti = 1; I <= matcher.groupcount (); i++) {9SYSTEM.OUT.PRINTLN ("group" + i + ":" +Matcher.group (i));Ten             } One}

Is this regular expression asking if you're scared?

Run the result as follows, analyze it yourself.

You can use backtracking references to replace operations.

However, for repeated multiple groupings, can only match to the last occurrence of the group, the reason is unclear, please teach the great God. The following example replaces this situation with a replacement operation.

"Give me a chestnut 5."

Now there is a task that needs to replace the first part of the IP address.

You can easily write a string regex = "((\\d|[ 1-9]\\D|1\\D{2}|2[0-4]\\D|25[0-5]) \ \.) {3} (\\d| [1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5]) "; Such a regular. Then according to the above analysis, to be replaced, there must be the following code:

1         String s = "192.168.56.1"; 2         String regex = "((\\d|[ 1-9]\\D|1\\D{2}|2[0-4]\\D|25[0-5]) \ \.) {3} (\\d| [1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5]) "; 3         4         System.out.println (S.replaceall (Regex, "100.$3$5$7"));

But the result is wrong.

Let's print out what each matching grouping is.

From the results, the previous three groups were duplicated, and only the last group was obtained.

The above example needs to be changed to the following to run normally.

1        String s = "192.168.56.1"; 2         String regex = "((\\d|[ 1-9]\\D|1\\D{2}|2[0-4]\\D|25[0-5]) \ \.) ((\\d| [1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5]) \ \.) ((\\d| [1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5]) \ \.) (\\d| [1-9]\\d|1\\d{2}|2[0-4]\\d|25[0-5]) "; 3         4         System.out.println (S.replaceall (Regex, "100.$3$5$7"));

Some languages are supported for changing letters to uppercase or lowercase when replacing. Java is not supported.

Meta-character meaning

\l Converts the next character to lowercase (lowercase)

\u convert the next character to uppercase (uppercase)

\l converts all characters between \l and \e to lowercase

\u converts all characters between \u and \e to uppercase.

\e used to end \l or \u conversions

Using \u

Results:

Using \u

5. Search before and after

The last point of knowledge is about looking up and down, and the pattern is that the match itself is not returned, but rather is used to determine the correct match location, which is not part of the matching result.

Imagine a scene in which the contents of a tag are extracted from the HTML file. Or extract the content in front of or behind a format.

Any expression can be converted to a forward lookup expression, as long as it is added with one? = prefix. Similarly, look backwards, plus a <= prefix.

It is important to note that the forward lookup expression should be placed after the pattern, while the direction lookup expression is placed in front of the pattern. Can't get mixed up.

"Give me a chestnut 6."

If there is a group of prices as follows:

$5.00,$6.00,$7.00,

Need to extract the price to calculate, the code is as follows:

1String s = "$5.00,$6.00,$7.00,";2String regex = "(? <=\\$). (?=,)";3         4Pattern pattern =pattern.compile (regex);5Matcher Matcher =Pattern.matcher (s);6         7         DoubleTotal = 0;8          while(Matcher.find ()) {9 System.out.println (Matcher.group ());TenTotal + =double.parsedouble (Matcher.group ()); One         } A          -System.out.println ("total:" + all);

The results of the operation are as follows:

Finally, the forward lookup (and backward lookup) match itself actually has a return result, but the result of the byte length is always 0. So it is sometimes called a 0-width (zero-width) match operation, or a 0-wide assertion.

Regular expressions from getting started to being proficient and looking at each use again

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More