Some basic uses of regular in Java

Last Update:2016-06-17 Source: Internet

Author: User

Tags character classes posix

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is my first technology-based blog, I hope to help the needy friends. ----- Reprint Please specify the source.

In the previous work, many strings need to be processed. I write it down now lest I forget it later.

For regular processing related knowledge, I started from the Internet to find information with the use of a Chinese version of Java API1.6 to learn, thank you for translating this version of the team (organization) or individuals, thank those who selfless share their knowledge of friends.

The relevant knowledge of the regular in the following, now I am from Oracle to provide the latest Java API, probably and I have seen 1.6 some discrepancies, posted API address: http://docs.oracle.com/javase/8/docs/api/

One, summarize

In Java, strings are processed using the pattern and Matcher classes under the Java.util.regex package. In this, in fact, the most important is the regular matching expression (hereinafter referred to as the table) of the preparation. You can test on this website after you finish writing the table, I often test on this website. This is the link: http://tool.oschina.net/regex/

But some regular expressions don't recognize it, so it's best to test it yourself by writing a unit test on as or eclipse.

Two, Pattern

The class inherits from the Java.lang.Object class and implements the Serializable interface.

The class has two static methods for getting instances, namely compile (string regex) and compile (string Regex,int flag). The second method has a flag that is used primarily to customize the matching pattern of the captured instances. Say the topic off-the-line suggested that everyone look at the API, also open as or eclipse, find the corresponding class, look inside the source code, so learning more convenient and intuitive.

Back to the point, the API has been marked with the Pattern field Summary, identified by the class has 9 constants. As follows

CANON_EQ enables canonical equivalence. (the equivalent of enable specification, should be default, match only)

Case_insensitive enables case-insensitive matching. (case insensitive)

COMMENTS permits whitespace and COMMENTS in pattern. (whitespace and annotations are allowed in the pattern)

Dotall enables Dotall mode. (in this mode. Match end of Line)

LITERAL enables LITERAL parsing of the pattern. (startup literal parsing mode)

MULTILINE enables MULTILINE mode. (Multiline)

Unicode_case enables Unicode-aware case folding. (If your code was intended to be portable and uses case-insensitive Matchi Ng on non-ascii characters, your should use this flag.)

Unicode_character_class enables the Unicode version of predefined CHARACTER classes and POSIX CHARACTER C Lasses. (Enable Unicode versions of predefined character classes and POSIX character classes)

Unix_lines enables UNIX LINES mode. (Start the UNIX line pattern)

When using Pattern.complie (regex) to get the pattern instance, the system passed 0 by default because flag was not specified. The rest is not going to be elaborate. For the above several flags, before looking at the source code I generally use Pattern.complie (regex) to get the pattern instance, multi-line mode, when writing a positive table, note that you can also match more than one line.

You can also use the flag () method to get the pattern integer value of the current pattern instance.

If you only need to determine if the match succeeds, you can call Pattern.matches (regex,input). Returns true if the match was successful

Three, Matcher

With an instance of patter patter, called patter.matcher (charsequence input), input needs to pass in the actual argument of the Charsequence interface of the implementation, which is generally the string.

There are a lot of methods in Matcher, I only write down a small amount I used.

Boolean find (), whether to find the content that matches the rule from the next sub-sequence.

String Group (); Returns the contents of the previous capturing group.

String Group (int); Returns the captured content for a given group

int Gourpcount (); Returns the number of capturing groups

After the find () match succeeds, there is a pointer-like thing that moves behind the captured group that was successfully captured. The capture group content that precedes the pointer is captured with group (). We can write a unit test to verify.

This is the input whatab1whatab2wahtab3whatab4ss, which is the positive table ab[\d]. The match is in the input

Here's how

   Static   string method (string regex, String input) {Matcher Matcher  =
    
      Pattern.compile (regex). Matcher (input); StringBuilder Builder 
     = new          StringBuilder ();         int  i = 0;  while   (Matcher.find ()) {Builder.appe            nd ( This is the first). Append ((++i)). Append ("group: Capture To" " "".        \ n ");     return   builder.tostring (); }

The output is:

Call method, parameter is

Utils.method ("ab[\\d]", "WHATAB1WHATAB2WHATAB3WHATAB4SS")

This is group 1th: Capture to "AB1". This is group 2nd: Capture to "AB2". This is group 3rd: Capture to "AB3". This is group 4th: Capture to "AB4".

If you use Group (int) to specify a specific match location, you need to have more than one capturing group, both "()".

Using the same input, change the positive table to. *? (AB ([\d])). *?

In the method methods of Utils, we use Group (), group (0), group (1) and group (2) to stitch the captured array, and print the contents as follows.

Group () Print results

This is group 1th: Capture to "WHATAB1". This is group 2nd: Capture to "WHATAB2". This is group 3rd: Capture to "WHATAB3". This is group 4th: Capture to "WHATAB4".

Group (0) print results

This is group 1th: Capture to "WHATAB1". This is group 2nd: Capture to "WHATAB2". This is group 3rd: Capture to "WHATAB3". This is group 4th: Capture to "WHATAB4".

Group (1) Print results

This is group 1th: Capture to "AB1". This is group 2nd: Capture to "AB2". This is group 3rd: Capture to "AB3". This is group 4th: Capture to "AB4".

Group (2) Print results

This is group 1th: Capture to "1". This is group 2nd: Capture to "2". This is group 3rd: Capture to "3". This is group 4th: Capture to "4".

From the printed results we can see that group () and group (0) capture the same content. What is the difference between group (0) and group (1) if the entire positive table is enclosed in a "()" situation?

The positive table is changed into (. *? AB ([\d]), the following is the result of printing

Group (0) and group ()

This is group 1th: Capture to "WHATAB1". This is group 2nd: Capture to "WHATAB2". This is group 3rd: Capture to "WHATAB3". This is group 4th: Capture to "WHATAB4".

Group (1)

This is group 1th: Capture to "WHATAB1". This is group 2nd: Capture to "WHATAB2". This is group 3rd: Capture to "WHATAB3". This is group 4th: Capture to "WHATAB4".

Group (2)

This is group 1th: Capture to "AB1". This is group 2nd: Capture to "AB2". This is group 3rd: Capture to "AB3". This is group 4th: Capture to "AB4".

Group (3)

This is group 1th: Capture to "1". This is group 2nd: Capture to "2". This is group 3rd: Capture to "3". This is group 4th: Capture to "4".

If you want to capture the contents of the "()", you should be aware that it is better to get the integer value of GroupCount () so as not to match the contents of the capturing group, causing the program to be abnormal.

Four, regular expression

For the preparation of a positive table, you first need to understand its various meanings.

The following is a description of some commonly used symbols, and of course, most of them are not covered, you can see from the Java API in the Java.util.regex.Pattern class.

"()": Capture group, explained in Matcher

"[]": Match range, note that if input [AC] only matches a or C, it is not a match string AC. For example, input, ABCAC, using [AC] can match out four groups,

respectively: 1.a 2.c 3.a 4.c

' {} ': Number of occurrences, with several notation, {n,m} indicates a minimum of n times, up to M times, and {n} indicates n occurrences. {N,} represents a minimum of n times. {} is typically used in conjunction with [] to specify the number of characters in the match range to occur several times. In this case, there is sometimes a "|" Links to use. For example XXABXXABXXACXXA1XXDAV11, use the [abvc]{2}| [A\d] {2} can be matched

Ab
Ab
Ac
A1
Av
11

".": In general, it represents any character except spaces and other lines at the beginning of the end of the line (personal understanding is wrong).

"\d": equivalent to [0-9], representing any number from 0 to 9.

"\d": In contrast to \d, mark any character other than \d

"\w": Word character

"\w": Non-word characters

"\s": white space character

"\s": Non-whitespace characters. When dealing with strings, if there is no custom multiline mode, you can set the "." Replace with "[\\s\\S]".

"^": if placed inside "[]", means take the remaining range. such as "[^a]" indicates any character of a. ^ can also represent the beginning of a line.

"$": Indicates the end of the line.

————————————————————————————————————————————————————————

The following three symbols plus "{}" have three matching patterns in regular expressions, namely greedy, reluctant, and possessive, which are carefully differentiated below.

"X?": such as x, indicates that x matches one or more times. Note that after a positive table is added, it becomes a non-greedy mode.

"x+": such as x, indicates that x is matched to one or more times.

"x*": such as x*, indicates that X matches to 0 or more times.

"X?" is a greedy pattern that matches as much of the content as possible. "X??" is a non-greedy pattern that matches the minimum content that meets the criteria. "x?+" is possessive quantifiers, this model I have not used, do not know what is going on, ask to share guidance.

——————————————————————————————————————————————————————————

"|": such as x| Y, which means match x or Y.

"\q" and "\e": such as \q|.? \e makes the middle match lose the regular meaning, as the normal character matches. Like ab|.? AB, if you want to match |.?, need to use \|\.\? to match, but use \q|.? \e can be directly matched without having to write so many transfer characters.

"\b": the word boundary, which can be used if it is used to match a specific word. As \bregex\b it matches the regex in the Hello regex!, but does not match the regex in the helloregex! because the latter's regex is not a word and the boundary is mixed.

"\b": with "\d" and "\w" to capitalize the word, indicating the opposite range. The boundary of a non-word is represented here. This thing really does not know how to use. Try it. Similarly, the input is Hello regex!. With \be\b, you can match to three E. But with \bregex\b, the match is not, because the regex in the Hello regex! is already on the boundary, and the input is changed to helloregexs! to use \bregex\b to match to the regex.

"\a": Indicates that the match starts at the beginning of the input. For example, enter "Abcdcfg" and use "[\\w]*?c" to match to ABC and DC, but use "\\a[\\w]*?c" to match only to ABC. That's because \a is starting to match from the beginning.

"\g": Indicates the last match to the end. This is really useless, seek to share, seek guidance.

"Z" and "Z": Do not know what meaning, ask to share.

"(? =x)": Literal translation is a 0-width forward prediction. First of all, this lookahead refers to the left character, not the right character. If XXABSS,XX is AB's first, SS is AB's after. This thing matches what's in front of "X". such as "Xxabcyyabc", "[\\s\\s]*? (? =abc) "Can match out xx and yy. The API inside of this parsing tete is not good understand. It is important to note that the X in (? =x) can be captured. If the input "xabcsabc", matching "[\\w] (? =abc) [\\w]{2}", will match 2 results, respectively, Xab and SAB.

"(?! X) ": Literal translation is a negative lookahead of 0 width. Negative a chicken feather, the British died early, will not paraphrase. (? =x) indicates that the right side of the match is X, which means that the match is on the right, not X. If you enter "XXABCYYABC", the table is "[\\w]{2} (?! X) [\\w]{2}, matching out Xxab and Cyya. Some friends may wonder, why? Because there is no content in front of Xxab, so can match out Xxab. If the input is not changed, replace the positive table with "[^x]{2} (?! XX) [\\w]{2} ", matching out is abcy and yabc.

"(? <=x)": Literal translation is a forward prediction of 0 width. The string representing the right side of the matching content is x. Test, enter "XXABCYYABC", the positive table is "[\\w].*? (? <=abc) ", matched to" xxabc "and" YYABC ".

"(? <! X) ": Literal translation is a negative prediction of 0 width. Indicates that the match content is not x to the right. Test, enter "XXABCYYABC", the positive table is "[\\w].*? (? <!abc) ABC ", matching print results are as follows

This is group 1th: Capture to "XXABC". This is group 2nd: Capture to "YYABC".

In the first group "[\\w].*? (? <!abc) "First capture to" XX ", followed by an" ABC ". The second group is the same.

"(? >x)": As a standalone non-capturing group.

Five, summary

Due to my limited level, for a lot of things have not been explained, including regular matching patterns of the matching principle, including the Java.regex.utils package of other classes of various methods, including the pattern of other marked regular symbols, there may be many errors in the text, hope Understanding, Guidance. I will only point to the fur of hermetical, hope to help the need to help friends.

For the regular expression of the writing, the individual think is to combine the simple moves, you will find that it contains a great power.

Some basic uses of regular in Java

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More