Use regular expressions in Java

Source: Internet
Author: User
Recently, I want to use nekohtml to parse HTML in the project. But I have read that nekohtml is a little complicated, so I adopted a convenient and stupid method to meet the project's requirements. at this time, the regular expression played a huge role in her power. It helped me solve the problem and quickly completed HTML parsing. during the parsing, I have studied many Java Regular Expression usage methods. I have gained some experience and do not dare to share it with you. java Regular Expression Java. util. RegExThe pattern class under the package and the matcher class implementation ( When reading this article, we recommend that you open the Java API documentation and check the method description in the Java API for better results.). The pattern class is used to create a regular expression. It can also be said to create a matching pattern. Its constructor is private and cannot be directly created, but it can be created through pattern. complie (string RegEx) simple factory method to create a regular expression, Java code example: Pattern P = pattern. compile ("// W +"); p. pattern (); // returns/W + Pattern ()Returns the RegEx parameter of pattern. complile (string RegEx ). 1. pattern. Split(Charsequence input) pattern has a split (charsequence input) method, used to separate strings and return a string [], I guess string. split (string RegEx) uses pattern. split (charsequence input. java code example: Pattern P = pattern. compile ("// D +"); string [] STR = P. split ("My QQ is: 456456 my phone is: 0532214 my mailbox is: aaa@aaa.com"); Result: Str [0] = "My QQ is: "str [1] =" My phone number is: "str [2] =" my mailbox is: aaa@aaa.com" 2. pattern. matcher(String RegEx, charsequence input) is a static method used to quickly match strings. This method is suitable for matching only once and all strings. java code example: pattern. matches ("// D +", "2223"); // returns true pattern. matches ("// D +", "2223aa"); // return false. True is returned only when all strings are matched. Here AA cannot match pattern. matches ("// D +", "22bb23"); // return false. True is returned only when all strings are matched. BB cannot match 3. pattern. matcher(Charsequence input) said so much, finally it's the matcher class debut, pattern. matcher (charsequence input) returns a matcher object. the matcher class constructor method is private and cannot be created at will. It can only be created through pattern. the matcher (charsequence input) method obtains the instance of this class. the pattern class can only perform some simple matching operations. To obtain more convenient Regular Expression matching operations, we need to work with matcher. the matcher class supports grouping regular expressions and Multiple matching of regular expressions. java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("22bb23"); M. pattern (); // returns P, that is, the pattern object created by which the matcher object is returned. 4. matcher. Matches() / Matcher. lookingat() / Matcher. Find() The matcher class provides three matching operation methods. All three methods return the boolean type. If the matching result is true, false matches () is returned to match the entire string, true is returned only when the entire string matches. Java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("22bb23"); M. matches (); // returns false because BB cannot be matched by/d +, leading to unsuccessful matching of the entire string. matcher m2 = P. matcher ("2223"); m2.matches (); // returns true because/d + matches the entire string. Now let's look back at pattern. matcher (string RegEx, charsequence input), which is equivalent to pattern in the following code. compile (RegEx ). matcher (input ). matches () lookingat () matches the previous string. True is returned only when the matched string is at the beginning. Java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("22bb23"); M. lookingat (); // returns true because/d + matches the previous 22 matcher m2 = P. matcher ("aa2223"); m2.lookingat (); // return false because/d + cannot match the preceding AA find () to match the string, the matched string can be in any position. java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("22bb23"); M. find (); // return true matcher m2 = P. matcher ("aa2223"); m2.find (); // returns true matcher m3 = P. matcher ("aa2223bb"); m3.find (); // return true matcher M4 = P. matcher ("AABB"); m4.find (); // return false 5. mathcer. Start() / Matcher. End() / Matcher. Group() When you use matches (), lookingat (), find () to perform the matching operation, you can use the above three methods to obtain more detailed information. start () returns the index position of the matched substring in the string. end () returns the index position of the last character of the matched substring in the string. group () returns the matched substring. Java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("aaa2223bb"); M. find (); // match 2223 M. start (); // 3 m is returned. end (); // return 7. The returned index number is m after 2223. group (); // returns 2223 mathcer m2 = m. matcher ("2223bb"); M. lookingat (); // match 2223 M. start (); // return 0. Because lookingat () can only match the previous string, when lookingat () is used, the START () method always returns 0 m. end (); // returns 4 m. group (); // returns 2223 matcher m3 = m. matcher ("2223bb"); M. matches (); // match the entire string M. start (); // return 0. The reason is as clear as M. end (); // 6 is returned. The reason is clear, because matches () needs to match all strings M. group (); // return 2223bb. I believe everyone understands the usage of the above methods. Let's talk about how regular expression grouping is used in Java. both start (), end (), and group () have an overload method. These methods are start (int I), end (int I), and group (int I), which are used for group operations, the mathcer class also has a groupcount () used to return the number of groups. java code example: Pattern P = pattern. compile ("([A-Z] +) (// D +)"); matcher M = P. matcher ("aaa2223bb"); M. find (); // match aaa2223 M. groupcount (); // returns 2 because there are 2 groups of M. start (1); // return 0 returns the index number M of the first matched substring in the string. start (2); // 3 m is returned. end (1); // return 3 returns the index position of the last character of the first matched substring in the string. m. end (2); // return 7 m. group (1); // returns AAA, returns the first group of matched sub-strings M. group (2); // 2223 is returned, and the substring matched by the second group is returned. Now we use the regular expression matching operation of a slightly advanced vertex. For example, there is a piece of text with many numbers in it, these numbers are separated. Now we need to extract all the numbers in the text. Using Java's regular expressions is so simple. java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("My QQ is: 456456 my phone is: 0532214 my mailbox is: aaa123@aaa.com"); While (M. find () {system. out. println (M. group ();} output: 456456 0532214 123 If you replace the preceding while () loop with while (M. find () {system. out. println (M. group (); system. out. print ("START:" + M. start (); system. out. println ("end:" + M. end ();} is output: 456456 start: 6 end: 12 0532214 start: 19 end: 26 123 start: 36 end: 39 now you should know, after each matching operation, the values of the START (), end (), and group () methods are changed to the information of matched substrings and their overloaded methods, it will also change to the corresponding information. Note: You can use the START (), end (), group () methods only when the matching operation is successful. Otherwise, Java is thrown. lang. illegalstateexception, which is used only when any of the matches (), lookingat (), find () Methods returns true. 6. matcher. Replaceall(String replacement )/ Matcher. replacefirst(String replacement) You should know the string. replaceall () and string. replacefirst () and matcher. replaceall () and matcher. replacefirst () has the same functions, but the usage is different. for example, I want to convert all the numbers in a text to * use string to complete this requirement. Java code example: String STR = "My QQ is: 456456 my phone number is: 0532214 my mailbox is: aaa123@aaa.com "; system. out. println (Str. replaceall ("// D", "*"); output: My QQ is: ****** my phone number is: * ****** my email address is AAA *** @ aaa.com. Now we use matcher to complete this requirement. Java code example: Pattern P = pattern. compile ("// D"); matcher M = P. matcher ("My QQ is: 456456 my phone is: 0532214 my mailbox is: aaa123@aaa.com"); system. out. println (M. replaceall ("*"); output: My QQ is: ******* my phone is: ********* my mailbox is: AAA *** @ aaa.com string. replaceall () should have called matcher. replaceall (), String. replaceall () is equivalent to the following code: Pattern. compile( RegEx ). matcher( Str ). replaceAll(Replacement )   Matcher. replacefirst () is also very simple, it is the same as the string. replacefirst () function, I will not say much. Str . Replacefirst ( RegEx , Replacement ) Is equivalent to the following code: Pattern. compile( RegEx). matcher( Str). replaceFirst( Replacement)   7. matcher. Appendreplacement(Stringbuffer Sb, string replacement )/ Matcher. appendtail(Stringbuffer SB) Replace the current matched substring with the specified string, and add the replaced substring and its string segment after the last matched substring to a stringbuffer object, the appendtail (stringbuffer SB) method adds the remaining strings after the last matching operation to a stringbuffer object. example: Java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("My QQ is: 456456 my phone is: 0532214 my mailbox is: aaa123@aaa.com"); stringbuffer sb = new stringbuffer (); M. find (); // match to 456456m. appendreplacement (SB, "*"); // append the string before 456456 to sb, replace 456456 with *, and append it to sbsystem. out. println (sb. tostring (); M. appendtail (SB); // connect the previously replaced content to the content that has not been replaced, and add it to sbsystem. out. println (sb. tostring (); output: My QQ is :*
My QQ is: * my phone is: 0532214 my mailbox is: aaa123@aaa.com again look at an example Java code example: Pattern P = pattern. compile ("// D +"); matcher M = P. matcher ("My QQ is: 456456 my phone is: 0532214 my mailbox is: aaa123@aaa.com"); stringbuffer sb = new stringbuffer (); While (M. find () {M. appendreplacement (SB, "*"); system. out. println (sb. tostring ();} M. appendtail (SB); system. out. the final content of println ("using appendtail () is:" + sb. tostring (); output: My QQ is :*
My QQ is: * My phone number is :*
My QQ is: * My phone number is: * my mailbox is: AAA *
The final content of using appendtail () is: My QQ is: * My phone number is: * my mailbox is: AAA * @ aaa.com. I will introduce these two methods here, if you do not understand it, you still need to start by yourself and carefully understand its meaning. 8. matcher. Region(INT start, int end )/ Matcher. regionend()/ Matcher. regionstart()

During the matching operation, the entire string is matched by default. For example, if there is a string "aabbcc" and "// D +" is used to find, match starts from the first A, that is, the position where the index number is 0. When the position where the index number is 0 does not match, it will go to the next position to match... it does not end until the substring is matched or the index number of the last character is matched. Obviously, "// D +" cannot match "aabbcc". When it matches the last C, when this match is completed, the match fails. That is to say, it will match the complete string. Can it not match the complete string? The answer is yes. region (INT start, int end) is used to set the region limit of this vertex. Let's take a look at an example. Java code example: Pattern P = pattern. Compile ("// D +"); string content = "aaabb2233cc ";
Matcher M = P. matcher (content );
System. Out. println (m); output: Java. util. RegEx. matcher [pattern =/d + Region = 0, 11Lastmatch =] We can see that region = indicates start = 0, end = 11. In more general, when a string is matched, it is first matched from the position where the index number is 0, if the substring is matched, the system returns the result. If the substring is not matched, the matching is performed at the next position. If the substring matches 11-1, the matching ends. why is it 11, because content. length () = 11 now you should understand its role. Let's look at an example. java code example: Pattern P = pattern. compile ("// D +"); string content = "aaabb2233cc ";
Matcher M = P. matcher (content );
M. find (); // if the value matches 2223, true matcher m2 = P is returned. matcher (content); m2.region (); m2.find (); // if false is returned, only the characters ranging from 0 to 5-1 are matched, but matcher m3 = P is not matched. matcher (content); m2.region (3, 8); m2.find (); // returns truem2.group (); // returns 223. Why, please count the index number to get it. matcher. regionstart () returns the start value in Region (INT start, int end). The default value is 0matcher. regionend () returns the end value in Region (INT start, int end). The default value is the length () value of the dematching string.
9. matcher. Reset()/ Matcher. Reset(Charsequence input) is used to reset the matching. Take a look at the sample Java code example: Pattern P = pattern. Compile ("[A-Z] +"); string content = "aaabb2233cc ";
Matcher M = P. matcher (content); // at this time, M is just created and is in the initial state. find (); M. group (); // returns aaabbm. find (); M. group (); // return CC matcher m2 = P. matcher (content); // at this time, M2 is just created and is in the initial state m. find (); M. group (); // returns aaabbm. reset (); // returns to the initial state. At this time, M2 is just created. find (); M. group (); // return aaabb. I believe you should know matcher. reset (charsequence input) is restored to the initial state, and the matching string is replaced with input. When the matching operation is performed later, the input will be matched instead of the original string. 10. matcher. tomatchresult() Check the description of the matcher class in Java API and you will find that it implements the matchresult interface. This interface only has the following methods: groupcount () group ()/group (int I) start ()/start (INT I) End ()/end (INT I) as to the functions of these methods, we have already introduced them. Now let's take a look at tomatchresult () java code example: Pattern P = pattern. compile ("// D + ");
Matcher M = P. matcher ("My QQ is: 456456 my phone is: 0532214 my mailbox is: aaa123@aaa.com ");
List list = new arraylist ();
While (M. Find ()){
List. Add (M. tomatchresult ());
}
Matchresult = NULL;
Iterator it = List. iterator ();
Int I = 1;
While (it. hasnext ()){
Matchresult = (matchresult) it. Next ();
System. Out. Print ("th" + (I ++) + "Times matched information :");
System. Out. println (matchresult. Group () + "/T" + matchresult. Start () + "/t" + matchresult. End ());
} Output: 1st matched information: 456456 6 12
2nd matched information: 0532214 19 26
3rd matched information: 123 36 39 now you should know That tomatchresult () is used to save the information after a match and will be used later. here is how to use it. Now we want to introduce another instance to have such a requirement. There is an HTML file which needs to be extracted without HTML tags. If a regular expression is used, this is an easy task. the premise is that the HTML file only retains the content within the <body> </body> label. java code example: String html = "<div> <font color = 'red'> example1 </font> </div>"; // It can be the source code of any HTML file, but the format must be correct pattern P = pattern. compile ("<[^>] *>"); matcher M = P. matcher (HTML); string result = m. replaceall (""); system. out. println (result); output: example1

References:
Java. util. RegEx API documentation
Chen guangjia's Java Regular Expression: Pattern and matcher

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.