Introduction when I extracted web crawler information a few years ago, I used HtmlParser to obtain the hidden information in the Web source code, such as comments and user information. This is simple, but the standard tag of the webpage is used. In fact, regular expressions in java can also be used to implement this function. In addition, regular expressions can provide excellent functions for non-tag strings with regular series combinations. I used to use regular expressions when I was in college, but I only knew it. Now I am not willing to learn it. The link above has a clear introduction for your reference. The following describes how to use regular expressions to perform pattern matching to extract target information during the experiment and some simple discoveries. In the capture group mode, the capture group is represented by () and numbered from left to right according to parentheses. The complete Part of a given regular expression is numbered 0, and then () is counted from 1 from left to right. We can get the content of any capture group. The following example is used: [java] public static void main (String [] args) {// TODO Auto-generated method stub String stmt = "xx [new Date ('2014/1/123'), 08/24], [new Date ('2014/2/123 '), 9.39], "; String regex = "\\[(. *?) \], "; Pattern p = Pattern. compile (regex); Matcher m = p. matcher (stmt); System. out. println (m. groupCount (); while (m. find () {System. out. println (m. group (0); System. out. println (m. group (1);} the output result is: [plain 1 [new Date ('2014/1/0160'), 08/24], new Date ('2014/2/123 '), 9.39 [new Date ('2014/1/123'), 08/24], new Date ('2014/2/123'), 2013 based on this, we can see that the entire expression is capture group 0, matches the string that matches the output expression pattern. The capture group in () gets the Matching content. The quantifiers are the same as the above Code. When we convert? After removal, the output result is: [plain] view plaincopy1 [new Date ('2014/1/123'), 08/24], [new Date ('2014/1/123'), 2013], new Date ('2014/1/0301'), 08/24], [new Date ('2014/2/123'), 2013 indicates that the global part is matched, that is, the so-called greedy policy. This is because of the difference between greedy, stubborn, and encroaching quantifiers. For details can refer to: http://www.java3z.com/cwbwebhome/article/article8/Regex/Java.Regex.Tutorial.html#reg5_3 so, I personally think that if used for information extraction may be more use is barely quantifiers. Nested mode? If the extracted mode contains many frequent subpatterns, can we use a regular expression? For example, for the above example, if you change the regular expression to [java] String stmt = "xx [[new Date ('192/123456'), 08/24], [new Date ('2014/1/123'), 08/24],]; "; String regex = "\\[(\\[(. *?) \],) * \]; "; Pattern p = Pattern. compile (regex); Matcher m = p. matcher (stmt); System. out. println (m. groupCount (); while (m. find () {System. out. println (m. group (); System. out. println (m. group (0); System. out. println (m. group (1); System. out. println (m. group (2);} There is no effective way to obtain the matching string of the inmost capturing group. Is that a problem? Although we can use a regular expression to extract duplicate parts and then use the regular expression to break them down, isn't it a step? Tips1. by default, it does not match all characters. For example, it is not allowed to wrap blank lines. In this case, Pattern p = Pattern. compile (regex, Pattern. DOTALL) can be used to match all characters.