Introduction
Remember a few years ago in the web crawler after the information extraction, for the source of the Web page hidden to extract information, such as comments, user information, such as attribute information, directly using Htmlparser to get. This is simple, but the use of the Web page of the specification of the tag tag. In fact, regular expressions in Java can also be used to implement this function. And for some non-tag strings with regular series combinations, regular expressions are better able to perform their functions. When I was in college, I was exposed to regular expressions, but only superficial knowledge. Now also do not study, the above link page has a relatively clear introduction can be used for reference. Here are just a few examples of how you can use regular expressions for pattern matching to extract target information and some simple discoveries during the experiment. Capture Group mode uses () to represent capturing groups, and is numbered from left to right according to parentheses. A given regular expression has a full part number of 0, and then () counts from left to right, starting with 1 respectively. We can get the contents of any capturing group, which is shown below with an example: [java] public static void Main (string[] args) { //TODO auto-generated method Stub String stmt = "Xx[new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9. 39], "; String regex = "\\[(. *?) \\],"; Pattern p = pattern.compile (regex); Matcher m = P.matcher (stmt); System.out.println (M.groupcount ()); while (M.find ()) { SYSTEM.OUT.PRINTLN (m.group (0)); & nbsp SYSTEM.OUT.PRINTLN (M.group (1)); } } output results are: [plain 1 [new date (' 08/24/2013 '), 9.39], new date (' 08/24/2013 '), 9.39 [new Date (' 08/24/2013 '), 9.39], new date (' 08/24/2013 '), 9.39 According to this, we can see the entire expression as capturing group 0, which can match the output of a string that conforms to the entire expression pattern. The capturing group in the () gets the corresponding matching content. Quantifier the same above code, when we remove the in the regular expression, the output is: [plain] View plaincopy1 [new date (' 08/24/2013 '), 9.39],[new date (' 08/24/ 9.39], new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9.39 found that the match is the global part, the so-called greedy strategy. This is the difference between greed, reluctance and the appropriation of quantifiers. Please refer to:http://www.java3z.com/cwbwebhome/article/article8/regex/java.regex.tutorial.html#reg5_3 for details. I personally think that if the use of information extraction may be more use of a reluctant quantifier. nested mode? If the pattern we are extracting includes many frequent sub-patterns, can it be done with a regular expression? For example, if the regular expression is modified to [java] string stmt = "Xx[[new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9.39],];"; string regex = "\\[" (\\[(. *?) \\],)*\\];"; pattern p = pattern.compile (regex); matcher m = P.matcher (stmt); system.out.println (M.groupcount ()); while (M.find ()) { System.out.println (M.group ()); SYSTEM.OUT.PRINTLN (m.group (0)); SYSTEM.OUT.PRINTLN (M.group (1)); System.out.println (M.group (2)); } there is no effective way to get the matching string of the most inner capturing group, is it a problem? Although we can first use a regular to extract the repeating part, and then use the regular to decompose it, one step not? tips1. The default does not match all characters, such as line breaks, so you can use the Pattern p = pattern.compile (Regex,pattern.dotall) to match all
Java expressions, pattern matching, and information extraction