Java expression, pattern matching and information extraction

Source: Internet
Author: User
Tags stmt

Introduction

Remember a few years ago in the web crawler after the information extraction, for the source of the Web page hidden to extract information, such as comments, user information, such as attribute information, directly using Htmlparser to get. This is simple, but the use of the Web page of the specification of the tag tag. In fact, regular expressions in Java can also be used to implement this function. And for some non-tag strings with regular series combinations, regular expressions are better able to perform their functions. When I was in college, I was exposed to regular expressions, but only superficial knowledge. Now also do not study, the above link page has a relatively clear introduction can be used for reference. Here are just a few examples of how you can use regular expressions for pattern matching to extract target information and some simple discoveries during the experiment.    Capture Group    mode uses () to represent capturing groups, and is numbered from left to right according to parentheses. A given regular expression has a full part number of 0, and then () counts from left to right, starting with 1 respectively. We can get the contents of any capturing group, which is shown below with an example:  [java] public static void Main (string[] args) {    //TODO auto-generated method Stub      String stmt = "Xx[new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9. 39], ";      String regex = "\\[(. *?) \\],";      Pattern p = pattern.compile (regex);      Matcher m = P.matcher (stmt);      System.out.println (M.groupcount ());      while (M.find ()) {         SYSTEM.OUT.PRINTLN (m.group (0));    & nbsp     SYSTEM.OUT.PRINTLN (M.group (1));     }  }    output results are: [plain 1  [new date (' 08/24/2013 '), 9.39],  new date (' 08/24/2013 '), 9.39  [new Date (' 08/24/2013 '), 9.39],  new date (' 08/24/2013 '), 9.39   According to this, we can see the entire expression as capturing group 0, which can match the output of a string that conforms to the entire expression pattern. The capturing group in the () gets the corresponding matching content.   Quantifier the same above code, when we remove the in the regular expression, the output is: [plain] View plaincopy1  [new date (' 08/24/2013 '), 9.39],[new date (' 08/24/ 9.39],  new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9.39   found that the match is the global part, the so-called greedy strategy. This is the difference between greed, reluctance and the appropriation of quantifiers. Please refer to:http://www.java3z.com/cwbwebhome/article/article8/regex/java.regex.tutorial.html#reg5_3  for details. I personally think that if the use of information extraction may be more use of a reluctant quantifier.    nested mode? If the pattern we are extracting includes many frequent sub-patterns, can it be done with a regular expression? For example, if the regular expression is modified to [java] string stmt = "Xx[[new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9.39],];";  string regex = "\\[" (\\[(. *?) \\],)*\\];";  pattern p = pattern.compile (regex);  matcher m = P.matcher (stmt);  system.out.println (M.groupcount ());  while (M.find ()) {     System.out.println (M.group ());      SYSTEM.OUT.PRINTLN (m.group (0));      SYSTEM.OUT.PRINTLN (M.group (1));      System.out.println (M.group (2));  }    there is no effective way to get the matching string of the most inner capturing group, is it a problem? Although we can first use a regular to extract the repeating part, and then use the regular to decompose it, one step not?   tips1. The default does not match all characters, such as line breaks, so you can use the Pattern p = pattern.compile (Regex,pattern.dotall) to match all

Java expressions, pattern matching, and information extraction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.