Java expression, pattern matching and information extraction

Last Update:2014-09-04 Source: Internet

Author: User

Tags stmt

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Remember a few years ago in the web crawler after the information extraction, for the source of the Web page hidden to extract information, such as comments, user information, such as attribute information, directly using Htmlparser to get. This is simple, but the use of the Web page of the specification of the tag tag. In fact, regular expressions in Java can also be used to implement this function. And for some non-tag strings with regular series combinations, regular expressions are better able to perform their functions. When I was in college, I was exposed to regular expressions, but only superficial knowledge. Now also do not study, the above link page has a relatively clear introduction can be used for reference. Here are just a few examples of how you can use regular expressions for pattern matching to extract target information and some simple discoveries during the experiment. Capture Group mode uses () to represent capturing groups, and is numbered from left to right according to parentheses. A given regular expression has a full part number of 0, and then () counts from left to right, starting with 1 respectively. We can get the contents of any capturing group, which is shown below with an example: [java] public static void Main (string[] args) { //TODO auto-generated method Stub String stmt = "Xx[new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9. 39], "; String regex = "\\[(. *?) \\],"; Pattern p = pattern.compile (regex); Matcher m = P.matcher (stmt); System.out.println (M.groupcount ()); while (M.find ()) { SYSTEM.OUT.PRINTLN (m.group (0)); & nbsp SYSTEM.OUT.PRINTLN (M.group (1)); } } output results are: [plain 1 [new date (' 08/24/2013 '), 9.39], new date (' 08/24/2013 '), 9.39 [new Date (' 08/24/2013 '), 9.39], new date (' 08/24/2013 '), 9.39 According to this, we can see the entire expression as capturing group 0, which can match the output of a string that conforms to the entire expression pattern. The capturing group in the () gets the corresponding matching content. Quantifier the same above code, when we remove the in the regular expression, the output is: [plain] View plaincopy1 [new date (' 08/24/2013 '), 9.39],[new date (' 08/24/ 9.39], new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9.39 found that the match is the global part, the so-called greedy strategy. This is the difference between greed, reluctance and the appropriation of quantifiers. Please refer to:http://www.java3z.com/cwbwebhome/article/article8/regex/java.regex.tutorial.html#reg5_3 for details. I personally think that if the use of information extraction may be more use of a reluctant quantifier. nested mode? If the pattern we are extracting includes many frequent sub-patterns, can it be done with a regular expression? For example, if the regular expression is modified to [java] string stmt = "Xx[[new date (' 08/24/2013 '), 9.39],[new date (' 08/24/2013 '), 9.39],];"; string regex = "\\[" (\\[(. *?) \\],)*\\];"; pattern p = pattern.compile (regex); matcher m = P.matcher (stmt); system.out.println (M.groupcount ()); while (M.find ()) { System.out.println (M.group ()); SYSTEM.OUT.PRINTLN (m.group (0)); SYSTEM.OUT.PRINTLN (M.group (1)); System.out.println (M.group (2)); } there is no effective way to get the matching string of the most inner capturing group, is it a problem? Although we can first use a regular to extract the repeating part, and then use the regular to decompose it, one step not? tips1. The default does not match all characters, such as line breaks, so you can use the Pattern p = pattern.compile (Regex,pattern.dotall) to match all

Java expressions, pattern matching, and information extraction

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More