Regular Expressions, pattern matching, and information extraction in Java

Last Update:2013-12-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction when I extracted web crawler information a few years ago, I used HtmlParser to obtain the hidden information in the Web source code, such as comments and user information. This is simple, but the standard tag of the webpage is used. In fact, regular expressions in java can also be used to implement this function. In addition, regular expressions can provide excellent functions for non-tag strings with regular series combinations. I used to use regular expressions when I was in college, but I only knew it. Now I am not willing to learn it. The link above has a clear introduction for your reference. The following describes how to use regular expressions to perform pattern matching to extract target information during the experiment and some simple discoveries. In the capture group mode, the capture group is represented by () and numbered from left to right according to parentheses. The complete Part of a given regular expression is numbered 0, and then () is counted from 1 from left to right. We can get the content of any capture group. The following example is used: [java] public static void main (String [] args) {// TODO Auto-generated method stub String stmt = "xx [new Date ('2014/1/123'), 08/24], [new Date ('2014/2/123 '), 9.39], "; String regex = "\\[(. *?) \], "; Pattern p = Pattern. compile (regex); Matcher m = p. matcher (stmt); System. out. println (m. groupCount (); while (m. find () {System. out. println (m. group (0); System. out. println (m. group (1);} the output result is: [plain 1 [new Date ('2014/1/0160'), 08/24], new Date ('2014/2/123 '), 9.39 [new Date ('2014/1/123'), 08/24], new Date ('2014/2/123'), 2013 based on this, we can see that the entire expression is capture group 0, matches the string that matches the output expression pattern. The capture group in () gets the Matching content. The quantifiers are the same as the above Code. When we convert? After removal, the output result is: [plain] view plaincopy1 [new Date ('2014/1/123'), 08/24], [new Date ('2014/1/123'), 2013], new Date ('2014/1/0301'), 08/24], [new Date ('2014/2/123'), 2013 indicates that the global part is matched, that is, the so-called greedy policy. This is because of the difference between greedy, stubborn, and encroaching quantifiers. For details can refer to: http://www.java3z.com/cwbwebhome/article/article8/Regex/Java.Regex.Tutorial.html#reg5_3 so, I personally think that if used for information extraction may be more use is barely quantifiers. Nested mode? If the extracted mode contains many frequent subpatterns, can we use a regular expression? For example, for the above example, if you change the regular expression to [java] String stmt = "xx [[new Date ('192/123456'), 08/24], [new Date ('2014/1/123'), 08/24],]; "; String regex = "\\[(\\[(. *?) \],) * \]; "; Pattern p = Pattern. compile (regex); Matcher m = p. matcher (stmt); System. out. println (m. groupCount (); while (m. find () {System. out. println (m. group (); System. out. println (m. group (0); System. out. println (m. group (1); System. out. println (m. group (2);} There is no effective way to obtain the matching string of the inmost capturing group. Is that a problem? Although we can use a regular expression to extract duplicate parts and then use the regular expression to break them down, isn't it a step? Tips1. by default, it does not match all characters. For example, it is not allowed to wrap blank lines. In this case, Pattern p = Pattern. compile (regex, Pattern. DOTALL) can be used to match all characters.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Regular Expressions, pattern matching, and information extraction in Java

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support