Java and Regular Expressions
Java and Regular Expressions
Labels: Java Basics
Regular Expression
AsRegular ExpressionIs to describeRules, Use this ruleMatch stringLearning regular expressions is learning regular expressions.Syntax Rules
Regular syntax common characters
Letters, numbers, Chinese characters, underscores, and punctuation marks without special DefinitionsCommon characters. The regular character in the expression matches the same character when it matches a string.
Escape characters
| Character |
Explanation |
\n |
Line Break |
\t |
Tab |
\^ \$ \( \) \{ \} \? \+ \* | \\ \[ \] |
Match these characters |
Combination of standard character sets
| Character |
Match |
\d |
Any number |
\w |
AnyLetter/digit/underline |
\s |
Any space, tab, line break, and other blank characters |
. |
Any character including the decimal point (\n) |
[\s\S] |
Match\nAny character. |
Note: uppercase and lowercase are reversed. Custom Character Set combination
[]Matching Method of square brackets, which can match any character in square brackets.
| Expression |
Explanation |
[ab5@v] |
Matcha b 5 @ vAny |
[^ab5@v] |
Matcha b 5 @ vBesidesAny |
[f-k] |
Matchf-kAny letter |
[^f-k] |
Matchf-kBesidesAny |
[f-k0-3] |
Matchf-kOr0-3Any letter |
[^f-k0-3] |
Mismatchf-kOr0-3Any letter |
Note:
A special symbol in a regular expression that is included in
[]In addition
-
^.
[]In
^Indicates the meaning of the inverse.
[]In
-Indicates the meaning of the range. The standard character set includes the decimal point. If it is included in brackets, the custom set includes the set.
For example
[\d.\-+]Match: Number, decimal point,
-,
+ Quantifiers
| Quantifiers |
Explanation |
{n} |
Expression repeated n times |
{m,n} |
The expression must be repeated at least m times and n times at most. |
{m,} |
The expression must be repeated at least m times. |
? |
Match expression 0 times or 1 time |
+ |
The expression appears at least once, equivalent{1,} |
* |
The expression does not appear or appears any time, which is equivalent{0,} |
Greedy mode and non-Greedy Mode
Greedy mode: the more matching characters, the better,Default.
Non-Greedy mode: The fewer matching characters, the better,Add a Quantizer?No..
Example
Matching mobile phone number
1[358]\d{9}Matching email
([\w\-\.]+)@([0-9a-zA-Z\-]+)(\.[a-zA-Z]{2,4}){1,2} Character Boundary
Zero-width match: the match is not a character but a location that meets certain conditions. It does not match any character.
| Character |
Explanation |
^ |
Match the start point of the string |
$ |
Match the end of the string |
\b |
Match A Word boundary, that is, the position between a word and a space. |
Note:\bIt will match the following position: the character before and after is not all\w
Selector and group
| Expression |
Explanation |
| |
The "or" relationship between the expressions on the left and right sides matches the relationship on the left or right. |
()Capture Group |
When modified,()The expression in can be modified as a whole. When the matching result is obtained,()The content matching the expression in can be obtained separately. Each pair of parentheses is assigned a number ((Prevail, from left to right: from 1) |
(?:exp)Non-capturing Group |
Some expressions have to be used(), But does not need to be saved()The content matched by the neutron expression, which can be offset by a non-capturing group.()Side effects |
Note: Non-capturing groups can be used to optimize memory allocation when processing a large amount of text.
Anti-Reference
\nnn
As shown above, the capture group is assigned a number by default. By reverse referencing, You can reference the captured strings in the group.
(\w{2})\1Matching similar
toto
dodo
gogoThe string copied from a word.
(img)\w+\1Img strings before and after matching Zero-width assertion (pre-search) and zero-width assertion: Only Child expressions are matched, and the matching content is not included in the final matching result. location match: determines whether the prefix and suffix of the current location meet the specified conditions, but does not retain the prefix and suffix.
In a regular expression, if the child expression matches the character content instead of the position and is retained to the final matching result, the child expression isCharacter possessionIf the child expression matches only the position, or the matched content is not saved to the final matching result, the child expression isZero Width(The character or zero width is used to determine whether the matched content is retained to the final result)
| Expression |
Explanation |
(?=exp) |
Asserted the location where it appearsCanMatching expression exp |
(?!exp) |
Asserted the location where it appearsNot laterMatching expression exp |
(?<=exp) |
Asserted the location where it appearsPreviouslyMatching expression exp |
(?<\!exp) |
Asserted the location where it appearsNot aboveMatching expression exp |
Example
[a-z]+(?=ing)Matches all words ending with ing, but ing is not placed in a string.
[a-z]+(?=\d+)Match All words ending with numbers
[a-z]+(?!\d+)Match words that do not end with a number
(?<=(href=\"))Match
href="String starting Java Pattern and Matcher
java.util.regexProvided under the package
PatternAnd
MatcherThe two classes provide regular expression support in Java;
PatternThe object is
Expression in memory after regular expression CompilationTherefore, the regular expression string must be first compiled as a Pattern object
Pattern pattern = Pattern.compile("\\w+");Then, use the Pattern object to create the Matcher object of the object.
Matcher matcher = pattern.matcher(input);.
MatcherThe object is a pair
CharSequenceRegular engine for matching: the State involved in matching execution is retained in the Matcher object. Multiple Matcher objects can share the same Pattern object.
/*** Created by jifang on 15/12/15. */public class LearnRegexp {@ Test public void testSearch () {String input = "hello1997 & 2000"; // compile a regular expression into the Pattern object Pattern pattern = Pattern. compile ("\ w +"); Matcher matcher = pattern. matcher (input); // matches tries to match the entire character sequence with the pattern System. out. println (matcher. matches (); // reset the pointer in matcher to reposition matcher. reset (); // find method to scan the entire string and find whether to find the next string that matches the pattern while (matcher. find () {String group = matcher. group (); System. out. println (group) ;}}/*** replace all numbers with '#' */@ Test public void testReplace () {String input = "1j2h3h4g5o "; matcher replace = Pattern. compile ("[0-9]"). matcher (input); input = replace. replaceAll ("#"); System. out. println (input);}/*** split the String by digits */@ Test public void testSplit () {String input = "1j24h356h467g589o"; String [] strings = input. split ("\ d +"); for (String str: strings) {System. out. println (str );}}}
As shown in the preceding example:StringSome methods in also support regular expressions, such as split and replace (for other usage of Pattern, Matcher and String, see JDK documentation ).
Small experiment-capture all hyperlinks on the web page
/*** Simulates web crawlers, captures website html, and analyzes all the hyperlinks in it * Created by jifang on 15/12/15. */public class HtmlAnalyzer {private final String FILE_PATH = "/Users/jifang/save.txt"; @ Test public void client () throws IOException {String html = downloadHtml ("http://www.163.com /", "gbk ");//(? <= (Href = \"))(? : [\ W .\/\:\? \=\&] + )(? = \ ") Matches the regular Set of the url
UrlSet = analyzeHtml (html ,"(? <= (Href = \\\"))(? : [\ W .\\/\\:\\? \\=\&] + )(? ==\\ ")"); SaveToFile (urlSet); System. out. println ();} private String downloadHtml (String url, String charset) throws IOException {URL readUrl = new URL (url); BufferedReader reader = new BufferedReader (new InputStreamReader (readUrl. openStream (), charset); return CharStreams. toString (reader);} private Set
AnalyzeHtml (String html, String regex) {Set
UrlSet = new HashSet <> (); // match the regular expression Matcher = Pattern of the url. compile (regex ). matcher (html); while (matcher. find () {String group = matcher. group (); urlSet. add (group) ;}return urlSet;} private void saveToFile (Set
UrlSet) throws IOException {PrintStream printer = new PrintStream (new FileOutputStream (FILE_PATH); for (String url: urlSet) {printer. println (url);} printer. flush (); printer. close ();}}
Appendix-to run the program, add the following dependency to pom. xml:
com.google.guava
guava
18.0
Several RegEx verification tools are recommended:
Mac: RegExRX Win: RegexBuddy Web: Online regular expression test