Thinking logic of computer programs (90) and thinking 90
Section 88 introduces the regular expression syntax. The previous section describes Java APIs related to regular expressions. This section discusses and analyzes some common regular expressions, including:
- Zip code
- Phone number, including mobile phone number and fixed phone number
- Date and Time
- ID card
- IP address
- URL
- Email Address
- Chinese characters
For the same purpose, regular expressions are often written in multiple ways, most of which are not the only correct method. This section describes the main examples. In addition, it is often easier to write a regular expression to match the content that you want to match, but it is difficult to make it not match the content that you don't want to match. That is to say, it is often difficult to ensure accuracy. However, in many cases, we do not need to write completely precise expressions. Writing more precise expressions is related to the text and requirements you need to process. In addition, if the regular expression is hard to express, You can further process it by writing a program. This description may be abstract. Next, we will discuss the analysis in detail.
Zip code
The zip code is relatively simple. It is a six-digit number and the first digit cannot be 0. Therefore, the expression can be:
[1-9][0-9]{5}
This expression can be used to verify whether the input is a zip code, for example:
public static Pattern ZIP_CODE_PATTERN = Pattern.compile( "[1-9][0-9]{5}");public static boolean isZipCode(String text) { return ZIP_CODE_PATTERN.matcher(text).matches();}
However, this expression is not enough for search. Let's look at the example:
Public static void findZipCode (String text) {Matcher matcher = ZIP_CODE_PATTERN.matcher (text); while (matcher. find () {System. out. println (matcher. group () ;}} public static void main (String [] args) {findZipCode ("zip code 100013, phone 18612345678 ");}
The text contains only one zip code, but the output is:
100013186123
What should we do? You can use the surround view boundary match described in section 88. For the left boundary, the character before it cannot be a number, and the surround view expression is:
(?<![0-9])
For the right boundary, the character on the right of the bucket cannot be a number, and the View Expression is:
(?![0-9])
Therefore, the complete expression can be:
(?<![0-9])[1-9][0-9]{5}(?![0-9])
Use this expression, that is, change ZIP_CODE_PATTERN:
Public static Pattern ZIP_CODE_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number on the left +" [1-9] [0-9] {5} "+ "(?! [0-9]) "); // No number is allowed on the right
You can output the expected results.
Do the six digits starting with 0 must be the zip code? Of course, the answer is no. Therefore, this expression is not accurate. If you need more accurate verification, you can write a program to further check.
Mobile phone number
China's mobile phone numbers are all 11 digits. Therefore, the simplest expression is:
[0-9]{11}
However, currently, mobile phone numbers with 1st bits are all 1 and 2nd BITs have values of 3, 4, 5, 7, and 8. Therefore, the more precise expression is:
1[3|4|5|7|8|][0-9]{9}
To facilitate the expression of the mobile phone number, there are often hyphens (minus signs '-') in the middle of the mobile phone number, such:
186-1234-5678
To express this optional hyphen, the expression can be changed:
1[3|4|5|7|8|][0-9]-?[0-9]{4}-?[0-9]{4}
There may be 0, + 86, or 0086 in front of the phone number, and there may be a space between the phone number and the phone number, for example:
018612345678+86 186123456780086 18612345678
In this form, you can add the following expression before the number:
((0|\+86|0086)\s?)?
Similar to the zip code, if you want to extract data, you must add a surround view boundary match between the left and right sides. The left and right sides cannot be numbers. Therefore, the complete expression is:
(?<![0-9])((0|\+86|0086)\s?)?1[3|4|5|7|8|][0-9]-?[0-9]{4}-?[0-9]{4}(?![0-9])
The Code represented in Java is:
Public static Pattern MOBILE_PHONE_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number +" (0 | \ + 86 | 0086) \ s?) on the left ?)? "// 0 + 86 0086 +" 1 [3 | 4 | 5 | 7 | 8 |] [0-9]-? [0-9] {4 }-? [0-9] {4} "// 186-1234-5678 + "(?! [0-9]) "); // No number is allowed on the right
Landline phone
Without considering the extension, China's fixed telephone generally consists of two parts: the area code and the city number. The area code is 3 to 4 digits, and the city number is 7 to 8 digits. The area code starts with 0. The expression can be:
0[0-9]{2,3}
The city number expression is:
[0-9]{7,8}
The area code may be enclosed in parentheses and may contain hyphens between the area code and the city number, as shown in the following format:
010-62265678(010)62265678
The entire area code is optional, so the entire expression is:
(\(?0[0-9]{2,3}\)?-?)?[0-9]{7,8}
In addition, the left and right boundary view is added, and the complete Java representation is:
Public static Pattern FIXED_PHONE_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number + "(\\(? 0 [0-9] {2, 3 }\\)? -?)? "// Area code +" [0-9] {7, 8} "// city number + "(?! [0-9]) "); // No number is allowed on the right
Date
There are many types of date representation. We only look at one type, such:
2017-06-212016-11-1
The year, month, and day are separated by hyphens, and the month and day may only have one character.
The simplest regular expression can be:
\d{4}-\d{1,2}-\d{1,2}
There is generally no limit on the year, but the month can only be set to 1 to 12, and the day can only be set to 1 to 31. How can this restriction be expressed?
For a month, the expression may be:
0?[1-9]
The expression can be:
1[0-2]
Therefore, the monthly expression is:
(0?[1-9]|1[0-2])
There are three scenarios:
- 1 to 9, expression: 0? [1-9]
- From 10 to 29, expression: [1-2] [0-9]
- Number 30 and number 31, expression: 3 [01]
Therefore, the entire expression is:
\d{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[1-2][0-9]|3[01])
The complete Java representation is:
Public static Pattern DATE_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number +" \ d {4}-"// year +" (0? [1-9] | 1 [0-2])-"// month +" (0? [1-9] | [1-2] [0-9] | 3 [01]) "// day + "(?! [0-9]) "); // No number is allowed on the right
Time
Consider the 24-hour system. Only the hour and minute are considered. The hour and minute are represented by two fixed digits. The format is as follows:
10:57
The basic expression is:
\d{2}:\d{2}
The Hour value ranges from 0 to 23, and the more precise expression is:
([0-1][0-9]|2[0-3])
The minute value ranges from 0 to 59. The more precise expression is:
[0-5][0-9]
Therefore, the entire expression is:
([0-1][0-9]|2[0-3]):[0-5][0-9]
The complete Java representation is:
Public static Pattern TIME_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number +" ([0-1] [0-9] | 2 [0-3]) "// hour +": "+" [0-5] [0-9] "// minute + "(?! [0-9]) "); // No number is allowed on the right
ID card
The ID card can be divided into the first generation and the second generation. The first generation has 15 digits and the second generation has 18 digits. It cannot start with 0. For the second generation ID card, the last digit may be x or X, others are numbers.
The expression of a generation ID card can be:
[1-9][0-9]{14}
The second generation ID card can be:
[1-9][0-9]{16}[0-9xX]
The first part of the two expressions is the same. The second generation ID card has the following content:
[0-9]{2}[0-9xX]
Therefore, they can be combined into an expression, namely:
[1-9][0-9]{14}([0-9]{2}[0-9xX])?
The complete Java representation is:
Public static Pattern ID_CARD_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number +" [1-9] [0-9] {14} "// Generation ID card +" ([0-9] {2} [0-9xX ])? "// Additional part of the Second Generation ID card + "(?! [0-9]) "); // No number is allowed on the right
Must the ID card number meet this requirement? Of course not. I will not discuss more specific requirements for my ID card.
IP address
The IP address format is as follows:
192.168.3.5
The periods are separated by numbers. Each digit ranges from 0 to 255. The simplest expression is:
(\d{1,3}\.){3}\d{1-3}
\ D {255} is too simple and does not meet the constraint between 0 and. To meet this constraint, we need to consider multiple situations.
The value is one-digit. There may be 0 to 2 digits before it. The expression is:
0{0,2}[0-9]
The value is a two-digit number. There may be a first 0, and the expression is:
0?[0-9]{2}
The value is a three-digit number. There is no limit on the last two digits starting with 1. The expression is:
1[0-9]{2}
If the second digit is 0 to 4, there is no limit on the third digit. The expression is:
2[0-4][0-9]
If the second digit is 5, the third digit is 0 to 5, and the expression is:
25[0-5]
Therefore, the more precise representation of \ d {} is:
(0{0,2}[0-9]|0?[0-9]{2}|1[0-9]{2}|2[0-4][0-9]|25[0-5])
Therefore, the complete Java representation of the IP address is:
Public static Pattern IP_PATTERN = Pattern. compile ("(? <! [0-9]) "// No number +" (0 {} [0-9] | 0? [0-9] {2} | 1 [0-9] {2} | 2 [0-4] [0-9] | 25 [0-5]) \\.) {3} "+" (0 {0, 2} [0-9] | 0? [0-9] {2} | 1 [0-9] {2} | 2 [0-4] [0-9] | 25 [0-5]) "+ "(?! [0-9]) "); // No number is allowed on the right
URL
The URL format is complex. Its specification is defined at https://tools.ietf.org/html/rfc1738. we only recommend the HTTP protocol. The general format is:
http://
Start with http: //, followed by the host name, followed by an optional port, followed by an optional path, followed by an optional query string,? .
Some examples:
http://www.example.comhttp://www.example.com/ab/c/def.htmlhttp://www.example.com:8080/ab/c/def?q1=abc&q2=def
The host name can contain letters, numbers, minus signs, and periods. Therefore, the expression can be:
[-0-9a-zA-Z.]+
The port can be written as follows:
(:\d+)?
A path consists of multiple sub-paths. Each sub-path starts with a slash (/) and is followed by zero or multiple non-// characters. In short, the expression can be:
(/[^/]*)*
More accurately, all allowed characters are listed, and the expression is:
(/[-\w$.+!*'(),%;:@&=]*)*
In short, a query string is composed of non-null strings and the expression is:
\?[\S]*
More precisely, all allowed characters are listed, and the expression is:
\?[-\w$.+!*'(),%;:@&=]*
The path and query string are optional, and the query string can only appear if at least one path exists. The mode is:
(/<sub_path>(/<sub_path>)*(\?<search>)?)?
Therefore, the simple expression of the path and query section is:
(/[^/]*(/[^/]*)*(\?[\S]*)?)?
The exact expression is:
(/[-\w$.+!*'(),%;:@&=]*(/[-\w$.+!*'(),%;:@&=]*)*(\?[-\w$.+!*'(),%;:@&=]*)?)?
The complete Java expression of HTTP is:
Public static Pattern HTTP_PATTERN = Pattern. compile ("http: //" + "[-0-9a-zA-Z.] +" // host name + "(: \ d + )? "// Port +" ("// optional path and query-start +"/[-\ w $. +! * '(), %;: @ & =] * "// Path of the first layer +" (/[-\ w $. +! * '(), %;: @ & =] *) * "// Optional other layer paths + "(\\? [-\ W $. +! * '(), %;: @ & =] *)? "// Optional query string + ")? "); // Optional path and query-end
Email Address
The complete Email specification is complex. It is defined at https://tools.ietf.org/html/rfc822. We will refer to it for common use first.
For example, Sina mail, whose format is as follows:
abc@sina.com
The username must be 4-16 characters long and can contain lowercase letters, numbers, and underscores (_), but cannot start or end with an underscore.
How can I verify the user name? It can be:
[a-z0-9][a-z0-9_]{2,14}[a-z0-9]
The complete Java expression of Sina mail is:
public static Pattern SINA_EMAIL_PATTERN = Pattern.compile( "[a-z0-9]" + "[a-z0-9_]{2,14}" + "[a-z0-9]@sina\\.com");
Let's take a look at QQ mail. The user name requirements are as follows:
- It must be 3-18 characters long and can contain English letters, numbers, minus signs, dots, or underscores.
- It must start with an English letter and end with an English letter or number.
- Point, minus sign, and underline cannot appear twice or more
If there is only the first entry, it can be:
[-0-9a-zA-Z._]{3,18}
To meet the second rule, you can change it:
[a-zA-Z][-0-9a-zA-Z._]{1,16}[a-zA-Z0-9]
How can we meet the third rule? You can use the border view, and add the following expression on the left side:
(?![-0-9a-zA-Z._]*(--|\.\.|__))
The complete expression can be:
(?![-0-9a-zA-Z._]*(--|\.\.|__))[a-zA-Z][-0-9a-zA-Z._]{1,16}[a-zA-Z0-9]
The complete Java expression of QQ mail is:
Public static Pattern QQ_EMAIL_PATTERN = Pattern. compile ("(?! [-0-9a-zA-Z. _] * (-- | \\. \\. | __)) "// points, minus signs, and underscores cannot appear twice or more times in a row +" [a-zA-Z] "// must begin with an English letter +" [-0-9a-zA-Z. _] {} "// 3-18 English letters, numbers, minus signs, points, underscores composition +" [a-zA-Z0-9] @ qq \\. com "); // It must end with an English letter or number.
These are the requirements of specific email service providers. What are the general mailbox rules? Generally, @ is used as the separator, followed by the user name and followed by the domain name.
The general rules for user names are:
- It consists of English letters, numbers, underlines, minus signs, and periods.
- At least 1 bit, no more than 64 bit
- It cannot start with a minus sign, a dot, or an underline.
For example:
h_llo-abc.good@example.com
This expression can be:
[0-9a-zA-Z][-._0-9a-zA-Z]{0,63}
The domain name must be separated by periods (.). There must be at least two parts. The last part is the top-level domain name, which consists of 2 to 3 English letters. The expression can be:
[a-zA-Z]{2,3}
For the parts separated by other periods of the domain name, each part is generally composed of letters, numbers, and minus signs, but the minus sign cannot start with and cannot exceed 63 characters. The expression can be:
[0-9a-zA-Z][-0-9a-zA-Z]{0,62}
Therefore, the domain name expression is:
([0-9a-zA-Z][-0-9a-zA-Z]{0,62}\.)+[a-zA-Z]{2,3}
Complete Java representation:
Public static Pattern GENERAL_EMAIL_PATTERN = Pattern. compile ("[0-9a-zA-Z] [-. _ 0-9a-zA-Z] {} "// username +" @ "+" ([0-9a-zA-Z] [-0-9a-zA-Z }\\.) + "// domain name section +" [a-zA-Z] {2, 3} "); // top-level domain name
Chinese characters
The Unicode numbers of Chinese characters are generally located between \ u4e00 and \ u9fff. Therefore, the expressions matching any Chinese character can be:
[\u4e00-\u9fff]
Java expression:
public static Pattern CHINESE_PATTERN = Pattern.compile( "[\\u4e00-\\u9fff]");
Summary
This section discusses and analyzes some common Regular Expressions in detail. In actual development, some can be used directly, and some need to be adjusted according to the specific text and requirements.
So far, we have finished introducing the regular expression. I believe you have a clearer understanding of the regular expression!
In the previous sections, we have discussed Java 7. Starting from the next section, we will discuss some features of Java 8, especially functional programming.
(As in other chapters, all the code in this section is located at https://github.com/swiftma/program-logicand under Bao shuo.laoma.regex.c90)
----------------
For more information, see the latest article. Please pay attention to the Public Account "lauma says programming" (scan the QR code below), from entry to advanced, ma and you explore the essence of Java programming and computer technology. Retain All copyrights with original intent.