If you are not familiar with this term, the "Regular Expression" is a string consisting of characters. It defines a pattern for searching and matching strings.
Getting started with regular expressions in 30 minutes
Common Regular Expressions
Many languages, including Perl, PHP, Python, JavaScript, and JScript, support regular expressions for text processing. Some text editors use regular expressions to implement advanced search-replace functions. So what about Java? When writing this article, a Java Specification Request that contains regular expressions for text processing has been recognized, and you can expect to see it in the next JDK version. |
However, what should I do if I need to use a regular expression now? You can download the open source code Jakarta-ORO library from Apache.org. Next, we will briefly introduce the knowledge of regular expressions, and then use the Jakarta-oro api as an example to introduce how to use regular expressions. |
I. Basic knowledge of Regular Expressions |
Let's start with a simple one. Suppose you want to search for a string containing the character "cat", and the regular expression used for searching is "cat ". If the search is case insensitive, the words "catalog", "Catherine", and "sophisticated" can all match. That is to say: |
Assume that you are playing an English spelling game and want to find three letters of words. These words must start with a letter "t" and end with a letter "n. In addition, if you have an English dictionary, you can use a regular expression to search for all its contents. To construct this regular expression, you can use the wildcard ".". In this way, the complete expression is "t. n. It matches "tan", "ten", "tin", "ton", and "t # n", "tpn", and even "t n ", there are many other meaningless combinations. This is because the periods match all characters, including spaces, Tab characters, and even line breaks: |
To solve the problem that the period matching range is too wide, you can specify meaningful characters in square brackets. In this case, only the characters specified in square brackets are involved in the matching. That is to say, the regular expression "t [aeio] n" only matches "tan", "Ten", "tin", and "ton ". But "Toon" does not match, because within square brackets you can only match a single character: |
If you want to match "toon" in addition to all the words matched above, you can use the "|" operator. The basic meaning of the "|" operator is the "or" operation. To match "toon", use the regular expression "t (a | e | I | o | oo) n. Square brackets cannot be used here, because square brackets can only match a single character. square brackets () must be used here. Parentheses can also be used for grouping. For more information, see the following section. |
1.4 indicates the number of matches |
Table 1 shows the symbols that indicate the number of matches. These symbols are used to determine the number of times the symbol appears next to the left of the symbol: |
Suppose we want to search for American Social Security numbers in text files. The number is in the format of 999-99-9999. The regular expression 1 used to match it is shown in. In a regular expression, A hyphen (-) has a special meaning. It represents a range, for example, from 0 to 9. Therefore, when matching a hyphen in a social security number, it must be preceded by an escape character "\". |
: Match all the social security numbers in 123-12-1234 format |
Assume that you want a hyphen to appear or not to appear during the search -- that is, 999-99-9999 and 999999999 are in the correct format. In this case, you can add "?" After the hyphen. Quantity limit symbol, as shown in Figure 2: |
: Match all social security numbers in the 123-12-1234 and 123121234 forms |
Next let's look at another example. An American car license is in four digits with two letters. Its regular expression is preceded by the digit section "[0-9] {4}", plus the letter section "[A-Z] {2 }". The complete regular expression is displayed. |
: Match a typical U.S. vehicle license number, such as 8836KV |
The "^" symbol is called the "no" symbol. If it is used in square brackets, "^" indicates the character that you do not want to match. For example, the regular expression matches all words, except words starting with an "X" letter. |
: Match all words, except those starting with "X" |
1.6 parentheses and white spaces |
Assume that you want to extract the month from the birthday date in the format of "June 26,195 1". The regular expression used to match the date can be shown in five: |
: Match all dates in the Moth DD and YYYY formats. |
The new "\ s" symbol is a blank symbol that matches all blank characters, including Tab characters. If the string matches correctly, how can we extract the month part? You only need to add a parentheses around the month to create a group, and then use oro api (discussed later in this article) to extract its value. The modified regular expression 6 is shown below: |
: Matches all Month DD and YYYY dates, and defines the Month value as the first group. |
For simplicity, you can use shortcuts created for common regular expressions. See table 2: |
For example, in the previous example of a Social Security number, "\ d" can be used for all the places where "[0-9]" appears ". The modified regular expression 7 is shown below: |
: Match all social security numbers in 123-12-1234 format |
Many open source code Regular Expression Libraries are available for Java programmers, and many of them support regular expression syntaxes compatible with Perl 5. I chose the Jakarta-ORO regular expression library here. It is one of the most comprehensive regular expression APIs and is fully compatible with Perl 5 regular expressions. In addition, it is also one of the best optimized APIs. |
The Jakarta-ORO library was previously called OROMatcher, and Daniel Savarese generously presented it to the Jakarta Project. You can download it according to the description of the final reference resource in this article. |
First, I will briefly introduce the objects that you must create and access when using the Jakarta-ORO library, and then introduce how to use the Jakarta-ORO API. |
First, create an instance of the Perl5Compiler class and assign it to the PatternCompiler interface object. Perl5Compiler is an implementation of the PatternCompiler interface. It allows you to compile a regular expression into a Pattern object for matching. |
Compile the regular expression into a Pattern object, call the compiler () method of the compiler object, and specify the regular expression in the call parameters. For example, you can compile the regular expression "t [aeio] n" in the following way ": |
By default, the compiler creates a case-sensitive pattern ). Therefore, the Code Compiled above only matches "tin", "tan", "ten", and "ton", but does not match "Tin" and "taN ". To create a case-insensitive mode, you should specify an additional parameter when calling the compiler: |
After creating a Pattern object, you can use the PatternMatcher class to perform Pattern matching. |
The PatternMatcher object performs a match check based on the Pattern object and string. You need to instantiate a Perl5Matcher class and assign the result to the PatternMatcher interface. The Perl5Matcher class is an implementation of the PatternMatcher interface. It performs pattern matching based on the regular expression syntax of Perl 5: |
With the PatternMatcher object, you can use multiple methods to perform matching operations. The first parameter of these methods is a string that needs to be matched according to the regular expression: |
· Boolean matches (String input, Pattern pattern): this parameter is used when the input String and the regular expression must be exactly matched. In other words, the regular expression must completely describe the input string. |
· Boolean matchesPrefix (String input, Pattern pattern): used when the regular expression matches the start part of the input String. |
· Boolean contains (String input, Pattern pattern): used when the regular expression matches a part of the input String (that is, it must be a substring ). |
In addition, in the preceding three method calls, you can use the PatternMatcherInput object as the parameter to replace the String object. In this case, you can continue matching from the last matching position in the String. When a string may have multiple substrings matching a given regular expression, it is useful to use the PatternMatcherInput object as a parameter. When the PatternMatcherInput object is used as a parameter to replace String, the syntax of the preceding three methods is as follows: |
· Boolean matches (PatternMatcherInput input, Pattern pattern) |
· Boolean matchesPrefix (PatternMatcherInput input, Pattern pattern) |
· Boolean contains (PatternMatcherInput input, Pattern pattern) |
Next, let's take a look at some application instances of the Jakarta-ORO library. |
Task: analyze a Web server log file to determine the time each user spends on the website. In a typical BEA WebLogic log file, the log format is as follows: |
By analyzing this log record, we can find that the content to be extracted from this log file has two items: IP address and page access time. You can use grouping symbols (parentheses) to extract IP addresses and time tags from log records. |
First, let's look at the IP address. An IP address consists of four bytes. Each byte ranges from 0 to 255. Each byte is separated by a period. Therefore, each byte in an IP address has at least one and at most three numbers. The regular expression for the IP address is displayed: |
The periods in the IP address must be escaped ("\" is added before), because the periods in the IP address have their original meaning, rather than the special meanings in the regular expression syntax. Special meanings of periods in regular expressions. |
The time part of the log record is enclosed by a pair of square brackets. You can extract all the content in square brackets according to the following ideas: first, search for the start square brackets ("[") and extract all the ending square brackets ("]"). to find the ending square brackets. The regular expression of this part is displayed. |
: Match at least one character until "]" is found. |
Now, the two regular expressions are combined with the grouping symbol (parentheses) into a single expression, so that the IP address and time can be extracted from the log record. Note that "\ s-\ s" is added to the regular expression to match "--" (but not extract it ". The complete regular expression 10 is shown. |
: Match ip address and time mark |
Now that the regular expression has been compiled, you can write the Java code using the regular expression library. |
To use the Jakarta-ORO library, create a regular expression string and a log record string to be analyzed: |
The regular expression used here is similar to the regular expression, but with one exception: in Java, You must escape each forward slash. It is not a Java representation, so we need to add "\" before each "\" to avoid compilation errors. Unfortunately, escape processing is prone to errors, so be careful. You can first enter a regular expression that is not escaped, and then replace "\" with "\" from left to right. If you want a review, you can try to output it to the screen. |
After the string is initialized, The PatternCompiler object is instantiated and the regular expression is compiled with PatternCompiler to create a Pattern object: |
Now, create a PatternMatcher object and call the contain () method of the PatternMatcher interface to check the matching conditions: |
Next, use the MatchResult object returned by the PatternMatcher interface to output the matched group. Because the logEntry string contains matched content, you can see the output of the following class: |
3.2 HTML processing example 1 |
The following task analyzes all the attributes of the FONT tag in the HTML page. The typical FONT tag in an HTML page is as follows: |
The program outputs the attributes of each FONT tag as follows: |
In this case, we recommend that you use two regular expressions. As shown in the first 11, it extracts "" face = "Arial, Serif" size = "+ 2" color = "red" "from the font mark "". |
I. Matching all attributes marked by FONT |
The second regular expression 12 shows that it splits each attribute into a name-value pair. |
2. match a single attribute and divide it into name-value pairs |
Now let's take a look at the Java code to complete this task. Create two regular expression strings and compile them into a Pattern object using Perl5Compiler. When compiling a regular expression, specify the Perl5Compiler. CASE_INSENSITIVE_MASK option to make the matching operation case insensitive. |
Next, create a Perl5Matcher object for matching. |
Suppose there is a String type variable html, which represents a line of content in the HTML file. If the html string contains the FONT tag, true is returned for the matching. In this case, you can use the MatchResult object returned by the MatchResult object to obtain the first group, which contains all the attributes of the FONT: |
Next, create a PatternMatcherInput object. This object allows you to continue the matching operation from the last matching position. Therefore, it is suitable for extracting the name-value pairs of attributes in the FONT tag. Create a PatternMatcherInput object and input the string to be matched as a parameter. Then, extract the attributes of each FONT using a matcher instance. This is done by repeatedly calling the contains () method of the PatternMatcherInput object instead of the string object as the parameter. Each iteration of the PatternMatcherInput object moves its internal Pointer Forward, And the next detection starts after the previous matching position. |
The output result of this example is as follows: |
3.3 HTML processing example 2 |
Next let's take a look at another example of processing HTML. This time, we assume that the Web server is moved from widgets.acme.com to newserver.acme.com. Now you need to modify some links on the page: |
The regular expression 13 that executes this search is shown in: |
3. Match the link before modification |
If you can match this regular expression, you can replace the three links with the following content: |
Note #$1 is added after the character. The Perl regular expression syntax uses $1, $2, and so on to represent the matched and extracted groups. Expression 3 attaches all content that is matched and extracted as a group to the end of the link. |
Now, Java is returned. As we have done before, you must create a test string, create an object necessary to compile the regular expression to the Pattern object, and create a PatternMatcher object: |
Next, use the substitute () Static Method of the Util class in the com. oroinc. text. regex package to replace it, and output the result string: |
The syntax of the Util. substitute () method is as follows: |
The first two parameters of this call are the previously created PatternMatcher and Pattern objects. The third parameter is a Substiution object, which determines how the replacement operation is performed. This example uses the Perl5Substitution object, which can be used to replace the Perl5 style. The fourth parameter is the string to be replaced. The last parameter allows you to specify whether to replace all matched substrings (Util. SUBSTITUTE_ALL) in the mode or only the specified number of times. |
[Conclusion]In this article, I will introduce you to the powerful functions of regular expressions. Regular Expressions play a major role in string extraction and text modification. In addition, I also introduced how to use regular expressions through the Jakarta-ORO library in Java programs. As for the final use of the old string processing method (using StringTokenizer, charAt, and substring), or using a regular expression, you have to decide on your own. |