First, the basic knowledge of regular expressions
Let's start with the simple. Suppose you are searching for a string containing the character "cat", and the regular expression used for searching is "cat". If the search is not case sensitive, the word "catalog", "Catherine", and "sophisticated" can be matched. Other words:
1.1 Period Symbol
Suppose you are playing English Scrabble and want to find three-letter words, and the words must start with a "T" letter and End with an "n" letter. Also, if you have an English dictionary, you can use regular expressions to search for all of its contents. To construct this regular expression, you can use a wildcard character-the period symbol "." Thus, the complete expression is "T.N", which matches "tan", "ten", "Tin" and "ton", also matches "t#n", "TPN" and even "T n", there are many other meaningless combinations. This is because the period symbol matches all characters, including spaces, tab characters, and even line breaks:
1.2 Square brackets Symbol
In order to solve the problem that the period symbol matching range is too wide, you can specify the characters that appear to be meaningful in square brackets ("[]"). At this point, only the characters specified inside the square brackets characters participate in the match. In other words, the regular expression "t[aeio]n" only matches "tan", "Ten", "Tin" and "ton". But "Toon" does not match, because within square brackets you can only match a single character:
1.3 "or" symbol
If you want to match "toon" in addition to all the words that match above, then you can use the "|" Operator. | The basic meaning of an operator is a "or" operation. to match "Toon", use the "t (A|e|i|o|oo) n" Regular expression. You cannot use a square extension, because the square brackets allow only a single character to be matched, and the parentheses "()" must be used here. Parentheses can also be used for grouping, see later.
1.4 Symbol for number of matches
Table One shows the symbols that represent the number of matches, which are used to determine the number of occurrences of the symbol immediately to the left of the symbol:
Suppose we want to search the text file for Social Security numbers in the United States. The format of this number is 999-99-9999. The regular expression used to match it is shown. In regular expressions, a hyphen ("-") has a special meaning, which represents a range, such as from 0 to 9. Therefore, when you match a hyphen symbol in a social security number, it is preceded by an escape character "\".
Figure One: matching Social Security numbers in all 123-12-1234 forms
If you're searching, you want a hyphen to appear or not-that is, 999-99-9999 and 999999999 are in the right format. At this point, you can add the word "?" after the hyphen symbol. The quantity-qualified symbol, two, shows:
Figure II: Match all 123-12-1234 and 123121234 forms of Social Security numbers
Let's look at another example here. One format for American car licences is four digits plus two letters. Its regular expression is preceded by the number part "[0-9]{4}", plus the letter part "[A-z]{2}". Figure III shows the complete regular expression.
Figure three: Matching a typical U.S. car license number, such as 8836KV
1.5 "no" symbol
The "^" symbol is called a "no" symbol. If used in square brackets, "^" indicates a character that you do not want to match. For example, the regular expression in Figure four matches all words, except for words that begin with an "X" letter.
Figure Four: Matches all words, except for "X"
1.6 Parentheses and blank symbols
Suppose you want to extract the month portion from a birthday date formatted as "June 26, 1951," and the regular expression used to match that date can be shown in five:
Figure V: Dates matching all moth dd,yyyy formats
The new "\s" symbol is a blank symbol that matches all whitespace characters, including the tab character. If the string matches correctly, then how do you extract the month part? Simply add a parenthesis around the month to create a group, and then extract its value using the Oro API (discussed in detail later in this article). The modified regular expression six shows:
Figure VI: Matches all dates in month dd,yyyy format, defines the month value as the first group
1.7 Other symbols
For simplicity, you can use some shortcut symbols that are created for common regular expressions. As shown in table two:
Table II: Common symbols
For example, in the previous social Security Number example, all occurrences of "[0-9]" where we can use "\d". The modified regular expression is shown in seven:
Figure VII: Matching Social Security numbers in all 123-12-1234 formats
Second, Jakarta-oro Library
There are many regular expression libraries that are open to the source code for use by Java programmers, and many of them support the Perl 5 compliant regular expression syntax. What I've chosen here is the Jakarta-oro regular expression library, which is one of the most comprehensive regular expression APIs and is fully compatible with Perl 5 regular expressions. In addition, it is one of the best optimized APIs.
The Jakarta-oro Library, formerly known as Oromatcher,daniel Savarese, generously presented it to Jakarta Project. You can download it at the end of this article by referring to the resource description.
I'll start with a brief introduction to the objects you have to create and access when you use the Jakarta-oro library, and then describe how to use the Jakarta-oro API.
▲patterncompiler Object
First, create an instance of the Perl5compiler class and assign it to the Patterncompiler interface object. Perl5compiler is an implementation of the Patterncompiler interface that allows you to compile regular expressions into pattern objects that match.
▲pattern Object
To compile the regular expression into a pattern object, call the Compile () method of the compiler object and specify the regular expression in the calling parameter. For example, you can compile the regular expression "t[aeio]n" in this way:
By default, the compiler creates a case-sensitive pattern (pattern). As a result, the code compiled above only matches "tin", "Tan", "ten", and "ton", but does not match "tin" and "tan". To create a case-insensitive pattern, you should specify an additional parameter when invoking the compiler:
Once you have created the pattern object, you can use the Patternmatcher class to pattern match with the pattern object.
▲patternmatcher Object
The Patternmatcher object is checked against the pattern object and string. You will instantiate a Perl5matcher class and assign the result to the Patternmatcher interface. The Perl5matcher class is an implementation of the Patternmatcher interface, which makes pattern matching based on the Perl 5 regular expression syntax:
With the Patternmatcher object, you can use multiple methods for matching operations, the first of which is a string that needs to be matched against a regular expression:
· Boolean matches (string input, pattern pattern): used when an input string and regular expression are to be precisely matched. In other words, the regular expression must completely describe the input string.
· Boolean Matchesprefix (string input, pattern pattern): used when the regular expression matches the starting part of the input string.
· Boolean contains (String input, pattern pattern): used when a regular expression is to match part of an input string (that is, it must be a substring).
In addition, you can replace a string object with a Patternmatcherinput object as a parameter in the three method calls above, and you can continue to match from the last position in the string. It is useful to use the Patternmatcherinput object as a parameter when a string may have more than one substring to match a given regular expression. When replacing a string with a Patternmatcherinput object as a parameter, the syntax for the above three methods is as follows:
· Boolean matches (Patternmatcherinput input, pattern pattern)
· Boolean Matchesprefix (Patternmatcherinput input, pattern pattern)
· Boolean contains (Patternmatcherinput input, pattern pattern)
Third, application examples
Let's take a look at some application examples of the Jakarta-oro library.
3.1 Log file Processing
Task: Analyze a Web server log file to determine how long each user spends on the site. In a typical BEA WebLogic log file, the format of the log record is as follows:
Analyzing this log record, you can find that the content to extract from this log file has two items: IP address and page access time. You can extract the IP address and time stamp from the log record with the grouping symbol (parentheses).
First, let's look at the IP address. The IP address has 4 bytes, and each byte has a value between 0 and 255, and each byte is separated by a period. Therefore, each byte in the IP address has at least one, up to three digits. Figure Eight shows a regular expression written for an IP address:
Figure Eight: Matching IP address
The period character in the IP address must be escaped (preceded by "\") because the period in the IP address has its meaning, rather than the special meaning in the regular expression syntax. The special meaning of a period in regular expressions is described earlier in this article.
The time portion of the log record is surrounded by a pair of parentheses. You can extract everything in square brackets as follows: First search for the starting square bracket character ("["), extract any content that does not exceed the closing square bracket character ("]"), and look forward until the closing square bracket character is found. Figure Nine shows the regular expressions for this section.
Figure nine: match at least one character until "]" is found
Now, the above two regular expressions are combined into a single expression with the grouping notation (parentheses), so that the IP address and time can be extracted from the log records. Note that in order to match "--" (but not extract), the regular expression is added in the middle of "\s-\s-\s". The complete regular expression 10 is shown.
Figure 10: Matching IP address and time stamp
Now that the regular expression has been written, you can then write the Java code that uses the regular expression library.
To use the Jakarta-oro library, first create the regular expression string and the log record string to be parsed:
The regular expression used here is almost exactly the same as the regular expression in Figure 10, with one exception: In Java, you must escape each forward slash ("\"). Figure 10 is not a Java representation, so we need to precede each "\" with a "\" to avoid compilation errors. Unfortunately, the escape process is prone to errors, so be cautious. You can first enter the non-escaped regular expression and then replace each "\" with "\ \" from left to right. If you want to recheck, you can try to output it to the screen.
After initializing the string, instantiate the Patterncompiler object and create a pattern object with the Patterncompiler compiled regular expression:
Now, create the Patternmatcher object and call the contain () method of the Patternmatcher interface to check the matching situation:
Next, the matching group is output using the Matchresult object returned by the Patternmatcher interface. Because the LogEntry string contains matching content, you can see the output of the class as follows:
3.2 HTML Processing Instance one
The following task is to parse all the attributes of a font tag within an HTML page. The typical font tags within an HTML page are as follows:
The program will output the properties of each font tag as follows:
In this case, I recommend that you use two regular expressions. As shown in the first 11, it extracts "face=" Arial from the font tag, Serif "size=" +2 "color=" Red ".
Figure 11: Matching all attributes of a font tag
The second regular expression, 12, divides each attribute into a name-value pair.
Figure 12: Match a single attribute and split it into a name-value pair
The result of the split is:
Now let's take a look at the Java code that completes this task. First, create two regular expression strings and compile them into pattern objects with Perl5compiler. When compiling regular expressions, specify the Perl5compiler.case_insensitive_mask option so that the matching operation is not case-sensitive.
Next, create a Perl5matcher object that performs the matching operation.
Suppose you have a variable HTML of type string that represents a single line of content in an HTML file. If the HTML string contains a font tag, the match returns True. At this point, you can get the first group with the Matchresult object returned by the match object, which contains all the properties of the font:
Next, create a Patternmatcherinput object. This object allows you to continue the matching operation from the last match, so it is good for extracting the name-value pairs of the attributes within the font tag. Creates a Patternmatcherinput object, passing in the string to be matched as a parameter. The properties of each font are then extracted with a match instance. This is done by specifying the Patternmatcherinput object (instead of the string object) as a parameter and repeatedly calling the Patternmatcher object's contains () method. Each iteration of the Patternmatcherinput object moves its internal pointer forward, and the next detection begins after the previous matching position.
The output of this example is as follows:
3.3 HTML Processing Example two
Let's take a look at another example of HTML processing. This time, we assume that the Web server moved from widgets.acme.com to newserver.acme.com. Now you want to modify the links in some pages:
The regular expression that performs this search is shown in 13:
Figure 13: Matching the pre-modified link
If you can match the regular expression, you can replace the link in Figure 13 with the following:
Note the # character is followed by $ $. Perl Regular expression syntax uses $, $, and so on to represent groups that have been matched and extracted. The expression in Figure 13 attaches all content that is matched and extracted as a group to the back of the link.
Now, return to Java. As we did earlier, you must create a test string, create the objects necessary to compile the regular expression into the pattern object, and create a Patternmatcher object:
Next, replace with the substitute () static method of the Com.oroinc.text.regex package util class to output the result string:
The syntax for the Util.substitute () method is as follows:
The first two parameters of this call are the previously created Patternmatcher and pattern objects. The third parameter is a Substiution object that determines how the replacement operation is performed. This example uses the Perl5substitution object, which is able to replace the PERL5 style. The fourth parameter is a string that you want to replace, and the last parameter allows you to specify whether to replace all matching substrings (util.substitute_all) of the pattern, or to replace only the specified number of times.
A detailed description of the Java regular expression