A detailed description of the Java regular expression

Last Update:2015-02-01 Source: Internet

Author: User

Tags perl regular expression

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, the basic knowledge of regular expressions

Let's start with the simple. Suppose you are searching for a string containing the character "cat", and the regular expression used for searching is "cat". If the search is not case sensitive, the word "catalog", "Catherine", and "sophisticated" can be matched. Other words:

1.1 Period Symbol

Suppose you are playing English Scrabble and want to find three-letter words, and the words must start with a "T" letter and End with an "n" letter. Also, if you have an English dictionary, you can use regular expressions to search for all of its contents. To construct this regular expression, you can use a wildcard character-the period symbol "." Thus, the complete expression is "T.N", which matches "tan", "ten", "Tin" and "ton", also matches "t#n", "TPN" and even "T n", there are many other meaningless combinations. This is because the period symbol matches all characters, including spaces, tab characters, and even line breaks:

1.2 Square brackets Symbol

In order to solve the problem that the period symbol matching range is too wide, you can specify the characters that appear to be meaningful in square brackets ("[]"). At this point, only the characters specified inside the square brackets characters participate in the match. In other words, the regular expression "t[aeio]n" only matches "tan", "Ten", "Tin" and "ton". But "Toon" does not match, because within square brackets you can only match a single character:

1.3 "or" symbol

If you want to match "toon" in addition to all the words that match above, then you can use the "|" Operator. | The basic meaning of an operator is a "or" operation. to match "Toon", use the "t (A|e|i|o|oo) n" Regular expression. You cannot use a square extension, because the square brackets allow only a single character to be matched, and the parentheses "()" must be used here. Parentheses can also be used for grouping, see later.

1.4 Symbol for number of matches

Table One shows the symbols that represent the number of matches, which are used to determine the number of occurrences of the symbol immediately to the left of the symbol:

Suppose we want to search the text file for Social Security numbers in the United States. The format of this number is 999-99-9999. The regular expression used to match it is shown. In regular expressions, a hyphen ("-") has a special meaning, which represents a range, such as from 0 to 9. Therefore, when you match a hyphen symbol in a social security number, it is preceded by an escape character "\".

Figure One: matching Social Security numbers in all 123-12-1234 forms

If you're searching, you want a hyphen to appear or not-that is, 999-99-9999 and 999999999 are in the right format. At this point, you can add the word "?" after the hyphen symbol. The quantity-qualified symbol, two, shows:

Figure II: Match all 123-12-1234 and 123121234 forms of Social Security numbers

Let's look at another example here. One format for American car licences is four digits plus two letters. Its regular expression is preceded by the number part "[0-9]{4}", plus the letter part "[A-z]{2}". Figure III shows the complete regular expression.

Figure three: Matching a typical U.S. car license number, such as 8836KV

1.5 "no" symbol

The "^" symbol is called a "no" symbol. If used in square brackets, "^" indicates a character that you do not want to match. For example, the regular expression in Figure four matches all words, except for words that begin with an "X" letter.

Figure Four: Matches all words, except for "X"

1.6 Parentheses and blank symbols

　Suppose you want to extract the month portion from a birthday date formatted as "June 26, 1951," and the regular expression used to match that date can be shown in five:

Figure V: Dates matching all moth dd,yyyy formats

The new "\s" symbol is a blank symbol that matches all whitespace characters, including the tab character. If the string matches correctly, then how do you extract the month part? Simply add a parenthesis around the month to create a group, and then extract its value using the Oro API (discussed in detail later in this article). The modified regular expression six shows:

Figure VI: Matches all dates in month dd,yyyy format, defines the month value as the first group

1.7 Other symbols

For simplicity, you can use some shortcut symbols that are created for common regular expressions. As shown in table two:

Table II: Common symbols

For example, in the previous social Security Number example, all occurrences of "[0-9]" where we can use "\d". The modified regular expression is shown in seven:

Figure VII: Matching Social Security numbers in all 123-12-1234 formats

Second, Jakarta-oro Library

There are many regular expression libraries that are open to the source code for use by Java programmers, and many of them support the Perl 5 compliant regular expression syntax. What I've chosen here is the Jakarta-oro regular expression library, which is one of the most comprehensive regular expression APIs and is fully compatible with Perl 5 regular expressions. In addition, it is one of the best optimized APIs.

The Jakarta-oro Library, formerly known as Oromatcher,daniel Savarese, generously presented it to Jakarta Project. You can download it at the end of this article by referring to the resource description.

I'll start with a brief introduction to the objects you have to create and access when you use the Jakarta-oro library, and then describe how to use the Jakarta-oro API.

▲patterncompiler Object

First, create an instance of the Perl5compiler class and assign it to the Patterncompiler interface object. Perl5compiler is an implementation of the Patterncompiler interface that allows you to compile regular expressions into pattern objects that match.

▲pattern Object

　　To compile the regular expression into a pattern object, call the Compile () method of the compiler object and specify the regular expression in the calling parameter. For example, you can compile the regular expression "t[aeio]n" in this way:

By default, the compiler creates a case-sensitive pattern (pattern). As a result, the code compiled above only matches "tin", "Tan", "ten", and "ton", but does not match "tin" and "tan". To create a case-insensitive pattern, you should specify an additional parameter when invoking the compiler:

Once you have created the pattern object, you can use the Patternmatcher class to pattern match with the pattern object.

▲patternmatcher Object

The Patternmatcher object is checked against the pattern object and string. You will instantiate a Perl5matcher class and assign the result to the Patternmatcher interface. The Perl5matcher class is an implementation of the Patternmatcher interface, which makes pattern matching based on the Perl 5 regular expression syntax:

With the Patternmatcher object, you can use multiple methods for matching operations, the first of which is a string that needs to be matched against a regular expression:

· Boolean matches (string input, pattern pattern): used when an input string and regular expression are to be precisely matched. In other words, the regular expression must completely describe the input string.

· Boolean Matchesprefix (string input, pattern pattern): used when the regular expression matches the starting part of the input string.

· Boolean contains (String input, pattern pattern): used when a regular expression is to match part of an input string (that is, it must be a substring).

In addition, you can replace a string object with a Patternmatcherinput object as a parameter in the three method calls above, and you can continue to match from the last position in the string. It is useful to use the Patternmatcherinput object as a parameter when a string may have more than one substring to match a given regular expression. When replacing a string with a Patternmatcherinput object as a parameter, the syntax for the above three methods is as follows:

· Boolean matches (Patternmatcherinput input, pattern pattern)

· Boolean Matchesprefix (Patternmatcherinput input, pattern pattern)

· Boolean contains (Patternmatcherinput input, pattern pattern)

Third, application examples

Let's take a look at some application examples of the Jakarta-oro library.

3.1 Log file Processing

Task: Analyze a Web server log file to determine how long each user spends on the site. In a typical BEA WebLogic log file, the format of the log record is as follows:

Analyzing this log record, you can find that the content to extract from this log file has two items: IP address and page access time. You can extract the IP address and time stamp from the log record with the grouping symbol (parentheses).

First, let's look at the IP address. The IP address has 4 bytes, and each byte has a value between 0 and 255, and each byte is separated by a period. Therefore, each byte in the IP address has at least one, up to three digits. Figure Eight shows a regular expression written for an IP address:

Figure Eight: Matching IP address

The period character in the IP address must be escaped (preceded by "\") because the period in the IP address has its meaning, rather than the special meaning in the regular expression syntax. The special meaning of a period in regular expressions is described earlier in this article.

The time portion of the log record is surrounded by a pair of parentheses. You can extract everything in square brackets as follows: First search for the starting square bracket character ("["), extract any content that does not exceed the closing square bracket character ("]"), and look forward until the closing square bracket character is found. Figure Nine shows the regular expressions for this section.

Figure nine: match at least one character until "]" is found

Now, the above two regular expressions are combined into a single expression with the grouping notation (parentheses), so that the IP address and time can be extracted from the log records. Note that in order to match "--" (but not extract), the regular expression is added in the middle of "\s-\s-\s". The complete regular expression 10 is shown.

Figure 10: Matching IP address and time stamp

Now that the regular expression has been written, you can then write the Java code that uses the regular expression library.

To use the Jakarta-oro library, first create the regular expression string and the log record string to be parsed:

The regular expression used here is almost exactly the same as the regular expression in Figure 10, with one exception: In Java, you must escape each forward slash ("\"). Figure 10 is not a Java representation, so we need to precede each "\" with a "\" to avoid compilation errors. Unfortunately, the escape process is prone to errors, so be cautious. You can first enter the non-escaped regular expression and then replace each "\" with "\ \" from left to right. If you want to recheck, you can try to output it to the screen.

After initializing the string, instantiate the Patterncompiler object and create a pattern object with the Patterncompiler compiled regular expression:

Now, create the Patternmatcher object and call the contain () method of the Patternmatcher interface to check the matching situation:

Next, the matching group is output using the Matchresult object returned by the Patternmatcher interface. Because the LogEntry string contains matching content, you can see the output of the class as follows:

3.2 HTML Processing Instance one

The following task is to parse all the attributes of a font tag within an HTML page. The typical font tags within an HTML page are as follows:

The program will output the properties of each font tag as follows:

In this case, I recommend that you use two regular expressions. As shown in the first 11, it extracts "face=" Arial from the font tag, Serif "size=" +2 "color=" Red ".

Figure 11: Matching all attributes of a font tag

The second regular expression, 12, divides each attribute into a name-value pair.

Figure 12: Match a single attribute and split it into a name-value pair

The result of the split is:

Now let's take a look at the Java code that completes this task. First, create two regular expression strings and compile them into pattern objects with Perl5compiler. When compiling regular expressions, specify the Perl5compiler.case_insensitive_mask option so that the matching operation is not case-sensitive.

Next, create a Perl5matcher object that performs the matching operation.

Suppose you have a variable HTML of type string that represents a single line of content in an HTML file. If the HTML string contains a font tag, the match returns True. At this point, you can get the first group with the Matchresult object returned by the match object, which contains all the properties of the font:

Next, create a Patternmatcherinput object. This object allows you to continue the matching operation from the last match, so it is good for extracting the name-value pairs of the attributes within the font tag. Creates a Patternmatcherinput object, passing in the string to be matched as a parameter. The properties of each font are then extracted with a match instance. This is done by specifying the Patternmatcherinput object (instead of the string object) as a parameter and repeatedly calling the Patternmatcher object's contains () method. Each iteration of the Patternmatcherinput object moves its internal pointer forward, and the next detection begins after the previous matching position.

The output of this example is as follows:

3.3 HTML Processing Example two

Let's take a look at another example of HTML processing. This time, we assume that the Web server moved from widgets.acme.com to newserver.acme.com. Now you want to modify the links in some pages:

The regular expression that performs this search is shown in 13:

Figure 13: Matching the pre-modified link

If you can match the regular expression, you can replace the link in Figure 13 with the following:

Note the # character is followed by $ $. Perl Regular expression syntax uses $, $, and so on to represent groups that have been matched and extracted. The expression in Figure 13 attaches all content that is matched and extracted as a group to the back of the link.

Now, return to Java. As we did earlier, you must create a test string, create the objects necessary to compile the regular expression into the pattern object, and create a Patternmatcher object:

Next, replace with the substitute () static method of the Com.oroinc.text.regex package util class to output the result string:

The syntax for the Util.substitute () method is as follows:

The first two parameters of this call are the previously created Patternmatcher and pattern objects. The third parameter is a Substiution object that determines how the replacement operation is performed. This example uses the Perl5substitution object, which is able to replace the PERL5 style. The fourth parameter is a string that you want to replace, and the last parameter allows you to specify whether to replace all matching substrings (util.substitute_all) of the pattern, or to replace only the specified number of times.

A detailed description of the Java regular expression

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More