Java Regular Expressions Detailed

Source: Internet
Author: User
Tags date format object end expression perl regular expression string access
About us if you have ever used Perl or any other built-in regular expression support language, you must know how simple it is to handle text and matching patterns with regular expressions. If you are unfamiliar with the term, then "regular expression" (Regular Expression) is a string of characters that defines a pattern for searching for a matching string. Many languages, including Perl, PHP, Python, JavaScript, and JScript, support the use of regular expressions to process text, and some text editors implement advanced search-replace functionality with regular expressions. So what about Java? When writing this article, a Java Specification requirement (specification Request) with regular expressions for text processing is recognized, and you can expect to see it in the next version of the JDK. However, what if you need to use regular expressions now? You can download the source code open Jakarta-oro Library from apache.org. The next section of this article briefly introduces the introductory knowledge of regular expressions, and then uses the Jakarta-oro API as an example to describe how to use regular expressions. First, regular expression basic knowledge we start with a simple. Let's say you're searching for a string that contains the character "cat", and the regular expression for the search is "cat." If the search is not sensitive to case, the word "catalog", "Catherine", "sophisticated" can all match. That is to say: The 1.1 period symbol assumes that you are playing English Scrabble and want to find three-letter words that must begin with the letter "T" and End With "n" Letters. In addition, suppose you have an English dictionary, you can use regular expressions to search all of its contents. To construct this regular expression, you can use a wildcard character-the period symbol ".". In this way, the complete expression is "T.N", which matches "tan", "ten", "Tin" and "ton", and also matches "T#n", "TPN" or even "T n", and many other meaningless combinations. This is because the period symbol matches all characters, including spaces, tab characters, and even line breaks: 1.2 bracket notation to solve the problem of too wide a period symbol matching range, you can specify characters that appear to be meaningful in square brackets ("[]"). At this point, only the character character character specified in the square brackets participate in the match. That is, the regular expression "t[aeio]n" matches only "tan", "Ten", "Tin", and "ton". But "Toon" does not match, because within the square brackets you can only match a single character: 1.3 "or" If you want to match "toon" except for all the words above, you can use the "|" Operator. | Operator'sThe basic meaning is "or" operation. to match "Toon", use the "t (A|e|i|o|oo) n" Regular expression. You cannot use a square extension here because the brackets allow only a single character to be matched, and you must use the parentheses "()" here. Parentheses can also be used to group, as described later in this article. 1.4 Symbol table showing the number of matches the symbol that represents the number of matches is used to determine the number of occurrences of the symbol immediately to the left of the symbol:

Suppose we want to search the U.S. Social Security number in a text file. The format of this number is 999-99-9999. The regular expression used to match it is shown in figure one. In a regular expression, a hyphen ("-") has a special meaning, which represents a range, for example, from 0 to 9. Therefore, when matching the hyphenation symbol in the social security number, it is preceded by an escape character "\".

Figure I: Matching Social Security numbers in all 123-12-1234 forms

Suppose you want the hyphen to appear or not when you are searching-that is, 999-99-9999 and 999999999 are in the correct format. At this point, you can add the word "? The quantity qualifier symbol, as shown in Figure two:

Figure II: Matching Social Security numbers for all 123-12-1234 and 123121234 forms

Let's look at another example here. A format for U.S. car licences is four digits plus two letters. Its regular expression is preceded by the number part "[0-9]{4}", plus the letter part "[A-z]{2}". Figure three shows the complete regular expression.

Figure three: Match a typical U.S. car license number, such as 8836KV

The 1.5 "no" symbol "^" is called a "no" symbol. If used in square brackets, "^" denotes a character that you do not want to match. For example, the regular expression in Figure four matches all words except words that begin with the "X" letter.

Figure four: match all words except the beginning of "X"

1.6 Parentheses and whitespace symbols suppose that you want to extract the month portion from a birthday date formatted with "June 26, 1951", the regular expression that matches the date can be as shown in Figure five:

Figure five: matching dates for all moth dd,yyyy formats

The newly appearing "\s" symbol is a blank symbol that matches all whitespace characters, including the tab character. If the string matches correctly, then how do you extract the month portion? Simply create a group with parentheses around the month, and then extract its value with the Oro API (discussed in detail later in this article). The modified regular expression is shown in Figure six:

Figure VI: Matches the date of all month dd,yyyy formats, defining the month value as the first group

1.7 Other symbols for simplicity, you can use some shortcut symbols that are created for common regular expressions. As shown in table two: Table two: Common symbols

For example, in the previous example of social Security numbers, "\d" can be used in all places where "[0-9]" appears. The modified regular expression is shown in Figure seven:

Figure VII: Matching Social Security numbers in all 123-12-1234 formats

Second, the Jakarta-oro library has many source-code-open regular expression libraries for use by Java programmers, and many of them support Perl 5-compliant regular expression syntax. What I'm using here is the Jakarta-oro regular expression library, which is one of the most comprehensive regular expression APIs, and it's completely compatible with Perl 5 regular expressions. In addition, it is one of the best optimized APIs. Jakarta-oro Library formerly called Oromatcher,daniel Savarese generously donated it to Jakarta Project. You can download it by following the instructions in the final reference resource in this article. I'll start with an overview of the objects you have to create and access when using the Jakarta-oro library, and then explain how to use the Jakarta-oro API. ▲patterncompiler object First, create an instance of the Perl5compiler class and assign it to the Patterncompiler interface object. Perl5compiler is an implementation of the Patterncompiler interface, allowing you to compile the regular expression into a pattern object to match. The ▲pattern object compiles the regular expression into a pattern object, invokes the compile () method of the compiler object, and specifies the regular expression in the invocation parameter. For example, you can compile the regular expression "T[aeio]n" in the following way: By default, the compiler creates a case sensitive pattern. Therefore, the pattern compiled by the code above matches only "tin", "Tan", "ten" and "ton", but does not match "tin" and "tan". To create a case-insensitive pattern, you should specify an extra parameter when invoking the compiler: After you create the pattern object, you can use the Patternmatcher class to match patterns. The ▲patternmatcher object Patternmatcher object is checked against the pattern object and string. You want to instantiate a Perl5matcher class and assign the result to the Patternmatcher interface. The Perl5matcher class is an implementation of the Patternmatcher interface that matches patterns according to the Perl 5 regular expression syntax: Using Patternmatcher objects, you can use multiple methods for matching operations. The first parameter of these methods is a string that needs to be matched according to the regular expression: · Boolean matches (String input, pattern pattern): when the inputstrings and regular expressions are used precisely when they are matched. In other words, the regular expression must fully describe the input string. · Boolean Matchesprefix (String input, pattern pattern): used when the regular expression matches the starting part of the input string. · Boolean contains (String input, pattern pattern): used when a regular expression matches part of an input string (that is, it must be a substring). In addition, in the above three method calls, you can also use the Patternmatcherinput object as a parameter to replace the string object, at which point you can begin the match from the last match in the string. It is useful to use a Patternmatcherinput object as a parameter when a string may have more than one substring to match a given regular expression. When you substitute a string with a Patternmatcherinput object as a parameter, the syntax of the above three methods is as follows: · Boolean matches (Patternmatcherinput input, pattern pattern) · Boolean Matchesprefix (Patternmatcherinput input, pattern pattern) · Boolean contains (Patternmatcherinput input, pattern pattern) Three, application examples below we look at some examples of Jakarta-oro library applications. 3.1 Log file Processing task: Analyze a Web server log file to determine how long each user spends on the site. In a typical BEA WebLogic log file, the format of logging is as follows: Analyzing this log record, you can find that there are two items to extract from this log file: IP address and page access time. You can extract IP addresses and time tokens from log records using group symbols (parentheses). First let's look at the IP address. The IP address consists of 4 bytes, each byte in a value between 0 and 255, and each byte is separated by a period. Therefore, each byte in the IP address has at least one, up to three digits. Figure Eight shows the regular expression written for the IP address:

Figure Eight: Matching IP address

The period character in an IP address must be escaped (preceded by "\") because the period in the IP address has its meaning rather than the special meaning in the regular expression syntax. The special meaning of a period in a regular expression is described earlier in this article. The time portion of the log record is surrounded by a pair of brackets. You can extract everything from the square brackets by first searching for the starting square bracket character ("["), extracting any content that does not exceed the closing square bracket character ("]"), and looking forward until the closing bracket character is found. Figure Nine shows the regular expression for this section.

Figure nine: match at least one character until the "]" is found

Now, combine the two regular expressions with the grouping symbols (parentheses) and merge them into a single expression, so that you can extract the IP address and time from the log record. Note that in order to match "--" (but not extract it), "\s-\s-\s" is added to the middle of the regular expression. The complete regular expression is shown in Figure 10.

Figure 10: Matching IP addresses and time markers

Now that the regular expression has been written, you can then write Java code that uses the regular expression library. To use the Jakarta-oro library, first create the regular expression string and the log record string to be parsed: the regular expression used here is almost exactly the same as the regular expression in Figure 10, with one exception: In Java, you must escape each forward slash ("\"). Figure 10 is not a Java representation, so we need to precede each "\" with a "\" to avoid a compilation error. Unfortunately, the escape process is prone to errors, so you should be cautious. You can first enter a regular expression without escaping, and then replace each "\" with "\" from left to right. If you want to recheck, you can try to output it to the screen. After the string is initialized, instantiate the Patterncompiler object, and compile the regular expression with Patterncompiler to create a pattern object: Now, create the Patternmatcher object, Call the contain () method of the Patternmatcher interface to check the match: Next, the Matchresult object returned by the Patternmatcher interface is used to output the matching group. Because the LogEntry string contains matching content, you can see the class like the following output: 3.2 HTML processing Instance one of the following tasks is to parse all the properties of the font tag within an HTML page. The typical font tag in an HTML page looks like this: The program prints the properties of each font tag as follows: In this case, I suggest you use two regular expressions. The first, as shown in Figure 11, extracts the "face=" Arial from the font tag, Serif "size=" "+2" color= "Red".

Figure 11: All attributes matching the font tag

The second regular expression, shown in Figure 12, splits each attribute into a name-value pair.

Figure 12: Matching a single property and dividing it into a name-value pair

The split result is: Now let's take a look at the Java code that completes this task. First, you create two regular expression strings and compile them into pattern objects using Perl5compiler. When compiling a regular expression, specify the Perl5compiler.case_insensitive_mask option so that the matching operation is case-insensitive. Next, create a Perl5matcher object that performs the matching operation. Suppose you have a variable HTML of type string that represents a line of content in an HTML file. If the HTML string contains a font tag, the match returns True. At this point, you can get the first group with the Matchresult object returned by the matching object, which contains all the properties of the font: Next, create a Patternmatcherinput object. This object allows you to continue the matching operation from the last match, so it is appropriate to extract the name-value pairs of the attributes within the font tag. Creates a Patternmatcherinput object, passing in a parameter to the string to be matched. Then, the properties of each font are extracted with a matching instance. This is done repeatedly by calling the Patternmatcher object's contains () method by specifying the Patternmatcherinput object (rather than a string object) as a parameter. Each iteration of the Patternmatcherinput object moves its internal pointer forward, and the next detection begins after the previous match. The output of this example is as follows: 3.3 HTML processing Instance two let's look at another example that handles HTML. This time, we assume that the Web server moved from widgets.acme.com to newserver.acme.com. Now you want to modify the links in some pages: the regular expression that executes the search is shown in Figure 13:

Figure 13: The link before the modification

If you can match this regular expression, you can replace the link in Figure 13 with the following: Note that the # character is followed by $. Perl Regular expression syntax uses $, $, and so on to represent groups that have been matched and extracted. The expression in Figure 13 attaches all content that is matched and extracted as a group to the back of the link. Now, return to Java. As we have done before, you must create a test string, create the object necessary to compile the regular expression into the pattern object, and create a Patternmatcher object: Next, Replace with the substitute () static method of the Com.oroinc.text.regex package Util class, the output result string: The syntax of the Util.substitute () method is as follows: The first two parameters of this call are previously created Patternmatcher and pattern objects. The third parameter is a Substiution object that determines how the substitution operation is performed. This example uses the Perl5substitution object, which is capable of Perl5-style substitutions. The fourth parameter is the string that you want to replace, and the last parameter allows you to specify whether to replace all matching substrings (util.substitute_all) of the pattern, or to replace only the specified number of times. "Closing" In this article, I introduced you to the powerful features of regular expressions. Regular expressions can play a significant role in string extraction and text modification as long as they are correctly applied. In addition, I also described how to use regular expressions in Java programs through the Jakarta-oro library. It's up to you to decide whether to end up with old-fashioned string processing (using Stringtokenizer,charat, and substring) or regular expressions.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.