Simplified pattern matching

Source: Internet
Author: User
Tags alternation character classes modifiers expression engine ibm com
Regular Expression (Regular Expression) pattern matching can automate many text processing operations, such as search replacement, input verification, text conversion, and filtering. The powerful Regular Expression Processing Engine greatly reduces the amount of code, and usually requires a lot of code processing. Now, only a few lines of regular expressions are required. Some programming languages (such as Perl) and system tools (such as grep) have supported regular expressions many years ago. However, JDK before j2se 1.4 does not support regular expressions and people have to rely on third-party development kits such as Jakarta Regexp and IBM com. IBM. RegEx (a commercial product ). Java. util. RegEx was born and rewritten in history. It provides the standard implementation of the Regular Expression Engine. This article describes how to quickly use Java. util. RegEx to implement paradigm-based search. First, we introduce the basic concepts of regular expressions, then discuss the use of this package in detail, and demonstrate several simple applications.
  
What is a regular expression? Why is it so important?

If you have experience using regular expressions in other languages, the following describes its application in Java and some new functions. If you are not familiar with regular expressions at the beginning, you will soon learn to use them effectively for text processing. This processing method may be incredible to you.
Regular Expressions are a mechanism for discovering matching from a given text Sequence Based on a given paradigm. That is to say, it is a paradigm language ). A regular expression generally consists of two types of characters: literal and meta-character ). The literal is a common character, such as A, B, C, 1, and 2; The metacharacters (such as *, $, and so on) are directed to the Regular Expression Engine (see below) it conveys a special meaning. The engine interprets regular expressions, performs paradigm matching, and processes the results. The combination of this paradigm language and processing engine makes regular expressions a powerful tool for simplified paradigm matching. Implementations such as Java. util. RegEx and jregex provide additional query processing functions (such as replacement and Segmentation), which facilitate the modification of the target text. For other Java implementations of regular expressions and implementations in other languages, refer to the references listed at the end of this article.
  
Metacharacters (meta-character)

Metacharacters provide advanced expression capabilities for regular expressions. This article describes common metacharacters supported by Java. For a complete metacharacters list, see the sun API documentation (in the Java. util. RegEx. Pattern Class documentation ).
  
Anchor)

An anchor is a pre-defined position in the target text. The concept of an anchor is similar to a reference point. It is used to determine the relative position of other elements in a regular expression. Although the anchor can match any other location (by using the search component listed in Listing 1), the typical application is to match the boundary location of strings, rows, words, and so on. Lookaround construct matches the position that meets the given conditions. A forward lookup (positive lookahead, for example? = Neo) match with the position of NEO, and the negative forward lookup (negative lookahead, for example ?! Neo) does not match the position ending with Neo. Lookbehind construct? <=..., The negative shape is like? <!...) The working method is the opposite of the forward search.
  
Character classes, class shorthands and Alternation)

Character element "[...] "is used to specify a list of characters contained in a regular expression, and the component" [^...] "is used to specify a list of non-containing characters. When "[...]" is used, if any character listed in the List is found in the target text, the match is considered successful once. For example, the regular expression "[cw] ocould" matches "cocould" and "wocould ". The character class implies logic or operation (also known as alternation between elements ). Round Robin (alternation) means that "X" or "Y" are successful matches under the condition of "(x | Y. Therefore, the regular expression can also write "(C | W) ocould ".
The special metacharacter "-" in the character class can be used to specify a range. Therefore, "[A-Z]" matches all lowercase English letters of the character. Class shorthand is a simplified form of common character classes, such as numbers, words, spaces, and so on. Listing 1 lists the class stenography available in Java.
  
Quantifier)

The limit is used to indicate the number of element instances that must be met in a successful match. JAVA supports the following limits: greedy, relunctant, and possessive ). Greedy limiters try to match as much as possible, while barely trying to match as few as possible. This means that no matter whether a successful match is found, the greedy limit will continue to search until the end of the input row. Of course, if the target text is large, the performance overhead will also be large. The force limit character (also known as the lazy limit character) is used to end the search when a successful match is found, rather than until the end of the row. The placeholder limit is used to optimize the match operation. When the placeholder limit is used, the status of the previous match is not saved. Listing 1 lists these three delimiters.
  
Modifier (mode modifiers)

Some components are used to enable or disable specific powerful functions. By default, these functions are disabled because they require additional overhead. For example, in a regular expression, (I) "enables case-insensitive matching. JAVA supports specifying the method modifier during compilation, that is, static members using Java. util. RegEx. pattern. The pattern class will be described below.
  
Example 1: input verification

Now let's take a look at an example of using the metacharacters just discussed to verify the password required by Zion. Zion's security standards require that the password only contain letters and numbers, and contain at least one number. The password length is 6-32 characters.
Listing 2 and listing 3 list two possible solutions. The first method (List 2) uses the regular expression supported by the built-in (built-in) Matches () method of Java. Lang. String. The second method is to use the class provided by the java. util. RegEx package. The underlying mechanisms of these two methods are exactly the same, which will be discussed below. The next section will discuss API specifications.
It also depends on how the solution meets the requirements. The Regular Expression of the third row in Listing 2 is the same as the pattern Pcontent in listing 3. This paradigm uses a metacharacter combination, that is, the character class "[A-Z]" and the class stenographer ("/D" indicates the character class "[0-9]"). and greedy limit characters (* and + ). When matching, the paradigm "// B (? I) ([A-Z] * // D + [A-Z] *) // B "successfully matches the word boundary, which means, 0 or more letters, followed by 1 or more numbers, followed by 0 or more letters. Modifier "? I indicates that the search is case-insensitive. Note that the regular expressions in the two lists are somewhat different. The significant difference is that listing 3 uses annotations. The other difference is inconspicuous, but it is very important. Have you found it? The answer is in the next section.
The fourth row of Listing 2 specifies the password length. It uses {min, max} to specify the minimum number and maximum number of successful matches. At this time, if there are 6 to 32 letters or numbers between the word boundary, then "// B (? I) ([a-z0-9]) {6, 32} // B "matches. Note: The final variable of the pattern class is used in listing 3 to specify the case-insensitive option, improving the readability of the regular expression. These variables will be discussed later in this article.
The two regular expressions can be used together to determine that "010101" and "m0rpheus" are legal, while "agentsmith" is invalid. In addition to regular expressions, there are other methods to verify the password. However, you can see how simple and elegant a regular expression is.
  
Deep-dive into metacharacters-grouping and capturing)

The parentheses in a regular expression are used for grouping and capturing.
Parentheses "()" are used to group elements and capture corresponding child matches. Backward reference (/1,/2, and so on) allows reference to the text captured by the corresponding group in the same regular expression. The parentheses are calculated from left to right. The position (left) in the regular expression determines the corresponding backward reference content. Java also supports accessing captured text from outside the regular expression, which is implemented by the component "$1" "$2", etc, the component "$1" points to the content contained in "/1" or Java. util. regEx. the handle of the group (N) method of the matcher class.
Another type of parentheses "(? :) "Group only, without capturing any text. However, this is a useful component when backward reference is not required. It does not retain the matching status, thus improving the matching operation speed. This is the difference between list 2 and list 3 mentioned above. Note: Listing 3 is more efficient because it uses non-capturing parentheses.
  
Java. util. RegEx package

This package is quite small and only consists of two final classes, namely Java. util. regEx. pattern and Java. util. regEx. matcher, and an exception-type Java. util. regEx. patternsyntaxexception. These classes form the Java Regular Expression framework.
An instance of the pattern class is the result of compiling a regular expression string. The matcher object is responsible for matching a given Character Sequence and provides additional functions for accessing and using matching results. The pattern class also provides a static matches (string pattern, string text) method, which is used when this paradigm (and matching) does not need to be retained and reused. Note that both the Pcontent and plength paradigms in the list are defined outside the method, which enables the paradigm object to be reused in multiple method calls, thus improving the efficiency. If you use the matches () method of Java. Lang. String, you cannot use this option, as shown in Listing 2.
The compile () method of the pattern class receives a string of the regular expression to be compiled. It checks the syntax of this expression. If a syntax error is found, it throws a non-checked exception (unchecked exception) Java. util. RegEx. patternsyntaxexception. The modifiers discussed above can be added during compilation. The pattern class represents the static final variables (case_insensitive, dotall, and so on) of these modifiers in the form of tags ). Multiple tags can be specified by the "|" operator, as shown in the plength paradigm in listing 3 and 23 rows.
Similarly, there are many methods to query results. Group () and group (int I) are the two most commonly used. The group () method is used to access the text that matches the match in the last match, while the group (int I) method returns the text captured by group I (using capturing parentheses ). For example, when the regular expression "(// D) ([A-C])" is applied to the string "2140ad", group (1) returns "0", group (2) "A" is returned, and "0a" is returned for group (0 ". Note that group (0) always returns the entire matched text.
Finally, let's discuss the text replacement function of the matcher class and take a look at two examples. The replaceall (string newtext) method replaces all matched text with the given new text, while the replacefirst (string newtext) method replaces only the text that is matched for the first time. Advanced replacement operations apendreplacement () and appendtail () provide fine-grained control over replacement operations. Listing 4 demonstrates their usage.
  
Example 2: Text Conversion

So far, we have discussed the input string verification example. Now let's take a look at the extension of the password verification example. In this example, the target text is modified to comply with the requirements. Zion's security policy requires the password to be encrypted in some way before transmission. See a simple text Conversion Tool in Listing 4, which uses regular expressions to reverse the numbers in the text. The purpose of this example is not to compile a mature encryption algorithm, but to demonstrate some advanced features of regular expressions.
This program uses a simple regular expression (5th rows) to match numbers in the password. The find () method is used to search for numbers. It also plays a role in the iteration of matching results (7th rows ). The group (1) method returns the captured text in parentheses (// D), that is, all numbers in the character class. Then, based on a Number array, the captured numbers are reversed and added to the stringbuffer that indicates the new password. The appendreplacement method inserts the replaced text, while appendtail appends the remaining text. The password "010101" will be converted to "989898 ".
  
Example 3

The paradigm in listing 5 demonstrates the application of regular expressions in matching email addresses and web addresses. The email Paradigm (listing 5, line 1) matches all addresses ending with matrix.com, matrix.net, or matrix.org. It also matches the name of the closely following person. For example, the paradigm matches the word "trinity", tn@matrix.com (Trinity) that follows the e-mail ). URL paradigm matches the host and optional path http://www.zion.com/antimatrix.html. Note that the extra specified pattern. multiline option indicates that the URL may continue with multiple rows. The default action is to match the current row.
  
Conclusion

Regular Expressions facilitate the compilation of paradigm matching programs such as form verification, conversion text to HTML or vice versa, parsing text, and helper programs. With the introduction of the regular expression package in Java, the wide use of regular expressions becomes more convenient and no longer depends on external packages. We can see that the use of common Regular Expression elements is supported in Java, and this article also demonstrates its use. However, due to space limitations, it is impossible for this article to discuss more components. For more information, see the documents listed at the end of this article.
Note: in other languages (such as Perl,. net, and so on), the regular expression components in Java may have slightly changed meanings, so regular expressions may not be completely portable.
Listing 1 lists a subset of the regular expression metacharacters supported by Java. Listing 2 demonstrates password verification using Java. Lang. String. Listing 3 demonstrates password verification using Java. util. RegEx. Listing 4 demonstrates using Java. util. RegEx for text conversion.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.