Analysis of regular expression learning experience 1th/2 page _ Regular expression

Source: Internet
Author: User
Tags character classes control characters lowercase
Of course, in order to specify a pattern, use ... This mark is not accurate enough. You need to specify exactly what character arrangement is a legitimate match. When describing a pattern, you need to use a special syntax.

Here's a simple example. Regular expressions
[jj]ava.+
Matches any string of the following forms:

The first letter is J or J.
The following three letters are AVA.
The remainder of the string consists of one or more arbitrary characters

For example, the string "javaness" matches this particular regular expression, but the string "Core Java" does not match.

As you can see, you need to understand a little bit of grammar to understand the meaning of regular expressions. Fortunately for most purposes, the use of a small number of simple constructs (straightforward constructs) is sufficient.

A character class is a collection of optional characters, with ' [' encapsulation, such as [jj],[0-9],[a-za-z] or [^0-9]. Here-the representation range (Unicode falls to all characters between two boundaries), ^ represents the complement (all characters outside the specified character).
There are many predefined character classes, like \d (numbers) or \P{SC} (Unicode currency notation), as shown in tables 12-8 and 12-9.
Most characters match themselves, like the AVA characters in the example above.
Symbols. matches any character (except for possible line Terminators, which relies on identity settings (flag settings))
\ used as an escape character, such as \. Matches a period, \ n matches a backslash.
^ and $ respectively match the wardrobe and the end of the line
If both X and Y are regular expressions, XY represents "Match of x followed by Y". X| Y denotes "any x or y match"
You can use Quantifiers (quantifier) in an expression, x+ to say that x repeats 1 or more times, and x* that x repeats 0 or more times, x? Represents x repeat 0 or 1 times
By default, a quantifier is always the longest possible duplicate that matches the overall success. Can I add a suffix? (called a reluctant or stingy match, to match the smallest number of duplicates), and A + (called a possessive or greedy match that matches the maximum number of duplicates even if the overall match fails) to change this property.


For example, the string cab matches [A-z]*ab, but does not match [A-z]*+ab. In the first case, [a-z]* matches only the character C, so character AB matches exactly the rest of the pattern. But the greedy version [a-z]*+ matches the character cab, and the remainder of the pattern matches the failure of AB (so the overall match fails).

You can use grouping to define child expressions. Encapsulate groupings in (), such as ([+-]?) ([0-9]+). You can then have the pattern Matcher return a match for each of the groupings, or use the \ n back and forth grouping (refer to a group with \ n), where n is the group number (starting with \1)

Here's a slightly more complex but useful regular expression--it's used to describe integers in decimal and hexadecimal.
[+-]? [0-9]+|0[xx][0-9a-fa-f]+

Unfortunately, the syntax is not completely standardized between the various programs and libraries that use regular expressions. Consensus has been reached on the basic structure, but there are many "maddening" distinctions in detail (many maddening differences). Java's regular expression classes use syntax similar to the Perl language, but they are different. Table 12-8 shows all the regular expression constructs of the Java syntax. To learn more about regular expressions, refer to the API documentation for the pattern class, or Jeffrey E. F. Friedl's book "Mastering Regular Expressions" (O ' Reilly and Associates, 1997) (just went to the second bookstore to check, Southeast University Press has introduced its second edition, photocopy)

Table 12-8 Regular Expression syntax

--------------------------------------------------------------------------------
Grammar explanation

--------------------------------------------------------------------------------
Character
C character C
\UNNNN, \xnn, \0n, \0NN, \0nnn code unit with 16 or octal values

\0n octal 0n characters (0<=n<=7)

\0nn octal 0nn characters (0<=n<=7)

\0mnn octal 0mnn characters (0<=m<=3,0<=n<=7)

The character represented by the \XNN hexadecimal 0xnn

The character represented by the \uhhhh hexadecimal 0xhhhh

\ n, \ r, \f, \a, \e control characters, followed by tabs, line breaks, carriage returns, page breaks, alarms and escape characters
\CC the corresponding characters appearing in the control character C

--------------------------------------------------------------------------------
Character class
[C1c2 ...] C1, C2 ... Any character in the. CI can be a character, a range of characters (C1-C2), or a character class.
[^. . .] Complement of character classes
[... && ...] Intersection of two character classes

--------------------------------------------------------------------------------
Predefined character classes
. Any character except the line terminator (if the Dotall flag is set, any character is represented)
\d Number [0-9]
\d non-digital [^0-9]
\s white space characters [\t\n\r\f\x0b]
\s non-whitespace characters
\w word characters [a-za-z0-9_]
\w non-word characters
\p{name} A specified character class, as shown in table 12-9
\p{name} specifies a complement of character classes

--------------------------------------------------------------------------------
Boundary matching character
^ $ The beginning and end of the input (multiline mode) is the beginning and end of the line
\b Word boundaries
\b Non-word boundaries
\a the beginning of the input
End of \z input
\z end of input except last line terminator
\g the end of the last match

--------------------------------------------------------------------------------
Quantifiers
X? Optional x (that is, X may or may not appear)
X* X, which can be repeated 0 or more times
x+ X, which can be repeated 1 or more times
X{n} x{n,} x{n,m} X repeat n times, at least repeat n times, repeat N to M times

--------------------------------------------------------------------------------
quantifier suffix
? Set default (greedy) match for reluctant matching
+ Set Default (greedy) match for possessive

--------------------------------------------------------------------------------
Collection operations
The match of XY X followed by y
X| Match y x or y

--------------------------------------------------------------------------------
Group
(x) matches X and captures it in a group of automatic counts
\ n matches with nth group

--------------------------------------------------------------------------------
Escape
\c character C (must not be a letter)
\q...\e literally quote ...
(? ... ) Special structure, see API for pattern class

--------------------------------------------------------------------------------

The simplest use of regular expressions is to test whether a particular string matches them. Here is a Java-written test program. First, a pattern object is constructed from a string representing a regular expression. It then obtains a Matcher object from the schema and calls its matches () method:

Pattern pattern = pattern.compile (patternstring);

Matcher Matcher = pattern.matcher (input);

if (Matcher.matches ()) ...


Table 12.9 predefined character class names (predefined Character class Names)

--------------------------------------------------------------------------------
Lower lowercase asii characters [A-z]
Upper uppercase ASCII characters [A-z]
Alpha ASCII letter [a-za-z]
Digit ASCII number [0-9]
Alnum ASCII letters or numbers [a-za-z0-9]
xdigit hexadecimal digits [0-9a-fa-f]
Print or Graph printable ASCII characters [\x21-\x7e]
Punct non-alphabetic or digital ASCII [\p{print}&&\p{alnum}]
ASCII all ASCII characters [\x00-\x7f]
Cntrl ASCII control character [\x00-\x1f]
Blank spaces or tabs [\ t]
Space blank character [\t\n\r\f\0x0b]
Javalowercase depends on lowercase characters of character.islowercase ()
Javauppercase depends on uppercase characters of Character.isuppercase ()
Javawhitespace depends on the whitespace character of the character.iswhitespace ()
Javamirrored depends on the mirrored (?) of character.ismirrored (). )
Inblock here is the block name of the Unicode character, separated by a space, such as Basiclatin or Mongolian. Block Name list Reference http://www.unicode.org
Category or Incategory Here the Category is the type name of the Unicode character, such as L (letter) or SC (currency symbol). Category List Reference http://www.unicode.org

--------------------------------------------------------------------------------

Matcher input can be any class object that implements the Charsequence interface, such as String,stringbuilder or Charbuffer.

When compiling mode, you can set one or more flags, such as

Pattern pattern = Pattern.compile (patternstring,

Pattern.case_insensitive + pattern.unicode_case);


The following Six Flags are supported:

Case_insensitive: The matching character characters is not case-insensitive, which only considers us ASCII characters by default.

Unicode_case: When combined with case_insensitive, use Unicode letters to match

multiline:^ and $ match the start and end of a row, not the entire input

Unix_lines: When matching ^ and $ in multiple-line mode, only ' \ n ' is treated as a line terminator

Dotall: When this flag is used,. Symbols match all characters including line Terminators

CANON_EQ: Consider specification equivalence for Unicode characters


If the regular expression includes grouping, the Matcher object can reveal the grouping boundary. Method

int start (int groupindex)

int end (int groupindex)

Returns the starting index and trailing index of a particular group (Past-the-end index).
By calling string Group (int groupindex), you can simply get a matching string.
The No. 0 Group represents all groupings, and the first actual group index is 1. Call GroupCount to get the total number of groups.

Use open parentheses to arrange nested groupings. For example, a given pattern ((1?[ 0-9]) [0-5][0-9]) [Ap]m, and the input 11:59am,matcher object reports the following groupings
((1?) [0-9]):( [0-5] [0-9])) [Ap]m
and enter
11:59am
Matcher report the following groupings


--------------------------------------------------------------------------------
Group index start End string

--------------------------------------------------------------------------------
0 0 7 11;59am
1 0 5 11:59
2 0 2 11
3 3 5 59

--------------------------------------------------------------------------------
Current 1/2 page 12 Next read the full text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.