Regular Expressions (39)

Last Update:2016-08-08 Source: Internet

Author: User

Tags pear posix

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction to Regular Expressions:

?? Regular expressions are a grammatical rule that describes the pattern of character arrangement and matching. It is used primarily for pattern segmentation, matching, finding, and substitution of strings. So far, the exact (text) match we used earlier is also a regular expression.
?? In PHP, a regular expression is generally a procedural description of a text pattern that is composed of regular characters and some special characters (similar to a wildcard character).

?? In PHP, regular expressions have three functions:

?? and is often used to extract information from a string.
?? Replaces matching text with new text.
?? Splits a string into a smaller set of information blocks.
?? Contains at least one atom in a regular expression

There are two sets of regular expression libraries in PHP, which are functionally similar, but with slightly different execution efficiency:

?? The set is provided by the Pcre (Perl Compatible Regular Expression) library. A function named with the prefix "Preg_";
?? A set of extensions provided by the POSIX (portable Operating System Interface of Unix) extension. Use a function named "Ereg_" as a prefix;
?? One of the reasons for using regular expressions is that in a typical search and replace operation, only exact text can be matched, and the search for object dynamic text is difficult or even impossible.

Syntax rules for regular expressions

Pcre Regular Expressions:
?? Pcre is all called Perl Compatible Regular expression, which means Perl-compatible regular expressions.
?? Pcre comes from the Perl language, and Perl is one of the most powerful languages for string manipulation, and the original version of PHP is the product developed by Perl.
?? The PCRE syntax supports more features, is more powerful than POSIX syntax, implements the same function functions, and uses the Pcre library with a slightly more efficient advantage. But they also have a lot of similarities.
?? In Pcre, a pattern expression (that is, a regular expression) is typically included between two backslashes "/", such as "/apple/". The user simply puts the pattern content that needs to be matched into the delimiter. The characters that are bound are not limited to "/". Any character other than a letter, number, and slash "\" can be used as a delimiter, like "#", "|", "!" It's all ready.

Atom (Atom)

Atoms are the basic unit of a regular expression, and should be used as a whole when analyzing regular expressions.
?? Atomic characters include all English letters, numbers, punctuation marks, and other symbols. The atom also includes the following content.
?? A single character, number, such as a-z,a-z,0-9.
?? A pattern unit, such as (ABC), can be understood as a large atom consisting of multiple atoms.
?? An atomic table, such as [ABC].
?? Re-use of the mode unit, such as: \\1
?? Normal escape characters, such as: \d,\d,\w
?? Escape meta-characters, such as: \*,\.

Normal escape character

Atomic description
-----------------------------------------------------------------
\d matches a number; equivalent to [0-9]
\d matches any character except a number; equivalent to [^0-9]
\w matches an English letter, number, or underscore; equivalent to [0-9a-za-z_]
\w matches any character except English letters, numbers, and underscores; equivalent to [^0-9a-za-z_]
\s matches a white space character; equivalent to [\f\n\r\t\v]
\s matches any character except whitespace characters; equivalent to [^\f\n\r\t\v]
\f matches a page break equivalent to \x0c or \CL
\ n matches a line break; equivalent to \x0a or \CJ
\ r matches a return character equivalent to \x0d or \cm
\ t matches a tab; equivalent to \x09\ or \CL
\v matches a vertical tab; equivalent to \x0b or \ck
\onn matches an octal number
\XNN matches a hexadecimal digit
\CC matches a control character

Metacharacters (Meta-character)

Metacharacters is a special-meaning character used to construct a regular expression. If you want to include the meta-character itself in a regular expression, you must precede it with "\" to escape
Metacharacters description
-----------------------------------------------------------------
* 0, 1 or more times to match the previous atom
+ 1 or more times to match the atoms before it
? 0 or 1 times to match its former atoms
| Match two or more selections
^ or \a matches the atom of the string header
$ or \z matches the string end of an atom
\b Match the boundaries of a word
\b matches a section other than the word boundary
[] matches any one of the atoms in the square brackets
[^] matches any character except an atom in square brackets
{m} indicates that its former atom appears exactly m times
{M,n} indicates that its former atom appears at least m times, at least N times (n>m)
{m,} indicates that its former atom appears not less than m times
() whole represents an atom
. Match any one character except for line break

string bounds Restrictions

In some cases, the matching range needs to be qualified to achieve more accurate matching results. "^" and "$" specify the start and end of the string, respectively.
?? For example, in the string "Tom and Jerry chased" in the until Tom's Uncelcome in "
?? The meta-character "^" or "\a" is placed at the beginning of the string to ensure that pattern matching appears at the first end of the string;
/^tom/
?? The meta-character "$" or "\z" is placed at the end of the string, ensuring that pattern matching appears at the tail end of the string.
/in$/
?? If you do not add a boundary-bound metacharacters, you will get more matching results.
/^tom$/exact matching/tom/fuzzy matching

Word Boundary limit

When using the Find feature of various editing software, you can obtain more accurate results by selecting "Search by word". Similar functionality is also provided in regular expressions.
?? For example: In the string "This island was a beautiful land"
?? The meta-character "\b" matches the boundary of the word;
/\bis\b/matches the word "is" and does not match "this" and "island".
/\bis/matches "is" in the word "is" and "island", does not match "this"
?? The meta-character "\b" matches parts except the word boundary.
/\bis\b/the explicit instructions not to match the left and right boundaries of the word, only the interior of the word. So there is no result in this case.
/\bis/matches "is" in the word "this"

Repeat match

There are some metacharacters in regular expressions that are used to repeatedly match certain atoms: "?", "*", "+". Their main difference is that the number of repetitions matches is different.
?? Metacharacters "?": represents 0 or 1 matches immediately before the atom.
For example:/colou?r/matches "colour" or "color".
?? Metacharacters "*": Represents 0, 1, or more occurrences of the atom immediately preceding it.
For example:/zo*/can match Z, zoo
?? Meta-character "+": represents 1 or more occurrences of the atom immediately preceding it.
For example:/go+gle/matches "Gogle", "Google" or "gooogle" with multiple o strings in the middle.

Any one character

Meta-character "." Matches any one character except the line break.
?? Equivalent to: [^\n] (Unix system) or [^\r\n] (Windows system).
?? For example:/pr.y/can match the string "prey", "Pray" or "pr%y" and so on.
?? You can usually use a combination of ". *" to match any character except a line break. In some books it is also referred to as "full match" or "single-containing match".
?? For example:
?? /^a.*z$/represents any string that does not include a newline character that begins with the letter "a" and ends with the letter "Z".
?? /.+/can also do similar matching functions differently because it matches at least one character.
?? /^a.+z$/match "a%z" mismatch string "AZ"

Atomic table-square brackets expression

The atomic table "[]" holds a group of atoms that are equal in status and match only one of them. If you want to match a "a" or "E" using [AE].
?? For example: Pr[ae]y matches "Pray" or "Prey".
?? The Atomic table "[^]" or "Exclude an atomic table", matches any character except the atoms in the table.
?? For example:/p[^u]/matches "Pa" in "part", but cannot match "Pu" in "Computer" because "U" is excluded from the match.
?? The Atomic table "[-]" is used to connect a set of atoms in ASCII order, simplifying writing.
?? For example:/x[0123456789]/can be written as x[0-9] to match a string of "X" letters with a number.
?? For example:
?? /[a-za-z]/matches all uppercase and lowercase letters
?? /^[a-z][0-9]$/matches such as "Z2", "T6", "G7"
?? /0[xx][0-9a-fa-f]/matches a simple hexadecimal number, such as "0x9".
?? /[^0-9a-za-z_]/matches any character except English letters, numbers, and underscores, which is equivalent to \w.
?? /0? [XX] [0-9a-fa-f]+/matches hexadecimal numbers, which can match "0x9b3c" or "X800".
?? /<[a-za-z][a-za-z0-9]*>/can Match "

"," "or" "and other HTML tags, and does not strictly control the case.

Pattern Selector

Meta-character ' | ' Also called the pattern selector. Match one or more of the two or more selections in a regular expression.
?? For example:
?? In the string "There is many apples and pears." ,/apple|pear/matches "Apple" at the first run, and matches "pear" when run again. You can also continue to add options, such as:/apple|pear|banana|lemon/

Mode unit

The meta-character "()" converts the regular expression in it to an atomic (or modal unit). Similar to parentheses in mathematical expressions, "()" can be used as a single unit.
?? For example:
?? /(dog) +/matches the "dog", "Dogdog", "Dogdogdog", because immediately before the "+" atom is the meta-character "()" enclosed in the string "dog".
?? /you (very) + old/match "you very old", "you very veryold"
?? /hello (World|earth)/Match "Hello World", "Hello Earth"
?? An expression in a pattern cell will be matched or operated on first.

Re-used mode unit

The system automatically stores the matches in the Mode Unit "()" and can be referenced in the form of "\1", "\2", and "\3" when needed. This method is very easy to manage when the regular expression contains the same pattern unit. Note that the use should be written as "\\1", "\\2"
For example:
?? /^\d{2} ([\w]) \d{2}\\1\d{4}$/matches strings such as "12-31-2006", "09/27/1996", "86 01 4321". However, the above regular expression does not match the format of "12/34-5678". This is because the result "/" of the Pattern "[\w]" has been stored. When the next location "\1" is referenced, its matching pattern is also the character "/".
?? Use non-storage mode units when you do not need to store matching results "(? ：）”
?? For example/(?: A|b|c) (d| e| F) \\1g/will match "AEEg". In some regular expressions, it is necessary to use a non-storage mode unit. Otherwise, the order of subsequent references needs to be changed. The above example can also be written/(A|B|C) (c| e| F) \\2g/.

The above describes the regular expression (39), including the aspects of the content, I hope that the PHP tutorial interested in a friend helpful.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More