Basic usage of regular expressions

Last Update:2016-10-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Perl has many features that are not available in other languages, and the strong support for regular expressions (regular expression) is one of its most prominent highlights. Regular expressions make Perl very powerful when it comes to working with text: fast, flexible, and reliable, and even powerful text processing is one of the most dazzling features of Perl in many languages. So the process of learning Perl is bound to be the process of learning regular expressions, which may add a little burden to Perl's learning, but fortunately the regular expression is not unique to Perl, it is a very wide-ranging language, and is widely supported in many tools and other programming languages. such as: Grep,awk,sed,vi and so on. It is so common that programmers are inevitably dealing with it on many occasions, so the benefits of mastering good regular expression are obvious, but the syntax is simple and learning is not too difficult.

In Perl or some other languages and tools, regular expressions are often called patterns, and regular expressions are essentially a string template that confirms whether a string conforms to the template's format, any string, or conforms to the template. Either does not conform to this template. specifically, in Perl, regular expressions are "/" enclosed, such as: ($string =~/pattern/). Here we call the string match string, only if string matches the string "pattern", the expression is true, and vice versa is false. For example: "ABC ef FFFF" =~/ef/is True, "abc" =~/ee/is false.

Of course, the above example is only the simplest case, if the regular expression can only do such pure character matching, then it can not be so powerful.

Regular expressions use a class of special characters, called metacharacters or wildcard characters (meta character), to make a fuzzy representation of a character, which represents a particular meaning in the process of matching. Let's take a brief introduction below.

(1) point number (.),

It is used to match any single character (\ n excluded, I will not mention the exception by default).

So: "Twoon" =~/tw.on/is True

The "Twvon" =~/tw.on/is also true.

The dot number has a special meaning in the regular expression, and sometimes we may want to match the dot, and then we need to escape.

"TWOO.N" =~/twoo\.n/is true.

In a regular expression, all other wildcards can also be escaped in the same way, representing a direct match to the wildcard character, removing its special meaning.

You can see \ is also a wildcard, if you want to match it is also the same reason.

"Two\\on" =~/two\\on/is true.

(2) asterisk (*):

An asterisk is a character that matches any number of characters before it (0 or any time), which is a quantity word (quantifier) that must be followed by another character, otherwise the expression is incorrect.

Example: "Twoon" =~/two*n/is True

The "TWN" =~/two*n/is also true.

The "Twoon" =~/*twoon/expression is incorrect, * must be followed by another symbol.

The asterisk can also match the dot (.) and the dot number represents any non-carriage return character, so (. *) represents any character any time. This is a customary usage, such as:/twoon.*walks/can match any string containing "Twoon" before, "walks" in the post. So (. *) is also known as: any old junk. Match anything.

(3) plus sign (+): The plus sign is a wildcard character similar to an asterisk (*), which is also a number word that matches the preceding character one or more times (at least once).

The difference between it and the asterisk is here, the asterisk can match 0 times, and the plus sign must be more than once.

For example: "Twoon" =~/two+n/is true.

"TWN" =~/two+n/is false.

(4) question mark (?): A question mark is also a quantity word that represents a match for the previous character 0 or 1 times.

Example: "Twoon" =~/twoo?n/is True

"Twoon" =~/two?n/is false.

"TWN" =~/two? N/is true.

(5) brackets (()): brackets are used to denote a combination, before I say that the quantity Word acts on the previous character, which is in fact inaccurate, should be said to be in a combination, a character is a combination, but multiple characters can also be combined. Parentheses are used to represent a combination, and a combination is enclosed in parentheses.

For example: "Twoon" =~/tw (o) *n/is true.

"Twoon" =~/tw (oo) *n/true

"Twowon" =~/t (wo) *n/is true.

"Twoon" =~/t (WV) *oon/is false.

"Twoon" =~/t (wo) +on/is true.

"Twon" =~/t (wo) +on/is false.

You can place any character in parentheses, or you can place other wildcard characters, such as "AAABCC" =~/(aa+b)? cc/

(6) Reference wildcard: Backslash plus number is the so-called wildcard character (back reference): \1 \2 \3, etc., its role is to refer to a previous bracket tuple, such as:

"Twoonwo" =~/t (wo) on\1/true

At first glance, it seems that the effect is not obvious, as the above example, we can completely without \1, and written like this:/t (wo) on (wo)/

In the above example, this question is understandable. Sometimes, however, our parenthetical tuples may be written like this: (we ...) because the dot number represents any character, if we want to work on this tuple later, without referencing wildcards, we simply cannot refer to the example:

"Weabceeweabc" =~/(We ...) Ee\1/is true.

"Weabceeweabc" =~/(We ...) EE (We ...) /Is True

"WEABCEEWECDF" =~/(We ...) EE (We ...) /is also true.

From the 2nd, 3 examples, we can see the difference. \1 represents exactly the same match as the previous tuple. and \1,\2,\3, respectively, the number of tuples from left to right.

"ABCdef def ABC" =~/(...) (...) \2\1/is true.

In Perl, the reference wildcard in support from 1~9, the wording is very live, you can either directly \1 \2 ... \9 this to write, can also be written \g{1], \g{2},..... \g{9}. The latter one is relatively complex, but it helps Perl to understand what you want to say. Because the backslash has a special belief in the programming language and usually conveys the escape, Perl will guess what you want to say when it encounters a backslash. So if you write a similar: \123 something like this, it does not know how to parse, you want to express \1+23, the reference followed by the number, or, \12+3, or \123, the escape character followed by the number is can be expressed escaped a 8 binary number. So there's ambiguity here. Perl 5.10 introduced \g{n} as a way to represent the reference wildcard character. N can even be a negative number, when a negative number is used, it represents a relative position. Represents the left-to-right, nth-tuple from the current position, such as:

"Twooavvboonn" =~/tw (OO) A (vv) b\{-2} (NN)/is true.

(7) Brackets []: brackets are used to denote a character set (character set)

The character set, as the name implies, is a collection of characters, the elements of the collection are placed in brackets, representing one of each match, such as: "Twoon" =~/[tw]woo/is True

Sometimes if this set has a lot of elements, such as 26 letters, numbers, etc., written in brackets, it would be too cumbersome and too stupid, then you can use a hyphen (hyphen) to represent a range, such as: [A-z] represents a collection of lowercase letters, [a-za-z] represents a collection of uppercase letters.

The above usage is used to provide scope to select, but sometimes does not match a range is also very common matching requirements, we can put a caret ^ (caret) at the beginning of the collection. This notation means that matching any characters that are not in the collection is exactly the opposite of the previous usage. For example: "Twoon" =~/[^two]woon/false

"Ewoon" =~/[^two]woon/is True

By the above usage, we can see that the two symbols have special meanings in the collection, and if we want to represent the two characters in the ^,-, we have to escape them. such as: [\^ab\-]

Some character sets are commonly used, such as letters, numbers, and so on, and Perl provides abbreviations to represent these common collections, such as: \d represents a number, equivalent to [0-9], these special characters include the following:

\w -(lowercase W) denotes letters or numbers, equivalent to [a-za-z0-9]

\w --(uppercase W) non-alphabetic and non-numeric, contrary to \w

\s -(lowercase s) matches a space character, including: spaces, line feeds, enter, tab, equivalent to [\n\r\t\f]

\s --(uppercase S) matches non-whitespace characters, \s the opposite

\d --Represents a 10 binary number, equivalent to [0-9]

(8) curly braces: {}

The function of the curly brace is to specify how many times the previous character repeats:

{n} repeats n times

{n,m} repeats n~m times

{N,} repeat at least n times

{, m} repeats M-times at most

Example: "Twoon" =~/two{2}n/is true.

(9) ^,& These two wildcard characters are used to denote the matching of the head or tail of the matched string.

Generally we write this: "Twoon" =~/oo/regular expression, the match is from the beginning of the "Twoon" match down, such as tw! = Oo, continue to match down, but sometimes we may just want to match the beginning or end, then ^,& come in handy. ^ The switch used to match the string,& is used to match the end of the string. For example: "Twoon" =~ "^TW" is true.

"Twoon" =~ "oo" is true

"Twoon" =~ "^oo" is false.

"Twoon" = "on&" is true.

"Twoon" = "oo&" is false

(10) "or" wildcard: regular expression with a vertical bar | Represents or, (AB | cd) represents one of the character groups that match a vertical bar, and if the number of characters left and right is more than one, it must be used with parentheses.

Example: "Twoon" =~/t|ewoon/result is true

"Twoon" =~/(Tw|ee) oon/result is true

"Twoon" =~/(EE|GS) oon/result is false

The above simply introduces the meanings and usages of the various wildcards in regular expressions, as you can see, the regular expression is syntactically simple, but it does not mean that it will only do simple things, in fact, the regular expression can also be written very complex, simple or complex syntax and the language is powerful or not much direct connection. Learning regular expressions can benefit us in a number of situations, especially when using the various tools under Unix-like systems, but this language is just like many other languages, and learning without it is easy to forget. This article is simply a brief introduction of its basic grammar, the basic wording, to achieve practice, it is necessary in the daily work, more exercise, more operation, more attention.

Basic usage of regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basic usage of regular expressions

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support