Perl has many features not available in other languages, among which strong support for regular expressions is one of its most prominent features.
Regular Expressions make perl very powerful in processing text: fast, flexible, reliable, and even can be said,
Powerful text processing capabilities are one of the most brilliant features of perl in many languages.
Therefore, the process of learning perl must also be the process of learning regular expressions, which may add less burden to learning perl,
Fortunately, regular expressions are not unique to perl. Regular Expressions are actually a widely used language,
It is widely supported in many tools and other programming languages, such as grep, awk, sed, and vi.
It is so common that programmers have to deal with it in many occasions, so the advantage of mastering regular expressions is very obvious,
Fortunately, its syntax is also very simple, and it is not too difficult to learn.
In Perl or other languages and tools, regular expressions are usually called pattern. A regular expression is essentially a string template,
It is used to confirm whether a string conforms to the template format. Any string either complies with the template or does not comply with the template.
In perl, regular expressions are enclosed by "/", for example:
($ String = ~ /Pattern/)
For the time being, we call string a matching string. This matching is true only when string contains the string "pattern", and false otherwise.
For example, "abc ef ffff" = ~ /Ef/is true, "abc" = ~ /Ee/false
The above is the simplest case. If a regular expression can only perform such pure character matching, it cannot be so powerful.
Regular Expressions use a special type of characters, which are called metacharacters or wildcards (meta character) for Fuzzy representation. They represent special meanings in the matching process.
The following is a brief introduction.
(1) Point (.), which is used to match any single character (\ n excluded, I will not mention this exception by default ).
So: "twoon" = ~ /Tw. on/is true
"Twvon" = ~ /Tw. on/is also true.
Point numbers have special meanings in regular expressions. Sometimes we may need to match the point numbers, so we need to escape them.
"Twoo. n" = ~ /Twoo \. n/is true.
In regular expression, all other wildcards can be escaped in the same way, indicating that the wildcard is directly matched and its special meaning is removed.
As you can see, \ is also a wildcard. If you want to match it, it is the same principle.
"Two \ on" = ~ /Two \ on/is true.
(2) asterisks (*): asterisks represent matching the first character any times (0 or any times). They are quantifiers ),
It must be followed by other characters; otherwise, the expression is incorrect.
For example, "twoon" = ~ /Two * n/true
"Twn" = ~ /Two * n/is also true.
"Twoon" = ~ /* The twoon/expression is incorrect. * It must be followed by other symbols.
At the same time, the asterisk can also match the dot (.). The dot represents any non-carriage return character. Therefore, (. *) represents any character at any time.
This is a common usage:
For example,/twoon. * walks/can match any string that contains "twoon" before and "walks" after.
So (. *) is also called: any old junk. match anything.
(3) plus sign (+): the plus sign is a wildcard similar to the asterisk (*). It is also a quantizer, indicating that the preceding character is matched once or multiple times (at least once ).
It is different from the asterisk here. The asterisk can match 0 times, And the plus sign must be more than once.
For example, "twoon" = ~ /Two + n/is true.
"Twn" = ~ /Two + n/is false.
(4) question mark (?) : A question mark is also a quantizer. It indicates matching the first character 0 or 1 time.
For example, "twoon" = ~ /Twoo? N/true
"Twoon" = ~ /Two? N/is false.
"Twn" = ~ /Two? N/is true.
(5) parentheses (): parentheses are used to represent a combination. The quantifiers are used in the previous character,
This statement is actually not accurate. It should be said that it is used in a combination. A character is a combination, but multiple characters can also be combined.
Brackets are used to indicate a combination. The brackets enclose a combination.
For example, "twoon" = ~ /Tw (o) * n/is true.
"Twoon" = ~ /Tw (oo) * n/true
"Twowon" = ~ /T (wo) * n/is true.
"Twoon" = ~ /T (wv) * oon/is false.
"Twoon" = ~ /T (wo) + on/is true.
"Twon" = ~ /T (wo) + on/is false.
Any character or other wildcard characters can be placed in brackets.
For example, "aaabcc" = ~ /(Aa + B )? Cc/
(6) reference wildcard characters: The backslash and digits are the so-called back reference wildcards: \ 1 \ 2 \ 3.
It is used to reference a previous pair of parentheses.
For example:
"Twoonwo" = ~ /T (wo) on \ 1/true
At first glance, it seems that the role is not obvious. In the above example, we can write it like this without \ 1:/t (wo) on (wo )/
In the above example, this question is understandable.
However, sometimes our parentheses may be written as follows: (we...) because the DoT number represents any character,
We cannot reference wildcards at all. For details, refer to the following example:
"Weabceeweabc" = ~ /(We...) ee \ 1/is true.
"Weabceeweabc" = ~ /(We...) ee (we...)/is true
"Weabceewecdf" = ~ /(We...) ee (we...)/is also true.
From the second and second examples, we can see the difference. \ 1 indicates the exact match with the previous tuples.
\ 1, \ 2, \ 3, respectively, representing the number of tuples from left to right.
"Abcdef abc" = ~ /(...) \ 2 \ 1/true.
In perl, wildcards can be referenced from 1 ~ 9. The writing method is very active. You can write it directly \ 1 \ 2... \ 9.
It can also be written as \ g {1], \ g {2 },..... \ g {9}. This method is relatively complicated, but it helps perl to understand what you want to express.
Because the backslash has a special belief in the programming language, it can usually express the escape. When perl encounters a backslash, it will guess what you want to express.
If you write something like \ 123, it will not know how to parse it. Do you want to express \ 1 + 23, refer to the following numbers, or,
\ 12 + 3, or \ 123, the escape character followed by a number can escape an octal number.
So here there is ambiguity.
Perl 5.10 introduces the \ g {N} expression to indicate referencing wildcards. N can even be a negative number. When it is a negative number, it indicates a relative position.
Indicates the nth Number of tuples starting from the current position to the left.
For example:
"Twooavvboonn" = ~ /Tw (oo) a (vv) B \ {-2} (nn)/is true.
(7) brackets []: brackets are used to indicate a character set combination)
Character Set combination, as its name implies, is a collection of characters. The elements of the set are placed in brackets, indicating that one of them is matched each time.
For example, "twoon" = ~ /[Tw] woo/is true
Sometimes, if this set contains many elements, such as 26 letters and numbers, which are written in brackets one by one, it is too troublesome and too stupid,
Hyphen can be used to represent a range. For example, [a-z] indicates a set of lower-case letters and [a-zA-Z] indicates a set of upper-case letters.
The above usage is used to provide a range for selection, but sometimes it is very common to not match a range,
In this case, we can put a caret character at the beginning of the set ).
This method matches any character that is not in the Set, which is the opposite of the preceding method.
For example, "twoon" = ~ /[^ Two] woon/is false
"Ewoon" = ~ /[^ Two] woon/is true
From the above usage, we can see ^,-these two symbols have special meanings in the set. If we want to represent these two characters in the set, we also need to escape them.
For example, [\ ^ AB \-]
Some character sets are commonly used, such as letters and numbers. perl Provides abbreviations to represent these commonly used sets.
For example, \ d indicates a number, which is equivalent to [0-9].
These special characters include:
\ W-- (Lower case w) represents letters or numbers, equivalent to [a-zA-Z0-9]
\ W-- (W in upper case) Non-letter and non-digit, opposite to \ w
\ S-- (Lowercase s) matches a space character, including: space, line feed, carriage return, tab, equivalent to [\ n \ r \ t \ f]
\ S-- (Uppercase S) matches non-space characters, and \ s is opposite
\ D-- Indicates a 10-digit number, which is equivalent to [0-9].
(8) braces :{}
Braces are used to specify how many times the previous character is repeated:
{N} repeated N times
{N, m} repeats n ~ M times
{N ,}at Least n times
{, M} can be repeated at most m times.
Example:
"Twoon" = ~ /Two {2} n/is true.
(9) ^, & these two Wildcards are used to indicate matching at the header or tail of the matching string.
We generally write this: "twoon" = ~ /Oo/regular expression, matching starts from twoon,
Such as tw! = Oo, we will continue to match, but sometimes we may just want to match the beginning or end, then ^, & will come in handy.
^ Used to match the character string. & used to match the end Of the character string.
For example, "twoon" = ~ "^ Tw" is true.
"Twoon" = ~ "Oo" is true
"Twoon" = ~ "^ Oo" is false.
"Twoon" = "on &" is true.
"Twoon" = "oo &" is false
(10) "or" wildcard: the regular expression uses a vertical line | to represent or, (AB | cd) to represent one of the character groups that match the left and right sides of the vertical line. If the left and right characters exceed one, it must be used with parentheses.
For example, "twoon" = ~ /T | ewoon/the result is true.
"Twoon" = ~ /(Tw | ee) oon/returns true
"Twoon" = ~ /(Ee | gs) oon/the result is false
The above briefly introduces the meanings and usage of various wildcards in regular expressions. It can be seen that regular expressions are simple in syntax,
However, this does not mean that it will only do simple things. In fact, regular expressions can be very complex, and the syntax is simple or complex, and there is not much direct connection between powerful functions or not.
Learning regular expressions can benefit us in many scenarios, especially when using various tools in unix systems, just like many other languages, learning is easy to forget.
This article just briefly introduces its basic syntax and basic writing method. To make it practical, You need to exercise more, operate more, and pay more attention in your daily work.