"Python syntax" regular expression

Source: Internet
Author: User

Why use regular expressions

Manipulating strings is one of the most important features in almost every programming language. It is very simple to understand, because the human information is transmitted mainly by the text, that is, the string, but so much of the information is not exactly what we want, so we will be programmed to extract or validate the part of the string.

A regular expression is a tool used to match a string , in fact it defines a set of syntax that can be used to match the character of a string with several descriptive characters. In the case of a description rule, we think it matches .

So for example, if we want to determine whether a string of characters is a legitimate email address, the method is:

    • Create an email-compliant正则表达式
    • The regular expression is then used to match the input string to determine whether it is legal.
Regular expression meta-characters

Use \d to match a number, \w can match a letter or a number

Meta character Match
. Any character ( but not including line break \n\r, etc. )
\w Letter or number or underline
\s Whitespace characters (including tab, etc.)
\d Digital

An example ‘py.‘ can be matched ‘pyc‘ , ‘pyo‘ and ‘py!‘ so on. Because . it represents any character, it can match the normal letter, or it can match!

Note that a meta-character represents only one character, such as \w, which represents only one letter or number.

You can use [] a representation range, for example, to [0-9] represent any number between matching 0~9.

    • [0-9a-zA-Z\_]Can match a number, letter, or underscore, which can be equivalent to\w

Sometimes you need to find characters that are not part of a character class that can be easily defined, which is反义

Code/Syntax Match
[^x] Any character other than X
[^aeiou] Any character other than a few letters AEIOU
Match a variable length

If a good match is a character that is longer, use a character representing * 0 or more characters, representing + 1 or more characters, with a ? representation of 0 or 1 characters.

You can also use curly braces to represent n characters with {n}, and {n,m} to represent n-m characters.

Code/Syntax Description
* Repeat more than 0 times, equivalent to {0,}
+ Repeat more than 1 times, equivalent to {1,}
? Repeat 0 or 1 times, equivalent to {0,1}
N Repeat n times
{N,} Repeat more than n times
{N,m} Repeat N to M times

So \d{3}\s+\d{3,8} what types of strings can be matched, for example?
Read from left to right:

    • \d{3} indicates a match of 3 digits, e.g. ' 010 ';
    • \s can match a space (also including tab and other whitespace), so \s+ indicates at least one space, such as "," and so on;
    • \d{3,8} represents 3-8 digits, such as ' 1234567 '.

What if I want to match a number like ' 010-12345 '? Because '-' is a special character, in the regular expression, to be escaped with ' ', so, the above is \d{3}-\d{3,8}.

    • [0-9a-zA-Z\_]+Can match a string of at least one number, letter, or underscore, for example, and ‘a100‘ ‘0_Z‘ ‘Py3000‘ so on;

    • [a-zA-Z\_][0-9a-zA-Z\_]*It can be matched by a letter or underscore, followed by a string consisting of a number, letter, or underscore, which is a valid Python variable;

    • [a-zA-Z\_][0-9a-zA-Z\_]{0, 19}More precisely limit the length of a variable to 1-20 characters (1 characters before + 19 characters later).

Note and 通配符 differentiate, the Linux Bash command line can be used * to proxy any character in a wildcard. For a regular expression, you must use it .* to represent any character

So for the example of the previous phone number, we can use more complex expressions to match \(?0\d{2}[) -]?\d{8}。\(?0\d{2}[) -]?\d{8}。 , can match (010) 88886666, or 022-22334455, or 02912345678, and so on.

    • The first is an escape character (, it can occur 0 or 1 times (?),
    • Then there is a 0, followed by 2 digits (\d{2}),
    • then yes) or-or one of the spaces, it appears 1 times or does not appear (?),
    • And finally 8 numbers (\d{8}

But this expression can also match 010) 12345678 or (022-87654321 such an "incorrect" format. Later, we will say how to modify the problem can be solved.

Boundary qualifier
Boundary Limit Match
^ Start of string
$ End of string

For example, the expression starts with a number, ends with a ^\d{5,12}$ number, matches the entire line, and the length is a string of digits in the 5~12.

Branch conditions

The so-called branching condition is similar to the logic "or", satisfies any one condition namely matches. The specific method is | to separate the different rules.

Examples of matching phone numbers you've talked about before, for example.

    • 0\d{2}-\d{8}|0\d{3}-\d{7}This expression can match
      • Three-bit area code, 8-bit local number (such as 010-12345678),
      • 4-bit Area code, 7-bit local number (0376-2233445).
    • \(0\d{2}\)[- ]?\d{8}|0\d{2}[- ]?\d{8}: This expression is | divided into two conditions
      • The expression on the left: \(0\d{2}\) can be matched (010), [- ]? indicating that the connector can be either a - space interval or not.
      • Expression on the right 0\d{2}[- ]?\d{8} : indicates that the area code is not enclosed in parentheses.

Note: When you match a branching condition, each condition is tested from left to right, and if a branch is satisfied, you will not be able to control the other conditions.

Group

Previously mentioned is how to repeat a single character (directly after the character with a qualifier on the line);
But what if you want to repeat multiple characters ? You can 小括号 specify a sub-expression (also called a grouping), and you can specify the number of repetitions of the sub-expression.

For example (\d{1,3}\.){3}\d{1,3} , it can be analyzed sequentially,

    • \d{1,3} matches numbers from 1 to 3 digits,
    • (\d{1,3}.) {3} matches three digits plus an English period (this whole is the group) repeats 3 times,
    • Finally, add one to three digits (\d{1,3}).
Summarize

I believe that the sudden emergence of such a symbol of the people must be ignorant force. Let's summarize what {} [] () These symbols are used for.

    • {2,3}: It needs to be combined with the characters in front of it, such as a{2,3} a two or three occurrences
    • []: There are 3 layers of meaning
      • [a-z]: Represents a range, that a~z is, the 一个 character between
      • [.*]: As long as the [] inside .* does not mean the meaning of the previous, but simply as a normal symbol just. For example, there is a sign that either is or is 点号 星号 .
      • [^a]: 非a all characters represented. Primarily do not and ^a confuse, ^a expressed as a the beginning of a line.
Greedy Match and Lazy match

In a.*b other words, it will match the longest string starting with a and ending with B, for example, when searching for Aabab, it will match the entire string aabab, that is 贪婪匹配 , as many matches as possible.

That 懒惰匹配 means as few matching characters as possible . .*after adding one ? later, you can convert to lazy matching mode, which .*? means that the minimum repetition is used if the match succeeds. For example, applying it to Aabab will match AaB and AB.

Why is the first match a aab instead of AB? Because the regular expression has a rule: the first match has the highest priority

Code/syntax
*?
+?
??
{n,m}?
{N,}?
Match Chinese characters

The expression of matching Chinese characters [\u4E00-\u9FA5] is, this is the range of UTF-8 encoding of Chinese characters.

Python calls the regular expression

Python provides the RE module, which contains the functionality of all regular expressions. Because the Python string itself is also escaped with \, pay special attention to:
For example s = ‘ABC\\-001‘ , the corresponding regular expression of a python string becomes‘ABC\-001‘
So it's best to prefix the Python string r without having to consider escaping problems, such ass = r‘ABC\-001‘ # Python的字符串

How to tell if a regular expression matches:

    • Introducing re Modules:import re
    • Using the match method, if the match succeeds, returns a Match object, otherwise returns none
      Test = ' user-entered string '

      if re.match(r‘正则表达式‘, test):print(‘ok‘)else:print(‘failed‘)
Slicing a string

When you use regular expressions, the split character becomes more flexible.

If you use Split's normal segmentation code, you can see that consecutive spaces are not recognized

>>> ‘a b   c‘.split(‘ ‘)[‘a‘, ‘b‘, ‘‘, ‘‘, ‘c‘]

Using regular expressions allows for more complex segmentation:

>>> re.split(r‘[\s\,\;]+‘, ‘a,b;; c  d‘)[‘a‘, ‘b‘, ‘c‘, ‘d‘]
Group

In addition 判断是否匹配 to this, regular expressions can be 提取子串 a powerful feature. In () is the one that you want to extract.分组(Group)。
Like what

m = re.match(r‘^(\d{3})-(\d{3,8})$‘, ‘010-12345‘)

This regular expression defines two groupings that can match - the two expressions before and after.

    • m.group(0): Get ' 010-12345 '
    • m.group(1): Get is "010"
    • m.group(2): Get is ' 12345 '

Group (0) is always the original string, group (1), Group (2) ... Represents the 1th, 2 、...... Substring.

Greedy match

The regular expression defaults to greedy matching. Like what

>>> re.match(r‘^(\d+)(0*)$‘, ‘102300‘).groups()#结果是(‘102300‘, ‘‘),\d+采用贪婪匹配,直接把后面的0全部匹配了,结果0*只能匹配空字符串了

You must let \d+ use a non-greedy match (that is, as few matches as possible) in order to match the back of the 0, add a? You can let the \d+ use a non-greedy match:

>>> re.match(r‘^(\d+?)(0*)$‘, ‘102300‘).groups()(‘1023‘, ‘00‘)

Another example

import re line = "boooooobby123";reg_str = ".*(b.*b).*";match_obj = re.match (reg_str , line);if match_obj:    print (match_obj.group(1));

Because .* it is a greedy match, so it will always match booooooboooooo , then the parentheses actually only match thebb

If you are using non-greedy mode, which is .* followed by adding a?

import re line = "boooooobby123";reg_str = ".*?(b.*?b).*";match_obj = re.match (reg_str , line);if match_obj:    

Example: Extract Date

Below we want to be able to automate the extraction of a paragraph of text, 生日 but if the previous format is not specified, we would like to write the date, such as

    • Born on January 23, 2018
    • Born in 2018/1/23
    • Born in 2018-1-23
    • Born in 2018-01-23
    • Born in 2018-01
    • Born in January 2018

Below we need to give a regular expression that asks him to match all the date formats above.

    • First match the part of the year in the date, from the above text can be seen, only 2018年 , 2018- ,
      2018/These several forms. That is, you can first use \d{4} the representation of numbers, and then use [年-\] to represent symbols. Together, it's

      regex = r"出生于(\d{4}[年/-])"
    • A second look 月份 at the numbers section can only have 01 and 1 two forms:\d{1,2}
    • 月份The latter part is relatively complex. Similarly, we can classify them and then use branching conditions to express them uniformly.
      • The 2018年1月23日 following parts of the match and 2018-01-23 as well 2018/1/23 :[月/-]\d{1,2}日?
      • Match the 2018年01月 following parts of this:[月/-]$
      • The 2018-01 following part of the match is, of course, directly used 结尾符 :$
      • Finally () , it | is used to classify and discuss.

        ([月/-]\d{1,2}日?|[月/-]$|$)

Finally, merge all the parts together.

import re lines = [ "出生于2018年1月23日", "出生于2018/1/23", "出生于2018-1-23", "出生于2018-01-23", "出生于2018-01", "出生于2018年01月"]regex = r"出生于(\d{4}[年/-]\d{1,2}([月/-]\d{1,2}日?|[月/-]$|$))"for line in lines :    m = re.match(regex  , line )    if m :        print(m.group(1));

Compile

When using regular expressions, two things are done inside the RE module:

    • Compile regular expressions, at which time the syntax analysis, if the expression itself is not legal, will be error;
    • Use the compiled regular expression to match the string.
      Then if a regular expression is to be used very often, you can precompile the regular expression

      # 编译:>>> re_telephone = re.compile(r‘^(\d{3})-(\d{3,8})$‘)# 使用:>>> re_telephone.match(‘010-12345‘).groups()(‘010‘, ‘12345‘)
Reference

Liaoche-Regular Expressions
Regular Expressions 30-minute introductory tutorial

"Python syntax" regular expression

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.