Python regular expression full Guide (I), python full Guide

Source: Internet
Author: User

Python regular expression full Guide (I), python full Guide

Regular expressions are used to process text. Most programming languages support regular expressions, which are used in scenarios such as form verification, text extraction, and replacement. The crawler system is inseparable from regular expressions, which often get twice the result with half the effort.

Before introducing the regular expression, let's take a look at the problem. The following text is a webpage link from Douban. I have reduced the content. Q: How do I extract all the email addresses in the text?

Html = "" <style>. qrcode-app {display: block; background: url (/pics/qrcode_app4@2x.png) no-repeat ;} </style> <div class = "reply-doc content"> <p class = ""> 34613453@qq.com, thank you </p> <p class = ""> 30604259@qq.com trouble landlord </p> </div> <p class = "> 490010464@163.com <br/> Thank you </p> """

 

If you haven't touched the regular expression, I think it will be impossible to solve it. You don't need to use regular expressions. It seems that it is a better way to deal with it. However, we will put this problem down for the moment, after learning the regular expression, consider how to solve it.

String Representation

Python strings can be expressed in the following forms:uThe character string at the beginning is called a Unicode string, which is not covered in this article. In addition, you should have seen these two methods:

>>> foo = "hello">>> bar = r"hello"

The former is a regular string, and the latterrIt starts with the original string. What is the difference between the two? In the above example, they are all strings composed of common text characters. There is no difference here. The following can be proved:

>>> foo is barTrue>>> foo == barTrue

However, what happens if a string contains special characters? Let's look at an example:

>>> foo = "\n">>> bar = r"\n">>> foo, len(foo)('\n', 1)>>> bar, len(bar)('\\n', 2)>>> foo == barFalse>>>

"\n"Is an escape character, which represents a line break in ASCII. Whiler"\n"Is an original string. The original string does not escape special characters. It is a string consisting of "\" and "n.

The original string can start with lower-case r or upper-case R, for exampler"\b"OrR"\b"Are allowed. In Python, regular expressions are generally defined in the form of original strings. Why?

For example"\b"It has special meaning in ASCII, indicating the return key. In a regular expression, it is a special metacharacter used to match the boundary of a word, to allow the regular expression compiler to correctly express its meaning, you need to use the original string. Of course, you can also use the Backslash "\" to escape the regular-defined string.

>>> foo = "\\b">>> bar = r"\b">>> foo == barTrue
Regular Expressions

A regular expression consists of two types of characters: common text characters and special characters (metacharacters. Metacharacters are of special significance in regular expressions, which allow them to be more expressive. For example, regular expressionsr"a.d", The characters 'A' and 'D' are common characters ,'. 'is a metacharacter ,. it can refer to any character. It can match 'a1d ', 'a2d', and 'acd'. The matching process is as follows:

Python built-in modulesreIs a module used to process regular expressions.

>>> Rex = r ". d "# Regular Expression text >>> original_str =" and "# original text >>> pattern = re. compile (rex) # Regular Expression Object> m = pattern. match (original_str) # matched object >>> m <_ sre. SRE_Match object at 0x101c85b28 ># equivalent to >>> re. match (r ". d "," and ") <_ sre. SRE_Match object at 0x10a15dcc8>

If the original text string matches the regular expression,MatchObject,MatchMethod returnNoneTo determine whether m isNoneForm Verification is supported.

Next, we need to learn more characters.

Basic metacharacters
  • .: Match any character except the line break. For example, "a. c" can fully match "abc" or "abc" in "abcef"
  • \: Escape Character, so that special characters have the original meaning, for example: 1 \. 2 can match 1.2
  • [...]: Match any character in square brackets. For example, a [bcd] e can match abe, ace, and ade. It also supports range operations, such: a to z can be expressed as "a-z", 0 to 9 can be expressed as "0-9". Note that special characters in "[]" do not have any special meaning, is its literal meaning, for example:[.*]Match. Or *
  • [^...]The character set is reversed, indicating that all characters that are not in parentheses can be matched. For example, a [^ bcd] e can match aee and afe.
>>> re.match(r"a.c", "abc").group()'abc'>>> re.match(r"a.c", "abcef").group()'abc'>>> re.match(r"1\.2", "1.2").group()'1.2'>>> re.match(r"a[0-9]b", "a2b").group()'a2b'>>> re.match(r"a[0-9]b", "a5b11").group()'a5b'>>> re.match(r"a[.*?]b", "a.b").group()'a.b'>>> re.match(r"abc[^\w]", "abc!123").group()'abc!

 

GroupReturns the molecular string (abc) that matches the regular expression in the original string (abcef ).MatchMethod to returnMatchObject.GroupMethod.

Preset metacharacters
  • \wMatch any word character, including numbers and underscores, which are equivalent to [A-Za-z0-9 _], for example, a \ wc can match abc, acc
  • \WMatch any non-word character, opposite to the \ w operation, it is equivalent to [^ A-Za-z0-9 _], for example: a \ Wc can match! C
  • \sMatch any blank character. Spaces and carriage return are all blank characters. For example, a \ SC can be used with a \ nc. Here, \ n indicates carriage return.
  • \SMatch any non-blank character
  • \dMatching any number is equivalent to [0-9]. For example, a \ dc can match a1c, a2c...
  • \DMatch any non-digit
Boundary match

Symbols related to boundary matching are used to modify characters.

  • ^Matches the start of a character before a string. For example, ^ abc indicates a string that matches the beginning of a and followed by bc. It can match abc.
  • $Matches the end of a character at the end of the string, for example, hello $
>>> re.match(r"^abc","abc").group()'abc'>>> re.match(r"^abc$","abc").group()'abc'
Duplicate match

The previous metacharacters are matched for a single character. If you want to repeat the matched characters, such as the ID card number and the length is 18 characters, you need to use the repeated metacharacters.

  • *Matches zero or more times
  • ?Matches zero times or once
  • +Repeat once or multiple times
  • {n}Repeat n times
  • {n,}Repeat at least n times
  • {n, m}Repeat n to m times
# Simple match ID card number. The first 17 digits are numbers, and the last digit can be numbers or letters X >>> re. match (r "\ d {17} [\ dX]", "42350119900101153X "). group () '42350119900101153x '# match the QQ number from 5 to 12 >>> re. match (r "\ d {5, 12} $", "4235011990 "). group () '000000'
Logical Branch

Match a fixed phone number. Different regions have different rules. Some regions have three phone numbers, while some regions have four phone numbers and seven phone numbers, the area codes and numbers are separated by commas (-). What if they meet this requirement? In this case, you need to use the logical branch condition characters|It divides the expression into the left and right parts. First, it tries to match the left part. If the matching is successful, it will not match the following part. This is a logical "or" relationship.

# Abc | cde can match abc or cde, But abc >>> re. match (r "aa (abc | cde)", "aaabccde "). group () 'aaabc'

0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} The expression starts with 0. It can match eight-digit numbers with three-digit area codes, you can also match a four-digit area code with a seven-digit number.

>>> re.match(r"0\d{2}-\d{8}|0\d{3}-\d{7}", "0755-4348767").group()'0755-4348767'>>> re.match(r"0\d{2}-\d{8}|0\d{3}-\d{7}", "010-34827637").group()'010-34827637'
Group

The matching rules described above are for a single character. If you want to repeat multiple characters, the answer is to use a subexpression (also called grouping), and the grouping uses parentheses "() ", for example, (abc) {2} indicates that abc is matched twice. When an IP address is matched, you can use(\d{1,3}\.){3}\d{1,3}Because the IP address is composed of four groups of three dots, all of which can be repeated three times as a group with the first three digits and three dots, the last part is a string consisting of 1 to 3 digits. For example, 192.168.0.1.

About groups,GroupThe method can be used to extract matching string groups. By default, it regards the matching result of the entire expression as 0th groups, that is, without parameters.Group ()OrGroup (0)Group in the first group of parenthesesGroup (1)Get, and so on

>>> M = re. match (r "(\ d +) (\ w +)", "123abc") # group 0, matching the entire regular expression >>> m. group () '123abc' # equivalent >>> m. group (0) '123abc' # group 1, matching the first pair of brackets >>> m. group (1) '000000' # group 2, matching the second pair of brackets >>> m. group (2) 'abc' >>>

 

Through grouping, we can extract the desired information from the string. In addition, groups can be obtained by specifying names.

# The name of the first group is number # the name of the second group is char >>> m = re. match (r "(? P <number> \ d + )(? P <char> \ w +) "," 123abc ")> m. group ("number") '000000' # equivalent >>> m. group (1) '20140901'
Greedy and non-greedy

By default, when a regular expression is repeatedly matched, it matches as many characters as possible when the entire expression can be matched. We call this greedy pattern as a greedy pattern. For example:r"a.*b"Matches the string starting with a and ending with B. It can be a string of any number of characters in the middle. If it is used to match aaabcb, it will match the entire string.

>>> re.match(r"a.*b", "aaabcb").group()'aaabcb'

 

Sometimes, we want to have as few matches as possible. What should we do? You only need to add a question mark "? ", Try to make as few matches as possible when matching is ensured. For example, if we only want to match aaab In the example just now, we only need to modify the regular expressionr"a.*?b"

>>> re.match(r"a.*?b", "aaabcb").group()'aaab'>>>

 

The non-Greedy mode is frequently used in crawler applications. For example, if I have previously written an article on the Public Account "Python Zen" crawling a website and converting it into a PDF file, the img Tag elements on the webpage are relative paths, we need to replace it with an absolute path.

>>> Html = ' ' # Two img tags that match in non-Greedy mode # you can change to the greedy mode to see how many matches you can match> rex = R'  '> Re. findall (rex, html) ['/images/category.png', '/images/js_framework.png'] >>>>> def fun (match ):... img_tag = match. group ()... src = match. group (1 )... full_src = "http://foofish.net" + src... new_img_tag = img_tag.replace (src, full_src )... return new_img_tag... >>> re. sub (rex, fun, html)  

 

SubThe function can take a function as the replacement target object. The return value of the function is used to replace the matching part of the regular expression. Here, I define the entire img tag as a regular expression.R' ',Group ()The returned value is, AndGroup (1)The returned value is/Images/category.pngAt last, I usedReplaceMethod to replace the relative path with the absolute path.

At this point, you should have a preliminary understanding of regular expressions. Now I think you should be able to solve the problems mentioned at the beginning of the article.

The basic introduction of regular expressions has come to an end here. Although many methods in the re module are used in the code example, I haven't officially introduced this module yet. Considering the length of this article, I will put this part in the next article, which will introduce the common methods of re.

 

Welcome to the Public Account "Zen of Python" (id: vttalk)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.