Regular Expressions
In this section we look at the relative usage of regular expressions, which is a powerful tool for handling strings, it has its own specific syntax structure, and with it, the search, replace, and match verification of strings is a cinch.
Of course for the crawler, with it, we can extract the information we want from the HTML is very convenient. Example introduces
Having said so much, maybe we are more vague about what it is, and we will use several examples to feel the usage of regular expressions.
We opened open source China provided the regular expression test tool http://tool.oschina.net/regex/, after opening we can enter the text to be matched, then select the common regular expression, we can derive the corresponding match result from the text which we entered.
For example, here we enter the text to be matched as follows:
Hello, my phone number are 010-86432100 and email is cqc@cuiqingcai.com, and I website is http://cuiqingcai.com.
This string contains a phone number and an e-mail message, and then we try to extract it with a regular expression.
We select the matching email address in the Web page and we can see the email in the text below. If we choose a URL to match URLs, we can see the URL in the text that appears below. is not very magical.
In fact, here is the use of regular expression matching, that is, using a certain rule to extract specific text. e-mail, for example, starts with a string, then an @ symbol, and then a domain name, which has a specific form. In addition to the URL, the beginning is the protocol type, followed by a colon plus a double slash, then the domain name plus path.
For URLs, we can use the following regular expression to match:
[a-za-z]+://[^\s]*
If we use this regular expression to match a string, if the string contains text that resembles a URL, it is extracted.
This regular expression looks like a mess, but in fact, there are certain grammatical rules. For example, A-Z represents the matching of arbitrary lowercase letters, \s to match any white space characters, * on behalf of matching any number of the preceding characters, this long series of regular expressions is so many matching rules of the combination, and finally achieve a specific matching function.
After writing the regular expression, we can take it to a long string to match the lookup, no matter what is in the string, as long as the rules we write can be found. So for Web pages, if we want to find out how many URLs in the source code of the Web page, you can match the regular expression of the URL, you can get the URL in the source.
In the above we say a few matching rules, then the regular expression of the number of rules in the end. So here's a summary of the common matching rules:
Pattern description
\w matching alphanumeric and underline
\w matching non-alphanumeric and underline
\s matches any whitespace character, equivalent to [\t\n\r\f].
\s matches any non-null character
\d matches any number, equivalent to [0-9]
\d matches any non-numeric
\a Match string start
\z Match string end, if there is a newline, only match to end string before wrapping
\z Match string End
\g matches the last matching completion location
\ n matches a newline character.
\ t matches a tab
^ matches the beginning of a string
$ matches the end of a string.
. matches any character except the line feed, when re. When the Dotall tag is specified, you can match any character that includes a line feed.
[...] Used to represent a set of characters, listed separately: [AMK] matches ' a ', ' m ' or ' K '
[^...] Characters not in []: [^ABC] matches characters other than a,b,c.
* Match 0 or more expressions.
+ matches 1 or more expressions.
? matches 0 or 1 fragments defined by the preceding regular expression, not greedy
{n} exactly matches n preceding expressions.
{n, m} matches N to M times by the fragment defined by the preceding regular expression, greedy way
A|b match A or b
() matches an expression within parentheses and also represents a group
Maybe after the end of a little dizzy the put, don't worry, below we will explain some of the common rules of use. How to use it to extract the information we want from the Web page. used in Python
In fact, regular expressions are not unique to Python, it can also be used in other programming languages, but Python's re library provides the entire regular expression implementation, and using the RE library we can use regular expressions in Python, and it's almost always the library that writes regular expressions in Python.
Let's take a look at the use of it below. match ()
Here we introduce the first common matching method, the match () method, we pass to this method to match the string and the regular expression, we can detect whether the regular expression matches the string.
The match () method attempts to match the regular expression from the starting position of the string and, if so, returns the result of the matching success, or none if it does not match.
We use an example to feel:
Import re
Content = ' Hello 123 4567 World_this is a Regex Demo '
Print (len (content))
result = Re.match (' ^hello\s\d\d\d\s\d{4}\s\w{10} ', content)
Print (Result)
Print (Result.group ())
Print (Result.span ())
Run Result:
41
<_sre. Sre_match object; span= (0), match= ' Hello 123 4567 world_this ' >
Hello 123 4567 World_this
(0, 25)
Here we first declare a string that contains English letters, white space characters, numbers, and so on, and then we write a regular expression ^hello\s\d\d\d\s\d{4}\s\w{10} to match this long string.
The beginning of ^ is the beginning of a matching string, that is, start with Hello, then \s matching white space characters, used to match the space of the target string, \d matching number, three \d match 123, then write a \s matching space, and then 4567, we can still use four \d to match, But it's rather tedious to write. So after you can match {4} with the preceding character four times, that is, match four digits, you can also complete the match, then a white space followed by a blank character, then \w{10} matches 10 letters and underscores, the regular expression ends here, We noticed that the target string was not matched, but it was still possible to match, but the result was a little bit shorter.
We call the match () method, the first argument passes in the regular expression, and the second parameter passes in the string to match.
Print out the results, you can see the result is a Sre_match object, proof of a successful match, it has two methods, the group () method can output the match to the content, the result is Hello 123 4567 world_this, which is exactly what our regular expression rules match, The span () method can output a matching range, and the result is (0, 25), which is the position range of the matching result string in the original string.
From the example above we can basically learn how to use regular expressions in Python to match a piece of text. Match Target
Just now we used the match () method to get the string content to match, but what if we want to extract part of the content from the string? As in the previous example, the message or phone number is extracted from a piece of text.
Here you can use the () bracket to enclose the substring we want to extract, () is actually marking the beginning and end of a subexpression, each of which is labeled with each subgroup in turn, and we can call the group () method to pass the index of the grouping to get the result of the extraction.
Let's use an example to feel the following:
Import re
Content = ' Hello 1234567 world_this is a Regex Demo '
result = Re.match (' ^hello\s (\d+) \sworld ', content)
Print (Result)
Print (Result.group ())
Print (Result.group (1))
Print (Result.span ())
is still the previous string, where we want to match this string and extract 1234567 of them, where we enclose the regular expression of the numeric portion () and then call Group (1) to get the result.
The results of the operation are as follows:
<_sre. Sre_match object; span= (0), match= ' Hello 1234567 world '
Hello 1234567 world
1234567
(0,)