The Python3 introductory series is basically the door to Python, from this chapter began to introduce the Python crawler tutorial, take out to share; the reptile says simply, is to crawl the data of the network to carry on the analysis processing; This chapter is primarily about getting to know the little tests of several reptiles, And a description of the tools used by the crawler, such as collections, queues, regular expressions;
To crawl a specified page with Python:
The code is as follows:
Import urllib.request
url= "http://www.baidu.com"
data = Urllib.request.urlopen (URL). Read () #
data = Data.decode (' UTF-8 ')
print (data)
Urllib.request.urlopen (URL) Official document returns a Http.client.HTTPResponse object, the Read () method used by the object, and returns the data; Http.client.HTTPResponse object, this object has a variety of methods, such as the Read () method we use;
Find variable URLs:
Import urllib
import urllib.request
data={}
data[' word ']= ' one Peace '
url_values= Urllib.parse.urlencode (data)
url= "http://www.baidu.com/s?"
Full_url=url+url_values
a = Urllib.request.urlopen (Full_url)
data=a.read ()
Data=data.decode (' UTF-8 ')
print (data)
# #打印出网址:
A.geturl ()
Data is a dictionary, and then the Urllib.parse.urlencode () is used to convert data to a string of ' word=one+peace ', and finally the URL is merged into Full_url
Description of Python regular expressions:
Introduction to Queues
In the crawler program to use the breadth of the priority algorithm, the algorithm used in the data structure, of course, you can use the list to achieve the queue, but the efficiency is not high. Here's a description: There is a queue in the container: Collection.deque
#队列简单测试:
From collections Import Deque
Queue=deque (["Peace", "Rong", "Sisi"])
Queue.append ("Nick")
Queue.append ("Pishi")
Print (Queue.popleft ())
Print (Queue.popleft ())
Print (queue)
Collection Introduction:
In the crawler, in order not to crawl the sites that have been crawled, we need to put the URL of the crawled page into the collection, and see if the collection is already there before we crawl a URL each time. If it already exists, we skip the URL; If it does not exist, we first put the URL into the collection and then climb the page.
Python also contains a data type--set (collection). A collection is a set of unordered, distinct elements. Basic features include relational testing and eliminating duplicate elements. The collection object also supports mathematical operations such as Union (union), intersection (intersection), difference (difference) and sysmmetric difference (symmetric difference sets).
Curly braces or set () functions can be used to create collections. Note: To create an empty collection, you must use Set () instead of {}. {} is used to create an empty dictionary;
The creation of the collection is illustrated as follows:
A={"Peace", "peace", "Rong", "Rong", "Nick"}
Print (a)
"Peace" in a
B=set (["Peace", "peace", "Rong", "Rong"])
Print (b)
#演示联合
Print (A|B)
#演示交
Print (A&B)
#演示差
Print (A-b)
#对称差集
Print (A^B)
#输出:
{' Peace ', ' Rong ', ' Nick '}
{' Peace ', ' Rong '}
{' Peace ', ' Rong ', ' Nick '}
{' Peace ', ' Rong '}
{' Nick '}
{' Nick '}
Regular expressions
When crawling back to the crawler is generally a character stream, we have to pick out the URL to require a simple string processing capabilities, but with regular expressions can easily complete this task;
Step of regular Expression: 1, compilation of regular expression 2, regular expression match string 3, processing of the result
The following figure lists the syntax for regular expressions:
The use of regular expressions in Pytho requires the introduction of the RE module, and some of the methods in the module are described below.
1.compile and Match
In the RE module, compile is used to generate pattern objects, then the match is obtained by calling the match method of pattern instances, and the information is obtained by using match;
Import re
# compiling regular expressions into pattern objects
= Re.compile (R ' Rlovep ')
# Match text with pattern, get matching results, and return none
when not matched m = Pattern.match (' rlovep.com ')
if M:
# using Match to get packet information
print (M.group ())
### output ###
# RLOVEP
Re.compile (strpattern[, flag]):
This method is the factory method of the pattern class, which is used to compile a regular expression in a string as a patterns object. The second parameter flag is a matching pattern, which can be used by bitwise OR operator ' | ' means that it takes effect at the same time, such as re. I | Re. M. Alternatively, you can specify patterns in the Regex string, such as re.compile (' pattern ', re. I | Re. M) is equivalent to Re.compile (' (? im) pattern ').
Optional values are:
Re. I (re. IGNORECASE): Ignore case (in parentheses is the complete writing, the same below)
M (MULTILINE): Multiline mode, changing ' ^ ' and ' $ ' behavior (see above)
S (Dotall): Point arbitrary matching mode, change '. ' The behavior
L (LOCALE): Make predefined character classes \w \w \b \b \s depending on the current locale
U (UNICODE): Make predefined character classes \w \w \b \b \s \s \d depending on the character attributes of the UNICODE definition
X (VERBOSE): Verbose mode. The regular expression in this mode can be multiple lines, ignoring whitespace characters, and can be added to a comment.
The Match:match object is a matching result that contains a lot of information about the match, and you can use the readable properties or methods provided by match to get that information.
Property:
String: The text to be used when matching.
Re: The pattern object used when matching.
POS: The index in which the literal expression begins the search. The value is the same as the Pattern.match () and Pattern.seach () method with the same name.
Endpos: The index in which the literal expression ends the search. The value is the same as the Pattern.match () and Pattern.seach () method with the same name.
Lastindex: The index of the last captured grouping in text. If no groupings are captured, none will be.
Lastgroup: The alias of the last captured group. If this group has no alias or is not a captured grouping, it will be none.
Method:
Group ([Group1, ...]):
Gets the string that is intercepted by one or more groups, and when multiple parameters are specified, they are returned as tuples. Group1 can use the number or alias, and number 0 represents the entire matched substring, returns Group (0) without filling the argument, the group that does not intercept the string, returns none, and the group that has been intercepted repeatedly returns the last intercepted substring.
Groups ([default]):
Returns a string of all grouped interceptions in a tuple form. Equivalent to calling group (1,2,... last). Default indicates that a group with no intercept string is substituted with this value, and the defaults to none.
Groupdict ([default]):
Returns a dictionary with the alias of an alias group as the value of the substring intercepted by the group, and a group without aliases is not included. Default meaning ditto.
Start ([group]):
Returns the starting index (the index of the first character of the substring) of the substring that the specified group intercepts in string. The group default value is 0.
End ([group]):
Returns the end index of the substring of the specified group intercept in string (the index of the last character of the substring + 1). The group default value is 0.
span ([group]):
Returns (Start (group), End (group)).
Expand (Template):
The group to be matched into the template and then returned. Template can be grouped using \id or \g<id>, \g<name> references, but no number 0 is used. \id is equivalent to \g<id>, but \10 is considered to be the 10th grouping, and if you want to express \1 the character ' 0 ', you can only use \g<1>0.
The Pattern:pattern object is a compiled regular expression, and a series of methods provided by pattern can be used to match the search to the text.
Pattern cannot be directly instantiated and must be constructed using Re.compile ().
Pattern provides several readable properties for obtaining information about an expression:
Pattern: An expression string used at compile time.
Flags: A matching pattern for compile-time use. Digital form.
Groups: The number of groups in an expression.
Groupindex: A dictionary with the alias of an alias group in an expression as a key and a value for that group number, no alias group is included.
instance method [| Re module Method]:
Match (string[, pos[, Endpos]) | Re.match (pattern, string[, flags]):
This method attempts to match pattern from a string's pos subscript, or if the pattern ends up being matched, returns a Match object, or none if the pattern does not match during the match or if the match has reached endpos at the end.
The default values for POS and Endpos are 0 and Len (string), and Re.match () cannot specify these two parameters, and the parameter flags specify matching patterns when compiling pattern.
Note: This method does not match exactly. When pattern ends, string and remaining characters are still considered successful. To match exactly, you can add the boundary match ' $ ' at the end of the expression.
Search (string[, pos[, Endpos]) | Re.search (pattern, string[, flags]):
This method is used to find substrings in a string that can match a success. Attempts to match pattern from a string's pos subscript, returns a match if the pattern ends, or, if it does not match, attempts to match the POS plus 1, and then returns none until the Pos=endpos is still unable to match. The default values for POS and Endpos are 0 and len (string) respectively, and Re.search () cannot specify these two parameters, and the parameter flags specify matching patterns when compiling pattern.
Split (string[, Maxsplit]) | Re.split (Pattern, string[, Maxsplit]):
Returns a list after the string is split by a substring that can match. Maxsplit is used to specify the maximum number of partitions, without specifying that all will be split.
FindAll (string[, pos[, Endpos]) | Re.findall (pattern, string[, flags]):
Searches for a string that returns all substrings that can be matched in the form of a list.
Finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):
Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.
Sub (repl, string[, Count]) | Re.sub (Pattern, REPL, string[, Count]):
Returns the replaced string after each matching substring in the string is replaced with REPL. When Repl is a string, you can group with \id or \g<id>, \g<name> references, but you cannot use number 0. When Repl is a method, this method should accept only one argument (the match object) and return a string to replace (the returned string cannot be referenced in a group). Count is used to specify the maximum number of substitutions and replace them when not specified.
Subn (REPL, string[, Count]) |re.sub (pattern, REPL, string[, Count]):
Returns (Sub (REPL, string[, Count), number of substitutions).
2.re.match (Pattern, string, flags=0)
Function parameter Description:
Parameters
|
Describe
|
Pattern
|
Matching regular expressions
|
String
|
The string to match.
|
Flags
|
A flag bit that controls how regular expressions are matched, such as case sensitivity, multiline matching, and so on.
|
We can use Group (num) or groups () to match an object function to get a matching expression.
Matching Object method
|
Describe
|
Group (num=0)
|
string that matches the entire expression. Group () You can enter more than one group number at a time, in which case it returns a tuple that contains the corresponding values for those groups.
|
Groups ()
|
returns a tuple that contains all the group strings, from 1 to the The number of the group included.
|
The demo is as follows:
#re. Match.
Import re
print (Re.match ("Rlovep", "rlovep.com")) # #匹配rlovep
print (Re.match ("Rlovep", "rlovep.com"). Span () # #从开头匹配rlovep
Print (re.match ("com", "http://rlovep.com")) # #不再起始位置不能匹配成功
# #输出:
<_sre. Sre_match object; span= (0, 6), match= ' Rlovep ' >
(0, 6)
None
Example two: Using group
Import Re line = "This is my
blog"
#匹配含有is的字符串
matchobj = Re.match (. *) are (. *?). * ', line, re. M|re. I)
#使用了组输出: When group without parameters is the entire matching output
#当带参数为1时匹配的是最外层左边包括的第一个括号, one analogy;
if matchobj:
print (" Matchobj.group (): ", Matchobj.group ()) #匹配整个
print (" Matchobj.group (1): ", Matchobj.group (1)) #匹配的第一个括号
Print ("Matchobj.group (2):", Matchobj.group (2)) #匹配的第二个括号
else:
print ("No match!!")
#输出:
matchobj.group (): This is my blog
matchobj.group (1): This
matchobj.group (2): my
3.re.search method
Re.search scans the entire string and returns the first successful match.
function Syntax:
Re.search (Pattern, string, flags=0)
Function parameter Description:
Parameters
|
Describe
|
Pattern
|
Matching regular expressions
|
String
|
The string to match.
|
Flags
|
A flag bit that controls how regular expressions are matched, such as case sensitivity, multiline matching, and so on.
|
We can use Group (num) or groups () to match an object function to get a matching expression.
Matching Object method
|
Describe
|
Group (num=0)
|
string that matches the entire expression. Group () You can enter more than one group number at a time, in which case it returns a tuple that contains the corresponding values for those groups.
|
Groups ()
|
returns a tuple that contains all the group strings, from 1 to the The number of the group included.
|
Example one:
Import re
print (Re.search ("Rlovep", "rlovep.com"). span ())
print (Re.search ("com", "http://rlovep.com"). Span ())
#输出:
import re
print (Re.search ("Rlovep", "rlovep.com"). span ())
print (Re.search ("com", " Http://rlovep.com "). Span ())
Example two:
Import Re line = "This is my
blog"
#匹配含有is的字符串
matchobj = Re.search (. *) are (. *?). * ', line, re. M|re. I)
#使用了组输出: When group without parameters is the entire matching output
#当带参数为1时匹配的是最外层左边包括的第一个括号, one analogy;
if matchobj:
print (" Matchobj.group (): ", Matchobj.group ()) #匹配整个
print (" Matchobj.group (1): ", Matchobj.group (1)) #匹配的第一个括号
Print ("Matchobj.group (2):", Matchobj.group (2)) #匹配的第二个括号
else:
print ("No match!!")
#输出:
matchobj.group (): This is my blog
matchobj.group (1): This
matchobj.group (2): my
Search and Match differences: Re.match only matches the start of a string, if the string does not start with a regular expression, the match fails, the function returns none, and Re.search matches the entire string until a match is found.
Python crawler Small test
The
uses Python to crawl links to all HTTP protocols in the page and recursively crawl the links of subpages. Use collections and queues; This is my site, the first edition of a lot of bugs; The code is as follows:
Import re import urllib.request import Urllib from collections import deque #使用队列存放url queue = deque () > previous Python3 Introductory Department The column is basically the door to Python, starting from this chapter on the Python crawler tutorial, take it out and share it with you; the reptile says simply, is to crawl the data of the network to carry on the analysis processing; This chapter is primarily about getting to know a few small tests of reptiles, as well as introduction to tools used by reptiles, such as collections, queues, Regular expression <!--more--> #使用visited防止重复爬同一页面 visited = set () url = ' http://rlovep.com ' # entry page, can be replaced by another #入队最初的页面 Queue.appe nd (URL) cnt = 0 while queue:url = Queue.popleft () # Team first element out team visited |= {URL} # marked as accessed print (' already crawled: ' + str (CNT) + ' is Crawl <---' + URL) cnt + + 1 #抓取页面 urlop = Urllib.request.urlopen (URL) #判断是否为html页面 if ' HTML ' not in Urlop.getheader ( ' Content-type '): Continue # Avoid program abort, use Try ... Catch handling Exception Try: #转换为utf-8 Code data = Urlop.read (). Decode (' Utf-8 ') Except:continue # Regular expression extracts all queues in the page and determines whether they have been accessed and then joins Pending Queue Linkre = Re.compile ("href=["] ([^\ ' >]*?)
['\"].*?") For x in Linkre.findall (data): # #返回所有有匹配的列表 if ' http ' in X and X does not be in visited:# #判断是否为http协议链接, and to determine if the queue.append is crawled ( x) print (' Join queue---> ' + x)
The results are as follows: