Python3 crawler entry and Regular Expression
In the previous python 3 getting started series, we also introduced python crawler tutorials and shared them with you. crawlers are simple, it is to capture network data for analysis and processing. This chapter is mainly used to get started with a few small crawler tests and introduces crawler tools, such as collections, queues, and regular expressions;
Capture a specified page using python:
The Code is as follows:
import urllib.requesturl= http://www.baidu.comdata = urllib.request.urlopen(url).read()#data = data.decode('UTF-8')print(data)
Urllib. request. urlopen (url) returns an http. client. HTTPResponse object, the read () method used by this object; returns data; this function returns an http. client. HTTPResponse object, which has various methods, such as the read () method we use;
Find a variable URL:
Import urllibimport urllib. requestdata = {} data ['word'] = 'one peace 'url _ values = urllib. parse. urlencode (data) url = http://www.baidu.com/s? Full_url = url + url_valuesa = urllib. request. urlopen (full_url) data =. read () data = data. decode ('utf-8') print (data) # print the URL:. geturl ()
Data is a dictionary, and then urllib. parse. urlencode () is used to convert data to a string of 'word = one + peace '. Finally, the data and url are merged into full_url.
Introduction to python Regular Expressions: queue Introduction
The breadth priority algorithm is used in crawler programs. This algorithm uses the data structure. Of course, you can use list to implement queues, but the efficiency is not high. Here we will introduce:
There is a queue in the container: collection. deque
# Simple queue test: from collections import dequequeue = deque ([peace, rong, sisi]) queue. append (nick) queue. append (pishi) print (queue. popleft () print (queue. popleft () print (queue)
Set introduction:
In the crawler program, in order not to repeatedly crawl those websites that have already been crawled, we need to put the url of the crawled page into the collection, before each crawling a url, first, check whether the set already exists. if the url already exists, we will skip this url. If it does not exist, we will put the url in the collection first and then crawl the page.
Python also contains a data class-set ). A set is a set of unordered, non-repeating elements. This function includes the test and elimination of the Complex Element. Union object still supports union
Intersection, difference, and sysmmetric difference.
Braces or the set () function can be used to create a set. Note: To create an empty set, you must use set () instead {}. {} Is used to create an empty dictionary;
The following shows how to create a collection:
A = {peace, peace, rong, rong, nick} print (a) peace in a B = set ([peace, peace, rong, rong]) print (B) # demo combined print (a | B) # demo print (a & B) # demo print (a-B) # symmetric difference set print (a ^ B) # output: {'peace', 'rong ', 'Nick'} {'peace', 'rong '} {'peace', 'rong', 'Nick '} {'peace ', 'rong '} {'Nick '}
Regular Expression
When crawlers are collected, they generally obtain a response stream. to select a url from the crawler, you must have simple string processing capabilities. You can use regular expressions to easily complete this task;
Regular Expression steps: 1. Regular Expression compilation 2. Regular Expression matching string 3. Result Processing
List the regular expression syntax:
When using regular expressions in pytho, You need to introduce the re module. The following describes some methods in this module;
1. compile and match
The compile in the re module is used to generate the pattern object, and then the match instance is obtained by calling the match method of the pattern instance to process the text; information is obtained by using match;
Import re # compile the regular expression into the Pattern object pattern = re. compile (r 'rlovep') # use Pattern to match the text and obtain the matching result. If the match fails, Nonem = pattern is returned. match ('fig. com ') if m: # Use Match to obtain the group information print (m. group () ### output #### rlovep
Re. compile (strPattern [, flag]):
This method is a factory method of the Pattern class. It is used to compile a regular expression in the string form into a Pattern object. The second parameter flag is the matching mode. The value can take effect simultaneously using the bitwise OR operator '|', such as re. I | re. M. In addition, you can specify the mode in the regex string, such as re. compile ('pattern', re. I | re. M) and re. compile ('(? Im) pattern ') is equivalent.
Optional values:
Re. I (re. IGNORECASE)
M (MULTILINE): In MULTILINE mode, the behavior of '^' and '$' is changed (see)
S (DOTALL): Any point matching mode, changing the behavior '.'
L (LOCALE): Make the predefined character class w w B s dependent on the current region settings
U (UNICODE): Make the predefined character class w B s S d depends on the Character attribute defined by unicode
X (VERBOSE): VERBOSE mode. In this mode, the regular expression can be multiple rows, ignore blank characters, and add comments.
Match:
A Match object is a matching result that contains a lot of information about this matching. You can use the readable attributes or methods provided by Match to obtain this information.
Attribute: string: the text used for matching. Re: Specifies the Pattern object used for matching. Pos: index in the text where regular expressions start to search. The value is the same as that of the Pattern. match () and Pattern. seach () methods. Endpos: Index of the ending search by a regular expression in the text. The value is the same as that of the Pattern. match () and Pattern. seach () methods. Lastindex: The index of the last captured group in the text. If no captured group exists, the value is None. Lastgroup: the alias of the last captured group. If this group does not have an alias or is not captured, it is set to None. Method: group ([group1,…]) : Gets one or more string intercepted by a group. If multiple parameters are specified, the string is returned as a tuple. Group1 can be numbered or alias. number 0 indicates the entire matched substring. If no parameter is set, group (0) is returned. If no string is intercepted, None is returned; the group that has been intercepted multiple times returns the last intercepted substring. Groups ([default]): returns the string intercepted by all groups in the form of tuples. It is equivalent to calling group (1, 2 ,... Last ). Default indicates that the group that has not intercepted the string is replaced by this value. The default value is None. Groupdict ([default]): returns a dictionary with the alias of an alias group as the key and the value of the substring intercepted by this group as the value. groups without aliases are not included. The meaning of default is the same as that of default. Start ([group]): returns the starting index of the substring intercepted by the specified group in the string (index of the first character of the substring ). The default value of group is 0. End ([group]): returns the end index of the substring intercepted by the specified group in the string (index of the last character of the substring + 1 ). The default value of group is 0. Span ([group]): Returns (start (group), end (group )). Expand (template): place the matched group into the template and return the result. You can use id or g in template
, G
Group referenced, but no. 0 is allowed. Id and g
Is equivalent, but will be considered as 10th groups. If you want to express the character '0', you can only use g <1> 0.
Pattern:
The Pattern object is a compiled regular expression. You can use a series of methods provided by Pattern to search for the text.
Pattern cannot be directly instantiated and must be constructed using re. compile.
Pattern provides several readable attributes used to obtain information about an expression: pattern: The expression string used for compilation. Flags: the matching mode used during compilation. Digit format. Groups: number of groups in the expression. Groupindex: the key is the alias of a group with an alias in the expression, and the number of the group is the value of the dictionary. A group without an alias is not included. Instance method [| re module method]: match (string [, pos [, endpos]) | re. match (pattern, string [, flags]): This method will try to Match pattern from the pos subscript of string; If pattern can still be matched at the end, a match object will be returned; if the pattern does not match during the matching process, or the matching has reached endpos before it is completed, None is returned. The default values of pos and endpos are 0 and len (string), respectively. re. match () cannot specify these two parameters. The flags parameter is used to specify the matching mode when compiling pattern. Note: This method does not fully match. When pattern ends, if the string contains any remaining characters, the operation is still considered successful. To perform a full match, you can add the boundary match '$' At the end of the expression '. Search (string [, pos [, endpos]) | re. search (pattern, string [, flags]): This method is used to find substrings that can be matched successfully in a string. Match pattern from the pos subscript of string. If pattern can still be matched at the end, a Match object is returned. If it cannot be matched, The pos is added with 1 to try again; if the pos = endpos still does not match, None is returned. The default values of pos and endpos are 0 and len (string) respectively. re. search () cannot specify these two parameters. The flags parameter is used to specify the matching mode when compiling pattern. Split (string [, maxsplit]) | re. split (pattern, string [, maxsplit]): splits string Based on Matched substrings and returns the list. Maxsplit is used to specify the maximum number of splits. If not specified, all splits are performed. Findall (string [, pos [, endpos]) | re. findall (pattern, string [, flags]): searches for strings and returns all matched substrings in the form of a list. Finditer (string [, pos [, endpos]) | re. finditer (pattern, string [, flags]): searches for strings and returns an iterator that accesses each matching result (Match object) sequentially. Sub (repl, string [, count]) | re. sub (pattern, repl, string [, count]): Use repl to replace each matched substring in string, and then return the replaced string. When repl is a string, you can use id or g
, G
Group referenced, but no. 0 is allowed. When repl is a method, this method should only accept one parameter (Match object) and return a string for replacement (the returned string cannot reference the group ). Count is used to specify the maximum number of replicas. If not specified, all replicas are replaced. Subn (repl, string [, count]) | re. sub (pattern, repl, string [, count]): Returns (sub (repl, string [, count]), replacement times ).
2. re. match (pattern, string, flags = 0)
Function parameter description:
Parameters |
Description |
Pattern |
Matched Regular Expression |
String |
The string to be matched. |
Flags |
A flag, used to control the matching mode of regular expressions, such as case-sensitive or multi-row matching. |
The re. match Method returns a matched object. Otherwise, None is returned.
We can use the group (num) or groups () matching object function to obtain the matching expression.
Matching object Method |
Description |
Group (num = 0) |
The string that matches the entire expression. group () can input multiple group numbers at a time. In this case, it returns a tuples that contain the values of those groups. |
Groups () |
Returns a tuple containing all group strings, from 1 to the group number contained. |
The demo is as follows:
# Re. match. import reprint (re. match (rlovep, rlovep.com) # match rlovepprint (re. match (rlovep, rlovep.com ). span () # match rlovepprint (re. match (com, http://rlovep.com) # no longer starting position cannot match successful # output: <_ sre. SRE_Match object; span = (0, 6), match = 'rlovep'> (0, 6) None
Example 2: Use group
Import reline = This is my blog # match a string containing is matchObj = re. match (R' (. *) is (.*?) . * ', Line, re. M | re. i) # using group output: when the group does not contain a parameter, It outputs a successful match. # When the parameter is set to 1, it matches the first parentheses on the left of the outermost layer, and so on; if matchObj: print (matchObj. group ():, matchObj. group () # match the entire print (matchObj. group (1):, matchObj. group (1) # print (matchObj. group (2):, matchObj. group (2) # The Second Matching bracket else: print (No match !!) # Output: matchObj. group (): This is my blogmatchObj. group (1): ThismatchObj. group (2): my
3re. search Method
Re. search scans the entire string and returns the first successful match.
Function Syntax:
re.search(pattern, string, flags=0)
Function parameter description:
Parameters |
Description |
Pattern |
Matched Regular Expression |
String |
The string to be matched. |
Flags |
A flag, used to control the matching mode of regular expressions, such as case-sensitive or multi-row matching. |
The re. search Method returns a matched object. Otherwise, None is returned.
We can use the group (num) or groups () matching object function to obtain the matching expression.
Matching object Method |
Description |
Group (num = 0) |
The string that matches the entire expression. group () can input multiple group numbers at a time. In this case, it returns a tuples that contain the values of those groups. |
Groups () |
Returns a tuple containing all group strings, from 1 to the group number contained. |
Instance 1:
Import reprint (re. search (rlovep, rlovep.com ). span () print (re. search (com, http://rlovep.com ). span () # output: import reprint (re. search (rlovep, rlovep.com ). span () print (re. search (com, http://rlovep.com ). span ())
Example 2:
Import reline = This is my blog # match the string matchObj = re. search (R' (. *) is (.*?) . * ', Line, re. M | re. i) # using group output: when the group does not contain a parameter, It outputs a successful match. # When the parameter is set to 1, it matches the first parentheses on the left of the outermost layer, and so on; if matchObj: print (matchObj. group ():, matchObj. group () # match the entire print (matchObj. group (1):, matchObj. group (1) # print (matchObj. group (2):, matchObj. group (2) # The Second Matching bracket else: print (No match !!) # Output: matchObj. group (): This is my blogmatchObj. group (1): ThismatchObj. group (2): my
Differences between search and match:
Re. match only matches the start of the string. If the start of the string does not conform to the regular expression, the match fails, and the function returns None; and re. search matches the entire string until a match is found.
Python crawler Test
Use python to capture all the http links on the page and recursively crawl the links on the subpage. Sets and queues are used. Here, my website is crawled, and there are many bugs in the first version;
The Code is as follows:
Import reimport urllib. requestimport urllibfrom collections import deque # Use the queue to store url queue = deque ()> the previous python3 entry-level series also basically entered python, this chapter introduces the python crawler tutorial and shares it with you. crawlers are simple in terms of crawling network data for analysis and processing, measure the test taker's knowledge about crawlers and the tools used by crawlers, such as collections, queues, and regular expressions;
# Use visited to prevent repeated crawling of the same page visited = set () url = 'HTTP: // rlovep.com '# entry page, which can be changed to another # enter the first page queue. append (url) cnt = 0 while queue: url = queue. popleft () # The first element of the team leaves the team visited | = {url} # marked as accessed print ('crawled: '+ str (cnt) + 'crawling <--- '+ url) cnt + = 1 # capture the page urlop = urllib. request. urlopen (url) # determine whether it is an html page if 'html' not in urlop. getheader ('content-type'): continue # Avoid program suspension due to exceptions. Use try .. catch Exception Handling try: # convert to UTF-8 code data = u Rlop. read (). decode ('utf-8') failed T: continue # The regular expression extracts all the queues on the page, determines whether they have been accessed, and then adds the linkre = re to the queue to be crawled. compile (href = ['] ([^'>] *?) [']. *?) For x in linkre. findall (data): # return all matched lists if 'http' in x and x not in visited: # determine whether an http link is used, and determine whether the queue has been captured. append (x) print ('add to queue ---> '+ x)
The result is as follows: