Python3.x crawler Tutorial: webpage crawling, image crawling, automatic login,
Original works of Lin bingwen Evankaka. Reprinted please indicate the source http://blog.csdn.net/evankaka
Abstract: This article uses Python3.4 to crawl webpages, crawl images, and log on automatically. This section briefly introduces the HTTP protocol. Before crawling, let's give a brief explanation of the HTTP protocol, so that the crawler will be more clear.
I. HTTP protocol
HTTP is short for Hyper Text Transfer Protocol. Its development is the result of cooperation between the World Wide Web Consortium and the Internet team IETF (Internet Engineering Task Force). They finally published a series of RFC, RFC 1945 defines HTTP/1.0. The most famous one is RFC 2616. RFC 2616 defines a common version of HTTP 1.1.
HyperText Transfer Protocol (Hyper Text Transfer Protocol) is a Transfer Protocol used to Transfer HyperText from a WWW server to a local browser. It makes the browser more efficient and reduces network transmission. It not only ensures that the computer transfers hypertext documents correctly and quickly, but also determines which part of the transmitted documents and which part of the content is first displayed (such as text before graphics.
HTTP Request Response Model
The HTTP protocol always initiates a request from the client, and the server returns the response. See:
This restricts the use of the HTTP protocol and prevents the server from pushing messages to the client when the client does not initiate a request.
HTTP is a stateless protocol. This request on the same client does not correspond to the previous request.
Workflow
An HTTP operation is called a transaction. The procedure can be divided into four steps:
1) First, the client and the server need to establish a connection. Click a hyperlink to start HTTP.
2) After a connection is established, the client sends a request to the server in the format of Uniform Resource Identifier (URL), Protocol version number, the MIME information is followed by the request modifier, client information, and possible content.
3) after receiving the request, the server returns the corresponding response information in the format of a status line, including the Protocol version number of the information, a successful or wrong code, MIME information is followed by server information, entity information, and possible content.
4) The information returned by the client receiving server is displayed on the user's display screen through a browser, and the client is disconnected from the server.
If an error occurs in one of the preceding steps, the error message is returned to the client and displayed. For users, these processes are completed by HTTP. Users only need to click and wait for the information to be displayed.
Request Header
The request header allows the client to send additional request information and client information to the server.
Common request headers
Accept
The Accept request header field is used to specify the types of information the client accepts. Eg: Accept: image/gif indicates that the client wants to Accept resources in the GIF image format; Accept: text/html indicates that the client wants to Accept html text.
Accept-Charset
The Accept-Charset request header field is used to specify the character set accepted by the client. Eg: Accept-Charset: iso-8859-1, gb2312. if this field is not set in the request message, it is acceptable by default for any character set.
Accept-Encoding
The Accept-Encoding Request Header domain is similar to Accept, but it is used to specify acceptable content Encoding. Eg: Accept-Encoding: gzip. deflate. If the domain server is not set in the request message, it is assumed that the client can Accept all content Encoding.
Accept-Language
The Accept-Language Request Header domain is similar to Accept, but it is used to specify a natural Language. Eg: Accept-Language: zh-cn. If this header field is not set in the request message, the server assumes that the client is acceptable to all languages.
Authorization
The Authorization request header domain is used to prove that the client has the right to view a resource. When a browser accesses a page, if the response code of the server is 401 (unauthorized), it can send a request containing the Authorization request header domain, requiring the server to verify the request.
Host (this header field is required when a request is sent)
The Host request header field is used to specify the Internet Host and port number of the requested resource. It is usually extracted from the http url. For example:
We enter: http://www.guet.edu.cn/index.html in the browser
The request message sent by the Browser contains the Host Request Header domain, as follows:
Host: www.guet.edu.cn
The default port number is 80. If the port number is specified, it is changed to: Host: www.guet.edu.cn: the specified port number.
User-Agent
When we log on to the forum online, we will often see some welcome information, which lists the names and versions of your operating system, the names and versions of your browsers, this is often amazing for many people. In fact, the server application obtains this information from the User-Agent Request Header domain. The User-Agent request header field allows the client to tell the server its operating system, browser, and other attributes. However, this header field is not required. If we write a browser and do not use the User-Agent to request the header field, the server will not be able to know our information.
Example of request header:
GET/form.html HTTP/1.1 (CRLF)
Accept: image/gif, image/x-xbitmap, image/jpeg, application/x-shockwave-flash, application/vnd. ms-excel, application/vnd. ms-powerpoint, application/msword, */* (CRLF)
Accept-Language: zh-cn (CRLF)
Accept-Encoding: gzip, deflate (CRLF)
If-Modified-Since: Wed, 05 Jan 2007 11:21:25 GMT (CRLF)
If-None-Match: W/"80b1a4c018f3c41: 8317" (CRLF)
User-Agent: Mozilla/4.0 (compatible; MSIE6.0; Windows NT 5.0) (CRLF)
Host: www.guet.edu.cn (CRLF)
Connection: Keep-Alive (CRLF)
(CRLF)
GET /form.html HTTP/1.1 (CRLF)Accept:image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/* (CRLF)Accept-Language:zh-cn (CRLF)Accept-Encoding:gzip,deflate (CRLF)If-Modified-Since:Wed,05 Jan 2007 11:21:25 GMT (CRLF)If-None-Match:W/"80b1a4c018f3c41:8317" (CRLF)User-Agent:Mozilla/4.0(compatible;MSIE6.0;Windows NT 5.0) (CRLF)Host:www.guet.edu.cn (CRLF)Connection:Keep-Alive (CRLF)(CRLF)
Response Header
The Response Header allows the server to transmit additional response information that cannot be placed in the status line, as well as information about the server and the next access to the resource identified by the Request-URI.
Common Response Headers
Location
The Location response header field is used to redirect the receiver to a new Location. Location response header fields are often used when domain names are changed.
Server
The Server response header contains the software information used by the Server to process requests. It corresponds to the User-Agent Request Header domain. Below is
An example of the Server response header domain:
Server: Apache-Coyote/1.1
WWW-Authenticate
The WWW-Authenticate Response Header domain must be included in the 401 (unauthorized) Response Message. When the client receives the 401 Response Message and sends the Authorization Header domain request server to verify the message, the server response header contains this header field.
Eg: WWW-Authenticate: Basic realm = "Basic Auth Test! "// You can see that the server uses a basic authentication mechanism for requested resources.
Ii. Python3.4 crawler Programming
1. In the first example, we will use simple crawlers to crawl others' webpages.
# Python3.4 crawler tutorial # A simple example crawler # Lin bingwen Evankaka (blog: http://blog.csdn.net/evankaka/) import urllib. requesturl = "http://www.douban.com/" webPage = urllib. request. urlopen (url) data = webPage. read () data = data. decode ('utf-8') print (data) print (type (webPage) print (webPage. geturl () print (webPage.info () print (webPage. getcode ())
Here is the output of the crawled webpage:
What happened in the middle? Let's open Fiddler to see it:
The red icon on the left indicates that the access was successful. It is http 200.
At the top of the right is the request header generated by python. For details, refer to the following:
A very simple header, and then let's look at the html response.
The response here is the Web page we printed in the python idle above!
2. Web page crawling in disguise as a browser
Some web pages, such as logon. If you do not initiate a request from the browser, this will not give you a response, then we need to write the header ourselves. And then send it to the server of the webpage, then it thinks you are a normal browser. This allows you to crawl!
# Python3.4 crawler tutorial # A simple example crawler # Lin bingwen Evankaka (blog: http://blog.csdn.net/evankaka/) import urllib. requestweburl = "http://www.douban.com/" webheader = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64; rv: 23.0) gecko/20100101 Firefox/23.0 '} req = urllib. request. request (url = weburl, headers = webheader) webPage = urllib. request. urlopen (req) data = webPage. read () data = data. decode ('utf-8') print (data) print (type (webPage) print (webPage. geturl () print (webPage.info () print (webPage. getcode ())
Let's take a look at the request header, which is just like what we set.
The returned result is the same:
A more complex request header:
# Python3.4 crawler tutorial # A simple example crawler # Lin bingwen Evankaka (blog: http://blog.csdn.net/evankaka/) import urllib. requestweburl = "http://www.douban.com/" webheader1 = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64; rv: 23.0) gecko/20100101 Firefox/23.0 '} webheader2 = {'connection': 'Keep-alive', 'accept': 'text/html, application/xhtml + xml, */* ', 'Accept-color': 'en-US, en; q = 0.8, zh-Hans-CN; q = 0.5, zh-Hans; q = 0.3 ', 'user-agent': 'mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv: 11.0) like Gecko ', # 'Accept-encoding': 'gzip, deflate', 'host': 'www .douban.com ', 'dnt': '1'} req = urllib. request. request (url = weburl, headers = webheader2) webPage = urllib. request. urlopen (req) data = webPage. read () data = data. decode ('utf-8') print (data) print (type (webPage) print (webPage. geturl () print (webPage.info () print (webPage. getcode ())
Check the generated results:
Return or:
3. Crawling images on the website
We can crawl the web page. Next, we can automatically download various data on the web page in batches ~, For example, here I want to download all the images on this page
# Python3.4 crawler tutorial # Crawling pictures on the website # Lin bingwen Evankaka (blog: http://blog.csdn.net/evankaka/) import urllib. request import socket import re import sys import OS targetDir = r "D: \ PythonWorkPlace \ load" # file storage path def destFile (path): if not OS. path. isdir (targetDir): OS. mkdir (targetDir) pos = path. rindex ('/') t = OS. path. join (targetDir, path [pos + 1:]) return t if _ name _ = "_ main _": # weburl = "http://www.douban.com/" w Ebheaders = {'user-agent': 'mozilla/5.0 (Windows NT 6.1; WOW64; rv: 23.0) Gecko/20100101 Firefox/100'} req = urllib. request. request (url = weburl, headers = webheaders) # construct the Request header webpage = urllib. request. urlopen (req) # Send request header contentBytes = webpage. read () for link, t in set (re. findall (R' (http: [^ \ s] *? (Jpg | png | gif) ', str (contentBytes): # use a regular expression to search for all images. print (link) try: urllib. request. urlretrieve (link, destFile (link) # download image failed T: print ('failed') # exception thrown
This is a running process:
Open the corresponding folder on the computer, and then look at the picture. This is only part of it !!..
Images on Real Web pages
4. Save the crawled packets
Def saveFile (data): save_path = 'd: \ temp. out 'f_obj = open (save_path, 'wb') # wb indicates the open mode f_obj.write (data) f_obj.close () # crawler code is omitted here #... # Put the crawled data in the dat variable # Save the dat variable to the saveFile (dat) on disk D)
For example:
# Python3.4 crawler tutorial # A simple example crawler # Lin bingwen Evankaka (blog: http://blog.csdn.net/evankaka/) import urllib. requestdef saveFile (data): save_path = 'd: \ temp. out 'f_obj = open (save_path, 'wb') # wb indicates the open mode f_obj.write (data) f_obj.close () weburl = "http://www.douban.com/" webheader1 = {'user-agent ': 'mozilla/5.0 (Windows NT 6.1; WOW64; rv: 23.0) Gecko/20100101 Firefox/23.0 '} webheader2 = {'connection': 'Keep-alive', 'accept ': 'text/html, application/xhtml + xml, */* ', 'Accept-color': 'en-US, en; q = 0.8, zh-Hans-CN; q = 0.5, zh-Hans; q = 0.3 ', 'user-agent': 'mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv: 11.0) like Gecko ', # 'Accept-encoding': 'gzip, deflate', 'host': 'www .douban.com', 'dnt ': '1'} req = urllib. request. request (url = weburl, headers = webheader2) webPage = urllib. request. urlopen (req) data = webPage. read () saveFile (data) # Save the data variable to data = data under drive D. decode ('utf-8') print (data) print (type (webPage) print (webPage. geturl () print (webPage.info () print (webPage. getcode ())
Then look at disk D:
Open with NotePad:
Hmm. Yes. The webpage has been crawled.
Iii. Automatic logon to Python3.x
Under normal circumstances, we enter the email address and password, and then log on. Let's take a look. This is the content of the submission form.
Python3.4 code writing:
Import gzipimport reimport http. cookiejarimport urllib. requestimport urllib. parse # extract function def ungzip (data): try: # try to extract print ('extracting ..... ') data = gzip. decompress (data) print ('decompressed! ') Compression t: print ('uncompress, do not decompress') return data # Get _ xsrf def getXSRF (data): cer = re. compile ('name = \ "_ xsrf \" value = \"(. *) \ "', flags = 0) strlist = cer. findall (data) return strlist [0] # construct the file header def getOpener (head): # Set a cookie processor, which is responsible for downloading cookies from the server to the local machine, in addition, the local cookie cj = http is included when the request is sent. cookiejar. cookieJar () pro = urllib. request. HTTPCookieProcessor (cj) opener = urllib. request. build_opener (pro) header = [] for key, valu E in head. items (): elem = (key, value) header. append (elem) opener. addheaders = header return opener # construct the header. Generally, the header must contain at least two items. These two items are analyzed from the captured bag. Header = {'connection': 'Keep-alive', 'accept': 'text/html, application/xhtml + xml, */* ', 'Accept-Language ': 'en-US, en; q = 0.8, zh-Hans-CN; q = 0.5, zh-Hans; q = 100', 'user-agent ': 'mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv: 11.0) like Gecko ', 'Accept-encoding': 'gzip, deflate', 'host ': 'www .zhihu.com ', 'dnt': '1'} url = 'HTTP: // www.zhihu.com/'opener = getOpener (header) op = opener. open (url) data = Op. read () data = ungzip (data) # decompress _ xsrf = getXSRF (data. decode () # post data receiving and processing page (we want to send the Post data we constructed to this page) url + = 'login/email 'id = 'ling20081005 @ 126.com' password = 'christmas258 @ '# construct the Post data, which is also obtained from the analysis in a large bag. PostDict = {'_ xsrf': _ xsrf, # Special Data. different websites may have different 'email': id, 'Password': password, 'rememberme ': 'y'} # Post data encoding postData = urllib. parse. urlencode (postDict ). encode () op = opener. open (url, postData) data = op. read () data = ungzip (data) print (data. decode ())
Let's take a look at the results:
In this case
Request Header sent
Received Header
What does the returned data mean:
It's easy. We transcode it:
Copyright Disclaimer: This article is the original article by the blogger Lin bingwen Evankaka, which cannot be reproduced without the permission of the blogger.