What is a Web page downloader?
First, the Web Downloader is the core component of the crawler
Second, the common Python Web Downloader has urlilib2 basic module and requests third-party plug-in two kinds
URLLIB2 Support Features: 1. Support direct URL download; 2. Support for the data entered directly into the Web page; 3. Support for cookie processing that requires a landing page; 4. Agent handling required for proxy access
Three, urllib2 three ways to download
Method one. Direct Download method
The corresponding code is as follows:
# -*-coding:utf-8-*- # calling the Urllib2 module Import # Direct Request Response=urllib2.urlopen ("http://www.baidu.com" # Gets the status code if 200 indicates success print response.getcode ()# read Crawl get content print response.read ()
Method 2: Add data and HTTP
Data: That's what the user needs to enter
Http-header: primarily to submit HTTP header information
Pass the URL, data, header three parameters to Urllib2 's request class, generate a Request object, and then use the Urlopen method in Urllib2 to send a Web request as a parameter
The corresponding code is as follows:
1 #Coding=utf-82 ImportUrllib23 4 #Create a Request object5Request= Urllib2. Request ("The URL that you want to crawl")6 7 #Add Data a=18Request.add_data ('a','1')9 Ten #add headers for HTTP OneRequest.add_header ('user-agent','mozilla/5.0') A - #send request Get results -Response=Urllib2.urlopen (Request) the - PrintResponse.getcode () - - PrintResponse.read ()
method two code examples
Method Three, add a special situation of the processor
Some Web pages need to be logged in to access and need to add a cookie for processing, here using Httpcookieprocessor
For use with proxy access: Proxyhandler
Web page using the HTTPS encryption protocol: Httpshandler
There are URLs to each other automatically jump relationship: Httpredirecthandler
These handler are routed to Urllib2 's Build_opener (handler) method to create opener objects, which are transferred to Install_opener (opener), and URLLIB2 have the ability to handle these scenarios.
The code is as follows: Cookie enhancement processing
1 #-*-coding:utf-8-*-2 3 #introduction of URLLIB2 and Cookielib modules4 ImportUrllib2,cookielib5 6 #Create a cookie container to store cookie data7cj=Cookielib. Cookiejar ()8 9 #Create a opener, and then use URLLIB2 's httpcookieprocessor to generate a handler with the CJ Cookiejar as a parameter, and then pass this handler to Build_ Opener method to generate a opener objectTenOpener =Urlib2.build_opener (urllib2. Httpcookieprocessor (CJ)) One A #then install opener for Urllib2 to enhance his processor. - Urllib2.install_opener (opener) - the #use URLLIB2 with cookies to access Web pages for crawling Web pages -Response = Urllib2.urlopen ("http://www.baidu.com")
Enhanced Processing
Crawler Learning--Web Downloader and URLLIB2 module