In Python, the URLLIB2 component is imported to complete the crawl of the Web page. was changed to Urllib.request in python3.x.
Crawling a specific process is similar to using the program to simulate the functionality of IE, sending the URL as HTTP request content to the server side, and then reading the server-side response resources.
Implementation process:
1 Import Urllib2 2 3 response=urllib2.urlopen ('http://gs.ccnu.edu.cn/')4 HTML =Response.read ()5print html
Print the returned HTML information, which is the same as when you right-click on the site to see what the source sees. The browser through these sources, will be realistic content rendering out.
In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.
HTTP is based on the request and response mechanism:
The client presents a request and the server provides a response.
Similarly urllib2, the returned content can be read by impersonating a request and then passing the request as a parameter into the Urlopen.
1 Import Urllib2 2 3 req=urllib2. Request ('http://gs.ccnu.edu.cn/')4 response2=Urllib2.urlopen ( REQ)5 page=response2.read ()6Print page
To impersonate an FTP request:
1 req=urllib2. Request ("ftp://example.com/")
There are two things you can do when you make an HTTP request.
1. Sending data forms
Sometimes when we crawl the Web page, we need to submit a form to simulate the login or registration operation.
Normally HTTP is done via post, and at request, the data form submitted needs to be URLLIB encode encoded in standard form.
1 ImportUrllib2 ImportUrllib23 4URL ='http://www.someserver.com/register.cgi' 5 6Values = {"INPUT1":"Seekhit", 7 "Input2":"123456", 8 "__eventtarget":"Btnlogin", 9 "__eventargument":"" } Ten Onedata = Urllib.urlencode (values)#Coding Work Areq = Urllib2. Request (URL, data)#send a request to the data form at the same time -Response = Urllib2.urlopen (req)#information to receive feedback -The_page = Response.read ()#read the content of the feedback2. Set header to HTTP request
Sometimes when an HTTP connection is established, the server returns different content to the client based on the User-agent header that the browser passes over. Different display results have been achieved. (such as the UC browser on Android, there is a device identification, such as mobile version, computer version, ipad)
Python supports the ability to customize the sending of past user-agent headers, creating a request with a custom dictionary as a user-agent header as a parameter.
The following code, speaking user-agent disguised as IE browser, to access.
1. Application Version "mozilla/4.0" means: You use Maxthon 2.0 browser using IE8 kernel;
2. Version ID "MSIE 8.0"
3. The platform's own identifying information "Windows NT" means "operating system is windows"
1URL ='http://www.someserver.com/register.cgi'2User_agent ='mozilla/4.0 (compatible; MSIE 8.0; Windows NT)'3headers = {'user-agent': User_agent}4Values = {"INPUT1":"Seekhit",5 "Input2":"123456",6 "__eventtarget":"Btnlogin",7 "__eventargument":"" }8 9data = Urllib.urlencode (values)#Coding WorkTenreq = Urllib2. Request (URL, data, headers)#send request, transmit data form, simulate user- OneResponse = Urllib2.urlopen (req)#information to receive feedback AThe_page = Response.read ()#read the content of the feedback
Python web Crawler II uses URLLIB2 to capture web content