In Python, the URLLIB2 component is imported to complete the crawl of the Web page. was changed to Urllib.request in python3.x.
Crawling a specific process is similar to using the program to simulate the functionality of IE, sending the URL as HTTP request content to the server side, and then reading the server-side response resources.
Implementation process:
1 Import Urllib2 2 3 response=urllib2.urlopen ('http://gs.ccnu.edu.cn/')4 HTML =Response.read ()5print html
Print the returned HTML information, which is the same as when you right-click on the site to see what the source sees. The browser through these sources, will be realistic content rendering out.
In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.
HTTP is based on the request and response mechanism:
The client presents a request and the server provides a response.
Similarly urllib2, the returned content can be read by impersonating a request and then passing the request as a parameter into the Urlopen.
1 Import Urllib2 2 3 req=urllib2. Request ('http://gs.ccnu.edu.cn/')4 response2=Urllib2.urlopen ( REQ)5 page=response2.read ()6Print page
To impersonate an FTP request:
1 req=urllib2. Request ("ftp://example.com/")
There are two things you can do when you make an HTTP request.
1. Sending data forms
Sometimes when we crawl the Web page, we need to submit a form to simulate the login or registration operation.
Normally HTTP is done via post, and at request, the data form submitted needs to be URLLIB encode encoded in standard form.
1 ImportUrllib2 ImportUrllib23 4URL ='http://www.someserver.com/register.cgi' 5 6Values = {"INPUT1":"Seekhit", 7 "Input2":"123456", 8 "__eventtarget":"Btnlogin", 9 "__eventargument":"" } Ten Onedata = Urllib.urlencode (values)#Coding Work Areq = Urllib2. Request (URL, data)#send a request to the data form at the same time -Response = Urllib2.urlopen (req)#information to receive feedback -The_page = Response.read ()#read the content of the feedback
2. Set header to HTTP request
Sometimes when an HTTP connection is established, the server returns different content to the client based on the User-agent header that the browser passes over. Different display results have been achieved. (such as the UC browser on Android, there is a device identification, such as mobile version, computer version, ipad)
Python supports the ability to customize the sending of past user-agent headers, creating a request with a custom dictionary as a user-agent header as a parameter.
The following code, speaking user-agent disguised as IE browser, to access.
1. Application Version "mozilla/4.0" means: You use Maxthon 2.0 browser using IE8 kernel;
2. Version ID "MSIE 8.0"
3. The platform's own identifying information "Windows NT" means "operating system is windows"
1URL ='http://www.someserver.com/register.cgi'2User_agent ='mozilla/4.0 (compatible; MSIE 8.0; Windows NT)'3headers = {'user-agent': User_agent}4Values = {"INPUT1":"Seekhit",5 "Input2":"123456",6 "__eventtarget":"Btnlogin",7 "__eventargument":"" }8 9data = Urllib.urlencode (values)#Coding WorkTenreq = Urllib2. Request (URL, data, headers)#send request, transmit data form, simulate user- OneResponse = Urllib2.urlopen (req)#information to receive feedback AThe_page = Response.read ()#read the content of the feedback
Python web Crawler II uses URLLIB2 to capture web content