Python web Crawler II uses URLLIB2 to capture web content

Last Update:2016-11-24 Source: Internet

Author: User

Tags urlencode python web crawler

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Python, the URLLIB2 component is imported to complete the crawl of the Web page. was changed to Urllib.request in python3.x.

Crawling a specific process is similar to using the program to simulate the functionality of IE, sending the URL as HTTP request content to the server side, and then reading the server-side response resources.

Implementation process:

1 Import Urllib2 2 3 response=urllib2.urlopen ('http://gs.ccnu.edu.cn/')4 HTML =Response.read ()5print html

Print the returned HTML information, which is the same as when you right-click on the site to see what the source sees. The browser through these sources, will be realistic content rendering out.

In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.

HTTP is based on the request and response mechanism:

The client presents a request and the server provides a response.

Similarly urllib2, the returned content can be read by impersonating a request and then passing the request as a parameter into the Urlopen.

1 Import Urllib2 2 3 req=urllib2. Request ('http://gs.ccnu.edu.cn/')4 response2=Urllib2.urlopen ( REQ)5 page=response2.read ()6Print page

To impersonate an FTP request:

1 req=urllib2. Request ("ftp://example.com/")

There are two things you can do when you make an HTTP request.

1. Sending data forms

Sometimes when we crawl the Web page, we need to submit a form to simulate the login or registration operation.

Normally HTTP is done via post, and at request, the data form submitted needs to be URLLIB encode encoded in standard form.

1 ImportUrllib2 ImportUrllib23 4URL ='http://www.someserver.com/register.cgi'  5   6Values = {"INPUT1":"Seekhit", 7           "Input2":"123456", 8           "__eventtarget":"Btnlogin", 9           "__eventargument":"" }  Ten  Onedata = Urllib.urlencode (values)#Coding Work Areq = Urllib2. Request (URL, data)#send a request to the data form at the same time -Response = Urllib2.urlopen (req)#information to receive feedback -The_page = Response.read ()#read the content of the feedback

2. Set header to HTTP request

Sometimes when an HTTP connection is established, the server returns different content to the client based on the User-agent header that the browser passes over. Different display results have been achieved. (such as the UC browser on Android, there is a device identification, such as mobile version, computer version, ipad)

Python supports the ability to customize the sending of past user-agent headers, creating a request with a custom dictionary as a user-agent header as a parameter.

The following code, speaking user-agent disguised as IE browser, to access.

1. Application Version "mozilla/4.0" means: You use Maxthon 2.0 browser using IE8 kernel;
2. Version ID "MSIE 8.0"
3. The platform's own identifying information "Windows NT" means "operating system is windows"

1URL ='http://www.someserver.com/register.cgi'2User_agent ='mozilla/4.0 (compatible; MSIE 8.0; Windows NT)'3headers = {'user-agent': User_agent}4Values = {"INPUT1":"Seekhit",5           "Input2":"123456",6           "__eventtarget":"Btnlogin",7           "__eventargument":"" }8 9data = Urllib.urlencode (values)#Coding WorkTenreq = Urllib2. Request (URL, data, headers)#send request, transmit data form, simulate user- OneResponse = Urllib2.urlopen (req)#information to receive feedback AThe_page = Response.read ()#read the content of the feedback

Python web Crawler II uses URLLIB2 to capture web content

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More