Python web Crawler II uses URLLIB2 to capture web content

Source: Internet
Author: User
Tags urlencode python web crawler

In Python, the URLLIB2 component is imported to complete the crawl of the Web page. was changed to Urllib.request in python3.x.

Crawling a specific process is similar to using the program to simulate the functionality of IE, sending the URL as HTTP request content to the server side, and then reading the server-side response resources.

Implementation process:
1 Import Urllib2 2 3 response=urllib2.urlopen ('http://gs.ccnu.edu.cn/')4 HTML =Response.read ()5print html

Print the returned HTML information, which is the same as when you right-click on the site to see what the source sees. The browser through these sources, will be realistic content rendering out.

In addition to "http:", URLs can also be replaced with "ftp:", "File:" And so on.

HTTP is based on the request and response mechanism:

The client presents a request and the server provides a response.

Similarly urllib2, the returned content can be read by impersonating a request and then passing the request as a parameter into the Urlopen.

1 Import Urllib2 2 3 req=urllib2. Request ('http://gs.ccnu.edu.cn/')4 response2=Urllib2.urlopen ( REQ)5 page=response2.read ()6Print page

To impersonate an FTP request:

1 req=urllib2. Request ("ftp://example.com/")

There are two things you can do when you make an HTTP request.

1. Sending data forms

Sometimes when we crawl the Web page, we need to submit a form to simulate the login or registration operation.

Normally HTTP is done via post, and at request, the data form submitted needs to be URLLIB encode encoded in standard form.

1 ImportUrllib2 ImportUrllib23 4URL ='http://www.someserver.com/register.cgi'  5   6Values = {"INPUT1":"Seekhit", 7           "Input2":"123456", 8           "__eventtarget":"Btnlogin", 9           "__eventargument":"" }  Ten  Onedata = Urllib.urlencode (values)#Coding Work Areq = Urllib2. Request (URL, data)#send a request to the data form at the same time -Response = Urllib2.urlopen (req)#information to receive feedback -The_page = Response.read ()#read the content of the feedback
2. Set header to HTTP request

Sometimes when an HTTP connection is established, the server returns different content to the client based on the User-agent header that the browser passes over. Different display results have been achieved. (such as the UC browser on Android, there is a device identification, such as mobile version, computer version, ipad)

Python supports the ability to customize the sending of past user-agent headers, creating a request with a custom dictionary as a user-agent header as a parameter.

The following code, speaking user-agent disguised as IE browser, to access.

1. Application Version "mozilla/4.0" means: You use Maxthon 2.0 browser using IE8 kernel;
2. Version ID "MSIE 8.0"
3. The platform's own identifying information "Windows NT" means "operating system is windows"

1URL ='http://www.someserver.com/register.cgi'2User_agent ='mozilla/4.0 (compatible; MSIE 8.0; Windows NT)'3headers = {'user-agent': User_agent}4Values = {"INPUT1":"Seekhit",5           "Input2":"123456",6           "__eventtarget":"Btnlogin",7           "__eventargument":"" }8 9data = Urllib.urlencode (values)#Coding WorkTenreq = Urllib2. Request (URL, data, headers)#send request, transmit data form, simulate user- OneResponse = Urllib2.urlopen (req)#information to receive feedback AThe_page = Response.read ()#read the content of the feedback

Python web Crawler II uses URLLIB2 to capture web content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.