1.4. Basic use of the URLLIB2 module

Source: Internet
Author: User

Next, let's really move on to our reptile path!

Basic use of the URLLIB2 library

The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local. There are many libraries in python that can be used to crawl Web pages, and we'll learn first urllib2 .

URLLIB2 is the Python2.7 module (no need to download, import can be used)

URLLIB2 Official Document: https://docs.python.org/2/library/urllib2.html

Urllib2 Source: https://hg.python.org/cpython/file/2.7/Lib/urllib2.py

urllib2was changed to python3.x.urllib.request

Urlopen

Let's start with a piece of code:

# urllib2_urlopen.py# 导入urllib2 库import urllib2# 向指定的url发送请求,并返回服务器响应的类文件对象response = urllib2.urlopen("http://www.baidu.com")# 类文件对象支持 文件对象的操作方法,如read()方法读取文件全部内容,返回字符串html = response.read()# 打印字符串print html

Executes the written Python code that will print the result

[email protected] ~$: python urllib2_urlopen.py

In fact, if we open the Baidu homepage in the browser, right click "View Source Code", you will find, and we just printed out is exactly the same. In other words, the above 4 lines of code has helped us to Baidu's home page of all the code to crawl down.

A basic URL request that corresponds to the Python code is really simple.

Request

In our first example, the parameter of Urlopen () is a URL address;

However, if you need to perform more complex operations, such as adding HTTP headers, you must create a request instance as a parameter to Urlopen (), and the URL address you need to access as a parameter to the Request instance.

We edit urllib2_request.py

# urllib2_request.pyimport urllib2# url 作为Request()方法的参数,构造并返回一个Request对象request = urllib2.Request("http://www.baidu.com")# Request对象作为urlopen()方法的参数,发送给服务器并接收响应response = urllib2.urlopen(request)html = response.read()print html
The result is exactly the same:

To create a new request instance, you can set two additional parameters in addition to the URL parameter:

  1. Data (default NULL): is a file submitted with the URL (such as the data to post), and the HTTP request will be changed from "GET" mode to "POST" mode.

  2. Headers (default NULL): is a dictionary that contains the key-value pairs of the HTTP headers that need to be sent.

These two parameters are mentioned below.

User-agent

But so directly with URLLIB2 to send a request to a website, indeed slightly abrupt, it is like, everyone has a door, you as a passer-by directly into the identity of the obvious is not very polite. And some sites do not like to be accessed by the program (non-human access), it is possible to deny your access requests.

But if we use a legitimate identity to request someone else's website, it is obvious that they are welcome, so we should add an identity to our code, which is called the User-Agent head.

  • The browser is the world's accepted identity of the Internet, if we want our bot to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different user-agent headers when sending requests. Urllib2 the default user-agent header is: Python-urllib/x.y (x and Y are Python major and minor version numbers, such as python-urllib/2.7)
#urllib2_useragent.pyimport urllib2url = "http://www.itcast.cn"#IE 9.0 的 User-Agent,包含在 ua_header里ua_header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} # url 连同 headers,一起构造Request请求,这个请求将附带 IE9.0 浏览器的User-Agentrequest = urllib2.Request(url, headers = ua_header)# 向服务器发送这个请求response = urllib2.urlopen(request)html = response.read()print html
Add more header information

A complete HTTP request message is constructed by adding a specific Header to the HTTP requests.

You can Request.add_header() Add/modify a specific header by calling or you can view an existing header by calling Request.get_header() .

    • to add a specific header
# urllib2_headers.pyimport Urllib2url =  "http://www.itcast.cn"  #IE 9.0 of User-agentheader = { "user-agent":  "mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; trident/5.0; "} Request = Urllib2. Request (url, headers = header)  #也可以通过调用Request. Add_header () Add/modify a specific headerrequest.add_ Header ( "Connection",  "keep-alive") # can also be viewed by calling Request.get_header () to view header information # request.get_ Header (header_name= "Connection") response = Urllib2.urlopen (req) print Response.code Span class= "hljs-comment" > #可以查看响应状态码html = Response.read () print html   
    • Randomly Add/modify User-agent
# urllib2_add_headers.pyimport urllib2 import randomurl =  "http://www.itcast.cn" ua_list = [ "mozilla/5.0 (Windows NT 6.1;) Apple .... ", " mozilla/5.0 (X11; CrOS i686 2268.111.0) ... ", " mozilla/5.0 (Macintosh; U PPC Mac OS X .... ", " mozilla/5.0 (Macintosh; Intel Mac OS ... "]user_agent = Random.choice (ua_list) request = Urllib2. Request (URL)  #也可以通过调用Request. Add_header () Add/Modify a specific Headerrequest.add_header ( "user-agent", user_agent) # the first letter capitalized, followed by all lowercase request.get_header ( Span class= "hljs-string" > "user-agent") response = Urllib2.urlopen (req) HTML = response.read () print html             

1.4. URLLIB2 Module Basic use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.