A brief introduction to URLLIB2
Reference URL: http://www.voidspace.org.uk/python/articles/urllib2.shtml
Fetching URLs
The simplest-urllib2 is as follows:
1.
Import Urllib2
Response = Urllib2.urlopen (' http://python.org/')
html = Response.read ()
2.
Import Urllib2
req = Urllib2. Request (' http://www.voidspace.org.uk ')
Response = Urllib2.urlopen (req)
The_page = Response.read ()
3.
req = Urllib2. Request (' ftp://example.com/')
4.
Import Urllib
Import Urllib2
url = ' http://www.someserver.com/cgi-bin/register.cgi '
Values = {' name ': ' Michael Foord ',
' Location ': ' Northampton ',
' Language ': ' Python '}
data = Urllib.urlencode (values)
req = Urllib2. Request (URL, data)
Response = Urllib2.urlopen (req)
The_page = Response.read ()
5.
Data can also is passed in a HTTP GET request by encoding it in the URL itself.
>>> Import Urllib2
>>> Import Urllib
>>> data = {}
>>> data[' name '] = ' somebody here '
>>> data[' location '] = ' Northampton '
>>> data[' language ' = ' Python '
>>> url_values = urllib.urlencode (data)
>>> Print Url_values
Name=somebody+here&language=python&location=northampton
>>> url = ' http://www.example.com/example.cgi '
>>> full_url = URL + '? ' + url_values
>>> data = Urllib2.urlopen (Full_url)
6.
Import Urllib
Import Urllib2
url = ' http://www.someserver.com/cgi-bin/register.cgi '
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
Values = {' name ': ' Michael Foord ',
' Location ': ' Northampton ',
' Language ': ' Python '}
headers = {' User-agent ': user_agent}
data = Urllib.urlencode (values)
req = Urllib2. Request (URL, data, headers)
Response = Urllib2.urlopen (req)
The_page = Response.read ()
7, Handling Exceptions
1) urlerror
>>> req = urllib2. Request (' http://www.jianshu.com/p/5c7a1af4aa531 ')
>>> Try:urllib2.urlopen (req)
>>> except Urlerror, E:
>>> Print E.reason
>>> Print E,e.code #分别表示凡返回错误类型, error codes and types, error codes
2) Httperror is the subclass of Urlerror raised in the specific case of HTTP URLs.
Httperror
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request
3) Error Codes
: (' Continue ', ' Request received, please Continue '),
101: (' Switching Protocols ',
' Switching to new protocol; Obey Upgrade header '),
£ º (' OK ', ' Request fulfilled, document follows '),
201: (' Created ', ' Document Created, URL follows '),
202: (' Accepted ',
' Request accepted, processing continues off-line '),
203: (' non-authoritative information ', ' Request fulfilled from cache '),
204: (' No Content ', ' Request fulfilled, nothing follows '),
205: (' Reset Content ', ' Clear input form for further input '),
206: (' Partial content ', ' partial content follows. '),
A: (' Multiple Choices ',
' Object has several resources--see URI list '),
301: (' Moved permanently ', ' Object Moved permanently--See URI list '),
302: (' Found ', ' Object moved temporarily--see URI list '),
303: (' See other ', ' Object moved--see Method and URL list '),
304: (' Not Modified ',
' Document has not changed since given time '),
305: (' Use Proxy ',
' You must use proxy specified in location to access '
' Resource. '),
307: (' Temporary Redirect ',
' Object moved temporarily--see URI list '),
: (' Bad Request ',
' Bad request syntax or unsupported method '),
401: (' Unauthorized ',
' No permission--See authorization schemes '),
402: (' Payment Required ',
' No payment--see charging schemes '),
403: (' Forbidden ',
' Request forbidden-authorization won't help '),
404: (' Not Found ', ' nothing matches the given URI '),
405: (' Method not allowed ',
' Specified method is invalid for this server. '),
406: (' not acceptable ', ' URI is not available in preferred format. '),
407: (' Proxy authentication Required ', ' You must authenticate with '
' This proxy before proceeding. '),
408: (' request Timeout ', ' request timed out; try again later. '),
409: (' Conflict ', ' Request Conflict. '),
410: (' Gone ',
' URI no longer exists and has been permanently removed. '),
411: (' Length Required ', ' Client must specify Content-length. '),
412: (' precondition Failed ', ' precondition in headers is false. '),
413: (' Request entity Too Large ', ' entity is Too Large. '),
414: (' Request-uri Too long ', ' URI is Too long '),
415: (' Unsupported Media Type ', ' Entity body in unsupported format. '),
416: (' requested Range not satisfiable ',
' cannot satisfy request range. '),
417: (' expectation Failed ',
' Expect condition could not being satisfied. '),
$: (' Internal server Error ', ' server got itself in trouble '),
501: (' Not implemented ',
' server does not supp ORT this operation '),
502: ("Bad Gateway", ' Invalid responses from another server/proxy. '),
503: (' Service Unavail Able ',
' the server cannot process the request due to a high load '),
504: (' Gateway Timeout ',
' the gateway Serv Er did not receive a timely response '),
505: (' HTTP Version not supported ', ' cannot fulfill request. '),
4)
Example 1: Br>from urllib2 Import Request, Urlopen, Urlerror, httperror
req = Request (someurl)
Try:
response = Urlopen (req
except Httperror, E:
print ' the server couldn\ ' t fulfill the request. '
print ' Error code: ', E.code
except Urlerror, E:
print ' We failed to reach a server. '
print ' Reason: ', E.reason
Else:
Note: Httperror is a subclass of Urlerror, to be written in front
Example 2:
From URLLIB2 import Request, Urlopen, Urlerror
req = Request (Someurl)
Try
Response = Urlopen (req)
Except Urlerror, E:
If Hasattr (E, ' reason '):
print ' We failed to reach a server. '
print ' Reason: ', E.reason
Elif hasattr (E, ' Code '):
print ' The server couldn\ ' t fulfill the request.
print ' Error code: ', E.code
Else
# everything is fine
Example 3:
From URLLIB2 import Request, Urlopen
req = Request (Someurl)
Try
Response = Urlopen (req)
Except IOError, E:
If Hasattr (E, ' reason '):
print ' We failed to reach a server. '
print ' Reason: ', E.reason
Elif hasattr (E, ' Code '):
print ' The server couldn\ ' t fulfill the request.
print ' Error code: ', E.code
Else
# everything is fine
Note: Urlerror is a subclass of IOError, and in rare cases it may be reported Socket.error
8. Info and Geturl
Geturl
This returns the actual URL of the fetched page. This is useful because urlopen (or the opener object used) may have followed the redirect. The URL of the obtained page may not be the same as the requested URL
Info
This returns a dictionary-like object that describes the retrieved page, especially the headers sent by the server. It is now a httplib. Httpmessage instance.
Example:
From URLLIB2 import Request,urlopen,urlerror,httperror
url = ' https://passport.baidu.com/center?_t=1510744860 '
req = Request (URL)
Response = Urlopen (req)
Print Response.info ()
Print Response.geturl ()
9, openers and handlers
Openers:
When you get a URL you use a opener (a urllib2. Openerdirector instances). Under normal circumstances, we use the default opener: through Urlopen. But you can create a personality openers. You can use Build_opener to create opener objects. Generally available for applications that need to process cookies or do not want redirection (you'll want to the create openers if you want to the fetch URLs with specific handlers ins Talled, for example to get a opener that handles cookie, or to get a opener that does not handle redirections.)
The following are the specific processes for using proxy IP to impersonate logins (which require processing cookies) to use handler and opener.
1 self.proxy = Urllib2. Proxyhandler ({' http ': Self.proxy_url})
2 Self.cookie = Cookielib. Lwpcookiejar ()
3 Self.cookie_handler = Urllib2. Httpcookieprocessor (Self.cookie)
4 Self.opener = Urllib2.build_opener (Self.cookie_handler, Self.proxy, Urllib2. HttpHandler)
Handles:
Openers uses processor handlers, all "heavy" work is handled by handlers. Each handlers knows how to open URLs through a specific protocol, or how to handle various aspects of the URL when it is opened. such as HTTP redirection or HTTP cookies.
More information on openers and handlers. Http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers
10, Proxies
Proxy IP Creation opener
note:currently URLLIB2 does not support fetching of HTTPS locations through a proxy. This can is a problem.
(http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies)
Example:
1 Import urllib2
2 Proxy--handler = Urllib2. Proxyhandler ({' http ': ' 54.186.78.110:3128 '}) #注意要确保该代理ip可用
3 opener = Urllib2.build_opener (Proxy_handler)
4 request = Urllib2. Request (URL, post_data, login_headers) #该例中还需要提交post_data和header信息
5 response = Opener.open (Request)
6 print Response.read (). Encode (' Utf-8 ')
11, Sockets and Layers
Example:
Import socket
Import Urllib2
# Timeout in seconds
Timeout = 10
Socket.setdefaulttimeout (Timeout)
# Urllib2.urlopen now uses the default timeout
# We have a set in the socket module
req = Urllib2. Request (' http://www.voidspace.org.uk ')
Response = Urllib2.urlopen (req)
12, Cookie
urllib2 The processing of cookies is also automatic. If you need to get the value of a Cookie entry, you can do this:
Example:
Import urllib2
Import cookielib
Cookie = cookielib. Cookiejar ()
opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie))
response = Opener.open (' http://www.baidu.com ')
for item in cookie:
print ' Name = ' +item.name
print ' Value = ' +item.value
After running it will output the value of the cookie that accesses Baidu:
Name = Baiduid
Value = c664216c4f7bd6b98db0b300292e0a23:fg=1
Name = Bidupsid
Value = C664216c4f7bd6b98db0b300292e0a23
Name = H_ps_pssid
Value = 1464_21099_17001_24879_22159
Name = PSTM
Value = 1510747061
Name = Bdsvrtm
Value = 0
Name = Bd_home
Value = 0
13, against "anti-hotlinking"
Some sites have so-called anti-hotlinking settings, in fact, it is very simple,
is to check that you sent the request to the header inside, Referer site is not his own,
So we just need to change the headers referer to the site, take Baidu as an example:
#...
headers = {
' Referer ': ' http://www.baidu.com/'
}
#...
Headers is a DICT data structure, you can put in any desired header, to do some camouflage.
For example, some websites like to read the x-forwarded-for in the header to see people's real IP, you can directly change the x-forwarde-for.
A brief summary of the Python urllib2 library