1. Basic methods
urllib.request.
urlopen
(
url,
data=none, [
timeout, ]
*,
cafile=none,
capath= None,
cadefault=false,
context=none)
-url: The URL that needs to be opened
-Data submitted by Data:post
-Timeout: Set the access time-out for a website
Get the page directly with the Urlopen () of the Urllib.request module, the data format of page is bytes type, need decode () decode, convert to STR type.
1 from urllib import request2 response = Request.urlopen (R ' http://python.org/') #
Urlopen provides methods for returning objects:
-Read (), ReadLine (), ReadLines (), Fileno (), close (): operation on HttpResponse type data
-INFO (): Returns the Httpmessage object that represents the header information returned by the remote server
-GetCode (): Returns the HTTP status code. If it is an HTTP request, 200 request completed successfully; 404 URL not Found
-Geturl (): Returns the requested URL
2. Use the requesturllib.request.
Request
(URL, Data=none, headers={}, Method=none) Use request () to wrap the requests, and then get the page through Urlopen ().
1 URL = R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ' 2 headers = {3 ' user-agent ': R ' mozilla/5.0 ( Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' 4 R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ', 5 ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ', 6 ' Connection ': ' Keep-alive ' 7} 8 req = Request. Request (URL, headers=headers) 9 page = Request.urlopen (req). Read () page = Page.decode (' Utf-8 ')
Data to wrap the head:
-User-agent: This head can carry the following information: Browser name and version number, operating system name and version number, default language
-Referer: Can be used to prevent hotlinking, there are some Web site image display source http://***.com, is to check Referer to identify
-Connection: Indicates the status of the connection and logs the session status.
3.Post dataurllib.request.
urlopen
(url, data=none, [timeout, ]*, cafile=none, capath= None, cadefault=false, context=none) The data parameter of Urlopen () defaults to none, and when the data parameter is not empty, Urlopen () is submitted as post.
1 from Urllib import request, parse 2 URL = R ' Http://www.lagou.com/jobs/positionAjax.json? ' 3 headers = {4 ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' 5 R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ', 6 ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ', 7 ' Connection ': ' Keep-alive ' 8} 9 data = {10< c4/> ' first ': ' true ', one ' pn ': 1,12 ' kd ': ' Python '}14 data = parse.urlencode (data). Encode (' utf-8 ') req = Request. Request (URL, headers=headers, data=data), page = Request.urlopen (req). Read () page = Page.decode (' Utf-8 ')
urllib.parse.urlencode
(query, Doseq=false, safe= ", Encoding=none, Errors=none)The main function of UrlEncode () is to enclose the URL with the data to be submitted.
1 data = {2 ' first ': ' True ', 3 ' pn ': 1,4 ' kd ': ' Python ' 5}6 data = Parse.urlencode (data). Encode (' Utf-8 ')
After the UrlEncode () converted data is First=true?pn=1?kd=python, the last URL submitted is
Http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Python
The data for the post must be bytes or iterable of bytes, not str, so encode () encoding is required
1 page = Request.urlopen (req, data=data). Read ()
Of course, data can also be encapsulated in the Urlopen () parameter
4. Exception Handling1 def get_page (URL): 2 headers = {3 ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' 4 R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ', 5 ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ', 6 ' Connection ': ' Keep-alive ' 7 } 8 data = {9 ' first ': ' True ', ' pn ': 1,11 ' kd ': ' Python ' }13 data = parse.urlencode (data). Encode (' Utf-8 ') req = Request. Request (URL, headers=headers) try:16 page = Request.urlopen (req, data=data). Read () page = Page.decode (' Utf-8 ') except error. Httperror as e:19 print (E.code ()) print (E.read (). Decode (' Utf-8 ')) return page
5, the use of agentsurllib.request.
ProxyHandler
(proxies=none) When a site that needs to crawl has access restrictions set, a proxy is needed to fetch the data.
1 data = {2 ' first ': ' True ', 3 ' pn ': 1, 4 ' kd ': ' Python ' 5 } 6 proxy = Request. Proxyhandler ({' http ': ' 5.22.195.215:80 '}) # set proxy 7 opener = Request.build_opener (proxy) # Mount opener 8 Request.install_opener (opener) # installs opener 9 data = Parse.urlencode (data). Encode (' utf-8 ') page = Opener.open (url , data). Read () page = Page.decode (' utf-8 ') return page
Python's Urllib Learning