Usage of python3 urllib and python3urllib
1. Basic Methodsurllib.request.urlopen(Url,Data = None,[Timeout,]*,Cafile = None,Capath = None,Cadefault = False,Context = None)
-Url: the url to be opened
-Data: data submitted by Post
-Timeout: sets the time-out for Website access.
Directly use urllib. request module's urlopen () to obtain the page. The data format of the page is bytes type, which must be decoded by decode () and converted to str type.
1 from urllib import request2 response = request. urlopen (r 'HTTP: // python.org/') #
Urlopen provides the following methods for returning objects:
-Read (), readline (), readlines (), fileno (), close (): perform operations on HTTPResponse data.
-Info (): return the HTTPMessage object, indicating the header information returned by the remote server.
-Getcode (): return the Http status code. If it is an http request, the 200 request is successfully completed; the 404 URL is not found
-Geturl (): return the request url.
2. Use Request
urllib.request.
Request(
Url, data = None, headers = {}, method = None)
Use request () to wrap the request, and then use urlopen () to obtain the page.
1 url = r'http://www.lagou.com/zhaopin/Python/?labelWords=label' 2 headers = { 3 'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 4 r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 5 'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label', 6 'Connection': 'keep-alive' 7 } 8 req = request.Request(url, headers=headers) 9 page = request.urlopen(req).read()10 page = page.decode('utf-8')
Data used to wrap the header:
-User-Agent: the header contains the following information: browser name and version number, operating system name and version number, and default language.
-Referer: it can be used to prevent leeching. Some website images are displayed at http: // ***. com, which is identified by Referer.
-Connection: indicates the Connection status and records the Session status.
3. Post Data
urllib.request.
urlopen(
Url,
Data = None,[
Timeout,]
*,
Cafile = None,
Capath = None,
Cadefault = False,
Context = None)
The default data parameter of urlopen () is None. When the data parameter is not empty, the urlopen () submission method is Post.
1 from urllib import request, parse 2 url = r'http://www.lagou.com/jobs/positionAjax.json?' 3 headers = { 4 'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 5 r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 6 'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label', 7 'Connection': 'keep-alive' 8 } 9 data = {10 'first': 'true',11 'pn': 1,12 'kd': 'Python'13 }14 data = parse.urlencode(data).encode('utf-8')15 req = request.Request(url, headers=headers, data=data)16 page = request.urlopen(req).read()17 page = page.decode('utf-8')
urllib.parse.urlencode(
Query, doseq = False, safe = '', encoding = None, errors = None)
Urlencode () attaches the data to be submitted to the url.
1 data = {2 'first': 'true',3 'pn': 1,4 'kd': 'Python'5 }6 data = parse.urlencode(data).encode('utf-8')
After urlencode () conversion, the data is? First = true? Pn = 1? Kd = Python. The last submitted url is
Http://www.lagou.com/jobs/positionAjax.json? First = true? Pn = 1? Kd = Python
Post data must be bytes or iterable of bytes, not str, so encode () encoding is required.
1 page = request.urlopen(req, data=data).read()
Of course, you can also encapsulate data in the urlopen () parameter.
4. Exception Handling
1 def get_page(url): 2 headers = { 3 'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 4 r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 5 'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label', 6 'Connection': 'keep-alive' 7 } 8 data = { 9 'first': 'true',10 'pn': 1,11 'kd': 'Python'12 }13 data = parse.urlencode(data).encode('utf-8')14 req = request.Request(url, headers=headers)15 try:16 page = request.urlopen(req, data=data).read()17 page = page.decode('utf-8')18 except error.HTTPError as e:19 print(e.code())20 print(e.read().decode('utf-8'))21 return page5. Use a proxy
urllib.request.
ProxyHandler(
Proxies = None)
When the website to be crawled is configured with access restrictions, you need to use a proxy to capture data.
1 data = {2 'first ': 'true', 3 'pn': 1, 4 'kd ': 'python' 5} 6 proxy = request. proxyHandler ({'http': '5. 22.195.215: 80'}) # Set proxy 7 opener = request. build_opener (proxy) # Mount opener 8 request. install_opener (opener) # Install opener 9 data = parse. urlencode (data ). encode ('utf-8') 10 page = opener. open (url, data ). read () 11 page = page. decode ('utf-8') 12 return page