Usage of python3 urllib and python3urllib

Source: Internet
Author: User

Usage of python3 urllib and python3urllib
1. Basic Methodsurllib.request.urlopen(Url,Data = None,[Timeout,]*,Cafile = None,Capath = None,Cadefault = False,Context = None)

-Url: the url to be opened

-Data: data submitted by Post

-Timeout: sets the time-out for Website access.

Directly use urllib. request module's urlopen () to obtain the page. The data format of the page is bytes type, which must be decoded by decode () and converted to str type.

1 from urllib import request2 response = request. urlopen (r 'HTTP: // python.org/') # 

Urlopen provides the following methods for returning objects:

-Read (), readline (), readlines (), fileno (), close (): perform operations on HTTPResponse data.

-Info (): return the HTTPMessage object, indicating the header information returned by the remote server.

-Getcode (): return the Http status code. If it is an http request, the 200 request is successfully completed; the 404 URL is not found

-Geturl (): return the request url.

2. Use Request urllib.request. Request( Url, data = None, headers = {}, method = None)

Use request () to wrap the request, and then use urlopen () to obtain the page.

 1 url = r'http://www.lagou.com/zhaopin/Python/?labelWords=label' 2 headers = { 3     'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 4                   r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 5     'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label', 6     'Connection': 'keep-alive' 7 } 8 req = request.Request(url, headers=headers) 9 page = request.urlopen(req).read()10 page = page.decode('utf-8')

Data used to wrap the header:

-User-Agent: the header contains the following information: browser name and version number, operating system name and version number, and default language.

-Referer: it can be used to prevent leeching. Some website images are displayed at http: // ***. com, which is identified by Referer.

-Connection: indicates the Connection status and records the Session status.

3. Post Data urllib.request. urlopen( Url, Data = None,[ Timeout,] *, Cafile = None, Capath = None, Cadefault = False, Context = None)

The default data parameter of urlopen () is None. When the data parameter is not empty, the urlopen () submission method is Post.

 1 from urllib import request, parse 2 url = r'http://www.lagou.com/jobs/positionAjax.json?' 3 headers = { 4     'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 5                   r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 6     'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label', 7     'Connection': 'keep-alive' 8 } 9 data = {10     'first': 'true',11     'pn': 1,12     'kd': 'Python'13 }14 data = parse.urlencode(data).encode('utf-8')15 req = request.Request(url, headers=headers, data=data)16 page = request.urlopen(req).read()17 page = page.decode('utf-8')
urllib.parse.urlencode( Query, doseq = False, safe = '', encoding = None, errors = None)

Urlencode () attaches the data to be submitted to the url.

1 data = {2     'first': 'true',3     'pn': 1,4     'kd': 'Python'5 }6 data = parse.urlencode(data).encode('utf-8')

After urlencode () conversion, the data is? First = true? Pn = 1? Kd = Python. The last submitted url is

Http://www.lagou.com/jobs/positionAjax.json? First = true? Pn = 1? Kd = Python

Post data must be bytes or iterable of bytes, not str, so encode () encoding is required.

1 page = request.urlopen(req, data=data).read()

Of course, you can also encapsulate data in the urlopen () parameter.

4. Exception Handling
 1 def get_page(url): 2     headers = { 3         'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 4                     r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3', 5         'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label', 6         'Connection': 'keep-alive' 7     } 8     data = { 9         'first': 'true',10         'pn': 1,11         'kd': 'Python'12     }13     data = parse.urlencode(data).encode('utf-8')14     req = request.Request(url, headers=headers)15     try:16         page = request.urlopen(req, data=data).read()17         page = page.decode('utf-8')18     except error.HTTPError as e:19         print(e.code())20         print(e.read().decode('utf-8'))21     return page
5. Use a proxy urllib.request. ProxyHandler( Proxies = None)

When the website to be crawled is configured with access restrictions, you need to use a proxy to capture data.

1 data = {2 'first ': 'true', 3 'pn': 1, 4 'kd ': 'python' 5} 6 proxy = request. proxyHandler ({'http': '5. 22.195.215: 80'}) # Set proxy 7 opener = request. build_opener (proxy) # Mount opener 8 request. install_opener (opener) # Install opener 9 data = parse. urlencode (data ). encode ('utf-8') 10 page = opener. open (url, data ). read () 11 page = page. decode ('utf-8') 12 return page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.