Brief introduction
Urllib is a python fetch URL (Uniform Resource locators, Uniform Resource Locator) that can be used to crawl remote data.
Common methods (1) Urlopen
Urllib.request.urlopen (URL, data=none,[timeout,]*,cafile=none,capath=none,cadefault=false,context=none)
Urllib.request.urlopen () can get the page, get the page content data format for the bytes type, need to do decode () decoding, convert to STR type.
Parameter description:
- URLs: URLs that need to be opened
- Data: Dictionary form, the default is none when the Get method, data is not empty, Urlopen () is submitted in the form of post, note that when the post commits, data needs to be converted to bytes;
- Timeout: Set the time-out for website access
From urllib import requestresponse = request.urlopen ("Http://members.3322.org/dyndns/getip") # < Http.client.HTTPResponse object at 0x031f63b0>page = Response.read () # b ' 106.37.169.186\n ' page = Page.decode ("Utf-8" ) # ' 106.37.169.186\n '
Urlopen returns the method provided by the object:
- Read (), ReadLine (), ReadLines (), Fileno (), close (): operation on HttpResponse type data
- info (): Returns the Httpmessage object that represents the header information returned by the remote server
- GetCode (): Return HTTP status code, if HTTP request, 200 request completed successfully, 404 Web page Not found
- Geturl (): Returns the requested URL
(2) Request
Urllib.request.Request (Url,data=none,headers={},method=none)
From urllib Import Requesturl = R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ' headers = { ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ', ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ', ' Connection ': ' keep-alive '}req = Request. Request (URL, headers=headers) page = Request.urlopen (req). Read () page = Page.decode (' Utf-8 ')
(3) Parse.urlencode
Urllib.parse.urlencode (query, doseq=false,safe= ", Encoding=none,errors=none)
The main function of UrlEncode () is to enclose the URL with the data to be submitted.
From urllib import request, parseURL = R ' Http://www.lagou.com/jobs/positionAjax.json? ' headers = {' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) ' R ' chrome/45.0.2454.85 safari/537.36 115browser/6.0.3 ', ' Referer ': R ' Http://www.lagou.com/zhaopin/Python/?labelWords=label ', ' Connection ': ' keep-alive '}data = {' First ': ' True ', ' pn ': 1, ' kd ': ' Python '}data = Parse.urlencode (data). Encode (' Utf-8 ') # At this point data is byte B ' first=true&pn=1&kd= Python ', the data for post must be bytes or iterable of bytes, cannot be str, and therefore need to be encode encoded # after UrlEncode converted data to ' FIRST=TRUE&PN=1&KD =python ' # The last URL submitted is: Http://www.lagou.com/jobs/positionAjax.json?first=true?pn=1?kd=Pythonreq = Request. Request (URL, headers=headers, data=data) # at this time req: <urllib.request.request object at 0x02f52a30>page = Request.urlopen (req). Read () # At this point the page is byte: B ' {"Success": false, "MSG": "\xe6\x82\xa8\xe6\x93\x8d\xe4\xbd\x9c\xe5\xa4\ xaa\xe9\xa2\x91\xe7\xb9\x81,\xe8\xaf\xb7\xe7\xa8\x8d\xe5\x90\x8e\xe5\x86\x8d\xe8\xae\xbf\xe9\x97\xae "," ClientIP ":" 106.37.169.186 "}\npage = Page.decode (' Utf-8 ') # At this point the page is a string: "Success": false, "MSG": "You are operating too frequently, please revisit later", "ClientIP": "106.37.169.186"}
(4) Agent request. Proxyhandler (Proxies=none)
When a site that needs to crawl has access restrictions set, a proxy is needed to fetch the data.
From urllib import request, Parsedata = { ' first ': ' True ', ' pn ': 1, ' kd ': ' Python ' }proxy = Request. Proxyhandler ({' http ': ' 5.22.195.215:80 '}) # set Proxyopener = Request.build_opener (proxy) # Mount Openerrequest.install_opener (opener) # Install Openerdata = parse.urlencode (data). Encode (' utf-8 ') page = Opener.open (URL, data). Read () page = Page.decode (' utf-8 ') return page
Article reference: https://www.cnblogs.com/Lands-ljk/p/5447127.html
Python3 's Urllib module