Urllib.request
The most used modules in Urllib involve requests, responses, browser simulations, proxies, cookies and other functions.
1. Quick Request
The Urlopen return object provides some basic methods:
- Read returns text data
- Header information returned by Info server
- GetCode Status Code
- Geturl the requested URL
Request.urlopen (URL, Data=none, timeout=10) #url: The URL that needs to be opened #data:post submitted data #timeout: Set the access time-out period for a website from urllib import Requestimport ssl# Address Some circumstances <urlopen error [ssl:certificate_verify_failed] CERTIFICATE VERIFY failedssl._create_ Default_https_context = Ssl._create_unverified_contexturl = ' https://www.jianshu.com ' #返回 < Http.client.HTTPResponse object at 0x0000000002e34550>response = Request.urlopen (URL, Data=none, timeout=10) # Get the page directly with the Urlopen () of the Urllib.request module, the data format of page is bytes type, need decode () decode, convert to STR type. page = Response.read (). Decode (' Utf-8 ')
2. Analog PC Browser and mobile browser
Need to add headers header information, Urlopen not supported, need to use request
Pc
Import Urllib.requesturl = ' https://www.jianshu.com ' # added headerheaders = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.96 safari/537.36 '}request = urllib.request.Request (URL, Headers=headers) response = Urllib.request.urlopen (Request) #在urllib里面 determine whether a GET request or a POST request is to determine if the data parameter print is submitted ( Request.get_method ()) >> output result get
Cell phone
req = Request. Request (' http://www.douban.com/') req.add_header (' user-agent ', ' mozilla/6.0 ' (iPhone; CPU iPhone os 8_0 like Mac os X) " applewebkit/536.26 (khtml, like Gecko) version/8.0 mobile/10a5376e safari/8536.25 ' ) with Request.urlopen (req) as F: print (' Status: ', F.status, F.reason) for K, V in F.getheaders (): print ('%s :%s '% (k, v)) print (' Data: ', F.read (). Decode (' Utf-8 '))
Use of 3.Cookie
The client is used to record the user's identity and maintain login information
Import Http.cookiejar, urllib.request# 1 Create cookiejar Object cookie = Http.cookiejar.CookieJar () # Create a cookie processor using httpcookieprocessor, handler = urllib.request.HTTPCookieProcessor (cookie) # Build opener Object opener = Urllib.request.build_opener (Handler) # installs opener as Global Urllib.request.install_opener (opener) data = Urllib.request.urlopen (URL) # 2 save Cookie as text import Http.cookiejar, urllib.requestfilename = "Cookie.txt" # Save type there are many kinds of # # Type 1cookie = Http.cookiejar.MozillaCookieJar (filename) # # type 2cookie = Http.cookiejar.LWPCookieJar (filename) # Use the appropriate method to read the cookie = Http.cookiejar.LWPCookieJar () cookie.load (' Cookie.txt ', ignore_discard=true,ignore_expires=true) Handler = Urllib.request.HTTPCookieProcessor (cookie) opener = Urllib.request.build_opener (handler) ...
4. Setting up the agent
When a site that needs to crawl has access restrictions set, a proxy is needed to fetch the data.
Import Urllib.requesturl = ' http://httpbin.org/ip ' proxy = {' http ': ' 39.134.108.89:8080 ', ' https ': ' 39.134.108.89:8080 ' }proxies = Urllib.request.ProxyHandler (proxy) # Create agent Processor opener = Urllib.request.build_opener (proxies, Urllib.request.HTTPHandler) # Create a specific opener Object Urllib.request.install_opener (opener) # Install global opener Turn Urlopen into a specific openerdata = Urllib.request.urlopen (URL) print (Data.read (). Decode ())
Back to TopUrllib.error
Urllib.error can receive urllib.request-generated exceptions. There are two methods commonly used in Urllib.error, Urlerror and Httperror. Urlerror is a subclass of OSError,
Httperror is a subclass of Urlerror, the response of HTTP on the server will return a status code, according to this HTTP status code, we can know whether our visit is successful.
Urlerror
Urlerror is usually caused by the network cannot connect, the server does not exist and so on.
For example, to access a URL that does not exist
Import Urllib.errorimport Urllib.requestrequset = urllib.request.Request (' http://www.usahfkjashfj.com/') Try: Urllib.request.urlopen (Requset). Read () except Urllib.error.URLError as E: print (E.reason) Else: print (' Success ') >> print results [Errno 11004] getaddrinfo failed
Httperror
Httperror is a subclass of Urlerror, and when you make a request using the Urlopen method, a Reply object response on the server, where he contains a number "status code",
For example, response is a redirect that needs to navigate to a different address to get the document, and Urllib will handle it. other can not handle, Urlopen will produce a httperror, corresponding to the corresponding status code,
The HTTP status code represents the status of the response returned by the HTTP protocol.
From urllib import request, Error try: response = Request.urlopen (' http://cuiqingcai.com/index.htm ') except error. Urlerror as E: print (E.reason) # First capturing subclass Error try: response = Request.urlopen (' http://cuiqingcai.com/ Index.htm ') except error. Httperror as E: print (E.reason, E.code, e.headers, sep= ' \ n ') except error. Urlerror as E: print (E.reason) else: print (' Request successfully ')
>> Print Results
Not Found
-------------
Not Found
50U
server:nginx/1.10.3 (Ubuntu)
Date:thu, 2018 14:45:39 GMT
content-type:text/html; Charset=utf-8
Transfer-encoding:chunked
Connection:close
Vary:cookie
expires:wed, Jan 1984 05:00:00 GMT
Back to TopUrllib.parse
Urllib.parse.urljoin Stitching URL
Constructs an absolute url,url based on one base URL and another URL must be a consistent site, otherwise the following parameters overwrite the previous host
Print (Parse.urljoin (' https://www.jianshu.com/xyz ', ' faq.html ')) print (Parse.urljoin (' http://www.baidu.com/ About.html ', ' http://www.baidu.com/FAQ.html ')) >> results https://www.jianshu.com/FAQ.htmlhttp://www.baidu.com/ Faq.html
Urllib.parse.urlencode Dictionary to string
From urllib import request, parseURL = R ' Https://www.jianshu.com/collections/20f7f4031550/mark_viewed.json ' headers = { ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 ', ' Referer ': R ' https:// Www.jianshu.com/c/20f7f4031550?utm_medium=index-collections&utm_source=desktop ', ' Connection ': ' Keep-alive '}data = { ' uuid ': ' 5a9a30b5-3259-4fa0-ab1f-be647dbeb08a ',} #Post的数据必须是bytes或者iterable of bytes, cannot be str , the Encode () encoding data = Parse.urlencode (data) is required. Encode (' Utf-8 ') print (data) req = Request. Request (URL, headers=headers, data=data) page = Request.urlopen (req). Read () page = Page.decode (' utf-8 ') print (page) > > Result B ' uuid=5a9a30b5-3259-4fa0-ab1f-be647dbeb08a ' {"message": "Success"}
Urllib.parse.quote URL Encoding
Urllib.parse.unquote URL Decoding
URLs are encoded in ASCII rather than Unicode, such as
Http://so.biquge.la/cse/search?s=7138806708853866527&q=%CD%EA%C3%C0%CA%C0%BD%E7
From urllib Import parsex = parse.quote (' Shanxi ', encoding= ' GB18030 ') # encoding= ' Gbkprint (x) #%c9%bd%ce%f7city = Parse.unquote ('%E5%B1%B1%E8%A5%BF ',) # encoding= ' utf-8 ' Print (city) # Shanxi
URLLIB3 Bag
URLLIB3 is a powerful, well-organized Python library for HTTP clients, and many Python native systems have started using URLLIB3. URLLIB3 provides a number of important features that are not available in the Python standard library:
1. Thread Safety
2. Connection pooling
3. Client SSL/TLS validation
4. File Branch code Upload
5. Assist in handling duplicate requests and HTTP relocation
6. Support Compression encoding
7. Support HTTP and Socks proxies
Installation:
URLLIB3 can be installed by PIP:
$pip Install URLLIB3
You can also download the latest source code on GitHub and install it after unpacking:
$git Clone Git://github.com/shazow/urllib3.git
$python setup.py Install
Use of URLLIB3:
Request GET Requests
Import urllib3import requests# Ignore warning: insecurerequestwarning:unverified HTTPS request is being made. Adding Certificate Verification is strongly advised.requests.packages.urllib3.disable_warnings () # An Poolmanager instance to generate the request, which handles the connection to the thread pool and all the details of thread safety HTTP = urllib3. Poolmanager () # Creates a request via the Ask () method: R = http.request (' GET ', ' http://cuiqingcai.com/') print (r.status) # 200# get HTML source code, Utf-8 decoding print (R.data.decode ())
Request Get requests (add data)
Header = { ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36 ' } r = http.request (' GET ', ' https://www.baidu.com/s ', fields={' wd ': ' Hello '}, Headers=header) print (r.status) # print ( R.data.decode ())
POST request
The #你还可以通过request () method adds some additional information to the request, such as: Header = { ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36 ' } r = http.request (' POST ', c15/> ' Http://httpbin.org/post ', fields={' hello ': ' World '}, Headers=header) print (R.data.decode () )
# for Post and put requests (request), the incoming data needs to be encoded manually and then added after the URL: Encode_arg = Urllib.parse.urlencode ({' arg ': ' My '}) print (Encode_ Arg.encode ()) R = http.request (' POST ', ' Http://httpbin.org/post? ') +encode_arg, headers=header) # Unicode decoding print (R.data.decode (' Unicode_escape '))
Send JSON data
#JSON: When initiating a request, you can send a compiled JSON data by defining the body parameter and defining the Content-type parameter of headers: import jsondata={' attribute ': ' Value '}encode _data= json.dumps (data). Encode () R = http.request (' POST ', ' http://httpbin.org/post ', body=encode_data, headers={' content-type ': ' Application/json '} ) Print (R.data.decode (' Unicode_escape '))
Uploading files
#使用multipart/form-data encoding to upload a file, you can use the same method as the incoming form data data, and define the file as a tuple form (File_name,file_data): With open (' 1.txt ', ' r+ ', encoding= ' UTF-8 ') as f: file_read = F.read () R = http.request (' POST ', ' http://httpbin.org/post ', fields={' Filefield ':(' 1.txt ', File_read, ' Text/plain ') }) print (R.data.decode (' Unicode_escape ')) #二进制文件with Open (' websocket.jpg ', ' RB ') as F2: binary_read = F2.read () R = http.request (' POST ', ' http://httpbin.org/post ', body=binary_read, headers={' content-type ': ' Image/jpeg ') # # print (Json.loads (R.data.decode (' utf-8 ')) [' Data ']) print (R.data.decode (' Utf-8 '))
Using timeout
#使用timeout, you can control when the request runs. In some simple applications, you can set the timeout parameter to a floating-point number: R = http.request (' POST ', ' http://httpbin.org/post ', timeout=3.0) print ( R.data.decode (' Utf-8 ')) #让所有的request都遵循一个timeout, you can define the timeout parameter in Poolmanager: http = urllib3. Poolmanager (timeout=3.0)
Control of retry and redirection
#通过设置retries参数对重试进行控制. The URLLIB3 defaults to 3 request retries and 3 direction changes. r = http.request (' GET ', ' http://httpbin.org/ip ', retries=5) #请求重试的次数为5print (R.data.decode (' Utf-8 ')) # #关闭请求重试 ( Retrying request) and redirection (redirect) as long as the retries is defined as false: R = http.request (' GET ', ' HTTP://HTTPBIN.ORG/REDIRECT/1 ', Retries=false,redirect=false) print (' D1 ', R.data.decode (' Utf-8 ')) #关闭重定向 (redirect) but remain retried (retrying request), Define the redirect parameter as False to r = http.request (' GET ', ' http://httpbin.org/redirect/1 ', redirect=false)
Python urllib and URLLIB3 packages