Python Urllib and URLLIB3 packages

Source: Internet
Author: User
Tags connection pooling urlencode uuid

Urllib.request

The most used modules in Urllib involve requests, responses, browser simulations, proxies, cookies and other functions.

1. Quick Request

The Urlopen return object provides some basic methods:

    • Read returns text data
    • Header information returned by Info server
    • GetCode Status Code
    • Geturl the requested URL
Request.urlopen (URL, Data=none, timeout=10) #url:  The URL that needs to be opened #data:post submitted data #timeout: Set the access time-out period for a website from urllib import Requestimport ssl# Address Some circumstances <urlopen error [ssl:certificate_verify_failed] CERTIFICATE VERIFY failedssl._create_ Default_https_context = Ssl._create_unverified_contexturl = ' https://www.jianshu.com ' #返回 < Http.client.HTTPResponse object at 0x0000000002e34550>response = Request.urlopen (URL, Data=none, timeout=10) # Get the page directly with the Urlopen () of the Urllib.request module, the data format of page is bytes type, need decode () decode, convert to STR type. page = Response.read (). Decode (' Utf-8 ')

2. Analog PC Browser and mobile browser

Need to add headers header information, Urlopen not supported, need to use request

Pc

Import Urllib.requesturl = ' https://www.jianshu.com ' # added headerheaders = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/58.0.3029.96 safari/537.36 '}request = urllib.request.Request (URL, Headers=headers) response = Urllib.request.urlopen (Request) #在urllib里面 determine whether a GET request or a POST request is to determine if the data parameter print is submitted ( Request.get_method ()) >> output result get

Cell phone

req = Request. Request (' http://www.douban.com/') req.add_header (' user-agent ', ' mozilla/6.0 ' (iPhone; CPU iPhone os 8_0 like Mac os X) "                             applewebkit/536.26 (khtml, like Gecko) version/8.0 mobile/10a5376e safari/8536.25 ' ) with Request.urlopen (req) as F:    print (' Status: ', F.status, F.reason)    for K, V in F.getheaders ():        print ('%s :%s '% (k, v))    print (' Data: ', F.read (). Decode (' Utf-8 '))

Use of 3.Cookie

The client is used to record the user's identity and maintain login information

Import Http.cookiejar, urllib.request# 1 Create cookiejar Object cookie = Http.cookiejar.CookieJar () # Create a cookie processor using httpcookieprocessor, handler = urllib.request.HTTPCookieProcessor (cookie) # Build opener Object opener = Urllib.request.build_opener (Handler) # installs opener as Global Urllib.request.install_opener (opener) data = Urllib.request.urlopen (URL) # 2 save Cookie as text import Http.cookiejar, urllib.requestfilename = "Cookie.txt" # Save type there are many kinds of # # Type 1cookie = Http.cookiejar.MozillaCookieJar (filename) # # type 2cookie = Http.cookiejar.LWPCookieJar (filename) # Use the appropriate method to read the cookie = Http.cookiejar.LWPCookieJar () cookie.load (' Cookie.txt ', ignore_discard=true,ignore_expires=true) Handler = Urllib.request.HTTPCookieProcessor (cookie) opener = Urllib.request.build_opener (handler) ...

4. Setting up the agent

When a site that needs to crawl has access restrictions set, a proxy is needed to fetch the data.

Import Urllib.requesturl = ' http://httpbin.org/ip ' proxy = {' http ': ' 39.134.108.89:8080 ', ' https ': ' 39.134.108.89:8080 ' }proxies = Urllib.request.ProxyHandler (proxy) # Create agent Processor opener = Urllib.request.build_opener (proxies, Urllib.request.HTTPHandler) # Create a specific opener Object Urllib.request.install_opener (opener) # Install global opener Turn Urlopen into a specific openerdata = Urllib.request.urlopen (URL) print (Data.read (). Decode ())

Back to TopUrllib.error

Urllib.error can receive urllib.request-generated exceptions. There are two methods commonly used in Urllib.error, Urlerror and Httperror. Urlerror is a subclass of OSError,

Httperror is a subclass of Urlerror, the response of HTTP on the server will return a status code, according to this HTTP status code, we can know whether our visit is successful.

Urlerror

Urlerror is usually caused by the network cannot connect, the server does not exist and so on.

For example, to access a URL that does not exist

Import Urllib.errorimport Urllib.requestrequset = urllib.request.Request (' http://www.usahfkjashfj.com/') Try:    Urllib.request.urlopen (Requset). Read () except Urllib.error.URLError as E:    print (E.reason) Else:    print (' Success ') >> print results [Errno 11004] getaddrinfo failed

Httperror

Httperror is a subclass of Urlerror, and when you make a request using the Urlopen method, a Reply object response on the server, where he contains a number "status code",

For example, response is a redirect that needs to navigate to a different address to get the document, and Urllib will handle it. other can not handle, Urlopen will produce a httperror, corresponding to the corresponding status code,

The HTTP status code represents the status of the response returned by the HTTP protocol.

From urllib import request, Error    try:        response = Request.urlopen (' http://cuiqingcai.com/index.htm ')    except error. Urlerror as E:        print (E.reason)    # First capturing subclass Error    try:        response = Request.urlopen (' http://cuiqingcai.com/ Index.htm ')    except error. Httperror as E:        print (E.reason, E.code, e.headers, sep= ' \ n ')    except error. Urlerror as E:        print (E.reason)    else:        print (' Request successfully ')

>> Print Results

Not Found

-------------
Not Found
50U
server:nginx/1.10.3 (Ubuntu)
Date:thu, 2018 14:45:39 GMT
content-type:text/html; Charset=utf-8
Transfer-encoding:chunked
Connection:close
Vary:cookie
expires:wed, Jan 1984 05:00:00 GMT

Back to TopUrllib.parse

Urllib.parse.urljoin Stitching URL

Constructs an absolute url,url based on one base URL and another URL must be a consistent site, otherwise the following parameters overwrite the previous host

Print (Parse.urljoin (' https://www.jianshu.com/xyz ', ' faq.html ')) print (Parse.urljoin (' http://www.baidu.com/ About.html ', ' http://www.baidu.com/FAQ.html ')) >> results https://www.jianshu.com/FAQ.htmlhttp://www.baidu.com/ Faq.html

Urllib.parse.urlencode Dictionary to string

From urllib import request, parseURL = R ' Https://www.jianshu.com/collections/20f7f4031550/mark_viewed.json ' headers = {    ' user-agent ': R ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.132 safari/537.36 ',    ' Referer ': R ' https:// Www.jianshu.com/c/20f7f4031550?utm_medium=index-collections&utm_source=desktop ',    ' Connection ': ' Keep-alive '}data = {    ' uuid ': ' 5a9a30b5-3259-4fa0-ab1f-be647dbeb08a ',} #Post的数据必须是bytes或者iterable of bytes, cannot be str , the Encode () encoding data = Parse.urlencode (data) is required. Encode (' Utf-8 ') print (data) req = Request. Request (URL, headers=headers, data=data) page = Request.urlopen (req). Read () page = Page.decode (' utf-8 ') print (page) > > Result B ' uuid=5a9a30b5-3259-4fa0-ab1f-be647dbeb08a ' {"message": "Success"}

Urllib.parse.quote URL Encoding

Urllib.parse.unquote URL Decoding

URLs are encoded in ASCII rather than Unicode, such as

Http://so.biquge.la/cse/search?s=7138806708853866527&q=%CD%EA%C3%C0%CA%C0%BD%E7

From urllib Import parsex = parse.quote (' Shanxi ', encoding= ' GB18030 ') # encoding= ' Gbkprint (x)  #%c9%bd%ce%f7city = Parse.unquote ('%E5%B1%B1%E8%A5%BF ',)  # encoding= ' utf-8 ' Print (city)  # Shanxi

URLLIB3 Bag

URLLIB3 is a powerful, well-organized Python library for HTTP clients, and many Python native systems have started using URLLIB3. URLLIB3 provides a number of important features that are not available in the Python standard library:

1. Thread Safety
2. Connection pooling
3. Client SSL/TLS validation
4. File Branch code Upload
5. Assist in handling duplicate requests and HTTP relocation
6. Support Compression encoding
7. Support HTTP and Socks proxies

Installation:

URLLIB3 can be installed by PIP:

$pip Install URLLIB3

You can also download the latest source code on GitHub and install it after unpacking:

$git Clone Git://github.com/shazow/urllib3.git

$python setup.py Install

Use of URLLIB3:

Request GET Requests

Import urllib3import requests#  Ignore warning: insecurerequestwarning:unverified HTTPS request is being made. Adding Certificate Verification is strongly advised.requests.packages.urllib3.disable_warnings () # An Poolmanager instance to generate the request, which handles the connection to the thread pool and all the details of thread safety HTTP = urllib3. Poolmanager () # Creates a request via the Ask () method: R = http.request (' GET ', ' http://cuiqingcai.com/') print (r.status) # 200# get HTML source code, Utf-8 decoding print (R.data.decode ())

Request Get requests (add data)

Header = {        ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36 '    }    r = http.request (' GET ',             ' https://www.baidu.com/s ',             fields={' wd ': ' Hello '},             Headers=header)    print (r.status) #    print ( R.data.decode ())

POST request

  The #你还可以通过request () method adds some additional information to the request, such as:    Header = {        ' user-agent ': ' mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36 '    }    r = http.request (' POST ', c15/> ' Http://httpbin.org/post ',                     fields={' hello ': ' World '},                     Headers=header)    print (R.data.decode () )
# for Post and put requests (request), the incoming data needs to be encoded manually and then added after the URL: Encode_arg = Urllib.parse.urlencode ({' arg ': ' My '}) print (Encode_ Arg.encode ()) R = http.request (' POST ',                 ' Http://httpbin.org/post? ') +encode_arg,                 headers=header) # Unicode decoding print (R.data.decode (' Unicode_escape '))

Send JSON data

#JSON: When initiating a request, you can send a compiled JSON data by defining the body parameter and defining the Content-type parameter of headers: import jsondata={' attribute ': ' Value '}encode _data= json.dumps (data). Encode () R = http.request (' POST ',                     ' http://httpbin.org/post ',                     body=encode_data,                     headers={' content-type ': ' Application/json '}                 ) Print (R.data.decode (' Unicode_escape '))

Uploading files

#使用multipart/form-data encoding to upload a file, you can use the same method as the incoming form data data, and define the file as a tuple form (File_name,file_data): With open (' 1.txt ', ' r+ ', encoding= ' UTF-8 ') as f:    file_read = F.read () R = http.request (' POST ',                 ' http://httpbin.org/post ',                 fields={' Filefield ':(' 1.txt ', File_read, ' Text/plain ')                         }) print (R.data.decode (' Unicode_escape ')) #二进制文件with Open (' websocket.jpg ', ' RB ') as F2:    binary_read = F2.read () R = http.request (' POST ',                 ' http://httpbin.org/post ',                 body=binary_read,                 headers={' content-type ': ' Image/jpeg ') # # print (Json.loads (R.data.decode (' utf-8 ')) [' Data ']) print (R.data.decode (' Utf-8 '))

Using timeout

#使用timeout, you can control when the request runs. In some simple applications, you can set the timeout parameter to a floating-point number: R = http.request (' POST ',                 ' http://httpbin.org/post ', timeout=3.0) print ( R.data.decode (' Utf-8 ')) #让所有的request都遵循一个timeout, you can define the timeout parameter in Poolmanager: http = urllib3. Poolmanager (timeout=3.0)

Control of retry and redirection

#通过设置retries参数对重试进行控制. The URLLIB3 defaults to 3 request retries and 3 direction changes. r = http.request (' GET ',                 ' http://httpbin.org/ip ', retries=5) #请求重试的次数为5print (R.data.decode (' Utf-8 ')) # #关闭请求重试 ( Retrying request) and redirection (redirect) as long as the retries is defined as false: R = http.request (' GET ',                 ' HTTP://HTTPBIN.ORG/REDIRECT/1 ', Retries=false,redirect=false) print (' D1 ', R.data.decode (' Utf-8 ')) #关闭重定向 (redirect) but remain retried (retrying request), Define the redirect parameter as False to r = http.request (' GET ',                 ' http://httpbin.org/redirect/1 ', redirect=false)

Python urllib and URLLIB3 packages

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.