Python Crawler's Urllib library

Source: Internet
Author: User

Request Library Urllib

Urllib is mainly divided into several parts

Urllib.request Sending requests
Urllib.error the exception that occurred during the processing of the request
Urllib.parse Processing URLs
Urllib.robotparser parsing Robots.txt------the site's crawler permissions are defined

Urllib.request method
data = Urllib.request.urlopen (URL)  # Returns the response object

Data.read ()---> Remove the Web source code (bytes type, can be decode () to Utf-8) Note that the page source code does not include JS processed data
Data.info ()---> Remove the response header information
Data.getcode ()---> Remove return code
Data.geturl ()---> Take out the requested URL

The request from the script, headers in the user-agent is python-urllib/3.6, some sites will be based on this to identify whether the request is a script issued, and then filter out the crawler, then how do we simulate the browser access it?
Use Urllib.request.Request () to carry headers header information

headers = {'user=agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/65.0.3325.181 safari/537.36'# instance request () object, request object initialization can pass in headers # Urlopen not only can pass in a URL, but can pass in a Request object

How does urllib.request differentiate between get or post on a request?
Look at the source code of Urllib.request, you can know if the request carries the data parameter, it is a POST request, and vice versa is a GET request

Use of cookies

Fixed notation

Import Http.cookiejar # Create Cookiejar object Cookie_jar = http.cookiejar.CookieJar ()#  Use Httpcookieprocessor to create a cookie processor and build opener objects with it as parameters opener = Urllib.request.build_opener ( Urllib.request.HTTPCookieProcessor (Cookie_jar))#  turn opener into Urlopen, after installation, add a Save cookie function to Urlopen  # If you don't want to install it, open it directly with Opener.open (URL)
Set up Proxy

Fixed notation

Proxy = {'http':'183.232.188.18:80','HTTPS':'183.232.188.18:80'}#Agent Ip:port#Creating an Agent processorproxies =Urllib.request.ProxyHandler (proxy)#Create a Opener objectOpener =Urllib.request.build_opener (Proxies,urllib.request.httphandler) Urllib.request.install_opener (opener)#If you do not want to install, you can open it directly with Opener.open (URL), so that opener has an agent and Urlopen no
Urllib.error

Urlerror is the parent class is disconnected or the server does not exist for an exception reason, there is no code attribute
Httperror is a subclass server exists, but the address does not exist, with the code and reason attributes

# url = ' http://www.adfdsfdsdsa.com ' urlerror # url = ' https://jianshu.com/p/jkhsdhgjkasdhgjkadfhg ' httperror

Judging method

Try: Data=urllib.request.urlopen (URL)Print(Data.read (). Decode ())exceptUrllib.error.URLError as E:ifHasattr (E,'Code'):        Print('Httperror')    elifHasattr (E,'reason'):        Print('Urlerror')
Urllib.parse
ImportUrllib.parse#Urllib.parse.urljoin () # stitching URL string concatenation, if all are domain names, then overwrite with the back#Urllib.parse.urlencode () # Turn the dictionary into a query string#urllib.parse.quote () # URL is encoded in ASCII code and requires URL encoding when Chinese is present#urllib.parse.unquote () # URL decoding#UrlEncode>>> fromUrllibImportParse>>> query = {'name':'Walker',' Age': 99,}>>>parse.urlencode (query)'name=walker&age=99'Quote/Quote_plus>>> fromUrllibImportParse>>> Parse.quote ('a&b/c')#non-coded slash'a%26b/c'>>> Parse.quote_plus ('a&b/c')#code a slash'A%26B%2FC'unquote/Unquote_plus fromUrllibImportParse>>> Parse.unquote ('1+2')#do not decode the plus sign'1+2'>>> Parse.unquote ('1+2')#decode the plus sign to a space'1 2'URL='http://www.baidu.com/?wd= schoolbag'urllib.request.urlopen (URL)#will error coding errors#if you want to request correctly, you need URL = ' http://www.baidu.com/?wd={} '. Format (urllib.parse.quote (' schoolbag ')) First encode the Chinese URL before requesting
Urllib3
Import== http.request ('GET','https:// www.jianshu.com'#  turn off redirect print

Python Crawler's Urllib library

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.