Python Crawler's Urllib library

Last Update:2018-07-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Request Library Urllib

Urllib is mainly divided into several parts

Urllib.request Sending requests
Urllib.error the exception that occurred during the processing of the request
Urllib.parse Processing URLs
Urllib.robotparser parsing Robots.txt------the site's crawler permissions are defined

Urllib.request method

data = Urllib.request.urlopen (URL)  # Returns the response object

Data.read ()---> Remove the Web source code (bytes type, can be decode () to Utf-8) Note that the page source code does not include JS processed data
Data.info ()---> Remove the response header information
Data.getcode ()---> Remove return code
Data.geturl ()---> Take out the requested URL

The request from the script, headers in the user-agent is python-urllib/3.6, some sites will be based on this to identify whether the request is a script issued, and then filter out the crawler, then how do we simulate the browser access it?
Use Urllib.request.Request () to carry headers header information

headers = {'user=agent':'mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/65.0.3325.181 safari/537.36'# instance request () object, request object initialization can pass in headers # Urlopen not only can pass in a URL, but can pass in a Request object

How does urllib.request differentiate between get or post on a request?
Look at the source code of Urllib.request, you can know if the request carries the data parameter, it is a POST request, and vice versa is a GET request

Use of cookies

Fixed notation

Import Http.cookiejar # Create Cookiejar object Cookie_jar = http.cookiejar.CookieJar ()#  Use Httpcookieprocessor to create a cookie processor and build opener objects with it as parameters opener = Urllib.request.build_opener ( Urllib.request.HTTPCookieProcessor (Cookie_jar))#  turn opener into Urlopen, after installation, add a Save cookie function to Urlopen  # If you don't want to install it, open it directly with Opener.open (URL)

Set up Proxy

Fixed notation

Proxy = {'http':'183.232.188.18:80','HTTPS':'183.232.188.18:80'}#Agent Ip:port#Creating an Agent processorproxies =Urllib.request.ProxyHandler (proxy)#Create a Opener objectOpener =Urllib.request.build_opener (Proxies,urllib.request.httphandler) Urllib.request.install_opener (opener)#If you do not want to install, you can open it directly with Opener.open (URL), so that opener has an agent and Urlopen no

Urllib.error

Urlerror is the parent class is disconnected or the server does not exist for an exception reason, there is no code attribute
Httperror is a subclass server exists, but the address does not exist, with the code and reason attributes

# url = ' http://www.adfdsfdsdsa.com ' urlerror # url = ' https://jianshu.com/p/jkhsdhgjkasdhgjkadfhg ' httperror

Judging method

Try: Data=urllib.request.urlopen (URL)Print(Data.read (). Decode ())exceptUrllib.error.URLError as E:ifHasattr (E,'Code'):        Print('Httperror')    elifHasattr (E,'reason'):        Print('Urlerror')

Urllib.parse

ImportUrllib.parse#Urllib.parse.urljoin () # stitching URL string concatenation, if all are domain names, then overwrite with the back#Urllib.parse.urlencode () # Turn the dictionary into a query string#urllib.parse.quote () # URL is encoded in ASCII code and requires URL encoding when Chinese is present#urllib.parse.unquote () # URL decoding#UrlEncode>>> fromUrllibImportParse>>> query = {'name':'Walker',' Age': 99,}>>>parse.urlencode (query)'name=walker&age=99'Quote/Quote_plus>>> fromUrllibImportParse>>> Parse.quote ('a&b/c')#non-coded slash'a%26b/c'>>> Parse.quote_plus ('a&b/c')#code a slash'A%26B%2FC'unquote/Unquote_plus fromUrllibImportParse>>> Parse.unquote ('1+2')#do not decode the plus sign'1+2'>>> Parse.unquote ('1+2')#decode the plus sign to a space'1 2'URL='http://www.baidu.com/?wd= schoolbag'urllib.request.urlopen (URL)#will error coding errors#if you want to request correctly, you need URL = ' http://www.baidu.com/?wd={} '. Format (urllib.parse.quote (' schoolbag ')) First encode the Chinese URL before requesting

Urllib3

Import== http.request ('GET','https:// www.jianshu.com'#  turn off redirect print

Python Crawler's Urllib library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More