Knowledge points of Urllib library used by reptiles

Last Update:2018-07-20 Source: Internet

Author: User

Tags urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Urllib Library

The Urllib Library is one of the most basic network request libraries in Python. It can mimic the browser's behavior to send requests to the specified server, while saving the data returned by the server.

Urlopen ()

In Python3 's Urllib library, all methods related to network requests are centralized into the urllib.request module. Here are the urlopen() most basic ways to use the method:

from urllib import requestresp = request.urlopen(‘https://www.baidu.com‘)print(resp.read())

The above code after the completion of the terminal will print out the source code of Baidu home page.

urlopen()The method can write two parameters (common):

URL: Requested URL
Data: The requested data, if this value is set, it will become a POST request

urlopen()The return value of the method is an http.client.HPPTResponse object that is a class file handle object. There are read(size) , readline() ,, readlines() and getcode() other methods.

Urlretrieve ()

This method can easily save a file on the Web page to local. The following code can be very convenient to download the homepage of Baidu to Local:

from urllib import requestrequest.urlretrieve(‘http://www.baidu.com/‘, ‘baidu.html‘)

UrlEncode ()

If the ULR contains Chinese or other special characters, the browser will automatically encode us. To use the crawler to simulate the browser, you need to do manual coding, it is necessary to use the urlencode() method to achieve. This method converts the dictionary data into URL-encoded data. For example:

from urllib import parseurl = ‘https://www.baidu.com/s‘params = {"wd": "教学设计"}qs = parse.urlencode(params)url = url + "?ie=UTF-8&&" + qsprint(qs)print(url)resp = request.urlopen(url)print(resp.read())

Parse_qs ()

This method decodes the encoded ULR parameter. For example:

from urllib import parseparams = {‘name‘: ‘木易‘, ‘age‘: ‘21‘, ‘sex‘: ‘male‘}qs = parse.urlencode(params)result = parse.parse_qs(qs)print(result)

Urlparse () and Urlsplit ()

These two methods are used to split the various components of the URL. Finally, you can get all the URL components at once, or you can take out a section separately. For example:

from urllib import parseurl = ‘https://baike.baidu.com/item/hello%20world/85501?fr=aladdin#2_13‘result = parse.urlparse(url)print(result) //拿出全部// 分别拿出各个部分print(‘scheme:‘, result.scheme)print(‘netloc:‘, result.netloc)print(‘path:‘, result.path)print(‘params:‘,result.params)  // urlsplit() 方法没有此项结果print(‘query:‘, result.query)print(‘fragment:‘, result.fragment)

urlparse()And urlsplit() basically exactly the same, there is a different place where there is urlparse() a "params" attribute, without urlsplit() this params attribute. For example, there is a URL: "Https://www.baidu.com/s;hello?wd=python&username=abc#1", then the urlparse() "params" can get "hello", and urlsplit() then not.

Request. Request class

If you want to disguise the crawler, you need to add some request headers when requested, then you must use the request.Request class to implement, such as the following request to pull the net code:

from urllib import request, parseurl = ‘https://www.lagou.com/jobs/positionAjax.json?px=default&gx=%E5%85%A8%E8%81%8C&needAddtionalResult=false&isSchoolJob=1‘// 这里的 User-Agent 和 Referer 参数是我先用 Chrome 访问再用F12检查得到的，下面的 data 也是headers = {    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36‘,    ‘Referer‘: ‘https://www.lagou.com/jobs/list_Go?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1&city=%E5%85%A8%E5%9B%BD‘}data = {    ‘first‘: ‘true‘,    ‘pn‘: 1,    ‘kd‘: ‘python‘}req = request.Request(url, headers=headers, data=parse.urlencode(data).encode(‘utf-8‘), method=‘POST‘)resp = request.urlopen(req)print(resp.read().decode(‘utf-8‘))

Proxyhandler processor (proxy settings)

Many websites will detect the number of times an IP is visited (through traffic statistics, system logs, etc.), and it will prohibit access to this IP if the number of visits is not as normal. So we can set up some proxy server, change an agent every time, even if the IP is forbidden, can still change the IP to continue to crawl data.
Using a proxy server with Proxyhandler in Urllib, the following code shows how to use a custom opener to use a proxy:

from urllib import request# 未使用代理url = ‘http://httpbin.org/ip‘resp = request.urlopen(url)print(resp.read())# 使用代理# 1.使用 ProxyHandler，传入代理构建一个handlerhandler = request.ProxyHandler({"https": "183.163.40.223:31773"})# 2.使用上面创建的handler构建一个openeropener = request.build_opener(handler)# 3.使用opener发送一个请求resp = opener.open(url)print(resp.read())

You can view your computer network's extranet IP address via HTTP://HTTPBIN.ORG/IP
The commonly used agents are:
-West Thorn free agent ip:http://www.xicidaili.com/
-Quick agent: https://www.kuaidaili.com/
-Agent Cloud: http://www.dailiyun.com/

Knowledge points of Urllib library used by reptiles

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More