Knowledge points of Urllib library used by reptiles

Source: Internet
Author: User
Tags urlencode

Urllib Library

The Urllib Library is one of the most basic network request libraries in Python. It can mimic the browser's behavior to send requests to the specified server, while saving the data returned by the server.

Urlopen ()

In Python3 's Urllib library, all methods related to network requests are centralized into the urllib.request module. Here are the urlopen() most basic ways to use the method:

from urllib import requestresp = request.urlopen(‘https://www.baidu.com‘)print(resp.read())

The above code after the completion of the terminal will print out the source code of Baidu home page.

urlopen()The method can write two parameters (common):

    • URL: Requested URL
    • Data: The requested data, if this value is set, it will become a POST request

urlopen()The return value of the method is an http.client.HPPTResponse object that is a class file handle object. There are read(size) , readline() ,, readlines() and getcode() other methods.

Urlretrieve ()

This method can easily save a file on the Web page to local. The following code can be very convenient to download the homepage of Baidu to Local:

from urllib import requestrequest.urlretrieve(‘http://www.baidu.com/‘, ‘baidu.html‘)
UrlEncode ()

If the ULR contains Chinese or other special characters, the browser will automatically encode us. To use the crawler to simulate the browser, you need to do manual coding, it is necessary to use the urlencode() method to achieve. This method converts the dictionary data into URL-encoded data. For example:

from urllib import parseurl = ‘https://www.baidu.com/s‘params = {"wd": "教学设计"}qs = parse.urlencode(params)url = url + "?ie=UTF-8&&" + qsprint(qs)print(url)resp = request.urlopen(url)print(resp.read())
Parse_qs ()

This method decodes the encoded ULR parameter. For example:

from urllib import parseparams = {‘name‘: ‘木易‘, ‘age‘: ‘21‘, ‘sex‘: ‘male‘}qs = parse.urlencode(params)result = parse.parse_qs(qs)print(result)
Urlparse () and Urlsplit ()

These two methods are used to split the various components of the URL. Finally, you can get all the URL components at once, or you can take out a section separately. For example:

from urllib import parseurl = ‘https://baike.baidu.com/item/hello%20world/85501?fr=aladdin#2_13‘result = parse.urlparse(url)print(result) //拿出全部// 分别拿出各个部分print(‘scheme:‘, result.scheme)print(‘netloc:‘, result.netloc)print(‘path:‘, result.path)print(‘params:‘,result.params)  // urlsplit() 方法没有此项结果print(‘query:‘, result.query)print(‘fragment:‘, result.fragment)

urlparse()And urlsplit() basically exactly the same, there is a different place where there is urlparse() a "params" attribute, without urlsplit() this params attribute. For example, there is a URL: "Https://www.baidu.com/s;hello?wd=python&username=abc#1", then the urlparse() "params" can get "hello", and urlsplit() then not.

Request. Request class

If you want to disguise the crawler, you need to add some request headers when requested, then you must use the request.Request class to implement, such as the following request to pull the net code:

from urllib import request, parseurl = ‘https://www.lagou.com/jobs/positionAjax.json?px=default&gx=%E5%85%A8%E8%81%8C&needAddtionalResult=false&isSchoolJob=1‘// 这里的 User-Agent 和 Referer 参数是我先用 Chrome 访问再用F12检查得到的,下面的 data 也是headers = {    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36‘,    ‘Referer‘: ‘https://www.lagou.com/jobs/list_Go?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1&city=%E5%85%A8%E5%9B%BD‘}data = {    ‘first‘: ‘true‘,    ‘pn‘: 1,    ‘kd‘: ‘python‘}req = request.Request(url, headers=headers, data=parse.urlencode(data).encode(‘utf-8‘), method=‘POST‘)resp = request.urlopen(req)print(resp.read().decode(‘utf-8‘))
Proxyhandler processor (proxy settings)

Many websites will detect the number of times an IP is visited (through traffic statistics, system logs, etc.), and it will prohibit access to this IP if the number of visits is not as normal. So we can set up some proxy server, change an agent every time, even if the IP is forbidden, can still change the IP to continue to crawl data.
Using a proxy server with Proxyhandler in Urllib, the following code shows how to use a custom opener to use a proxy:

from urllib import request# 未使用代理url = ‘http://httpbin.org/ip‘resp = request.urlopen(url)print(resp.read())# 使用代理# 1.使用 ProxyHandler,传入代理构建一个handlerhandler = request.ProxyHandler({"https": "183.163.40.223:31773"})# 2.使用上面创建的handler构建一个openeropener = request.build_opener(handler)# 3.使用opener发送一个请求resp = opener.open(url)print(resp.read())

You can view your computer network's extranet IP address via HTTP://HTTPBIN.ORG/IP
The commonly used agents are:
-West Thorn free agent ip:http://www.xicidaili.com/
-Quick agent: https://www.kuaidaili.com/
-Agent Cloud: http://www.dailiyun.com/

Knowledge points of Urllib library used by reptiles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.