Python Crawler Learning Chapter Fourth

Last Update:2018-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Urllib Library and Urlerror exception handling what is a urllib library

Urllib is a module provided by Python for manipulating URLs.

Crawl Web pages quickly using Urllib

Import the module first
import urllib.request
After importing, crawl
file = urllib.request.urlopen("http://www.baidu.com")
Read, three different ways
1. File.read () reads the entire contents of the file, and unlike ReadLines, read assigns the read content to a string variable
2. File.readlines () read the entire contents of the file, unlike read, ReadLines will read to the content of a list variable, to read the entire content, it is recommended to use this way
3. File.readline () reads a file from one line of file.
How to save a crawl to a Web page locally as a Web page
1. First, crawl a Web page and assign the crawled content to a variable.
2. Open a local file as a write, and name it a network format such as *.html.
3. Write the 1 variable to the file
4. Close the file.
fhandle=open(r"C:\Users\My\Desktop\git学习\1.html",‘wb‘) fhandle.write(file.read()) fhandle.close()

Urlretreve () function

Write the corresponding information directly to the file. Format:urllib.request.urlretrieve(url,filename=本地文件地址)

urllib.request.urlretrieve("http://www.51cto.com/",filename=r"C:\Users\My\Desktop\git学习\2.html")
During urlretrieve execution, the cache is generated and the cache is cleared
urllib.request.urlcleanup()
Gets information about the current environment using info. file.info()
Gets the status code of the current crawl Web page, using GetCode (), if return 200 is correct, return the other incorrectly.
file.getcode()
Gets the URL of the crawl Web page, using Geturl () file.geturl()
In general, the URL standard will only allow a subset of ASCII characters such as numbers, letters, some symbols, etc., while some other characters, such as Chinese characters, are not in line with the URL standard. So if we use some other non-conforming characters in the URL, there will be a problem, and URL encoding is needed to resolve it. For example, in the URL to enter the Chinese or ":" or "&" and other non-conforming characters, need to encode.
If you want to encode, use Urllib.request.quete (), for example, we encode "http://www.sina.com.cn".
urllib.reqeust.quote("http://www.sina.com.cn") 输出‘http%3A//www.sina.com.cn‘
Decoding
urllib.request.unquote("http%3A//www.sina.com.cn") 输出‘http://www.sina.com.cn‘

Browser Emulation---headers properties

Sometimes, we are unable to crawl some Web pages, there are 403 errors, this is because these pages in order to prevent others malicious collection of information by the anti-crawler settings.
If you want to crawl, we can set up headers information and simulate it as a browser to access these sites. To set up headers, the first thing you need is to find user-agent.

There are two ways to make a crawler into a browser model Web page
1. Using buildopener () to modify the header
import urllib.request url="网页地址" headers=("User-Agent","浏览器里找到的User-Agent") opener=urllib.request.build_opener() opener.addheaders=[headers] data=opener.open(url).read() fhandler=open("存放网页的地址","wb") fhandler.write(data) fhandler.close()
2.add headers using the Add header ()
import urllib.request url="网页地址" req=urllib.request.Request(url) req.add_header(‘User-Agent‘,"浏览器找到的User-Agent") data=urllib.request.urlopen(req).read()

Timeout settings

Sometimes, when you visit a webpage, if the page does not respond for a long time, the system determines that the page is timed out, that is, the page cannot be opened.
import urllib.requestfile=urllib.request.urlopen("网页地址",timeout=时间值)

HTTP protocol Request Combat

There are 6 main types of HTTP protocol requests.
1. Get requests: Get requests pass information through URL URLs, can be directly in the URL to write the information to be passed, you can also have a form for delivery. If the form is used for delivery, the information in the form is automatically converted to the data in the URL address and passed through the URL address.
2. Post request: Can submit data to the server, is a more mainstream and more secure way of data transmission, such as at the time of login, often use the POST request to send data.
3. Put request: The request server stores a resource and typically specifies the location of the store.
4. Delete request: The request server deletes a resource
5. Head request: Request for the corresponding HTTP header information.
6. Options Request: You can get the type of request supported by the current URL.
In addition to trace requests and connect requests, trace requests are primarily used for testing or diagnostics.
Detailed explanation of Get and post
-GET Request Instance analysis
import urllib.request keywd="hello" url="http://www.baidu.com/s?wd="+keywd req=urllib.request.Request(url) data=urllib.request.urlopen(req).read() fhandle=open("文件存储地址","wb") fhandle=write(data) fhandle.close()
When retrieving Chinese will error.
UnicodeEncodeError
import urllib.request url="http://www.baidu.com/s?wd=" key="韦玮老师" key_code=urllib.request.quote(key) url_all=url+key_code req=urllib.request.Request(url_all) data=urllib.request.urlopen(req).read() fh=open("文件地址","wb") fh.write(data) fh.close()
Using GET requests, the idea is as follows:
1. Build the corresponding URL address, which contains information such as the field name and field contents of the GET request, and the URL address satisfies the format of the GET request, that is, "http://url? field Name 1 = field content 1& Field Name 2 = Field Content 2".
2. The corresponding URL is a parameter that constructs the request object.
3. Open the built Request object via Urlopen ().
4. Follow-up processing operations as required. such as reading Web content, or writing files.
-POST Request Instance analysis
Using the POST request, the idea is as follows: 1. Set up URL URLs.
2. Build the form data and encode the data with Urllib.parse.urlencode
3. Create the Request object, which includes the URL address and the data to be passed.
4. Use Add_header () to add header information to simulate browser crawling.
5. Use Urllib.request.urlopen () to open the corresponding Request object and complete the delivery of the information.
6. Subsequent processing, such as reading, writing files, etc.
import urllib.request import urllib.parse url="网页地址" postdata=urllib.parse.urlencode({ ‘name‘:‘aaa‘, ‘pass‘:‘bbb‘}).encode(‘utf-8‘)#将数据使用urlencode编码处理后，使用encode()设置为utf-8编码 req=urllib.request.Request(url,postdata) req.add_header(‘User-Agent‘,‘User-Agent数据‘) data=urllib.request.urlopen(req).read() fd=open("存储地址",‘wb‘) fd.write(data) fd.close()

Proxy server Settings

#代理服务器网站http://yum.iqianyue.com/proxy def use_proxy(proxy_addr,url): import urllib.request proxy=urllib.request.ProxyHandler({‘http‘:proxy_addr}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data=urllib.request.urlopen(url).read().decode(‘utf-8‘) return data proxy_addr="代理服务器地址加端口" data=use_proxy(proxy_addr,"网页地址") print(len(data))

Debuglog Combat exception Handling artifact----Urlerror

import urllib.request import urllib.error try: urllib.request.urlopen("网页地址") except urllib.error.URLError as e: print(e.code) print(e.reason) 输出错误： 403 Forbidden
There are several possible reasons for producing Urlerror: 1. The link is not on server 2. The remote URL does not exist 3. No Network 4. Triggering the HTTPError403 trigger is Urlerror subclass Httperror, so it can be changed toimport urllib.request import urllib.error try: urllib.request.urlopen("网页地址") except urllib.error.HTTPError as e: print(e.code) print(e.reason)Status code meaning: OK
Everything's fine.
301 Moved Permanently
Redirect to a new URL, Permanent
302 Found
Redirect to temporary URL, non-permanent
304 Not Modified
The requested resource is an update
Request
Illegal request
401 Unauthorized
Request Unauthorized
403 Forbidden
No access
404 Not Found
No corresponding page found
Internal Server Error
An error occurred inside the server
501 Not implemented
The server does not support the functionality required to implement the request
Httperror cannot handle the first three errors of Urlerror. Processing will be an error.import urllib.request import urllib.error try: urllib.request.urlopen("http://blog.baidusss.net") except urllib.error.HTTPError as e: print(e.code) print(e.reason) except urllib.error.URLError as e: print(e.reason)

Python Crawler Learning Chapter Fourth

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Crawler Learning Chapter Fourth

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support