Python's simplest web crawler tutorial

Last Update:2017-08-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In our daily surfing the Web page, often see some good-looking pictures, we would like to save these images to download, or users to do desktop wallpaper, or used to make design material. The following article on the introduction of the use of Python to achieve the simplest web crawler related information, the need for friends can refer to the following to see together.

Objective

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Recently, the Python crawler has a strong interest in sharing their learning path, welcome to make suggestions. We communicate with each other and make progress together. Words not much to say, come together to see the detailed introduction:

1. Development tools

I use the tool is sublime text3, it's short (maybe the men do not like the word) I am very fascinated. Recommended for everyone, of course, if your computer is well configured, Pycharm may be more suitable for you.

Sublime Text3 Building Python development environment recommended view this article:

[Sublime build python development environment]

Crawler, as the name implies, is like a worm, crawling on the internet this big online. That way, we can get what we want.

Now that we are crawling on the Internet, we need to know the URL, Fahao "Uniform Resource Locator", and nickname "Link". Its structure is mainly composed of three parts:

(1) Agreement: such as the HTTP protocol that we common in the URL.

(2) Domain name or IP address: domain name, such as: Www.baidu.com,IP address, will be the domain name after the corresponding IP.

(3) Path: directory or file, etc.

3.urllib development of the simplest crawler

(1) Urllib Introduction

Module	introduce
Urllib.error	Exception classes raised by Urllib.request.
Urllib.parse	Parse URLs into or assemble them from.
Urllib.request	Extensible library for opening URLs.
Urllib.response	Response classes used by Urllib.
Urllib.robotparser	Load a robots.txt file and answer questions about fetchability of other URLs.

(2) Development of the simplest crawler

Baidu homepage Simple and generous, very suitable for our crawler.

The crawler code is as follows:

From urllib import Requestdef visit_baidu (): url = "http://www.baidu.com" # Open the URL req = Request.urlopen (URL) # Read  The URL  html = req.read () # decode the URL to utf-8 HTML = Html.decode ("Utf_8") print (HTML) if __name__ = = ' __main__ ': Visit_baidu ()

Results such as:

We can compare the results of our operation by right-clicking on the Baidu Home page and viewing the review elements.

Of course, request can also generate a request object that can be opened with the Urlopen method.

The code is as follows:

From urllib import Requestdef Vists_baidu (): # Create a request Obkect req = Request. Request (' http://www.baidu.com ') # Open the Request object response = Request.urlopen (req) # Read the response  HTML = r Esponse.read () HTML = Html.decode (' utf-8 ') print (HTML) if __name__ = = ' __main__ ': Vists_baidu ()

The result is the same as just before.

(3) Error handling

Error handling is handled through the Urllib module, mainly with Urlerror and httperror errors, where httperror errors are subclasses of Urlerror errors, Httrperror can also be captured by Urlerror.

Httperror can be captured by its Code property.

The code to process Httperror is as follows:

From urllib import requestfrom urllib import Errordef Err (): url = "https://segmentfault.com/zzz" req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Httperror as E:print (e.code) if __name__ = = ' __main__ ': ERR ()

Run results

404 to print out the error code, about this detailed information you can self-Baidu.

Urlerror can be captured by its reason property.

The code for Chulihttperror is as follows:

From urllib import requestfrom urllib import Errordef Err (): url = "https://segmentf.com/" req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Urlerror as E:print (e.reason) if __name__ = = ' __main__ ': ERR ()

Run results

Since in order to deal with errors, it is best to write two errors into the code, after all, the more detailed the clearer. Note that Httperror is a subclass of Urlerror, so be sure to put Httperror in front of urlerror, otherwise it will output urlerror, such as 404 output to Not Found.

The code is as follows:

From urllib import requestfrom urllib Import error# first method, Urlerroe and Httperrordef Err (): url = "Https://segmentfault.com/zz Z "req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Httperror as E:print (e.code) except error. Urlerror as E:print (E.reason)

You can change the URL to see the various error output forms.

Summarize

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python's simplest web crawler tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python's simplest web crawler tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support