Python's simplest web crawler tutorial

Source: Internet
Author: User
In our daily surfing the Web page, often see some good-looking pictures, we would like to save these images to download, or users to do desktop wallpaper, or used to make design material. The following article on the introduction of the use of Python to achieve the simplest web crawler related information, the need for friends can refer to the following to see together.

Objective

Web crawler (also known as Web spider, Network robot, in the middle of the foaf community, more often called the Web Chaser), is a certain rules, automatically crawl the World Wide Web information program or script. Recently, the Python crawler has a strong interest in sharing their learning path, welcome to make suggestions. We communicate with each other and make progress together. Words not much to say, come together to see the detailed introduction:

1. Development tools

I use the tool is sublime text3, it's short (maybe the men do not like the word) I am very fascinated. Recommended for everyone, of course, if your computer is well configured, Pycharm may be more suitable for you.

Sublime Text3 Building Python development environment recommended view this article:

[Sublime build python development environment]

Crawler, as the name implies, is like a worm, crawling on the internet this big online. That way, we can get what we want.

Now that we are crawling on the Internet, we need to know the URL, Fahao "Uniform Resource Locator", and nickname "Link". Its structure is mainly composed of three parts:

(1) Agreement: such as the HTTP protocol that we common in the URL.

(2) Domain name or IP address: domain name, such as: Www.baidu.com,IP address, will be the domain name after the corresponding IP.

(3) Path: directory or file, etc.

3.urllib development of the simplest crawler

(1) Urllib Introduction

Module introduce
Urllib.error Exception classes raised by Urllib.request.
Urllib.parse Parse URLs into or assemble them from.
Urllib.request Extensible library for opening URLs.
Urllib.response Response classes used by Urllib.
Urllib.robotparser Load a robots.txt file and answer questions about fetchability of other URLs.

(2) Development of the simplest crawler

Baidu homepage Simple and generous, very suitable for our crawler.

The crawler code is as follows:


From urllib import Requestdef visit_baidu (): url = "http://www.baidu.com" # Open the URL req = Request.urlopen (URL) # Read  The URL  html = req.read () # decode the URL to utf-8 HTML = Html.decode ("Utf_8") print (HTML) if __name__ = = ' __main__ ': Visit_baidu ()

Results such as:


We can compare the results of our operation by right-clicking on the Baidu Home page and viewing the review elements.

Of course, request can also generate a request object that can be opened with the Urlopen method.

The code is as follows:


From urllib import Requestdef Vists_baidu (): # Create a request Obkect req = Request. Request (' http://www.baidu.com ') # Open the Request object response = Request.urlopen (req) # Read the response  HTML = r Esponse.read () HTML = Html.decode (' utf-8 ') print (HTML) if __name__ = = ' __main__ ': Vists_baidu ()

The result is the same as just before.

(3) Error handling

Error handling is handled through the Urllib module, mainly with Urlerror and httperror errors, where httperror errors are subclasses of Urlerror errors, Httrperror can also be captured by Urlerror.

Httperror can be captured by its Code property.

The code to process Httperror is as follows:


From urllib import requestfrom urllib import Errordef Err (): url = "https://segmentfault.com/zzz" req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Httperror as E:print (e.code) if __name__ = = ' __main__ ': ERR ()

Run results

404 to print out the error code, about this detailed information you can self-Baidu.

Urlerror can be captured by its reason property.

The code for Chulihttperror is as follows:


From urllib import requestfrom urllib import Errordef Err (): url = "https://segmentf.com/" req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Urlerror as E:print (e.reason) if __name__ = = ' __main__ ': ERR ()

Run results


Since in order to deal with errors, it is best to write two errors into the code, after all, the more detailed the clearer. Note that Httperror is a subclass of Urlerror, so be sure to put Httperror in front of urlerror, otherwise it will output urlerror, such as 404 output to Not Found.

The code is as follows:


From urllib import requestfrom urllib Import error# first method, Urlerroe and Httperrordef Err (): url = "Https://segmentfault.com/zz Z "req = Request. Request (URL) try:response = Request.urlopen (req) HTML = Response.read (). Decode ("Utf-8") print (HTML) except error. Httperror as E:print (e.code) except error. Urlerror as E:print (E.reason)

You can change the URL to see the various error output forms.

Summarize

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.