Python3 Crawler Example (i)--Simple web crawl

Source: Internet
Author: User
Tags rfc

Reptile before

Before you start writing a crawler, make sure that you have the knowledge you need.

First: Understand the relevant HTTP protocol knowledge

HTTP is an abbreviation for the Hyper Text Transfer Protocol (Hypertext Transfer Protocol). Its development was the result of collaboration between the World Wide Web Association (Wide) and the Internet Working Group IETF (Internet Engineering Task Force), which eventually released a series of RFC,RFC 1945 defines the http/1.0 version. The most notable of these is RFC 2616. RFC 2616 defines a version of--http 1.1 that is commonly used today.
The HTTP protocol (hypertext Transfer Protocol, Hypertext Transfer Protocol) is the transfer protocol used to transfer the text from the WWW server to the local browser. It can make the browser more efficient and reduce the network transmission. It not only ensures that the computer transmits hypertext documents correctly and quickly, but also determines which part of the document is being transmitted, and which content is displayed first (such as text before graphics), and so on.
Request Response Model for HTTP
The HTTP protocol is always a client-initiated request, and the server echoes the response.
This limits the use of the HTTP protocol, which cannot be implemented when the client does not initiate a request, the server pushes the message to the client.
The HTTP protocol is a stateless protocol, and there is no correspondence between this request and the last request of the same client.
Work flow
An HTTP operation is called a transaction, and its working process can be divided into four steps:
1) First the client and the server need to establish a connection. As soon as you click on a hyperlink, the HTTP work begins.
2) After the connection is established, the client sends a request to the server in the form of a Uniform Resource Identifier (URL), protocol version number, followed by MIME information including the request modifier, client information, and possible content.
3) When the server receives the request, it gives the corresponding response information in the form of a status line, including the protocol version number of the information, a successful or incorrect code, followed by MIME information including server information, entity information, and possible content.
4) The information returned by the client receiving server is displayed on the user's display by the browser, and then the client disconnects from the server.
If an error occurs in one of the steps above, the information that generates the error is returned to the client, with the display output. For the user, these processes are done by HTTP itself, the user just click with the mouse, waiting for information to display it.

Second: Learn about the Urllib library in Python

The Python2 series uses Urllib2,python3 to integrate all of them into urllib; what we need to learn is a few common functions. Details can be viewed on the official website.

Third: Development tools

Python comes with a compiler--IDLE, very concise; pycharm-Interactive Python a ide;fiddler-Web request monitoring tool that we can use to understand the detailed steps that occur when a user triggers a page request;

Simple web crawler

Code

"The first example: a simple web crawler crawling Watercress homepage" import urllib.request# URL url = "https://www.douban.com/" #请求request = urllib.request.Request (URL) #爬取结果response = urllib.request.urlopen (request) data = Response.read () #设置解码方式data = Data.decode (' Utf-8 ') # Printing results print (data) #打印爬取网页的各类信息print (type (response)) print (Response.geturl ()) print (Response.info ()) Print ( Response.getcode ())

Results

Some of the results of interception are as follows:

Python3 Crawler Example (i)--Simple web crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.