Python crawler: Something you should know before learning crawler, python Crawler
This is the 14th article on Python, which mainly introduces the principles of crawlers.
When it comes to crawlers, we have to talk about web pages, because the crawlers we write are actually designed for Web pages.Parsing webpages and capturing such data are what crawlers do.
For most web pages, the Code consists of three languages: HTML, CSS, and JavaScript. When we crawl data, most of them crawl from HTML and CSS.
So, before learning crawlers, we need to know the following things.
First, you need to understand the switching mechanism between the client and the server.
Every time we access a page, we actually initiate a request to the server, which is called a request. After receiving the request, the server will send us a response called response. These two actions are combined, HTTP protocol.
That is to say,HTTP protocol is a way for our clients (web pages) to communicate with servers.
When a request is sent to the server, the request mainly contains eight methods: get, post, head, put, options, connect, trace, and delete. Most of the time we use the get method, in the future, detailed operations will be carried out.
Response is the information the server returns to us. When we send a request to the server, the server returns the required information to us.
Secondly, understand the basic structure of web pages
A webpage consists of three parts: header, content, and footer ).
We can open a webpage at will, such as the featured page of PMCAFF: Success.
Then, right-click and choose check to view the source code of the page. Observe carefully that common labels include at least the following:
- <Div>... </div> Partition
- <Li>... </li> List
- <P>... </p> section
- <H1>...
- Image
- <A href = "">... </a> Link
Finally, we need to learn how to parse webpages before crawling.
Therefore, we need to learn how to use beautifulsoap to parse webpages.
The specific content will be explained in detail in the next article, and request + beautifulsoap will be used to crawl the real web page data.
Operating Environment: Python version, 3.6; PyCharm version, 2016.2; Computer: Mac
----- End -----
Author: du wangdan, Public Account: du wangdan, Internet product manager.