Python crawler: Something you should know before learning crawler, python Crawler

Source: Internet
Author: User

Python crawler: Something you should know before learning crawler, python Crawler

This is the 14th article on Python, which mainly introduces the principles of crawlers.

When it comes to crawlers, we have to talk about web pages, because the crawlers we write are actually designed for Web pages.Parsing webpages and capturing such data are what crawlers do.

For most web pages, the Code consists of three languages: HTML, CSS, and JavaScript. When we crawl data, most of them crawl from HTML and CSS.

So, before learning crawlers, we need to know the following things.

First, you need to understand the switching mechanism between the client and the server.

Every time we access a page, we actually initiate a request to the server, which is called a request. After receiving the request, the server will send us a response called response. These two actions are combined, HTTP protocol.

That is to say,HTTP protocol is a way for our clients (web pages) to communicate with servers.

When a request is sent to the server, the request mainly contains eight methods: get, post, head, put, options, connect, trace, and delete. Most of the time we use the get method, in the future, detailed operations will be carried out.

Response is the information the server returns to us. When we send a request to the server, the server returns the required information to us.

Secondly, understand the basic structure of web pages

A webpage consists of three parts: header, content, and footer ).

We can open a webpage at will, such as the featured page of PMCAFF: Success.

Then, right-click and choose check to view the source code of the page. Observe carefully that common labels include at least the following:

  • <Div>... </div> Partition
  • <Li>... </li> List
  • <P>... </p> section
  • <H1>...
  • Image
  • <A href = "">... </a> Link

Finally, we need to learn how to parse webpages before crawling.

Therefore, we need to learn how to use beautifulsoap to parse webpages.

The specific content will be explained in detail in the next article, and request + beautifulsoap will be used to crawl the real web page data.

Operating Environment: Python version, 3.6; PyCharm version, 2016.2; Computer: Mac

----- End -----

Author: du wangdan, Public Account: du wangdan, Internet product manager.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.