Sesame HTTP: Basic Principles of crawlers and basic principles of sesame Crawlers
We can compare the Internet to a large network, and crawlers (web crawlers) are web crawlers. Comparing the nodes of the network to a Web page, crawlers can access the page and obtain its information. The connection between nodes can be compared to the link between a webpage and a webpage. After a spider passes through a node, it can continue crawling along the node connection to the next node, that is to say, you can continue to obtain the subsequent Webpage through a webpage, so that all the nodes on the entire network can be crawled by the spider, and the website data can be crawled.
1. crawler Overview
To put it simply, crawlers are automated programs for retrieving webpages and extracting and storing information. The following is an overview.
(1) webpage Retrieval
Crawlers need to obtain the webpage source code first. The source code contains some useful information about the web page. You only need to obtain the source code to extract the desired information.
The following describes the concept of request and response. When sending a request to a website server, the response body is the source code of the webpage. Therefore, the most important part is to construct a request and send it to the server, then receive the response and parse it. How can this process be implemented? You cannot extract the source code of a webpage manually, right?
Don't worry. Python provides many libraries to help us implement this operation, such as urllib and requests. We can use these libraries to help us implement HTTP request operations. Both requests and responses can be expressed in the data structure provided by the class library. After a response is obtained, we only need to parse the Body part of the data structure, the source code of the webpage is obtained, so that we can use a program to obtain the webpage.
(2) extract information
After obtaining the webpage source code, we will analyze the webpage source code and extract the data we want. First, the most common method is regular expression extraction. This is a 10 thousand-bit method, but it is complicated and error-prone when constructing regular expressions.
In addition, because the webpage structure has certain rules, there are also some libraries that extract webpage information based on the webpage node attributes, CSS selector or XPath, such as Beautiful Soup, pyquery, and lxml. Using these libraries, We can efficiently and quickly extract webpage information, such as node attributes and text values.
Information Extraction is a very important part of crawlers. It can make messy data clear, so that we can process and analyze data later.
(3) Save data
After extracting information, we usually save the extracted data to a certain place for future use. There are a variety of storage formats, such as the TXT text or JSON text, or the database, such as MySQL and MongoDB, or the remote server, such as using SFTP for operations.
(4) Automation program
When it comes to automated programs, it means that crawlers can complete these operations on behalf of people. First of all, we can extract this information manually, but if the equivalent is very large or you want to quickly obtain a large amount of data, you must use the program. Crawler is an automated program that replaces us to complete this crawling job. It can perform various Exception Handling, error retry, and other operations during the crawling process to ensure continuous and efficient crawling.
2. what data can be captured?
We can see a variety of information on the Web page. The most common is the regular web page, which corresponds to HTML code, and the most common crawling is the HTML source code.
In addition, some web pages may return not HTML code, but a JSON string (most of the API interfaces use this format). Data in this format can be easily transmitted and parsed, and they can also be crawled, in addition, data extraction is more convenient.
In addition, we can see a variety of binary data, such as slices, videos, and audios. With crawlers, we can capture the binary data and save it as the corresponding file name.
In addition, you can also see files with various extensions, such as CSS, JavaScript, and configuration files. These files are actually the most common files that can be accessed in a browser, you can capture it.
In fact, the above content corresponds to their respective URLs, which are based on HTTP or HTTPS protocol. crawlers can crawl such data.
3. JavaScript rendering page
Sometimes, when we capture a webpage using urllib or requests, the source code is actually different from what we see in the browser.
This is a very common problem. Currently, more and more web pages are built using Ajax and front-end modular tools. The entire web page may be rendered by JavaScript. That is to say, the original HTML code is an empty shell, for example:
<!DOCTYPE html>
body
There is only oneid
Iscontainer
But note thatbody
After the node is introduced, app. js is responsible for rendering the entire website.
When you open this page in a browser, the HTML content is first loaded, and then the browser will find that an app is introduced. js file, and then request the file. After obtaining the file, the JavaScript code will be executed, while JavaScript will change the HTML node and add content to it, the complete page is displayed.
However, when we request the current page using libraries such as urllib or requests, we only get this HTML code, which will not help us to continue loading this JavaScript file, so that we will not see the content in the browser.
This also explains why sometimes the source code we get is different from what we see in the browser.
Therefore, the source code obtained using the basic HTTP request library may be different from the page source code in the browser. In this case, we can analyze the background Ajax interface, or use libraries such as Selenium and Splash to simulate JavaScript rendering.
Later, we will detail how to collect JavaScript rendered web pages.
This section describes some basic principles of crawlers, which helps us to write crawlers later.