What is a reptile? A crawler is a program that gets the content of a Web page parsed to get useful data and store data in a database.
Basic steps:
1. Get the contents of the Web page by constructing the request to the server side, let the server side think that the real browser is on request, and then return the response. Python has a lot of requests for libraries, such as Urllib,requests library, etc., personal preferred requests library, very easy to get started.
2. The obtained data can be obtained by means of regular expressions, bs4,xpath and other analytic tools.
3. To the database, now the three most popular databases, Mysql,mongodb,redis can interact with Python-related libraries.
Conversations and Cookies?
What is a session? What is a cookie? This is all about the HTTP protocol, because the HTTP protocol is a stateless protocol, so before and after two requests it is not able to know whether the same user is sending the request, which could result in repeated requests being delivered. Conversations and cookies can be a big help at this time. When the user performs a logon operation, the back-end server creates a session for the user, which contains an ID to identify the session, the user's login status, and the user's information, and returns the ID ID to the client through the Set-cookie field. When the next time the client requests a webpage that needs to be logged in to view, the server checks the cookie field sent by the client, and if it can find the user's corresponding session through the cookie field, it will further determine the user's login status. Usually the site will have a session timeout, and if the session expires, you will need to log in again.
In summary, the session is stored on the server side of the information, and the cookie is stored in the client information, the role of the session is to maintain the user's login status.
Agent
What is the rationale of the agent? Proxy actually refers to the proxy server, when we set up the proxy server, the proxy server becomes our server, and the proxy server itself becomes the client to the server we really request to send a request, when the proxy server gets a response and then return the response to our local client, This successfully implements the ability to hide our local IP.
Why use a proxy? Some Web sites in the back-end processing requests will be detected over a period of time the same IP access times, if the number of times reached a certain value, will be directly denied services, that is, often said IP was blocked. To prevent this from happening, a powerful proxy function is required to hide our IP. When using crawlers to crawl data, if we can constantly change agents, will let the server lost itself ~
Common proxy settings: Use a free agent on the Internet or use a paid proxy service.
Python crawler learns the fundamentals of two------crawlers