International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Python crawler learns the fundamentals of two------crawlers

Last Update:2018-06-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a reptile? A crawler is a program that gets the content of a Web page parsed to get useful data and store data in a database.

Basic steps:

1. Get the contents of the Web page by constructing the request to the server side, let the server side think that the real browser is on request, and then return the response. Python has a lot of requests for libraries, such as Urllib,requests library, etc., personal preferred requests library, very easy to get started.

2. The obtained data can be obtained by means of regular expressions, bs4,xpath and other analytic tools.

3. To the database, now the three most popular databases, Mysql,mongodb,redis can interact with Python-related libraries.

Conversations and Cookies?

What is a session? What is a cookie? This is all about the HTTP protocol, because the HTTP protocol is a stateless protocol, so before and after two requests it is not able to know whether the same user is sending the request, which could result in repeated requests being delivered. Conversations and cookies can be a big help at this time. When the user performs a logon operation, the back-end server creates a session for the user, which contains an ID to identify the session, the user's login status, and the user's information, and returns the ID ID to the client through the Set-cookie field. When the next time the client requests a webpage that needs to be logged in to view, the server checks the cookie field sent by the client, and if it can find the user's corresponding session through the cookie field, it will further determine the user's login status. Usually the site will have a session timeout, and if the session expires, you will need to log in again.

In summary, the session is stored on the server side of the information, and the cookie is stored in the client information, the role of the session is to maintain the user's login status.

Agent

What is the rationale of the agent? Proxy actually refers to the proxy server, when we set up the proxy server, the proxy server becomes our server, and the proxy server itself becomes the client to the server we really request to send a request, when the proxy server gets a response and then return the response to our local client, This successfully implements the ability to hide our local IP.

Why use a proxy? Some Web sites in the back-end processing requests will be detected over a period of time the same IP access times, if the number of times reached a certain value, will be directly denied services, that is, often said IP was blocked. To prevent this from happening, a powerful proxy function is required to hide our IP. When using crawlers to crawl data, if we can constantly change agents, will let the server lost itself ~

Common proxy settings: Use a free agent on the Internet or use a paid proxy service.

Python crawler learns the fundamentals of two------crawlers

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Python thread pause, resume, exit detail and Example _python 01-18

Python design mode-UML-Package diagrams (Package Diagram) 09-09

The difference between OS and sys two modules in Python 04-05

Python: send emails 12-08

Python: Database Operations 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler learns the fundamentals of two------crawlers

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support