Today we are going to explain the basic conceptual knowledge of python. Many of the friends who have just touched python have a lot of questions, what is a python crawler? So why is python called a reptile?
What is a python crawler?
Before entering the article, we first need to know what a reptile is. Crawler, or web crawler, you can understand as a spider crawling on the internet, the Internet is likened to a large network, and the crawler is crawling on this web spider, if it encounters its own prey (the resources needed), then it will crawl it down. For example, it is crawling a Web page, in which he discovers a path, in fact, a hyperlink to a Web page, then it can crawl to another web to get the data. Not easy to understand the words can actually be understood by the following images:
because of the Python scripting features, Python is easy to configure, and the processing of characters is very flexible, plus python Rich network Crawl module, so the two are often linked together. A Python crawler development engineer, starting from a page (usually the homepage) of a website, reads the contents of the Web page, finds the other link addresses in the Web page, and then searches for the next page through these linked addresses, so that it loops until all the pages of the site have been crawled. If the entire Internet as a Web site, then the network spider can use this principle to the Internet all the pages are crawled down.
Crawlers can capture the content of a website or an app and extract useful value. It can also simulate the user's operation in the browser or app, and realize the automatic program. The following behaviors can be implemented with crawlers:
Grab ticket Artifact
Voting artifact
Forecasts (Stock market forecasts, box office forecasts)
Analysis of national sentiment
Network of social relationships
As mentioned above, we can think of crawler generally refers to the crawl of network resources, and because of the Python scripting features, and it is not only easy to configure, but also the processing of characters is very flexible, and Python has a rich network crawl module, so the two are often linked together. That's why Python is called a reptile.
Why is python called a reptile?
As a programming language, Python is purely free software, and is loved by programmers for its simplicity and clarity of syntax and the forced use of whitespace characters for sentence indentation. For example: To complete a task, the C language has to write 1000 lines of code, Java to write 100 lines, and Python only need to write 20 lines of code. By using Python to do programming tasks, you write less code, shorter and more readable code, and a team develops more quickly, with more efficient development and more productive work.
This is a great programming language for developing web crawlers, and Python's interface for crawling Web documents is more concise than other static programming languages, and Python's URLLIB2 package provides a more complete API for accessing Web documents than other dynamic scripting languages. In addition, Python has a good third-party package to efficiently implement Web page crawling, and can use a very short code to complete the page label filtering function.
The Python crawler's architecture consists of:
1. URL Manager: Manages the collection of URLs to be crawled and the crawled URL collection, passing URLs to be crawled to the Web page downloader;
2. Web downloader: Crawl URL corresponding to the page, stored as a string, sent to the page parser;
3. Web parser: Parse out valuable data, store it, and add URL to URL manager at the same time.
Python's workflow, however, is as follows:
(Python crawler through the URL manager, to determine whether to crawl the URL, if a crawl URL, through the scheduler to pass to the downloader, download the URL content, and through the scheduler to the parser, parse the URL content, and the value data and the new URL list through the scheduler to the application, And the process of outputting value information. )
Python is a very suitable programming language for the development of web crawler, providing such as Urllib, RE, JSON, Pyquery and other modules, while there are many molding framework, such as scrapy frame, Pyspider crawler system, etc. itself is very simple and convenient so is the network crawler preferred programming language! Hopefully this article will give a little help to a friend who has just come into contact with the Python language.