The purpose of Learning web crawler :
1, you can customize a search engine, you can deeply understand the working principle of the search engine.
2, in the era of big data, to carry out data analysis, the first to have data sources, learning reptiles, can let us get more data.
3, practitioners can take advantage of the crawler, understand its principles, and optimize your program.
The composition of web crawler
Network crawler consists of control node, reptile node and resource database.
Types of Reptiles
1, General network Crawler: Also known as the whole network crawler, can crawl the target resources in the whole network.
2, focus on web crawler: Mainly used in the crawling of specific information, mainly for the specific type of people to provide services.
3, incremental network crawler: the so-called incremental, is the incremental update, incremental update refers to the update when only the changes in the place, and the unchanged place is not updated, so the incremental crawler to a certain extent to ensure that the page crawled as far as possible are new pages.
4, Deep web crawler: The so-called deep, refers to the Internet, the Web page according to storage classification, can be divided into surface pages and deep pages, the so-called surface page refers to the need to submit a form, using static links can reach the static page. A deep page is a page that needs to be submitted with a certain keyword.
Python Combat-web crawler