1 Primary crawler
(1) Web front-end knowledge: HTML, CSS, JavaScript, DOM, DHTML, Ajax, Jquery,json, etc.;
(2) Regular expression, can extract normal general Web page want information, such as some special text, link information, know what is lazy, what is greedy type of regular;
(3) using RE, beautifulsoup,xpath and so on to obtain some information about the nodes in the DOM structure;
(4) Know what is depth first, breadth-first crawl algorithm, and the practice of the use of rules;
(5) Can analyze the structure of simple website, will use URLLIB,URLLIB2 or requests library for simple data fetching;
2 Intermediate crawler
(1) Understand what is a hash, the use of simple md5,sha1 and other algorithms to hash the data for storage;
(2) familiar with the basic knowledge of the HTTP,HTTPS protocol, understand the Get,post method, understand the HTTP header information, including the return status code, encoding, user-agent,cookie,session, etc.;
(3) can set up user-agent to crawl data, set up agents and so on;
(4) Know what is request, what is response, will use fiddle, wireshark and other tools to crawl and analyze Simple network packets; for dynamic crawlers, learn to analyze Ajax requests, simulate the production of post packet requests, Crawl client session and other information, for some simple web site, can be automatically logged through the simulation data packet;
(5) For more difficult to fix the site, learn to use Phatomjs+selenium crawl some dynamic Web information;
(6) Concurrent downloading, speeding up data fetching by parallel downloading;
3 Advanced Crawler
(1) can use Tesseract, Baidu AI and other libraries to verify code identification;
(2) can use data mining technology, classification algorithm and so on to avoid death chain;
(3) Use common database for data storage, query, such as Mongodb,redis (large data volume cache), etc., download cache, learn how to avoid the problem of duplicate download through the cache, Bloom filter use;
(4) can use the technology of machine learning to dynamically adjust crawler crawling strategy, so as to avoid the banned IP number, etc.
(5) can use some open source framework scrapy,celery and other distributed crawler, can deploy control distributed crawler for large-scale data capture;
PYTHONL Practice Notes-beginner, intermediate and advanced knowledge of reptiles