PYTHONL Practice Notes-beginner, intermediate and advanced knowledge of reptiles

Source: Internet
Author: User
Tags ip number

1 Primary crawler

(1) Web front-end knowledge: HTML, CSS, JavaScript, DOM, DHTML, Ajax, Jquery,json, etc.;

(2) Regular expression, can extract normal general Web page want information, such as some special text, link information, know what is lazy, what is greedy type of regular;

(3) using RE, beautifulsoup,xpath and so on to obtain some information about the nodes in the DOM structure;

(4) Know what is depth first, breadth-first crawl algorithm, and the practice of the use of rules;

(5) Can analyze the structure of simple website, will use URLLIB,URLLIB2 or requests library for simple data fetching;

2 Intermediate crawler

(1) Understand what is a hash, the use of simple md5,sha1 and other algorithms to hash the data for storage;

(2) familiar with the basic knowledge of the HTTP,HTTPS protocol, understand the Get,post method, understand the HTTP header information, including the return status code, encoding, user-agent,cookie,session, etc.;

(3) can set up user-agent to crawl data, set up agents and so on;

(4) Know what is request, what is response, will use fiddle, wireshark and other tools to crawl and analyze Simple network packets; for dynamic crawlers, learn to analyze Ajax requests, simulate the production of post packet requests, Crawl client session and other information, for some simple web site, can be automatically logged through the simulation data packet;

(5) For more difficult to fix the site, learn to use Phatomjs+selenium crawl some dynamic Web information;

(6) Concurrent downloading, speeding up data fetching by parallel downloading;

3 Advanced Crawler

(1) can use Tesseract, Baidu AI and other libraries to verify code identification;

(2) can use data mining technology, classification algorithm and so on to avoid death chain;

(3) Use common database for data storage, query, such as Mongodb,redis (large data volume cache), etc., download cache, learn how to avoid the problem of duplicate download through the cache, Bloom filter use;

(4) can use the technology of machine learning to dynamically adjust crawler crawling strategy, so as to avoid the banned IP number, etc.

(5) can use some open source framework scrapy,celery and other distributed crawler, can deploy control distributed crawler for large-scale data capture;

PYTHONL Practice Notes-beginner, intermediate and advanced knowledge of reptiles

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.