International - English

Cart Console

Topic Center

Contact Sales

Home > Others

PYTHONL Practice Notes-beginner, intermediate and advanced knowledge of reptiles

Last Update:2018-05-10 Source: Internet

Author: User

Tags ip number

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Primary crawler

(1) Web front-end knowledge: HTML, CSS, JavaScript, DOM, DHTML, Ajax, Jquery,json, etc.;

(2) Regular expression, can extract normal general Web page want information, such as some special text, link information, know what is lazy, what is greedy type of regular;

(3) using RE, beautifulsoup,xpath and so on to obtain some information about the nodes in the DOM structure;

(4) Know what is depth first, breadth-first crawl algorithm, and the practice of the use of rules;

(5) Can analyze the structure of simple website, will use URLLIB,URLLIB2 or requests library for simple data fetching;

2 Intermediate crawler

(1) Understand what is a hash, the use of simple md5,sha1 and other algorithms to hash the data for storage;

(2) familiar with the basic knowledge of the HTTP,HTTPS protocol, understand the Get,post method, understand the HTTP header information, including the return status code, encoding, user-agent,cookie,session, etc.;

(3) can set up user-agent to crawl data, set up agents and so on;

(4) Know what is request, what is response, will use fiddle, wireshark and other tools to crawl and analyze Simple network packets; for dynamic crawlers, learn to analyze Ajax requests, simulate the production of post packet requests, Crawl client session and other information, for some simple web site, can be automatically logged through the simulation data packet;

(5) For more difficult to fix the site, learn to use Phatomjs+selenium crawl some dynamic Web information;

(6) Concurrent downloading, speeding up data fetching by parallel downloading;

3 Advanced Crawler

(1) can use Tesseract, Baidu AI and other libraries to verify code identification;

(2) can use data mining technology, classification algorithm and so on to avoid death chain;

(3) Use common database for data storage, query, such as Mongodb,redis (large data volume cache), etc., download cache, learn how to avoid the problem of duplicate download through the cache, Bloom filter use;

(4) can use the technology of machine learning to dynamically adjust crawler crawling strategy, so as to avoid the banned IP number, etc.

(5) can use some open source framework scrapy,celery and other distributed crawler, can deploy control distributed crawler for large-scale data capture;

PYTHONL Practice Notes-beginner, intermediate and advanced knowledge of reptiles

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

practice of system and network administration practice of system and network administration third edition pmbok body of knowledge c# intermediate classes interfaces and oop pmi book of knowledge pdf call of duty advanced example of knowledge web pages

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PYTHONL Practice Notes-beginner, intermediate and advanced knowledge of reptiles

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support