Recently wrote a crawler, want to optimize it, I think you can use Scrapy + Redis implementation of a distributed crawler, learn to learn Redis today.
A brief introduction to Redis
Redis is a high-performance Key-value database that stores data in memory, so it is relatively fast and highly performant.
The same as other Key-value databases are:
1. Support data persistence, restart the data can be loaded and used again;
2. Not only support key-value types of data, but also support data structures such as List,set,zset,hash;
3. Support Data Backup
Different points:
1.redis has more complex data structures and provides atomic manipulation of them;
1.redis runs in memory but can be persisted to disk so the amount of data cannot be larger than hardware memory, and it is relatively simple to operate in memory compared to the same complex data structure on disk.
Redis data types
Supports five types of data: string (String), hash (hash), List,set,zset (ordered collection)
The string type is binary safe, that is, string can contain any data;
Hash is a set of key-value pairs, is a string-type field and value mapping table, hash is particularly suitable for storing objects;
Lists are a simple list of strings, sorted by insert order;
set is an unordered collection of string types that are implemented by a hash table, so the complexity of adding, deleting, and finding is O (1). Using the Sadd command to add a string element to the key, the corresponding set set, Success returns 1, if the element is already in the collection returns 0;
Zset and set differ in that each element is associated with a double-type fraction, and Redis uses fractions to sort the small-to-large members of the collection. The command to add an element is the zadd command.
Redis Key (Key)
The basic syntax for the Redis key command is: command key_name
Redis Hyperloglog
The advantage of the algorithm used to do cardinality statistics is that when the number of input elements or mentions is very large, the space required to calculate the cardinality is very small.
Base? The base set for the dataset {1,3,5,7,5,7,8} is {1,3,5,7,8} and the cardinality is 5. The technique is to quickly calculate the cardinality within an acceptable range of errors.
The command used is as follows:
Pfadd: Adding the specified element to the Hyperloglog
Pfcount: Returns the technical estimate for a given hyperloglog
Pfmerge: Merge multiple hyperloglog into one
Two about distributed crawler
how Redis implements a crawler-distributed hub: The URLs that all crawlers get are placed in a redis queue, and all crawlers get the request from a single Redis queue. the crawler defaults to breadth-first search, assuming that there are now two crawlers, then how to implement the distributed, the specific steps are as follows:
First run crawler A, the crawler engine request spider a start_urls in the link and delivery to the scheduler, and then the engine to the scheduler request crawl URL and to the download, downloaded response to the spider, The spider is linked to the defined rules and continues to be handed over to the scheduler via the engine.
In order to start the B,b Start_urls first to the same as the scheduler in A, and B's engine request crawl URL, the scheduler dispatched to the B download URL or a does not download the URL, at the same time a and download a not completed link, to complete, at the same time download the request link B.
The Spiderpriorityqueue method is used by default in Scrapy-redis, which is a non-Fifo,lifo method implemented by sorted set.
Each time a re-crawl is performed, the data stored in the Redis should be emptied, otherwise it will affect the crawler.
The difference between request and URL: request is done by the spider, and the spider returns a request to the Scrapy engine to deliver the scheduler. The URL is also defined in the spider or retrieved by the spider
crawler is a spider in Scrapy, Scrapy's architecture is Spider,spider's role is to provide Start_url, based on downloaded response analysis to get the content you want continue extracting the URL.
If you use python+redis+ other databases to implement distributed crawler storage data, where Redis is used only as URL storage, not related to the specific data crawler, set slave on the Scrapy-redis Gets the address of the URL as the master address, although there are multiple slave, but gets the URL only from the Redis database on the server. Also, because of Scrapy-redis's own queue mechanism, slave gets links that do not conflict with each other. Each slave is aggregated to the server after the crawl task is completed (the storage is no longer redis, it can be MongoDB or MySQL, etc.)
For existing scrapy programs, the extended distributed program steps are as follows:
1. Find a high-performance server for Redis queue maintenance and data storage;
2. Extend the Scrapy program to get start_urls through the server's Redis and overwrite the data store section in the pipeline to change the storage address to the server address.
3. Write some scripts that generate URLs on the server and execute them on a regular basis.
Methods of anti-gripping shielding
1. Set the Download_delay, but will reduce the efficiency of the crawler;
2. Randomly generate user_agent, or rewrite middleware, so that the program can randomly get user_agent each time it runs;
3. Set proxy IP pool;
4. Set the Domian and host in the header
On the method of using random user-agent
Add the following code to the settings.py:
Downloader_middlewares = {
' Scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware ': None,
' Crawler.comm.rotate_useragent.RotateUserAgentMiddleware ': 400
}
Add a list of user-agent in the corresponding crawler code as follows:
user_agent_list = [\
"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/22.0.1207.1 safari/537.1 "\
"mozilla/5.0 (X11; CrOS i686 2268.111.0) applewebkit/536.11 (khtml, like Gecko) chrome/20.0.1132.57 safari/536.11 ", \
"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1092.0 safari/536.6 ", \
"mozilla/5.0 (Windows NT 6.2) applewebkit/536.6 (khtml, like Gecko) chrome/20.0.1090.0 safari/536.6", \
"mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/537.1 (khtml, like Gecko) chrome/19.77.34.5 safari/537.1 ", \
"mozilla/5.0 (X11; Linux x86_64) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.9 safari/536.5 ", \
"mozilla/5.0 (Windows NT 6.0) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.36 safari/536.5", \
"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 ", \
"mozilla/5.0 (Windows NT 5.1) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3", \
"mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1063.0 safari/536.3 ", \
"mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3", \
"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1062.0 safari/536.3 ", \
"mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3", \
"mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3 ", \
"mozilla/5.0 (Windows NT 6.1) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.1 safari/536.3", \
"mozilla/5.0 (Windows NT 6.2) applewebkit/536.3 (khtml, like Gecko) chrome/19.0.1061.0 safari/536.3", \
"mozilla/5.0 (X11; Linux x86_64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 ", \
"mozilla/5.0 (Windows NT 6.2; WOW64) applewebkit/535.24 (khtml, like Gecko) chrome/19.0.1055.1 safari/535.24 "
]
Redis Primary Knowledge