This article mainly introduced the Python crawler DNS parsing caching method, combined with the concrete instance form analysis Python uses the socket module to parse the DNS cache the related operation skill and the attention matter, needs the friend can refer to the next
In this paper, the DNS parsing caching method of Python crawler is described. Share to everyone for your reference, as follows:
Objective:
This is the Python crawler in the DNS parsing cache module core code, is the last year's code, now put out interested can look at.
In general, DNS parsing time for a domain name is between 10~60 milliseconds, which may seem trivial, but it is not negligible for large crawlers. For example, we want to crawl Sina Weibo, the same domain name request has 10 million (this is not too much), then time-consuming between 10~60 million seconds, a day only 86,400 seconds. That is, the single DNS resolution of this one has been used for several days, at this time, plus the DNS resolution cache, the effect is obvious.
The following code is put directly below, explained in the back.
Code:
# encoding=utf-8#---------------------------------------# version: 0.1# Date: 2016-04-26# Author: Jiu Cha <bone_ ace@163.com># Development environment: Win64 + Python 2.7#---------------------------------------import socket# from gevent Import Socket_dnscache = {}def _setdnscache (): "" " DNS Cache" "" def _getaddrinfo (*args, **kwargs): if args in _ DnsCache: # Print str (args) + "in cache" return _dnscache[args] else: # print str (args) + ' not ' in cache " _dnscache[args] = Socket._getaddrinfo (*args, **kwargs) return _dnscache[args] if not hasattr (socket , ' _getaddrinfo '): socket._getaddrinfo = socket.getaddrinfo socket.getaddrinfo = _getaddrinfo
Description
In fact, there is no difficulty, is to save the cache inside the socket, to avoid repeated access.
You can put the above code in a dns_cache.py file, the crawler frame to call this _setDNSCache() method is OK.
The
need to explain is, if you use the gevent, and use the monkey.patch_all () , note that at this time the crawler has to use the socket inside the gevent, The DNS parsing cache module should also use the gevent socket.