Python crawler DNS How to parse caching methods in detail

Source: Internet
Author: User
This article mainly introduced the Python crawler DNS parsing caching method, combined with the concrete instance form analysis Python uses the socket module to parse the DNS cache the related operation skill and the attention matter, needs the friend can refer to the next

In this paper, the DNS parsing caching method of Python crawler is described. Share to everyone for your reference, as follows:

Objective:

This is the Python crawler in the DNS parsing cache module core code, is the last year's code, now put out interested can look at.

In general, DNS parsing time for a domain name is between 10~60 milliseconds, which may seem trivial, but it is not negligible for large crawlers. For example, we want to crawl Sina Weibo, the same domain name request has 10 million (this is not too much), then time-consuming between 10~60 million seconds, a day only 86,400 seconds. That is, the single DNS resolution of this one has been used for several days, at this time, plus the DNS resolution cache, the effect is obvious.

The following code is put directly below, explained in the back.

Code:

# encoding=utf-8#---------------------------------------#  version: 0.1#  Date: 2016-04-26#  Author: Jiu Cha <bone_ ace@163.com>#  Development environment: Win64 + Python 2.7#---------------------------------------import socket# from gevent Import Socket_dnscache = {}def _setdnscache (): "" "  DNS Cache" ""  def _getaddrinfo (*args, **kwargs):    if args in _ DnsCache:      # Print str (args) + "in cache"      return _dnscache[args]    else:      # print str (args) + ' not ' in cache "      _dnscache[args] = Socket._getaddrinfo (*args, **kwargs)      return _dnscache[args]  if not hasattr (socket , ' _getaddrinfo '):    socket._getaddrinfo = socket.getaddrinfo    socket.getaddrinfo = _getaddrinfo

Description

In fact, there is no difficulty, is to save the cache inside the socket, to avoid repeated access.
You can put the above code in a dns_cache.py file, the crawler frame to call this _setDNSCache() method is OK.

The

need to explain is, if you use the gevent, and use the monkey.patch_all () , note that at this time the crawler has to use the socket inside the gevent, The DNS parsing cache module should also use the gevent socket.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.