關於gethostbyname在多線程環境下的阻塞問題

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

轉自：http://blog.csdn.net/zouxinfox/article/details/2234225

Unix/Linux下的gethostbyname函數常用來向DNS查詢一個網域名稱的IP地址。由於DNS的遞迴查詢，常常會發生gethostbyname函數在查詢一個網域名稱時嚴重逾時。而該函數又不能像connect和read等函數那樣通過setsockopt或者select函數那樣設定逾時時間，因此常常成為程式的瓶頸。有人提出一種解決辦法是用alarm設定定時訊號，如果逾時就用setjmp和longjmp跳過gethostbyname函數（這種方式我沒有試過，不知道具體效果如何）。

在多線程下面，gethostbyname會一個更嚴重的問題，就是如果有一個線程的gethostbyname發生阻塞，其它線程都會在gethostbyname處發生阻塞。我在編寫爬蟲時也遇到了這個讓我疑惑很久的問題，所有的爬蟲線程都阻塞在gethostbyname處，導致爬蟲速度非常慢。在網上google了很長時間這個問題，也沒有找到解答。今天湊巧在實驗室的googlegroup裡面發現了一本電子書"Mining
the Web - Discovering Knowledge from Hypertext Data",其中在講解爬蟲時有下面幾段文字：

    Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which
cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance.Furthermore,
if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research
Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.
    In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http
data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable
delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets,
and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will
be fast when the page is actually needed later on.

    大意是說unix的gethostbyname無法處理在並發程式下使用，這是先天的缺陷是無法改變的。大型爬蟲往往不會使用gethostbyname，而是實現自己獨立定製的DNS用戶端。這樣可以實現DNS的Server Load Balancer，而且通過非同步解析能夠大大提高DNS解析速度。DNS用戶端往往用UDP實現，可以在爬蟲爬取網頁前提前解析URL的IP。文章中還提到了一個開源的非同步DNS庫adns，首頁是http://www.chiark.greenend.org.uk/~ian/adns/
    從以上可看出，gethostbyname並不適用於多線程環境以及其它對DNS解析速度要求較高的程式。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

關於gethostbyname在多線程環境下的阻塞問題

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support