轉自:http://blog.csdn.net/zouxinfox/article/details/2234225
Unix/Linux下的gethostbyname函數常用來向DNS查詢一個網域名稱的IP地址。 由於DNS的遞迴查詢,常常會發生gethostbyname函數在查詢一個網域名稱時嚴重逾時。而該函數又不能像connect和read等函數那樣通過setsockopt或者select函數那樣設定逾時時間,因此常常成為程式的瓶頸。有人提出一種解決辦法是用alarm設定定時訊號,如果逾時就用setjmp和longjmp跳過gethostbyname函數(這種方式我沒有試過,不知道具體效果如何)。
在多線程下面,gethostbyname會一個更嚴重的問題,就是如果有一個線程的gethostbyname發生阻塞,其它線程都會在gethostbyname處發生阻塞。我在編寫爬蟲時也遇到了這個讓我疑惑很久的問題,所有的爬蟲線程都阻塞在gethostbyname處,導致爬蟲速度非常慢。在網上google了很長時間這個問題,也沒有找到解答。今天湊巧在實驗室的googlegroup裡面發現了一本電子書"Mining
the Web - Discovering Knowledge from Hypertext Data",其中在講解爬蟲時有下面幾段文字:
Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which
cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance.Furthermore,
if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research
Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.
In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http
data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable
delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets,
and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will
be fast when the page is actually needed later on.
大意是說unix的gethostbyname無法處理在並發程式下使用,這是先天的缺陷是無法改變的。大型爬蟲往往不會使用gethostbyname,而是實現自己獨立定製的DNS用戶端。這樣可以實現DNS的Server Load Balancer,而且通過非同步解析能夠大大提高DNS解析速度。DNS用戶端往往用UDP實現,可以在爬蟲爬取網頁前提前解析URL的IP。文章中還提到了一個開源的非同步DNS庫adns,首頁是http://www.chiark.greenend.org.uk/~ian/adns/
從以上可看出,gethostbyname並不適用於多線程環境以及其它對DNS解析速度要求較高的程式。