Golang DNS Query time-out fault location

Source: Internet
Author: User
Tags nameserver
This is a creation in Article, where the information may have evolved or changed. Recent on-line reverse proxy Exception Log: Lookup / http xxx.xxx.com On 10.0.0.1:53:dial UDP 10.0.0.1:53:i/o timeout, which is obviously caused by a DNS query timeout. But the problem is, Golang in the dial UDP operation is simply to create Epoll object, there is no real I/O operation, talk about what time-out?

In this case, first try to reproduce the line, and then to locate.

Operating Environment

    • CentOS Release 6.4 (Final) virtual machine
    • Go version go1.7 linux/amd64

Simulation Site
      Code address
    • Configure the crontab task. Executes every minute, 300 queries per second, 5 seconds per task
      • */1 * * * * Cd/home/d.cao &&/dnsquery_cost-host=http/xxx.xxx.com - Epoch=10-batch=30-duration=5 >>/home/d.cao/run.log 2>&1 &
      • Concurrent Number = Epoch * Batch
      • Duration for duration, per second
    • After some time, the log appears: 2017/06/23 15:20:08 lookup http:/xxx.xxx.com on 10.0.0.1:53 Dial UDP 10.0.0.1:53:i/o Timeout

Experimental conclusion

    • Can be reproduced, excluding occasional cases (UDP packet loss)
    • Exclude online Service Code bugs
    • Guess: 1, library code implementation on the BUG;2, machine operating environment problems
Source Tracing

Since there is no more conclusion from the simple log, then the source is traced back.

    • Call Chain
      • -Lookuphost (Host string)
      • --Golookuphostorder (CTX context. Context, name string, order hostlookuporder) "DNS query, Golang has two implementation versions: 1,CGO,GLIBC implementation; 2, pure go implementation, default. can be godebug= by environment variable

NETDNS=CGO to switch

      • ---tryonename (ctx context. Context, cfg *dnsconfig, name string, Qtype uint16)
      • ----Exchange (CTX context. Context, server, name string, Qtype uint16) "in Src/net/dnsclient_unix.go"

The hierarchy is not deep, and the Exchange function is the final implementation of the DNS query.
    • Code comments

    • Problem Locator

Check the implementation of D.dialdns and D.dnsroundtrip know that dial UDP 10.0.0.1:53:i/o timeout appears in the Dialdns function, that is, the dial UDP is indeed timed out! If my timeout is set at the millisecond level, this conclusion is acceptable, but the timeout time is 10 seconds, which is unreasonable.

Fallback to the Tryonename function, found a retry logic, and the timeout setting before retrying the logic, instead of setting a timeout for each retry:

A

means: 1, If Cfg.attempts is greater than 1 or the number of cfg.servers is greater than 1,tryonename, the error returned is only an error at the end of the failure, 2, if the first exchange call times out, the remaining attempts will be returned immediately on D.dialdns call, and Error.error () is "Dial UDP 10.0.0.x:53:i/o timeout". The means: 1, If Cfg.attempts is greater than 1 or the number of cfg.servers is greater than 1,tryonename, the error returned is only an error at the end of the failure, 2, if the first exchange call times out, the remaining attempts will be returned immediately on D.dialdns call, and Error.error () is "Dial UDP 10.0.0.x:53:i/o timeout".

    • Surface
      • Linux under CFG represents the object of the/etc/resolv.conf configuration file
        • Where servers corresponds to nameserver (up to three), attempts corresponds to the Attemtps field in the options and timeout corresponds to the Timeout field in the options
        • Attempts defaults to 2
        • Timeout defaults to 5 (seconds)
      • After viewing, the server on the/etc/resolv.conf configuration, the last nameserver is exactly 10.0.0.1 (three); options are not set.
    • Conclusion
      • Because of the Golang Library code's bizarre implementation of timeout and retry logic (and the GLIBC implementation), the printed exception log is not the crux of the problem, resulting in misleading results.
      • Actually causes the query to time out, which occurs when the DNS query response is read
    • Verify
      • Edit/etc/resolv.conf
        • Keep Only one nameserver
        • Add Options Attempts:1
      • After a while, Run.log appeared in the "2017/06/23 19:26:26 Lookup/ http/xxx.xxx.com on 10.0.0.1:53: Read UDP 10.0.0.1:49757->10.0.0.1:53: I/O timeout. Read UDP timeout!!! Meet expectations.

Solution Solutions
    • DNS Server troubleshooting issues and solutions
    • DNS Client Cache
      • In fact, we use DNSMASQ on the machine (for details, click Connect jump) to do the DNS cache, so let's first look at how to fix the problem very tricky.
      • When tracing the source code, notice the function Hostlookuporder, which determines the order logic of the DNS query. Linux has a/etc/hosts file, is a static mapping of the domain name, Hostlookuporder determines the Hosts file query or not and order.
        • Fix method: Adjusts the default order of Golang DNS queries. DNS queries are prioritized, and when DNS queries fail, the hosts match
        • Specific operation: 1, in the/etc/hosts to add the domain name and the corresponding static ip;2 to backstop, modify the hosts entry in/etc/nsswitch.conf as "Hosts:dns files", that is, priority DNS query, after the Hosts file query

written in the last

Although the reason for the specific read UDP failure has not been identified, the exception of the online dial UDP I/O timeout has disappeared after the above tricky operation. Finally, left two points: 1,golang in the library about timeout and retry that block of logic, is very confused; 2, in the case of DNSMASQ exists, the DNS server still has a large pressure? If so, why is there only a limited number of domain name problems?

Hastily written, if there is a mistake, please correct me!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.