The first disclosure of the extreme challenges encountered by the Alibaba DNS system.
Every year, the eleventh is a challenge to Alibaba's various technologies. DNS is no exception.
The DNS protocol has standardized the DNS protocol from the earliest hosts file to RFC 1034 and RFC 1035. It has been 40 years old and is one of the oldest protocols in the TCP/IP protocol suite. Up to now, there have been hundreds of RFCs that have expanded and hardened the DNS. Make DNS one of the most commonly used and most indispensable Internet infrastructures. After most network programs are started, the first packet sent is the DNS request packet generated by the gethostbyname() call. Because of the program communication on the network, the way of direct connection through IP has long been abandoned, and instead, the domain name calls each other. A lot of benefits: easy to debug, maintain, easy for human understanding. In an IPv6 network, it is more difficult to directly reference an IP address and must be converted into an easy-to-read domain name form.
In the DNS system, the main roles are: DNS resolver, recursive DNS, and authoritative DNS. The DNS resolver is a piece of DNS code on Windows and Linux. That is, glibc implements gethostbyname(). The function is to receive the call of the upper application, send a DNS request packet on the network, receive the response, and extract the parsing result and return it to the upper application.
Recursive DNS is deployed in a network location close to the PC and server. The service IP of the recursive DNS is the DNS server IP configured by Windows and the NS IP in /etc/resolv.conf of Linux. The recursive DNS receives requests from the DNS resolver and goes to the authoritative DNS at all levels to make recursive requests. Recursive DNS caches recursive results locally, and instead of recursing the next time you receive the same request. Recursive DNS itself is a cache that does not store authoritative data.
The authoritative DNS maintains the correspondence between the domain name and the IP, from the root to the various top-level domains, and the first-level authorization. For example, the authoritative DNS of the root stores the IP of the authoritative DNS of each top-level domain such as .com .net .cn. The authoritative DNS of .com maintains the IP of the authoritative DNS of taobao.com. The corresponding relationship between the domain name and the IP address of www.taobao.com., there is an authoritative DNS server of taobao.com. Recursive DNS first-level search, get the IP of www.taobao.com. from the authoritative DNS of taobao.com., and return it to the user.
As can be seen from the above system, the complex and time-consuming recursive logic is done by recursive DNS, and the DNS client simply sends out a DNS request. The DNS system encapsulates complex logic within the system, providing users with simple and clear logic. Recursive DNS is a window for providing services to users, and it is also a module with very high request pressure. In Alibaba.com, the qps of requests received by recursive DNS are very high for several reasons.
Numerous invalid DNS requests generated by various mixed application code
Burst DNS requests from stress tests by various business parties
Excessive host name reversal request from low version OS/glibc
The DNS request brought by the massive server is less and more, and the aggregation to the DNS server becomes huge traffic.
DNS attack, here the specific pointer to the recursive DNS, authoritative DNS attack
Double eleven DNS technical challenges
Calls between systems within the group are invoked via domain names, not IP. A click on the Taobao Tmall web page sends dozens of DNS requests from multiple servers to the DNS server. Every year at 11:00, a large number of users request to come inside the Alibaba network, a stone stirs up thousands of waves, Ali's internal systems are activated, and massive calls to each other lead to the peak of DNS requests. In response to such a large number of DNS requests, the traditional heap machine approach is no longer effective.
In the double eleven zero peak scenario, the traffic of recursive DNS is many times as usual. The increase in authoritative DNS traffic is not as obvious as recursive DNS. Because in the double eleven, the large number of requests is a duplicate domain name. A user's purchase request may trigger a mutual call between A, B, ... Z systems. Another purchase request will still request these domain names. This makes the recursive DNS cache hit rate high, the request will not go to the authoritative DNS.
The reason for so many DNS requests is not only the double eleven, but also the reason for the domain name mechanism of the operating system.
The Windows operating system itself provides DNS caching. You can view the current cache of the system through ipconfig /displaydns. Linux has no DNS cache by default, but there is a DNS cache software nscd similar to Windows. As long as the nscd service is started, the DNS cache can be used. Generally Linux native does not do DNS caching. If Linux locally caches DNS, once the IP address corresponding to the domain name is invalidated or changed, the local application cannot immediately perceive the change and cause a failure.
In addition to the operating system's cache, the application itself will generally design a DNS cache to maximize efficiency. Take the JVM as an example. The JVM caches 30 minutes for a normal domain name.
There is also a big problem. Many systems use DNS for load balancing, and a domain name is configured with multiple IPs. Gethostbyname() is returned to the upper application and is the first IP. This requires recursive DNS to scramble the order of multiple IPs and return them randomly. If the cluster service capabilities of different IPs are different, recursive DNS is also required to return IP according to the weight.
Alibaba's self-developed DNS solution
Alibaba's DNS team has developed a DNS system, DNS Mega, to easily cope with the peak traffic of double eleven zeros.
DNS Mega is a recursive DNS system that can handle large traffic DNS requests. With a single physical server, you get very high QPS capabilities. In the past, it was necessary to expand the number of servers horizontally to increase the system capacity of the DNS. This method is relatively primitive, and the expansion and contraction are slow. With DNS Mega, only a few servers in a cluster need to be deployed to meet the DNS requirements of a large number of business servers. When the domain name cache is about to expire, the DNS Mega will re-request the domain name so that the domain name in the DNS Mega system always reflects the latest state of the authoritative DNS.
The DNS Mega has a unique set of logic for dealing with DNS attacks. The more attacks against recursive DNS are pan-domain attacks, which is to request a.taobao.com b.taobao.com. A domain name request with a constantly changing prefix. In a typical recursive DNS system, receiving these requests will not hit the cache and will go to the authoritative DNS recursion. Take up a lot of recursive DNS resources. The DNS Mega can distinguish between normal requests and attack requests, and allocate more resources to normal requests.
Fortunately, Ali Public DNS, including DNS Mega, has already provided services to a wide range of Internet users. Alibaba Public DNS can be used free of charge for various terminals such as PCs, mobile devices, and servers. Users only need to set the local DNS address to 223.5.5.5 and 223.6.6.6, and you can use the completely free and fast public DNS service.