Translated from: http://kb.cnblogs.com/page/166210/
Why Google Went Offline Today, and a Bit about how the Internet Works
Note: This article mentions that CloudFlare is a San Francisco-based Content distribution network (CDN) service company, founded in 2009 by the three-bit former developer of project Honey Pot projects. October 2011 was named the most innovative network technology company by the Wall Street Journal.
Today, Google's services have undergone a brief outage, lasting about 27 minutes, impacting Internet users in some areas. The cause of the incident came to a deep, dark corner of the internet. I am a network engineer at CloudFlare Company, which provides a helping hand to help Google recover from this outage. Here's how it happens.
At about 6:24 Pacific Standard Time, November 5, 2012 minutes/Time Standard Time November 6, 2012 2:24 minutes, CloudFlare's employees found that Google's service was interrupted. We use Google's email and other services, so when it's not working properly, people in the office will soon find out. I work in the network technology group, so I immediately connect to the network to see what the situation is--a local area problem or a global problem.
Troubleshoot problems
I quickly realised that all of Google's services could not be connected-even the connection 8.8.8.8, Google's public DNS server-so I started by tracing DNS.
$ dig +trace google.com
Here's what I got when I probed google.com's domain name server:
Google.com. 172800 in NS ns2.google.com.
Google.com. 172800 in NS ns1.google.com.
Google.com. 172800 in NS ns3.google.com.
Google.com. 172800 in NS ns4.google.com.
;; Received 164 bytes from 192.12.94.30#53 (e.gtld-servers.net) in MS
;; Connection timed out; No servers could be reached
The inability to detect any server results proves that something is wrong. In particular, this means that from our office will not connect to any of the Google DNS servers.
I started the network layer to look for problems and see if there was a problem at this communication layer.
PING 216.239.32.10 (216.239.32.10): Data bytes
Request Timeout for icmp_seq 0
Bytes from 1-1-15.edge2-eqx-sin.moratelindo.co.id (202.43.176.217)
There's a strange message here. In general, we should not see the name of an Indonesian network service provider (Moratel) in Google's routing information. I immediately went into a CloudFlare router to see what had happened. Meanwhile, the reports from other parts of the world on Twitter show that we are not the only ones in trouble.
Internet routing
To understand what's going on, you need to know the basics of how some of the Internet works. The entire internet is made up of many networks, known as autonomous Systems (AS). Each network has a unique number to mark itself, known as the as number. The as number of CloudFlare is 13335, and Google's as number is 15169. Each network is interconnected through a technology called Edge Gateway Protocol (BGP). The Edge Gateway protocol is known as the Internet's binder--it declares which IP address belongs to which network, and it establishes a route from an autonomous network to another autonomous network. An Internet "route" is exactly the same as the meaning of the word: The path of an IP address in one autonomous network to another IP address in another autonomous network.
The Edge Gateway protocol is based on a mutual trust system. The trust-based principles of each network tell other networks which IP address belongs to which network. When you send a packet, or send a request across the network, your network service provider contacts its upstream provider or peer provider, asking them which route is closest from your network service provider to the network destination.
Unfortunately, if a network issues a claim that an IP address or a network is inside it, and that is not the case, if its upstream network or peer network trusts it, the packet will eventually get lost. This is the problem that is happening here.
I looked at the routing address of the Google IP that was passed by the Edge Gateway protocol, and the route pointed to Moratel (23947), an Indonesian network service provider. Our office is in California, not far from Google's data center, and packets should never go through Indonesia. Most likely, Moratel declares an incorrect network route.
The route I saw at the Edge Gateway Protocol was:
[Email protected]> Show Route 216.239.34.10
inet.0:422168 destinations, 422168 routes (422154 active, 0 holddown, Hidden)
+ = Active Route,-= last active, * = Both
216.239.34.0/24 *[bgp/170] 00:15:47, MED, Localpref 100
As path:4436 3491 23947 15169 I
> to 69.22.153.1 via ge-1/0/9.0
I looked at other routes, such as Google's public DNS, which were also hijacked to the same (incorrect) path:
[Email protected]> Show Route 8.8.8.8
inet.0:422196 destinations, 422196 routes (422182 active, 0 holddown, Hidden)
+ = Active Route,-= last active, * = Both
8.8.8.0/24 *[bgp/170] 00:27:02, MED, Localpref 100
As path:4436 3491 23947 15169 I
> to 69.22.153.1 via ge-1/0/9.0
Route leaks
Problems like this are thought to originate from "Route leaks" in the industry and are not normal. This kind of thing is not without precedent. Google has suffered a similar outage before, presumably in Pakistan to ban a video on YouTube, where Pakistani national ISPs have removed routing information from the YouTube site. Unfortunately, their approach has been passed outside, and PCCW, the upstream provider of Pakistan's telecoms company, has trusted the Pakistani telecoms company to pass the route to the Internet. The event caused the YouTube site to be inaccessible for about 2 hours.
What happened today belongs to a similar situation. One person at Moratel is likely to be a "fat finger", and the wrong Internet route is lost. And PCCW, the Moratel company's upstream provider, trusts the routes that Moratel company delivers to them. Soon, the wrong route was uploaded to the entire internet. In this trust mode of the Edge Gateway protocol, it is not so much a malicious act as to say that it is a mistake or a mistake.
Repair
The solution is to let Moratel company stop declaring the wrong route. As a network engineer, especially an engineer working in a big web company like CloudFlare, a large part of the job is to keep in touch with network engineers from other parts of the world. When the problem was discovered, I contacted a colleague of Moratel company and told him what had happened. He fixed the issue around 6:50 Pacific Standard Time/2:50 world standard Time. After 3 minutes, the route is back to normal, and Google's service can work again.
Looking at the network transmission diagram, I estimate that the 3-5% of the entire Internet user worldwide has been affected by this outage. The hardest hit is Hong Kong, which is the headquarters of PCCW. If you are in a region that was unable to access Google services at the time, you should now know what the reason is.
Building a better Internet
I say this to let you know how our Internet is built under a mechanism of mutual trust. Today's accident shows that even if you are a big company like Google, external factors that you can't control can affect your users and make them inaccessible to you. So, a network technology team is very necessary for them to monitor the routing and manage your connection to the world. CloudFlare Company's daily job is to ensure that customers get the best route. We take care of all the websites on the internet to ensure that they deliver the service at the fastest transfer rate. Today's thing is just a small fragment of our work.
Understand how Internet works from Google outage events