What really happens if you navigate to a URL

Source: Internet
Author: User
Tags anycast nameserver

As a software developer, you certainly has a high-level picture of how the Web apps work and what kinds of technologies is I Nvolved:the Browser, HTTP, HTML, Web server, request handlers, and so on.

In this article, we'll take a-deeper look at the sequence of events and the place when you visit a URL.

1. You enter a URL into the browser

It all starts here:

2. The browser looks up the IP address for the domain name

The first step in the navigation are to figure out the IP address for the visited domain. The DNS lookup proceeds as follows:

  • Browser cache– The browser caches DNS records for some time. Interestingly, the OS does not tell the browser the time-to-live for each DNS record, and so the browser caches them for a Fixed duration (varies between browsers, 2–30 minutes).
  • OS Cache –if The browser cache does not contain the desired record, the browser makes a system call (Gethostbyna Me in Windows). The OS has its own cache.
  • Router Cache , haven request continues on to your Router, which typically have its own DNS cache.
  • ISP DNS Cache , haven next place checked is the cache ISP ' s DNS server. With a cache, naturally.
  • Recursive Search –your ISP ' s DNS server begins a Recursive search, from the root nameserver, through the. com to P-level nameserver, to Facebook ' s nameserver. Normally, the DNS server would have names of the. com nameservers in cache, and so a hits to the root nameserver won't be necessary.

Here's a diagram of what a recursive DNS search looks like:

One worrying thing about DNS is, the entire domain like wikipedia.org or facebook.com seems to map to a single IP addr Ess. Fortunately, there is ways of mitigating the bottleneck:

    • round-robin DNS  is a solution where the DNS lookup returns multiple IP addresses, rather t Han just one. For example, facebook.com actually maps to four IP addresses.
    • Load-balancer  is The piece of hardware that listens on a particular IP address and forwards the Requests to other servers. Major sites would typically use expensive high-performance load balancers.
    • geographic dns  improves scalability by mapping a domain name to different IP addresses, Dependin G on the client's geographic location. This was great for hosting static content so that different servers don't have to update the shared state.
    • Anycast  is A routing technique where a single IP address maps to multiple physical servers. Unfortunately, Anycast does not fit well with TCP and are rarely used in that scenario.

Most of the DNS servers themselves use Anycast to achieve high availability and low latency of the DNS lookups.

3. The browser sends a HTTP request to the Web server

You can is pretty sure that Facebook ' s homepage would not be served from the browser cache because dynamic pages expire EIT She very quickly or immediately (expiry date set to past).

So, the browser would send this request to the Facebook server:

[...] [...] Accept-encoding:gzip, deflateconnection:keep-alivehost:facebook.comcookie:datr=1265876274-[...]; Locale=en_us; LSD=WW[...]; C_user=2101[...]

The GET request names the URL to fetch: "http://facebook.com/". The browser identifies itself (user-agent header), and states what types of responses it would accept (accept< /c3> and accept-encodingheaders). The Connection header asks the server to keep the TCP Connection open for further requests.

The request also contains the cookies that the browser have for this domain. As you probably already know, cookies is key-value pairs that track the state of a Web site in between different page req Uests. And so the cookie store the name of the logged-in user, a secret number that is assigned to the user by the server, some of user ' s settings, etc. The cookie is stored in a text file on the client, and sent to the server with every request.

There is a variety of tools-let you view the raw HTTP requests and corresponding responses. My Favorite tool for viewing the raw HTTP traffic was fiddler, but there was many other tools (e.g., FireBug) These tools a Re a great help when optimizing a site.

In addition to GET requests, another type of requests so you could be familiar with are a POST request, typically used to s Ubmit forms. A GET request sends its parameters via the URL (e.g.: Http://robozzle.com/puzzle.aspx? id=85). A POST request sends its parameters in the request body, just under the headers.

The trailing slash in the URL "http://facebook.com/" is important. In this case, the browser can safely add the slash. For URLs of the form http://example.com/folderOrFile, the browser cannot automatically add a slash, because it's not Clea R whether Folderorfile is a folder or a file. In such cases, the browser would visit the URL without the slash, and the server would respond with a redirect, resulting in An unnecessary roundtrip.

4. The Facebook server responds with a permanent redirect

This is the response, the Facebook server sent back to the browser request:

http/1.1 301 Moved permanentlycache-control:private, No-store, No-cache, Must-revalidate, post-check=0, Pre-Check      =0expires:sat, 00:00:00 gmtlocation:http://www.facebook.com/p3p:cp= "DSP law" pragma:no-cacheset-cookie:ma de_write_conn=deleted; Expires=thu, 12-feb-2009 05:09:50 GMT;      path=/; domain=.facebook.com; httponlycontent-type:text/html; Charset=utf-8x-cnection:closedate:fri, 05:09:51 gmtcontent-length:0

The server responded with a 301 Moved permanently response to tell the browser to go to "http://www.facebook.com/" instead of "http://facebook.com/".

There is interesting reasons why the server insists on the redirect instead of immediately responding with the Web page T Hat the user wants to see.

One reason have to do with search engine rankings. See, if there is both URLs for the same page, say Http://www.igoro.com/and http://igoro.com/, search engine may consider them to is different sites, with fewer incoming links and thus a lower ranking. Search Engines understand permanent redirects (301), and would combine the incoming links from both sources to a single R Anking.

Also, multiple URLs for the same content is not cache-friendly. When a piece of content has multiple names, it'll potentially appear multiple times in caches.

5. The browser follows the redirect

The browser now knows this "http://www.facebook.com/" is the URL of the correct to go, and so it sends out another GET request :

[...] [...] Accept-encoding:gzip, Deflateconnection:keep-alivecookie:lsd=xw[...]; C_user=21[...]; X-referer=[...] Host:www.facebook.com

The meaning of the headers is the same as for the first request.

6. The server ' handles ' the request

The server would receive the GET request, process it, and send back a response.

This could seem like a straightforward task, but in fact there was a lot of interesting stuff the happens Here–even on a s Imple site like my blog, let alone to a massively scalable site like Facebook.

    • Web server Software
      the Web server software (e.g., IIS or Apache) receives the HTTP request and decides which request handler should is executed to handle this request. A request handler is a program (in ASP. PHP, Ruby, ...) that reads the request and generates the HTML for the response.< P>in The simplest case, the request handlers can is stored in a file hierarchy whose structure mirrors the URL structure, And so for Example http://example.com/folder1/page1.aspx url would map to file/httpdocs/folder1/page1.aspx. The Web server software can also is configured so the URLs is manually mapped to request handlers, and so the public URL of Page1.aspx could behttp://example.com/folder1/page1.

    • Request Handler
      the request handler reads the request, its parameters, and cookies. It would read and possibly update some data stored on the server. Then, the request handler would generate a HTML response.

One interesting difficulty that every dynamic website faces is what to store data. Smaller sites would often has a single SQL database to store their data, but sites that store a large amount of data and/o R has many visitors has to find a-from-to split the database across multiple machines. Solutions include sharding (splitting up a table across multiple databases based on the primary key), replication, and USA GE of simplified databases with weakened consistency semantics.

One technique to keep data updates cheap are to defer some of the work to a batch job. For example, Facebook have to update the newsfeed. A timely fashion, but the data backing the ' People you may know ' Featu Re may only need to be updated nightly (my guess, I don ' t actually know how they implement this feature). The Batch job updates result in staleness of some less important data, but can make data updates much faster and simpler.

7. The server sends back a HTML response

Here is the response, the server generated and sent back:

http/1.1 okcache-control:private, No-store, No-cache, Must-revalidate, post-check=0,    Pre-check=0expires:sat, 00:00:00 gmtp3p:cp= "DSP law" pragma:no-cachecontent-encoding:gzipcontent-type:text/html; Charset=utf-8x-cnection:closetransfer-encoding:chunkeddate:fri, 09:05:55 gmt2b3
[Email protected]???? [...]

The entire response is a kB, the bulk of them in the byte blob at the end, the I trimmed.

The content-encoding header tells the browser, the response body is compressed using the GZIP algorithm. After decompressing the blob, you'll see the HTML you ' d expect:

<! DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 strict//en"      "HTTP://WWW.W3.ORG/TR/XHTML1/DTD/XHTML1-STRICT.DTD" > 

In addition to compression, headers specify whether and how to cache the page, any cookie to set (none in this response), Privacy information, etc.

Notice the header that's sets Content-type to text/html. The header instructs the browser to render the response content as HTML, instead of say downloading it as a file. The browser would use the header to decide how to interpret the response, but would consider other factors as well, such as The extension of the URL.

8. The browser begins rendering the HTML

Even before the browser has received the entire HTML document, it begins rendering the website:

9. The browser sends requests for objects embedded in HTML

As the browser renders the HTML, it'll notice tags that require fetching of other URLs. The browser would send a GET request to retrieve for each of the these files.

Here is a few URLs that my visit to facebook.com retrieved:

    • Images
      Http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif
      Http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif
      ...
    • CSS style Sheets
      Http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css
      Http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css
      ...
    • JavaScript files
      Http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js
      Http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js
      ...

Each of these URLs would go through process a similar to what the HTML page went through. So, the browser would look up the domain name in DNS, send a request to the URL, follow redirects, etc.

However, static files–unlike dynamic Pages–allow the browser to cache them. Some of the files is served up from cache, without contacting the server at all. The browser knows how long-to-cache a particular file because the response that returned the file contained an Expires Hea Der Additionally, each response may also contain an ETAG header that works like a version number–if the browser sees an ETag for a version of the file it already have, it can stop the transfer immediately.

Can you guess what "fbcdn.net" in the URLs stands for? A Safe bet is the It means "Facebook content delivery Network". Facebook uses a content delivery network (CDN) to distribute static content–images, style sheets, and JavaScript files. So, the files would be a copied to many machines across the globe.

Static content often represents the bulk of the bandwidth of a site, and can be easily replicated across a CDN. Often, sites would use a third-party CDN provider, instead of operating a CND themselves. For example, Facebook's static files is hosted by Akamai, the largest CDN provider.

As a demonstration, when you try to ping the static.ak.fbcdn.net, you'll get a response from an akamai.net server. Also, interestingly, if you ping the URL a couple of times, could get responses from different servers, which demonstrates T He load-balancing that happens behind the scenes.

Ten. The browser sends further asynchronous (AJAX) requests

In the spirit of Web 2.0, the client continues to communicate and the server even after the page is rendered.

For example, Facebook chat would continue to update the list of your logged in friends as they come and go. To update the list of your logged-in friends, the JavaScript executing in your browser have to send a asynchronous request to the server. The asynchronous request is a programmatically constructed GET or POST request this goes to a special URL. In the Facebook example, the client sends a POST request to http://www.facebook.com/ajax/chat/buddy_list.php to fetch the List of your friends who is online.

This pattern was sometimes referred to as "AJAX", which stands for "Asynchronous JavaScript and XML", even though there was No particular reason why the server have to format the response as XML. For example, Facebook returns snippets of JavaScript code in response to asynchronous requests.

Among other things, the Fiddler tool lets your view the asynchronous requests sent by your browser. In fact, no only can observe the requests passively, but can also modify and resend them. The fact that it's this easy-to-"spoof" AJAX requests causes a lot of grief-developers of online games with scoreboard S. (Obviously, please don ' t cheat the.)

Facebook Chat provides an example of a interesting problem with the ajax:pushing data from the server to the client. Since HTTP is a request-response protocol, the chat server cannot push new messages to the client. Instead, the client has to poll the server every few seconds to see if any new messages.

Long polling is a interesting technique to decrease the load, the server in these types of scenarios. If the server does not has any of the new messages when polled, it simply does does send a response back. And, if a message for this client is received within the timeout period, the server would find the outstanding request and Return the message with the response.

Conclusion

Hopefully this gives your a better idea of what the different Web pieces work together.

What really happens if you navigate to a URL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.