This article will take a deeper look at what happened to the background as a software developer when you enter a website, you will certainly have a complete hierarchical understanding of how network applications work, and also include the technologies used by these applications, such as browsers, HTTP, HTML, and network servers, requirement processing and so on.
This article will take a deeper look at what happened in the background when you enter a website ~
1. First, you have to enter the URL in your browser.:
2. Search for the IP address of the domain name in the browser
The first step in the navigation is to find the IP address of the accessed domain name. The DNS search process is as follows:
Browser cache-The browser caches DNS records for a period of time. Interestingly, the operating system does not tell the browser how long it will take to store DNS records, so that different browsers will save a fixed period of time (from 2 minutes to 30 minutes ). System cache-if no required record is found in the browser cache, the browser will make a system call (gethostbyname in windows ). In this way, records in the system cache can be obtained. Router cache-Next, the previous query request is sent to the router, which generally has its own DNS cache. Isp dns Cache-check the server where the ISP caches DNS. The corresponding cache records can be found here. Recursive search-your ISP's DNS server performs recursive search from the Domain Name Server to the Domain Name Server on Facebook. Generally, the DNS server cache contains domain names in the. com Domain Name Server. Therefore, the matching process to the top-level server is not necessary.
Shows recursive DNS lookup:
DNS is a bit worrying, that is, the whole domain name like wikipedia.org or facebook.com looks only corresponding to a separate IP address. Fortunately, there are several ways to eliminate this bottleneck:
Cyclic DNSIs the solution when multiple IP addresses are returned during DNS lookup. For example, Facebook.com actually corresponds to four IP addresses. A server Load balancer listens on a specific IP address and forwards network requests to hardware devices on the cluster server. Some large websites generally use this expensive high-performance Load balancer. Geography
DNSYou can map a domain name to multiple IP addresses to improve Scalability Based on your location. In this way, different servers cannot update the synchronization status, but it is very good to map static content.
AnycastIt is a routing technology that maps IP addresses to multiple physical hosts. In the US, Anycast and TCP Protocols are not well adapted, so they are rarely used in those solutions.
Most DNS servers use Anycast for efficient and low-latency DNS lookup.
3. the browser sends an HTTP request to the web server
Because dynamic pages like the Facebook homepage will expire soon or even immediately after they are opened in the browser cache, and they cannot be read from them without a doubt.
Therefore, the browser sends the following request to the server where Facebook is located:
GET http://facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Host: facebook.com
Cookie: datr=1265876274-[...]; locale=en_US; lsd=WW[...]; c_user=2101[...]
The GET request definesURL: "Http://facebook.com /". Browser definition (User-AgentHeader), corresponding to the type it wants to accept (AcceptAndAccept-EncodingHeader ).ConnectionHeader requires the server not to close the TCP connection for subsequent requests.
The request also contains the domain name stored by the browserCookies. As you may already know, cookies are key values that match the status of a website. In this way, cookies store the login user name, the password assigned by the server, and some user settings. Cookies are stored in the client as text documents and sent to the server each time a request is sent.
There are many tools used to view the original HTTP requests and their corresponding tools. The author prefers to use fiddler, and of course there are other tools like FireBug. These software will be very helpful for website optimization.
In addition to obtaining requests, another method is to send requests, which are often used in Form submission. Send a request to pass its parameter via URL (e.g.: http://robozzle.com/puzzle.aspx? Id = 85 ). The request body header sends its parameters.
As in http://facebook.com/#, the oblique barrier is important. In this case, the browser can safely Add a slash. For example, "http: // slash. In this case, the browser directly accesses the address without adding a slash, and the server will respond to a redirection, resulting in an unnecessary handshake.
4. Permanent redirect response of the facebook Service
The figure shows the response sent from the Facebook server to the browser:
HTTP/1.1 301 Moved Permanently
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
Location: http://www.facebook.com/
P3P: CP="DSP LAW"
Pragma: no-cache
Set-Cookie: made_write_conn=deleted; expires=Thu, 12-Feb-2009 05:09:50 GMT;
path=/; domain=.facebook.com; httponly
Content-Type: text/html; charset=utf-8
X-Cnection: close
Date: Fri, 12 Feb 2010 05:09:51 GMT
Content-Length: 0
The server returns a 301 permanent redirect response to the browser, so that the browser will access "http://www.facebook.com/" instead of" http://facebook.com /".
Why must the server redirect instead of sending the webpage content that the user wants to view directly? There are many interesting answers to this question.
One of the reasons is related to the search engine ranking. You see, if a page has two addresses, just like http://www.igoro.com/and Baidu. The search engine knows what 301 permanent redirection means, so that the addresses with and without www will be ranked under the same website.
Another reason is that different addresses may cause poor cache friendliness. When a page has several names, it may appear in the cache several times.
5. browser tracking redirection address
Now, the browser knows that "http://www.facebook.com/?" is the address of the website to be accessed, and the website will send another request:
GET http://www.facebook.com/ HTTP/1.1
Accept: application/x-ms-application, image/jpeg, application/xaml+xml, [...]
Accept-Language: en-US
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
Cookie: lsd=XW[...]; c_user=21[...]; x-referer=[...]
Host: www.facebook.com
The header information has the same meaning as in the previous request.
6. The server "processes" requests
The server receives the request and then processes it and returns a response.
This seems like a smooth task, but there are a lot of interesting things in the middle-a simple website like the author's blog, not to mention a website with a huge access volume like facebook!
Web Server Software
Web server software (such as IIS and Apache) receives HTTP requests and determines what requests are processed to process them. Request processing is a program that can read requests and generate HTML for response (such as ASP. NET, PHP, RUBY ...).
To give a simple example, you can map the website address structure to hierarchical file storage. The address http://example.com/folder1/page1.aspxwill map the/httpdocs/folder1/page1.aspx file. The web server software can be set to manually process the corresponding request with the address, so that the publishing address of page1.aspx can be http://example.com/folder1/page1.
Request Processing
The request processes the read request and its parameters and cookies. It reads or updates some data and stores the data on the server. Then, the request processing generates an HTML response.
All dynamic websites face an interesting challenge-how to store data. Half of a small website will have a SQL database to store data. websites that store a large amount of data and/or access traffic have to find some way to allocate the database to multiple machines. Solutions: sharding (data tables are distributed to multiple databases based on primary key values), replication, and simplified database with weak semantic consistency.
Batch processing is a technology that keeps data updated at a low cost. For example, Fackbook needs to update the news feed in a timely manner, but the "people you may know" function supported by data only needs to be updated every night (I guess this is the case, how to Improve the function ). Updating batch processing jobs will lead to obsolete data that is not very important, but it will make the data update farming faster and more concise.
7. The server sends back an HTML response
In the figure, the response is generated and returned by the server:
HTTP/1.1 200 OK
Cache-Control: private, no-store, no-cache, must-revalidate, post-check=0,
pre-check=0
Expires: Sat, 01 Jan 2000 00:00:00 GMT
P3P: CP="DSP LAW"
Pragma: no-cache
Content-Encoding: gzip
Content-Type: text/html; charset=utf-8
X-Cnection: close
Transfer-Encoding: chunked
Date: Fri, 12 Feb 2010 09:05:55 GMT
2b3Tn@[...]
The overall response size is 35 Kb, most of which are transmitted as blob after sorting.
Content EncodingThe header tells the browser that the entire response body is compressed using the gzip algorithm. After extracting the blob block, you can see the following expected HTML:
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
lang="en" id="facebook" class=" no_js">
...
For compression, the header information shows whether to cache this page, how to do it if it is cached, what cookies should be set (this is not found in the previous response), and privacy information.
Note thatContent-typeSet to"Text/html". The header allows the browser to display the response content in HTML format, rather than downloading it as a file. The browser will decide how to explain the response based on the header information, but it will also consider other factors such as URL extension content.
8. The browser starts displaying HTML
When the browser does not fully accept all HTML documents, it will start to display this page:
9. the browser sends an object to obtain the object embedded in HTML.
When the browser displays HTML, it will notice the tags that need to be obtained from other addresses. In this case, the browser sends a request to obtain these files again.
The following are the URLs we need to retrieve when accessing facebook.com:
Image
Http://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif
Http://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif
...
CSS style table
Http://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css
Http://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css
...
JavaScript files
Http://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js
Http://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js
...
These addresses all go through a process similar to HTML reading. So the browser will find these domain names in DNS, send requests, redirect, and so on...
But unlike dynamic pages, static files can be cached by browsers. Some files may not need to communicate with the server, but can be directly read from the cache. The server response contains the retention period of static files, so the browser knows how long it will take to cache them. In addition, each response may contain the ETag header (the object Value of the requested variable) that works like the version number. If the browser observes that the file version ETag already exists, stop the transfer of this file immediately.
Try to guess"Fbcdn.net"What does the address represent? The smart answer is "Facebook content delivery network ". Facebook uses the content delivery network (CDN) to distribute static files such as images, CSS tables, and JavaScript files. Therefore, these files will be backed up in many CDN data centers around the world.
Static content often represents the bandwidth of the site and can be easily copied through CDN. Websites usually use third-party CDN. For example, Facebook's static files are hosted by Akamai, the largest CDN provider.
For example, when you try to ping static.ak.fbcdn.net, you may obtain a response from an akamai.net server. Interestingly, when you ping the server again, the response server may be different. This shows that the load balancing function starts to take effect.
10. The browser sends an asynchronous (AJAX) Request
Guided by the great spirit of Web 2.0, the page shows that the client is still in touch with the server.
Taking Facebook chat as an example, it will keep in touch with the server to update your bright and gray friends in a timely manner. In order to update the friend status of these pictures, the JavaScript code executed in the browser will send an asynchronous request to the server. This asynchronous request is sent to a specific address, which is a get or send request constructed by program. In the Facebook example, the client sends a request to http://www.facebook.com/ajax/chat/buddy_list.php to obtain the online status information of the friend.
When it comes to this mode, you must talk about "AJAX" -- "Asynchronous JavaScript and XML". Although the server does not have a clear reason for responding in XML format. For another example, Facebook will return some JavaScript code snippets for asynchronous requests.
Among others, the fiddler tool allows you to see asynchronous requests sent by the browser. In fact, you can not only passively serve as a visitor to these requests, but also take the initiative to modify and resend them. AJAX requests are so easy to gain, which can make the online game developers who have scored a lot more depressing. (Of course, don't lie to anyone like that ~)
Facebook chat provides an interesting case about AJAX: Pushing data from the server to the client. Because HTTP is a request-response protocol, the chat server cannot send new messages to the customer. Instead, the client has to poll the server every few seconds to check whether there are any new messages.
In these cases, long polling is an interesting technique to reduce server load. If the server does not receive any new message when it is poll, it will ignore this client. When the client receives a new message that has not timed out, the server will find the unfinished request and return the new message to the client as a response.