As a software developer, you will certainly have a complete hierarchical understanding of how Web applications work, as well as the technologies used in these applications: browsers, http,html, Web servers, requirements processing, and so on.
This article will be more in-depth study when you enter a URL, the background exactly what happened to the thing ~
1. First of all, you need to enter the URL in the browser:
2. The browser finds the IP address of the domain name
The first step in navigation is to find its IP address by the domain name that is accessed. The DNS lookup process is as follows:
* Browser Cache – The browser caches DNS records for a period of time. Interestingly, the operating system does not tell the browser when to store DNS records, so that different browsers store a self-fixed time (ranging from 2 minutes to 30 minutes).
* System Cache – If the required records are not found in the browser cache, the browser makes a system call (gethostbyname in Windows). This will get the records in the system cache.
* Router Caching – Next, the previous query request is sent to the router, which generally has its own DNS cache.
* ISP DNS Cache – The next check is the ISP cache DNS server. The corresponding cache record can be found in this general.
* Recursive search – Your ISP's DNS server starts a recursive search with a domain name server, from a. com top-level domain name server to a Facebook domain name server. In the general DNS server cache there will be domain names in the. com domain name server, so the match process to the top level server is not so necessary.
DNS recursive lookups are as follows:
2012-11-24 19:53:01 Upload
Download Attachments(21.65 KB)
DNS is a bit worrying, which is that the entire domain name, such as Wikipedia.org or facebook.com, appears to correspond to a single IP address. Fortunately, there are several ways to eliminate this bottleneck:
* Circular DNS is a solution when DNS lookups return multiple IPs. For example, facebook.com actually corresponds to four IP addresses.
* A load balancer is a hardware device that listens on a specific IP address and forwards network requests to a clustered server. Some large sites typically use this expensive, high-performance load balancer.
* Geo-DNS improves scalability by mapping domain names to multiple different IP addresses, depending on the geographic location of the user. Such a different server is not able to update the synchronization state, but it is good to map the static content.
* Anycast is a routing technology that maps multiple physical hosts to an IP address. In the ointment, anycast and TCP protocols are not well adapted, so they are rarely used in those scenarios.
Most DNS servers use anycast to obtain efficient, low-latency DNS lookups.
3. The browser sends an HTTP request to the Web server
Because dynamic pages such as the Facebook page, which are opened and soon expire in the browser cache, are no doubt they cannot be read from.
So, the browser will send a request to the server on which Facebook is located:
GET HTTP://facebook.com/HTTP/1.1
Accept:application/x-ms-application, Image/jpeg, Application/xaml+xml, [...]
user-agent:mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-encoding:gzip, deflate
Connection:keep-alive
Host:facebook.com
Cookie:datr=1265876274-[...]; Locale=en_us; Lsd=ww[...]; C_user=2101[...]
GET This request defines the URL to read: "HTTP://facebook.com/". The browser itself defines (user-agent header), and what type of corresponding (accept and accept-encoding headers) it wants to accept. The connection header requires the server not to close the TCP connection in order to request it behind.
The request also contains cookies stored by the browser for that domain name. As you may already know, in different page requests, cookies are key values that match the status of a website. This allows cookies to store login usernames, server-assigned passwords, and some user settings. Cookies are stored in the client computer as text documents and sent to the server each time it is requested.
The original HTTP request and its corresponding tools are used in many ways. The author prefers to use fiddler, and of course there are other tools like Firebug. These software can be very helpful when it comes to website optimization.
In addition to getting the request, there is another way to send the request, which is often used in the submission form alone. The send request passes its parameters (e.g.:http://robozzle.com/puzzle.aspx?id=85) through the URL. The send request sends its arguments after the request body header.
Slashes like "HTTP://facebook.com/" are critical. In this case, the browser can safely add slashes. And like "HTTP://example.com/folderorfile" Such an address, because the browser is not clear whether Folderorfile is a folder or file, so cannot automatically add slashes. At this point, the browser does not have a slash directly access to the address, the server responds to a redirect, resulting in an unnecessary handshake.
4. Permanent redirect response for Facebook services
The figure shows the response that the Facebook server sends back to the browser:
http/1.1 301 Moved Permanently
Cache-control:private, No-store, No-cache, Must-revalidate, Post-check=0,
Pre-check=0
Expires:sat, 00:00:00 GMT
location:http://www.facebook.com/
p3p:cp= "DSP Law"
Pragma:no-cache
set-cookie:made_write_conn=deleted; Expires=thu, 12-feb-2009 05:09:50 GMT;
path=/; domain=.facebook.com; HttpOnly
content-type:text/html; Charset=utf-8
X-cnection:close
Date:fri, 05:09:51 GMT
content-length:0
The server responds with a 301 permanent redirect response to the browser so that the browser accesses "HTTP://www.facebook.com/" rather than "HTTP://facebook.com/".
Why does the server have to redirect rather than directly send the Web content that the user wants to see? There are many interesting answers to this question.
One of the reasons is related to search engine rankings. You see, if a page has two addresses, like HTTP://www.igoro.com/and HTTP://igoro.com/, the search engine will think of them as two sites, resulting in fewer search links and less rankings. and search engine know 301 permanent redirect is what meaning, so will visit with www and without WWW address to the same site ranking.
Another is that using a different address will result in a poor cache-friendliness. When a page has several names, it may appear several times in the cache.
5. Browser Tracing REDIRECT Address
Now, the browser knows that "HTTP://www.facebook.com/" is the correct address to access, so it sends another fetch request:
GET HTTP://www.facebook.com/HTTP/1.1
Accept:application/x-ms-application, Image/jpeg, Application/xaml+xml, [...]
Accept-language:en-us
user-agent:mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; [...]
Accept-encoding:gzip, deflate
Connection:keep-alive
Cookie:lsd=xw[...]; C_user=21[...]; X-referer=[...]
Host:www.facebook.com
The header information is in the same meaning as in the previous request.
6. Server "Processing" requests
The server receives the fetch request, and then processes and returns a response.
This appears to be a forward-looking task, but there are a lot of interesting things going on in the middle-just like the simple website of the author's blog, not to mention the large-scale website like Facebook.
* Web server Software Web server software (like IIS and Apache) receives an HTTP request and then determines what request processing is performed to handle it. Request processing is a program that can read requests and generate HTML to respond (like Asp.net,php,ruby ... )。
For the simplest example, the requirement processing can be stored in a file hierarchy that maps the address structure of a Web site. Like HTTP://example.com/folder1/page1.aspx this address will map/httpdocs/folder1/page1.aspx this file. The Web server software can be set as the address manual for the corresponding request processing, so that the Page1.aspx publishing address can be HTTP://example.com/folder1/page1. * Request processing request processing read request and its parameters and cookies. It will read and possibly update some data, and say that the data is stored on the server. Then, the requirement processing generates an HTML response.
All dynamic sites face an interesting challenge-how to store data. Half of a small site has a SQL database to store data, and a website that stores a large amount of data and/or visits has to find some way to allocate the database to multiple machines. The solution is: sharding (based on the primary key value of the data table scattered across multiple databases), replication, the use of weak semantic consistency of the simplified database.
Delegating work to batch processing is a cheap technology to keep data updated. For example, Fackbook has to update the news feed in a timely fashion, but the "people you might know" feature in the data support only needs to be updated every night (as the author guesses, it's unclear how the changes will be perfected). batch job updates can cause some of the less important data to be stale, but it makes it faster and cleaner to keep data updated. 7. The server sends back an HTML response
The response generated and returned by the server in the figure:
http/1.1 okcache-control:private, No-store, No-cache, Must-revalidate, Post-check=0,pre-check=0expires:sat, Jan 00:00:00 gmtp3p:cp= "DSP law" pragma:no-cachecontent-encoding:gzipcontent-type:text/html; Charset=utf-8x-cnection:closetransfer-encoding:chunkeddate:fri, 09:05:55 GMT
[Email protected] [...]
The entire response size is 35kB, most of which is transferred as BLOB type after finishing.
The content encoding header tells the browser that the entire response body is compressed with the GZIP algorithm. After extracting the BLOB block, you can see the following HTML as expected:
"HTTP://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >lang= "en" id= "Facebook" >
...
With regard to compression, the header information indicates whether the page is cached or not, and if so, what cookies are to be set (not in the previous response) and private information.
Note that the Content-type is set to "text/html" in the header. The header lets the browser render the response content in HTML instead of downloading it as a file. The browser determines how the response is interpreted based on the header information, but it also considers other factors such as URL extension content. 8. The browser starts to display HTML
When the browser does not fully accept the entire HTML document, it has already started to display this page:
9. The browser sends the object embedded in the HTML
When the browser displays HTML, it will notice the need to get a label for other address content. At this point, the browser sends a FETCH request to retrieve the files.
Here are a few URLs we need to get back when we visit facebook.com:
* Picture
HTTP://static.ak.fbcdn.net/rsrc.php/z12E0/hash/8q2anwu7.gif
HTTP://static.ak.fbcdn.net/rsrc.php/zBS5C/hash/7hwy7at6.gif
... * CSS style sheet
HTTP://static.ak.fbcdn.net/rsrc.php/z448Z/hash/2plh8s4n.css
HTTP://static.ak.fbcdn.net/rsrc.php/zANE1/hash/cvtutcee.css
.. * JavaScript File
HTTP://static.ak.fbcdn.net/rsrc.php/zEMOA/hash/c8yzb6ub.js
HTTP://static.ak.fbcdn.net/rsrc.php/z6R9L/hash/cq2lgbs8.js
...
These addresses are going through a process similar to HTML reading. So the browser will find these domain names in DNS, send requests, redirect, etc...
However, unlike dynamic pages, static files allow the browser to cache them. Some files may not need to be communicated to the server and read directly from the cache. The server's response contains the term information for static file retention, so the browser knows how long to cache them. Also, each response may contain an ETag header that works like a version number (the entity value of the requested variable), and if the browser observes that the version ETag information for the file already exists, stop the file transfer immediately.
Try to guess what "fbcdn.net" means in the address. The smart answer is "Facebook content distribution Network". Facebook uses content distribution networks (CDNs) to distribute static files like images, CSS tables, and JavaScript files. As a result, these files will be backed up in many CDN data centers around the world.
Static content often represents the bandwidth size of the site and can be easily replicated through a CDN. A third-party CDN is usually used by the website. For example, Facebook's static files are hosted by Akamai, the largest CDN provider.
For example, you might get a response from a akamai.net server when you try to ping static.ak.fbcdn.net. Interestingly, when you ping again, the server may be different, which means that the load balance behind the scenes is starting to work. 10. The browser sends an asynchronous (AJAX) request
Under the guidance of the Great Spirit of Web 2.0, the client remains in contact with the server after the page is displayed.
Take the Facebook chat feature as an example, and it will keep in touch with the server to update your shiny gray friend status in a timely fashion. In order to update the status of these avatar-lit friends, the JavaScript code executed in the browser sends an asynchronous request to the server. This asynchronous request is sent to a specific address, which is a fetch or send request constructed by the program. Or in the case of Facebook, the client sends HTTP://www.facebook.com/ajax/chat/buddy_list.php a publish request to get the status information about which online in your friend.
When you mention this pattern, you have to talk about "AJAX" – "Asynchronous JavaScript and XML", although the reason why the server responds in XML format is not casualgirlfriend. For example, for an asynchronous request, Facebook will return some JavaScript code snippets.
Among other fiddler, this tool allows you to see the asynchronous requests sent by the browser. In fact, you can not only passively as a spectator of these requests, but also proactively make changes and resend them. Ajax requests are so easy to be blindfolded that it's really frustrating for those scoring online game developers. (Of course, don't lie to others like that)
The Facebook Chat feature provides an interesting case for Ajax: Pushing data from the server side to the client. Because HTTP is a request-response protocol, the chat server cannot send new messages to customers. Instead, the client has to poll the server every few seconds to see if he has any new messages.
These situations occur when long polling is a very interesting technique for mitigating server load. If the server does not have new messages when polled, it will ignore the client. When a new message is received from the customer without a time-out, the server finds an outstanding request and returns the new message as a response to the client. Summarize
Hope to read this article, you can understand how different network modules work together
This article is from the Webmaster Information Network original link: http://www.chinahtml.com/1007/127890385919293_2.html
We seem to do such a thing every day, open a browser, enter the URL, carriage return, a blank page suddenly has something, it may be Baidu or the like Search page, or a packed with text and pictures of the portal. What happened when we opened the browser and we saw what we wanted to see?
Here we understand from three aspects of the process, one is a browser, two is a server, and the third is the browser and server communication between the protocol. Before we understand these three aspects, we must first understand the word linking these three aspects: the Web.
1,world Wide Web
The web that we usually refer to is the World Wide Web. In general, this is a technology that accesses resources through a browser. What we often say about surfing the internet is that it's all about the World Wide Web, but we often confuse the World Wide Web with the Internet. The internet is a network interconnection technology, it refers to the physical level of interconnection, and the World Wide Web should be regarded as a service running on the Internet.
We usually also access the Web through our browser, and we often see pages that contain hypertext, images, video and audio content. These resources are provided to us by a single site that connects to each other through the Internet. We use hyperlinks from one Web page to another, from one site to another, all of which make up a huge network, which is the web.
Web-enabled technology, the first is the underlying network, because the Web is built on the Internet, the basic protocol of the Web is the HTTP protocol, it runs on the TCP protocol, and the TCP protocol requires IP protocol support, IP protocol is supported by the underlying link, So we can see from the high to the first that such a stack of http->tcp->ip->-Link layer protocol. It's enough to understand the Web to IP.
We can think about what resources are on the Web? The first is the text, and later added the picture, to the current various audio and video resources, all the resources on the Internet through a thing called the URI is also marked, of course, we are more common is the URL. Now there is no need to dwell on the difference between the two, the URL is a subset of the URI, the URL gives us the address of the resource, so we can find it.
Now look at a URL: This is the URL of a picture. It is defined by this syntax: Scheme://domain:port/path?query_string#fragment_id.scheme is the protocol, in the browser is usually HTTP, the example is HTTPS is a kind of HTTP and SSL /tls is a combination of applications that provide encrypted communication and authentication to the network server (Http://zh.wikipedia.org/zh/HTTPS). Then is the domain name, each site has at least one domain name, the above example domain name part is www.google.com.hk, this domain name also divides into three parts, www is the hostname, COM.HK is the top-level domain name, besides COM also has cn,net and so on. After the domain name is the port number default is 80, usually omitted, this is the server side Server software listens to the port, is also TCP inside a port number value. Then there is path, which is the resource on the server. The last question mark part of the client takes advantage of the URL passed to the server for some parameter values, usually less value, less important when doing so.
2, protocol
(1) HTTP protocol
The most important protocol on the web is the HTTP protocol, which, for the classic ISO seven-tier network model, is at the highest level-the application layer. The model of the HTTP application is the Client/server model. So there are two kinds of HTTP message types, request and response. The client makes a request to the server and the server sends the request back to the client. Here's a look at the format of the two types of messages:
The following are explained separately.
The first is the HTTP Request Message
Request line: The request line begins with a method symbol, separated by a space, followed by the requested URI and version of the Protocol. Common to request methods are: GET POST HEAD put and so on.
Message header: In the normal header, there are a few header fields used for all request and response messages, but not for the transferred entity, only for the transmitted messages. The request header allows the client to pass additional information about the request to the server side, as well as the client itself. Both request and response messages can send an entity. An entity consists of an Entity header field and an entity body, but it does not mean that the entity header fields and entity bodies are sent together, and only the entity header fields can be sent. The entity header defines the meta-information about the entity body (eg: there is no entity body) and the resource identified by the request. The contents of the POST request are placed in the entity body.
HTTP Response Message
Status line: The primary field is the server response code. For example, $ OK, 401 Unauthorized, 403 Forbidden, 404 Not Found, Internal server Error, 503 server Unavaila ble
Message header: The normal header and entity header are similar to the request header. The difference is in the response header, which allows the server to pass additional response information that cannot be placed in the status line, as well as information about the server and the next access to the resource identified by Request-uri.
(This part of the comparison is sketchy, online resources are more, you can refer to this article: http://blog.csdn.net/gueter/article/details/1524447 and http://book.51cto.com/art/ 200902/109036.htm)
Here is a get message ethereal caught, post message and response message, you can probably look at.
(2) TCP protocol
The HTTP protocol is based on the TCP protocol, where all content of HTTP is encapsulated as a TCP entity into a TCP message. The TCP protocol is a connection-oriented, reliable transmission mechanism. This means that the client interacts with the server in the process of data creation and release process, see the above HTTP header field can see the relevant fields. TCP has a powerful window mechanism to adapt to the sender and receiver of the sending and receiving capacity, but also based on the entire network conditions to adjust.
(3) IP protocol
IP protocol is in the connecting position of the whole TCP/IP protocol family. We know that the host on the Internet is based on a 32-bit IP address to locate the HTTP URL is also considered an address, but more advanced, IP protocol is not understood, so need a conversion from the URL to the IP, this process through the DNS (Domain name query System) protocol completed. Every computer we use is configured with the address of the DNS server, if not configured then your gateway to act as the default, when we have a URL to know the corresponding IP needs to send a query request to the DNS server, it will send back the results of the query.
2, browser
The least popular role in the Web world is the browser. Before we talk about HTTP protocols, there are two types of HTTP messages, request and response. The main task of the browser is to send HTTP request messages and receive processing HTTP response messages. I haven't seen open source documents from the browser, but I think a software can basically be called the last browser just by completing the following few things.
(1) The ability to generate the appropriate HTTP request message based on the user's request. For example, the user enters an address in the browser address bar to access, the browser to be able to generate HTTP GET messages, form the sending of post messages and so on.
(2) Be able to deal with all kinds of response.
(3) Render the HTML document, create a document tree, be able to interpret the CSS, and have a JavaScript engine.
(4) The ability to initiate DNS queries to obtain an IP address.
Browser is a very complex software, of course, now browser support for the HTTP protocol should not be a problem, they are mainly entangled in the HTML document rendering part, for the user endless new requirements, the new standards of endless, the browser path should have just begun.
3, Server
The server has two levels of concept, it can be a machine, it has all the things of a site, it can be software, installed on a machine called a server, to help the machine to distribute the things users want. I don't have much research on servers, I just used Apache a few times. So just a simple talk about my understanding.
The most basic function of a server is to respond to client requests for resources. The server will first listen to port 80, HTTP requests, according to the request processing, request a picture that is based on the path to find resources sent back, request static HTML page is also the case, if the request is a dynamic page like PHP should first call the PHP compiler (or interpreter bar) to generate HTML code , and then back to the client. One problem to solve, of course, is the parallel problem in response to large traffic.
Source: http://blog.csdn.net/saiwaifeike/article/details/8789624
Full process analysis from the input URL to the display page