"Domestic large-scale portal site architecture analysis" is divided into two parts, the original author was written in 2004. But it still has a good reference meaning to the present large-scale website structure.
The architecture of large community websites
Analysis on the structure of large portal site
How to do a mirrored server
Realization scheme of Intelligent load balancing for domain name double line
System architecture of large high concurrent high load Web sites
Sina and Sohu in the domestic visibility is no one knows nobody. Their daily hits are more than tens of millions. Such a large number of visits for Sina and Sohu How to make use of limited resources to enable netizens to obtain the fastest speed becomes the first prerequisite, after all, now the network company has left the stage of burning money, began a benign development, each sum of money must be echoed to hit the line. On the other hand, technicians have to rack their brains and not allow users to always be inaccessible, or very slow to access. So even if there are good editors, good sales, they will be difficult to sell ads, waiting for their will be closed. None of this has happened, of course, because their technicians have made full use of the resources available and have played them to the extreme. In the final analysis, squid is used as the Web cache server, while Apache provides real Web services behind squid. Of course, the use of such a schema must ensure that most of the home page is a static page. This requires a programmer's cooperation to convert the page to a static page before giving it back to the client. All right, basic architecture. So, here's how I guessed it and the specific architecture:
One of the Magic weapons: nslookup
Actual combat:
Nslookup www.sina.com.cn
server:ns-px.online.sh.cn
address:202.96.209.5
Non-authoritative Answer:
Name:taurus.sina.com.cn
addresses:61.172.201.230, 61.172.201.231, 61.172.201.232, 61.172.201.233
61.172.201.221, 61.172.201.222, 61.172.201.223, 61.172.201.224, 61.172.201.225
61.172.201.226, 61.172.201.227, 61.172.201.228, 61.172.201.229
Aliases:www.sina.com.cn, jupiter.sina.com.cn
Here you can see Sina on the home page to use so many IP, the beginning of someone will think really sina deep pockets. Actually, keep looking down:
Nslookup news.sina.com.cn
server:ns-px.online.sh.cn
address:202.96.209.5
Non-authoritative Answer:
Name:taurus.sina.com.cn
addresses:61.172.201.228, 61.172.201.229, 61.172.201.230, 61.172.201.231
61.172.201.232, 61.172.201.233, 61.172.201.221, 61.172.201.222, 61.172.201.223
61.172.201.224, 61.172.201.225, 61.172.201.226, 61.172.201.227
Aliases:news.sina.com.cn, jupiter.sina.com.cn
Careful people can find news this channel IP number and the same as the first page, and IP is exactly the same. That is, these IP in Sina DNS name is called taurus.sina.com.cn, those IP is the domain of a record. and news,sports,jczs.news ... It's all CNAME records. Use DNS to do automatic polling. Don't believe it, another one, on the sports channel is good:
Nslookup sports.sina.com.cn
server:ns-px.online.sh.cn
address:202.96.209.5
Non-authoritative Answer:
Name:taurus.sina.com.cn
addresses:61.172.201.222, 61.172.201.223, 61.172.201.224, 61.172.201.225
61.172.201.226, 61.172.201.227, 61.172.201.228, 61.172.201.229, 61.172.201.230
61.172.201.231, 61.172.201.232, 61.172.201.233, 61.172.201.221
Aliases:sports.sina.com.cn, jupiter.sina.com.cn
Others can try it on their own. Okay, let's look at the Sohu:
Nslookup www.sohu.com
server:ns-px.online.sh.cn
address:202.96.209.5
Non-authoritative Answer:
Name:pagegrp1.sohu.com
addresses:61.135.132.172, 61.135.132.173, 61.135.132.176, 61.135.133.109
61.135.145.47, 61.135.150.65, 61.135.150.67, 61.135.150.69, 61.135.150.74
61.135.150.75, 61.135.150.145, 61.135.131.73, 61.135.131.91, 61.135.131.180
61.135.131.182, 61.135.131.183, 61.135.132.65, 61.135.132.80
Aliases:www.sohu.com
--------------------------------------------
Nslookup news.sohu.com
server:ns-px.online.sh.cn
address:202.96.209.5
Non-authoritative Answer:
Name:pagegrp1.sohu.com
addresses:61.135.150.145, 61.135.131.73, 61.135.131.91, 61.135.131.180
61.135.131.182, 61.135.131.183, 61.135.132.65, 61.135.132.80, 61.135.132.172
61.135.132.173, 61.135.132.176, 61.135.133.109, 61.135.145.47, 61.135.150.65
61.135.150.67, 61.135.150.69, 61.135.150.74, 61.135.150.75
Aliases:news.sohu.com
As with Sina, just from the surface to see Sohu IP number more than Sina IP number, then Sohu on each channel with more than Sina server. Of course not, because a server can bind multiple IP, so can not from the number of IP to determine how many servers used.
From these experiments, we can basically see that Sina and Sohu are using the same technology for the channel and so on, that is squid to listen to these IP 80 ports, and the real Web server to listen to another port. There is no difference in the sense of the user, compared to the way the Web server is directly connected to the client, such a way to significantly conserve bandwidth and servers. The speed of user access will also feel faster.
Of course, also can not because of a few domain name IP flatly they use squid to do front-end cache, you can directly access one of the IP to try, the results as shown:
This can prove that Sina is in DNS set a lot of IP to point to domain name sqsh-19.sina.com.cn, while all the other channels of the same nature are just sqsh-19.sina.com.cn an alias, with CNAME specified. The DNS settings should be this way, and then the server listens to 80 ports via Squid 2.5.stable5 (the latest stable version is STABLE6). These are based on a number of information analysis, should be basically correct. Here are some of my personal guesses:
Its real Web server is also listening on port 80 because one of the squid profiles is:
Httpd_accel_port 80
If you set the other port number (for example, 88), the error message on the map becomes
While trying to retrieve the url:http://61.172.201.19:88
Tool 2:nmap Scanner: can be used to check what port the server has opened.
I'm using Nmap to scan a ip:61.172.201.19 in Sina for analysis.
bash-2.05$ Nmap 61.172.201.19
Starting Nmap 3.50 (http://www.insecure.org/nmap/) at 2004-07-30 13:31 GMT
Interesting ports on 61.172.201.19:
(The 1657 ports scanned but not shown below are in state:filtered)
PORT State SERVICE
22/TCP Open SSH
80/TCP Open http
Nmap Run completed--1 IP address (1 host up) scanned in 73.191 seconds
Can see he opened only 2 ports, 80 port is just what we said Squid opened, which has just been verified. and 22 ports are used to SSH remote connection, mainly the SA is used to remotely operate the server with very high security methods.
Tool 3:lynx or other tools and programs that can read HTTP headers: Just look at the example better understand:
http/1.0 OK
Date:fri, June 05:49:47 GMT
server:apache/2.0.49 (Unix)
Last-modified:fri, June 05:48:16 GMT
Accept-ranges:bytes
Vary:accept-encoding
Cache-control:max-age=60
Expires:fri, June 05:50:47 GMT
content-length:180747
Content-type:text/html
Age:37
X-cache:hit from sqsh-230.sina.com.cn
Connection:close
The above is the feedback information of Sina's HTTP header. There are a lot of valuable things oh: For example, the Apache behind it is used 2.0.49, also set the expiration time of 2 minutes. Last modified time. These are loaded at the time of compiling Apache, especially last-modified need a small change of source code-at least I do.
Sum up
Sina's architecture should be the front squid, according to the current server 2u,2g memory in general each server can run at least 4 squid2.5stable5. In this way, it uses 4 servers for 16 IP. The back layer is apache2.0.49 should use 2 units. All the 2 possible uses are private IP, specified in the Hosts file by the Squid server in front. The specific implementation method I will organize my experiment document next time: The Apache Htdocs may have one or 2 disk arrays for NFS. The Apache Mount NFS server should be read-only, and then there will be a server turnstile used as an editor server to edit people to update the article. This server should have writable Permissions for NFS server.
----This is a complete set of Sina's use of the program, of course, many are by guessing, I did not have any communication with Sina's technical staff (because one does not know), otherwise I will not write out. Other sohu,163 should also have such an architecture.
Final statement: This is just some static page composition channel of a structure, Sina there are many other servers, what downloads, blogs, search engines, pictures, forums, etc. are not in this architecture.