What kind of hardware is supporting StackOverflow?

Source: Internet
Author: User
Tags cisco 3945
I prefer to regard StackOverflow as a runningwithscalebutnotatscale that can run on a large amount of data ). This means that our website is very efficient, but at least so far, our scale is not enough & ldquo; large & rdquo ;. Let's use some numbers to introduce how StackOverflow is currently

I prefer to think of Stack Overflow as being able to run under large-scale data, but it is not a large scale (Running with scale but not at scale). This means that our website is very efficient, but at least so far we are not large enough ". Let's use some numbers to introduce the current size of Stack Overflow. The following are some core figures from the past day (24 hours. This is a typical workday and only counts the data centers of our activities, that is, our own servers. Requests and traffic to CDN nodes are excluded because they do not directly access our network.

  • The server load balancer received 148,084,833 HTTP requests.
  • The first page is loaded.
  • 833,992,982,627 bytes (776 GB) HTTP traffic is used for sending
  • Received a total of 286,574,644,032 bytes (267 GB) of data
  • A total of 1,125,992,557,312 bytes (1,048 GB) of data were sent
  • 334,572,103 SQL queries (including only HTTP requests)
  • 412,865,051 Redis requests
  • 3,603,418 tag engine requests
  • Time consumed 558,224,585 MS (155 hours) on SQL queries
  • Time elapsed 99,346,916 MS (27 hours) on Redis requests
  • Time consumed 132,384,059 MS (36 hours) on TAG engine requests
  • Time consumption 2,728,177,045 MS (757 hours) on ASP. Net program processing

(I think we should post an article about how to quickly collect the data and why it is worth the effort to obtain them)

Note that the above figures include the entire Stack Exchange network, but not all of them. In addition, these numbers only come from HTTP requests recorded for performance detection. "Wow, how do you do this for so many hours a day ?" We call this magic. of course, some people like to say "many servers with multi-core processors", but we still insist that this is magic. The following are the devices that run the Stack Exchange Network in the data center:

  • 4 ms SQL Servers
  • 11 IIS servers
  • Two Redis servers
  • Three tag engine services (any request for tags will access them, such as/questions/tagged/c ++)
  • Three ElasticSearch servers
  • Two load balancers (HAProxy)
  • Two switches (Nexus 5596 and Fabric Extenders)
  • 2 Cisco 5525-X ASA? (It can be viewed as a firewall)
  • Two Cisco 3945 routers

There is a picture with the truth:

We not only run websites, but also some servers and other devices running virtual machines on the shelf side. they do not directly serve websites, it is used for deployment, domain name control, monitoring, and database operations. The two database servers in the above list have been used for backup until recently as read-only load (mainly used for Stack Exchange API), so we can continue to expand the scale without too much consideration. The Web server is used to develop and store metadata separately, and the running load is very low.

Core device

To remove unnecessary devices, the following are required for running Stack Exchange (to maintain the current performance level ):

  • 2 ms SQL servers (Stack Overflow is on one server, while others are on the other. In fact, only one machine can run as surplus)
  • 2 Web servers (maybe 3, but I'm confident that 2 is sufficient)
  • 1 Redis server
  • 1 tag engine server
  • 1 ElasticSearch server
  • 1 server load balancer
  • One vSwitch
  • 1 ASA
  • 1 vro

(We should try this configuration and close some devices to see where the limit is)

There are also some virtual machines running in the background to execute some auxiliary functions, such as domain name control. However, it is a relatively low-load task, so we will not discuss it. here we focus on the Stack Overflow itself to see how it loads the page at full speed. If you want to be more precise and comprehensive, you can add a vmwarevm to execute all the auxiliary work. In this case, there is no need for many machines, but the specifications of these machines are usually difficult to implement on the cloud, unless you have enough money. The following is a brief introduction to the configuration of these "enhanced" servers:

  • The database server has GB memory and tb ssd hard drive.
  • Redis server has 96 GB memory
  • The ElasticSearch server has GB of memory.
  • TAG engine servers have the fastest processor we can afford
  • Each port of the vSwitch has a bandwidth of 10 GB.
  • The Web server is not very special. It has 32 GB memory, 2 4-core processors, and gb ssd hard drive.
  • Some servers have 2 10 Gbit/s bandwidth interfaces (such as databases) and others have 4 1 Gbit/s bandwidth interfaces.

20 GB of bandwidth is too much? By the way, the active database servers only use 100-200 Mb of the 20 GB channel on average. However, operations such as backup and reconstruction can completely saturated the bandwidth based on the current memory and SSD hard disk conditions. Therefore, this design is meaningful.

Storage device

We currently have about 2 TB of database storage (the first cluster has 18 SSD hard disks-1.63 TB in total, 1.06 TB in use; the second cluster consists of 4 SSD hard disks-1.45 TB in total, use 889 GB). This is what we need on the ECS (well, we have to pay for it again). please remember that all these are SSD hard disks. Thanks to the good performance of memory, the average write time of our database is0 msAnd even exceeds the measurement accuracy. The actual read/write ratio of the database in Stack Overflow is 40: 60. You are not mistaken. 60% is a write operation (Click here to learn about the read/write ratio ). In addition, each Web server has two RAID 1 hard drives consisting of a 320 gb ssd. ElasticSearch requires about GB of capacity in each block. since we frequently write or re-create indexes, SSD hard disks are a better choice here.

It is worth noting that we have a SAN (storage region Network) connected to the core network, that is? Equal Logic PS6110X, which has 24 hot-swapping 10 k sas disks and 2 10 GB controllers. This device is used only by VM servers as shared storage space to ensure high availability of virtual machines, but it does not actually support website operations. In other words, if a SAN crashes, the website cannot even be noticed for a period of time (only the domain name controller in the virtual machine can perceive it ).

Integrated

Why are all these devices together? Performance. We need high performance, which is a very important feature for us. The homepage of all sites is a problem page. We call it Question/Show internally ). In November 12, the average page rendering time was28 MSBut our requirement is at most 50 ms. To improve the user experience, we do everything possible to shorten the page loading time, even if only one millisecond. In terms of performance-related issues, all our developers are "always better", which also helps our website to respond quickly. The following figure shows the average rendering time of some popular pages on Stack Overflow. The data still comes from the previous 24 hours:

  • Question/Show: 28 MS (29.7 million clicks)
  • User Profiles: 39 MS (1.7 million clicks)
  • Question List: 78 MS (1.1 million clicks)
  • Home page: 65 MS (1 million clicks )(This is slow for us. Kevin Montrose is fixing this problem.)

By recording the timeline of each request, we can accurately observe the page loading process. We need such data,Otherwise, do you rely on brain supplements to make decisions?With data in hand, we can monitor the performance as follows:

If you are interested in the data on a specific page, I am also happy to publish it. But here I focus on the rendering time, because it indicates how long it will take for our server to generate a web page. Network Transmission speed is a completely different topic (although I have to admit that it also has a lot to do with it), but I will talk about it in the future.

Growth space

It is worth mentioning that the usage of these servers is very low. For example, the average CPU usage of the Web server is5-15%The memory only uses 15.5 GB, and the network traffic is only 20-40 Mb/s. The average CPU usage of the database server is5-10%, Uses 100 GB of memory and 200-MB/s network. This enables us to do a few important things: we do not need to upgrade the device as soon as the website grows; when there is a problem (wrong queries, code, and attacks, etc, no matter what the problem is), we can keep the website intact; reduce power consumption when necessary. Here is a monitoring project on the Web layer:

The main reason for such a low utilization rate is efficient code. Although this is not the topic of this article, efficient code also plays a decisive role in the performance of server mining. The loss of doing a non-essential thing is much more than doing nothing-extend this to the code, that is, you need to improve them more efficiently. These losses or consumption can be energy, hardware (you need more and faster servers), and it is more difficult for developers to understand the code (in all fairness, this has two sides, efficient code is not necessarily that simple), and slow page rendering-may cause fewer users to browse other pages of the website or even stop visiting your website. Inefficient code may cause much more losses than you think.

Now we know what kind of hardware Stack Overflow runs on. next time we can discuss why we don't use the cloud.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.