Gigahttpd design idea version 0.1

Source: Internet
Author: User
Project URL: http://gigahttpd.sourceforge.net/

Version: 0.1
Submission time:
Author: Lu yiming (yiming Lu) Co., lu.yiming.lu@gmail.com.
Description: gigahttpd development documentation

* Tell the truth first

When I wrote this design idea, I knew that I was still very ignorant. Many ideas may be wrong, and even the whole design was completely infeasible. But it doesn't matter. Let's continue to improve.

* Simple description of HTTP Server Functions

Receives an HTTP request and returns the requested data.

* Function details

(1) Establish a TCP connection.

(2) receive HTTP requests, get or post, authenticate the identity session, and parse the URL.

(3) Find the object of the corresponding URL from all objects.

(4) If the object needs to be calculated (for example, the parsed data in the post body is used for calculation), the calculation process is started and the calculation results are saved.

(5) return the data of this object to the user.

* Challenges we face

Process and respond to 1 billion HTTP requests per second. Each request must go through the preceding steps.

* Our available resources

(1) 1000 PCs.

(2) Each pc has 8 to 16 64-bit CPU cores.

(3) Each pc has 8 to 16 GB memory.

(4) Each pc has 8 to 16 1g/10G Ethernet cards.

(5) Each pc has two 1 TB hard disks, which are simulated by a hardware RAID card.

(6) There are enough 10G Ethernet switches.

(7) Open-source Linux operating system.

* System running environment

The following conditions are assumed to exist or will exist, that is, we do not consider them.

(1) The space between devices is large enough and the power supply is sufficient :)

(2) sufficient Internet bandwidth that can be accessed by 1G users at the same time. It may be many broadband access channels, including DNS load balancing.

(3) All IPv6. If there is an IPv4 access, the frontend is connected to the IPv6/IPv4 Nat converter.

(4) There is enough TCP connection Load balancer to allocate 1g of TCP connections to tens of thousands of NICs.

* One application!

Before starting the design, please clarify a core issue. Our system only processes one application! This means that this is an application that is reduced to the minimum and cannot be further decomposed.

Example 1: assume that an application is a large forum with many sub-forums, each of which is relatively independent. This application does not need our system, because each self-forum can use a simple HTTP server. Of course, for some reason, such as uniform authentication of identity sessions, or complicated internal settlement transactions, it is appropriate to use our system.

Example 2: assume that the application is a real-time multi-person painting system that allows 1 billion people to draw simultaneously on a piece of paper. This application is very difficult to break down, and our system is suitable for this application.

Oh, don't laugh at the madness of some ideas. We just want to support even more crazy public welfare or commercial ideas, which can be put into practice :)

* TCP Connection

In the NIC Driver. That is, no process is blocked in the listen () of the socket without entering the TCP/IP protocol stack of the kernel.

Therefore, a "normal" Nic and a dedicated "Data nic" are required in every PC ". We designed a separate driver for the "Data nic.

The "data adapter" driver directly judges from the Ethernet data packets. If it is a TCP connection request, it directly returns accept.

The data processing process of the "normal" Nic is not modified. We can log on to Linux on the PC through the "normal" Nic for management.

* Identity Authentication

User identity authentication includes the first authentication, namely user name/password verification and HTTP session verification.

Assume that each user occupies 4 kb (one page ), including the user name, password, all session/tcp connection information of the user, TCP input data buffer, TCP output information, and some additional information. A total of 4 TB of memory is required to store all user information.

To achieve high performance, all data in the system is stored in the memory. To facilitate unified access, we design a 10 TB memory in a continuous memory address space, that is, internal data references are represented by 64-bit address pointers.

Each pc has about 10 Gb of memory, SO 4 TB of user data is stored on 400 PCs. 1G users can be located with 32-bit binary data. Therefore, 1g user information is stored continuously in 4 TB space and located with 44-bit memory addresses.

During the user name/password test, the user name is translated into an address pointer using the hash algorithm, and then the user is searched for in a small range in the memory. After the user name/password is found, the user name/password is checked. After the check is passed, add a large random number to the user's 32-bit address and combine it as the session ID. The subsequent session check is simple. To improve performance, you can put sessionid in the URL and discuss it later.

Therefore, when a PC's "data adapter" receives a user login request, the corresponding CPU core first calculates the user's 32-bit address. If this address is not in the PC, then send the request to the PC where the address is located.

For the sake of performance, when designing a webpage, do not submit the form data more than the capacity of an Ethernet packet (about 1.4kb) at a time. This is also the size of the TCP Window sent to the client. The receiving of the upload file will not be considered by the subsequent design.

Therefore, the CPU cores in each PC are divided into two types: one is the "normal" CPU core, and the other is the "Data CPU core ", the "Data CPU core" only processes the data sending and receiving tasks of the "Data nic. "Data CPU core" is mainly used at the kernel level to read and write NICs at a high speed. If the overhead in and out of the kernel can be tolerated, it can also work at the user level. It works on the kernel-level "Data CPU core". To avoid affecting normal operating system kernel data, you can set a separate memory ing, memory page Directory table, that is, you cannot access normal operating system data. Of course, "Data CPU core" is not involved in normal operating system task scheduling. "Data CPU core" has its own scheduling method. When "normal CPU core" is idle, you can set it to "Data CPU core" temporarily ".


* Parse the URL and find the object.

For performance considerations, all objects in the system are saved as a static memory data zone and accessed through address pointers.

The result calculated based on the data submitted by the user is also saved as a static memory data zone. For example, posts submitted in forums, articles submitted in blogs, comments in blogs, and images drawn from real-time plotting. The data area of the real-time image can be reused after it expires.

Static files are pre-read into the memory from the hard disk.

If the application is a big data file, such as a movie, it will increase the memory of the entire system or buffer the hard disk according to certain algorithms, you can also limit the user's data traffic (as long as the video is played continuously) to reduce bandwidth and memory usage. This issue remains for future design discussions.

Therefore, objects in the system can be nested with objects.

The URL corresponding to each object can be used as a URL when it is clear that static data is expected, even if the system restarts, the address remains unchanged after all data is reloaded. If the object is dynamic, it can be represented by a string or a string with an ID.

Therefore, all the objects in the system are stored in a huge address space. On the other hand, the total number of objects is limited. A 1 GB user accesses a lot of content that is repeated, and the amount of data accessed by many users at the same time is not large, such as commodity stock transaction data and popular film and TV works. The focus of stock transaction data for commodities needs to be well stored, not necessarily real-time access to a large amount of historical data. If many people access a large amount of data at the same time, such as the Global fine map system and all the high definition versions of movies taken by humans, the data traffic will not be large, and it is static data, each user's bandwidth can be limited to a certain range, so the hard disk data buffer mechanism can be used.

Therefore, in the entire system, you can use 5 TB to save all objects. Each object is accessed through a pointer address. All objects are distributed on over 500 PCs. For objects with large traffic volumes, you should save copies on multiple PCs to achieve high performance. The question of Server Load balancer is to be discussed later.

To check whether a user has the permission to access an object, you can design a certain number of "authorization codes" in a certain format ", each object to be protected is given one or several "authorization codes ". These authorization codes can also form roles or multi-level (multi-layer) roles. Users can be set to multiple roles to implement large-scale authorization. As for object authorization to a single user, you can record the user ID (address pointer) in the object. For example, the blog content can only be modified by the blog author.

The memory usage of the permission is designed to be between 1 MB and 100 MB. We can continue to discuss this design.

* Calculate an object

This is specially designed to calculate large objects, such as real-time plotting or commodity transaction matching.

If the computing process can be decomposed and then merged in parallel, such as real-time drawing, it will be decomposed into several CPU cores for calculation, and then sent to a single CPU Core for synthesis. If the computing process cannot be decomposed in parallel, such as transaction matching of a single product, it is computed on a CPU core.

The calculation result is saved as static data. If you want to copy data to multiple PCs, broadcast over Ethernet. If too many broadcasts affect performance, you can divide subnets.

* Send object data to the user

For the sake of performance, no output buffer is used, that is, data is not replicated between global memory as the output buffer.

Set an output TCP connection pool for each object, and each connection data in the pool is the sending status of the TCP connection. The complete HTTP response data is sent back here, that is, part of the static object data is sent. The size of the Ethernet data packet is smaller than the size of the user's TCP accept window, in an Ethernet packet, pack it into an IP address, including the public IP address of the server. After the IP address is ready, it is pushed to the router and sent to the user.

The memory of this part of the TCP connection pool is dynamically allocated. The size is close to 1 TB and allocated to the PC where each object is located. The memory allocation management method can be similar to the Linux Kernel Slab for full-page allocation. Because the TCP sending time can be set to time out, the memory can be re-used within a certain period of time.

* A typical HTTP Request/Response Process

(1) The user is sending a TCP request (an IP packet) to the system using a client such as a browser or chat tool, and first arrives at the Server Load balancer.

(2) The Server Load balancer sends TCP requests to a PC's "Data nic". The driver of the "Data nic" directly replies and establishes a TCP connection.

(3) The Server Load balancer sends an http post request (assuming there is no more than one Ethernet packet) to the user to manage the "Data nic" of the PC ".

(4) A "Data CPU core" in the user management PC parses the session ID and finds that it is not its own user. Therefore, this request is forwarded to the second user management PC.

(5) After the second user management PC checks that the user's session ID is correct, parse the URL to get the Object ID, then, the request is forwarded to the "data adapter" of an object management PC based on the server Load balancer algorithm ".

(6) The "Data CPU core" of the Object Management PC parses the user's post body, computes and updates the static data of the object.

(7) The object management PC sends updated static data broadcasts to other object management PCs.

(8) The object management PC adds a connection record to the TCP connection pool of the object.

(9) The object management PC notifies the Server Load balancer that subsequent IP packets of this TCP connection are sent to the local machine, and it is best to send the packets to this Nic.

(10) The object management PC extracts a part of static data and packs it into an Ethernet package containing an IP package. The destination IP address of the IP package is the user's IP address, which is sent to the router.

(11) After receiving the IP packet, the user sends back the ACK packet for receiving confirmation. The Server Load balancer sends this packet to the object management PC. The object management PC continues to package and sends the next packet.

(12) delete this connection record from the connection pool when the object management PC sends all data or the connection times out. If all connection records on a page are deleted, the connection pool is reclaimed.

(13) A 10 KB Object Request/response can complete the above process within one second, with 1 GB of requests at the same time.

* Legacy issues

The following issues will not be taken into account in the current version design.

(1) SSL/TLS

(2) Hot swapping/Fault Tolerance

(3) how to store data on the hard disk and whether to use the database.

(4) internal load balancing

(5) internal network data forwarding loss

(6) CPU Cache Optimization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.