Notes on building a high-performance web site-infrastructure-notes on building a high-performance web site-Applications

Source: Internet
Author: User
Tags rsync
    • Notes on building a high-performance web site-Infrastructure
    • Building a high-performance web site-Application
Cause

It took me less than a month to finish reading this 400-page book "Building a high-performance web site". I have to say that this is the first time I have fully read a book in the real sense, although I have read many technical books. One of the reasons is that most of the technical books tend to be boring. Even if you read them with some devout purpose and desire, it is easy to give up halfway. However, this book is different. It attracts my desire to read it, and almost all links can resonate with me, so I quickly read it again.

The author uses typical lamp as an example. I have almost never touched on this aspect, but I believe that the idea is consistent, and the foundation of thinking is the key. Therefore, this article is a summary of the Summary.

 

Overview

The following figure takes a long time to draw, extracts the basic architecture design part of the book, and hopes to condense them into a picture, which includes image-map, you can click the dotted box to link to the corresponding section below.However, it should be emphasized that this figure is just a hodgedge and is not an actual architecture.


Component separation

DNS load balancing

Different web contents are distributed to different servers and subdomains are divided. Requests are naturally transferred to different servers using DNS. There are two main contents:

    • Dynamic Content, CPU, io-intensive
    • Static content, io-intensive

By configuring multiple a records in DNS, requests are transferred to different servers in the cluster, which helps large websites with regional problems, DNS allows users to access the corresponding web server nearby. DNS service software such as bind provides a wide range of scheduling policies. However, if the host in the cluster fails, it usually takes some time to update the DNS cache. In addition, the client can also bypass DNS scheduling by setting the host.

Cross-origin shared COOKIE: expands the cookie range to the parent domain

 

HTTP redirection

Distribute and transfer request pressure by redirecting the client. For example, some download services usually have several backup storage servers.

 

Distributed cache

When you cannot use the page-level cache, you must consider directly caching data, for example, using memcached as the cache. In this case, you need to consider writing memcached concurrently. In addition, when the scale of memcached expands horizontally and the number of servers increases, a correspondingAlgorithmTo enable the applicationProgramKnow which memcached server should be connected (for example, modulo operation ). The distributed cache can automatically rebuild the cache without worrying about the server going down.

 

Server Load balancer

Server Load balancer distributes requests. This involves how to design scheduling policies to maximize the performance of clusters. When the host capabilities in the cluster are equivalent to the average scheduling capability at the time, more effort should be done when the capabilities are uneven. With the complexity of the problem, you must always pay attention to the scheduling performance, rather than making scheduling a performance bottleneck.

Reverse Proxy Server Load balancer

The reverse proxy server works on the HTTP layer, similar to the proxy server. Unlike the normal proxy server, the server is on the proxy backend, rather than the client on the proxy backend. This is similar to Nat, only Nat works at the network layer. In the same way, the reverse proxy server emphasizes "forwarding" rather than "transferring" because it not only forwards client requests, but also forwards server responses. Software that can be used as a reverse proxy server includes nginx, lighttp, and Apache. At present, some professional proxy forwarding devices can work at the application layer, such as A10.

Pay attention to the following issues when using proxy forwarding:

    • Due to the forwarding feature of the reverse proxy, the proxy itself may become a performance bottleneck. Generally, it is suitable to use proxies for CPU-intensive requests. If it is Io-intensive, this cluster mode may not be able to maximize its performance.
    • Health check should be enabled on the proxy to promptly discover faulty machines in the cluster and adjust forwarding policies. This is generally better than DNS in real time.
    • Sticky session: For an application that starts a session to save user information or uses dynamic content cache on the backend server, the user's requests in a session must be kept on the same server. Proxy Servers generally support similar configurations. However, try not to make the application too localized. For example, you can use cookies to store user data, distributed sessions, or distributed caches.

IP Server Load balancer

Literally, requests are forwarded at the network layer, similar to NAT gateway. However, Gateway forwarding may cause a bottleneck in bandwidth, because there is only one egress, so the egress bandwidth requirements are high. The netfilter module in Linux can be configured through iptables. For example, requests to port 8001 on the Internet are forwarded to a server on the Intranet, while requests to port 8002 on the Internet are forwarded to another server on the Intranet. This method is easy to use, but it cannot be configured too much for scheduling. LVS-NAT is also the way of forwarding in Linux at the network layer, different from netfilter, it supports some dynamic scheduling algorithms, such as the minimum link, weighted least link, the minimum expected latency and so on.

Direct routing

Direct routing uses the scheduler to modify the destination MAC address of the data packet and forward the request data packet, but the response data packet can be directly sent to the Internet. The obvious advantage of this is that you do not need to worry about gateway bottlenecks, but the actual server and scheduling server must be linked to the WAN switch and have an independent Internet IP address.

The working principle of this method is slightly complicated:

First, each server needs to set an IP alias, which is a virtual IP address for the client. Only the proxy server responds to the ARP request of this IP alias, in this way, the request packet sent from the client to this IP address is first sent to the proxy server. Then, the proxy server fills in the target MAC address of the request packet as the MAC address of the actual server (the target server is determined by a scheduling algorithm). Because the target server also has this IP alias, the forwarded data packets can be received and processed by the actual server. Finally, because the source IP address of the data packet is still the IP Address requested by the client, the actual server will forward the response packet directly to the client through the switch instead of the proxy server.

Linux can achieve direct routing through LVS-DR

IP tunneling

The IP tunnel means that the scheduler encapsulates the original IP data packet in the new IP data packet for scheduling. The actual server can directly forward the response data packet to the user end.

 

Shared File System

For some simple services that provide file downloading (including static resources in HTML), we naturally need to consider using clusters to reduce the pressure. But how can we synchronize these resources on hosts in the cluster.

NFS

One solution is to let these hosts fetch data from the same place. For example, NFS (Network File System) is used, based on PRC. This method is easy to use. However, due to the disk throughput of the NFS server, concurrent processing capability, bandwidth, and other issues, it is often very limited.

Redundancy Distribution

Another solution is to redundant storage resources on the host, so that the host does not need to access the shared file system and only needs to read resources on the local disk. But this also brings about a synchronization problem. How to synchronize the data:

Active distribution, also divided into single-level distribution and multi-level distribution, distribution can be achieved through SCP, SFTP, HTTP extension protocol WebDAV

    • Single-level distribution: this solution can be achieved through one distribution. However, performance bottlenecks may occur in disk pressure and network bandwidth, making it difficult to expand.
    • Multi-level distribution: multi-level distribution is used to achieve the destination. This solution can distribute the disk pressure and network bandwidth pressure, and is easy to expand. The disadvantage is that the cost is high.

Passive synchronization is easy to understand. You can use rsync to determine whether to synchronize conditions based on the last update time. Therefore, if there are too many files in a folder, rsync scanning takes a long time. You can set the last update time for the folder and reasonably plan the file directory to speed up the scanning time of rsync. Even if you do not use rsync, you can use this idea to develop your own synchronization program to improve performance.

 

Distributed File System

The Distributed File System works on the user process layer. It is a file management platform that maintains internal redundancy, retrieval, tracking, scheduling, and other work, it usually contains an organizational structure at the physical layer and an organizational structure at the logical layer. The organizational structure of the physical layer is maintained by the Distributed File System. The organizational structure of the logical layer is oriented to users. "Tracker" plays a key role.

Mogilefs is an open-source distributed file system written in Perl, including a tracker, storage node, and management tool. It uses all information of the MySQL Distributed File System and WebDAV for file replication. Other famous ones are hadoop.

Each file is defined by a key. When you need to read the file, specify a key. The tracker returns an actual path and obtains the file by accessing this address. You can even cache the path corresponding to the key with the distributed cache, which can reduce the query overhead of the tracker, but it also loses the superiority of the scheduling policy of the Distributed File System. In addition, you can use reverse proxy servers (such as perlbal) that support reproxy to redirect paths to the reverse proxy server.

 

Database Expansion

Master-slave replication, read/write splitting

This method uses the database copy or image function to store the same data on multiple databases and separate read and write operations. Write operations are concentrated on one primary database, read operations are concentrated on multiple slave databases. This method is suitable for websites with more reads than writes. If you do not want to maintain this separation ing at the application layer, you can use the database reverse proxy to automatically complete read/write separation.

Vertical partitioning

Data Tables that do not require joint queries can be distributed to different database servers, which are called vertical partitions. Of course, each partition can also use read/write splitting.

Horizontal partitioning

Splitting records from the same table to different tables or even servers is called horizontal partitioning. This usually requires a stable algorithm to ensure that data can be obtained from different servers correctly during reading, for example, simply modulo the ID, divide the range, or save the ing relationship. You can also use a proxy-like product Spock.

 

Summary

This article summarizes the basic architecture in "Building a high-performance web site", which is just a concise summary. Almost every place has a lot of knowledge to go.

labor fruit, reprinted please indicate the source: http://www.cnblogs.com/P_Chou/archive/2012/10/10/high-performance-web-site-infrastructure.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.