Key points for developing large-scale and high-load website applications

Last Update:2018-12-03 Source: Internet

Author: User

Tags mysql host php download database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Key points for developing large-scale and high-load website applications

Author: maid
Source: http://www.phpchina.com/bbs/thread-15484-1-1.html

After reading some people's so-called large projects, I feel a little uncomfortable without talking about ideas.
I also talk about my own views. I personally think it is difficult to determine whether a project is large or not,
Even a simple application is a challenge in the case of high load and high growth. Therefore, in my opinion, it is simply a high load.
Many of these problems need to be considered in high concurrency or high growth. They are irrelevant to program development, but are related to the overall system.
The architecture is closely related.

Database

Yes, the first is the database, which is the first spof faced by most applications. In particular, for Web applications, database responses must be addressed first.
MySQL is generally the most commonly used. It may be initially a MySQL host. When data increases to more than 1 million,
Then, MySQL's Performance drops sharply. Common optimization measures are the M-S (master-slave) mode for Synchronous replication, the queries and operations and
Server. I recommend the m-slaves method, with two primary MySQL instances and multiple Server Load balancer instances. Note that although there are two master nodes,
However, only one of them is active at the same time. We can switch between them at a certain time. 2 m is used to ensure that M will not become the spof of the system.
Server Load balancer can further balance the Select Operation to different Server Load balancer instances by combining LVS.

The above architecture can compete against a certain amount of load, but as the number of users increases, your user table data exceeds 10 million, then the M becomes
Spof. You cannot expand slaves at will. Otherwise, the overhead of replication synchronization will rise. What should you do? My method is Table Partitioning,
Partitions on the business layer. The simplest is to use user data as an example. Based on a certain splitting method, such as ID, it is split into different database clusters.
The global database is used to query meta data. The disadvantage is that each query will be added. For example, if you want to query a user, you must first
The global database Group finds the cluster ID corresponding to the blacklist server, and then finds the actual data of the blacklist server in the specified cluster.
Each cluster can be m-m or m-slaves.
This is a scalable structure. As the load increases, you can simply add new MySQL clusters.

Note that:
1. Disable all auto_increment Fields
2. ID must be centrally allocated using a common algorithm
3. You must have a good way to monitor the load of the MySQL host and the running status of the service. If you have more than 30 MySQL Databases running, you will understand what I mean.
4. Do not use persistent links (do not use pconnect). On the contrary, use a third-party database connection pool such as sqlrelay, or simply do it by yourself, because MySQL in PhP4
Connection Pool issues frequently.

Cache

Caching is another big problem. I generally use memcached as a cache cluster. Generally, about 10 servers are deployed (10 Gb memory pool ). Note that you must never use it.
Swap, it is best to disable Linux swap.

Load Balancing/acceleration

Maybe when I talk about caching, some people first think about static pages. The so-called static html, which I think is common sense, is not a key point. Static Page brings about static services.
Load Balancing and acceleration. I think lighttped + squid is the best method.
LVS <-------> lighttped ==> squid (s) === Lighttpd

I often use the above. Note: I do not use Apache. I do not deploy Apache unless specified, because I usually use PHP-FastCGI with Lighttpd,
The performance is much better than Apache + mod_php.

Squid can be used to solve file synchronization and other problems. However, you must monitor the cache hit rate well and increase the hit rate by more than 90% as much as possible.
Squid and lighttped have many topics to discuss.

Storage

Storage is also a big problem. One is the storage of small files, such as slice. The other is the storage of large files, such as search engine indexes. Generally, a single file is larger than 2 GB.
The simplest way to store small files is to use Lighttpd for distribution. Or simply use the RedHat GFS, the advantage is that the application is transparent, the disadvantage is that the cost is high. I mean
You have purchased a disk array. In my project, the storage volume is 2-10 TB, and I use distributed storage. The file replication and redundancy should be solved here.
In this way, each file has different redundancy. For details, refer to Google's GFS paper.
For more information about the storage of large files, see the nutch solution. Now it is an independent hadoop sub-project. (You can Google it)

Others:
In addition, passport and so on are also considered, but all of them are relatively simple.

After dinner, don't write anything.

[Reply]

9tmd:
There are several key parts that must be considered, such as the squid group, LVS, or VIP (layer-4 Switching). You do not need to check the ID in the master database for logical table sharding, it can be regularly cached or logically controlled by the program.
Share with you my experience: http://www.toplee.com/blog/archives/337.html (Welcome to discuss) nightsailer:
Upstairs said very well.
Let me talk about why we want to query the primary table. The most important factor is the replication and maintenance considerations. Assume that according to the program logic, the user's nightsailer should be in the S1 cluster, but for various reasons, I need to transfer the data of the nightsailer from the S1 cluster to the S5 cluster or in some cases, I need to merge the data of several clusters. during maintenance, I only need to update the cluster ID of the primary database from 1 to 5, and the maintenance work can be performed independently, you do not need to consider updating the application logic. Maybe the ID allocation logic of the program can take this into account, but in this way, your logic will be dispersed into various applications, and the coupling of the generated code is very high. On the contrary, if you use the lookup table method, you only need to perform initial allocation at the beginning, so other applications do not need to consider these algorithms and logic.
Of course, the addition of this query I mentioned at the beginning does not mean that the master database needs to be searched for each query, and the cache policy must be considered.

As to why auto_increment should be disabled, it is clear to me that auto_increment cannot be used for data merging and separation.

Nightsailer: There are a lot of Optimization Options for PHP during idle time. The main measures are as follows:
1. Use fcgi with Lighttpd and Zeus.
I personally prefer Zeus, which is simple and reliable. However, you need to make $.
Lighty is also good, and the configuration file is also very simple and refreshing. The latest 1.5, although unstable, has improved performance with Aio in Linux.
Obviously. Even in the current stable version, epoll 2.6 can achieve very high performance. Of course, the disadvantage of lighty over Zeus is that
The support is very limited, so you can use multi-server load, or simply start different process services to listen to different ports.
2. Dedicated PHP fcgi server.
There are many benefits. On this server, you can run the fcgi service of PHP. You can add some caches, such as xcache, which I personally like.
There are other things. If you use a big name, you can install all the things that can be installed.
In addition, you can maintain only one PHP environment, which can be shared by Apache, Zeus, and Lighttpd at the same time,
The premise is that these are all in PHP fcgi mode, and xcache can be fully utilized!
3. Apache + mod_fastcgi
Apache is not useless and sometimes required. For example, if I use PHP as a web_dav server, I can only run Apache in case of other problems.
Then, install mod_fastcgi in Apache and use the PHP FastCGI configured in 2 by using the externl server.
4. Optimized Compilation
ICC is my first choice. It is Intel's compiler. Using ICC to re-compile PHP, MySQL, Lighty, and NLE compilation will produce a lot. Especially when you use
Intel CPU.
5. Patch is required for 64-bit PhP4
It seems that no one has compiled PhP4 on Linux x86_64.
Not to mention China, which is rarely used by foreigners.
Here is a reminder, if the official PHP download (including the latest php-4.4.4), all can not be compiled through. The problem is that in Autoconf
Manually modify config. M4. It is generally in MySQL, Gd, LDAP, and other key extension, as well as the phpize script. Add/usr/lib64
In the path of the search in config. M4.
However, it is estimated that few people like me will not defend against the use of PhP4. PhP5 is okay.
I also considered migrating to php5.2. It is too convenient to write code and I have been patient with it. Nightsailer: Quote: original post WuexpPublished on
Table sharding can make the operation data (change, delete, query) edge very complex, especially when sorting is more troublesome.
Hash the user ID and insert it into the corresponding table.

Understand what you mean.

However, we may not discuss exactly the same thing.
The sub-tables I mentioned should be divided based on different business situations,

1. vertical division,
For example, splitting based on business entities, such as user a's blog posts, user tags, and user comments are all stored in database, what's more, it's a complete set of data structures (in this case, it should be called database sharding)

2. horizontal division is also possible,
The data of a table is divided into different databases.
For example, in the message table, you may be divided into daily_message, history_message,
Dialy_meesage may be a hot object, week_message is warm, post two months ago
It may belong to the cold object. These objects are divided into different database groups based on different access frequencies.

3. Combination of the two

However, modification and deletion are not complex in any case and there is no difference between them and unpartitioned tables.

As for query and sorting, it is impossible to simply use select, order?
Instead, a summary table, index table, and reference table should be generated...
In addition, to reduce junk data according to specific business analysis, sometimes only the first 10 thousand records are required, then all tables
Data Sorting is not required. Many traditional businesses, such as retail, have large streaming water meters, but report data
Reports are not generated in real time. Therefore, you should be familiar with reports.

You can also refer to the practices of many websites, such as Technorati and Flickr.

The so-called trouble is that you should consider the system structure when designing the database, and pay more attention to it when designing the database,
Therefore, as long as the initial design of the project framework is complete, most of them are transparent to developers.
The premise is that you must design it well, rather than letting programmers write code and design it. That would be a nightmare.

I write so much nonsense, not just for programmers, but perhaps more useful to designers. 9tmd: you only need to maintain a configuration file for database access to split the tabulation of procedural logic control. For Development, it is completely transparent and you do not need to worry about the access, instead, you only need to call common interfaces. in systems that have been used, such applications are often used, especially in the whole-Network Passport, Community posts, and other aspects.

I used this architecture in Yahoo and mop, and I feel Trustworthy as a whole. After all, a single table is at risk of extreme data volume.

9tmd:
Some people have always asked auto_increment questions before. In fact, this is an official description of MySQL for m/s replication, because MySQL synchronization relies on synchronizing MySQL SQL logs, in fact, the one-way Master-> slave's use of auto_increment is no problem, and the two-way M/M mode will have problems, and you will know what is going on with a little bit of thinking. Official documentation:
Http://dev.mysql.com/tech-resour... ql-replication.html
Http://dev.mysql.com/doc/refman/... auto-increment.html

In addition, when using MySQL synchronization, you must note that when writing SQL in your own code, do not use functions similar to now () provided by MySQL, instead, you should use the time calculated in the PHP program to bring it into the SQL statement. Otherwise, the values may be different during synchronization. This may cause other similar problems, you can consider it.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More