Large Web site I think refers to high concurrency, massive data, complex business sites, building such sites often require a variety of technical optimization of the angle, some of the techniques described in this article I do not have actual combat experience, here is just a record for follow-up study
I. Database related issues
Large Web sites generally the most likely to produce a performance bottleneck is the database, because the relational database can withstand the number of concurrent links is very limited, can be optimized from the following points
1. Database query to optimize enough, most of the Web site tend to read far more than the number of writes, as soon as possible to remove data, close the link, the release of resources is of course the simplest, but also the first to do.
- The database table field of the query is properly indexed;
- Query the time only need to check the fields, do not need to write the field;
- Try to avoid index invalidation queries, such as fuzzy matching, in query;
- As far as possible to avoid transactional commit, the transaction will cause the lock table, and the processing of bad also easy to cause deadlock, you can transfer this transaction to the program processing, the application of distributed locks can effectively alleviate the database burden, there will be a brief introduction;
- Do not use triggers, general conditions
- such as the number of fuzzy query more frequent, large data volume, it is recommended to use full-text search engine , such as Lucene, this is the Java and. NET platform with a lot of, online information is also very rich, http://www.cnblogs.com/MeteorSeed/ Archive/2012/12/24/2703716.html, the rest is not much to say, these are the most basic knowledge
2. Çeku, Çeku can effectively avoid the high concurrency caused by the pressure, is also a relatively simple way, Çeku is divided into two vertical çeku and horizontal çeku.
- In the business perspective, the database is split into two databases, placed on two servers, this method is called vertical çeku;
- When a table data volume is too large, a table is split into multiple tables, which is called the horizontal sub-library;
There are two problems in the first place of a library
- That is the join query, when the two tables are placed in two databases, the efficiency of the join query is very low, to avoid this situation, you can be divided into multiple queries
- When it comes to adding additions and deletions, transactional commits, if more than one table is involved in a different database, is not possible
In order to avoid the above situation, we should pay attention to the problem of the granularity of splitting the table, and put the table with high correlation in the same database.
The level of the library is more complex not only the above problems, but also added other problems to solve
- The above two problems will greatly increase the difficulty of the problem when the horizontal Çeku, because the dimension of splitting becomes bigger;
- The design of the primary key will be a difficult problem in the horizontal sub-Library.
- There is what logic we have to store the data in the different tables that are split;
- As the table grows more and more, both development and maintenance workloads and server management increase the pressure
1). Horizontal Sub-table, we can be based on the heat of the data to split, such as the order table we can be one months such as the order in one table, the rest on another sheet, so that can greatly improve our query efficiency
2). When horizontally splitting the table, we need to be aware that the primary key of the table being split is not repeatable, because this is supposed to be the same table, http://www.cnblogs.com/heyuquan/p/3261250.html
3). Sometimes we need to query the data from the different tables of the horizontal splitting, this difficulty realizes a bit big, we should try to avoid, in fact, most of the time we do not need to return all the data, we should from the business perspective to avoid the emergence of this situation, many times we are not the perfect solution, there must be
Just like the 1th. After the split order, we can provide two tabs one provides a query within one person's order, and the other provides a query history order
can be resolved in the business perspective, to avoid the use of technology to solve, because this tends to increase the risk, so before this should be fully evaluated
3. Master-slave Copy, read/write separation
Read and write separation if the pressure of the database is too large then we can read and write the database to separate, write a separate database, do the main library, read with multiple databases, as from the database, do load balancing, the main database to synchronize data from the database, but this will produce a certain delay, Because you cannot guarantee that the data you read is not up-to-date, so it is necessary to evaluate whether it is appropriate to do so, http://www.cnblogs.com/netfocus/p/4055346.html if you need to ensure real-time and read-write separation, We can use a middle tier, write to the database at the same time, write data to the middle tier, until the confirmation that the data has been synchronized, the data is deleted from the middle tier, read this part of the data is read from the middle tier, and then read from the database, duplicate data in the middle tier, because the middle tier is always up to date, But one problem is that we have no way of knowing whether the data is in sync, and this requires us to create a monitoring program to compare the data from the database to the middle tier, confirming that synchronization is complete, which is my personal idea, not yet practiced.
4. Queues and packet submissions
- When the primary database writes at the same time the concurrency is too large to cause the database to crash, this situation can queue the request, allow a certain number of tasks at the same time to execute simultaneously, but this situation may cause a delay response, but the delay response is better than the direct crash, there will be a brief introduction to the queue follow-up;
- Group Commit Bulk Write requests we can group requests as a single commit, such as multiple requests to reduce the quantity of an item, we can subtract the number of requests as a single commit, and we can put multiple requests into the same database connection. Because the database links open and Close is also very resource-intensive, here to control the corresponding scale
5. Caching
When a set of data is often read and update a few times, allow a certain delay, we can put the data into the cache, the cache can greatly reduce the burden of the database, the memory read speed is tens of thousands of times times the disk, the following will be introduced
6.Nosql
There are already countless solutions for nosql, such as the use of more mongodb,http://www.cnblogs.com/huangxincheng/archive/2012/02/18/2356595.html in the country, The collocation of relational database and NoSQL database can effectively reduce the delay of service
Second, other related issues
1. Problems and solutions for distributed applications
Large web site in order to withstand the large number of access concurrency, the Web server will certainly be load balanced, there are many ways to implement, such as Nginx reverse proxy, DNS load balancing, IP load Balancing, http://www.open-open.com/lib/view/ Open1416924842581.html
Http://blog.sina.com.cn/s/blog_493a84550102vjlq.html http://www.cnblogs.com/edisonchou/p/4126742.html/HTTP Www.cnblogs.com/edisonchou/p/4281978.html
Distributed systems face at least 4 issues to address
1). If the Web site uses a lock, distributed site deployment in different places, if our business logic can only have one thread at a time, then we need to use distributed locks, the previous mentioned when a large number of row locks, table locks, transactions flooded database, the database will cause a lot of pressure, Here is a solution to reduce the database lock burden, using zookeeper distributed lock service, http://www.cnblogs.com/shanyou/archive/2012/09/22/2697818.html,http:/ /www.cnblogs.com/shanyou/p/3221990.html
2) The site will generally have some configuration information, when the site is deployed in different places, to maintain it becomes relatively troublesome, zookeeper also provides a response to the configuration management,
3) The site will generally have upload pictures, a lot of picture management is also a need to solve the problem, distributed File system has many solutions, here is a way Fastdfs link address http://www.cnblogs.com/lori/p/3142598.html
4) distributed cache, distributed cache has many solutions.
- Memcached http://www.cnblogs.com/HCCZX/archive/2012/12/23/2829645.html
- Redis http://www.cnblogs.com/yangecnu/p/Introduct-Redis-in-DotNET-Part2.html
- SSDB http://www.cnblogs.com/shanyou/p/3496163.html
2. Distributed Message Queuing
When there are a lot of tasks in the site backlog need to wait for processing, such as text messaging, mail delivery, database operation delay task, we generally use the queue to handle, the following describes several queue components
ActiveMQ is the most popular, powerful, open source messaging bus that Apache has produced. The ActiveMQ is a JMS provider implementation that fully supports the JMS1.1 and the Java EE 1.4 specification, with multiple languages and protocols written by the client. Languages: Java, C, C + +, C #, Ruby, Perl, Python, PHP. Application protocol: Openwire,stomp REST,WS NOTIFICATION,XMPP,AMQP, http://activemq.apache.org/C# client http://activemq.apache.org/nms/
Beanstalk is a simple, fast message queue. BEANSTALKD to RABBITMQ, like Nginx in apache,varnish to squid. Later in the process of using BEANSTALKD in the project, it is found that its simple, lightweight, high performance, easy to use and other characteristics, as well as priority, multi-queue, persistent, distributed fault-tolerant, time-out control and other characteristics. BEANSTALKD client-side development package with multiple programming languages C # client Https://github.com/kr/beanstalkd/wiki/client-libraries
Kafka is a high-throughput distributed publish-Subscribe messaging System http://kafka.apache.org/documentation.html C # Client Https://github.com/Jroland/kafka-net
Building large-scale website related technologies