With the widespread popularization of Internet application, the storage and access of massive data has become the bottleneck problem of system design. For a large-scale Internet application, every day millions even hundreds of millions of PV undoubtedly caused a considerable load on the database. The stability and scalability of the system caused great problems.
First, load balancing technology
The load Balancing cluster is composed of a set of independent computer systems, which are connected by a regular network or a private network, the routers converge together, each node cooperates, the common load, and the equilibrium pressure, for the client, the whole cluster can be regarded as a stand-alone server with super high performance.
1. Principle of realization
To realize the load balancing technology of the database, we must first have a control side that can control the connection database. Here, it truncates the direct connection between the database and the program, which is accessed by all programs, and then by the middle tier to access the database. This allows us to control access to a particular database, and then we can take an effective equalization strategy based on the current load of the database to adjust each connection to which database.
2. Realize data synchronization of multi-database
For load balancing, the most important thing is that all server data is synchronized in real time. This is necessary for a cluster, because, if the number is not real-time, out-of-sync, then the user from one server read the data, it is different from the other server read the data, this is not allowed. Therefore, data synchronization of the database must be implemented. In this way, the query can have multiple resources, to achieve a balanced. A more common approach is Moebius for SQL Server clusters, where the Moebius for SQL Server cluster takes the approach of hosting the core program in a database of each machine, called the Moebius for SQL Server middleware, The primary role is to monitor changes in the data in the database and synchronize the changed data to other databases. After the data synchronization is completed, the client will get a response, the synchronization process is completed concurrently, so synchronization to multiple databases and synchronization to a database is basically the same time, and the synchronization process is done in the context of the transaction, to ensure that the data at any time of multiple data consistency. Because of the innovation of Moebius middleware hosting in database, the middleware can not only know the change of data, but also know the SQL statement which causes the data change, and adopt different data synchronization strategy according to the type of SQL statement to ensure the minimization of the data synchronization cost.
Data is very small, data content is not large, the data is synchronized directly
The number of data bars is small, but contains large data types, such as text, binary data, and so on, the data is compressed and then synchronized, thereby reducing the network bandwidth consumption and transmission time.
The number of data bars is so large that the middleware gets the SQL statements that make the data change, then parses the SQL statements, analyzes their execution plan and execution costs, and chooses whether to synchronize the data or synchronize the SQL statements to other databases. This situation is useful when adjusting the table structure or changing data in bulk.
3. Advantages and Disadvantages
(1) Strong extensibility: When the system to higher database processing speed, as long as simply increase the database server can be extended.
(2) Maintainability: When a node fails, the system will automatically detect the failure and transfer the application of the fault node to ensure the continuous work of the database.
(3) Security: Because the data will be synchronized on multiple servers, you can achieve the redundancy of the data set, through multiple copies of data to ensure security. In addition, it successfully put the database into the intranet to better protect the security of the database.
(4) Ease of use: completely transparent to the application, the cluster exposes an IP
(1) The load cannot be allocated according to the processing capacity of the Web server.
(2) The Load Balancer (control side) failure can cause the entire database system to be paralyzed.
Second, the database read and write separation
1, the realization principle: the reading and writing separation Simple said is the database reads and writes the operation separately to correspond the different database server, this can effectively reduce the database pressure, also can alleviate the IO pressure. The main database provides write operations, from the database to provide read operations, in fact, in many systems, mainly read operations. When the primary database is written, the data is synchronized to the database from which the database integrity can be effectively guaranteed.
(ebay's read-write ratio is 260:1,ebay read-write separation)
(Microsoft Database Distribution)
2, implementation method: in MS SQL Server can use the way the publication definition to achieve database replication, to achieve read and write separation, replication is a set of data from one data source to the technology copy to multiple data sources, is to publish a copy of the data to multiple storage sites effective way. Using replication technology, users can publish a single piece of data to multiple servers. Replication technology ensures data consistency by ensuring that data is automatically synchronized and updated across locations. There are three types of SQL Server replication technologies, namely snapshot replication, transactional replication, and merge replication. SQL SERVER handles replication primarily in the form of publications and subscriptions. The server on which the source data resides is the publishing server, which publishes the data. The publishing server copies copies of all changes to the published data to the Distributor, which contains a distribution database that can receive all changes to the data and save those changes, and then distribute those changes to subscribers.
3. Advantages and Disadvantages
(1) Poor real-time data: Data is not synchronized to the self-read server in real-time, when the data written to the primary server, the next synchronization to be able to query.
(2) When the data volume is large, the synchronization efficiency is poor: when the single-table data volume is too large, the performance will become poor due to the index, disk IO and other problems.
(3) Connect multiple (at least two) databases at the same time: connect to at least two data databases, the actual read and write operations are done in the program code, and easily cause confusion
(4) Read high reliability and scalability: read-only server, because there is no write operation, will greatly reduce the performance problems such as disk IO, greatly improve efficiency; read-only servers can be load balanced, and the primary database is published to multiple read-only servers to achieve the scalability of read operations.
Third, database/data table split (distributed)
By a certain condition, the data stored in the same database is distributed to multiple databases, distributed storage, routing rules to access a specific database, so that each access is not a single server, but the n server, which can reduce the load pressure on a single machine. Tip: After SQL Server version 2005, you can support table partitioning friendly.
Vertical (Portrait) split: Refers to the function module split, such as the Order of the library, commodity library, user Library ... In this way, the table structure differs between multiple databases.
Horizontal (horizontal) Split: Blocks the data for the same table into separate databases with identical table structures.
(Split vertically)
(Split horizontally)
1, the principle of implementation: the use of vertical splitting, mainly to see whether the application type is suitable for such a split way, such as the system can be divided into, order system, commodity management system, user management system business system is relatively clear, vertical splitting can be very good to the role of decentralized database pressure. The business module is unclear, and a system with a higher coupling (table Association) degree is not suitable for this kind of splitting method. But the vertical splitting method does not completely solve all the pressure problems, such as having a 5000w order table, the pressure to operate the order library is still very large, such as we need to add (insert) a new data in this table, after the insert is completed, the database will re-index this table, 5000w row data The overhead of indexing is still not negligible, and conversely, if we divide the table into 100 tables, from table_001 to table_100,5000w rows, there is only 500,000 rows of data in each sub-table, This is when we index the Insert data in a table with only 50w rows of data, and the time for indexing is reduced by an order of magnitude, greatly improving the run-time efficiency of the DB and increasing the concurrency of the DB, which is split horizontally
2, the implementation method: Vertical Splitting, the implementation of the split method is relatively simple, according to the table name access to different databases. There are a lot of rules for horizontal splitting, and here is a summary of the previous points
(1) Sequential split: If you can by the date of the order by the year want $, 2003 in DB1, 2004 of DB2, and so on. Of course, you can also split by the primary key standard.
Pros: can be partially migrated
Disadvantage: Data distribution is uneven, May 2003 years of orders have 100w,2008 years of 500W.
(2) Hash modulus: user_id hash (or if the user_id is a numerical type directly using the value of user_id can also), and then use a specific number, such as the application needs to cut a database into 4 databases, we use 4 this number to User_ ID of the hash value of the modulo operation, that is user_id%4, so that each operation there are four possible: The result is 1 when the corresponding DB1; the result is 2 when the corresponding DB2; the result is 3 corresponds to DB3; The result is 0 corresponds to DB4, This allows the data to be distributed to 4 DB in a very uniform way.
Advantages: Uniform Data distribution
Cons: When data is migrated, you cannot allocate data according to machine performance.
(3) Save the database configuration in the authentication library
is to set up a DB, this DB separately save user_id to DB mapping, each time you access the database to query the database first to obtain the specific DB information, and then we need to do the query operation.
Advantages: Strong flexibility, one-on-one relationship
Disadvantage: The query before each query, will cause a certain performance loss.
This article from http://blog.csdn.net/dinglang_2009/,http://www.cnblogs.com/dinglang/Reprint please indicate the source.
SQL Server Database large Application Solution Summary