With the wide popularization of Internet application, the storage and access of massive data has become the bottleneck of system design. For a large Internet application, a daily millions or even billions of PV will undoubtedly cause a considerable load on the database. It poses a great problem for the stability and extensibility of the system.
First, load balancing technology
Load-balanced cluster is composed of a set of independent computer systems, connected by a regular network or a private network, the routers are joined together, each node cooperates, the common load, the balance pressure, for the client, the whole cluster can be considered as a stand-alone server with ultra-high performance.
1. Principle of realization
To realize the load balancing technology of the database, we must first have a control terminal that can control the connection database. Here, it truncates the direct connection between the database and the program, and all programs access the middle tier, and then the middle tier accesses the database. In this way, we can specifically control access to a database, and then can also be based on the current load of the database to take an effective balance strategy to adjust each connection to which database.
2, to achieve the data synchronization of the multi-according database
For load balancing, the most important thing is that all server data is synchronized in real time. This is necessary for a cluster, because if the numbers are not real-time or different, then the data that the user reads from one server is different from the data read from the other server, which is not allowed. Therefore, the data synchronization of the database must be implemented. In this way, the query can have multiple resources to achieve a balanced. A more common approach is to Moebius for SQL Server clusters, where the Moebius for SQL Server cluster employs a core program that resides in each machine's database, a core program called Moebius for SQL Server middleware, The main role is to monitor the data changes in the database and synchronize the changed data to other databases. After the data synchronization is completed, the client will get a response, synchronization process is completed concurrently, so synchronization to multiple databases and synchronization to a database is basically equal; the other synchronization process is done in the transaction environment, which guarantees the consistency of data at any point in time. Because of the innovation of the Moebius middleware hosting in the database, the middleware can not only know the data change, but also know the SQL statement that causes the data change, and adopt the different data synchronization strategy according to the type of SQL statement to ensure the minimization of the data synchronization cost.
Small number of data bar, data content is not large, then directly synchronize data
The number of data bars is very small, but it contains large data types, such as text, binary data, and so on, the data is compressed and then synchronized, thereby reducing the network bandwidth consumption and transmission time spent.
There are a lot of data bars, at which point the middleware gets the SQL statement that caused the data change, then parses the SQL statement, analyzes its execution plan and execution cost, and chooses whether to synchronize the data or synchronize the SQL statements into other databases. This is useful when adjusting the table structure or changing the data in batches.
3. Advantages and Disadvantages
(1) Strong scalability: When the system to higher database processing speed, as long as the simple increase of the database server can be extended.
(2) Maintainability: When a node fails, the system will automatically detect the fault and transfer the application of the fault node to ensure the continuous work of the database.
(3) Security: Because the data will be synchronized on multiple servers, you can achieve the data set of redundancy, through a number of data to ensure security. In addition, it successfully put the database into the intranet, better protect the security of the database.
(4) Ease of use: fully transparent to the application, the cluster exposed is an IP
(1) Can not allocate the load according to the processing ability of the Web server.
(2) Load balancer (control end) failure, will cause the entire database system paralysis.
Second, the database of read and write separation
1, the principle of implementation: read-write separation is simply to read and write to the database of the operation of different database server, which can effectively reduce the database pressure, but also to reduce IO pressure. The primary database provides write operations, which provide read operations from the database, in fact, in many systems, mainly read operations. When the primary database is written, the data is synchronized to the database from which the database integrity can be effectively guaranteed.
(ebay's read/write ratio is 260:1,ebay read-write separation)
(Microsoft Database Distribution)
2, implementation method: in MS SQL Server can use publishing definition of the way to achieve database replication, to achieve read-write separation, replication is a set of data from one data source to copy to multiple data sources of technology, is a copy of the data to multiple storage sites effective way. With replication technology, users can publish a single piece of data to multiple servers. Replication technology ensures data consistency by ensuring that data is automatically synchronized and updated at different locations. There are three types of SQL Server replication technologies: Snapshot replication, transactional replication, merge replication. SQL SERVER handles replication primarily with publications, subscriptions. The server on which the source data resides is the publishing server, responsible for publishing the data. The publisher copies copies of all changes to the published data to the Distributor, which contains a distribution database that receives all changes to the data, saves the changes, and distributes the changes to subscribers.
3, advantages and disadvantages
(1) The real-time data is poor: The data is not synchronized to the read-only server, when the data is written to the primary server, after the next synchronization can be queried.
(2) A large amount of data synchronization efficiency is poor: when the amount of single table data inserts and updates due to the index, disk IO and other problems, performance will be very poor.
(3) Connect multiple (at least two) databases at the same time: to connect to at least two data databases, the actual read and write operation is done in the program code, easy to cause confusion
(4) Read with high-performance high reliability and scalability: read-only server, because there is no write operation, will greatly reduce performance problems such as disk IO, greatly improve efficiency; read-only servers can be load-balanced, and the primary database can be distributed to multiple read-only servers to enable the scalability of read operations.
Third, database/data table split (distributed)
By a certain condition, the data stored in the same database is distributed to multiple databases, distributed storage and routing rules are used to access specific databases, so each visit is not a single server, but n servers, which can reduce the load pressure on a single machine. Tip: After SQL Server version 2005, you can support "table partitioning" in a friendly form.
Vertical (vertical) Split: refers to the division by functional modules, such as the Order library, commodity library, user Library ... This way the table structure between multiple databases is different.
Horizontal split: Blocks the data of the same table into separate databases, where the table structure is exactly the same.
(Split vertically)
(Split horizontally)
1, the principle of implementation: the use of vertical split, mainly to see whether the application type is suitable for such a split, such as the system can be divided into, order system, commodity management system, user management system business system is relatively clear, vertical split can be very good to disperse the role of the database pressure. Business module is not clear, coupled (table Association) degree of a higher system is not suitable for this split method. But the vertical split doesn't completely solve all the stress problems, for example, there is a 5000w order table, the pressure to operate the order library is still very large, such as we need to add (insert) a new data in this table, after the insert is completed, the database will be indexed to this table, The overhead of indexing 5000w rows of data is not negligible, and conversely, if we divide the table into 100 tables, from table_001 to table_100,5000w rows of data, there are only 500,000 rows of data in each child table. When we insert data into a table with only 50w rows of data, the time to index is reduced by a magnitude, greatly increasing the DB's run-time efficiency and increasing the concurrency of DB, which is split horizontally
2, the implementation of the method: vertical Split, split the way to achieve a relatively simple, according to the table name access to different databases. There are many rules for transverse splitting, and here is a summary of the previous points
(1) Sequential split: If you can press the date of the order by year want $, 2003 in the DB1, 2004 DB2, and so on. Of course, you can also split by the primary key standard.
Benefits: can be partially migrated
Disadvantages: Uneven data distribution, may be 2003 years of orders have 100w,2008 years of 500W.
(2) hash modulus: hash The user_id (or, if user_id is numeric, use the user_id value directly), and then use a specific number, For example, the application needs to cut a database into 4 databases, we use the number of 4 to user_id hash value of the modulo operation, that is, user_id%4, so that each operation there are four possible: The result is 1 of the time corresponding to DB1; The result is 2 when corresponding DB2 When the result is 3, the corresponding DB3 is DB4, and the result is 0, so the data is distributed to 4 db very evenly.
Benefits: Data is evenly distributed
Disadvantage: trouble with data migration; You cannot allocate data according to machine performance.
(3) to save the database configuration in the authentication library
is to create a db, this db separate user_id to DB mapping relationship, each access to the database must first query this database, to get specific DB information Before we can do the query operation we need.
Advantages: Flexibility, one-to-one relationships
Disadvantage: Multiple queries before each query can result in a certain loss of performance.