From: http://database.csdn.net/page/b12503e6-9f14-4cc1-90dd-88632444a46e
In the Web 2.0 era, websites often face a rapid increase in access traffic. However, how can our applications meet users' access needs, in addition, we can basically see that performance bottlenecks are all on the database. This is not to blame for the database. After all, it is indeed a great pressure on any database to meet a large amount of traffic, whether it is a commercial database such as Oracle, MS SQL Server, DB2, or open source MySQL or PostgreSQL, it is a great challenge and the solution is very simple, data is distributed across different databases (either hardware or logic). This article mainly discusses how to store databases in a distributed manner.
Currently, the main distributed storage methods are segmented according to certain methods, mainly vertical and horizontal. Of course, there are also two combined methods to achieve more appropriate segmentation granularity.
1. vertical Split (vertical) data is split by Database Based on website services and products, for example, user data, blog article data, photo data, tag data, group data, and so on, each business has an independent database or database server.
2. horizontal Split (horizontal) Data treats all data as a big product, but splits all flat data on different databases or database servers according to certain keys (such as user names, this method is also discussed in this article.
This article mainly targets open-source MySQL/PostgreSQL databases, and the platform is in linux/FreeBSD, using php/Perl/Ruby/Python and other scripting languages, with Web applications on platforms such as Apache/Lighttpd and other Web servers, static file storage, such as videos, images, CSS, and JS, is another topic.
Note: A ranking Node that will be repeatedly mentioned below refers to a database Node, either a physical database server or a database, generally, a database server is a database server with a Master/Slave structure. Let's look at the architecture of this node:
1. Hash-based distribution
1. Introduction to hash
Hash-based distributed storage mainly relies on primary keys and Hash algorithms. For example, the main role of a user-oriented application is the user, then the Key can be the user ID, user name, email address, and so on (This value must be passed everywhere on the site). The unique value is used as the Key, this Key is hashed to distribute different user data on different database nodes.
We use a simple example to describe this problem: for example, there is an application where the Key is the user ID and there are 10 database nodes, the simplest hashing algorithm is the number of all nodes and the remainder is the corresponding node machine. algorithm: node = user ID % Total number of nodes, the node where the user whose user ID is 125 is located: 125% 10 = 5, it should be on the node named 5. Similarly, a more powerful and reasonable Hash algorithm can be constructed to evenly distribute users to different nodes.
2. Expand the storage mode of hash Distribution
We know that since a hash algorithm is defined, these keys will be distributed to the specified node step by step. But what if all the nodes currently do not meet the requirements? This has a problem of resizing. The biggest concern for resizing is to modify the hash algorithm. At the same time, data should also be migrated or modified based on the hash algorithm.
(1) expansion in Migration Mode: After the Hash algorithm is modified, for example, if there were 10 nodes and 20 nodes are added, the Hash algorithm is [Model 20]. there will be a large amount of data allocated to a previous node, but the data of the newly added node is not balanced, we can consider using the new Hash algorithm to calculate and migrate data from previous data to new nodes based on the Key. However, this cost is relatively high and instability increases; the advantage is that the data is relatively uniform and the New and Old nodes can be fully utilized.
(2) Make full use of the new node: after adding the new node, the Hash algorithm will Hash all the new data to the new node and no longer allocate data to the old node, there is no cost for data migration. The advantage is that you only need to modify the Hash algorithm and can simply add nodes without migrating data. However, when querying data, you must use the old Key to use the old Hash algorithm, the newly added Key uses the new Hash algorithm. Otherwise, the node where the data is located cannot be found. The disadvantage is that the Hash algorithm increases complexity. If new nodes are frequently added, the algorithm becomes very complex and cannot be maintained. The other aspect is that the old nodes cannot fully utilize resources, because the old node simply retains the old Key data, of course, this also has a suitable solution.
To sum up, it is difficult and cumbersome to add nodes for data distribution in the hash mode, but there are also many suitable scenarios, especially for applications that can predict the future data size, however, generally, Web websites cannot predict the amount of data.
Ii. Global node allocation
1. Introduction to global node allocation
The ing relationship between all Key information and database nodes is recorded and saved to the global table. When you need to access a node, you must first search in the global table, find the node and locate it. Global tables are generally stored in two ways:
(1) The Node information is stored in the node database itself (MySQL/PostgreSQL) and can be accessed remotely. To ensure performance, the Heap (MEMORY) MEMORY table is used together, or use Memcached to cache and accelerate node search.
(2) local file databases such as BDB (BerkeleyDB) and DBM/GDBM/NDBM are used, and the search performance is relatively high based on key => value hash databases, combined with cache acceleration such as APC and Memcached.
The first storage method is easy to query (including remote queries), but the disadvantage is that the performance is not very good (this is a common problem of all relational databases ); the second method is that the local query speed is very fast (especially for hash data databases, the time complexity is O (1), which is relatively fast). The disadvantage is that it cannot be used remotely, in addition, data cannot be synchronized and shared among multiple machines, and Data Consistency exists.
Let's describe the approximate structure of the implementation: If we have 10 database nodes and a global database is used to store the ing information from Key to node, assume that the global database has a table named AllNode, which contains two fields, key and NodeID. Assume that we continue to follow the above case. The user ID is the Key and there is a user with a user ID of 125. The corresponding node is obtained from the query table:
Key NodeID
13 2
148 5
22 9
125 6
If you confirm that the user ID is 125 and the node is 6, you can quickly locate the user and process the data.
Let's take a look at the distributed storage structure:
2. Global node distribution expansion
The global node allocation method also has the expansion problem, but it has long considered this problem, and this design is designed to facilitate the expansion, the main expansion method is two:
(1) Allocate Key-to-node ing resizing by adding nodes
This is the most typical, simple, and machine resource-saving resizing method. Generally, the specified data volume is allocated based on each node. For example, a node stores 0.1 million user data, the first node stores 0-10 million user data, the second node stores 10-20 million user data, and the third node stores 20-30 million user information. In this way, when users increase to a certain amount of data, the node server is increased, at the same time, the Key is allocated to the newly added node, and the ing relationship is recorded in the global table. This allows unlimited addition of nodes. The problem is that if the Access Frequency of early node users is relatively low and the Access Frequency of later node users is relatively high, the server load of the node server is not balanced, this is also a solution.
(2) map keys to node expansion using probability algorithms
This method sets a probability for each node to be allocated to the Key on the basis of existing nodes. when the Key is allocated, it is allocated according to the probability specified by each node, if the average data capacity of each node exceeds the specified percentage, for example, 50%, then the new node is added, and the probability of the new node increasing the Key is greater than that of the old node.
In general, the probability of node allocation is also recorded in the database. For example, we set all probabilities to 100, with a total of 10 nodes, set the probability of Data allocated to each node to 10. Let's view the data table structure:
NodeID Weight
1 10
2 10
3 10
Now a new node is added, and the probability of the allocated Key of the new node is greater than that of the old node. Then, the probability calculation of the new node must be performed. formula: 10 rows + rows = 100, rows> rows, and the result is: bytes {10... 90}, numbers {1... 9}, x indicates the probability of a single old node. Each node of the old node has the same probability. y indicates the probability of a new node. According to this formula, calculate the range of the probability of the new node y according to the probability formula of different applications.
Iii. Existing Problems
Now let's analyze and solve the problems existing in the above two distributed storage methods, so that we can avoid or integrate some problems and shortcomings when considering the architecture.
1. There are problems with the hash and global allocation methods.
(1) It is not convenient to expand the hash mode. You must modify the hash algorithm and migrate data. The advantage of this algorithm is that it is very fast to locate a node from the Key, O (1) time complexity, and basically do not need to query the database, saving response time.
(2) The most obvious problem with the global allocation method is single point of failure. If the global database is down, all applications will be affected. Another problem is that the query volume is large, and operations on each Key node must go through the global database, which is under great pressure. The advantage is that the expansion is convenient and the addition of nodes is simple.
2. Search and statistics problems caused by distributed storage
(1) generally, the search or statistics process all the data. However, after the data is split, the data is scattered on machines of different nodes, and global search and statistics cannot be performed. Solution 1: The primary basic data is stored in the global table for easy search and statistics. However, this type of data should not be too large, and some core data should be used.
(2) Use intra-site search engines to index and record all data. For example, use Lucene and other open-source index systems to index all data for easy search. For statistical operations, you can use non-real-time statistics in the background. You can traverse all nodes, but the efficiency is low.
3. Performance Optimization Problems
(1) hash algorithms, node probabilities, and allocation can all be developed using the compilation language to improve performance, and made into lib or all php extensions.
(2) when MySQL is used, you can use a custom database connection pool and load it in the form of Apache Module. You can customize Various connection methods.
(3) global data or frequently accessed data can be cached using APC, Memcache, DBM, BDB, shared memory, file system, and other methods to reduce the access pressure on the database.
(4) use a powerful data processing mechanism, such as MySQL5 table partition or MySQL5 Cluster. In addition, we recommend that you use the InnoDB table engine as the primary storage engine in the actual architecture. MyISAM is used as the log and statistical data, ensuring security, reliability, and speed.