1 The rise of database splitting
In the past few years, the concept of database splitting has become more and more popular with the large increase of business application database transaction volume and the increase of database size. The success of many online service providers, software-as-a-service providers (SaaS) and social networking sites illustrates this.
Database splitting can be simply defined as a "0 shared" partitioning scheme designed for large databases across multiple servers, which makes it possible to improve database performance and scalability to a new level. Imagine broken glass, you can understand what is sharding (fragmentation)-the database is divided into smaller pieces called "fragments", scattered on many distributed servers.
The term "sharding" is pioneered by Google engineers and is promoted through the publication of their big table architecture. However, the idea of a database partition "shared-nothing (0 share)" has existed for more than 10 years. Many instances have been generated during this period, particularly the well-known internal solutions of some of the leading online service providers, such as ebay, Amazon, Digg, Flickr, Skype, YouTube, Friendster and Wikipedia.
The focus of this article is on the requirements of database splitting, the optional scenarios for database partitioning, and some key considerations for successful database splitting.
1 Big tables: a structured data distributed storage System, presented by Fay Chang,jeffrey Dean and other Google employees.
2 What drives the need for database splitting?
Database splitting is a highly scalable way to improve data throughput and overall performance for highly transactional, large database-centric commercial applications. Since the creation of a relational database, it is clear that business databases have generally grown over time, and that application engineers and designers are demanding continuous improvement in the performance and capacity of their databases. In addition, the development of network economy, the background of information age, the popularization of large amount of e-commerce makes commercial data expand sharply, which leads to the more obvious trend.
As any experienced database administrator or application developer knows, response times tend to grow logarithmic when the size of the data layer and transaction size grow linearly, which is self-evident. As shown in the following:
Figure 1. The increase in database size and number of transactions has a significant impact on response time.
The intrinsic reason for the challenge of database performance and scalability is the basic design of the database management system itself. The database of any computer relies primarily on its three components:
-CPU
-Memory
-Disk
By benchmarking, we know that a component on a single server can only extend to a certain limit, and then other measures must be taken. Obviously, disk I/O is a major bottleneck, because even if the database management system is improved, it retains high CPU and memory usage. As a matter of fact, we have noticed that the maximum performance of the database is determined by the degree of matching of these three factors. In other words, you cannot increase the overall performance of the whole database system by simply increasing the number of CPUs (or processing cores) without increasing the content capacity or improving the performance of the disk drive subsystem. Obviously, the return on a resource to a single database server is diminishing. Especially on mixed-use business transactions, these factors are more pronounced on systems that perform large volumes of read and write transactions and support broad-sense business reporting tasks.
As a result, the demand for it continues to grow as business applications become more sophisticated. Architects, developers, and database administrators have always faced the challenge of maintaining database performance for mission-critical systems. This prospect drives the need for database splitting.
3 optional options for database partitioning
Database partitioning has long been known to improve the performance and extensibility of relational databases. Technologies that have evolved to date include:
-Master/Slave server: This is the simplest option that is used by many organizations. Use one master server to handle all write (create, update, or delete, or add-and-remove) operations, while using one or more read-only operations from the server. The primary server replicates data to individual slave servers using a standard, near real-time replication capability. Master/Slave mode can increase the overall performance of the database to a certain extent, allowing read-intensive operations to be separated from the server for processing, but this approach also has the following limitations:
The single primary server handles write operations, has significant extensibility limitations, and quickly creates bottlenecks.
The replication mechanism of the master/slave server is "near real-time", which means that there is no guarantee from the server to get instant snapshots on the primary server. This mechanism is feasible for some applications, but this approach is undesirable if your application needs the latest data.
Many organizations also use a master/slave server approach for high availability, but are also limited to master and slave servers that are not fully synchronized. If a catastrophic failure occurs on the primary server, any transactions will be lost before replication, which is unacceptable for most commercial transaction applications.
-Cluster operations: Use multiple servers to compute the same cluster, sharing information between servers through the nodes of the cluster. In most cases this depends on a centralized shared disk device, typically a storage area network (SAN). Each node in the cluster runs a single instance of the database server and works in a different mode:
For high availability, multiple nodes in a cluster can be used for reading, but there is only one processing write (add and remove) operation. This can make reading faster, while write operations do not get any improvement. If one node fails, another node in the cluster takes over and continues to work on the shared disk device. This practice limits its extensibility due to a change in the operation bottleneck. Even read operations end up with a performance limit, because a centrally shared disk device can only share such a heavy load before the performance increase decreases. When an application needs a complex join or contains an SQL statement that is not optimized, the limit of its read operation is a strong proof.
More advanced clustering technology relies on real-time memory replication between nodes, which is maintained in real-time by a real-time information system to keep memory images in the cluster nodes. This allows each node to work in either read mode or in write mode, but will eventually be limited by the amount of traffic that can be transmitted between communication nodes (using a typical network or other high-speed communications mechanism). Therefore, as nodes increase, the overhead of communication and memory replication is exponentially multiplied, which severely limits extensibility, often with a relatively small number of nodes. This scenario is constrained by the same shared disk as the traditional cluster, where a single large, growing database generates more and more dense disk reads and writes.
-Table Partitioning: Many database management systems support table partitioning, such as data in a large table that can span multiple disks to improve disk I/O utilization. This partition is usually done horizontally (cross-partition branches), but some systems can also be partitioned vertically (different columns placed on different partitions). This approach can help reduce disk I/O bottlenecks for a particular table, but often slows joins and other operations. In addition, because this approach relies on a single DB instance under the database management system, all other limitations imposed by the contention for CPU and memory further limit its extensibility.
-Federated table: one branch of the table partitioning technique is the federated table method. With this approach, tables can be accessed across multiple servers. This approach is very complex to manage, and is inefficient because the federated table must be accessed over the network. This approach may be appropriate for some reporting or analytical work, but is not a good choice for general read/write transactions.
The common disadvantage of these methods is that they depend on shared devices and resources. Whether you rely on shared memory, centralized disks, or processors, scalability can be limited, not to mention other drawbacks, including complex management, lack of support for critical business requirements, and high availability constraints.
4 database splitting, a method of "0 sharing"
Database splitting provides a way to achieve extensibility across multiple independent servers. Each server has its own CPU, memory, and disk. Compared to traditional methods of enhancing database performance, it has no typical limitations that are encountered by other methods. It has been more than 15 years since the "0 sharing" database has been researched and explored, but it has not been possible to find a broader market demand in the business sector until recent years because of the huge increase in the amount of application data.
The basic concept of database splitting is straightforward: a large database is decomposed across servers into a number of smaller databases. As shown in the following:
Figure 2: Database splitting is the splitting of a large database into several small databases.
Obviously, the advantage of the "0 sharing" database split is that it greatly improves scalability. As more servers are added to the network, extensibility grows in a nearly linear fashion. However, when considering a split scenario, there are several other advantages to splitting into small databases that cannot be overlooked:
• Smaller databases are easier to manage. The production database must be fully managed: regular backups, database optimizations, and other common tasks. With a large database, it is very difficult to implement these daily tasks if only the time required to complete the operation. Regular table and index optimizations can last for hours or days and in some cases lead to less flexible periodic maintenance. By splitting the method, each individual "sub-Library" can be maintained separately. In this way, management is simpler and multiple maintenance tasks can be performed in parallel.
• Smaller databases are faster. The extensibility of the split is obvious, and it is implemented through distributed processing across the sub-Libraries and servers in the network. There is also a less obvious fact that each sub-library outperforms a single large database because of its smaller size. Each sub-Library has its own server, so the ratio between memory and disk is greatly increased, thereby reducing the disk I/O. The result is less resource contention, better performance for join operations, faster index searches, and fewer database locks. As a result, not only the split system can scale to a higher level of capacity, but also the performance of a single transaction is improved.
Database splitting can reduce costs. Most database split scenarios can benefit from low-cost open source databases and even from the workgroup version of the business database. In addition, the split database works well on commercial multi-core server hardware, which costs much less than expensive high-end multiprocessor servers and expensive storage area networks (Sans). The overall cost savings in licensing, software maintenance, and hardware investments are considerable and can sometimes save 70% or more compared to other solutions. There is no doubt that database splitting is a viable option for many organizations, as evidenced by the practices of many large online vendors and software as a service (SaaS) vendors (giants such as Amazon, ebay, and, of course, Google).
5 Usability of database splitting
If a database split is highly scalable, costs less, and improves performance, why is the technology not widely used? Is it suitable for your organization?
In fact, database splitting is a very useful technique, but like other scenarios, there are a number of factors to consider for successful implementation. In addition, there are some limitations, and database splitting does not work well on all types of commercial applications. This chapter discusses these key factors and how they can be addressed.
5.1 Challenges of database splitting
Given the nature of the distribution of individual databases, some key factors must be considered:
Reliability First of all, any production and operational commercial applications must be reliable, fault-tolerant, and can not be subjected to frequent power outages. The data layer is often the most critical factor in any reliability design, and the implementation of database splitting is no exception. In fact, given the nature of the distribution of multiple split databases, a well-designed scheme is particularly important. To ensure reliability and fault tolerance, the following points are required:
n Automatic backup of a single sub-Library
N Sub-Library redundancy, ensuring that at least 2 real-time backups per sub-Library can take its place in the event of a power outage or a server failure. This requires a high-performance, high-efficiency, reliable replication mechanism.
n the hardware redundancy of the economy, whether it is the internal hardware of the server or the hardware across the server.
n automatic failover when a power outage or server failure occurs.
N Site management for disaster recovery
-Distributed queries. Many types of queries can be processed much faster if you use distributed query parallel processing mode, each subpackage a singleton query, and then merges the processing results of each sub-library. This technique allows for an order of magnitude improvement in database performance, which in many cases is 10 times times faster or more. In order to seamlessly distribute queries across applications, it is important to have a device process queries for each sub-library and then merge the results into a single result set to return to the application tier. Common queries that can benefit from this distributed processing pattern are:
N Statistical Summary, a full scan of the entire system's data is required. For example, a product sales calculation usually evaluates the entire database.
n a query that supports complex reports, such as a list of all customers from the previous day, one week, or the previous January of a specified product.
-Avoid joins across sub-Libraries. In a split system, using inline queries or other statements across sub-libraries is inefficient and difficult to execute. In most cases, if you are using the correct method, you do not actually need to use inline. The main trick is to replicate global tables, which are relatively non-changeable, often used to join large master tables. The tables that contain status codes, countries, types, and even products belong to this class. What we need is an automated replication mechanism to ensure that the values in the global table are synchronized across all the sub-libraries, minimizing or eliminating cross-library joins.
-Self-growth key management. The typical self-growth capabilities provided by the database management system generate an ordinal key for each new row that is inserted into the database. This is not a problem for a single-database application, but when using database splitting techniques, these key values must be managed across sub-Libraries. For this, we need to provide the application with a seamless, automated way to generate key values across the sub-Library to ensure that the entire system's key values are unique.
-Supports multiple split scenarios. It is important to note that the library split technology is effective because it provides an application-oriented scale-out and performance improvement technology. In fact, it can be said that the split effect is directly linked to how well the split algorithm itself and the problems faced by the application are. What we need is a diverse, flexible, split solution, each of which addresses the specific issues facing an application. Each scenario has inherent performance and the characteristics and benefits of the application, or one of them. In fact, using the wrong split scheme can limit performance and not achieve the desired effect. It is not uncommon for a single application to have multiple split scenarios, each of which is used for a specific part of the application, for optimization. Some common split scenarios are listed below:
N Session-based split. This scenario is used if an individual user or process interacts with a specific subpackage during the session of the entire user or process. This is the easiest split technology to implement, with little extra overhead for overall performance because each session is split only once. The applications that benefit from this are typically customer-centric business applications where all data related to each customer is placed on a sub-library.
n A transaction-based split. The basis for determining subpackage is to examine the first SQL statement for a given transaction. This is usually done by evaluating the key value of the "split key" in the statement (for example, the order number). Then, all other statements in the transaction are directed directly to the same child library.
N Statement-based splitting. A statement-based split is the most process-intensive of all split types, and it evaluates each SQL statement to determine which of the correct sub-Libraries the statement should import. Similarly, it is also necessary to evaluate the "split key" key value. This split scheme is suitable for a large number of small-grained transactions, such as recording call logs.
-The best way to decide how to split the data. This is another area that varies widely, and different applications have different options. This has a great relationship with the choice of the above-mentioned split schemes. There are many ways to determine how to split your data, but it is important to know how frequently your transactions are, how large the size of the table is, how key values are distributed, and other features of your app. By knowing this data, you can determine the optimal split strategy:
N is split based on the primary key of the table. This is the most straightforward choice and the easiest way to map to an application. However, only if your data is distributed properly will it work. For example, if you split a database by Customer ID (which is a sequential numeric value), and most transactions are for new customers, the split effect is minimal. On the other hand, if you choose a key that will be used to distribute the transaction reasonably naturally, you can get a huge benefit.
n a modulus split of the key value of a key. This method is widely used, which models the key values and distributes the transactions according to the modulus. In fact, you can pre-set any number of sub-libraries, and the modulo function processes the key values based on the "loop" rule so that the new key values can be distributed very evenly across the entire database.
N Maintain the Sub-Library Index master table. This technique uses a separate primary table to assign different values to different sub-libraries. This method is very flexible and suitable for a wide range of applications. However, this approach often results in a low database performance because it requires additional queries for each split SQL statement.
There are a number of factors that need to be considered and many conditions to be met to ensure that the database splits are successful and effective in order to achieve the goal of providing an affordable, higher level of scalability and performance.
5.2 When to split the database properly
Database splitting is a good fit for many types of commercial applications with regular database requirements. It can also be used effectively in data warehousing applications, but we don't discuss this in detail because there are many products and technologies that can be implemented in this area.
The general database requirements for splitting the database are as follows:
-Highly Transactional database applications
-database application for hybrid tasks
n frequent read operations, including complex queries and joins
N Write-intensive transactions (add-and-remove statements, including inserts, updates, deletions)
N for public tables and public rows, or for resource contention of either
-General Business reports
n the generation of a typical "repeating segmentation" report
n Some data analysis (mixed with other tasks)
The most important thing to determine whether a database split is appropriate for your particular application or environment is to evaluate how well your database structure can be split. In essence, database splitting is a method of "landscape" partitioning, where a single table's rowset (as opposed to a column) is distributed across multiple sub-libraries. Some of the things that are important in order to understand the merits of determining the split in a given situation are:
-Find all the transaction-intensive tables in your database structure
-Verify the number of transactions that your database is currently processing (or the number of transactions expected to be processed)
-Identify all common SQL statements (SELECT, INSERT, UPDATE, delete) and confirm the usage of each statement
-Understand the "table hierarchy" of your database structure, in other words, the dependencies between tables.
-Determine the "Key distribution" of transactions based on large-capacity tables to determine whether they are evenly distributed or concentrated in narrow areas.
With this information, you can make a quick assessment of the value and applicability of your application. For example, this has a simple bookstore system database structure that shows how the data is split:
Figure 3. The system database structure of the legend bookstore shows how the data is split
In the bookstore system example, the master split table is the "customer" table. This is the table used to split the data. The "Customers" table is the parent table of the subpackage, and the "Customer orders" table and the "Order Bibliography details" table are their child tables. This data is split according to the customer ID attribute, and all child tables are split well with the rowset that specifies the customer ID. Those global tables are common, they are relatively small, and are replicated to all sub-libraries to avoid joining operations across sub-Libraries.
Although this example is very simple, it provides the basic factors that should be considered when deciding how to split a specified database application. With this assessment, you can determine whether the split is appropriate for your particular environment and the benefits that can be achieved after the database has been split.
6 concluding remarks
This article provides an overview of database splitting, including a discussion of the challenges faced with data splitting, and the basic approach of completing a data splitting scheme. Database splitting has been proven in many large organizations and is well suited to the specific problems encountered in your application. As long as it is used correctly, database splitting will certainly help a large number of commercial transaction applications achieve the goal of achieving low-cost, near-linear scaling performance.
Database split?