When a relational database tries to store terabytes of data in a single table, total performance is often reduced. Obviously, indexing all the data is time-consuming not only for reading but also for writing. Because NoSQL data stores are especially good for storing large data (such as Google's Bigtable), it is clear that NoSQL is a relational database method. For developers who tend to use acid-ity and entity-structure relational databases, and for projects that require this structure, segmentation is an exciting alternative.
Segmentation is a branch of a database partition, but it is not a local database technology-The Shard occurs at the application level. In various segmentation implementations, Hibernate Shards is one of the most popular in the Java™ technology world. This agile project allows you to perform almost seamless operations on a shard dataset using POJO mapped to a logical database (I'll briefly explain the "almost" reason below). When using Hibernate shards, you do not need to map your POJO specifically to segmentation-you can map any common relational database as you would with the Hibernate method. Hibernate Shards can manage low-level segmentation tasks for you.
So far in this series, I've used a simple field based on race and contestant analogies to show various database storage technologies. This month, I will continue to use this familiar example to introduce a practical segmentation technique and then implement it in Hibernate shards. Note: The main work associated with segmentation is not much related to Hibernate; in fact, Hibernate shards coding is relatively simple. The key part is to determine how to do the segmentation and what to do.
About this series
Since the first birth of Java technology, the Java development pattern has undergone tremendous changes. Thanks to a proven open source framework and a reliable leasing deployment infrastructure, Java applications can now be assembled, tested, run, and maintained quickly and economically. In this series, Andrew Glover explores a variety of technologies and tools that make this new Java development style possible.
Introduction to Segmentation
Database segmentation is an intrinsic relational process that divides the rows of a table into different groups through some logical blocks of data. For example, if you are partitioning a large table named Foo based on a timestamp, all data prior to August 2010 will enter partition A, and then all the data is entered into partition B. Partitions can speed reading and writing because they target smaller datasets in separate partitions.
Partitioning is not always available (MySQL is not supported until version 5.1), and the cost of the business system it needs is prohibitive. More importantly, most partitioning implementations store data on the same physical machine, so they are affected by the hardware base. In addition, partitioning does not identify hardware reliability or lack of reliability. As a result, many intelligent people are beginning to look for new ways of scaling.
Segmentation is essentially a database-level partition: Instead of dividing rows of data tables through a block of data, it splits the database itself through some logical data elements (usually across different computers). In other words, segmentation does not divide the data table into small chunks, but instead divides the entire database into small chunks.
A typical example of segmentation is based on segmentation of a large database that stores world-wide customer data based on a region: segmentation A is used to store customer information in the U.S., Shard B users store Asian customer information, divide C Europe, and so on. These shards are on separate computers, and each shard stores all relevant data, such as customer preferences or subscription history.
The benefits of segmentation, like partitioning, are that it compresses large data: Separate tables are relatively small in each shard, which can support faster read and write speeds and thus improve performance. Segmentation can also improve reliability, because even if a shard fails unexpectedly, other shards can still serve the data. And because segmentation is done at the application level, you can split a database that does not support regular partitions. Lower capital costs are also a potential advantage.
Segmentation and Strategy
As with many other technologies, partial compromises are needed to make the Shard. Because segmentation is not a local database technology-that is, it must be implemented in your application-you need to work out your segmentation strategy before you start the Shard. Primary key and cross splitting queries play an important role in the segmentation, primarily by defining what you can't do to achieve them.
Primary key
Segmentation utilizes multiple databases in which all databases function independently and do not interfere with other segmentation. Therefore, if you rely on database sequences (such as automatic primary key generation), it is most likely that the same primary key will appear in a single database set. You can reconcile sequences across distributed databases, but this increases the complexity of the system. The safest way to avoid the same primary key is to have the application (the application will manage the Shard system) generate the primary key.
Cross splitting query
Most segmentation implementations (including Hibernate shards) do not support cross splitting queries, which means that if you want to take advantage of two datasets with different slices, you must deal with extra lengths. (Interestingly, Amazon's SimpleDB also prohibits cross-domain querying) For example, if you store US customer information in Shard 1, you also need to store all of the relevant data here. If you try to store those data in Shard 2, the situation becomes complex and system performance can be affected. This is also related to the previous point-if you need to do a cross splitting connection for some reason, it's best to use a way to manage keys that can eliminate duplication!
It is clear that the segmentation strategy must be considered comprehensively before the database is established. Once you've chosen a specific direction, you're almost bound to it-it's hard to move the data after you've been slicing.
Avoid immature segmentation
Segmentation is best achieved later. As with immature optimizations, segmentation based on expected data growth could be a hotbed of disaster. The successful segmentation implementation is based on an understanding of the growth of the application data over time, as well as subsequent inferences about the future. Once the data has been sliced, moving data can be very difficult.