Using Hibernate shards for segmentation

Source: Internet
Author: User
Tags generator uuid advantage
When a relational database tries to store terabytes of data in a single table, total performance is often reduced. Obviously, indexing all the data is time-consuming not only for reading but also for writing. Because NoSQL data stores are especially good for storing large data (such as Google's Bigtable), it is clear that NoSQL is a relational database method. For developers who tend to use acid-ity and entity-structure relational databases, and for projects that require this structure, segmentation is an exciting alternative.

Segmentation is a branch of a database partition, but it is not a local database technology-The Shard occurs at the application level. In various segmentation implementations, Hibernate Shards is one of the most popular in the Java™ technology world. This agile project allows you to perform almost seamless operations on a shard dataset using POJO mapped to a logical database (I'll briefly explain the "almost" reason below). When using Hibernate shards, you do not need to map your POJO specifically to segmentation-you can map any common relational database as you would with the Hibernate method. Hibernate Shards can manage low-level segmentation tasks for you.

So far in this series, I've used a simple field based on race and contestant analogies to show various database storage technologies. This month, I will continue to use this familiar example to introduce a practical segmentation technique and then implement it in Hibernate shards. Note: The main work associated with segmentation is not much related to Hibernate; in fact, Hibernate shards coding is relatively simple. The key part is to determine how to do the segmentation and what to do. about this series

Since the first birth of Java technology, the Java development pattern has undergone tremendous changes. Thanks to a proven open source framework and a reliable leasing deployment infrastructure, Java applications can now be assembled, tested, run, and maintained quickly and economically. In this series, Andrew Glover explores a variety of technologies and tools that make this new Java development style possible.

Introduction to Segmentation

Database segmentation is an intrinsic relational process that divides the rows of a table into different groups through some logical blocks of data. For example, if you are partitioning a large table named Foo based on a timestamp, all data prior to August 2010 will enter partition A, and then all the data is entered into partition B. Partitions can speed reading and writing because they target smaller datasets in separate partitions.

Partitioning is not always available (MySQL is not supported until version 5.1), and the cost of the business system it needs is prohibitive. More importantly, most partitioning implementations store data on the same physical machine, so they are affected by the hardware base. In addition, partitioning does not identify hardware reliability or lack of reliability. As a result, many intelligent people are beginning to look for new ways of scaling.

Segmentation is essentially a database-level partition: Instead of dividing rows of data tables through a block of data, it splits the database itself through some logical data elements (usually across different computers). In other words, segmentation does not divide the data table into small chunks, but instead divides the entire database into small chunks.

A typical example of segmentation is based on segmentation of a large database that stores world-wide customer data based on a region: segmentation A is used to store customer information in the U.S., Shard B users store Asian customer information, divide C Europe, and so on. These shards are on separate computers, and each shard stores all relevant data, such as customer preferences or subscription history.

The benefits of segmentation, like partitioning, are that it compresses large data: Separate tables are relatively small in each shard, which can support faster read and write speeds and thus improve performance. Segmentation can also improve reliability, because even if a shard fails unexpectedly, other shards can still serve the data. And because segmentation is done at the application level, you can split a database that does not support regular partitions. Lower capital costs are also a potential advantage.

Back to the top of the page

Segmentation and Strategy

As with many other technologies, partial compromises are needed to make the Shard. Because segmentation is not a local database technology-that is, it must be implemented in your application-you need to work out your segmentation strategy before you start the Shard. Primary key and cross splitting queries play an important role in the segmentation, primarily by defining what you can't do to achieve them.

Primary key
Segmentation utilizes multiple databases in which all databases function independently and do not interfere with other segmentation. Therefore, if you rely on database sequences (such as automatic primary key generation), it is most likely that the same primary key will appear in a single database set. You can reconcile sequences across distributed databases, but this increases the complexity of the system. The safest way to avoid the same primary key is to have the application (the application will manage the Shard system) generate the primary key.

Cross splitting query
Most segmentation implementations (including Hibernate shards) do not support cross splitting queries, which means that if you want to take advantage of two datasets with different slices, you must deal with extra lengths. (Interestingly, Amazon's SimpleDB also prohibits cross-domain querying) For example, if you store US customer information in Shard 1, you also need to store all of the relevant data here. If you try to store those data in Shard 2, the situation becomes complex and system performance can be affected. This is also related to the previous point-if you need to make a cross splitting connection for some reason, it's best to use a way to manage keys in ways that eliminate duplication.

It is clear that the segmentation strategy must be considered comprehensively before the database is established. Once you've chosen a specific direction, you're almost bound to it-it's hard to move the data after you've been slicing. Avoid immature segmentation

Segmentation is best achieved later. As with immature optimizations, segmentation based on expected data growth could be a hotbed of disaster. The successful segmentation implementation is based on an understanding of the growth of the application data over time, as well as subsequent inferences about the future. Once the data has been sliced, moving data can be very difficult.

A policy example

Because the Shard binds you to a linear data model (that is, you can't easily connect the data in different slices), you must have a clear idea of how to logically organize the data in each shard. This can be achieved by focusing on the primary node in the domain. As in an e-commerce system, the primary node can be an order or a customer. Therefore, if you select customer as the node for the segmentation strategy, all data related to the customer will be moved to the respective Shard, but you still have to choose which shard to move the data to.

For customers, you can slice from the location (Europe, Asia, Africa, etc.), or you can split according to other elements. It's up to you to decide. However, your segmentation strategy should include a way to distribute the data evenly across all slices. The overall concept of segmentation is to segment large datasets into small datasets; Therefore, if a particular E-commerce domain contains a large set of European customers and a relatively small set of American customers, segmentation based on the location of the customer may not make sense.

Back to the top of the page

Back to the game-use Shard.

Now let's go back to the racing application example I've always mentioned, and I can split it by race or contestant. In this example, I'll split according to the game because I see that the domain is organized according to the contestants who participate in different races. Therefore, the game is the root of the domain. I will also split the distance according to the game, because the game application contains different lengths and different contestants of many competitions.

Please note: In making this decision, I have accepted a compromise: if a contestant participates in more than one competition, they are divided into different slices. Hibernate shards (like most shard implementations) does not support cross splitting connections. I have to endure these minor inconveniences and allow the contestants to be included in multiple splits-that is, I will rebuild the contestants in a number of game splits that the contestants participate in.

For the sake of simplicity, I'll create two slices: one for races below 10 miles and another for 10-mile races.

Back to the top of the page

Realize Hibernate Shards

Hibernate shards can be used seamlessly with existing Hibernate projects. The only problem is that Hibernate shards needs some specific information and behavior. For example, a segmentation access strategy, a segmentation selection strategy, and a segmentation processing strategy are needed. These are the interfaces that you must implement, although in some cases you can use the default policy. We'll look at each of the interfaces in the following sections.

Shardaccessstrategy

When executing a query, Hibernate shards needs a mechanism to determine the first Shard, the second segmentation, and subsequent segmentation. Hibernate shards does not need to determine what to query (this is Hibernate Core and the underlying database), but it does realize that multiple slices may need to be queried before the answer can be obtained. Therefore, Hibernate Shards offers two creative logical implementations: One method is to query the Shard based on the sequence mechanism (one at a time) until the answer is obtained; the other is to access the policy in parallel, which uses a threading model to query all the slices at once.

For the sake of simplicity, I will use the sequence policy, named Sequentialshardaccessstrategy. We'll configure it later.

Shardselectionstrategy

When a new object is created (for example, when a new Race or Runner is created through Hibernate), Hibernate shards needs to know which shard to write the corresponding data to. Therefore, you must implement the interface and encode the Shard logic. If you want to implement the default, there is a policy named Roundrobinshardselectionstrategy that uses a looping policy to enter data into the Shard.

For running applications, I need to provide the behavior to be split according to race distance. Therefore, we need to implement the Shardselectionstrategy interface and provide a simple logic for segmentation based on the Race object's distance using the Selectshardidfornewobject method. (I'll show you later in the Race object.) )

At run time, when a method similar to save is invoked on my domain object, the behavior of that interface is deeply applied to the core of the Hibernate.
Listing 1. A simple segmentation and selection strategy

 import org.hibernate.shards.ShardId; import

Org.hibernate.shards.strategy.selection.ShardSelectionStrategy; public class Racershardselectionstrategy implements Shardselectionstrategy {public Shardid selectshardidfornewobject (
   Object obj) {if (obj instanceof Race) {Race rce = (Race) obj;
  Return This.determineshardid (Rce.getdistance ());
   else if (obj instanceof Runner) {Runner Runnr = (Runner) obj;
   if (Runnr.getraces (). IsEmpty ()) {throw new IllegalArgumentException ("Runners must have at least one race");
    else {Double dist = 0.0;
     For (Race rce:runnr.getRaces ()) {dist = rce.getdistance ();
    Break
   return This.determineshardid (Dist); 
 } else {throw new IllegalArgumentException ("A Non-shardable object is being created");
  } Private Shardid Determineshardid (double distance) {if (Distance > 10.0) {return new Shardid (1);
  else {return new Shardid (0); }
 }
}

As you can see in Listing 1, if the persisted object is a Race, the distance is determined and (therefore) a shard is selected. In this case, there are two shards: 0 and 1, where Shard 1 contains more than 10 miles of the race, and shard 0 contains all the other races.

If you persist a Runner or other object, the situation is slightly more complex. I have coded a logical rule, which has three principles: a Runner cannot exist without a corresponding Race. If Runner was created to participate in multiple Races, the Runner will be persisted to the Shard of the first Race that was found. (By the way, the principle has a negative impact on the future.) If other domain objects are also saved, an exception is now thrown.

Then you can erase the hot sweat from your forehead, because most of the hard work has been done. As the game application grows, the logic I use may not be flexible enough, but it will complete the demo successfully.

Shardresolutionstrategy

When searching for an object by key, Hibernate shards needs a way to determine the first shard. It will need to be guided using the Sharedresolutionstrategy interface.

As I mentioned earlier, segmentation forces you to focus on the primary key, because you will need to manage these primary keys yourself. Fortunately, Hibernate has performed well in providing key or UUID generation. So Hibernate shards creatively provides an ID generator, named Shardeduuidgenerator, which is flexible enough to embed the Shard ID information into the UUID.

If you end up using Shardeduuidgenerator for the key generation (which I will also take in this article), you can also use the innovative Shardresolutionstrategy implementation provided by Hibernate shards, called Allshardsshardresolutionstrategy, this allows you to determine what segmentation is searched based on the ID of a particular object.

After configuring the three interfaces required for Hibernate shards work, we can implement the second step of splitting the sample application. It's time to start Hibernate's sessionfactory.

Back to the top of the page

Configure Hibernate Shards

One of the core interface objects of Hibernate is its sessionfactory. All the magic of Hibernate is implemented through this small object during its configuration of the Hibernate application, for example, by loading the mapping file and configuration. If you use annotations or Hibernate valuable. hbm files, you also need a sessionfactory to let Hibernate know which objects are persistent and where they will be persisted.

Therefore, when using Hibernate shards, you must configure multiple databases with an enhanced sessionfactory type. It can be named Shardedsessionfactory, and it is of course sessionfactory type. When creating a shardedsessionfactory, you must provide three previously configured Shard implementation types (Shardaccessstrategy, Shardselectionstrategy, and Shardresolutionstrategy). You will also need to provide all the mapping files required for POJO. (If you use a Hibernate POJO configuration based on a memo, the situation may be different.) Finally, a shardedsessionfactory example requires that each shard correspond to multiple Hibernate configuration files.

Create a Hibernate configuration

I have created a shardedsessionfactorybuilder type that has a primary method createsessionfactory that can create a reasonably configurable sessionfactory. After that, I'm going to connect everything to Spring (now who doesn't use an IOC container.) )。 Listing 2 now shows the primary role of Shardedsessionfactorybuilder: Creating a Hibernate configuration:
Listing 2. Create a Hibernate configuration

				
Private Configuration Getprototypeconfig (String hibernatefile, list<string> 
  resourcefiles) {
 Configuration config = new Configuration (). Configure (hibernatefile);
 for (String res:resourcefiles) {
  configs.addresource (res);
 }
 return config;
}

As you can see in Listing 2, a simple Configuration is created from the Hibernate configuration file. The file contains the following information, such as what type of database is used, user name and password, and all necessary resource files, such as the. hbm file used by POJO. In the case of segmentation, you typically need to use multiple database configurations, but Hibernate shards supports the use of only one hibernate.cfg.xml file, simplifying the entire process (but, as you can see in Listing 4, You will need to prepare a hibernate.cfg.xml file for each shard you use.

Next, in Listing 3, I've collected all the Shard configurations into a list:
Listing 3. Shard Configuration List

				
list<shardconfiguration> shardconfigs = new arraylist<shardconfiguration> ();
for (String hibconfig:this.hibernateConfigurations) {
 shardconfigs.add (Buildshardconfig (Hibconfig));
}

Spring Configuration

In Listing 3, the reference to hibernateconfigurations points to the Strings list, where each String contains the name of the Hibernate configuration file. The List is automatically connected through Spring. Listing 4 is an excerpt from my Spring configuration file:
Listing 4. Part of the Spring configuration file

				
<bean id= "Shardedsessionfactorybuilder" 
  class= "Org.disco.racer.shardsupport.ShardedSessionFactoryBuilder ">
    <property name=" resourceconfigurations ">
        <list>
            <value>racer.hbm.xml</ value>
        </list>
    </property>
    <property name= "Hibernateconfigurations" >
        < list>
            <value>shard0.hibernate.cfg.xml</value>
            <value>shard1.hibernate.cfg.xml </value>
        </list>
    </property>
</bean>

As you can see in Listing 4, Shardedsessionfactorybuilder is connecting with a POJO mapping file and two shard profiles. Listing 5 is an excerpt from the POJO file:
Listing 5. Match POJO Map

 <class name= "Org.disco.racer.domain.Race" table= "Race" dynamic-update= "true" Dynam Ic-insert= "true" > <id name= "id" column= "race_id" unsaved-value= "-1" > <generator class= " Org.hibernate.shards.id.ShardedUUIDGenerator "/> </id> <set name=" participants "cascade=" Save-update " Inverse= "false" table= "Race_participants" lazy= "false" > <key column= "race_id"/> <many-to-many column= " runner_id "class=" Org.disco.racer.domain.Runner "/> </set> <set name=" Results "inverse=" true "table=" race
 _results "lazy=" false "> <key column=" race_id "/> <one-to-many class=" Org.disco.racer.domain.Result "/> </set> <property name= "name" column= "name" type= "string"/> <property "name=" Distance "column=" Type= "Double"/> <property name= "date" column= "date" type= "date"/> <property name= "description" column= " DESCRIPTION "type=" string/> </class> 

Note that the only unique aspect of the POJO mapping in Listing 5 is the ID generator class-This is shardeduuidgenerator, which, as you might imagine, embeds the Shard ID information into the UUID. This is the only unique aspect of my POJO map segmentation.

Shard configuration file

Next, as shown in Listing 6, I have configured a shard-in this example, the same is the case for Shard 0 and Shard 1, except for the Shard ID and connection information.
Listing 6. Hibernate Shards configuration file

				
<?xml version= ' 1.0 ' encoding= ' utf-8 '?> <! DOCTYPE hibernate-configuration Public "-//hibernate/hibernate configuration dtd//en" "http://hibernate.so Urceforge.net/hibernate-configuration-3.0.dtd ">  

As its name shows, the Enable_cross_shard_relationship_checks property examines the cross segmentation relationship. This property is time-consuming and should be closed in a build environment, depending on the Hibernate shards document record.

Finally, Shardedsessionfactorybuilder by creating shardstrategyfactory and then adding three types (including the racershardselectionstrategy in Listing 1), Consolidate everything together, as shown in Listing 7:
Listing 7. Creating Shardstrategyfactory

				
Private Shardstrategyfactory Buildshardstrategyfactory () {
 shardstrategyfactory shardstrategyfactory = new Shardstrategyfactory () {public
  shardstrategy newshardstrategy (list<shardid> shardids) {
   Shardselectionstrategy pss = new Racershardselectionstrategy ();
   Shardresolutionstrategy prs = new Allshardsshardresolutionstrategy (shardids);
   Shardaccessstrategy pas = new sequentialshardaccessstrategy ();
   return new Shardstrategyimpl (PSS, PRS, PAS);
  }
 ;
 return shardstrategyfactory;
}

Finally, I executed the wonderful method called Createsessionfactory, which created a shardedsessionfactory in this example, as shown in Listing 8:
Listing 8. Creating Shardedsessionfactory

				
Public Sessionfactory createsessionfactory () {
 Configuration prototypeconfig = This.getprototypeconfig
  ( This.hibernateConfigurations.get (0), this.resourceconfigurations);

 list<shardconfiguration> shardconfigs = new arraylist<shardconfiguration> ();
 for (String hibconfig:this.hibernateConfigurations) {
  shardconfigs.add (Buildshardconfig (Hibconfig));
 }

 Shardstrategyfactory shardstrategyfactory = Buildshardstrategyfactory ();
 Shardedconfiguration shardedconfig = new Shardedconfiguration (
  prototypeconfig, Shardconfigs, Shardstrategyfactory);
 return Shardedconfig.buildshardedsessionfactory ();
}

Using Spring to connect domain objects

Now we can take a deep breath, because we will succeed in a minute. So far, I've created a generator class that can reasonably configure shardedsessionfactory, in fact, to implement Hibernate ubiquitous sessionfactory types. Shardedsessionfactory has done all the magic in the Shard. It leverages the Shard selection strategy I deployed in Listing 1 and reads and writes data from the two slices I've configured. (listing 6 shows that the configuration of Shard 0 and Shard 1 is almost the same.) )

Now all I have to do is connect my domain objects, and in this case, because they depend on Hibernate, a sessionfactory type is required to work. I will only use my shardedsessionfactorybuilder to provide a sessionfactory type, as shown in Listing 9:
listing 9. Connecting POJO in Spring

				
<bean id= "mysessionfactory"
 factory-bean= "Shardedsessionfactorybuilder"
 Createsessionfactory ">
</bean>

<bean id=" Race_dao "class=" Org.disco.racer.domain.RaceDAOImpl ">
 <property name=" Sessionfactory ">
  <ref bean=" mysessionfactory "/>
 </property>
</bean>

As you can see in Listing 9, I first created a factory-like Bean in Spring, which means that my Racedaoimpl type has an attribute named Sessionfactory, which is the sessionfactory type. After that, the Mysessionfactory reference creates a sessionfactory example by using the Createsessionfactory method on Shardedsessionfactorybuilder, as shown in Listing 4.

When I use Spring for my Race object example, which I use primarily as a giant factory to return preconfigured objects, everything is done. Although not shown, the Racedaoimpl type is an object that uses the Hibernate template for data storage and retrieval. My Race type contains a racedaoimpl example that defers all activities related to the data store. Very tacit, isn't it.

Note that my DAO and Hibernate shards are not bound in code, but are bound by configuration. The configuration (shown in Listing 5) binds them to a particular shard UUID generation scenario, which means that I can reuse domain objects from existing Hibernate implementations when I need to slice.

Back to the top of the page

Segmentation: Test-driven using EASYB

Next, I need to verify that my shard implementation works. I have two databases and split them by distance, so when I create a marathon (10 miles or more), the Race example should be found in Shard 1. A small race, such as a 5-kilometer race (3.1 miles), will be found in Shard 0. After creating a Race, I can check the records of a single database.

In Listing 10, I have created a marathon and then continue to verify that the record is really in Shard 1 rather than in Shard 0. What makes things more interesting (and simple) is that I use EASYB, a behavior-driven development architecture based on Groovy, using natural language validation. Easyb can also easily handle Java code. Even if you don't know Groovy or EASYB, you can see that everything is on schedule by looking at the code in Listing 10. (Note that I helped create the EASYB and have published articles on this topic in DeveloperWorks.)
listing 10. An excerpt from a Easyb story that verifies the segmentation correctness

				
Scenario "races greater than 10.0 miles should 1 or shard", {DB02 "a given newly created that are over 1 0.0 Miles ", {new Race (" Leesburg Marathon ", New Date (), 26.2," Race the beautiful streets of leesburg! "). 
    Create ()} then "Everything should work fine w/respect to Hibernate", {rce = Race.findbyname ("Leesburg marathon") Rce.distance.shouldBe 26.2} and "The race should is stored in Shard 1 or db02", {sql = Sql.newinstance (db02 URL, name, PSSWRD, driver) sql.eachrow ("Select race_id, distance, name from race where Name=?", [Leesburg Mara Thon "]) {row-> row.distance.shouldBe 26.2} sql.close ()} and" The race should not being stored in SHA Rd 0 or Db01 ", {sql = Sql.newinstance (db01url, name, PSSWRD, driver) sql.eachrow (" Select race_id, Distance, name
    From race where name=? ", [" Leesburg Marathon "]) {row-> fail" Shard 0 contains a marathon! "
 } sql.close ()}

Of course, my work is not finished-I also need to create a short race and verify that it is in Shard 0 rather than in Shard 1. You can see this validation action in the code download provided in this article.

Back to the top of the page

The pros and cons of segmentation

Segmentation can increase your application's read and write speed, especially if your application contains large amounts of data-such as tb-or your domain is in unrestricted development, such as Google or Facebook.

Before you make a shard, be sure to determine the size and growth of your application in favor of it. The cost (or disadvantage) of segmentation includes the cost of encoding specific application logic for how data is stored and retrieved. When you slice, you're locked into your segmentation model more or less, because it's not easy to do it again.

If implemented correctly, segmentation can be used to solve the problem of scale and speed of traditional RDBMS. Segmentation is a very cost effective decision for organizations that are bound to a relational infrastructure and cannot continue to upgrade hardware to meet a large number of scalable data storage requirements.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.