Open source database sharding Technology

Source: Internet
Author: User
Tags comparison table database sharding

Content Abstract: sharding is not a function attached to a specific database software, but abstract Processing Based on specific technical details. It is horizontal scaling (scale out, or horizontal scaling) the main purpose of this solution is to break through the I/O capability limitations of Single-node database servers and solve database scalability problems.

  From Shard to sharding

The word "shard" refers to "Fragmentation" in English. As a database-related technical term, it seems to have been first seen in massively multiplayer online role-playing games (MMORPG. "Sharding" is called "sharding ".

Sharding is not a new technology, but a relatively simple software concept. As you know, MySQL 5 was used to partition data tables. Before that, many potential mysql users concerned about MySQL scalability, whether the partition function is available is a key indicator (of course not the only indicator) for measuring the scalability of a database ). Database scalability is an eternal topic. MySQL promoters are often asked: How does one implement partitioning to process application data in a single database? The answer is sharding.

Sharding is not a function attached to a specific database software, but an abstract Processing Based on specific technical details. It is a horizontally scalable (scale out) solution, its main purpose is to break through the I/O capability limitations of Single-node database servers and solve database scalability problems.

  Database scalability

Speaking of database scalability, this is a big topic. Currently, commercial data has its own scalability solutions, which are relatively mature in the past. However, with the rapid development of the Internet, it will inevitably lead to the evolution of some computing models, in this way, many mainstream business systems will inevitably expose some shortcomings. For example, Oracle RAC uses a shared storage mechanism. For I/O-intensive applications, the bottleneck is easily stored. Such a mechanism determines that subsequent resizing can only be scale up (up) the hardware cost, developers' requirements, and maintenance cost are relatively high.

Sharding is basically a scalable solution for open-source databases. Few people have heard of sharding for commercial databases. The current industry trend is basically to embrace scale out and gradually release from scale up.

  Sharding application scenarios

Any technology can play its due role in a suitable situation. The same is true for sharding. Online games, Im, and BSP are suitable for sharding application scenarios. In common, abstract data objects have very small data associations. For example, Im, if each user is abstracted into a data object, it can be stored independently in any place, and the data object is share nothing. For example, the content of the website of the blog service provider, basically, the content generated by users (UGC) can be isolated from different users to different storage sets, which is transparent to users.

This "share nothing" is a concept borrowed from a database cluster. For example, some types of data are not "share nothing" in granularity, such as historical table information similar to transaction records, if a record contains both seller information and buyer information, the buyer and seller will continue the transaction with other users over time, in this way, the information of the two buyers and sellers is inevitably distributed to different sharding databases. If you query the sellers, the overhead will be larger.

Sharding is not a silver bullet in the database expansion solution, but also has some unsuitable scenarios. For example, transaction-oriented applications are very complicated. For transactions across different databases, it is difficult to ensure integrity, not worth the candle. Therefore, the sharding format is not rigid.

 Sharding and database partition

Sometimes, sharding is similar to horizontal partitioning. In many places on the internet, horizontal partitioning is also used to refer to sharding, but I personally think there is actually a difference between the two. Indeed, sharding comes from partitioning, but database partitions are basically processed at the data object level, such as partitions of tables and indexes, each sub-dataset can have different physical storage attributes or operations within a single database, while sharding can span databases or even physical machines. (See the comparison table)

 Sharding Policy

Data sharding policies are similar to partition tables in many ways, such as tables, Id ranges, data generation times, or services based on SOA. Different from the traditional Table Partitioning Method, sharding policies and services are more closely integrated. Successful sharding must be familiar with its own business and be conducted on the basis of numerous feasibility analyses, "business logic-driven ".

  Case study of sharding: Digg website

Digg.com, one of the most popular Web 2.0 websites, has a large user base, but its database data is not massive. In the same period last year, the primary data was about 30 GB, and it should be larger now, however, there should not be an increase of magnitude, and the database software uses MySQL 5.x. Digg.com has a high Io pressure and is a centralized read application (98% of Io is a read request ). Because news services are provided, such data has its own characteristics, and data in the recent period is often the most stressful part of the read.

According to the business characteristics, digg.com sharding the main business data based on the time range, effectively isolating less than 10% of the "hot" data, and using this part of data for better hardware, provides a better user experience. In addition, 90% of the data is rarely accessed by users, so although the access speed is a little slower, the impact on users is also very small. Through sharding, Digg achieves the expected results.

  Introduction to existing sharding Software

Currently, there are a lot of sharding-related software implementations, based on the database layer, Dao layer, there are also many cases in different languages. For a brief introduction.

 MySQL proxy + hscale

A set of more promising solutions. Among them, MySQL proxy (http://forge.mysql.com/wiki/MySQL_Proxy) is implemented by Lua script, between the client and the server, play the role of proxy, provide query analysis, failure take over, query filtering, adjustment and other functions. Currently, read and write separation cannot be performed in version 0.6. Hscale is based on the MySQL proxy plug-in and implemented by Lua, which simplifies the sharding process. It should be pointed out that the MySQL proxy and hscale both bring certain overhead, but the overhead of this overhead and the overhead of a single query in centralized data processing mode is still small.

  Hibernate shards

This is a project contributed by Google's technical team (http://www.hibernate.org/414.html), which was born during the sharding process for Google's financial system data. Because it is implemented at the framework layer, it has its unique features: the standard hibernate programming model can be done with hibernate, and the technical cost is low; relatively elastic sharding policies and support for virtual shard.

  Spock proxy

This is also an open-source project generated in actual needs. Spock (http://www.spock.com/) is a Web 2.0 website searched by a human. Spock proxy (http://spockproxy.sourceforge.net/) project is generated by effectively sharding a single database. Spock proxy is a branch of MySQL proxy and provides a range-based sharding mechanism. Spock is based on rails, So Spock proxy is also built based on rails. friends who care about ror should not miss this project.

 Hivedb

The above introduces the implementation of ROR, hivedb (http://www.hivedb.org/) is based on Java implementation, in addition, a slight difference is that the project behind the support of commercial companies.

  PL/Proxy

The first few are sharding solutions for MySQL, while PL/proxy is for PostgreSQL. The design idea is similar to the hash mechanism of teradata. Data Storage is transparent to the client, after customer requests are sent to PL/proxy, distributed stored procedures are called and distributed in a unified manner. PL/proxy is designed to act as a "Data Bus" at this layer. Therefore, when data throughput cannot be supported, you only need to add more PL/proxy servers. The famous Skype uses the PL/Proxy solution.

  Pyshards

Http://code.google.com/p/pyshards/wiki/Pyshards is a python-based solution. The design goal of this tool is also in re-balancing, which is a radical idea. Currently, only MySQL databases are supported.

  Conclusion

Sharding is an "old" technology that is still in rapid development. With the development of Web 2.0, sahrding gradually changes from the concept of "virtual" to the concept of "real, the tide of Open Source Software has also injected new vigor into sharding. I believe more and more projects will adopt sharding technology, and more mature sharding solutions and database additional software will emerge.

Has your site been sharding?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.