Original address
From Shard to Sharding
The word "Shard" means "fragment" in English, and the technical terminology associated with the database seems to be among the earliest in the massively multiplayer online role-playing game (MMORPG). "Sharding" is called "Shard".
Sharding is not a new technology, but a relatively simple concept of software. As you know, the data table partitioning feature is only available after MySQL 5, so many of MySQL's potential users have concerns about MySQL extensibility, and whether partitioning is a key metric for measuring the scalability of a database (not the only indicator, of course). Database extensibility is an eternal topic, and MySQL advocates often ask: how do you handle processing of application data on a single database that needs to be partitioned and so on? The answer is: sharding.
Sharding is not a feature attached to a particular database software, but an abstraction on top of specific technical details, a solution for horizontal scaling (scale out, or scale-out, and scale-out), with the main purpose of exceeding the I/O capability limits of a single-node database server. Resolve database extensibility issues.
Related to Database extensibility
Speaking of database extensibility, this is a very big topic. The current business data have their own extensibility solutions, in the past relatively mature, but with the rapid development of the Internet, will inevitably bring some computational models of evolution, so many mainstream business systems will inevitably expose some shortcomings. For example, Oracle's RAC is a shared storage mechanism, for I/O intensive applications, bottlenecks can easily fall on the storage, such a mechanism determines that the subsequent expansion is only the scale up (scaling up) type, for hardware costs, developer requirements, maintenance costs are relatively high.
Sharding is basically an extensibility solution for open source databases, and few have heard of commercial databases being sharding. The current trend in the industry is basically to embrace scale out and gradually liberate it from scale up.
Sharding's application Scenario
Any technology can play its due role in the right situation. The same is true of sharding. Online games, IM, BSP are more suitable for sharding scenarios. The common denominator is that the data objects that are abstracted are associated with very little data. such as IM, each user if abstracted into a data object, can be stored independently in any place, the data object is Share Nothing, such as Blog service provider's site content, basically for user-generated content (UGC), can completely isolate different users to different storage sets, And it's transparent to the user.
This "Share nothing" is borrowed from the database cluster concept, for example, some types of data granularity is not "Share", such as the history table information like transactions, if a record contains both seller information and buyer information, if over time, buy , the sellers will continue to trade with other users, so that the inevitable two trader's information will be distributed across the different sharding DB, and then if the buyer for the purchase of inquiries, will span more sharding, the cost will be relatively large.
Sharding is not a silver bullet for database expansion scenarios, and it is not suitable for scenarios such as transaction-oriented applications that can be very complex. For transactions that span different db, it is difficult to guarantee integrity and outweigh the gains. So, what kind of sharding form is used, not mechanically.
The difference between sharding and database partitioning (Partition)
Sometimes, sharding is similar to horizontal partitioning (horizontal partitioning), and many places on the web also use horizontal zoning to refer to sharding, but I personally think there is a difference between the two. Indeed, the idea of sharding is from the idea of partitioning, but the database partition is basically a data object-level processing, such as table and index partitions, each child dataset can have different physical storage properties, or a single database-scoped operation, and Sharding is able to cross the database, Even across the physical machine. (See comparison table)
Sharding strategy
There are many similarities between the strategy of the data sharding and the partitioning table, which can be chosen based on the table, the ID range, the time the data was generated, or the service-based approach under the SOA concept. And unlike the traditional table partitioning method, sharding strategy and business integration closer, successful sharding must be familiar with their own business, conduct a number of feasibility analysis on the basis of the "business logic driven."
Sharding Implementation Case study: Digg website
As one of the Digg.com Web 2.0 site, although the user base is huge, but the site database data is not huge, the same period last year, the main data about only 30GB appearance, it should be larger now, but should not appear in order to grow, database software using MySQL 5.x. The IO pressure of the digg.com is very large and is a read-focused application (98% of IO is a read request). Because the news service is provided, this kind of data has its own characteristics, the most recent time period of data is often the most read pressure part.
According to the business characteristics, digg.com according to the time range of the main business data to do sharding, less than 10% of the "hot" data effectively isolated, while this part of the data for better hardware, to provide a better user experience. While another 90% of the data is rarely accessed by users, even though the access speed is slightly slower, the impact is minimal for users. The desired effect was achieved through Sharding,digg.
Introduction to existing sharding software
Now sharding related software implementation In fact a lot, based on database layer, DAO layer, different languages are not lack of cases. Confined to space, to give a brief introduction.
MySQL Proxy + Hscale
A set of more promising options. MySQL Proxy (Http://forge.mysql.com/wiki/MySQL_Proxy) is implemented with Lua script, between the client and the server side, playing the role of proxy, providing query analysis, failure takeover, query filtering, adjustment and other functions. The current 0.6 version also does not read, write separation. Hscale is for MySQL Proxy plug-ins, but also implemented with Lua, the sharding process has been simplified a lot. It should be noted that MySQL Proxy and hscale each have a certain amount of overhead, but this overhead and centralized data processing method of single query cost is still small.
Hibernate Shards
This is a contribution from the Google Technology team (http://www.hibernate.org/414.html), which was born in the sharding process of data on Google's financial system. Because it is implemented at the framework level, it has its own unique features: The standard Hibernate programming model, which can be done with hibernate, has low technical cost, a relatively resilient sharding strategy, and supports virtual Shard.
Spock Proxy
This is also an open source project that is generated in real demand. Spock (http://www.spock.com/) is a Web 2.0 site for people looking for. The Spock Proxy (http://spockproxy.sourceforge.net/) project is produced by effectively sharding its own single DB, and Spock Proxy is a branch of MySQL proxy that provides a fan-based The sharding mechanism of the enclosure. Spock is based on rails, so Spock Proxy is also built on rails, and friends who are concerned about RoR should not miss the project.
Hivedb
The implementation of RoR is described above, and Hivedb (http://www.hivedb.org/) is based on the Java implementation, and, slightly different, this project is backed by a commercial company.
Pl/proxy
The first few are for MySQL sharding scheme, Pl/proxy is for PostgreSQL, design ideas like Teradata Hash mechanism, the data store is transparent to the client, customer request sent to Pl/proxy, distributed storage Procedure calls, unified distribution. Pl/proxy is designed to act as a "data bus" in this layer, so when the throughput is not supported, you only need to add more Pl/proxy servers. The famous Skype for use is the Pl/proxy solution.
Pyshards
Http://code.google.com/p/pyshards/wiki/Pyshards
This is a Python-based solution. The goal of the tool is to have a re-balancing in it, which is a more radical idea. Only MySQL database is currently supported.
Conclusion
Sharding is still in a high-speed development of the "old" technology, with the development of Web 2.0, sahrding gradually from the comparison of "virtual" concept into a more "real" use of ideas, open source software tide also give sharding inject new vitality, I believe there will be more and more projects to adopt Sharding technology, there will be more mature sharding programs and database add-on software emerge.
Is your site sharding?