The first chapter of PostgreSQL Replication Understanding Replication Concepts (3)

Source: Internet
Author: User
Tags database sharding

1.3 Using sharding and data distribution

In this section you will learn about basic extensibility techniques, such as database sharding. Sharding is widely used in high-end systems and provides a simple and reliable way to scale out. In recent years, fragmentation has become a standard way to expand the scale of professional systems.

1.3.1 Understanding the purpose of sharding

What happens if your data volume grows beyond the processing power of a single machine? What if you are running so many transactions that a server can't keep up? We assume that you have millions users and tens of thousands of users want to perform specific tasks at the same time.

Obviously, at some point, you can no longer solve the problem by purchasing a server that is large enough to handle an infinite load. Obviously it's not possible to run a Facebook-like or Google-like app on a single server. Sometimes you have to come up with a scalable strategy to serve your needs. This is the scenario for the Shard application.

The idea of sharding is simple: What if you can somehow split the data that can reside on different nodes?

Example of designing a shard system

To illustrate the basic concept of sharding, let's assume the situation:

We want to store millions of users ' information. Each user has a unique user ID. We further assume that we have only two servers. In this case, we can store even the ID of the user on server 1, and the ID of the odd user is stored on server 2.

Show how this is achieved:

As you can see, in our diagram, we have well distributed data. Once the data distribution is complete, we can send a query to the system as follows:

SELECT * FROM T_user WHERE id = 4;

The client can easily figure out where to find the data through our query check filter. In our example, the query will be sent to the first node because we are working with an even number.

As we have already distributed data based on a key (the user ID used here), we can easily search for anyone if we know the key. In large systems, it is common practice to refer to users through a single key (key), so this approach is appropriate. With this simple method, we can also easily double the number of machines in our system.

When designing the system, we can easily take out any number of servers; all we have to do is create a nice and smart partitioning feature to distribute data within our server cluster. What if we want to split the data within 10 servers (not the problem) and use id%10 as a partitioning feature?

But when you try to distribute the data to different hosts, you must make sure that you are using a robust partitioning feature that, to some extent, distributes the data very well and that each host has more or less the same amount of data.

It may not be a good idea for users to assign data alphabetically. The reason is that not all letters have the same possibility. We cannot simply assume that the case from letter A to M is the same as when the letter occurs from N to Z. This is a big problem if you want to distribute a dataset to 1000 servers and not just a handful of machines. As mentioned before, it must be a robust partitioning function that produces evenly distributed effects.

[In most cases, the hashing function (hash function) provides you with good and evenly distributed data. This is especially useful when working with character fields, such as names, email addresses, and so on. ]

Examples of querying different fields

In the previous section, you've seen how we can easily query a person with a key (key). Let's delve into it and see what happens in the following query:

SELECT * from t_test WHERE name = ' Max ';

Remember, we use the ID to distribute the data, and in the query we search for the name. The application will not know which partition to use because there are no rules that tell us what is where.

As a logical result, the application must require the name of each partition. If the name you are looking for is a real situation, it is acceptable; However, we cannot rely on this fact. Having to ask multiple servers instead of a server is obviously a serious solution optimization (not optimized) and unacceptable.

We have two options to deal with this problem:

• Proposed a more intelligent partitioning function

• Redundant storage of data

Proposing a smarter partitioning feature must be the best choice, but it's almost impossible if you want to query different fields.

This leaves us with a second option, which is to store data redundantly. Storing two copies of a dataset, or even more, is less common and is really a good way to solve this problem. The following picture shows how this is implemented:

As you can see, in this scenario we have two clusters. When a query arrives, the system must decide which data can be found on which node. In case of query names, we (for simplicity) simply split the data in half alphabetical order. In the first cluster, our data is still split by user ID.

Advantages and disadvantages of 1.3.2 sharding

Sharding is not a simple one-way street is an important thing to understand. If someone decides to use sharding, it is important to recognize the advantages and disadvantages of the technology. As always, without thinking, no God can miraculously solve all human problems.

Each actual usage is different, with no substitute for common sense and deep thinking.

First, let's look at the advantages of the split listed below:

• It has the ability to extend the system beyond a single server

• It is an easy way to

• It is supported by many frameworks

• It can be combined with a variety of other replication methods

• It can support PostgreSQL very well (e.g. using pl/proxy)

Light and shadow tend to go together, so fragmentation has its shortcomings, as follows:

• Adding servers to a running cluster is cumbersome (depending on the type of partition function)

• Your flexibility may be severely reduced.

• Not all types of queries will have the same effect on a single server.

• Increased complexity of overall installation settings (e.g. failover, etc.).

• Backup requires more planning.

• You may face redundancy and additional storage requirements.

• Application developers should know the Shard to ensure that efficient queries are written.

In the 13th chapter, using the Pl/proxy extension, we will discuss how to efficiently leverage sharding together with PostgreSQL, how to set Pl/proxy with maximum performance and extensibility.

1.3.3 the choice between splitting and redundancy

Learning how to slice a table is just the first step in designing an extensible system architecture. In the example we have shown in the previous chapters, we only have a single table that can be easily split with keys. But what if we had more than one table? Let's say we have two tables:

• A table named T_user stores users in our system

• A table named T_language stores the languages supported by our system

We may be able to split the table T_user well by some sort of splitting method, which can also be allocated to a reasonable number of servers. But what about table T_language? Our system may support up to 10 languages.

It is good to shard and distribute hundreds of millions of users, but how about splitting 10 languages? This is obviously useless. In addition, we may need our language table on all nodes so that we can connect.

The way to solve this problem is simple: you need a full copy of the language table on all nodes. This will not cause problems with storage waste because the table is so small.

[Ensure that only large tables are fragmented.] Full replication of a table may be more efficient for small tables.]

Again, every situation must be well thought out.

1.3.4 increase and decrease the size of the cluster

So far, we have been thinking that the setting of the Shard size is constant. We have set up a shard that allows us to take advantage of a fixed number of partitions within our cluster. This limitation may not reflect the daily needs. How can you really tell the number of nodes that are needed at a particular point in time of design? People may have a rough idea of hardware requirements, but actually know that load expectations are an art rather than a science.

[to reflect this, you have to design a system that he can easily resize. ]

A common mistake is that people tend to increase the size of their settings in unnecessary small steps. Someone could have increased from five machines to six or seven machines. This can be tricky. We assume that at some point we split the data using user id%5 as the partition function. What if we want to use user id%6? It's not easy, but the problem is that we have to rebalance the data within our cluster to reflect our new rules.

Keep in mind that we've covered shards (that is, partitioning) because we have so much data and so many loads, one server can't handle so many requests. If we want to come up with a strategy, we now need to rebalance the data. We are already on the wrong track. You certainly don't want to rebalance 20TB of data in order to add two or three servers to your existing servers.

In fact, it is easier to simply double the number of partitions. Doubling your partition does not require rebalancing the data, as you can simply follow the following strategy:

• Create a copy of each partition

• Delete half the data on each partition

If your partition function was previously user id%5, it should now be user id%10. The advantage of multiplication is that data cannot be moved between two partitions. But when it comes to doubling, users may think that your cluster is growing too small to grow too fast. This is true, but if the system is running at its limit, adding 10% of the system's resources does not solve the scalability problem.

Not only increases the number of clusters (which is used in most cases), you can also put more effort into writing a more complex partitioning function to preserve the old data, but to handle the latest data smarter. A time-dependent partitioning function may cause problems of its own, but it may be a worthwhile method to study.

[Some NoSQL systems use range partitioning to propagate data.] A range partition means that each server has a fixed data slice for a given time frame. If you want to do time series analysis or similar work may be beneficial. However, it can be counterproductive if you want to ensure that data is split evenly. ]

If you expect your cluster to grow in size, we recommend that you start with more partitions than you originally need, and package more than one partition on a single server. Later, it is easy to move a single partition to the hardware that is added to the cluster. Some cloud services can do this, but this book does not contain this content.

To narrow the cluster is you can simply apply the reverse policy and move more than one partition to a single server. It's easy to leave and open the door for future server additions.

1.3.5 Combining Shards and replication

Once the data is broken up into useful chunks that can be processed by a server or partition, we must consider how to make the entire setup more reliable and fail-safe.

The more servers you have in your settings, the more likely one of these servers is to be down or unavailable for another reason.

[This is the design of a highly scalable system that always avoids a single point of failure. ]

To ensure maximum throughput and maximum availability, we can turn to redundancy again. The design method can be summed up as a simple formula, which should always be printed in the mind of a system architect:

"One isnoneand the other is one"

A server is far from enough to provide high availability for us. Each system requires a backup system that can take over in a very urgent situation. We never improve usability by splitting a set of data, because we have more servers that may also fail at this point in time. To solve this problem, we can add replicas for each of our partitions (fragments), as shown in:

Each partition is a separate PostgreSQL db instance, with each instance having its own copy (or multiple replicas).

Keep in mind that you can choose from all the weapon features and features discussed in this book (for example, synchronous and asynchronous replication). All of the strategies described in this book can be combined flexibly, and a single technology is often not enough, so you need to freely combine various technologies to achieve your goals in different ways.

1.3.6 Various Shard Solutions

In recent years, Sharding has been incorporated into many extensibility-related issues as an industry standard. As a result, many programming languages, frameworks and products have provided Plug and Play to support splitting.

When you implement a shard, you can basically choose between two strategies:

• Rely on some frameworks/middleware

• Rely on the PostgreSQL method to solve the problem

In the next two sections, we will briefly discuss these two choices. This little overview doesn't mean a comprehensive guide, but a generalization that lets you start slicing.

PostgreSQL-based Sharding

PostgreSQL itself cannot shard the data, but it has all the interfaces and methods of implementing the shards through the add-ons. One of the widely used add-ons is pl/proxy. It has been around for years and provides excellent transparency and good extensibility.

The idea behind Pl/proxy is basically to hide a table of server groups using a local virtual table.

Pl/proxy will be discussed in depth in the 13th chapter, using the Pl/proxy extension.

External Framework/Middleware

You can also use external tools to replace dependent PostgreSQL. Some of the most widely used and well-known tools are:

hibernate shards (Java)

Rails (Ruby)

SQLAlchemy (Python)

1.4 Summary

In this chapter, you have learned about basic replication-related concepts, as well as about physical limitations. We have dealt with the concept of theory, which is the basis of something that will still appear at the back of this book.

In the next chapter, you will be guided through the PostgreSQL transaction log, and we will outline all the important aspects of this important component. You will learn what the log of things is useful for and how it is applied.

The first chapter of PostgreSQL Replication Understanding Replication Concepts (3)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.