Analysis of Greenplum Technology

Source: Internet
Author: User

Speaking of Greenplum, SUN was the first product that was introduced when SUN pushed their data warehouse product DWA. He was not very familiar with this database product stacked by PgSQL, at that time, the focus was still on DWA's hardware. Of course, DWA still had some features.

Later, we found that general PC + SAS disks have very good throughput capabilities, not inferior to some expensive storage devices. In this way, we tried to build an environment with PC + Greenplum, and the effect completely exceeded our expectation, and the throughput completely exceeded our large storage. Since then, we have no longer been superstitious about expensive hosts and storage and started to try new things, such as using PC + SAS/SATA to stack low-cost storage, use Greenplum to build a data warehouse computing environment, search for hadoop clusters, PC + SSD to build an OLTP database, and use Intel Nehalem to replace minicomputers.

Yesterday, I went to the Data Warehouse department to share a technology on Greenplum, during which I listed a lot of performance data comparisons, especially the comparison with the current set of Oracle RAC. The result is self-evident. The performance of Data Warehouse applications, especially the processing of large data volumes, varies greatly. Now the problem is coming. Many people feel that this product is amazing and can solve all the problems in the data warehouse, as if it is a gift from God. Finally, many people are asking: Oracle is so bad. Why should we use such a good device with such poor performance? The Greenplum is good, but it is not "magical". We should not be blocked by the "magic" data.

For Greenplum, I am actually in a completely unknown state. I would like to explain to you why Greenplum is faster? What "magic" technology does he use?

There are two main factors to improve the processing capability of a data warehouse: first, the throughput capability is the so-called IO; second, the parallel computing capability.

We all know that Oracle RAC is a shared everything architecture, while Greenplum is a shared nothing architecture. The entire cluster consists of multiple segment hosts (data nodes) and master hosts (control nodes). Each segment host runs multiple PgSQL databases (segment ).

When data enters the database, we first need to do data distribution, that is, to evenly distribute the data of a table to each segment as much as possible. We need to specify a distribute column for each table, then, data is distributed based on hash. The purpose of this operation is to make full use of the I/O capabilities of each node. We know that the I/O capabilities of PCs are considerable now. Sun Fire x4500 server, a specially designed data node such as DWA, A box is integrated with 48 SATA disks, known as "Scan 1 terabyte of data in 60 seconds ". In fact, there is no need to buy DWA, domestic manufacturers all have that disk-intensive PC, the price is cheap enough, we have been using it.

When many people see the greenplum architecture, the first problem is what features the master machine has taken on? Will it become a bottleneck of the system? This is also an important feature of the greenplum system. The master only undertakes a very small amount of control functions and interacts with the client, and does not undertake any computing at all. If there is a central node, it means that the system has no way to linear expansion, because the master will definitely become the bottleneck of the system. This problem does not exist in greenplum. The data interaction between nodes does not need to pass through the master node, but is completed directly between nodes.

Now, if we want to query the data of a table, we only need to allocate the work to each node. IO is no longer a problem. Next we need to solve the problem of parallel computing, the core issue is to join multiple tables. Because the table is distributed through the DT column, each node knows that the data is on a node through the DT column. Assume that the two tables use the DT column for join, because the same data is on the same node, you only need to calculate the corresponding node and then merge the results. If the join operation is performed on non-DT columns, A redistribute process is performed because data distribution is unknown between nodes ). Let's take a look at the example below. All three tables use id columns as DT columns. First, we use id for join. Because we design a join to a non-DT column, Greenplum will perform redistribute, the function is to re-distribute data according to hash. The purpose of this function is to let the node know the node on which the data is located, so as to complete the join operation. We can see that the group by column is also redistristri, because the group by column is not a DT column, and the hash aggregate action also requires data exchange between nodes, and the distribution of data must be known between nodes. Is redistribute efficient? Because redistribute only targets the required data, and all the data is completed in the node cache, it must be slower than joining the DT column, but the efficiency is still very high.

Greenplum truly gives full play to the ubiquitous advantage of parallelism. It starts multiple PgSQL databases on a single host at the same time, so that the multi-core CPU on the hardware can take full advantage of it. Someone asked me: Can Greenplum process multiple tasks in parallel? The answer is: Impossible. Since Greenplun has fully utilized the machine's IO and processing capabilities, it is no longer possible to process multiple tasks at the same time.

Greenplum also has an interesting feature: During data loading, instead of having a central data distribution node as we generally imagine, all nodes read data at the same time, and then according to the hash algorithm, store your own data and send the data of other nodes to the node through the network. Therefore, the data loading speed is very fast.

Greenplum HA Architecture

Currently, Greenplum is not magical. In fact, Oracle RAC is also a very good solution for data warehouses. Similar technologies are available in all Oracle. We can assume that, for a fixed SQL statement, I can use Oracle RAC to do Greenplum, we can partition the table by Hash + Range (in fact, Greenplum is also a hash + range partition, and hash is used to distribute data to different databases, then, use range to partition the tables in each data base, and then use the parallel processing capability of RAC. Oracle also has a similar partition-wise join function, but there is no redistribute operation for data. The biggest problem with Oracle is the shared everything architecture. As a result, I/O processing capability is limited, and our large storage throughput is 1.4 GB/S, And the scalability is limited. Oracle database machine, which has previously been introduced, is a solution specifically provided by Oracle for data warehouses.

In fact, there are some magical technologies. The reason why Greenplum is magical is that our scenario has played its own role. In fact, we can also design a scenario to draw a very bad conclusion about Greenplum, therefore, do not trust the vendor's data or the technology that can solve all problems.

"Don't be infatuated with Brother, brother is just a legend. "

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.