Greenplum + Hadoop learning notes-11-distributed database storage and query processing, hadoop-11-

Source: Internet
Author: User

Greenplum + Hadoop learning notes-11-distributed database storage and query processing, hadoop-11-

3. 1. Distributed Storage
Greenplum is a distributed database system. Therefore, all its business data is physically stored in the database of all Segment instances in the cluster. In the Greenplum database, all tables are distributed, therefore, each table is sliced, and each Segment instance database stores corresponding data fragments. Data in four tables, sale, customer, vendor, and product, are sliced and stored on all segments. All Segment instances work at the same time. Because each Segment only needs to calculate a part of the data, therefore, the computing efficiency will be greatly improved.



3. 2. Table distribution policy-foundation of Parallel Computing
3.2.1.Hash Distribution
Syntax format:
Create table... Distributed by (column [,…])
The content of the same value is allocated to the same Segment. When selecting the Hash distribution policy, you can specify one or more columns in the table. GP calculates the Hash value corresponding to each row of data based on the specified Hash Key column and maps it to the corresponding Segment instance. When the value of the selected Hash Key column is unique, the data is evenly distributed to all Segment instances. The GP database uses Hash distribution by default. If the Distributed Key is not specified during table creation, the Primary Key is selected as the Distributed Key. If the Primary Key does not exist, the first column of the table is selected as the Distributed Key.

3.2.2. Cyclic (random) Distribution
Syntax format:
Create table... DISTRIBUTED RANDOMLY
The content of rows with the same value may not be in the same Segment. The same value may not be distributed to the same Segment. Random distribution is not recommended.

3. query planning and distribution

The client distributes the query plan to each subnode through the Master node, including update, delete, create, and other operations, after the query plan is executed on each subnode, the results are returned to the Master node and displayed on the client.

3.4. SQL Query Processing Mechanism

QD process (query and distribution process) exists on the Master node, and QE process (query and execution process) exists on the child node. When the Master node distributes the query plan to the child node, execute the QE process on the child node. GP Splits a query plan into multiple slice to provide execution efficiency. When multiple execution plans work in parallel, the first slice will continue to wait for the result of the completed slice. The processing of the same data can be understood as gang (cluster ). After Slice 1 finishes processing, it will send the processing result to slice 2, and slice 2 will return the summary result to the Master node.
3. 5. parallel query plan
SELECT customer, amount FROM sales JOIN customer USING (cust_id) WHERE date = 03222015;

GP query plan: first, a full table scan is performed on the table. After the full table scan is completed, data is redistributed and Hash is distributed. The redistribution is on slice1, And the hash is distributed on slice2, hash join is performed after the redistribution and Hash distribution ends. When the Hash join ends, the gather Motion (merge join) operation enters slice3.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.