Data partitioning in Impala and hive (1)

Source: Internet
Author: User


Partitioning the data will greatly improve the efficiency of data query, especially the use of big data in the present, is an indispensable knowledge. So how does the data create partitions? How does the data load into the partition?

    • impala/hive by state partition accounts

(1) Example: Accounts Non-partitioned table

650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M02/8C/C2/wKiom1h2_EnSt-X7AAEfrJRChiI954.png-wh_500x0-wm_ 3-wmp_4-s_645522526.png "title=" 11.png "alt=" Wkiom1h2_enst-x7aaefrjrchii954.png-wh_50 "/>

The data is stored in the accounts directory if created by the above method. So, what if most of Loudacre's analysis of the Customer table is done by state? Like what:

650) this.width=650; "Src=" https://s5.51cto.com/wyfs02/M01/8C/C2/wKiom1h2_FugLa6sAABJqWfJJaE435.png-wh_500x0-wm_ 3-wmp_4-s_3489301602.png "title=" 22.png "alt=" Wkiom1h2_fugla6saabjqwfjjae435.png-wh_50 "/>

In this case, if the amount of data is large, in order to avoid the full table scan, we can create the partition. If you do not create a partition, it will default to all queries that have to scan all files in the directory. create partition press State to store the data to a different subdirectory, and when queried according to the "NY" criteria, it will only scan to subdirectories, the following I specifically look at partition creation.

Second, partition creation

(1) using partitioned by to create a partitioned table

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M01/8C/BE/wKioL1h2_Grw90jxAAFWAgQZY6E325.png-wh_500x0-wm_ 3-wmp_4-s_392022630.png "title=" 33.png "alt=" Wkiol1h2_grw90jxaafwagqzy6e325.png-wh_50 "/>

Note that the state is deleted because it is a partition field and we know that the partition data will not appear in the actual file, so state will not appear in the column as a partition field. In other words, a partition key is a virtual column, and it is not in the column. So, how do we see the columns of our partition? Will it appear in our structure? That's going to happen.

Third, view the partition column

Use describe to display the partition column, which appears in the last column of the structure, which is a virtual column, not the actual column that exists in the data.

650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M00/8C/C2/wKiom1h2_HjimK9DAAEw_rxEwws663.png-wh_500x0-wm_ 3-wmp_4-s_1632417991.png "title=" 44.png "alt=" Wkiom1h2_hjimk9daaew_rxewws663.png-wh_50 "/>

We create a single partition, but sometimes there are nested partitions, how do we handle them?

Iv. Creating nested partitions:

650) this.width=650; "Src=" https://s3.51cto.com/wyfs02/M02/8C/BE/wKioL1h2_IXyfiaBAABVUJW1iHA425.png-wh_500x0-wm_ 3-wmp_4-s_3300471324.png "title=" 55.png "alt=" Wkiol1h2_ixyfiabaabvujw1iha425.png-wh_50 "/>

Created partitions, how do we load data into partitions? There are two ways of dynamic partitioning and static partitioning. Dynamic partitioning means that impala/hive automatically adds new partitions as they are loaded, and data is stored in the correct partitions (subdirectories) based on column values. Static partitioning requires that we define the name of the partition in advance by using the Add partition, and when loading the data, specify which partition to store the data to. So what are the characteristics of dynamic partitioning and static partitioning? Follow up for everyone and then share.

for big data, we should actively to cater and learn, because it does not have a mature system, but also in the development of the rise, only continuous learning to improve to catch up with the pace of development. Suggestions in peacetime everyone learn more communication, I usually like to focus on "Big Data cn" This public number, for me personally, very good, recommended onlookers.


This article is from the "11872756" blog, please be sure to keep this source http://11882756.blog.51cto.com/11872756/1891301

Data partitioning in Impala and hive (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.