Code Analysis of hive MetaStore

Source: Internet
Author: User

1. Hive MetaStore Internal Structure

1.1 Package Structure

From the perspective of the package structure, there are five main packages. Let's take a look at the contents of these packages.

(1) metastorepackage is the entry of the MetaStore module and the core of the entire MetaStore module. It contains the hivemetastore class as the core of the entire module. It receives requests from hive and returns the required information.

(2) MetaStore. apipackage includes the interfaces used to call and access the MetaStore module, interface parameters, and return value types. Users of the MetaStore module can access the MetaStore module through APIS.

(3) MetaStore. events is used for the internal observer mode of the MetaStore module. Because the MetaStore module supports the notification mechanism and some other subsequent processing. In the observer mode, when MetaStore performs some operations on metadata, some events will be generated at the same time. These events will be captured by their listener and processed accordingly, such as sending some notifications.

(4) MetaStore. model is related to data persistence. The MetaStore module uses the datanucleus database to persistently store the model to the database. The model here corresponds to the table in the database.

(5) MetaStore. tools is a tool for the background metadata Data Administrator to view and modify metadata.

 

1.2. Class Structure

In the preceding five packages, the key component is the MetaStore package. Let's take a look at the class structure in detail.

(1) first, let's take a look at the client. hivemetastoreclient inherits from the imetastoreclient interface. hivemetastoreclient can access and call the server of hivemetastore locally and remotely.

(2) The core part is hivemetastore's internal class hmshandler, which inherits from the ihmshandler interface. ihmshandler inherits from the thrihivemetastore. iface interface and provides remote calls through thrift. All the MetaStore modules of the interface are implemented in hivemetastore. hmshandler.

(3) ObjectStore inherits from the rowstore interface and is used to persist data. Objects can obtain data from the database and map it to the object of the model, you can also save the objects in the model to the database.

(4) hivealterhandler inherits from the alterhandler interface and isolates it from the hmshandler to perform the alter operation.

(5) warehouse is mainly used to operate files on HDFS. Because modification of metadata may involve some file operations on HDFS, such as mkdir and deltedir.

(6) metastorepreeventlistener, metastoreeventlistener, and metastoreendfunctionlistener process the generated event through the observer mode. These three classes are all abstract classes inherited and implemented by other specific classes.

 

2. Interaction between hive and MetaStore

Let's take a look at how hive interacts with MetaStore. The following two examples show that one is a DDL hql command and the other is an analyze hql command.

 

2.1 ddlql

(1) In QL. to compile the command in driver, you must call QL. parse. semantic. analyzer's analyze method analyzes and optimizes the command, Ql. parse. semantic. analyzer calls QL. metadata. the gettable method of hive obtains table information, Ql. metadata. hive is a class used in the Ql module to interact with the MetaStore module. It accesses and calls the MetaStore module through hivemetastoreclient.

(2) The compile process also calls the getschema method of Ql. parse. semantic. analyzer to obtain the schema information, which will be saved in QL. parse. semantic. analyzer in the previous step.

(3) c. ddltask calls some DDL methods in QL. Metadata. hive to execute DDL processing.

This example is an example that does not contain map-reduce processing. It only modifies metadata in MetaStore. The following example contains the processing of Map-Reduce and the modification of metadata in MetaStore.

 

2.2 analyzeql

(1) The compile process here is similar to the previous example.

(2) Objective C. in the Execute Process of mapredtask, a map-reduce job is created. The job is submitted to hadoop by calling the command line and the processing result of Map-Reduce is received, stored in the context.

(3) The Execute Process of C. stats will call the updatetablecolumnstatistics or updatapartitioncolumnstatistics method of Ql. Metadata. hive to update the statistics in MetaStore Based on the command.

 

3. Estimate the database capacity and scalability in MetaStore

In terms of capacity and scalability, the main model or table that occupies the storage space is as follows:

Database, table, partition, and role

The following describes how to add database, table, partition, and role to calculate the storage space required.

Note: R (x) indicates the size of a row in table x; n (x) indicates the number of X instances in the system, that is, the number of rows in table x; k indicates a small number.

(1) create n (database) databases and add data to the database and database_params tables.

(R (database) + K * R (database_params) * n (database)

 

(2) create n (table) tables. Each table needs to add data to the table and table_params tables

R (table) + K * R (table_params)

You also need to add tablecolumnprivilege, tablecolumnstatistics, tableprivilege, and the size of each table is

N (column) * R (tablecolumnprivelege) + R (tablecolumnstatistics) + R (tableprivelege)

You also need to add storagedescriptor because storagedescriptor contains column information, and the size of each table is

N (column) * R (fieldschema) + R (serdeinfo)

In general, to add N (table) tables, the data size to be added is:

[R (table) + K * R (table_params) +

N (column) * R (tablecolumnprivelege) + R (tablecolumnstatistics) + R (tableprivelege) +

N (column) * R (fieldschema) + R (serdeinfo)]

* N (table)

 

(3) create n (partion) Partitions. Each partition must be added to the patition table, partition_params table, and table

R (partition) + K * R (partition_params) + R (fileschema)

You must also add partitioncolumnprivilege, partitioncolumnstatistics, and partitionprivilege. The size of each partition is

N (column) * R (partitioncolumnprivelege) + R (partitioncolumnstatistics) + R (partitionprivelege)

N (partition) partitions are added. The data size to be added is as follows:

[R (partition) + K * R (partition_params) + N (partition) * R (fileschema) +

N (column) * R (partitioncolumnprivelege) + R (partitioncolumnstatistics) + R (partitionprivelege)]

* N (partition)

(4) create n (role) role. The size to be added is:

N (role) * R (role)

 

The total size of database storage files is

(R (database) + K * R (database_params) * n (database) +

[R (table) + K * R (table_params) +

N (column) * R (tablecolumnprivelege) + R (tablecolumnstatistics) + R (tableprivelege) +

N (column) * R (fieldschema) + R (serdeinfo)]

* N (table) +

[R (partition) + K * R (partition_params) + R (fileschema) +

N (column) * R (partitioncolumnprivelege) + R (partitioncolumnstatistics) + R (partitionprivelege)]

* N (partition) +

N (role) * R (role)

 

To reduce R (X) to R, the total size is:

K * r * n (database) + 2 * r * n (column) * n (table) + R * n (column) * n (partition) + R * n (role)

= [K * n (database) + N (role) + 2 * n (column) * n (table) + N (column) * n (partition)] * R

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.