Architecture Diagram of Hive-based Offline Analysis Big Data Tool Hive

Last Update:2020-06-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The principle and use of Hive
Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table and provide simple SQL query functions. It can convert SQL statements into MapReduce tasks to run. The advantage is that the learning cost is low, and simple MapReduce statistics can be quickly realized through SQL-like statements. It is not necessary to develop special MapReduce applications. It is very suitable for statistical analysis of data warehouses. Facebook first completed and open-sourced the Hive framework, which can directly translate SQL statements into MapReduce program. Hive is a Hadoop-based data warehouse tool that can map structured data files into a table and provide SQL-like query functions. Hive is equivalent to a client.

The role of Hive framework:

(1) Data analysis personnel who do not understand java can use hadoop for data analysis;

(2) MapReduce development is very tedious and complex, and using hive can improve efficiency.

(3) Unified metadata management, which can share metadata with impala/spark.

2. Hive basics:

(1) Use HQL as a query interface; use MapReduce for calculation; store data on HDFS; run on Yarn.

(2) Hive is more flexible and extensible, and supports UDF and multiple file formats.

(3) Hive is suitable for offline data analysis (batch processing, large delay requirements).

Hive is a SQL parsing engine, which translates SQL statements into Map/Reduce Jobs and executes them in Hadoop. Hive tables are actually HDFS directories, and folders are separated by table names. If it is a partitioned table, the partition value is a subfolder, and these data can be used directly in the Map/Reduce Job.

Hive system structure
HDFS and Mapreduce are the foundation of Hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver (Complier, Optimizer and Executor), these components can be divided into two categories: server-side components and client-side components .

(1) Client components:

① CLI: command line interface, command line interface.

② Thrift client: Thrift client is not written in the above architecture diagram, but many client interfaces of Hive architecture are built on Thrift client, including JDBC and ODBC interfaces.

③WEBGUI: Hive client provides a way to access the services provided by Hive through a web page. This interface corresponds to the hive web interface of Hive (hive web interface), and the hwi service must be started before use.

(2) Server-side components:

①Driver component: This component includes Complier, Optimizer and Executor. Its function is to parse, compile and optimize the HiveQL (SQL-like) statements we wrote, generate an execution plan, and then call the underlying mapreduce calculation framework.

②Metastore component: metadata service component, this component stores hive metadata, hive metadata is stored in a relational database, and the relational databases supported by hive include derby and mysql. Metadata is very important for hive, so hive supports to separate the metastore service and install it in a remote server cluster, thereby decoupling the hive service and the metastore service to ensure the robustness of hive operation.

③ Thrift service: Thrift is a software framework developed by Facebook. It is used to develop scalable and cross-language services. Hive integrates this service and allows different programming languages to call the hive interface.

(3) The underlying foundation:

—>Hive data is stored in HDFS, most of the queries are done by MapReduce (queries containing *, such as select * from table will not generate MapRedcue tasks)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Architecture Diagram of Hive-based Offline Analysis Big Data Tool Hive

Contact Us

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support