Architecture Diagram of Hive-based Offline Analysis Big Data Tool Hive
Source: Internet
Author: User
Keywordsbig data hive hadoop
The principle and use of
Hive Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table and provide simple SQL query functions. It can convert SQL statements into MapReduce tasks to run. The advantage is that the learning cost is low, and simple MapReduce statistics can be quickly realized through SQL-like statements. It is not necessary to develop special MapReduce applications. It is very suitable for statistical analysis of data warehouses. Facebook first completed and open-sourced the Hive framework, which can directly translate SQL statements into MapReduce program. Hive is a Hadoop-based data warehouse tool that can map structured data files into a table and provide SQL-like query functions. Hive is equivalent to a client.
(1) Data analysis personnel who do not understand java can use hadoop for data analysis;
(2) MapReduce development is very tedious and complex, and using hive can improve efficiency.
(3) Unified metadata management, which can share metadata with impala/spark.
2. Hive basics:
(1) Use HQL as a query interface; use MapReduce for calculation; store data on HDFS; run on Yarn.
(2) Hive is more flexible and extensible, and supports UDF and multiple file formats.
(3) Hive is suitable for offline data analysis (batch processing, large delay requirements).
Hive is a SQL parsing engine, which translates SQL statements into Map/Reduce Jobs and executes them in Hadoop. Hive tables are actually HDFS directories, and folders are separated by table names. If it is a partitioned table, the partition value is a subfolder, and these data can be used directly in the Map/Reduce Job.
Hive system structure
HDFS and Mapreduce are the foundation of Hive architecture. The Hive architecture includes the following components: CLI (command line interface), JDBC/ODBC, Thrift Server, WEB GUI, metastore and Driver (Complier, Optimizer and Executor), these components can be divided into two categories: server-side components and client-side components .
(1) Client components:
① CLI: command line interface, command line interface.
② Thrift client: Thrift client is not written in the above architecture diagram, but many client interfaces of Hive architecture are built on Thrift client, including JDBC and ODBC interfaces.
③WEBGUI: Hive client provides a way to access the services provided by Hive through a web page. This interface corresponds to the hive web interface of Hive (hive web interface), and the hwi service must be started before use.
(2) Server-side components:
①Driver component: This component includes Complier, Optimizer and Executor. Its function is to parse, compile and optimize the HiveQL (SQL-like) statements we wrote, generate an execution plan, and then call the underlying mapreduce calculation framework.
②Metastore component: metadata service component, this component stores hive metadata, hive metadata is stored in a relational database, and the relational databases supported by hive include derby and mysql. Metadata is very important for hive, so hive supports to separate the metastore service and install it in a remote server cluster, thereby decoupling the hive service and the metastore service to ensure the robustness of hive operation.
③ Thrift service: Thrift is a software framework developed by Facebook. It is used to develop scalable and cross-language services. Hive integrates this service and allows different programming languages to call the hive interface.
(3) The underlying foundation:
—>Hive data is stored in HDFS, most of the queries are done by MapReduce (queries containing *, such as select * from table will not generate MapRedcue tasks)
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.