First, the historical value of hive
1, Big Data is known for Hadoop, and Hadoop is useful because of hive. Hive is the killer on Hadoop application,hive is the Data Warehouse on Hadoop, while Hive has both the storage and query engines in the Data warehouse. And Spark SQL is a much better and more advanced query engine that does not provide storage functionality. So spark SQL cannot replace hive, and in today's enterprise applications Spark sql+hive is the most efficient and popular trend for big data used in the industry.
2,hive is launched by Facebook, primarily to enable non-Java-savvy engineers to harness Hadoop clusters for multi-dimensional analysis of data distributed data through SQL, even if you can manipulate hive directly through a web interface;
The core of 3,hive is to translate its SQL language, HQL, into a mapreduce code and then hand it to the Hadoop cluster, which means that hive itself is a standalone version of the SOFTWARE!!!
4, because it is used to write SQL to complete the business requirements, so compared to programming mapreduce, very simple and flexible, can very easily meet the needs of the business and the changing scene;
5, hive exists almost in all companies that use big Data!
Second, the design of hive architecture
The architecture diagram for 1,hive is as follows:
650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M00/7D/55/wKioL1bmWinCrlXXAABz8usKgU8418.jpg "title=" 263dd81b9d16fdfa13bcc9d8b78f8c5495ee7b02.jpg "alt=" Wkiol1bmwincrlxxaabz8uskgu8418.jpg "/>
2, you can connect to hive in a number of ways, and hive installs only on one machine.
3, Metastore (what databases are in hive, which tables, which columns and their data types are in the table); Hive's metadata is stored by default in Derby, but Derby only supports single-user, so the MySQL database is typically used to store hive metadata.
4, the data to be manipulated by hive itself is determined by the hive configuration file, which is located in HDFs (in fact, an ordinary file on HDFs, but is organized by installing hive);
5, from the hive point of view, the data is a table, we operate is based on the multidimensional query of SQL table.
6, people have been trying to replace the traditional data Warehouse with hive (scalability, extensibility), but failed! Because hive is too slow. So the gold combination in the industry's current trend is using hive (the storage engine for the Data Warehouse) +spark SQL (Distributed Analytic Computing query engine)
7,HQL will be interpreted by hive to optimize and generate a query plan, in general the query plan will be converted into a mapreduce task. However, a type like select * from table does not convert to a mapreduce task.
8,hive does not have an index (similar to an index in a traditional database)!!
Indexes are standard database technology and are supported after the hive 0.7 release. Hive provides limited indexing functionality, unlike a traditional relational database with a "key" concept, where users can create indexes on certain columns to speed up certain operations, and index data created for one table is saved in another table. Hive's indexing function is now relatively late and offers fewer options. However, the index is designed to be customized with built-in pluggable Java code, and users can extend this functionality to suit their needs. Of course not that some queries will benefit from the Hive index. Users can use the explain syntax to analyze whether the HIVEQL statement can use indexes to improve the performance of user queries. Like an index in an RDBMS, it is necessary to evaluate whether the index is created reasonably, after all, the index requires more disk space and there is a cost to creating a maintenance index. The user has to weigh the benefits and costs of getting the index.
This article is from the "Ding Dong" blog, please be sure to keep this source http://lqding.blog.51cto.com/9123978/1750882
53rd Lesson: Hive First Lesson: The value of hive, Introduction to the architecture design of Hive