Absrtact: Because Hive employs SQL query Language HQL, it is easy to Hive
understood as a database. In fact
Structurally, Hive and databases have similar query languages, no more
. This article will
Explain the differences between Hive and database from several aspects. The database can be used Online
Application, but
The Hive is designed for the data warehouse, and it helps to understand this in terms of application
Hive characteristics.
Hive and database Comparison query Language HQL SQL
Data storage location HDFS Raw Device or local FS
Data format user-defined system decision
Data update support does not support
Index None
Perform mapredcue Executor
Execution Latency
Scalability High and low
Data size
Query Language. Because SQL is widely used in data warehouses, it is specifically targeted at
Hive's features are designed to HQL the query language of class SQL. Familiar with development of SQL
Can be easily developed using Hive.
The location of the data store. Hive is built on Hadoop, all Hive data
are stored in HDFS. A database can store data in a block device or a
In the file system.
Data format. There is no specific data format defined in the Hive, and the data format can be
Specifies that the user-defined data format needs to specify three properties: Column delimiter (usually a space
, "\ T", "\x001″", Row delimiter ("\ n"), and
Method (Hive has three file formats by default Textfile,sequencefile and
Rcfile). Because in the process of loading data, you do not need to format from user data to
Hive the transformation of the data format defined, so that Hive does not log during loading
Make any modifications to itself, but simply copy or move the data content to the corresponding
HDFS directory. In the database, different databases have different storage engines,
Righteousness of its own data format. All data will be stored in a certain organization, so the number
The process of Kuga data is time-consuming.
Data updates. Because the Hive is designed for data warehouse applications, the inside of the Data Warehouse
The capacity is to read more and write less. Therefore, overwriting and adding of data is not supported in Hive, and all
The data are all fixed in the load. The data in the database is typically
Often modified, so you can use INSERT into ... VALUES Add number
According to, use UPDATE ... SET modifies data.
Index。 As has been said before, Hive does not allow data to be in the process of loading data
What to do, not even scan the data, so there is no data on some Key
Indexing is established. Hive A brute-force scan is required to access specific values in the data that meet the criteria
The entire data, so the access latency is high. Due to the introduction of MapReduce, Hive can
To access data in parallel, so even without an index, for large amounts of data, Hive
Can still show an advantage. Database, you typically create a cable for one or several columns
, so the database can be highly effective for access to a small number of specific conditions
Rate, the lower the delay. Due to the high latency of data access, it is decided that Hive is not suitable for
Line data query.
Implementation。 The execution of most queries in Hive is provided through Hadoop MapReduce
To implement (a query similar to the SELECT * from TBL does not require MapReduce).
The database usually has its own execution engine.
Execution delay. Previously mentioned, Hive when querying data, because there is no index, you need to
A higher latency is needed to scan the entire table. Another cause of Hive execution delay is high
The element is the MapReduce frame. Because the MapReduce itself has a higher latency,
There are also high delays in executing Hive queries using MapReduce. Relative to the
, database execution latency is low. Of course, this low is conditional, that is, the size of the data
Small, when the data size is large to exceed the processing capacity of the database, Hive parallel meter
It is clear that the advantages are evident.
Scalability. Since Hive is built on Hadoop, the Hive can be expanded
Malleable is consistent with the scalability of Hadoop (the world's largest Hadoop cluster
The size of the yahoo!,2009 year is around 4000 nodes. and the database is
Strict limits on ACID semantics, with very limited expansion lines. The most advanced parallel database at present
Oracle has only about 100 extensions in theory.
Data size. Because Hive is built on a cluster and can be performed using MapReduce and
Row calculation, so it can support large scale data, corresponding to the database can support
Data is small.