On the similarities and differences of Hive (III.)

On the similarities and differences of Hive (III.) –hive and database

Last Update:2015-03-16 Source: Internet

Author: User

Keywords So execute can therefore execute can

Tags .mall access an application application applications data data warehouse designed for

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Absrtact: Because Hive employs SQL query Language HQL, it is easy to Hive

understood as a database. In fact
Structurally, Hive and databases have similar query languages, no more

. This article will
Explain the differences between Hive and database from several aspects. The database can be used Online

Application, but
The Hive is designed for the data warehouse, and it helps to understand this in terms of application

Hive characteristics.

Hive and database Comparison query Language HQL SQL
Data storage location HDFS Raw Device or local FS
Data format user-defined system decision
Data update support does not support
Index None
Perform mapredcue Executor
Execution Latency
Scalability High and low
Data size

Query Language. Because SQL is widely used in data warehouses, it is specifically targeted at

Hive's features are designed to HQL the query language of class SQL. Familiar with development of SQL

Can be easily developed using Hive.
The location of the data store. Hive is built on Hadoop, all Hive data

are stored in HDFS. A database can store data in a block device or a

In the file system.
Data format. There is no specific data format defined in the Hive, and the data format can be

Specifies that the user-defined data format needs to specify three properties: Column delimiter (usually a space

, "\ T", "\x001″", Row delimiter ("\ n"), and

Method (Hive has three file formats by default Textfile,sequencefile and

Rcfile). Because in the process of loading data, you do not need to format from user data to

Hive the transformation of the data format defined, so that Hive does not log during loading

Make any modifications to itself, but simply copy or move the data content to the corresponding

HDFS directory. In the database, different databases have different storage engines,

Righteousness of its own data format. All data will be stored in a certain organization, so the number

The process of Kuga data is time-consuming.
Data updates. Because the Hive is designed for data warehouse applications, the inside of the Data Warehouse

The capacity is to read more and write less. Therefore, overwriting and adding of data is not supported in Hive, and all

The data are all fixed in the load. The data in the database is typically

Often modified, so you can use INSERT into ... VALUES Add number

According to, use UPDATE ... SET modifies data.
Index。 As has been said before, Hive does not allow data to be in the process of loading data

What to do, not even scan the data, so there is no data on some Key

Indexing is established. Hive A brute-force scan is required to access specific values in the data that meet the criteria

The entire data, so the access latency is high. Due to the introduction of MapReduce, Hive can

To access data in parallel, so even without an index, for large amounts of data, Hive

Can still show an advantage. Database, you typically create a cable for one or several columns

, so the database can be highly effective for access to a small number of specific conditions

Rate, the lower the delay. Due to the high latency of data access, it is decided that Hive is not suitable for

Line data query.
Implementation。 The execution of most queries in Hive is provided through Hadoop MapReduce

To implement (a query similar to the SELECT * from TBL does not require MapReduce).

The database usually has its own execution engine.
Execution delay. Previously mentioned, Hive when querying data, because there is no index, you need to

A higher latency is needed to scan the entire table. Another cause of Hive execution delay is high

The element is the MapReduce frame. Because the MapReduce itself has a higher latency,

There are also high delays in executing Hive queries using MapReduce. Relative to the

, database execution latency is low. Of course, this low is conditional, that is, the size of the data

Small, when the data size is large to exceed the processing capacity of the database, Hive parallel meter

It is clear that the advantages are evident.
Scalability. Since Hive is built on Hadoop, the Hive can be expanded

Malleable is consistent with the scalability of Hadoop (the world's largest Hadoop cluster

The size of the yahoo!,2009 year is around 4000 nodes. and the database is

Strict limits on ACID semantics, with very limited expansion lines. The most advanced parallel database at present

Oracle has only about 100 extensions in theory.
Data size. Because Hive is built on a cluster and can be performed using MapReduce and

Row calculation, so it can support large scale data, corresponding to the database can support

Data is small.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More