Introduction: Apache hive is a data warehouse built on top of Hadoop (Distributed system infrastructure), Apache HBase is a nosql (=not only SQL, non-relational database) database system running on the top level of HDFs. This is a column-oriented database that differs from hive,hbase with the ability to read and write.
For users who have just come into contact with big data, it is difficult to differentiate between hive and HBase. In this paper, we will try to analyze it from the aspects of its definition, characteristics, limitations, application scenarios and so on.
What is Hive?
Apache Hive is a data warehouse built on top of Hadoop (Distributed system infrastructure), note that this is not a database. Hive can be thought of as a user programming interface, which itself does not store and compute data; it relies on HDFs (Hadoop Distributed File System) and MapReduce (a programming model, mapping and simplification; for big data parallel operations). Its operation on HDFs is similar to sql-named HQL, which provides a rich SQL query to analyze the data stored in HDFs, and HQL through its own SQL to query the content of the analysis after it has been compiled into a mapreduce job, even if you are not familiar with MapReduce Users can also easily query, summarize, and analyze data using the SQL language. and MapReduce developers can use their own mapper and reducer as plugins to support hive for more complex data analysis.
What is HBase?
Apache HBase is a nosql (=not only SQL, non-relational database) database system running on the top level of HDFs. This is a column-oriented database that differs from hive,hbase with the ability to read and write. HBase stores data as a table, which consists of rows and columns, and columns are divided into several column families (row family). For example, a message column cluster contains the sender, the recipient, the date sent, the message title, and the message content. Each pair of key values in HBase is defined as a cell, where the key consists of a row-key (row key), a column cluster, a column, and a timestamp. In HBase, each row represents a combination of key-value mappings identified by the row key. HBase targets rely primarily on scale-out to increase compute and storage capacity by increasing the number of inexpensive commercial servers.
Characteristics
The JDBC-compliant hive not only allows users with SQL knowledge to perform mapreduce operations indirectly, but also integrates the current SQL-based operational tools. However, because the default data reads are full-table traversal, the time consuming is inevitably relatively large. However, there is a limit to the amount of data that can be read by the different hive partitioning methods. The hive partition allows filtering queries on data stored on separate files, returning filtered data. For example, log file access for a date, provided that the file name of the class contains date information.
HBase stores data in the form of key-value pairs. It contains 4 main data manipulation methods:
Add or update data rows
Scan to get cells within a range
Returns the corresponding cells for a specific data row
Delete data row/column, or column description information from a data table
Column information can be used to get the value before the data is changed (through the HBase compression policy you can delete the column information history to free up storage space).
Limit
Hive does not support regular SQL UPDATE statements, such as: Data insertions, updates, and deletions. Because its operation on the data is for the entire data table. This also makes it possible to calculate data queries in minutes or hours. In addition, its mapreduce conversion process must follow the predefined conversion rules.
HBase data query is a set of its own SQL-like operating language, which requires a certain learning to master. In addition, to run the hbase,zookeeper is required to be equipped. Zookeeper is a reliable coordination system for large-scale distributed systems, offering features such as configuration maintenance, name services, distributed synchronization, group services, and more.
Application examples
Hive is suitable for large, static data queries such as Web logs. For example: User consumption behavior record, website visit footstep and so on. But not for online real-time online queries.
HBase is able to take on big data online real-time query scenarios. For example, Fackbook uses its online real-time analysis of messages sent between users.
Summary
Both Hive and HBase are based on different technologies on Hadoop. Hive is a class-SQL programming interface capable of executing mapreduce jobs, and HBase is a non-relational database structure. Combined with each other's own characteristics, may be able to receive a complementary effect. For example, using hive to process static off-line data, using HBase for online real-time queries, and then consolidating the result set between the two, which makes the data complete and youthful, providing good support for further business analysis.
HBaseHbase–hadoop Database, a distributed, column-oriented, open source data base, comes from the Google paper "Bigtable: A distributed storage system of structured data" written by Chang et al. Just as BigTable leverages the distributed data store provided by the Google File system, HBase provides bigtable-like capabilities on top of Hadoop. HBase is a sub-project of the Apache Hadoop project. HBase differs from the general relational database, which is a database suitable for unstructured data storage. The other difference is that HBase is column-based instead of row-based patterns. by line: In a row of units to save, anyway, there are advantagesall data files in HBase are stored on the Hadoop HDFs file system, mainly including the two file types presented above:1.
hfile, hbase keyvalue data storage format, hfile is a hadoop binary format file, in fact StoreFile is the hfile to do a lightweight packaging, That is, the StoreFile bottom is hfile . 2.
HLog file, the storage format of the Wal (Write Ahead Log) in HBase, is physically the sequence file of Hadoop
HiveHive is a Hadoop-based data warehousing tool that maps structured data files into a single database table and provides full SQL query functionality that can be translated into a mapreduce task to run. The advantage is that the learning cost is low, the simple mapreduce statistics can be quickly realized through the class SQL statements, and it is very suitable for the statistical analysis of data Warehouse without developing specialized mapreduce applications. The other one is the Windows registry file.
MapReduceMapReduce is a programming model for parallel operations of large datasets (larger than 1TB). The concepts "map" and "Reduce", and their main ideas, are borrowed from functional programming languages, as well as the features borrowed from vector programming languages. He greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming . The current software implementation is to specify a map function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrency reduction function, which is used to guarantee that each of the mapped key-value pairs share the same set of keys
ZooKeeperZookeeper is a distributed, open-source, distributed application coordination Service that contains a simple set of primitives and is an important component of Hadoop and HBase. Distributed applications can use it to implement functions such as unified naming services, configuration management, distributed lock services, cluster management, and so on.
The difference of Phive HBase project in Hadoo ecology