Hive is a framework that occupies and plays an important role in the ecosystem architecture of Hadoop, and it is used in many practical businesses, so that the popularity of Hadoop is largely due to the presence of hive. So what exactly is hive and why it occupies such an important position in the Hadoop family, this article will focus on Hive's architecture (architecture), hive operations, hive and hbase differences, and so on.
Before that, let's introduce a business scenario that will give you a sense of why hive is so popular:
Business Description: Statistics Business table consumer.txt How many customers are there in Beijing? The following are the corresponding business data:
ID City Name Sex
0001 Beijing Zhangli Man
0002 Guizhou Lifang Woman
0003 Tianjin Wangwei Man
0004 Chengde Wanghe Woman
0005 Beijing Lidong Man
0006 Lanzhou wuting Woman
0007 Beijing Guona Woman
0008 Chengde Houkuo Man
First, I first use the familiar MapReduce program to achieve this business analysis, the complete code is as follows:
Package IT;Import Java. IO. IOException;Import Java. NET. URI;import org. Apache. Hadoop. conf. Configuration;import org. Apache. Hadoop. FS. Fsdatainputstream;import org. Apache. Hadoop. FS. FileSystem;import org. Apache. Hadoop. FS. Path;import org. Apache. Hadoop. IO. Ioutils;import org. Apache. Hadoop. IO. Longwritable;import org. Apache. Hadoop. IO. Text;import org. Apache. Hadoop. MapReduce. Job;import org. Apache. Hadoop. MapReduce. Mapper;import org. Apache. Hadoop. MapReduce. Reducer;import org. Apache. Hadoop. MapReduce. Lib. Input. Fileinputformat;import org. Apache. Hadoop. MapReduce. Lib. Input. Textinputformat;import org. Apache. Hadoop. MapReduce. Lib. Output. Fileoutputformat;import org. Apache. Hadoop. MapReduce. Lib. Output. Textoutputformat;import org. Apache. Hadoop. MapReduce. Lib. Partition. Hashpartitioner;public class consumer{public static String path1 ="Hdfs://192.168.80.80:9000/consumer.txt";public static String path2 ="Hdfs://192.168.80.80:9000/dir";public static void Main (string[] args) throws Exception {FileSystem FileSystem = FileSystem. Get(New URI (path1), New Configuration ());if (FileSystem. Exists(New Path (path2))) {FileSystem. Delete(New Path (path2), true);The Job Job = new Job (new Configuration (),"Consumer");Fileinputformat. Setinputpaths(Job, New Path (path1));Job. Setinputformatclass(Textinputformat. Class);Job. Setmapperclass(Mymapper. Class);Job. Setmapoutputkeyclass(Text. Class);Job. Setmapoutputvalueclass(longwritable. Class);Job. Setnumreducetasks(1);Job. Setpartitionerclass(Hashpartitioner. Class);Job. Setreducerclass(Myreducer. Class);Job. Setoutputkeyclass(Text. Class);Job. Setoutputvalueclass(longwritable. Class);Job. Setoutputformatclass(Textoutputformat. Class);Fileoutputformat. Setoutputpath(Job, New Path (path2));Job. WaitForCompletion(true);View execution Results Fsdatainputstream FR = FileSystem. Open(New Path ("hdfs://hadoop80:9000/dir/part-r-00000"));Ioutils. Copybytes(FR, System. out,1024x768, true);} public static class Mymapper extends Mapper<longwritable, text, text, longwritable> {publi c Static Long sum =0L;protected void Map (longwritable K1, Text v1,context Context) throws IOException, Interruptedexception { string[] splited = v1. toString(). Split("\ T");if (splited[1]. Equals("Beijing")) {sum++;}} protected void Cleanup (context context) throws IOException, Interruptedexception { String str ="Beijing";Context. Write(New Text (str), new longwritable (sum));}} public static class Myreducer extends Reducer<text, longwritable, Text, longwritable> {prot ected void Reduce (Text K2, iterable<longwritable> v2s,context Context) throws IOException, Interruptedexception {for (longwritable v2:v2s) {context. Write(K2, V2);} } } }
The MapReduce program code runs the following results:
From the running results can be seen: in the Consumer.txt business table, Beijing's customers a total of three people. Below we will use hive to achieve the same function, that is, Statistics business table Consumer.txt in Beijing, how many customers?
The hive operation is as follows:
The result of hive operation is as follows:
Here, it is not the sense that hive is an amazing operating framework--just a few lines of SQL command for the same business logic to get the results we need, which is exactly why hive is so popular, the advantages of hive are mainly:
①hive supports standard SQL syntax, eliminating the process of user writing MapReduce programs, greatly reducing the company's development costs
The advent of ②hive can enable users who are proficient in SQL skills, but are unfamiliar with mapreduce, have weak programming skills, and are not good at Java language to easily query, summarize, and analyze data on HDFS large data sets. After all, people who are proficient in SQL are much more proficient in the Java language.
③hive is for big Data batch processing, the emergence of hive solves the traditional relational database (MYSQL, Oracle) on the large-scale data processing bottleneck
Well, the above illustrates the great advantages of hive with a simple small business scenario, and the next step is to get to the point of this article.
(i) Introduction to Hive Architecture (architecture)
1, the concept of hive:
①hive is a framework for simplifying user-written mapreduce programs, and people who have done data analysis with MapReduce know that many analysis programs are essentially the same, except for the business logic. In this case, a user programming interface such as Hive is required. Hive provides a set of SQL-like query languages called QL, and the use of SQL to implement hive in the process of creating a hive framework is due to the familiarity of the SQL language and the low cost of conversion that can greatly popularize the scope of our Hadoop users. A similar action of pig is not implemented by SQL.
Hive is an open source data warehouse system based on Hadoop, which can map structured data files into a database table and provide full SQL query functionality, and hive can convert tables and fields in SQL to directories and files in HDFs.
②hive is a data warehouse infrastructure built on Hadoop, a batch system designed to reduce the effort of mapreduce authoring, and hive itself does not store and compute data, it relies entirely on HDFs and MapReduce. Hive can be understood as a client tool that transforms our SQL operations into the appropriate MapReduce jobs and then runs on Hadoop.
In the beginning of the Consumer.txt small business, from the writing of SQL to the final analysis of the results of Beijing 3 is actually the MapReduce program, but this MapReduce program is not written by the user, Instead, the Hive Client tool translates our SQL operations into the corresponding MapReduce programs, which are the related logs we display when we run the SQL command:
As can be seen from the log, hive parses our SQL command into the corresponding MapReduce task, and finally gets the results of our analysis.
③hive can be thought of as a package and package of MapReduce. The meaning of hive is to transform a user-friendly, written SQL language into a complex, hard-to-write MapReduce program in Business Analytics, which greatly reduces the threshold for Hadoop learning and allows more users to exploit Hadoop for data mining analysis.
To make it easy for everyone to understand the nature of hive ——-"Hive is a SQL parsing engine that translates SQL statements into corresponding mapreduce programs," The blogger uses a diagram to illustrate the example:
As you can see from the diagram, Hive is, in a way, a package of many "sql-mapreduce" frameworks that can parse a user-written SQL language into a corresponding MapReduce program, and ultimately form the results of the operations of the MapReduce operation framework to the client.
2. Introduction of Hive Architecture
Here is the architecture diagram for hive:
The architecture of hive can be divided into the following sections:
① user interface: Includes shell commands, JDBC/ODBC, and WebUI, the most common of which is the shell client-side approach to hive operation
②hive parser (Driver driver): The core function of Hive parser is to match the corresponding MapReduce template according to the SQL syntax written by the user, and to form the corresponding MapReduce job to execute.
③hive metabase (Metastore): Hive stores metadata information from a table in a database, such as Derby (brought in), Mysql (configured in the actual work), and metadata information in hive includes the name of the table, the columns and partitions of the table, the properties of the table (whether external tables, etc.), The directory where the table's data resides, and so on. The parser in hive reads the relevant information in the metabase Metastore when it is run.
Let's talk about why we don't use the database Derby that comes with hive in the actual business, and we want to reconfigure it with a new database MySQL because the Derby database has a lot of limitations: Derby This database does not allow users to open multiple clients to manipulate it , there can only be one client open to operate it, that is, only one user can use it at the same time, naturally this is inconvenient in the work, so we have to reconfigure it to a database.
④hadoop:hive is stored in HDFs, calculated with MapReduce ——-hive The data is stored in HDFs, and the actual business analysis is performed using MapReduce.
As can be seen from the above architecture, with the help of Hadoop HDFs and MapReduce and MySQL, hive is actually using the Hive parser to parse the user's SQL statements into the corresponding MapReduce program, that is, hive is just a client tool, This is why we do not have the distribution and pseudo-distribution in the hive construction process. (Hive is like Liu Bang, reasonable use of Zhang Liang, Han Xin and Xiao Auxiliary, thus achieved a great event!)
3, the operation mechanism of hive
The operating mechanism of hive is as follows:
Hive running mechanism is: after the table is created, the user only need to write the SQL statement according to business requirements, and then the hive framework will parse the SQL statement into the corresponding MapReduce program, run the job through the MapReduce computation framework, and then get our final analysis results.
In the run of Hive, the user only need to create tables, import data, write SQL analysis statements, the rest of the process will be completed automatically by the Hive framework, and create tables, import data, write SQL analysis statements is actually the knowledge of the database, The operation of Hive also explains why the presence of hive greatly reduces the learning threshold for Hadoop and why hive occupies such an important place in the Hadoop family.
(ii) Operation of Hive
The operation of hive is actually the operation of the table and the operation of the database for the user. Here we will cover two aspects:
1. Hive table-Creation of internal tables, external tables, partitioned tables
The so-called internal table is the normal table, creating a syntax format:
CREATE TABLE TableName #内部表名
{
ID int; #字段名称 field type
Name string;
City string;
Sex string;
}
Row format Delimited #一行文本对应表中的一条记录
field terminated by ' \ t ' #指定输入文件字段的间隔符, that is, what is separated from the fields of the input file
The CREATE syntax format for the external table (external table) is:
Create external Table TableName #外部表名
{
ID int; #字段名称 field type
Name string;
City string;
Sex string;
}
Row format Delimited #一行文本对应表中的一条记录
field terminated by ' \ t ' #指定输入文件字段的间隔符, that is, what is separated from the fields of the input file
Location ' Hdfs://namenode:9000/dir ' #与HDFS中的文件建立链接
Note: The last line is the directory dir, the file will not be written, the hive table will automatically read all the files in the dir directory file
The difference between an internal table and an external table:
Internal tables in the process of loading data, the actual data will be moved to the Data Warehouse directory (HIVE.METASTORE.WAREHOUSE.DIR), then the user access to the data will be directly in the Data Warehouse directory, when the internal table is deleted, Data and metadata information in the internal tables are deleted at the same time.
External tables in the process of loading data, the actual data is not moved to the Data Warehouse directory, just a link to the external table (the equivalent of a shortcut to a file); When you delete an external table, only the link is deleted.
Partitioned table concept: Refers to our data can be partitioned, that is, according to a field to divide the file into different standards, partition table is created by the creation of the table when the partitioned by is enabled.
The partition table is created in the following syntax format:
CREATE TABLE TableName #分区表名
{
ID int; #字段名称 field type
Name string;
City string;
Sex string;
}
Partitoned by (day int) #分区表字段
Row format Delimited #一行文本对应表中的一条记录
field terminated by ' \ t ' #指定输入文件字段的间隔符, that is, what is separated from the fields of the input file
Note: Partition table in the process of loading data to specify the partition field, otherwise it will error, the correct loading method is as follows:
Load data local inpath '/usr/local/consumer.txt ' into table T1 partition (day=2);
The rest of the operations are the same as the internal tables and external tables.
2. Loading (importing) data files into hive tables
After creating the table in hive, we then naturally import the data into the table, but when we import the data it is different from our traditional database (MYSQL, Oracle): Hive does not support one-to-one inserts with INSERT statements, nor does IT support update operations. The data in the Hive table is loaded into the built-in table in the form of load. Once the data is imported, it cannot be modified. Either drop the entire table or create a new table and import the new data.
The syntax format for importing data is:
Load data[local] inpath ' filepath ' [overwrite] into table tablename
[Partition (Partcol1=val1,partcol2=val2 ...)] ;
Here are a few things to consider when importing data:
①local Inpath means importing data from a local Linux to a hive table, inpath means importing data from HDFs into a hive table
② default is to append data to the original hive table, overwrite to overwrite the original data in the table for import
③partition is unique to the partition table and must be added when importing data data, otherwise an error will be
The ④load operation is simply a copy/move operation that copies/moves the data file to the location of the hive table, that is, hive does not make any modifications to the data itself during the loading of the data, but simply copies or moves the contents of the data to the appropriate table
Taking our consumer table (partition table) as an example, the process of creating a table and loading data (locally loaded) is as follows:
(iii) The difference between hive and HBase
In fact, in the strict sense, the hive and hbase should not talk about the difference, the reason for the difference is that both hive and hbase itself involves the creation of tables, inserting data into the table, and so on. So we want to find the difference between hive and hbase, but why not differentiate between the two for the following reasons:
1, according to the above analysis, hive to some extent is a lot of "sql-mapreduce" framework of a package, that is, hive is a package of MapReduce, hive is the meaning of the business analysis will be easy to write users, Writes the SQL language into a complex, hard-to-write MapReduce program.
2, HBase can be considered to be a package of HDFs. His essence is data storage, a NoSQL database; HBase is deployed on HDFs and overcomes the drawbacks of HDFS in random Read and write.
Therefore, to ask the difference between hive and hbase, it is tantamount to asking the difference between HDFs and MapReduce, and the difference between HDFs and MapReduce is not very significant.
But when we have to talk about the difference between hbase and hive, there are a few things to discuss here:
Hive and HBase are two different technologies based on Hadoop –hive is a kind of SQL-like engine and runs the MapReduce task, HBase is a nosql key/vale database on top of Hadoop. Of course, both of these tools can be used at the same time. Like using Google to search and socialize with Facebook, hive can be used for statistical queries, HBase can be used for real-time queries, and data can be written from hive to HBase, set up to be written back to hive from HBase.
Hive is ideal for analyzing queries for data over a period of time, such as the logs used to calculate trends or websites. Hive should not be used for real-time queries. Because it takes a long time to return the results.
HBase is ideal for real-time queries for big data. Facebook uses HBase for messages and real-time analytics. It can also be used to count the number of Facebook connections.
This is the difference between hbase and hive, and the Hive starter notes are also written here, so please leave a message if you have questions.
Hive Getting Started note-----Architecture and application Introduction