usage of Hive Beeline
Reprint: http://www.teckstory.com/hadoop-ecosystem/hive-new-cli-beeline-for-hive/
Hive is the Data Warehouse software of Hadoop ecosystem. It provides a mechanism to project structure onto large data sets stored in Hadoop. Hive allows to query this data using the Sql-like language called HiveQL. The use case for the hive is already established and it's widely adopted as well. In Feb, Hive 1.0.0 was released. Before this release, it is at 0.14 and next release 0.14.1 is voted by Hive community to be released as 1.0.0. Similarly, next major release of Hive 0.15.0 has been renamed as 1.1.0.
When the it comes to interact with any database including Hive, first and most basic yet powerful method was a command line Tool. We have a been using Hive CLI since long but now, Beeline are the new hive CLI and old Hive CLI are being deprecated in favor of Beeline. If you were looking for a command line tool to interact with Hive, Beeline is the recommended tool for you. This article would discuss most common functions which you were performing earlier using the old hive CLI and how would you do t Hem now using Beeline. This article would give you a jumpstart migrating from the old CLI to Beeline.
What is the things you would want to does with a command line tool? Let's look at the example of most common things your may want to does with a command line tool and how can I do it using hi ve Beeline CLI. I'll use the Cloudera Quick start VM 5.4.x for executing commands and generate output for this article. If you is using any other Hadoop environment, your output may differ caused by the difference in versions. I am also truncating beeline prompt info at some places to fit into the this page. Hive Shell Command (old CLI)
Hive Shell command is the gateway to Hive services. It is used to invoke various hive services including Hive command line interface (CLI). You can get a list of services using below command.
[Cloudera@quickstart ~] $hive Help
Usage./hive <parameters>–service serviceName <service parameters>
Service list:beeline CLI help hiveburninclient hiveserver2 hiveserver hwi jar lineage Metastore Metatool orcfiledump RCFI Lecat Schematool Version
You can see various services available including CLI. The CLI is the most important service and it is the default service. So if you just issue hive command without specifying anything, it'll start CLI service.
[Cloudera@quickstart ~]$ Hive
Logging initialized using configuration in File:/etc/hive/conf.dist/hive-log4j.properties
Warning:hive CLI is deprecated and migration to Beeline are recommended.
Hive>
This hive shell command with default service is the one known as "Hive CLI" or hive command line interface. And you can see that I am getting a warning the Hive CLI is deprecated and advised to migrate to Beeline. In the Services list, you can see many and other services. One of them is hiveserver2. The the most important service of our interest for current discussion. We'll talk about it. One more thing you might has noticed that by just executing hive command and you got connected to the hive. At the this stage, you is ready to execute hive database commands. HiveServer2
The old Hive CLI directly connects to Hive Database and Meta-store. The Hive CLI can be used only on the hosts where you have access to these services, for example, cluster nodes (Name Node and Data Nodes) or Edge Nodes. This makes situation little unrealistic in an enterprise environment. In an enterprise environment, usually cluster nodes is protected by firewalls. There is no direct access to these nodes for anyone other than admins. Edge nodes is the only place that has access to hive CLI but usually, they is less in number and everyone would have To remote login to these systems. This is isn't how most of the enterprise database systems work. We needed client/server architecture with concurrency and authentication, just as other database systems.
This was offered by HiveServer2 and Beeline together. HiveServer2 is the Server, enables remote clients to send their queries against Hive and retrieve results. It supports concurrent access from multiple clients and authentication. Beeline is the Client which was nothing but a JDBC application. It is the based on popular Sqlline CLI.
Default authentication for HiveServer2 is NONE. However, we can configure Username/password based authentication like any other database. Such an authentication was backed by LDAP. If you have LDAP configured, your username and password would be validated by HiveServer2 using LDAP.
With all this background, we is now a-ready-to-take a tour of Beeline. Let's see how to achieve basic and the most common things which we'll need during our day to day work with hive database. Connecting to Hive
You can start Beeline tool by simply typing beeline command.
[Cloudera@quickstart ~]$ Beeline
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
Yet connected to hive database. To connect to hive database, you'll have the use!connect command. Beeline supports, connection modes. Embedded Remote Embedded Mode
To connect in embedded mode, you need to being on the machine where Hive is installed. It is the similar type of connection we used with an old hive CLI. To connect to the hive in embedded mode using beeline, you should use below command.
Beeline>!connect jdbc:hive2://Scott Tiger
Scan Complete in 2ms
Connecting to jdbc:hive2://
Added [/usr/lib/hive/lib/hive-contrib.jar] to class path
Added resources: [/usr/lib/hive/lib/hive-contrib.jar]
Connected to:apache Hive (version 1.1.0-cdh5.4.2)
Driver:hive JDBC (version 1.1.0-cdh5.4.2)
Transaction Isolation:transaction_repeatable_read
0:jdbc:hive2://>
In this example,!connect is a beeline command. Next is connection string and finally username and password. Scott is the username and tiger is password. We have passed username and password for the sake of passing it, but since my HiveServer2 are not configured with any LDAP To verify my credentials, HIveServer2 'll ignore it and allow me to connect. If you don ' t pass user and password, beeline'll ask for it.
At the this stage, you is now connected to Hive Server2 and ready to enter your hive commands interactively. As we already discussed, this type of connection are useful for those who has access to cluster nodes. Remote Mode
Embedded mode is mostly used by admins because they'll has access to machines where the hive is installed. For developers, it's almost always remote mode connection to the hive. When you connect to the hive using remote mode, in fact, your is interacting with HiveServer2. But you'll need TCP network connectivity to HiveServer2. To connect to HiveServer2 using beeline remotes, you can use!connect command. This command would take the JDBC URL to connect to your database. A simplified format of the JDBC URL is given below.
!connect jdbc:hive2://
The detailed format of the JDBC URL for HiveServer2 is given below.
Jdbc:hive2://
We'll discuss and show example for Hive_conf_list and Hive_var_list. For now, lets connect to hive
Beeline>!connect Jdbc:hive2://192.168.172.143:10000/default Scott Tiger
Scan Complete in 2ms
Connecting to Jdbc:hive2://192.168.172.143:10000/default
Connected to:apache Hive (version 1.1.0-cdh5.4.2)
Driver:hive JDBC (version 1.1.0-cdh5.4.2)
Transaction Isolation:transaction_repeatable_read
0:jdbc:hive2://192.168.172.143:10000/default>
At the this stage, you is now ready to enter your hive commands interactively.
I mentioned earlier that you need the TCP network connectivity to being able to connect to HiveServer2 in remote mode. HiveServer2 also support remote connections up HTTP but your administrator would have the to start HiveServer2 in HTTP TRANSP ORT mode. At a time, HiveServer2 can accept either TCP requests or HTTP requests. If your HiveServer2 is running in the HTTP transport mode, you can use the below command to connect to hive.
!connect jdbc:hive2://
!connect jdbc:hive2://c15738:10001/default?hive.server2.transport.mode=http;hive.server2.thrift.http.path= Cliservice
Default port is 10001 and the default endpoint is Cliservice. At the this stage, again, your is ready to enter your hive commands interactively.
It is not a necessary to start Beeline and then use!connect command. You can pass connection parameters to beeline and get connected to Hive automatically on start.
[Cloudera@quickstart ~]$ beeline-u jdbc:hive2://192.168.172.143:10000/test-n scott-p Tiger
Scan Complete in 3MS
Connecting to Jdbc:hive2://192.168.172.143:10000/test
Connected to:apache Hive (version 1.1.0-cdh5.4.2)
Driver:hive JDBC (version 1.1.0-cdh5.4.2)
Transaction Isolation:transaction_repeatable_read
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
0:jdbc:hive2://192.168.172.143:10000/test>
In above example,-U takes JDBC URL,-N takes the username and-p is a valid password. interactively executing your DML/DDL statements
The most common requirement and once-is connected to your HiveServer2, it's very simple-to-use your hive que Ries interactively from Beeline command line interface. Let us execute one simple SELECT statement.
0:jdbc:hive2://> Select Stock_symbol,exchange_code,stock_date,stock_open from Stock_data limit 5;
+ ————— + —————-+ ————-+ ————-+–+
| Stock_symbol | Exchange_code | Stock_date | Stock_open |
+ ————— + —————-+ ————-+ ————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
| ABB | NYSE | 2015-03-26 | 21.35 |
| ABB | NYSE | 2015-03-25 | 21.66 |
+ ————— + —————-+ ————-+ ————-+–+
5 rows selected (0.233 seconds)
0:jdbc:hive2://>executing single Hive query from Beeline
You can use-e option to execute a single hive query just like you were using the old hive CLI. This kind of option might is useful for a quick query or if you want to achieve something using shell scripting.
[Cloudera@quickstart ~]$ beeline-u jdbc:hive2://192.168.172.143:10000/test-n scott-p tiger-e "Select Stock_symbol,exc Hange_code,stock_date,stock_open from Stock_data limit 5 "
Scan Complete in 3MS
Connecting to Jdbc:hive2://192.168.172.143:10000/test
Connected to:apache Hive (version 1.1.0-cdh5.4.2)
Driver:hive JDBC (version 1.1.0-cdh5.4.2)
Transaction Isolation:transaction_repeatable_read
+ ————— + —————-+ ————-+ ————-+–+
| Stock_symbol | Exchange_code | Stock_date | Stock_open |
+ ————— + —————-+ ————-+ ————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
| ABB | NYSE | 2015-03-26 | 21.35 |
| ABB | NYSE | 2015-03-25 | 21.66 |
+ ————— + —————-+ ————-+ ————-+–+
5 rows selected (0.863 seconds)
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
closing:0: Jdbc:hive2://192.168.172.143:10000/test
[Cloudera@quickstart ~]$executing your hive queries from files
This option-f was an extension of the previous one and can being used in similar situations where you had multiple HQL state ments stored in a file. It is very common to use a combination of this feature and the single statement execution feature to build test scripts, D Eployment scripts and various other automation purposes in your project.
[Cloudera@quickstart ~]$ beeline-u jdbc:hive2://192.168.172.143:10000/test-n scott-p tiger-f test.hql
Scan Complete in 3MS
Connecting to Jdbc:hive2://192.168.172.143:10000/test
Connected to:apache Hive (version 1.1.0-cdh5.4.2)
Driver:hive JDBC (version 1.1.0-cdh5.4.2)
Transaction Isolation:transaction_repeatable_read
0:jdbc:hive2://192.168.172.143:10000/test> Select Stock_symbol,exchange_code,stock_date,stock_open from Stock_ Data limit 5;
+ ————— + —————-+ ————-+ ————-+–+
| Stock_symbol | Exchange_code | Stock_date | Stock_open |
+ ————— + —————-+ ————-+ ————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
| ABB | NYSE | 2015-03-26 | 21.35 |
| ABB | NYSE | 2015-03-25 | 21.66 |
+ ————— + —————-+ ————-+ ————-+–+
5 rows selected (0.198 seconds)
0:jdbc:hive2://192.168.172.143:10000/test>
closing:0: Jdbc:hive2://192.168.172.143:10000/test
[Cloudera@quickstart ~]$Initializing your hive environment
In a typical hive project, you'll have requirements if you want to set some hive configurations to customize hive Beha Vior. These is some common initialization requirements for specific situations. You can find a detailed list of hive configuration properties here. You can set the Hive configuration using–hiveconf option for Beeline. Below example shows how can I set hive.cli.print.current.db configuration property.
[Cloudera@quickstart ~]$ beeline-u jdbc:hive2://192.168.172.143:10000/test-n scott-p tiger–hiveconf: Hive.cli.print.current.db=false
Scan Complete in 3MS
Connecting to Jdbc:hive2://192.168.172.143:10000/test
Connected to:apache Hive (version 1.1.0-cdh5.4.2)
Driver:hive JDBC (version 1.1.0-cdh5.4.2)
Transaction Isolation:transaction_repeatable_read
Beeline version 1.1.0-cdh5.4.2 by Apache Hive
0:jdbc:hive2://192.168.172.143:10000/test>
Unfortunately, this configuration variable doesn ' t has any effect in the beeline and we is still seeing the database Nam E in Beeline prompt info. But we can use the set <variable_name>command to check the current value of the configuration variable.
0:jdbc:hive2://192.168.172.143:10000/test> set hiveconf:hive.cli.print.current.db;
+ ——————————————-+–+
| Set |
+ ——————————————-+–+
| Hiveconf:hive.cli.print.current.db=false |
+ ——————————————-+–+
1 row selected (0.2 seconds)
You can also the use SET command to change the configuration variable while your is in interactive mode.
0:jdbc:hive2://192.168.172.143:10000/test> set hiveconf:hive.cli.print.current.db=true;
No rows affected (0.012 seconds)
You can test it again and the value for this variable must has been changed to true. Setting variables and parameters in hive queries
When your is trying to automate something, you'll require parameters in your queries. In hive you can define variables and set their values, you can use these variables in your hive queries. This feature are very powerful which you can feel looking at some examples below.
0:jdbc:hive2://> set hivevar:sopen=21.15;
No rows affected (0.01 seconds)
0:jdbc:hive2://> Select Stock_symbol,exchange_code,stock_date,stock_open from Stock_data where stock_open=${ Hivevar:sopen};
+ ————— + —————-+ ————-+ ————-+–+
| Stock_symbol | Exchange_code | Stock_date | Stock_open |
+ ————— + —————-+ ————-+ ————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-01-02 | 21.15 |
| ABB | NYSE | 2012-01-31 | 21.15 |
+ ————— + —————-+ ————-+ ————-+–+
3 Rows selected (0.273 seconds)
Above example, shows so can define variables and use them in WHERE clause of your query. Let us take another example.
0:jdbc:hive2://192.168.172.143:10000/test> set Hivevar:tbl=stock_data;
No rows affected (0.02 seconds)
0:jdbc:hive2://192.168.172.143:10000/test> Select Stock_symbol,exchange_ Code,stock_date,stock_open from ${HIVEVAR:TBL} limit 3;
+ ————— + —————-+ ————-+ ————-+–+
| stock_symbol | exchange_code | stock_date | stock_open |
+ ————— + —————-+ ————-+ ————-+–+
| ABB | NYSE | 2015-03-31 | 21.15 |
| ABB | NYSE | 2015-03-30 | 21.41 |
| ABB | NYSE | 2015-03-27 | 21.290001 |
+ ————— + —————-+ ————-+ ————-+–+
3 rows selected (0.188 seconds)
0:jdbc:hive2://192.168.172.143:10000/test >