sparkSQL1.1 Introduction VII: Thriftserver and CLI

Last Update:2014-09-09 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

spark1.1 compared to spark1.0, the biggest difference is that spark1.1 increased the million people expect the CLI and thriftserver. It makes it easy for hive users to use the command-line RDBMS database administrator to sparksql on the ground hand and enter the SQL ERA in real sense. The following is a brief introduction of its use, limited to the time relationship, and then attach the source code analysis.
1:thriftserver and CLI command parameters A: The surprising CLI Just deploy good spark1.1 can not wait to test the CLI, for accustomed to the SQL command line himself, lost shark, for sparkSQL1.0 once very is a conflict (in fact, for the development of debugging personnel, Spark-shell is a tool, can be very convenient to use each spark ecosystem components) 。 Eagerly, without shutting down the Hive Metastore service, and then a bin/spark-sql into the command line, and then through Hive Metastore can directly query hive:

spark-sql> use saledata;//the sales singular, total sales spark-sql> select C.theyear,count (Distinct a.ordernumber) for all orders in each year, sum ( B.amount) from Tblstock a joins Tblstockdetail B on A.ordernumber=b.ordernumber joins Tbldate C on A.dateid=c.dateid Group B Y c.theyear order by C.theyear;

Operation Result:
By the way, test the syntax of hive0.13 (the test system used is hive0.13,spark1.1 compile time is hive0.12, no accident, in the CLI is not able to use hive0.12, must use and spark matching hive version of the hive language Law). The reason this is possible to connect to hive Metastore is because the configuration file Hive-site of hive is placed under Conf, and the CLI can read the configuration information of hive directly. Let's take a look at the command parameters of the Thriftserver and CLI. Thriftserver and CLI command parameters are composed of two parts, the first part is the operating parameters of Spark, you can refer to the Spark1.0.0 Application Deployment Tool Spark-submit parameters, the latter part is the Thriftserver and CLI command parameters.
B:thriftserver Command Parameters

[Email protected] spark110]$ sbin/start-thriftserver.sh--helpusage:./sbin/start-thriftserver [Options] [Thrift Server Options]thrift Server options: Use      value for given property

Where [options] is the thriftserver to start a sparksql application parameters, if not set--master, will start the Thriftserver machine to run in local mode, only by HTTP//machine name : 4040 for monitoring. In the cluster to provide thriftserver, you must configure the master, Executor-memory and other parameters, [thrift Server Options] is a thriftserver parameter, you can use the [Thrift server Options] are configured in real-world applications because of the many parameters, usually using the Conf/hive-site.xml configuration.
C:CLI Command Parameters

[[email protected] spark110]$ bin/spark-sql--helpusage:./bin/spark-sql [Options] [CLI option]cli options:-D,-- Define <key=value> Variable subsitution to the Apply to hive commands. e.g.-D a=b or--define a=b--database <databasename> Specify the database to Use-e <quoted-query-string&         Gt SQL from command line-f <filename> SQL from files-h Where [options] is the CLI to start a sparksql application parameters, if not set--master, will start the Thriftserver machine in local mode, can only be monitored by http:/Machine name 4040.[CLI option] is a parameter of the CLI through which the CLI can run SQL files directly, run SQL commands into the command line, and so on, similar to the use of previous shark. It is important to note that the CLI does not use a JDBC connection, so you cannot connect to Thriftserver, but you can configure Conf/hive-site.xml to connect to Hive's Metastore (as described in the surprising CLI) and then query the hive data.
When the 2:thriftserver Thriftserver starts, a SPARKSQL application is launched, and a listener is opened, waiting for the JDBC client to connect and submit the query. Therefore, when configuring Thriftserver, at least configure the hostname and port of the thriftserver, and if you want to use hive data, provide the URIs of Hive Metastore.      In general, you can define the following configurations in Conf/hive-site.xml, or you can configure them by using environment variables (environment variables take precedence over hive-site.xml). The following is the Hive-site.xml configuration for starting thriftserver on HADOOP2 in the experimental cluster:Configuring the Thriftserver Hive-site.xml<configuration><property><name>hive.metastore.uris</name ><value>thrift://hadoop3:9083</value><description>thrift URI for the remote Metastore. Used by Metastore client-to-connect to remote Metastore.</description></property><property><name >hive.server2.thrift.min.worker.threads</name><value>5</value><description>minimum Number of Thrift worker threads</description></property><property><name> Hive.server2.thrift.max.worker.threads</name><value>500</value><description>maximum Number of Thrift worker threads</description></property><property><name> Hive.server2.thrift.port</name><value>10000</value><description>port Number of HiveServer2 Thrift interface. Can be overridden by setting $HIVE _server2_thrift_port</description></property><property><name >hive.sErver2.thrift.bind.host</name><value>hadoop2</value><description>bind host on which to run The HiveServer2 Thrift interface. Can is overridden by setting$hive_server2_thrift_bind_host</description></property></configuration >
Then, sbin/start-thriftserver.sh start the thriftserver directly on the HADOOP2, and be careful not toThe hive.server2.thrift.bind.host configuration can be localhost, otherwise the remote client cannot connect. switch to client wyy, using Bin/beeline connection thriftserver:
You can then work directly with the SQL command:
Of course, you can also use the JDBC connection thriftserver in the user's program, please refer to: HiveServer2 clients. At this point, view hadoop2:4040, you can see the user's actions:
Finally, Thriftserver boot is in the listening state, the user can use CTRL + C to exit Thriftserver, and Beeline exit using the!q command. The configuration of the 3:cli CLI is simple, and the URIs required to specify hive Metastore in the conf/hive-site.xml can be used. Now to use Spark-sql on the client wyy, configure the Conf/hive-site.xml as follows:Configuring the hive-site.xml<configuration><property><name>hive.metastore.uris</name>< of the CLI Value>thrift://hadoop3:9083</value><description>thrift URI for the remote Metastore. Used by Metastore client-to-connect to remote metastore.</description></property></configuration>
Then start Spark-sql:Bin/spark-sql--master spark://hadoop1:7077--executor-memory 3g
This is the cluster monitoring page where you can see the launch of the Sparksql application:
You can then query the hive data using the HQL statement, and you can use the command, such as setting parameters using set: By default, Sparksql shuffle is 200 partition, you can modify this parameter with the following commands:SET spark.sql.shuffle.partitions=20;
Running the same query statement, the number of Task (partition) changes from 200 to 20 after the parameter is changed.
Basically, the majority of hive features can be used in the CLI.
4: Summary Overall, the introduction of Thriftserver and the CLI makes it easier for sparksql to use hive data, allowing Sparksql to be closer to users than to developers.
sparkSQL1.1 Introduction VII: Thriftserver and CLI

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More