Use Spark-thriftserver operation on CDH Carbondata

Source: Internet
Author: User
Tags zip

Carbondata is a new type of tabular file format for distributed computing, this time using Spark-thrift mode to operate Carbondata, briefly describes how to start Spark-carbondata-thriftserver. version CDH 5.10.3 spark 2.1.0 carbondata 1.2.0 download spark /spark-2.1.0-bin-hadoop2.6.tgz Carbondata carbondata Compilation

The compilation environment is: jdk-1.8.0_151,maven-3.5.2 (low version of not tried, but the official said lowest: jdk7,maven3.3)

CD carbondata-parent-1.2.0
mvn-dskiptests-pspark-2.1-dspark.version=2.1.0 Clean Package

After the compilation succeeds, it will appear in the carbondata-parent-1.2.0/assembly/target/scala-2.11/directory carbondata_2.11-1.2.0- Shade-hadoop2.2.0.jar (different build versions, jar package names may not be the same) start step

1. In the following command, the/carbondata directory of HDFs should be created with the HDFs user and read and write permissions to the user rights with the chmod command if the startup user does not create the permission.
2. The boot node does not have the same node as the HiveServer2 service to prevent port collisions (I have not yet found the way to change the Thriftserver port).

TAR-ZXVF spark-2.1.0-bin-hadoop2.6.tgz
CD spark-2.1.0-bin-hadoop2.6
Cp/etc/hive/conf/hive-site.xml conf #可以读取到hive的表
CP Conf/ conf/
VI conf/
Add to:
Export hadoop_conf_dir=/etc/hadoop/conf
Export yarn_conf_dir=/etc/hadoop/conf.cloudera.yarn/

mkdir Carbondata_lib
#将carbondata_2.11-1.2.0-shade-hadoop2.2.0.jar copy to the Carbondata_lib directory, if you need to set the Carbondata parameter, Need to be carbondata-parent-1.2.0/ files under the Conf directory are renamed to and copied to the spark-2.1.0-bin-hadoop2.6/conf directory
CD bin
./spark-submit \
–master yarn \
–deploy-mode client \
–conf spark.sql.hive.thriftserver.singlesession=true \
–class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
.. /carbondata_lib/carbondata_2.11-1.2.0-shade-hadoop2.2.0.jar \
Hdfs://[namenodeip]:8020/carbondata thriftserver Use

Beeline-u jdbc:hive2://[start node ip]:10000-n user name PostScript

Personally think this way to operate Carbondata is the best way to use the JDBC approach after all is the most convenient.
Spark-sql and Spark-shell can be operated, different modes of operation of the same version may appear to support different operation situations.
If a lot of resources, you can first adjust the spark parameters to start the good container, so that the SQL task can be used as soon as possible to the resources, if more than one application using a yarn resource, you can use the dynamic distribution of spark, the need for resources to start requesting resources, Establish container.

Please correct me if you have any questions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.