Carbondata is a new type of tabular file format for distributed computing, this time using Spark-thrift mode to operate Carbondata, briefly describes how to start Spark-carbondata-thriftserver. version CDH 5.10.3 spark 2.1.0 carbondata 1.2.0 download spark https://archive.apache.org/dist/spark/spark-2.1.0 /spark-2.1.0-bin-hadoop2.6.tgz Carbondata https://dist.apache.org/repos/dist/release/carbondata/1.2.0/ Apache-carbondata-1.2.0-source-release.zip carbondata Compilation
The compilation environment is: jdk-1.8.0_151,maven-3.5.2 (low version of not tried, but the official said lowest: jdk7,maven3.3)
Unzip Apache-carbondata-1.2.0-source-release.zip
CD carbondata-parent-1.2.0
mvn-dskiptests-pspark-2.1-dspark.version=2.1.0 Clean Package
After the compilation succeeds, it will appear in the carbondata-parent-1.2.0/assembly/target/scala-2.11/directory carbondata_2.11-1.2.0- Shade-hadoop2.2.0.jar (different build versions, jar package names may not be the same) start step
Note:
1. In the following command, the/carbondata directory of HDFs should be created with the HDFs user and read and write permissions to the user rights with the chmod command if the startup user does not create the permission.
2. The boot node does not have the same node as the HiveServer2 service to prevent port collisions (I have not yet found the way to change the Thriftserver port).
TAR-ZXVF spark-2.1.0-bin-hadoop2.6.tgz
CD spark-2.1.0-bin-hadoop2.6
Cp/etc/hive/conf/hive-site.xml conf #可以读取到hive的表
CP Conf/spark-env.sh.template conf/spark-env.sh
VI conf/spark-env.sh
Add to:
Export hadoop_conf_dir=/etc/hadoop/conf
Export yarn_conf_dir=/etc/hadoop/conf.cloudera.yarn/
mkdir Carbondata_lib
#将carbondata_2.11-1.2.0-shade-hadoop2.2.0.jar copy to the Carbondata_lib directory, if you need to set the Carbondata parameter, Need to be carbondata-parent-1.2.0/ Carbon.properties.template files under the Conf directory are renamed to Carbon.properties and copied to the spark-2.1.0-bin-hadoop2.6/conf directory
CD bin
#启动命令为:
./spark-submit \
–master yarn \
–deploy-mode client \
–conf spark.sql.hive.thriftserver.singlesession=true \
–class org.apache.carbondata.spark.thriftserver.CarbonThriftServer \
.. /carbondata_lib/carbondata_2.11-1.2.0-shade-hadoop2.2.0.jar \
Hdfs://[namenodeip]:8020/carbondata thriftserver Use
Beeline-u jdbc:hive2://[start node ip]:10000-n user name PostScript
Personally think this way to operate Carbondata is the best way to use the JDBC approach after all is the most convenient.
Spark-sql and Spark-shell can be operated, different modes of operation of the same version may appear to support different operation situations.
If a lot of resources, you can first adjust the spark parameters to start the good container, so that the SQL task can be used as soon as possible to the resources, if more than one application using a yarn resource, you can use the dynamic distribution of spark, the need for resources to start requesting resources, Establish container.
Please correct me if you have any questions.