Shark Cluster Setup Configuration

Last Update:2015-12-22 Source: Internet

Author: User

Tags shuffle rsync

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Shark simple introduction

Shark is a SQL query engine based on spark and Hive, and the architecture diagram and performance of the website are tried for example: (Ps: I also made a performance test see Shark Performance Testing report)

We are involved in 2 dependent components, 1 is Apache Spark, and the other is Amplab's Hive0.11.

Note the selection of the version number here. Be sure to select the official recommended version number:

Spark0.91 + Amplab Hive0.11 + Shark0.91

Be sure to compile them yourself and apply them to your own cluster.

Second, shark cluster construction

1. Build the Spark cluster. This can be taken: Spark cluster build.

2. Compile the Amplab Hive0.11 and go to the root folder directly under the ant package.

3. Compile shark, this step is the same as compiling spark. and HDFS version number is compatible, change the Hadoop version number in the Sharkbuild.scala below project. Then SBT/SBT Assembly.

Third, start Spark + Shark

First of all. To start Spark, change the configuration file for spark here, and configure it in spark-env.sh:

hadoop_conf_dir=/home/hadoop/src/hadoop/confspark_classpath=/home/hadoop/src/hadoop/lib/:/app/hadoop/shengli/ sharklib/*spark_local_dirs=/app/hadoop/shengli/spark/dataspark_master_ip=10.1.8.210spark_master_webui_port= 7078

Next, configure Spark's spark-defaults.conf

Spark.master spark://10.1.8.210:7077spark.executor.memory 32gspark.shuffle.spill truejava.library.path/u Sr/local/libspark.shuffle.consolidatefiles true# spark.eventLog.enabled true# spark.eventLog.dir hdfs://namenode:80 21/directory# Spark.serializer Org.apache.spark.serializer.KryoSerializer

Then configure slaves:

10.1.8.210  #这里master节点不会做cache10.1.8.21110.1.8.21210.1.8.213

Finally, start the cluster, sbin/start-all.sh, and the spark cluster configuration is complete.

The shark has a dependent jar package. We will centrally copy it to a directory:

#!/bin/bashfor jar in ' find/home/hadoop/shengli/shark/lib-name ' *jar '; Do      CP $jar/home/hadoop/shengli/sharklib/donefor jar in ' find/home/hadoop/shengli/shark/lib_managed/jars-name ' * Jar '; Do      CP $jar/home/hadoop/shengli/sharklib/donefor jar in ' Find/home/hadoop/shengli/shark/lib_managed/bundles- Name ' *jar '; Do  CP $jar/home/hadoop/shengli/sharklib/done

Configuring shark, configuring in shark/conf/shark-env.sh

# format as the JVM ' s-xmx option, e.g. 300m or 1g.export java_home=/usr/java/jdk1.7.0_25# (Required) Set the master Progr Am ' s memory#export shark_master_mem=1g# (Optional) Specify the location of Hive's configuration directory. By default,# Shark Run scripts would point it to $SHARK _home/conf#export hive_conf_dir= "" Export hadoop_home=/home/hadoop/s rc/hadoop# for running Shark in distributed mode, set the Following:export shark_master_mem=1gexport hadoop_home= $HADOOP _ Homeexport Spark_home=/app/hadoop/shengli/sparkexport Spark_master_ip=10.1.8.210export MASTER=spark:// 10.1.8.210:7077# only required if using Mesos: #export mesos_native_library=/usr/local/lib/libmesos.so# only required if Run shark with Spark on Yarn#export shark_exec_mode=yarn#export spark_assembly_jar= #export shark_assembly_jar=# (  Optional) Extra classpath#export spark_library_path= "" # Java options# on EC2, change the Local.dir to/mnt/tmp# (Optional)                 Tachyon related Configuration#export tachyon_master= ""    # e.g. "localhost:19998" #export tachyon_warehouse_path=/sharktables # Could is any valid PATH Name#export hive_home= /home/hadoop/shengli/hive/build/destexport Hive_conf_dir=/app/hadoop/shengli/hive/confexport CLASSPATH=$ Classpath:/home/hadoop/src/hadoop/lib:home/hadoop/src/hadoop/lib/native:/app/hadoop/shengli/sharklib/*export Scala_home=/app/hadoop/shengli/scala-2.10.3#export spark_library_path=/home/hadoop/src/hadoop/lib/native/ Linux-amd64-64#export ld_library_path=/home/hadoop/src/hadoop/lib/native/linux-amd64-64#spark conf Copy hereSPARK_ Java_opts= "-dspark.cores.max=8-dspark.local.dir=/app/hadoop/shengli/spark/data-dspark.deploy.defaultcores=2- Dspark.executor.memory=24g-dspark.shuffle.spill=true-djava.library.path=/usr/local/lib "SPARK_JAVA_OPTS+="- XMX4G-XMS4G-VERBOSE:GC-XX:-P rintgcdetails-xx:+printgctimestamps-xx:+usecompressedoops "Export SPARK_JAVA_OPTS

Next configure the Shark cluster, we will compile the Spark,shark. All hive is distributed to each node. Keep syncing up to date with rsync.

rsync--update-pav--progress/app/hadoop/shengli/spark/[email Protected]:/app/hadoop/shengli/spark/......rsync-- Update-pav--progress/app/hadoop/shengli/shark/[email protected]:/app/hadoop/shengli/shark/......rsync--update- Pav--progress/app/hadoop/shengli/hive/[Email protected]:/app/hadoop/shengli/hive/......rsync--update-pav-- progress/app/hadoop/shengli/sharklib/[Email protected]:/app/hadoop/shengli/sharklib/......rsync--update-pav-- progress/usr/java/jdk1.7.0_25/[Email protected]:/usr/java/jdk1.7.0_25/...

Start shark to view the cluster status on WebUI (Web UI PORT 7078 is configured above)

Enter to Shark_home/bin

Drwxr-xr-x  4 Hadoop games 4.0K June 10:01 drwxr-xr-x Hadoop games 4.0K June 16 16:59. -rwxr-xr-x  1 Hadoop Games  882 April 19:18 beelinedrwxr-xr-x  2 Hadoop games 4.0K June 10:01 Devdrwxr-xr-x
   2 Hadoop Games 4.0K June 10:01 ext-rwxr-xr-x  1 Hadoop Games 1.4K Apr 19:18 shark-rwxr-xr-x  1 Hadoop games
   730 Apr 19:18 shark-shell-rwxr-xr-x  1 Hadoop Games  840 APR 19:18 shark-withdebug-rwxr-xr-x  1 Hadoop Games  838 APR 19:18 shark-withinfo

Here shark is directly executed shark

Shark-shell similar to Spark-shell

Shark-withdebug is entered in execution with debug log4j mode, which is suitable for troubleshooting errors and understanding execution.

Shark-withinfo Ibid.

Shark also provides a cacahed Rdd concept in Shark-server shared application.

Bin/shark-h 10.1.8.210-p 7100

-H 10.1.8.210-p 7100Starting the Shark Command line clientlogging initialized using configuration in Jar:file:/app/hadoop /shengli/sharklib/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.propertieshive History File=/tmp/root/[email  protected]_201406171640_1172020906.txtslf4j:class path contains multiple slf4j bindings. Slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/ Shark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found binding in [jar : file:/app/hadoop/shengli/shark/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/ Impl/staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [org.slf4j.impl.log4jloggerfactory]2.870: [GC 262208k->21869k (1004928K), 0.0274310 SECS][10.1.8.210:7100] Shark>

This allows you to connect to the port with multiple clients.

Bin/shark-h 10.1.8.210-p 7100-h 10.1.8.210-p 7100Starting the shark Command line clientlogging initialized using config Uration in Jar:file:/app/hadoop/shengli/sharklib/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.propertieshive History File=/tmp/hadoop/[email protected]_201406171719_457245737.txtslf4j:class Path contains multiple SLF4J Bindings. Slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/ Shark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found binding in [jar : file:/app/hadoop/shengli/shark/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/ Impl/staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [org.slf4j.impl.log4jloggerfactory]show ta3.050: [GC 262208k->22324k (1004928K), 0.0240010 secs]ble[10.1.8.210:7100] shark> show tables; Time taken (including network latency): 0.072 seconds

At this point, shark starts to complete.

3. Test

To do a simple test to see if it's available, process a 21g file.

[[email protected] shark]$ Hadoop dfs-ls/user/hive/warehouse/log/found 1 items-rw-r--r--   3 Hadoop supergroup 224990 35249 2014-06-16 18:32/user/hive/warehouse/log/21gfile

CREATE TABLE log (  C1 string,  C2 string,  C3 string,  c4 string,  c5 string,  c6 string,  C7 String,  C8 string,  C9 string,  C10 string,  C11 string,  C12 string,

Load data inpath '/user/hive/warehouse/log/21gfile ' into table log;

Count the Log table:

[10.1.8.210:7100] Shark> Select COUNT (1) from Log > 171802086Time taken (including network latency): 33.753 seconds

Spents 33 seconds.

Load the log table to memory, count the log_cached:

CREATE TABLE log_cached tblproperties ("Shark.cache" = "true") as SELECT * from log; Time taken (including network latency): 481.96 seconds

Shark> Select COUNT (1) from log_cached;171802086time taken (including network latency): 6.051 seconds

It takes 6 seconds to increase the speed by at least 5 times times.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvb29wc29vbq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

View Executor and task storage status:

To view the storage status storage:

At this point, shark cluster construction and simple test has been completed.

Perhaps I will write a common problem in environmental construction, and more specific shark test conclusions.

Note: original article. Reprint please indicate the source. Originating From: http://blog.csdn.net/oopsoom/article/details/30513929

-eof-

Shark Cluster Setup Configuration

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More