Shark Cluster Setup Configuration

Source: Internet
Author: User
Tags shuffle rsync

First, Shark simple introduction

Shark is a SQL query engine based on spark and Hive, and the architecture diagram and performance of the website are tried for example: (Ps: I also made a performance test see Shark Performance Testing report)


We are involved in 2 dependent components, 1 is Apache Spark, and the other is Amplab's Hive0.11.

Note the selection of the version number here. Be sure to select the official recommended version number:

Spark0.91 + Amplab Hive0.11 + Shark0.91

Be sure to compile them yourself and apply them to your own cluster.


Second, shark cluster construction

1. Build the Spark cluster. This can be taken: Spark cluster build.

2. Compile the Amplab Hive0.11 and go to the root folder directly under the ant package.

3. Compile shark, this step is the same as compiling spark. and HDFS version number is compatible, change the Hadoop version number in the Sharkbuild.scala below project. Then SBT/SBT Assembly.


Third, start Spark + Shark

First of all. To start Spark, change the configuration file for spark here, and configure it in spark-env.sh:

hadoop_conf_dir=/home/hadoop/src/hadoop/confspark_classpath=/home/hadoop/src/hadoop/lib/:/app/hadoop/shengli/ sharklib/*spark_local_dirs=/app/hadoop/shengli/spark/dataspark_master_ip=10.1.8.210spark_master_webui_port= 7078

Next, configure Spark's spark-defaults.conf

Spark.master spark://10.1.8.210:7077spark.executor.memory 32gspark.shuffle.spill truejava.library.path/u Sr/local/libspark.shuffle.consolidatefiles true# spark.eventLog.enabled true# spark.eventLog.dir hdfs://namenode:80 21/directory# Spark.serializer Org.apache.spark.serializer.KryoSerializer


Then configure slaves:

10.1.8.210  #这里master节点不会做cache10.1.8.21110.1.8.21210.1.8.213

Finally, start the cluster, sbin/start-all.sh, and the spark cluster configuration is complete.

The shark has a dependent jar package. We will centrally copy it to a directory:

#!/bin/bashfor jar in ' find/home/hadoop/shengli/shark/lib-name ' *jar '; Do      CP $jar/home/hadoop/shengli/sharklib/donefor jar in ' find/home/hadoop/shengli/shark/lib_managed/jars-name ' * Jar '; Do      CP $jar/home/hadoop/shengli/sharklib/donefor jar in ' Find/home/hadoop/shengli/shark/lib_managed/bundles- Name ' *jar '; Do  CP $jar/home/hadoop/shengli/sharklib/done

Configuring shark, configuring in shark/conf/shark-env.sh

# format as the JVM ' s-xmx option, e.g. 300m or 1g.export java_home=/usr/java/jdk1.7.0_25# (Required) Set the master Progr Am ' s memory#export shark_master_mem=1g# (Optional) Specify the location of Hive's configuration directory. By default,# Shark Run scripts would point it to $SHARK _home/conf#export hive_conf_dir= "" Export hadoop_home=/home/hadoop/s rc/hadoop# for running Shark in distributed mode, set the Following:export shark_master_mem=1gexport hadoop_home= $HADOOP _ Homeexport Spark_home=/app/hadoop/shengli/sparkexport Spark_master_ip=10.1.8.210export MASTER=spark:// 10.1.8.210:7077# only required if using Mesos: #export mesos_native_library=/usr/local/lib/libmesos.so# only required if Run shark with Spark on Yarn#export shark_exec_mode=yarn#export spark_assembly_jar= #export shark_assembly_jar=# (  Optional) Extra classpath#export spark_library_path= "" # Java options# on EC2, change the Local.dir to/mnt/tmp# (Optional)                 Tachyon related Configuration#export tachyon_master= ""    # e.g. "localhost:19998" #export tachyon_warehouse_path=/sharktables # Could is any valid PATH Name#export hive_home= /home/hadoop/shengli/hive/build/destexport Hive_conf_dir=/app/hadoop/shengli/hive/confexport CLASSPATH=$ Classpath:/home/hadoop/src/hadoop/lib:home/hadoop/src/hadoop/lib/native:/app/hadoop/shengli/sharklib/*export Scala_home=/app/hadoop/shengli/scala-2.10.3#export spark_library_path=/home/hadoop/src/hadoop/lib/native/ Linux-amd64-64#export ld_library_path=/home/hadoop/src/hadoop/lib/native/linux-amd64-64#spark conf Copy hereSPARK_ Java_opts= "-dspark.cores.max=8-dspark.local.dir=/app/hadoop/shengli/spark/data-dspark.deploy.defaultcores=2- Dspark.executor.memory=24g-dspark.shuffle.spill=true-djava.library.path=/usr/local/lib "SPARK_JAVA_OPTS+="- XMX4G-XMS4G-VERBOSE:GC-XX:-P rintgcdetails-xx:+printgctimestamps-xx:+usecompressedoops "Export SPARK_JAVA_OPTS


Next configure the Shark cluster, we will compile the Spark,shark. All hive is distributed to each node. Keep syncing up to date with rsync.

rsync--update-pav--progress/app/hadoop/shengli/spark/[email Protected]:/app/hadoop/shengli/spark/......rsync-- Update-pav--progress/app/hadoop/shengli/shark/[email protected]:/app/hadoop/shengli/shark/......rsync--update- Pav--progress/app/hadoop/shengli/hive/[Email protected]:/app/hadoop/shengli/hive/......rsync--update-pav-- progress/app/hadoop/shengli/sharklib/[Email protected]:/app/hadoop/shengli/sharklib/......rsync--update-pav-- progress/usr/java/jdk1.7.0_25/[Email protected]:/usr/java/jdk1.7.0_25/...

Start shark to view the cluster status on WebUI (Web UI PORT 7078 is configured above)

Enter to Shark_home/bin

Drwxr-xr-x  4 Hadoop games 4.0K June 10:01 drwxr-xr-x Hadoop games 4.0K June 16 16:59. -rwxr-xr-x  1 Hadoop Games  882 April 19:18 beelinedrwxr-xr-x  2 Hadoop games 4.0K June 10:01 Devdrwxr-xr-x
   2 Hadoop Games 4.0K June 10:01 ext-rwxr-xr-x  1 Hadoop Games 1.4K Apr 19:18 shark-rwxr-xr-x  1 Hadoop games
   730 Apr 19:18 shark-shell-rwxr-xr-x  1 Hadoop Games  840 APR 19:18 shark-withdebug-rwxr-xr-x  1 Hadoop Games  838 APR 19:18 shark-withinfo




Here shark is directly executed shark

Shark-shell similar to Spark-shell

Shark-withdebug is entered in execution with debug log4j mode, which is suitable for troubleshooting errors and understanding execution.

Shark-withinfo Ibid.


Shark also provides a cacahed Rdd concept in Shark-server shared application.

Bin/shark-h 10.1.8.210-p 7100
-H 10.1.8.210-p 7100Starting the Shark Command line clientlogging initialized using configuration in Jar:file:/app/hadoop /shengli/sharklib/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.propertieshive History File=/tmp/root/[email  protected]_201406171640_1172020906.txtslf4j:class path contains multiple slf4j bindings. Slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/ Shark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found binding in [jar : file:/app/hadoop/shengli/shark/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/ Impl/staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [org.slf4j.impl.log4jloggerfactory]2.870: [GC 262208k->21869k (1004928K), 0.0274310 SECS][10.1.8.210:7100] Shark> 

This allows you to connect to the port with multiple clients.

Bin/shark-h 10.1.8.210-p 7100-h 10.1.8.210-p 7100Starting the shark Command line clientlogging initialized using config Uration in Jar:file:/app/hadoop/shengli/sharklib/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.propertieshive History File=/tmp/hadoop/[email protected]_201406171719_457245737.txtslf4j:class Path contains multiple SLF4J Bindings. Slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/ Shark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found binding in [jar : file:/app/hadoop/shengli/shark/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/ Impl/staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [org.slf4j.impl.log4jloggerfactory]show ta3.050: [GC 262208k->22324k (1004928K), 0.0240010 secs]ble[10.1.8.210:7100] shark> show tables; Time taken (including network latency): 0.072 seconds

At this point, shark starts to complete.


3. Test

To do a simple test to see if it's available, process a 21g file.

[[email protected] shark]$ Hadoop dfs-ls/user/hive/warehouse/log/found 1 items-rw-r--r--   3 Hadoop supergroup 224990 35249 2014-06-16 18:32/user/hive/warehouse/log/21gfile

CREATE TABLE log (  C1 string,  C2 string,  C3 string,  c4 string,  c5 string,  c6 string,  C7 String,  C8 string,  C9 string,  C10 string,  C11 string,  C12 string,  

Load data inpath '/user/hive/warehouse/log/21gfile ' into table log;

Count the Log table:

[10.1.8.210:7100] Shark> Select COUNT (1) from Log > 171802086Time taken (including network latency): 33.753 seconds
Spents 33 seconds.


Load the log table to memory, count the log_cached:

CREATE TABLE log_cached tblproperties ("Shark.cache" = "true") as SELECT * from log; Time taken (including network latency): 481.96 seconds
Shark> Select COUNT (1) from log_cached;171802086time taken (including network latency): 6.051 seconds

It takes 6 seconds to increase the speed by at least 5 times times.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvb29wc29vbq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">


View Executor and task storage status:


To view the storage status storage:



At this point, shark cluster construction and simple test has been completed.

Perhaps I will write a common problem in environmental construction, and more specific shark test conclusions.


Note: original article. Reprint please indicate the source. Originating From: http://blog.csdn.net/oopsoom/article/details/30513929

-eof-

Shark Cluster Setup Configuration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.