First, Shark simple introduction
Shark is a SQL query engine based on spark and Hive, and the architecture diagram and performance of the website are tried for example: (Ps: I also made a performance test see Shark Performance Testing report)
We are involved in 2 dependent components, 1 is Apache Spark, and the other is Amplab's Hive0.11.
Note the selection of the version number here. Be sure to select the official recommended version number:
Spark0.91 + Amplab Hive0.11 + Shark0.91
Be sure to compile them yourself and apply them to your own cluster.
Second, shark cluster construction
1. Build the Spark cluster. This can be taken: Spark cluster build.
2. Compile the Amplab Hive0.11 and go to the root folder directly under the ant package.
3. Compile shark, this step is the same as compiling spark. and HDFS version number is compatible, change the Hadoop version number in the Sharkbuild.scala below project. Then SBT/SBT Assembly.
Third, start Spark + Shark
First of all. To start Spark, change the configuration file for spark here, and configure it in spark-env.sh:
hadoop_conf_dir=/home/hadoop/src/hadoop/confspark_classpath=/home/hadoop/src/hadoop/lib/:/app/hadoop/shengli/ sharklib/*spark_local_dirs=/app/hadoop/shengli/spark/dataspark_master_ip=10.1.8.210spark_master_webui_port= 7078
Next, configure Spark's spark-defaults.conf
Spark.master spark://10.1.8.210:7077spark.executor.memory 32gspark.shuffle.spill truejava.library.path/u Sr/local/libspark.shuffle.consolidatefiles true# spark.eventLog.enabled true# spark.eventLog.dir hdfs://namenode:80 21/directory# Spark.serializer Org.apache.spark.serializer.KryoSerializer
Then configure slaves:
10.1.8.210 #这里master节点不会做cache10.1.8.21110.1.8.21210.1.8.213
Finally, start the cluster, sbin/start-all.sh, and the spark cluster configuration is complete.
The shark has a dependent jar package. We will centrally copy it to a directory:
#!/bin/bashfor jar in ' find/home/hadoop/shengli/shark/lib-name ' *jar '; Do CP $jar/home/hadoop/shengli/sharklib/donefor jar in ' find/home/hadoop/shengli/shark/lib_managed/jars-name ' * Jar '; Do CP $jar/home/hadoop/shengli/sharklib/donefor jar in ' Find/home/hadoop/shengli/shark/lib_managed/bundles- Name ' *jar '; Do CP $jar/home/hadoop/shengli/sharklib/done
Configuring shark, configuring in shark/conf/shark-env.sh
# format as the JVM ' s-xmx option, e.g. 300m or 1g.export java_home=/usr/java/jdk1.7.0_25# (Required) Set the master Progr Am ' s memory#export shark_master_mem=1g# (Optional) Specify the location of Hive's configuration directory. By default,# Shark Run scripts would point it to $SHARK _home/conf#export hive_conf_dir= "" Export hadoop_home=/home/hadoop/s rc/hadoop# for running Shark in distributed mode, set the Following:export shark_master_mem=1gexport hadoop_home= $HADOOP _ Homeexport Spark_home=/app/hadoop/shengli/sparkexport Spark_master_ip=10.1.8.210export MASTER=spark:// 10.1.8.210:7077# only required if using Mesos: #export mesos_native_library=/usr/local/lib/libmesos.so# only required if Run shark with Spark on Yarn#export shark_exec_mode=yarn#export spark_assembly_jar= #export shark_assembly_jar=# ( Optional) Extra classpath#export spark_library_path= "" # Java options# on EC2, change the Local.dir to/mnt/tmp# (Optional) Tachyon related Configuration#export tachyon_master= "" # e.g. "localhost:19998" #export tachyon_warehouse_path=/sharktables # Could is any valid PATH Name#export hive_home= /home/hadoop/shengli/hive/build/destexport Hive_conf_dir=/app/hadoop/shengli/hive/confexport CLASSPATH=$ Classpath:/home/hadoop/src/hadoop/lib:home/hadoop/src/hadoop/lib/native:/app/hadoop/shengli/sharklib/*export Scala_home=/app/hadoop/shengli/scala-2.10.3#export spark_library_path=/home/hadoop/src/hadoop/lib/native/ Linux-amd64-64#export ld_library_path=/home/hadoop/src/hadoop/lib/native/linux-amd64-64#spark conf Copy hereSPARK_ Java_opts= "-dspark.cores.max=8-dspark.local.dir=/app/hadoop/shengli/spark/data-dspark.deploy.defaultcores=2- Dspark.executor.memory=24g-dspark.shuffle.spill=true-djava.library.path=/usr/local/lib "SPARK_JAVA_OPTS+="- XMX4G-XMS4G-VERBOSE:GC-XX:-P rintgcdetails-xx:+printgctimestamps-xx:+usecompressedoops "Export SPARK_JAVA_OPTS
Next configure the Shark cluster, we will compile the Spark,shark. All hive is distributed to each node. Keep syncing up to date with rsync.
rsync--update-pav--progress/app/hadoop/shengli/spark/[email Protected]:/app/hadoop/shengli/spark/......rsync-- Update-pav--progress/app/hadoop/shengli/shark/[email protected]:/app/hadoop/shengli/shark/......rsync--update- Pav--progress/app/hadoop/shengli/hive/[Email protected]:/app/hadoop/shengli/hive/......rsync--update-pav-- progress/app/hadoop/shengli/sharklib/[Email protected]:/app/hadoop/shengli/sharklib/......rsync--update-pav-- progress/usr/java/jdk1.7.0_25/[Email protected]:/usr/java/jdk1.7.0_25/...
Start shark to view the cluster status on WebUI (Web UI PORT 7078 is configured above)
Enter to Shark_home/bin
Drwxr-xr-x 4 Hadoop games 4.0K June 10:01 drwxr-xr-x Hadoop games 4.0K June 16 16:59. -rwxr-xr-x 1 Hadoop Games 882 April 19:18 beelinedrwxr-xr-x 2 Hadoop games 4.0K June 10:01 Devdrwxr-xr-x
2 Hadoop Games 4.0K June 10:01 ext-rwxr-xr-x 1 Hadoop Games 1.4K Apr 19:18 shark-rwxr-xr-x 1 Hadoop games
730 Apr 19:18 shark-shell-rwxr-xr-x 1 Hadoop Games 840 APR 19:18 shark-withdebug-rwxr-xr-x 1 Hadoop Games 838 APR 19:18 shark-withinfo
Here shark is directly executed shark
Shark-shell similar to Spark-shell
Shark-withdebug is entered in execution with debug log4j mode, which is suitable for troubleshooting errors and understanding execution.
Shark-withinfo Ibid.
Shark also provides a cacahed Rdd concept in Shark-server shared application.
Bin/shark-h 10.1.8.210-p 7100
-H 10.1.8.210-p 7100Starting the Shark Command line clientlogging initialized using configuration in Jar:file:/app/hadoop /shengli/sharklib/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.propertieshive History File=/tmp/root/[email protected]_201406171640_1172020906.txtslf4j:class path contains multiple slf4j bindings. Slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/ Shark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found binding in [jar : file:/app/hadoop/shengli/shark/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/ Impl/staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [org.slf4j.impl.log4jloggerfactory]2.870: [GC 262208k->21869k (1004928K), 0.0274310 SECS][10.1.8.210:7100] Shark>
This allows you to connect to the port with multiple clients.
Bin/shark-h 10.1.8.210-p 7100-h 10.1.8.210-p 7100Starting the shark Command line clientlogging initialized using config Uration in Jar:file:/app/hadoop/shengli/sharklib/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.propertieshive History File=/tmp/hadoop/[email protected]_201406171719_457245737.txtslf4j:class Path contains multiple SLF4J Bindings. Slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/slf4j-log4j12-1.7.2.jar!/org/slf4j/impl/ Staticloggerbinder.class]slf4j:found Binding in [jar:file:/app/hadoop/shengli/sharklib/ Shark-assembly-0.9.1-hadoop0.20.2-cdh3u5.jar!/org/slf4j/impl/staticloggerbinder.class]slf4j:found binding in [jar : file:/app/hadoop/shengli/shark/lib_managed/jars/org.slf4j/slf4j-log4j12/slf4j-log4j12-1.7.2.jar!/org/slf4j/ Impl/staticloggerbinder.class]slf4j:see http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Slf4j:actual binding is of type [org.slf4j.impl.log4jloggerfactory]show ta3.050: [GC 262208k->22324k (1004928K), 0.0240010 secs]ble[10.1.8.210:7100] shark> show tables; Time taken (including network latency): 0.072 seconds
At this point, shark starts to complete.
3. Test
To do a simple test to see if it's available, process a 21g file.
[[email protected] shark]$ Hadoop dfs-ls/user/hive/warehouse/log/found 1 items-rw-r--r-- 3 Hadoop supergroup 224990 35249 2014-06-16 18:32/user/hive/warehouse/log/21gfile
CREATE TABLE log ( C1 string, C2 string, C3 string, c4 string, c5 string, c6 string, C7 String, C8 string, C9 string, C10 string, C11 string, C12 string,
Load data inpath '/user/hive/warehouse/log/21gfile ' into table log;
Count the Log table:
[10.1.8.210:7100] Shark> Select COUNT (1) from Log > 171802086Time taken (including network latency): 33.753 seconds
Spents 33 seconds.
Load the log table to memory, count the log_cached:
CREATE TABLE log_cached tblproperties ("Shark.cache" = "true") as SELECT * from log; Time taken (including network latency): 481.96 seconds
Shark> Select COUNT (1) from log_cached;171802086time taken (including network latency): 6.051 seconds
It takes 6 seconds to increase the speed by at least 5 times times.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvb29wc29vbq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
View Executor and task storage status:
To view the storage status storage:
At this point, shark cluster construction and simple test has been completed.
Perhaps I will write a common problem in environmental construction, and more specific shark test conclusions.
Note: original article. Reprint please indicate the source. Originating From: http://blog.csdn.net/oopsoom/article/details/30513929
-eof-
Shark Cluster Setup Configuration