Transferred from: http://in.sdo.com/?p=325
Spark/shark Small Test
recently in the test cluster try to build up spark and shark and experience it.
Spark is a highly efficient distributed computing system that, compared to Hadoop, is 100 times times more powerful than Hadoop claims. Spark provides a higher-level API than Hadoop, and the same algorithm is often implemented in spark with only 1/10 or 1/100 of the length of Hadoop. Shark, like "SQL on Spark", is an implementation of a data warehouse on spark that can perform up to 100 times times more than hive in the case of hive compatibility. Spark
Download Spark
$ wget https://github.com/downloads/mesos/spark/spark-0.6.0-prebuilt.tar.gz
$ tar xvfz spark-0.6.0-prebuilt*. Tgz
Edit Spark-0.6.0/conf/slaves, add each slave node's hostname, one per line.
Edit spark-0.6.0/conf/spark-env.sh, set Scala_home and Spark_worker_memory
Export scala_home=/home/hdfs/guojian/scala-2.9.2 export
java_home=/home/hdfs/java-current
export Spark_ worker_memory=8g
export spark_master_ip=10.133.103.11
export spark_master_port=7077
Spark_worker_memory is the maximum memory available for each node in spark. The larger the value is set, the more data will be cache into memory.
In addition, Spark needs an HDFS cluster in addition to the above Scala, Java.
If Scala is not installed on your system, you can download it in the following ways:
$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.2.tgz
$ tar xvfz scala-2.9.2.tgz
Go to the Spark-0.6.0/bin directory and execute the start-all.sh script to start the Spark cluster.
After booting, you can view the status of Spark via http://10.133.103.11:8080/, as shown below:
Shark
$ wget https://github.com/downloads/amplab/shark/shark-0.2.1-bin.tgz
$ tar xvfz shark-0.2.1-bin.tgz
Edit shark-0.2.1/conf/shark-env.sh, set Hive_home, Scala_home and master, etc.:
Export Hadoop_home=/home/hdfs/hadoop-current export
hive_home=/home/hdfs/guojian/hive-0.9.0-bin
Export master=spark://10.133.103.11:7077
export spark_home=/home/hdfs/guojian/spark-0.6.0
export scala_home=/ home/hdfs/guojian/scala-2.9.2
export spark_mem=5g
source $SPARK _home/conf/spark-env.sh
It is worth noting that the value of SPARK_MEM cannot be greater than the value of the previous spark_worker_memory.
Replicate the spark and shark targets to each slave node. You can use the following methods:
$ while read slave_host;
Do $ rsync-pav spark-0.6.0 shark-0.2.1 $slave _host
$ done </path/to/spark/conf/slaves
Now it's time to start shark.
$./bin/shark-withinfo
You can use SQL to test whether shark is available.
CREATE TABLE src (key INT, value STRING);
LOAD DATA LOCAL inpath ' ${env:hive_home}/examples/files/kv1.txt ' into TABLE src;
SELECT COUNT (1) from SRC;
CREATE TABLE src_cached as SELECT * from SRC;
SELECT COUNT (1) from src_cached;
The test results are as follows:
In simple contrast, in Mr, one of the simplest count operations takes at least 20-30 seconds, using Spark/shark, which takes only 1 seconds. Because the Spark/shark is more memory operation, it greatly reduces the IO overhead and increases the computational efficiency. Of course, different calculation operations, efficiency improvements are not the same, the subsequent time for different computing operations to do further testing.
Reference: Http://www.csdn.net/article/2013-04-26/2815057-Spark-Reynold Spark Project home: http://spark-project.org/ Shark Project Home: Https://github.com/amplab/shark/wiki Resources Download: https://github.com/amplab/shark/downloads