Spark and shark Environment building

Source: Internet
Author: User

Transferred from: http://in.sdo.com/?p=325


Spark/shark Small Test

recently in the test cluster try to build up spark and shark and experience it.

Spark is a highly efficient distributed computing system that, compared to Hadoop, is 100 times times more powerful than Hadoop claims. Spark provides a higher-level API than Hadoop, and the same algorithm is often implemented in spark with only 1/10 or 1/100 of the length of Hadoop. Shark, like "SQL on Spark", is an implementation of a data warehouse on spark that can perform up to 100 times times more than hive in the case of hive compatibility. Spark

Download Spark

$ wget https://github.com/downloads/mesos/spark/spark-0.6.0-prebuilt.tar.gz
$ tar xvfz spark-0.6.0-prebuilt*. Tgz

Edit Spark-0.6.0/conf/slaves, add each slave node's hostname, one per line.

Edit spark-0.6.0/conf/spark-env.sh, set Scala_home and Spark_worker_memory

Export scala_home=/home/hdfs/guojian/scala-2.9.2 export
java_home=/home/hdfs/java-current
export Spark_ worker_memory=8g
export spark_master_ip=10.133.103.11
export spark_master_port=7077

Spark_worker_memory is the maximum memory available for each node in spark. The larger the value is set, the more data will be cache into memory.

In addition, Spark needs an HDFS cluster in addition to the above Scala, Java.

If Scala is not installed on your system, you can download it in the following ways:

$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.2.tgz 
$ tar xvfz scala-2.9.2.tgz

Go to the Spark-0.6.0/bin directory and execute the start-all.sh script to start the Spark cluster.

After booting, you can view the status of Spark via http://10.133.103.11:8080/, as shown below:

Shark

$ wget https://github.com/downloads/amplab/shark/shark-0.2.1-bin.tgz
$ tar xvfz shark-0.2.1-bin.tgz

Edit shark-0.2.1/conf/shark-env.sh, set Hive_home, Scala_home and master, etc.:

Export Hadoop_home=/home/hdfs/hadoop-current export
hive_home=/home/hdfs/guojian/hive-0.9.0-bin
Export master=spark://10.133.103.11:7077
export spark_home=/home/hdfs/guojian/spark-0.6.0
export scala_home=/ home/hdfs/guojian/scala-2.9.2
export spark_mem=5g

source $SPARK _home/conf/spark-env.sh

It is worth noting that the value of SPARK_MEM cannot be greater than the value of the previous spark_worker_memory.

Replicate the spark and shark targets to each slave node. You can use the following methods:

$ while read slave_host; 
Do $ rsync-pav spark-0.6.0 shark-0.2.1 $slave _host 
$ done </path/to/spark/conf/slaves

Now it's time to start shark.

$./bin/shark-withinfo

You can use SQL to test whether shark is available.

CREATE TABLE src (key INT, value STRING); 
LOAD DATA LOCAL inpath ' ${env:hive_home}/examples/files/kv1.txt ' into TABLE src; 
SELECT COUNT (1) from SRC; 
CREATE TABLE src_cached as SELECT * from SRC; 
SELECT COUNT (1) from src_cached;

The test results are as follows:

In simple contrast, in Mr, one of the simplest count operations takes at least 20-30 seconds, using Spark/shark, which takes only 1 seconds. Because the Spark/shark is more memory operation, it greatly reduces the IO overhead and increases the computational efficiency. Of course, different calculation operations, efficiency improvements are not the same, the subsequent time for different computing operations to do further testing.

Reference: Http://www.csdn.net/article/2013-04-26/2815057-Spark-Reynold Spark Project home: http://spark-project.org/ Shark Project Home: Https://github.com/amplab/shark/wiki Resources Download: https://github.com/amplab/shark/downloads

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.