Spark and shark Environment building

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://in.sdo.com/?p=325

Spark/shark Small Test

recently in the test cluster try to build up spark and shark and experience it.

Spark is a highly efficient distributed computing system that, compared to Hadoop, is 100 times times more powerful than Hadoop claims. Spark provides a higher-level API than Hadoop, and the same algorithm is often implemented in spark with only 1/10 or 1/100 of the length of Hadoop. Shark, like "SQL on Spark", is an implementation of a data warehouse on spark that can perform up to 100 times times more than hive in the case of hive compatibility. Spark

Download Spark

$ wget https://github.com/downloads/mesos/spark/spark-0.6.0-prebuilt.tar.gz
$ tar xvfz spark-0.6.0-prebuilt*. Tgz

Edit Spark-0.6.0/conf/slaves, add each slave node's hostname, one per line.

Edit spark-0.6.0/conf/spark-env.sh, set Scala_home and Spark_worker_memory

Export scala_home=/home/hdfs/guojian/scala-2.9.2 export
java_home=/home/hdfs/java-current
export Spark_ worker_memory=8g
export spark_master_ip=10.133.103.11
export spark_master_port=7077

Spark_worker_memory is the maximum memory available for each node in spark. The larger the value is set, the more data will be cache into memory.

In addition, Spark needs an HDFS cluster in addition to the above Scala, Java.

If Scala is not installed on your system, you can download it in the following ways:

$ wget http://www.scala-lang.org/downloads/distrib/files/scala-2.9.2.tgz 
$ tar xvfz scala-2.9.2.tgz

Go to the Spark-0.6.0/bin directory and execute the start-all.sh script to start the Spark cluster.

After booting, you can view the status of Spark via http://10.133.103.11:8080/, as shown below:

Shark

$ wget https://github.com/downloads/amplab/shark/shark-0.2.1-bin.tgz
$ tar xvfz shark-0.2.1-bin.tgz

Edit shark-0.2.1/conf/shark-env.sh, set Hive_home, Scala_home and master, etc.:

Export Hadoop_home=/home/hdfs/hadoop-current export
hive_home=/home/hdfs/guojian/hive-0.9.0-bin
Export master=spark://10.133.103.11:7077
export spark_home=/home/hdfs/guojian/spark-0.6.0
export scala_home=/ home/hdfs/guojian/scala-2.9.2
export spark_mem=5g

source $SPARK _home/conf/spark-env.sh

It is worth noting that the value of SPARK_MEM cannot be greater than the value of the previous spark_worker_memory.

Replicate the spark and shark targets to each slave node. You can use the following methods:

$ while read slave_host; 
Do $ rsync-pav spark-0.6.0 shark-0.2.1 $slave _host 
$ done &lt;/path/to/spark/conf/slaves

Now it's time to start shark.

$./bin/shark-withinfo

You can use SQL to test whether shark is available.

CREATE TABLE src (key INT, value STRING); 
LOAD DATA LOCAL inpath ' ${env:hive_home}/examples/files/kv1.txt ' into TABLE src; 
SELECT COUNT (1) from SRC; 
CREATE TABLE src_cached as SELECT * from SRC; 
SELECT COUNT (1) from src_cached;

The test results are as follows:

In simple contrast, in Mr, one of the simplest count operations takes at least 20-30 seconds, using Spark/shark, which takes only 1 seconds. Because the Spark/shark is more memory operation, it greatly reduces the IO overhead and increases the computational efficiency. Of course, different calculation operations, efficiency improvements are not the same, the subsequent time for different computing operations to do further testing.

Reference: Http://www.csdn.net/article/2013-04-26/2815057-Spark-Reynold Spark Project home: http://spark-project.org/ Shark Project Home: Https://github.com/amplab/shark/wiki Resources Download: https://github.com/amplab/shark/downloads

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark and shark Environment building

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark and shark Environment building

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support