Spark Installation and Learning _spark

Source: Internet
Author: User
Tags svn git clone automake
Absrtact: Spark is a new generation of large data distributed processing framework after Hadoop, which is led by the Matei Zaharia of UC Berkeley. I can only say that it is a god-like character created by the artifact, details please bash HTTP://WWW.SPARK-PROJECT.ORG/1 Scala installation

Currently, the latest version of Spark is 0.5, because when I write this document, the version is still 0.4, so all the descriptions below this article are based on version 0.4.

But Taobao's talent has tried 0.5 and written the relevant installation documentation in this Http://rdc.taobao.com/team/jm/archives/tag/spark.

~~~~~~~~~~~~~~~ the following to start my installation document ~~~~~~~~~~~~~~

The version of Spark I use is 0.4, only on GitHub, which uses a Scala version of 0.9.1.final. So first to http://www.scala-lang.org/node/165 download scala-2.9.1.final.tar.gz. After decompression, place under local/opt, add in/etc/profile

Export Scala_home=/opt/scala-2.9.1.final

Export path= $SCALA _home/bin: $PATH 2 git installation

Because git is required to download spark and compile spark, you can install git first and install it directly into the Ubuntu Software Center or Apt-get. Installed after the need to go to https://github.com to register an account, I registered is Jerrylead, registered mailbox and password, and then according to the site Get-start prompted to generate RSA password.

Note: If there is a local rsa_id.pub,authorized_keys before, save it or make the original password a DSA form, so git and the original password do not conflict. 3 Spark Installation

Download the latest source code first

git clone git://github.com/mesos/spark.git

After getting the directory spark, enter the Spark directory, enter the Conf subdirectory, rename spark-env.sh-template to spark-env.sh, and add the following line of code:

Export Scala_home=/opt/scala-2.9.1.final

Go back to spark directory, start compiling, run

$ SBT/SBT Update Compile

This command will download many jars on the network, then compile the spark, and the compilation will prompt success

[Success] Total time:1228 S, completed May 9, 3:42:11 PM

You can interact with spark by running Spark-shell.

You can also run test cases first./run <class> <params>

./run Spark.examples.SparkLR Local[2]

Start two threads running linear regression locally.

./run Spark.examples.SparkPi Local

Start running the Pi estimator locally.

More examples in the Examples/src/main/scala inside 3 spark Export

It is better to export the compiled classes to a jar before using spark, and you can

$ SBT/SBT Assembly

Export the spark and its dependent packages to a jar and place them in the

Core/target/spark-core-assembly-0.4-snapshot.jar

You can add the jar to the classpath and develop the spark application.

Generally in the development of spark applications need to import spark some classes and some implicit conversion, need to add the beginning of the program

Import Spark. Sparkcontext

Import Sparkcontext._ 4 using Spark interactive mode

1. Run./spark-shell.sh

2. scala> val data = Array (1, 2, 3, 4, 5)//Generate Data

Data:array[int] = Array (1, 2, 3, 4, 5)

3. scala> val distdata = sc.parallelize (data)//processing data into RDD

Distdata:spark. Rdd[int] = Spark. parallelcollection@7a0ec850 (the displayed type is RDD)

4. Scala> distdata.reduce (_+_)//operation on the RDD, add and to the elements in the data

12/05/10 09:36:20 INFO Spark. Sparkcontext:starting job ...

5. Finally run to get

12/05/10 09:36:20 INFO Spark. Sparkcontext:job finished in 0.076729174 s

Res2:int = 15 5 using Spark to process Hadoop datasets

Spark can create distributed datasets from Hdfs/local Fs/amazon s3/hypertable/hbase. Spark supports text files,sequencefiles and other Hadoop inputformat.

For example, read text from HDFs to create Rdd

scala> val distfile = Sc.textfile ("Hdfs://m120:9000/user/lijiexu/demo/file01.txt")

12/05/10 09:49:01 INFO mapred. Fileinputformat:total input paths to process:1

Distfile:spark. Rdd[string] = Spark. Mappedrdd@59bf8a16

You can then count the number of characters in the text, and map handles each line of text map (_size) to get the number of characters per line, and a list,reduce is responsible for adding all the elements in the list.

Scala> Distfile.map (_.size). Reduce (_+_)

12/05/10 09:50:02 INFO Spark. Sparkcontext:job finished in 0.139610772 s

Res3:int = 79

Textfile can specify the number of slice by setting the second argument (slice corresponds to the split/block concept in Hadoop, a task handles a slice). Spark default to the Hadoop last block for a slice, but you can increase the number of slice, but not smaller than the number of blocks, this need to know HDFs the last file block number, can be viewed through 50070 of Dfs jsp.

For Sequencefile, you can use the Sparkcontext Sequencefile[k,v] method to generate RDD, where K and V are sure to sequencefile when the type is stored, that is, it must be a subclass of writable. Spark also allows the use of native types to read, such as Sequencefile[int, String.

For complex sequencefile, you can use the Sparkcontext.hadooprdd method to read, the method passed in the jobconf parameter, contains Inputformat,key Class,value class, and Hadoop Java client reads the same way. 6 Distributed DataSet Operations

A distributed dataset supports two types of operations: transformation and action. Transformation means generating a new dataset from the old dataset, where the action is calculated on the dataset and returns the result to driver program. Each spark application contains a driver program used to execute the user's main function, for example, a map is a transformation that divides large datasets into small datasets, and reduce is action, Aggregates the content on the dataset and returns it to driver program. One exception is that Reducebykey should belong to the transformation and return a distributed dataset.

It should be noted that Spark's transformation is lazy, transformation first record the action until the next action needs to return the processing results to driver program.

Another feature is caching, if the user specifies the cache of a dataset Rdd, then the different slice in the dataset will be stored in the corresponding different nodes in the partition memory, so that the reuse of the dataset, the efficiency will be much higher, especially for the iterative and interactive applications. If the cache's Rdd is lost, reuse the transformation build. 7 Shared variables

Unlike Hadoop's MapReduce, Spark allows you to share variables, but only two restricted variables are allowed: Broadcast and accumulators.

Broadcast, as its name suggests, is "broadcast", maintaining a read-only variable on each node. For example, when the map task of Hadoop requires a read-only dictionary to process text, each task needs to load a dictionary because no shared variables exist. Of course, you can use Distributedcache to solve it. In Spark, through the broadcast, each node to store a dictionary is enough, so from task granularity up to node granularity, saving resources can be imagined. Spark's broadcast routing algorithm also takes into account the communication overhead.

The packaging and sharing of the variable V is implemented by using Sparkcontext.broadcast (v).

scala> val broadcastvar = Sc.broadcast (Array (1,2,3))

12/05/10 10:54:21 INFO Spark. Boundedmemorycache:asked to add key ((1,A5C2A151-185D-4EA4-AAD1-9EC642EEBC5D), 0)

12/05/10 10:54:21 INFO Spark. boundedmemorycache:estimated size for key ((1,A5C2A151-185D-4EA4-AAD1-9EC642EEBC5D), 0 is 12

12/05/10 10:54:21 INFO Spark. Boundedmemorycache:size Estimation for key ((1,A5C2A151-185D-4EA4-AAD1-9EC642EEBC5D), 0) took 0 ms

12/05/10 10:54:21 INFO Spark. Boundedmemorycache:ensurefreespace (1,A5C2A151-185D-4EA4-AAD1-9EC642EEBC5D), called with curBytes=12, maxBytes= 339585269

12/05/10 10:54:21 INFO Spark. Boundedmemorycache:adding Key ((1,A5C2A151-185D-4EA4-AAD1-9EC642EEBC5D), 0)

12/05/10 10:54:21 INFO Spark. Boundedmemorycache:number of entries is now 2

Broadcastvar:spark.broadcast.broadcast[array[int]] = Spark. Broadcast (A5C2A151-185D-4EA4-AAD1-9EC642EEBC5D)

After you create the broadcast variable, you can access the read-only raw variable v by. Value.

Scala> Broadcastvar.value

Res4:array[int] = Array (1, 2, 3)

Another kind of shared variable is accumulators, as the name implies is can be "added" variables, such as MapReduce counters is the constant accumulation of variables. Spark native supports additive variables of type int and double.

A variable of type accumulator is created by Sparkcontext.accumulator (v), and then the task that runs can be summed up using the "+ =" operator. But the task cannot read the variable, only driver program can read (through. value), which is also to avoid using too many read and write locks.

Create a 0 accumulator version.

scala> val accum = sc.accumulator (0)

Accum:spark. Accumulator[int] = 0

Add the Rdd to the build, don't reduce it this time.

Scala> sc.parallelize (Array (1,2,3,4)). foreach (x => accum = x)

12/05/10 11:05:48 INFO Spark. Sparkcontext:starting job ...

Scala> Accum.value

Res7:int = 20 8 installation Mesos

Spark-0.4 recommended Mesos version is 1205738, not the latest version of the Mesos, I think the latest version should also be, here for the moment to use 1205738.

First download Mesos

SVN checkout–r 1205738 https://svn.apache.org/repos/asf/incubator/mesos/trunk Mesos

After you get the Mesos directory, install the software needed for the build

Apt-get Install python2.6 Python2.6-dev

Unfortunately, although there is Python 2.7 on Ubuntu 11.04, the WebUI (Mesos Web interface) requires Python 2.6, so install

Apt-get Install Libcppunit-dev (install cppunit)

Make sure the g++ version is greater than 4.1

If missing Automake, then install

Apt-get Install autoconf Automake libtool

Because the system is an Ubuntu 11.04 (gnu/linux 2.6.38-8-generic x86_64)-natty, it can be used directly./configure.template.ubuntu-natty-64. But the JDK I use is sun, so modify./configure.template.ubuntu-natty-64 inside--with-java-home for/opt/jdk1.6.0_27.

The whole is as follows:

CP configure.template.ubuntu-natty-64 CONFIGURE.TEMPLATE.UBUNTU-MY-NATTY-64

Modify configure.template.ubuntu-my-natty-64 to get the following content

1 #!/bin/sh

2 Export python=python2.7

3

4 $ (dirname $)/configure \

5--with-python-headers=/usr/include/python2.7 \

6--with-java-home=/opt/jdk1.6.0_27 \

7--with-webui \

8--with-included-zookeeper $@

Compiling Mesos

root@master:/opt/mesos#./configure.template.ubuntu-my-natty-64

When it's over,

root@master:/opt/mesos# make 9 configures Mesos and spark

Install Mesos on slave1, Slave2, Slave3 and master first, and I install it here in/opt/mesos.

Enter the Conf directory, modify deploy-env.sh, add Mesos_home

# This works with a newer version of hostname on Ubuntu.

#FULL_IP = "hostname--all-ip-addresses"

#export libprocess_ip= ' echo $FULL _ip | Sed ' s/\ ([^]*\). */\1/'

Export Mesos_home=/opt/mesos

Modify Mesos.conf, add

# Mesos-slave with--help.

Failover_timeout=1

Enter/opt/spark, modify conf/spark-env.sh, add

# variables to set are:

#-Mesos_home, to your MESOS installation

#-Scala_home, to point to your SCALA installation

#-Spark_classpath, to add elements to SPARK ' s CLASSPATH

#-Spark_java_opts, to add JVM options

#-Spark_mem, to change the amount of memory used/node (this should

# is in the same format as the JVM ' s-xmx option, e.g. 300m or 1g.

#-Spark_library_path, to add extra search paths for native libraries.

Export Scala_home=/opt/scala-2.9.1.final

Export Mesos_home=/opt/mesos

Export path= $PATH:/opt/jdk1.6.0_27/bin

Export spark_mem=10g (indicates the maximum amount of memory that SPARK can use, based on the size of its own machine)

Source: http://www.cnblogs.com/jerrylead/archive/2012/08/13/2636115.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.