Spark official documentation-write and run scala programs locally

Source: Internet
Author: User

Quick StartThis article describes how to use scala, java, and python to compile a spark click Mode Program. First, you only need to successfully build Spark on a machine. Practice: Enter the Spark root directory and enter the command: $ sbt/sbt package
(Because of the Great Firewall of tianchao, the mainland China cannot succeed, unless you can smoothly flip the wall). If you do not want to climb the wall, you can download the pre-compiled Spark, spark-0.7.2-prebuilt-hadoop1.tgz.gzInteractive Analysis of Spark shellI. BasicsConcept:Spark notebook is a simple way to learn APIs and a powerful tool for analyzing dataset interactions. Run:./Spark-shellSpark In the spark root directory. The abstract Distributed cluster space is called Resilient Distributed Dataset (RDD) elastic Dataset. RDD can be created in two ways: 1. Input from the Hadoop File System (such as HDFS); 2. convert other existing RDD to get a new RDD.Practice:1. Now we use the README file in the Spark directory to create a new RDD:

scala> val textFile = sc.textFile("README.md")textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

2. RDD has two types of operations: action (return values) and transformations (return a new RDD). Below we start a few actions:
Scala> textFile. count () // Number of items in this RDDres0: Long = 74 scala> textFile. first () // First item in this RDDres1: String = # Spark

3. Use the filter in transformations to return the new RDD of a file subset.
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?res3: Long = 15

2. More operations based on RDD
1. actions and transformations of RDD can be used for more complex computing. For example, we want to find the rows with the most words:
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)res4: Long = 16

2. To make the program simpler, We can reference the package to use the existing Function Methods to compile the program:
Scala> import java. lang. mathimport java. lang. mathscala> textFile. map (line => line. split (""). size ). reduce (a, B) => Math. max (a, B) res5: Int = 16

3. Spark can easily execute MapReaduce streams.
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)wordCounts: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
Here we use flatMap, map, and reduceByKey in transformations to calculate the number of times each word appears in the file and store the RDD dataset as a (String, Int) pair.

4. Use the collect method of actions to return the calculated value.
scala> wordCounts.collect()res6: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra,3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...

Iii. Cache
Spark also supports caching datasets into memory. This solves the time consumption of repeated disk IO operations when processing a large number of iterative operations (such as machine learning algorithms. Memory I/O operations and disk I/O operations are not an order of magnitude, and the efficiency improvement is self-evident.
1. Make a small example to mark the lineswitheat Ark dataset and cache it:
Scala> lineswitheat ark. cache () res7: spark. RDD [String] = spark. FilteredRDD @ 17e51082scala> lineswitheat ark. count () res8: Long = 15

4. A single-host scala job
/*** SimpleJob.scala ***/import spark.SparkContextimport SparkContext._object SimpleJob {  def main(args: Array[String]) {    val logFile = "/var/log/syslog" // Should be some file on your system    val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME",      List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))    val logData = sc.textFile(logFile, 2).cache()    val numAs = logData.filter(line => line.contains("a")).count()    val numBs = logData.filter(line => line.contains("b")).count()    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))  }}
Program explanation:
First, create a SparkContext object and input four parameters:
1. scheduler used (local scheduler is used in this example );
2. program name;
3. Spark installation path;
4. The jar package name that contains the program resource.
Note: In distributed mode, the last two parameters must be set. The installation path is used to determine which several nodes Spark runs. The jar name will enable Spark to automatically transmit jar files to slave nodes.

This program file relies on Spark APIs, so we must have an sbt configuration file to describe the dependencies between the program and Spark. The following is the configuration file simple. sbt:
name := "Simple Project"version := "1.0"scalaVersion := "2.9.3"libraryDependencies += "org.spark-project" %% "spark-core" % "0.7.3"resolvers ++= Seq(  "Akka Repository" at "http://repo.akka.io/releases/",  "Spray Repository" at "http://repo.spray.cc/")

To make sbt work correctly, we must layout SimpleJob. scala and simple. sbt according to the typical directory structure. After the layout is complete, we can create a JAR package containing the program source code, and then run the run Command of sbt to execute the sample program.
$ Find... /simple. sbt. /src. /src/main. /src/main/scala. /src/main/scala/SimpleJob. scala $ sbt package $ sbt run... lines with a: 8422, Lines with B: 1836

This completes the example of running the program locally.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.