Spark Shell Simple to use

Source: Internet
Author: User
Tags log4j

Basis

The shell of Spark, as a powerful interactive data analysis tool, provides an easy way to learn the API. It can use Scala (a good way to run an existing Java library on a Java Virtual machine) or Python. Start running in the Spark directory using the following method:

[Plain]View PlainCopy
    1. ./bin/spark-shell
In the spark shell, there is a proprietary sparkcontext that has been created for you, and the variable name is called SC. The sparkcontext you created will not work. You can use the--master parameter to set Sparkcontext to connect to the cluster, use--jars to set the jar package to be added to the classpath, if there are multiple jar packages, you can use the comma delimiter to connect them. For example, to run Spark-shell on a 4-core environment, use: [Plain]View PlainCopy
    1. ./bin/spark-shell--master Local[4]
or add Code.jar in Classpath, using: [Plain]View PlainCopy
    1. ./bin/spark-shell--master Local[4]--jars Code.jar
You can perform Spark-shell--help to get a complete list of options. The main abstraction of Spark is the elastic distributed collection called the Resilient Distributed Dataset (RDD). Rdds can be created using Hadoop inputformats (for example, HDFs files) or from other Rdds. Let's create a new rdd from the readme.md text file in the Spark source code directory. [Plain]View PlainCopy
  1. scala> val textfile = Sc.textfile ("file:///home/hadoop/hadoop/spark/README.md")
  2. 16/07/24 03:30:53 INFO Storage. Memorystore:ensurefreespace (217040) called with curmem=321016, maxmem=280248975
  3. 16/07/24 03:30:53 INFO Storage. Memorystore:block broadcast_2 stored as values in memory (estimated size 212.0 KB, free 266.8 MB)
  4. 16/07/24 03:30:53 INFO Storage. Memorystore:ensurefreespace (20024) called with curmem=538056, maxmem=280248975
  5. 16/07/24 03:30:53 INFO Storage. Memorystore:block broadcast_2_piece0 stored as bytes in memory (estimated size 19.6 KB, free 266.7 MB)
  6. 16/07/24 03:30:53 INFO Storage. Blockmanagerinfo:added Broadcast_2_piece0 in Memory on localhost:43303 (size:19.6 KB, free:267.2 MB)
  7. 16/07/24 03:30:53 INFO Spark. Sparkcontext:created broadcast 2 from Textfile at <console>:21
  8. Textfile:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[5] at textfile at <console>:21
Note: 1. Where 2~7 line is the log information, for the moment do not have to pay attention to the last line. Subsequent run log information will no longer be posted. Users can also go to the Spark Directory/conf folder, At this point there is a log4j.properties.template file, and we execute the following command to copy it as a log4j.properties and modify the Log4j.properties file.

[Plain]View PlainCopy
    1. CP Log4j.properties.template Log4j.properties
    2. Vim Log4j.properties
As shown, change info to warn so that the blue part of the log information is not output:

2. In addition, FILE:///HOME/HADOOP/HADOOP/SPARK/README.MD, the header file represents the local directory, note file: There are three slashes (/); The middle red part is my Spark installation directory, Readers can replace them according to their own circumstances.

The actions of the RDD return a value from the RDD, transformations can be converted to a new RDD and return a reference to it. Here are a few action:

[Plain]View PlainCopy
    1. Scala> Textfile.count ()
    2. Res0:long = 98
    3. Scala> Textfile.first ()
    4. res1:string = # Apache Spark
Where count represents the total number of data bars in the RDD; first represents the top row of data in the RDD.

Using a transformation below, we will use the filter function to filter the textfile, remove the line containing the string "Spark", and return a new RDD:

[Plain]View PlainCopy
    1. scala> val lineswithspark = textfile.filter (line = Line.contains ("Spark"))
    2. Lineswithspark:org.apache.spark.rdd.rdd[string] = mappartitionsrdd[2] At the filter at <console>:23
Of course, you can also connect actions with transformations to use:

[Plain]View PlainCopy
    1. Scala> Textfile.filter (line = Line.contains ("Spark")). Count ()
    2. Res2:long = 19
The above statement indicates how many rows include the string "Spark".

More RDD operations

RDD actions and transformations can be used in more complex calculations. For example, to find the maximum number of words in a row:

[Plain]View PlainCopy
    1. Scala> Textfile.map (line = Line.split (""). Size). Reduce ((A, B) and if (a > B) a else B)
    2. Res3:int = 14
First, the line is mapped into an integer value to produce a new rdd. In this new RDD, use reduce to find the maximum number of words in a row. The parameters of map and reduce are Scala's function strings (closures), and can use any language feature or Scala/java class library. For example, we can easily invoke other function declarations. We use the Math.max () function to make the code easier to understand:

[Plain]View PlainCopy
    1. Scala> Import Java.lang.Math
    2. Import Java.lang.Math
    3. Scala> Textfile.map (line = Line.split (""). Size). Reduce ((a, B) = Math.max (A, B))
    4. Res4:int = 14
As we all know, a common data flow pattern prevalent in Hadoop is mapreduce. Spark makes it easy to implement MapReduce:

[Plain]View PlainCopy
    1. scala> val wordcounts = textfile.flatmap (line + line.split ("")). Map (Word = = (Word, 1)). Reducebykey ((a, b) = A + b)
    2. wordcounts:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[8] at Reducebykey at <console>:24
Here, we combine flatmap, map, and reducebykey to calculate the number of occurrences of each word in a file, and the result is an RDD containing a set of (String, Int) key-value pairs. We can use the collect operation to collect the number of words:

[Plain]View PlainCopy
  1. Scala> Wordcounts.collect ()
  2. res5: array[(String, int)] = array ((package,1),  (for,2),  (Programs,1),  ( processing.,1),  (because,1),  (the,1),  (cluster.,1),  (its,1),  ([run,1],  (APIs,1),   (have,1),  (try,1),  (computation,1),  (through,1),  (several,1),  (This,2),  (" Yarn-cluster ", 1),  (graph,1),  (hive,2),  (storage,1),  ([" specifying,1],  (To,2),  ( Page] (http://spark.apache.org/documentation.html), 1),  (once,1),  (application,1),  (prefer,1),   (sparkpi,2),  (engine,1),  (version,1),  (file,1),  (documentation,,1),  (processing ,, 2),  (the,21),  (are,1),  (systems.,1),  (params,1),  (not,1),  (different,1),  (refer,2),  (interactive,2),  (given.,1),  (if,4),  (build,3),  (when,1),  (be,2),   (tests,1),  (apache,1),  (all,1),  (./bin/run-example,2),  (programs,,1),  ( including,3),  (spark.,1),  (package.,1),  (+). Count (), 1),  (versions,1),  (hdfs,1),  (data.,1),  (...)   

Cache

Spark supports the caching of datasets into memory, which is useful when repeated access is required. To give a simple example:

[Plain]View PlainCopy
    1. Scala> Lineswithspark.cache ()
    2. Res6:linesWithSpark.type = mappartitionsrdd[2] At the filter at <console>:23
    3. Scala> Lineswithspark.count ()
    4. Res7:long = 19
    5. Scala> Lineswithspark.count ()
    6. Res8:long = 19
    7. Scala> Lineswithspark.count ()
    8. Res9:long = 19
The Lineswithspark data set is cached first, and then the value returned by the Count function is repeatedly accessed. Of course, we are not aware of the obvious change in query speed, but when using caching in large datasets, it will significantly improve the corresponding iteration speed.

Spark Shell Simple to use

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.