Spark Shell simple to use _rdd

Source: Internet
Author: User
Tags documentation log4j

Basis

Spark's shell serves as a powerful interactive data analysis tool, providing an easy way to learn the API. It can use Scala (a good way to run an existing Java library on a Java Virtual machine) or Python. Start running in the Spark directory using the following method:

./bin/spark-shell
In the spark shell, there is a proprietary sparkcontext that has been created for you, and the variable name is called SC. The sparkcontext you create will not work. You can set the cluster sparkcontext to connect with the--master parameter, and use--jars to set up a jar package that needs to be added to Classpath, and if you have more than one jar package, you can connect them with a comma delimiter. For example, to run Spark-shell on a 4-core environment, use:
./bin/spark-shell--master Local[4]
Or add Code.jar to the classpath, using:
./bin/spark-shell--master Local[4]--jars Code.jar
You can perform Spark-shell--help get a complete list of options. The most important abstraction of spark is the elastic distributed set called the resilient Distributed Dataset (RDD). Rdds can be created using Hadoop inputformats (for example, HDFs files) or from other Rdds conversions. Let's create a new rdd from the readme.md text file in the Spark source code directory.
 scala> val textfile = Sc.textfile ("file:///home/hadoop/hadoop/spark/README.md") 16/07/24 03:30:53 INFO Storage . Memorystore:ensurefreespace (217040) called with curmem=321016 maxmem=280248975 16/07/24 INFO 03:30:53. Memorystore:block broadcast_2 stored as values in memory (estimated size 212.0 KB, free 266.8 MB) 16/07/24 03:30:53 INFO Storage. Memorystore:ensurefreespace (20024) called with curmem=538056 maxmem=280248975 16/07/24 INFO 03:30:53. Memorystore:block broadcast_2_piece0 stored as bytes in memory (estimated size 19.6 KB, free 266.7 MB) 16/07/24 03:30:53 INFO storage. Blockmanagerinfo:added Broadcast_2_piece0 in Memory on localhost:43303 (size:19.6 KB, free:267.2 MB) 16/07/24 03:30:53 INFO Spark. Sparkcontext:created broadcast 2 from Textfile at <console>:21 textfile:org.apache.spark.rdd.rdd[string] = MapPar TITIONSRDD[5] at Textfile at <console>:21 
Note: 1. Where the 2~7 line is log information, not to pay attention, mainly to see the last line. Subsequent running log information will no longer be posted. Users can also enter the Spark directory/conf folder, At this time there is a log4j.properties.template file, we execute the following command to copy it as log4j.properties and modify the Log4j.properties file.

CP log4j.properties.template Log4j.properties
vim log4j.properties
As shown in the following illustration, change the info to warn so that you do not output log information for the Blue section:


2. In addition, FILE:///HOME/HADOOP/HADOOP/SPARK/README.MD, the first file represents the local directory, note file: There are three slashes (/), and the middle red part is my Spark installation directory, The reader can be replaced according to his or her own situation.


Rdd's actions return a value from RDD, transformations can be converted to a new RDD and its references are returned. Here are a few action:

Scala> textfile.count ()
Res0:long =

scala> textfile.first ()
res1:string = # Apache Spark
Where count represents the total number of data bars in the RDD, and the first represents the data in Rdd.

Using a transformation, we will filter the Textfile RDD using the filter function, remove the line containing the string "Spark" and return a new RDD:

scala> val lineswithspark = textfile.filter (line => line.contains ("Spark"))
Lineswithspark: Org.apache.spark.rdd.rdd[string] = mappartitionsrdd[2] at filter at <console>:23
Of course, you can also use the actions and transformations to connect together:

Scala> Textfile.filter (line => line.contains ("Spark")). Count ()
Res2:long = 19
The above statement indicates how many rows include the string "Spark".


More RDD operations

RDD actions and transformations can be used in more complex computations. For example, to find the maximum number of words in a row:

Scala> Textfile.map (line => line.split (""). Size). Reduce ((A, B) => if (a > B) a else b)
Res3:int = 14
First, the row is mapped to an integer value to produce a new rdd. In this new RDD up, use reduce to find the largest number of words in the row. The parameters of map and reduce are Scala's function strings (closures), and can use any language features or Scala/java class libraries. For example, we can easily invoke other function declarations. We use the Math.max () function to make the code easier to understand:

Scala> Import java.lang.Math
import Java.lang.Math

scala> textfile.map (line => line.split (""). Size) . reduce ((A, B) => Math.max (A, b))
Res4:int = 14
As you all know, a common data flow pattern in Hadoop is mapreduce. Spark can easily implement MapReduce:

scala> val wordcounts = Textfile.flatmap (line => line.split ("")). Map (Word => (Word, 1)). Reducebykey ((A, B) => A + b)
wordcounts:org.apache.spark.rdd.rdd[(String, Int)] = shuffledrdd[8] at Reducebykey at <console>:24
Here, we combine flatmap, map, and reducebykey to calculate the number of occurrences of each word in the file, which results in a RDD that contains a set of (String, Int) key values. We can use the collect operation to collect the number of words:

Scala> wordcounts.collect ()
res5:array[(String, Int)] = Array ((package,1), (for,2), (programs,1), (processing. , 1), (because,1), (the,1), (cluster.,1), (its,1), ([run,1), (apis,1), (have,1), (try,1), (computation,1), (through,1), (s everal,1), (this,2), ("Yarn-cluster", 1), (graph,1), (hive,2), (storage,1), (["specifying,1), (to,2), (page] (http:// spark.apache.org/documentation.html), 1 (once,1), (application,1), (prefer,1), (sparkpi,2), (engine,1), (version,1) , (file,1), (documentation,,1), (processing,,2), (the,21), (are,1), (systems.,1), (params,1), (not,1), (different,1), ( refer,2), (interactive,2), (given.,1), (if,4), (build,3), (when,1), (be,2), (tests,1), (apache,1), (all,1), (./bin/ run-example,2), (programs,,1), (including,3), (spark.,1), (package.,1), (1000). Count (), 1, (versions,1), (hdfs,1), ( data.,1), ("...)."


Cache

Spark supports the caching of datasets into memory, which is useful when you want to repeat access. To give a simple example:

Scala> Lineswithspark.cache ()
res6:linesWithSpark.type = mappartitionsrdd[2] at filter at <console>:23

scala> lineswithspark.count ()
Res7:long =

scala> lineswithspark.count ()
Res8:long = 19

scala> lineswithspark.count ()
Res9:long = 19
First cache the Lineswithspark dataset, and then repeatedly access the value returned by the Count function. Of course, we are not aware of the obvious change in query speed, but when caching is used in a large dataset, the corresponding iteration speed is significantly improved.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.