Spark notes: RDD basic operations (UP)

Last Update:2016-05-19 Source: Internet

Author: User

Tags spark notes

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article is mainly about the basic operation of the RDD in Spark. The RDD is a data model specific to spark, and when it comes to what elastic distributed datasets are mentioned in the RDD, and what are the non-circular graphs, this article does not unfold these advanced concepts for the time being, and in reading this article, you can think of the RDD as an array, which is very helpful for us to learn about the RDD API. All the sample code in this article is written in the Scala language.

While the calculations in spark are all done with the RDD, the first question to learn about the RDD is how to build the Rdd, which is divided into two categories: the first is to read the data directly from the memory, and the second is to read from the filesystem. Of course, many of the file system types here are HDFs and the local file system.

The first class constructs the RDD from memory, using the method: Makerdd and Parallelize methods, as shown in the following code:

    /* Use Makerdd to create the RDD *    /* list *    /val rdd01 = Sc.makerdd (list (1,2,3,4,5,6))    val R01 = rdd01.map {x = x * x * }    println (R01.collect (). mkstring (","))/    * Array */    val rdd02 = Sc.makerdd (Array (1,2,3,4,5,6))    Val R02 = rdd02.filter {x = x < 5}    println (R02.collect (). mkstring (","))    val rdd03 = sc.parallelize (List (  , 4,5,6), 1)    val r03 = rdd03.map {x = x + 1}    println (R03.collect (). mkstring (","))/    */Array */    Val RDD04 = Sc.parallelize (List (1,2,3,4,5,6), 1)    val r04 = rdd04.filter {x = x > 3}    println (R04.collect (). Mkstring (","))

　　As you can see, the RDD is essentially an array, so the data is constructed using the list (linked list) and array (array) types.

The second type of approach is to construct the RDD through the file system, as shown in the code below:

    Val rdd:rdd[string] = sc.textfile ("File:///D:/sparkdata.txt", 1)    val r:rdd[string] = rdd.flatmap {x = = X.split (", ")}    println (R.collect (). mkstring (", "))

This example uses a local file system, so the file path protocol prefix is file://.

The Rdd object is constructed, the next is how to manipulate the Rdd object, the operation of the RDD is divided into conversion operations (transformation) and action actions, the RDD is divided into these two types of operation is related to the RDD lazy operation, when the RDD to perform the conversion operation, The actual calculation is not executed, and only when the RDD executes the action operation will the calculation task be submitted and the corresponding calculation operation performed. The difference between conversion and action is also very simple, the conversion operation is to create a new RDD operation from an RDD, and the action is to do the actual calculation.

Below is an introduction to the basic Operation API for RDD:

Type of operation	Name of function	Role
Conversion actions	Map ()	Parameters are functions that apply to the RDD for each element, and the return value is the new Rdd
	FlatMap ()	A parameter is a function that applies to each element of the RDD, splits the element data into an iterator, and the return value is the new Rdd
	Filter ()	Arguments are functions that filter out elements that are not eligible, and the return value is a new RDD
	Distinct ()	Without parameters, the elements in the Rdd are re-manipulated.
	Union ()	The parameter is an RDD that generates a new RDD with all elements of two RDD
	Intersection ()	parameter is RDD, find common elements of two RDD
	Subtract ()	The parameter is the RDD, which removes the same elements in the original RDD as the parameter rdd.
	Cartesian ()	The parameter is the RDD, which asks for a Cartesian product of two Rdd
Action operations	Collect ()	Return all elements of the RDD
	Count ()	Number of elements in the RDD
	Countbyvalue ()	Number of occurrences of each element in the RDD
	Reduce ()	Consolidate all RDD data, such as summing operations, in parallel
	Fold (0) (func)	Same as reduce function, but fold with initial value
	Aggregate (0) (Seqop,combop)	As with the reduce function, but the type of RDD data returned is different from the original RDD
	foreach (func)	Each element of the RDD uses a specific function

Here is the sample code for the above API operations, as follows:

conversion action:

    Val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) val rddstr:rdd[string] = sc.parallelize (Array ("A", "B", "C"    , "D", "B", "a"), 1) val rddfile:rdd[string] = sc.textfile (path, 1) val Rdd01:rdd[int] = Sc.makerdd (List (1,3,5,3)) Val Rdd02:rdd[int] = Sc.makerdd (List (2,4,5,1))/* Map operation */println ("======map operation ======") println (rddint.map (x =    > x + 1). Collect (). Mkstring (",")) println ("======map operation ======")/* Filter operation */println ("======filter Operation ======")    println (rddint.filter (x = x > 4). Collect (). Mkstring (",")) println ("======filter operation ======")/*/FLATMAP Operation */ println ("======flatmap operation ======") println (Rddfile.flatmap {x = X.split (",")}.first ()) println ("======flatmap Operation ====== ")///distinct go to re-operate */println (" ======distinct to Heavy ====== ") println (Rddint.distinct (). Collect (). Mkstring (", ") ) println (Rddstr.distinct (). Collect (). Mkstring (",")) println ("======distinct de-weight ======")/* Union operation */println ("   ======union Operation ====== ") println (rdd01.union (RDD02) Collect () mkstring (",")) println ("======union operation ======")/* Intersection operation */println ( "======intersection Operation ======") println (Rdd01.intersection (RDD02). Collect (). Mkstring (",")) println ("====== Intersection Operation ====== ")/* Subtract operation */println (" ======subtract operation ====== ") println (Rdd01.subtract (RDD02). Collect () . mkstring (",")) println ("======subtract operation ======")/* Cartesian operation */println ("======cartesian operation ======") println (Rdd01.cartesian (RDD02). Collect (). Mkstring (",")) println ("======cartesian Operation ======")

    Val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) val rddstr:rdd[string] = sc.parallelize (Array ("A", "B", "C" , "D", "B", "a"), 1)/* Count operation */println ("======count operation ======") println (Rddint.count ()) println ("======coun T operation ====== ")/* Countbyvalue operation */println (" ======countbyvalue operation ====== ") println (Rddint.countbyvalue ()) Print ln ("======countbyvalue operation ======")/* Reduce operation */println ("======countbyvalue operation ======") println (Rddint.reduce ((x, Y ) (= x + y)) println ("======countbyvalue operation ======")/* Fold operation */println ("======fold operation ======") println (Rddin     T.fold (0) ((x, y) + x + y)) println ("======fold operation ======")/*/aggregate operation */println ("======aggregate Operation ======") Val Res: (int,int) = Rddint.aggregate ((0,0)) ((x, y) = (x._1 + x._2,y), (x, y) = (x._1 + x._2,y._1 + y._2)) Prin TLN (Res._1 + "," + res._2) println ("======aggregate operation ======")/* Foeach operation */println ("======foeach Operation ======") p Rintln (Rddstr.foreach {x = println (x)}) println ("======foeach Operation ======")

Rdd operation to learn here for the time being, the rest of the content in the next article to talk about, the following I would like to talk about how to develop spark, install spark content I will use a special article to explain, here we assume that the spark has been installed, Then we can use Spark-shell for the shell that interacts with spark on the already installed Spark server, where we can directly write the spark program by tapping the code. But Spark-shell after all the use of too much trouble, and Spark-shell can only use one user at a time, when another user to use Spark-shell will be kicked off the previous user, and the shell does not have the IDE that code completion, code validation function, It is very painful to use.

But Spark is really a magical framework, the magic here is that spark local development debugging is very simple, local development debugging does not need any of the installed spark system, we just need to build a project, this project can be Java or can be Scala, We then put the jar into the project environment, and we can develop and debug the Spark program locally Spark-assembly-1.6.1-hadoop2.6.0.jar.

Let's look at our full code in Eclipse, which has the Scala plugin installed:

Package Cn.com.sparktestimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkContextimport org.apache.spark.rdd.RDDobject sparktest {val conf:sparkconf = new sparkconf (). Setappname ("Xtq"). Setmaster ("local[2]") val sc:sparkcontext = new Sparkcontext (conf)/** * How data is created-constructs data from memory (base) * /def createdatamethod (): Unit = {/* use Makerdd to create RDD */* list */val rdd01 = Sc.makerdd (list (1,2,3,4,5,6)) v    Al R01 = rdd01.map {x = = x * x} println ("===================createdatamethod:makerdd:list=====================") println (R01.collect () mkstring (",")) println ("===================createdatamethod:makerdd:list================= = = = ")/* array */val rdd02 = Sc.makerdd (Array (1,2,3,4,5,6)) Val R02 = rdd02.filter {x = x < 5} print ln ("===================createdatamethod:makerdd:array=====================") println (R02.collect (). mkString ("," )) println ("===================createdatamethod:makerdd:array===================== ")/* use Parallelize to create the RDD */* list */val rdd03 = sc.parallelize (List (1,2,3,4,5,6), 1) v Al R03 = rdd03.map {x = x + 1} println ("===================createdatamethod:parallelize:list===================== ") println (R03.collect () mkstring (", ")) println (" ===================createdatamethod:parallelize:list===========  ========== ")/* Array */val rdd04 = Sc.parallelize (List (1,2,3,4,5,6), 1) Val r04 = rdd04.filter {x = x > 3} println ("===================createdatamethod:parallelize:array=====================") println (R04.collect (). Mkstring (",")) println ("===================createdatamethod:parallelize:array=====================")}/** * Create PA IR Map */def createpairrdd (): Unit = {val rdd:rdd[(string,int)] = Sc.makerdd (List ("Key01", 1), ("Key02", 2), ("Key03", 3))) val r:rdd[string] = Rdd.keys println ("===========================createpairrdd================================ = ") println (R.collect (). mkstrING (",")) println ("===========================createpairrdd=================================")}/** * Create an RDD from a file * File Data: * key01,1,2.3 key02,5,3.7 key03,23,4.8 key04,12,3.9 key05,7,1.3 */def createdatafromfile (p ath:string): Unit = {val rdd:rdd[string] = sc.textfile (path, 1) val r:rdd[string] = rdd.flatmap {x = = X.split (", ")} println (" =========================createdatafromfile================================== ") println (R.collect () . mkstring (",")) println ("=========================createdatafromfile==================================")}/** *    Basic RDD Operation */def BASICTRANSFORMRDD (path:string): Unit = {val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1))  Val rddstr:rdd[string] = sc.parallelize (Array ("A", "B", "C", "D", "B", "a"), 1) val rddfile:rdd[string] = sc.textfile (path,     1) val Rdd01:rdd[int] = Sc.makerdd (list (1,3,5,3)) val Rdd02:rdd[int] = Sc.makerdd (list (2,4,5,1))/* Map operation */ println ("======map operation = = == = = ") println (rddint.map (x = x + 1). Collect (). Mkstring (", ")) println (" ======map operation ====== ")/*/filter operation */P Rintln ("======filter operation ======") println (rddint.filter (x = x > 4). Collect (). Mkstring (",")) println ("======filte R Operation ====== ")/* FLATMAP operation */println (" ======flatmap operation ====== ") println (Rddfile.flatmap {x = X.split (", ")}.fir St ()) println ("======flatmap operation ======")/* Distinct de-re-operation */println ("======distinct de-heavy ======") println (Rddint.dis Tinct (). Collect (). Mkstring (",")) println (Rddstr.distinct (). Collect () mkstring (",")) println ("======distinct de-weight = = = = = = ")/* Union operation */println (" ======union operation ====== ") println (Rdd01.union (RDD02). Collect (). Mkstring (", ")) println ("======union operation ======")/*/intersection operation */println ("======intersection operation ======") println (Rdd01.intersection (RDD Collect (). Mkstring (",")) println ("======intersection operation ======")/* Subtract operation */println ("======subtract operation = = = = = ") println (Rdd01.subtraCT (RDD02). Collect () mkstring (",")) println ("======subtract operation ======")/* Cartesian operation */println ("======cartesian operation   ====== ") println (Rdd01.cartesian (RDD02) Collect () mkstring (", ")) println (" ======cartesian Operation ====== ")}/** * Basic RDD Action operation */Def BASICACTIONRDD (): Unit = {val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) Val Rddst    R:rdd[string] = sc.parallelize (Array ("A", "B", "C", "D", "B", "a"), 1)/* Count operation */println ("======count Operation ======") println (Rddint.count ()) println ("======count operation ======")/* Countbyvalue operation */println ("======countbyvalue operation = = = = = = ") println (Rddint.countbyvalue ()) println (" ======countbyvalue operation ====== ")/* Reduce operation */println (" ======count    Byvalue Operation ====== ") println (Rddint.reduce ((x, y) + x + y)) println (" ======countbyvalue operation ====== ")/* Fold operation */ println ("======fold operation ======") println (Rddint.fold (0) ((x, y) = + x + y)) println ("======fold operation ======")/* AGGR Egate operation */println ("======aggregate Operation ====== ") Val Res: (int,int) = Rddint.aggregate ((0,0)) ((x, y) = (x._1 + x._2,y), (x, y) = (x._1 + x). _2,y._1 + y._2)) println (Res._1 + "," + res._2) println ("======aggregate operation ======")/* Foeach operation */println ("= = = ===foeach Operation ====== ") println (Rddstr.foreach {x = println (x)}) println (" ======foeach Operation ====== ")} def Mai N (args:array[string]): Unit = {println (system.getenv ("Hadoop_home")) Createdatamethod () Createpairrdd () Crea Tedatafromfile ("File:///D:/sparkdata.txt") Basictransformrdd ("File:///D:/sparkdata.txt") Basicactionrdd ()/* Print results * //*d://hadoop===================createdatamethod:makerdd:list=====================1,4,9,16,25,36============== =====createdatamethod:makerdd:list========================================createdatamethod:makerdd:array====== ===============1,2,3,4===================createdatamethod:makerdd:array======================================= =createdatamethod:parallelize:list=====================2, 3,4,5,6,7===================createdatamethod:parallelize:list======================================== Createdatamethod:parallelize:array=====================4,5,6===================createdatamethod:parallelize: Array================================================createpairrdd=================================key01,key02 , Key03===========================createpairrdd=================================key01,1,2.3,key02,5,3.7,key03, 23,4.8,key04,12,3.9,key05,7,1.3=========================createdatafromfile================================== 2,3,4,5,6,7,3,6,2======map Operation ============filter Operation ======5,6,5======filter Operation ============flatmap Operation ======key01===== =flatmap Operation ============distinct ======4,6,2,1,3,5======distinct to heavy ============union operation ======1,3,5,3,2,4,5,1===== =union Operation ============intersection Operation ======1,5======intersection Operation ============subtract Operation ======3,3======subtract Operation ============cartesian Operation ====== (+), (1,4), (3,2), (3,4), (1,5), (), (3,5), (3,1), (5,2), (5,4), (3,2), (3,4), (5,5), ( 5,1), (3,5), (3,1) ======carTesian Operation ============count Operation ======9======count Operation ============countbyvalue Operation ======map (5, 2, 1, 2, 6, 1, 2-& Gt 2, 3, 1, 4, 1) ======countbyvalue Operation ============countbyvalue Operation ======29======countbyvalue Operation ============fold Operation ======29======fold Operation ============aggregate Operation ======19,10======aggregate Operation ============foeach Operation ======abcdba====== Foeach Operation ======*/}}

When Spark executes, we need to construct an SPARKCONTENXT environment variable that constructs an Sparkconf object, such as code: Setappname ("XTQ"). Setmaster ("local[2")

AppName is the spark task name, and master is local[2] refers to using local mode to start 2 threads to complete the spark task.

When running the spark program in Eclipse, the following error is reported:

Java.io.IOException:Could not locate executable Null\bin\winutils.exe in the Hadoop binaries.at Org.apache.hadoop.util.Shell.getQualifiedBinPath (shell.java:355) at Org.apache.hadoop.util.Shell.getWinUtilsPath (shell.java:370) at org.apache.hadoop.util.shell.<clinit> (shell.java:363) at Org.apache.hadoop.util.stringutils.<clinit> (stringutils.java:79) at Org.apache.hadoop.security.Groups.parseStaticMapping (groups.java:104) at org.apache.hadoop.security.groups.< Init> (groups.java:86) at org.apache.hadoop.security.groups.<init> (groups.java:66) at Org.apache.hadoop.security.Groups.getUserToGroupsMappingService (groups.java:280) at Org.apache.hadoop.security.UserGroupInformation.initialize (usergroupinformation.java:271) at Org.apache.hadoop.security.UserGroupInformation.ensureInitialized (usergroupinformation.java:248) at Org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject (usergroupinformation.java:763) at Org.apache.hadoop.security.UserGroupInformation.getLoginUser(usergroupinformation.java:748) at Org.apache.hadoop.security.UserGroupInformation.getCurrentUser ( usergroupinformation.java:621) at org.apache.spark.util.utils$ $anonfun $getcurrentusername$1.apply (Utils.scala : 2160) at org.apache.spark.util.utils$ $anonfun $getcurrentusername$1.apply (utils.scala:2160) at Scala. Option.getorelse (option.scala:120) at Org.apache.spark.util.utils$.getcurrentusername (Utils.scala:2160) at Org.apache.spark.sparkcontext.<init> (sparkcontext.scala:322) at cn.com.sparktest.sparktest$.<init> ( SPARKTEST.SCALA:10) at cn.com.sparktest.sparktest$.<clinit> (Sparktest.scala) at Cn.com.sparktest.SparkTest.main (Sparktest.scala)

This error does not affect the operation of the program, but it is always uncomfortable, the problem is because the spark run relies on Hadoop, but under window is actually unable to install Hadoop, can only use Cygwin analog installation, While the new version of Hadoop used in Windows needs to use Winutils.exe, to solve the problem is simple, is to download a winutils.exe, note that your operating system is 32-bit or 64-bit, find the corresponding version, and then placed in such a directory:

D:\hadoop\bin\winutils.exe

And then define the Hadoop_home= D:\hadoop in the environment variable.

Changing the environment variable to restart Eclipse, so that the environment variable will take effect, this time the program will not report errors.

Spark notes: RDD basic operations (UP)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More