This article is mainly about the basic operation of the RDD in Spark. The RDD is a data model specific to spark, and when it comes to what elastic distributed datasets are mentioned in the RDD, and what are the non-circular graphs, this article does not unfold these advanced concepts for the time being, and in reading this article, you can think of the RDD as an array, which is very helpful for us to learn about the RDD API. All the sample code in this article is written in the Scala language.
While the calculations in spark are all done with the RDD, the first question to learn about the RDD is how to build the Rdd, which is divided into two categories: the first is to read the data directly from the memory, and the second is to read from the filesystem. Of course, many of the file system types here are HDFs and the local file system.
The first class constructs the RDD from memory, using the method: Makerdd and Parallelize methods, as shown in the following code:
/* Use Makerdd to create the RDD * /* list * /val rdd01 = Sc.makerdd (list (1,2,3,4,5,6)) val R01 = rdd01.map {x = x * x * } println (R01.collect (). mkstring (","))/ * Array */ val rdd02 = Sc.makerdd (Array (1,2,3,4,5,6)) Val R02 = rdd02.filter {x = x < 5} println (R02.collect (). mkstring (",")) val rdd03 = sc.parallelize (List ( , 4,5,6), 1) val r03 = rdd03.map {x = x + 1} println (R03.collect (). mkstring (","))/ */Array */ Val RDD04 = Sc.parallelize (List (1,2,3,4,5,6), 1) val r04 = rdd04.filter {x = x > 3} println (R04.collect (). Mkstring (","))
As you can see, the RDD is essentially an array, so the data is constructed using the list (linked list) and array (array) types.
The second type of approach is to construct the RDD through the file system, as shown in the code below:
Val rdd:rdd[string] = sc.textfile ("File:///D:/sparkdata.txt", 1) val r:rdd[string] = rdd.flatmap {x = = X.split (", ")} println (R.collect (). mkstring (", "))
This example uses a local file system, so the file path protocol prefix is file://.
The Rdd object is constructed, the next is how to manipulate the Rdd object, the operation of the RDD is divided into conversion operations (transformation) and action actions, the RDD is divided into these two types of operation is related to the RDD lazy operation, when the RDD to perform the conversion operation, The actual calculation is not executed, and only when the RDD executes the action operation will the calculation task be submitted and the corresponding calculation operation performed. The difference between conversion and action is also very simple, the conversion operation is to create a new RDD operation from an RDD, and the action is to do the actual calculation.
Below is an introduction to the basic Operation API for RDD:
Type of operation |
Name of function |
Role |
Conversion actions |
Map () |
Parameters are functions that apply to the RDD for each element, and the return value is the new Rdd |
FlatMap () |
A parameter is a function that applies to each element of the RDD, splits the element data into an iterator, and the return value is the new Rdd |
Filter () |
Arguments are functions that filter out elements that are not eligible, and the return value is a new RDD |
Distinct () |
Without parameters, the elements in the Rdd are re-manipulated. |
Union () |
The parameter is an RDD that generates a new RDD with all elements of two RDD |
Intersection () |
parameter is RDD, find common elements of two RDD |
Subtract () |
The parameter is the RDD, which removes the same elements in the original RDD as the parameter rdd. |
Cartesian () |
The parameter is the RDD, which asks for a Cartesian product of two Rdd |
Action operations |
Collect () |
Return all elements of the RDD |
Count () |
Number of elements in the RDD |
Countbyvalue () |
Number of occurrences of each element in the RDD |
Reduce () |
Consolidate all RDD data, such as summing operations, in parallel |
Fold (0) (func) |
Same as reduce function, but fold with initial value |
Aggregate (0) (Seqop,combop) |
As with the reduce function, but the type of RDD data returned is different from the original RDD |
foreach (func) |
Each element of the RDD uses a specific function |
Here is the sample code for the above API operations, as follows:
conversion action:
Val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) val rddstr:rdd[string] = sc.parallelize (Array ("A", "B", "C" , "D", "B", "a"), 1) val rddfile:rdd[string] = sc.textfile (path, 1) val Rdd01:rdd[int] = Sc.makerdd (List (1,3,5,3)) Val Rdd02:rdd[int] = Sc.makerdd (List (2,4,5,1))/* Map operation */println ("======map operation ======") println (rddint.map (x = > x + 1). Collect (). Mkstring (",")) println ("======map operation ======")/* Filter operation */println ("======filter Operation ======") println (rddint.filter (x = x > 4). Collect (). Mkstring (",")) println ("======filter operation ======")/*/FLATMAP Operation */ println ("======flatmap operation ======") println (Rddfile.flatmap {x = X.split (",")}.first ()) println ("======flatmap Operation ====== ")///distinct go to re-operate */println (" ======distinct to Heavy ====== ") println (Rddint.distinct (). Collect (). Mkstring (", ") ) println (Rddstr.distinct (). Collect (). Mkstring (",")) println ("======distinct de-weight ======")/* Union operation */println (" ======union Operation ====== ") println (rdd01.union (RDD02) Collect () mkstring (",")) println ("======union operation ======")/* Intersection operation */println ( "======intersection Operation ======") println (Rdd01.intersection (RDD02). Collect (). Mkstring (",")) println ("====== Intersection Operation ====== ")/* Subtract operation */println (" ======subtract operation ====== ") println (Rdd01.subtract (RDD02). Collect () . mkstring (",")) println ("======subtract operation ======")/* Cartesian operation */println ("======cartesian operation ======") println (Rdd01.cartesian (RDD02). Collect (). Mkstring (",")) println ("======cartesian Operation ======")
Val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) val rddstr:rdd[string] = sc.parallelize (Array ("A", "B", "C" , "D", "B", "a"), 1)/* Count operation */println ("======count operation ======") println (Rddint.count ()) println ("======coun T operation ====== ")/* Countbyvalue operation */println (" ======countbyvalue operation ====== ") println (Rddint.countbyvalue ()) Print ln ("======countbyvalue operation ======")/* Reduce operation */println ("======countbyvalue operation ======") println (Rddint.reduce ((x, Y ) (= x + y)) println ("======countbyvalue operation ======")/* Fold operation */println ("======fold operation ======") println (Rddin T.fold (0) ((x, y) + x + y)) println ("======fold operation ======")/*/aggregate operation */println ("======aggregate Operation ======") Val Res: (int,int) = Rddint.aggregate ((0,0)) ((x, y) = (x._1 + x._2,y), (x, y) = (x._1 + x._2,y._1 + y._2)) Prin TLN (Res._1 + "," + res._2) println ("======aggregate operation ======")/* Foeach operation */println ("======foeach Operation ======") p Rintln (Rddstr.foreach {x = println (x)}) println ("======foeach Operation ======")
Rdd operation to learn here for the time being, the rest of the content in the next article to talk about, the following I would like to talk about how to develop spark, install spark content I will use a special article to explain, here we assume that the spark has been installed, Then we can use Spark-shell for the shell that interacts with spark on the already installed Spark server, where we can directly write the spark program by tapping the code. But Spark-shell after all the use of too much trouble, and Spark-shell can only use one user at a time, when another user to use Spark-shell will be kicked off the previous user, and the shell does not have the IDE that code completion, code validation function, It is very painful to use.
But Spark is really a magical framework, the magic here is that spark local development debugging is very simple, local development debugging does not need any of the installed spark system, we just need to build a project, this project can be Java or can be Scala, We then put the jar into the project environment, and we can develop and debug the Spark program locally Spark-assembly-1.6.1-hadoop2.6.0.jar.
Let's look at our full code in Eclipse, which has the Scala plugin installed:
Package Cn.com.sparktestimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkConfimport Org.apache.spark.SparkContextimport org.apache.spark.rdd.RDDobject sparktest {val conf:sparkconf = new sparkconf (). Setappname ("Xtq"). Setmaster ("local[2]") val sc:sparkcontext = new Sparkcontext (conf)/** * How data is created-constructs data from memory (base) * /def createdatamethod (): Unit = {/* use Makerdd to create RDD */* list */val rdd01 = Sc.makerdd (list (1,2,3,4,5,6)) v Al R01 = rdd01.map {x = = x * x} println ("===================createdatamethod:makerdd:list=====================") println (R01.collect () mkstring (",")) println ("===================createdatamethod:makerdd:list================= = = = ")/* array */val rdd02 = Sc.makerdd (Array (1,2,3,4,5,6)) Val R02 = rdd02.filter {x = x < 5} print ln ("===================createdatamethod:makerdd:array=====================") println (R02.collect (). mkString ("," )) println ("===================createdatamethod:makerdd:array===================== ")/* use Parallelize to create the RDD */* list */val rdd03 = sc.parallelize (List (1,2,3,4,5,6), 1) v Al R03 = rdd03.map {x = x + 1} println ("===================createdatamethod:parallelize:list===================== ") println (R03.collect () mkstring (", ")) println (" ===================createdatamethod:parallelize:list=========== ========== ")/* Array */val rdd04 = Sc.parallelize (List (1,2,3,4,5,6), 1) Val r04 = rdd04.filter {x = x > 3} println ("===================createdatamethod:parallelize:array=====================") println (R04.collect (). Mkstring (",")) println ("===================createdatamethod:parallelize:array=====================")}/** * Create PA IR Map */def createpairrdd (): Unit = {val rdd:rdd[(string,int)] = Sc.makerdd (List ("Key01", 1), ("Key02", 2), ("Key03", 3))) val r:rdd[string] = Rdd.keys println ("===========================createpairrdd================================ = ") println (R.collect (). mkstrING (",")) println ("===========================createpairrdd=================================")}/** * Create an RDD from a file * File Data: * key01,1,2.3 key02,5,3.7 key03,23,4.8 key04,12,3.9 key05,7,1.3 */def createdatafromfile (p ath:string): Unit = {val rdd:rdd[string] = sc.textfile (path, 1) val r:rdd[string] = rdd.flatmap {x = = X.split (", ")} println (" =========================createdatafromfile================================== ") println (R.collect () . mkstring (",")) println ("=========================createdatafromfile==================================")}/** * Basic RDD Operation */def BASICTRANSFORMRDD (path:string): Unit = {val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) Val rddstr:rdd[string] = sc.parallelize (Array ("A", "B", "C", "D", "B", "a"), 1) val rddfile:rdd[string] = sc.textfile (path, 1) val Rdd01:rdd[int] = Sc.makerdd (list (1,3,5,3)) val Rdd02:rdd[int] = Sc.makerdd (list (2,4,5,1))/* Map operation */ println ("======map operation = = == = = ") println (rddint.map (x = x + 1). Collect (). Mkstring (", ")) println (" ======map operation ====== ")/*/filter operation */P Rintln ("======filter operation ======") println (rddint.filter (x = x > 4). Collect (). Mkstring (",")) println ("======filte R Operation ====== ")/* FLATMAP operation */println (" ======flatmap operation ====== ") println (Rddfile.flatmap {x = X.split (", ")}.fir St ()) println ("======flatmap operation ======")/* Distinct de-re-operation */println ("======distinct de-heavy ======") println (Rddint.dis Tinct (). Collect (). Mkstring (",")) println (Rddstr.distinct (). Collect () mkstring (",")) println ("======distinct de-weight = = = = = = ")/* Union operation */println (" ======union operation ====== ") println (Rdd01.union (RDD02). Collect (). Mkstring (", ")) println ("======union operation ======")/*/intersection operation */println ("======intersection operation ======") println (Rdd01.intersection (RDD Collect (). Mkstring (",")) println ("======intersection operation ======")/* Subtract operation */println ("======subtract operation = = = = = ") println (Rdd01.subtraCT (RDD02). Collect () mkstring (",")) println ("======subtract operation ======")/* Cartesian operation */println ("======cartesian operation ====== ") println (Rdd01.cartesian (RDD02) Collect () mkstring (", ")) println (" ======cartesian Operation ====== ")}/** * Basic RDD Action operation */Def BASICACTIONRDD (): Unit = {val Rddint:rdd[int] = Sc.makerdd (List (1,2,3,4,5,6,2,5,1)) Val Rddst R:rdd[string] = sc.parallelize (Array ("A", "B", "C", "D", "B", "a"), 1)/* Count operation */println ("======count Operation ======") println (Rddint.count ()) println ("======count operation ======")/* Countbyvalue operation */println ("======countbyvalue operation = = = = = = ") println (Rddint.countbyvalue ()) println (" ======countbyvalue operation ====== ")/* Reduce operation */println (" ======count Byvalue Operation ====== ") println (Rddint.reduce ((x, y) + x + y)) println (" ======countbyvalue operation ====== ")/* Fold operation */ println ("======fold operation ======") println (Rddint.fold (0) ((x, y) = + x + y)) println ("======fold operation ======")/* AGGR Egate operation */println ("======aggregate Operation ====== ") Val Res: (int,int) = Rddint.aggregate ((0,0)) ((x, y) = (x._1 + x._2,y), (x, y) = (x._1 + x). _2,y._1 + y._2)) println (Res._1 + "," + res._2) println ("======aggregate operation ======")/* Foeach operation */println ("= = = ===foeach Operation ====== ") println (Rddstr.foreach {x = println (x)}) println (" ======foeach Operation ====== ")} def Mai N (args:array[string]): Unit = {println (system.getenv ("Hadoop_home")) Createdatamethod () Createpairrdd () Crea Tedatafromfile ("File:///D:/sparkdata.txt") Basictransformrdd ("File:///D:/sparkdata.txt") Basicactionrdd ()/* Print results * //*d://hadoop===================createdatamethod:makerdd:list=====================1,4,9,16,25,36============== =====createdatamethod:makerdd:list========================================createdatamethod:makerdd:array====== ===============1,2,3,4===================createdatamethod:makerdd:array======================================= =createdatamethod:parallelize:list=====================2, 3,4,5,6,7===================createdatamethod:parallelize:list======================================== Createdatamethod:parallelize:array=====================4,5,6===================createdatamethod:parallelize: Array================================================createpairrdd=================================key01,key02 , Key03===========================createpairrdd=================================key01,1,2.3,key02,5,3.7,key03, 23,4.8,key04,12,3.9,key05,7,1.3=========================createdatafromfile================================== 2,3,4,5,6,7,3,6,2======map Operation ============filter Operation ======5,6,5======filter Operation ============flatmap Operation ======key01===== =flatmap Operation ============distinct ======4,6,2,1,3,5======distinct to heavy ============union operation ======1,3,5,3,2,4,5,1===== =union Operation ============intersection Operation ======1,5======intersection Operation ============subtract Operation ======3,3======subtract Operation ============cartesian Operation ====== (+), (1,4), (3,2), (3,4), (1,5), (), (3,5), (3,1), (5,2), (5,4), (3,2), (3,4), (5,5), ( 5,1), (3,5), (3,1) ======carTesian Operation ============count Operation ======9======count Operation ============countbyvalue Operation ======map (5, 2, 1, 2, 6, 1, 2-& Gt 2, 3, 1, 4, 1) ======countbyvalue Operation ============countbyvalue Operation ======29======countbyvalue Operation ============fold Operation ======29======fold Operation ============aggregate Operation ======19,10======aggregate Operation ============foeach Operation ======abcdba====== Foeach Operation ======*/}}
When Spark executes, we need to construct an SPARKCONTENXT environment variable that constructs an Sparkconf object, such as code: Setappname ("XTQ"). Setmaster ("local[2")
AppName is the spark task name, and master is local[2] refers to using local mode to start 2 threads to complete the spark task.
When running the spark program in Eclipse, the following error is reported:
Java.io.IOException:Could not locate executable Null\bin\winutils.exe in the Hadoop binaries.at Org.apache.hadoop.util.Shell.getQualifiedBinPath (shell.java:355) at Org.apache.hadoop.util.Shell.getWinUtilsPath (shell.java:370) at org.apache.hadoop.util.shell.<clinit> (shell.java:363) at Org.apache.hadoop.util.stringutils.<clinit> (stringutils.java:79) at Org.apache.hadoop.security.Groups.parseStaticMapping (groups.java:104) at org.apache.hadoop.security.groups.< Init> (groups.java:86) at org.apache.hadoop.security.groups.<init> (groups.java:66) at Org.apache.hadoop.security.Groups.getUserToGroupsMappingService (groups.java:280) at Org.apache.hadoop.security.UserGroupInformation.initialize (usergroupinformation.java:271) at Org.apache.hadoop.security.UserGroupInformation.ensureInitialized (usergroupinformation.java:248) at Org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject (usergroupinformation.java:763) at Org.apache.hadoop.security.UserGroupInformation.getLoginUser(usergroupinformation.java:748) at Org.apache.hadoop.security.UserGroupInformation.getCurrentUser ( usergroupinformation.java:621) at org.apache.spark.util.utils$ $anonfun $getcurrentusername$1.apply (Utils.scala : 2160) at org.apache.spark.util.utils$ $anonfun $getcurrentusername$1.apply (utils.scala:2160) at Scala. Option.getorelse (option.scala:120) at Org.apache.spark.util.utils$.getcurrentusername (Utils.scala:2160) at Org.apache.spark.sparkcontext.<init> (sparkcontext.scala:322) at cn.com.sparktest.sparktest$.<init> ( SPARKTEST.SCALA:10) at cn.com.sparktest.sparktest$.<clinit> (Sparktest.scala) at Cn.com.sparktest.SparkTest.main (Sparktest.scala)
This error does not affect the operation of the program, but it is always uncomfortable, the problem is because the spark run relies on Hadoop, but under window is actually unable to install Hadoop, can only use Cygwin analog installation, While the new version of Hadoop used in Windows needs to use Winutils.exe, to solve the problem is simple, is to download a winutils.exe, note that your operating system is 32-bit or 64-bit, find the corresponding version, and then placed in such a directory:
D:\hadoop\bin\winutils.exe
And then define the Hadoop_home= D:\hadoop in the environment variable.
Changing the environment variable to restart Eclipse, so that the environment variable will take effect, this time the program will not report errors.
Spark notes: RDD basic operations (UP)