1. Initialize Spark
Import Org.apache.spark. {sparkcontext, Sparkconf}val conf=new sparkconf (). Setappname ("RDD1"). Setmaster ("local") Val sc=new sparkcontext (conf )
2. How to create an RDD
Memory: Parallelize or Makerdd
External files: textfile
1. Both Parallelize and Makerdd could create RDD from In-memory Val distdata=sc.parallelize (data) //parallelize Val Dist Data1=sc.makerdd (data)
3. Key-value pairs
The following two are equivalent:
Myrdd. Map (s=> (s,1)) Myrdd. Map (_,1)
Reducebykey and Sortbykey, Groupbykey
Distfile.flatmap (_.split (")). Map ((_,1)). Reducebykey (_+_). Collect (). foreach (println) Distfile.flatmap (_. Split ("")). Map (s=> (s,1)). Sortbykey (). Collect (). foreach (println) distfile.flatmap (_.split (")"). Map (S=> (s,1 ). Groupbykey (). foreach (println)
1) Return key and the number of each key (key, CNT)
2) return (Key,value) after sorting
3) Return (key, (value1,value2 ...))
4. RDD Persistence
Persist () or cache ()
Unpersist () can delete cache Rdd
5. Broadcast variables and accumulators
- Defined by Sc.broadcast (v) and sc.accumulator (initial value, comments)
- The value is accessed by value.
- The broadcast variable cannot be modified
- Accumulator can only be modified by add or + =
Sparkcontext.broadcast (v) is a broadcast variable, could replace V in any place of the Clusterval broadcastvar=sc.br Oadcast (Array) println (broadcastvar.value (0), Broadcastvar.value (1), Broadcastvar.value (2)) Val accum= Sc.accumulator (0, "My accumulator") sc.parallelize (Array (1,2,3,4)). foreach (X=>accum+=x) println (Accum.value)
Spark Programming Basics