directory [-] 1, prepare file 2, load file 3, display row 4, function use (1) Map (2) Collecct (3) filter (4) Flatmap (5) Union (6) Join (7) lookup ( 8) Groupbykey (9) Sortbykey 1, prepare documents? 1 wget http://statweb.stanford.edu/~tibs/elemstatlearn/datasets/spam.data
2, loading files? 1 scala> val inFile = Sc.textfile ("/home/scipio/spam.data")
Output? 1 2 3 14/06/28 12:15:34 INFO memorystore:ensurefreespace (32880) called with curmem= 65736, maxmem= 311387750 14/06/28 12:15:34 INFO memorystore:block broadcast_2 stored as values to memory (estimated size 32.1 KB, fre E 296.9 MB) infile:org.apache.spark.rdd.rdd[string] = mappedrdd[7] at textfile at <console>: 12
3, show a line? 1 scala> Infile.first ()
output? 1 2 3 4 5 6 7 8 9 14/06/28 12:15:39 Info fileinputformat: total input paths&nbs p;to process : 1 14/06/28 12:15:39 Info sparkcontext: starting job:&nbs P;first at <console>: 14/06/28 12:15:39 INFO&NBSP;DAGSCHEDULER:&NBSP;GOT&NBSP;JOB&N Bsp 0 (first at <console>:) with 1 output partitions (allowLocal= true) 14/06/28 12:15:39 info dagscheduler: final stage: stage 0 (first  At <console>: 14/06/28 12:15:39 info dagscheduler: parents of fi nal stage: list () 14/06/28 12:15:39 info dagscheduler: missing parents:  ; List () 14/06/28 12:15:39 info dagscheduler: computing the requested partitio N locallY 14/06/28 12:15:39 info hadooprdd: input split: file:/home/scipio/spam.data:0 + 349170 14/06/28 12:15:39 info sparkcontext: job finished: first at < Console>:, took 0.532360118 s res2: string = 0 0.64 0.64 nbsp 0.32 0 0 0 0 0 0 0.64 0 0 0 0.32 0 1.29  ; 1.93 0 0.96 0 0 0 0 0 0 0 0 0 0 0 0&nb Sp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0& nbsp 0 0 0.778 0 0 3.756 61 278 1
This command indicates that the spark load file is loaded by rows, and a string of each behavior, such a rdd[string] string array, allows the entire file to be stored in memory. 4. Function application (1) Map
? 1 2 3 4 5 6 7 8 9 Scala> val nums = infile.map (x=>x.split) (' '). Map (_.todouble )) nums: org.apache.spark.rdd.rdd[array[double]] = mappedrdd[8] at map at <console>: Scala> nums.first () 14/06/28 12:19:07 info sparkcontext: Starting job: first at <console>: 14/06/28 12:19:07 INFO DAGSchedule r: got job 1 (first at <console>:) with 1 output partitions (allowlocal= true) 14/06/28 12:19:07 INFO&NBSP;DAGSCHEDULER:&NBSP;FINAL&NBSP;STAGE:&N Bsp stage 1 (first at <console>:) 14/06/28 12:19:07 info dagscheduler:  ; parents of final stage: list () 14/06/28 12:19:07 INFO DAGScheduler: Mi Ssing parents: list () 14/ 06/28 12:19:07 info dagscheduler: computing the requested partition Locally 14/06/28 12:19:07 Info hadooprdd: input split: file:/home/scipio/spam.dat a:0 + 349170 14/06/28 12:19:07 Info sparkcontext: job finished: first at&nbs P;<console>:, took 0.011412903 s Res3: array[double] = array (0.0, 0.64, 0.64, 0.0, 0.32, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.64, 0.0, 0.0, 0.0, 0.32, 0.0, 1.29, 1.93, 0.0, 0.96, &NB Sp 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0. 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 , 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.778, 0.0, 0.0, 3.756, 61.0, 278.0, 1.0)
The command line here: Converts the string of each line to a corresponding Double array so that all data can be represented by a two-dimensional array rdd[array[double]] (2) COLLECCT 1 2 3 4 5 6 7 8 9 Scala> ; val rdd = sc.parallelize (List (1, 2, 3, 4, 5)) rdd: org.apache.spark.rdd.rdd[int] =& nbsp parallelcollectionrdd[9] at parallelize at <console>: Scala> val maprdd = rdd.map (2 *_) maprdd: org.apache.spark.rdd.rdd[int] = mappedrdd[10] at map at <console>: Scala> maprdd.collect 14/06/28 12:24:45&NB Sp Info sparkcontext: job finished: collect at <console>:, took 1.789249751 s Res4: array[int] = array (2, 4, 6, 8, )
(3) Filter
? 1 2 3 4 5 6 scala> val filterrdd = sc.parallelize (List (1, 2, 3, 4, 5)). Map (_* 2). Filter (_> 5) filterrdd:org . apache.spark.rdd.rdd[int] = filteredrdd[] at filter at <console>: scala> filterrdd.collect 14/06/28 12:27:45 INFO sparkcontext:job finished:collect at <console>: Took 0.056086178 s res5:array[int] = Array (6, 8, 10)
(4) Flatmap