Sparkcontext
Usually as an entry function, you can create and return an RDD.
such as the spark cluster as the service side that spark driver is the client, Sparkcontext is the core of the client;
As the note says, Sparkcontext is used to connect spark clusters, create RDD, accumulators (accumlator), broadcast variables (broadcast variables)
Map Operation :
Each input is specified, and then an object is returned for each input;
Flatmap Operation :
"Flattened after first mapping"
Action 1: The same as the map function: Specify the operation for each input and return an object for each input
Action 2: Finally merge all objects into one object
zip function :
x = [1, 2, 3= [4, 5, 6, 7= zip (x, y)print xy #[(1, 4), (2, 5), (3, 6)]
Routine WordCount:
fromPysparkImportSPARKCONTEXTSC= Sparkcontext ('Local')" "in the beginning of a spark program, there are a lot of sparkcontext parallelize to make the Rdd, is Parallelcollectionrdd, create a parallel collection. Doc consists of 2 tasks here" "Doc= Sc.parallelize ([['a','b','C'], ['b','D','D']])Print(Doc.count ())#2" "map Operations: Each input is specified, and then an object is returned for each input; Flatmap action: "Flatten First" Action 1: Same as map function: Specify the operation for each input and return an object action 2 for each input: Finally, all the Like merging into one object" "words= Doc.map (Lambdad:d). Collect ()Print(words) words= Doc.flatmap (Lambdad:d). Collect ()Print(words) words= Doc.flatmap (Lambdad:d). Distinct (). Collect ()Print(words)" "Zip (list1, list2) turns list1,list2 into a list (E1 (List1), E1 (LIST2), E2 ... Here the character is labeled (0:len (words))" "word_dict= {W:i forW, Iinchzip (words, range (len (words)))}" "broadcast the variable word_dict is efficiently passed to each child node Word_dict_b is the alias of Word_dict in the child node processing function, the content is consistent difference is that if use. value" "Word_dict_b=sc.broadcast (word_dict)defWord_count_per_doc (d): Dict_tmp={} WD=Word_dict_b.value forWinchD:dict_tmp[wd[w]]= Dict_tmp.get (Wd[w], 0) + 1returndict_tmp" "each doc will be called once Word_count_per_doc" "Print(Doc.map (Word_count_per_doc). Collect ())Print("successful!")
Sparkcontext, map, FlatMap, zip, and routines WordCount