Sparkcontext, map, FlatMap, zip, and routines WordCount

Source: Internet
Author: User

  Sparkcontext

Usually as an entry function, you can create and return an RDD.

such as the spark cluster as the service side that spark driver is the client, Sparkcontext is the core of the client;

As the note says, Sparkcontext is used to connect spark clusters, create RDD, accumulators (accumlator), broadcast variables (broadcast variables)

Map Operation :
Each input is specified, and then an object is returned for each input;

Flatmap Operation :
"Flattened after first mapping"
Action 1: The same as the map function: Specify the operation for each input and return an object for each input
Action 2: Finally merge all objects into one object

zip function :

x = [1, 2, 3= [4, 5, 6, 7= zip (x, y)print xy  #[(1, 4), (2, 5), (3, 6)]
Routine WordCount:
 fromPysparkImportSPARKCONTEXTSC= Sparkcontext ('Local')" "in the beginning of a spark program, there are a lot of sparkcontext parallelize to make the Rdd, is Parallelcollectionrdd, create a parallel collection. Doc consists of 2 tasks here" "Doc= Sc.parallelize ([['a','b','C'], ['b','D','D']])Print(Doc.count ())#2" "map Operations: Each input is specified, and then an object is returned for each input; Flatmap action: "Flatten First" Action 1: Same as map function: Specify the operation for each input and return an object action 2 for each input: Finally, all the Like merging into one object" "words= Doc.map (Lambdad:d). Collect ()Print(words) words= Doc.flatmap (Lambdad:d). Collect ()Print(words) words= Doc.flatmap (Lambdad:d). Distinct (). Collect ()Print(words)" "Zip (list1, list2) turns list1,list2 into a list (E1 (List1), E1 (LIST2), E2 ... Here the character is labeled (0:len (words))" "word_dict= {W:i forW, Iinchzip (words, range (len (words)))}" "broadcast the variable word_dict is efficiently passed to each child node Word_dict_b is the alias of Word_dict in the child node processing function, the content is consistent difference is that if use. value" "Word_dict_b=sc.broadcast (word_dict)defWord_count_per_doc (d): Dict_tmp={} WD=Word_dict_b.value forWinchD:dict_tmp[wd[w]]= Dict_tmp.get (Wd[w], 0) + 1returndict_tmp" "each doc will be called once Word_count_per_doc" "Print(Doc.map (Word_count_per_doc). Collect ())Print("successful!")

Sparkcontext, map, FlatMap, zip, and routines WordCount

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.