Spark Programming Basics

Source: Internet
Author: User
Tags delete cache

1. Initialize Spark
Import Org.apache.spark. {sparkcontext, Sparkconf}val conf=new sparkconf (). Setappname ("RDD1"). Setmaster ("local") Val sc=new sparkcontext (conf )
2. How to create an RDD

Memory: Parallelize or Makerdd

External files: textfile

1.  Both Parallelize and Makerdd could create RDD from In-memory Val distdata=sc.parallelize (data)                   //parallelize Val Dist Data1=sc.makerdd (data)                 
3. Key-value pairs

The following two are equivalent:

Myrdd. Map (s=> (s,1)) Myrdd. Map (_,1)

Reducebykey and Sortbykey, Groupbykey

Distfile.flatmap (_.split (")). Map ((_,1)). Reducebykey (_+_). Collect (). foreach (println)   Distfile.flatmap (_. Split ("")). Map (s=> (s,1)). Sortbykey (). Collect (). foreach (println) distfile.flatmap (_.split (")"). Map (S=> (s,1 ). Groupbykey (). foreach (println)

1) Return key and the number of each key (key, CNT)

2) return (Key,value) after sorting

3) Return (key, (value1,value2 ...))

4. RDD Persistence

Persist () or cache ()

Unpersist () can delete cache Rdd

5. Broadcast variables and accumulators
    • Defined by Sc.broadcast (v) and sc.accumulator (initial value, comments)
    • The value is accessed by value.
    • The broadcast variable cannot be modified
    • Accumulator can only be modified by add or + =
Sparkcontext.broadcast (v)  is a broadcast variable, could replace V in any place of the Clusterval broadcastvar=sc.br Oadcast (Array) println (broadcastvar.value (0), Broadcastvar.value (1), Broadcastvar.value (2))    Val accum= Sc.accumulator (0, "My accumulator") sc.parallelize (Array (1,2,3,4)). foreach (X=>accum+=x) println (Accum.value)

  

Spark Programming Basics

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.