Spark the common application examples of Scala language __spark

Source: Internet
Author: User
Tags prepare spark rdd

As a beginner, first learn spark, share your own experience.

In learning Spark programming, the first to prepare the compilation environment, to determine the programming language, I used the Scala language, IntelliJ idea of the compilation environment, at the same time have to prepare four packages, respectively: Spark-assembly-1.3.1-hd-2.6.0.jar, Scala-compiler.jar, Scala-library.jar, Scala-reflect.jar. Import these four packages to start your own Scala programming trip.

Because the Hadoop environment is not set up, so in the practice of Scala programming, you can no longer read HDFS data on Hadoop, but not to the point, in order to practice programming, we can read the local TXT file, and then save the results to TXT, so that not only can feel the spark RDD is powerful and can also achieve our goal of practicing programming. Down mainly with an example to illustrate the spark rdd commonly used for the operation.

First we have to configure sparkconf (), usually read the file on HDFs, but here read the local TXT file, configure sparkconf () as follows:

<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >conf=new sparkconf (). Setappname ("Test"). Setmaster ("local[4]") </span></span>

Explain: Local[n]: Local mode, using N threads.

The following program uses count () to count the number of rows

<span style= "FONT-SIZE:18PX;" >object YB {
  * *
   the number of occurrences of the contents of the row, that is, how many times the same line will appear
  /def main (args:array[string]): unit={
    val conf=new Sparkconf (). Setappname ("Test"). Setmaster ("local[4]")
    val sc = new Sparkcontext (conf)
    val lines = Sc.textfile ( "e:/spark/Tianchi Large Data/data_format1/yb.txt")
    val Countx=lines.count ()//Statistics Rows
    println (countx)/output: 10485750 
  }
}</span>

Statistic frequency and sort by word frequency:

<span style= "FONT-SIZE:18PX;" >object YB {
  def main (args:array[string]): unit={

    val conf=new sparkconf (). Setappname ("Test"). Setmaster (" Local[4] "
    val sc = new Sparkcontext (conf)
    val lines = Sc.textfile (" e:/spark/Tianchi large Data/data_format1/100w.txt ") /
    *
    sortbykey parameter has two. 1, True (ascending), or vice versa. 2, identify the number of fragments (partition number)
    Flatmap equivalent to get a list of friends object.
     * *
    val worldcounts=lines.flatmap (line => line.split ("")). Map (Word => (word,1)). Reducebykey ((a,b) = > A+b). map{case (Key,value) => (Value,key)}.sortbykey (true,1) Worldcounts.foreach (println)
  }
} </span>

Map () and Flatmap () difference:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
   def main (args:array[string]) {
     val m=list (List (1,2), List (3,4))
     println (M.map (x=>x))
     println (M)
     Val x=m.flatten
     println (x)
     println (M.flatmap (x =>x))
  }</span></span >
The above program can be flatmap by the map and flatten synthesis, but also can be found that Flatmap will eventually output a series of sequences, and map output is more than one collection.

Union () Usage:

<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
  def main (args:array[string]) {
    val m1=list (List (1,2), List (3,4))
    Val m2=list (List (1,2), List (3,4))
    Val unionx=m1.union (m2)//combines two datasets
    println (Unionx)
    Val mx1=list (1,2)
    Val mx2=list (3,4)
    Val unionxx=mx1.union (MX2)//Combines two datasets
    println (unionxx)
  }
}</span></span>
Cartesian product Cartesian () Usage:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
  def main (args:array[string]) {
    val conf=new sparkconf (). Setappname ("Test"). Setmaster ("local[ 4] "
    val sc = new Sparkcontext (conf)
    Val data1=sc.parallelize (List (1,2,3))//parallelization, because the Cartesian product is manipulated on the RDD, so it must be RDD data.
    Val data2=sc.parallelize (List (4,5,6))
    Data1.cartesian (data2). foreach (println)
  }
}</span> </span>
Groupbykey () and Reducebykey () difference:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object YB {
  def main (args:array[string]) {
    val conf=new sparkconf (). Setappname ("Test"). Setmaster ("local[ 4] "
    val sc = new Sparkcontext (conf)
    val lines = Sc.textfile (" e:/spark/Tianchi large Data/data_format1/100w.txt ")/
    *
    There are two sortbykey parameters. 1, True (ascending), or vice versa. 2, identify the number of fragments (number of partitions)/
    Val worldcounts=lines.flatmap (line => line.split ("")). Map (Word => (word,1)). Reducebykey ((a,b) => a+b). Map{case (Key,value) => (Value,key)}.sortbykey (false,1)//To sort the Val topk= in order from large to small
    Worldcounts.top (a)
    Topk.foreach (println)//output number of the top ten word frequency
   
  }
}</span></span>
Groupbykey is not unified in the local merge in the master node merge
Reducebykey in the local merge and then in the merge to the main node

Reduce () Usage:

<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
  def main (args:array[string]) {
    val data=list (1,2,3,4)
   Val sum=data.reduce ((x,y) =>x+y)
    println (SUM)/output: {
  }
} Reduce passes the element 22 in Rdd to the input function, generating a new value, and the new value is passed to the input function until the last one of the RDD,
    which is essentially equivalent to the process of assigning the root node to the left and right children with a full two-fork tree. </span></span>





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.