As a beginner, first learn spark, share your own experience.
In learning Spark programming, the first to prepare the compilation environment, to determine the programming language, I used the Scala language, IntelliJ idea of the compilation environment, at the same time have to prepare four packages, respectively: Spark-assembly-1.3.1-hd-2.6.0.jar, Scala-compiler.jar, Scala-library.jar, Scala-reflect.jar. Import these four packages to start your own Scala programming trip.
Because the Hadoop environment is not set up, so in the practice of Scala programming, you can no longer read HDFS data on Hadoop, but not to the point, in order to practice programming, we can read the local TXT file, and then save the results to TXT, so that not only can feel the spark RDD is powerful and can also achieve our goal of practicing programming. Down mainly with an example to illustrate the spark rdd commonly used for the operation.
First we have to configure sparkconf (), usually read the file on HDFs, but here read the local TXT file, configure sparkconf () as follows:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >conf=new sparkconf (). Setappname ("Test"). Setmaster ("local[4]") </span></span>
Explain: Local[n]: Local mode, using N threads.
The following program uses count () to count the number of rows
<span style= "FONT-SIZE:18PX;" >object YB {
* *
the number of occurrences of the contents of the row, that is, how many times the same line will appear
/def main (args:array[string]): unit={
val conf=new Sparkconf (). Setappname ("Test"). Setmaster ("local[4]")
val sc = new Sparkcontext (conf)
val lines = Sc.textfile ( "e:/spark/Tianchi Large Data/data_format1/yb.txt")
val Countx=lines.count ()//Statistics Rows
println (countx)/output: 10485750
}
}</span>
Statistic frequency and sort by word frequency:
<span style= "FONT-SIZE:18PX;" >object YB {
def main (args:array[string]): unit={
val conf=new sparkconf (). Setappname ("Test"). Setmaster (" Local[4] "
val sc = new Sparkcontext (conf)
val lines = Sc.textfile (" e:/spark/Tianchi large Data/data_format1/100w.txt ") /
*
sortbykey parameter has two. 1, True (ascending), or vice versa. 2, identify the number of fragments (partition number)
Flatmap equivalent to get a list of friends object.
* *
val worldcounts=lines.flatmap (line => line.split ("")). Map (Word => (word,1)). Reducebykey ((a,b) = > A+b). map{case (Key,value) => (Value,key)}.sortbykey (true,1) Worldcounts.foreach (println)
}
} </span>
Map () and Flatmap () difference:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
def main (args:array[string]) {
val m=list (List (1,2), List (3,4))
println (M.map (x=>x))
println (M)
Val x=m.flatten
println (x)
println (M.flatmap (x =>x))
}</span></span >
The above program can be flatmap by the map and flatten synthesis, but also can be found that Flatmap will eventually output a series of sequences, and map output is more than one collection.
Union () Usage:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
def main (args:array[string]) {
val m1=list (List (1,2), List (3,4))
Val m2=list (List (1,2), List (3,4))
Val unionx=m1.union (m2)//combines two datasets
println (Unionx)
Val mx1=list (1,2)
Val mx2=list (3,4)
Val unionxx=mx1.union (MX2)//Combines two datasets
println (unionxx)
}
}</span></span>
Cartesian product Cartesian () Usage:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
def main (args:array[string]) {
val conf=new sparkconf (). Setappname ("Test"). Setmaster ("local[ 4] "
val sc = new Sparkcontext (conf)
Val data1=sc.parallelize (List (1,2,3))//parallelization, because the Cartesian product is manipulated on the RDD, so it must be RDD data.
Val data2=sc.parallelize (List (4,5,6))
Data1.cartesian (data2). foreach (println)
}
}</span> </span>
Groupbykey () and Reducebykey () difference:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object YB {
def main (args:array[string]) {
val conf=new sparkconf (). Setappname ("Test"). Setmaster ("local[ 4] "
val sc = new Sparkcontext (conf)
val lines = Sc.textfile (" e:/spark/Tianchi large Data/data_format1/100w.txt ")/
*
There are two sortbykey parameters. 1, True (ascending), or vice versa. 2, identify the number of fragments (number of partitions)/
Val worldcounts=lines.flatmap (line => line.split ("")). Map (Word => (word,1)). Reducebykey ((a,b) => a+b). Map{case (Key,value) => (Value,key)}.sortbykey (false,1)//To sort the Val topk= in order from large to small
Worldcounts.top (a)
Topk.foreach (println)//output number of the top ten word frequency
}
}</span></span>
Groupbykey is not unified in the local merge in the master node merge
Reducebykey in the local merge and then in the merge to the main node
Reduce () Usage:
<span style= "FONT-SIZE:18PX;" ><span style= "FONT-SIZE:18PX;" >object yb{
def main (args:array[string]) {
val data=list (1,2,3,4)
Val sum=data.reduce ((x,y) =>x+y)
println (SUM)/output: {
}
} Reduce passes the element 22 in Rdd to the input function, generating a new value, and the new value is passed to the input function until the last one of the RDD,
which is essentially equivalent to the process of assigning the root node to the left and right children with a full two-fork tree. </span></span>