Here we first understand the process of spark processing data, because spark have standalone,local,yarn and many other modes, each mode has a different place, but the overall process is the same, is roughly the client to the cluster Manager to submit operations, generate a direction without the loop map, The contents of the diagram include several stage, each stage has several tasks, each task is executed by which executor, and the next task is the whole spark cluster is arranged according to the direction-free graph, and the result is obtained.
Below we give a network log computing PV UV example, through the code into a jar package in the way of spark-submit execution, the code to achieve the following functions:
1. Data cleansing, preserving only date URLs and GUIDs
2. Create spark Schema, convert Rdd to Dataframe, and create temporary tables
3. Query UV PV Using SQL statements
4. Save the results in a database
Package com.stanley.scala.objects Import org.apache.spark.SparkConf import org.apache.spark.SparkContext Import Org.apache.spark.sql.SQLContext Import Org.apache.spark.sql.types.StructType Import Org.apache.spark.sql.types.StructField Import Org.apache.spark.sql.types.StringType Import Org.apache.spark.sql.Row Object WebLog {def main (args:array[string]): unit = {//Create configuration file, select Yarn-clent mode val C
Onf=new sparkconf (). Setappname ("Sparktest"). Setmaster ("Yarn-client") Val sc =new sparkcontext (conf)/Read data Val filerdd=sc.textfile (args (0))//etl cleaning data val weblogrdd=filerdd.filter (_.length>0). Map (line=>{Val
Arr=line.split ("T") Val Url=arr (1) Val guid =arr (5) Val Date=arr. substring (0,10) (Date,guid,url)}). Filter (_._3.LENGTH>0)//Establish Sparksql Val sqlcontext=new SqlContext (SC)//Establish schema Val Schema=structty PE (List structfield ("date", stringtype,trUE), Structfield ("GUID", Stringtype,true), Structfield ("url", stringtype,true)) ) Val Rowrdd=weblogrdd.map (Tuple=>row (tuple._1,tuple._2,tuple._3)) Val Weblogdf=sqlcontext.createdataframe (R Owrdd, Schema)//Register temporary table weblogdf.registertemptable ("WebLog")//Create SQL statement query UV,PV val uvsql= "SELECT count (
*) Pv,count (DISTINCT (GUID)) UV from WebLog "Val uvpvdf=sqlcontext.sql (Uvsql) uvpvdf.show ()//result incoming MySQL Val url= "jdbc:mysql://master:3306/test?user=root&password=123456" Import java.util.Properties val Propert Ies=new properties UvpvDf.write.jdbc (URL, "UVPV", properties)//Close Resource sc.stop ()}
It is noteworthy that because the amount of data is not very large, we can set the number of partitions in the spark-defaults.conf to speed up the operation, if you do not set this parameter partition number may be 200 will produce 200 task
Spark.sql.shuffle.partitions 10
Next we run the program, start the cluster first, and open the Historyserver, then enter the Spark directory, which belongs to the Spark-sumbit directive
./bin/spark-submit \
--class com.stanley.scala.objects.WebLog \
/opt/testfile/sparktest.jar \
/input/ 2015082818
This is the WebUI can be seen through the direction of the ring-free diagram, divided into three stages
Then look at the following log, the number of partitions is also our set of 10
Enter MySQL view UVPV table already exists