Spark Processing Network log query PV UV instance

Source: Internet
Author: User
Tags mysql view

Here we first understand the process of spark processing data, because spark have standalone,local,yarn and many other modes, each mode has a different place, but the overall process is the same, is roughly the client to the cluster Manager to submit operations, generate a direction without the loop map, The contents of the diagram include several stage, each stage has several tasks, each task is executed by which executor, and the next task is the whole spark cluster is arranged according to the direction-free graph, and the result is obtained.

Below we give a network log computing PV UV example, through the code into a jar package in the way of spark-submit execution, the code to achieve the following functions:

1. Data cleansing, preserving only date URLs and GUIDs

2. Create spark Schema, convert Rdd to Dataframe, and create temporary tables

3. Query UV PV Using SQL statements

4. Save the results in a database

Package com.stanley.scala.objects Import org.apache.spark.SparkConf import org.apache.spark.SparkContext Import Org.apache.spark.sql.SQLContext Import Org.apache.spark.sql.types.StructType Import Org.apache.spark.sql.types.StructField Import Org.apache.spark.sql.types.StringType Import Org.apache.spark.sql.Row Object WebLog {def main (args:array[string]): unit = {//Create configuration file, select Yarn-clent mode val C
     Onf=new sparkconf (). Setappname ("Sparktest"). Setmaster ("Yarn-client") Val sc =new sparkcontext (conf)/Read data  Val filerdd=sc.textfile (args (0))//etl cleaning data val weblogrdd=filerdd.filter (_.length>0). Map (line=>{Val
       Arr=line.split ("T") Val Url=arr (1) Val guid =arr (5) Val Date=arr. substring (0,10) (Date,guid,url)}). Filter (_._3.LENGTH>0)//Establish Sparksql Val sqlcontext=new SqlContext (SC)//Establish schema Val Schema=structty PE (List structfield ("date", stringtype,trUE), Structfield ("GUID", Stringtype,true), Structfield ("url", stringtype,true)) ) Val Rowrdd=weblogrdd.map (Tuple=>row (tuple._1,tuple._2,tuple._3)) Val Weblogdf=sqlcontext.createdataframe (R Owrdd, Schema)//Register temporary table weblogdf.registertemptable ("WebLog")//Create SQL statement query UV,PV val uvsql= "SELECT count (
     *) Pv,count (DISTINCT (GUID)) UV from WebLog "Val uvpvdf=sqlcontext.sql (Uvsql) uvpvdf.show ()//result incoming MySQL Val url= "jdbc:mysql://master:3306/test?user=root&password=123456" Import java.util.Properties val Propert Ies=new properties UvpvDf.write.jdbc (URL, "UVPV", properties)//Close Resource sc.stop ()}

It is noteworthy that because the amount of data is not very large, we can set the number of partitions in the spark-defaults.conf to speed up the operation, if you do not set this parameter partition number may be 200 will produce 200 task

Spark.sql.shuffle.partitions 10

Next we run the program, start the cluster first, and open the Historyserver, then enter the Spark directory, which belongs to the Spark-sumbit directive

./bin/spark-submit \
--class com.stanley.scala.objects.WebLog \
/opt/testfile/sparktest.jar \
/input/ 2015082818

This is the WebUI can be seen through the direction of the ring-free diagram, divided into three stages


Then look at the following log, the number of partitions is also our set of 10


Enter MySQL view UVPV table already exists




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.