Spark Processing Network log query PV UV instance

Last Update:2018-07-26 Source: Internet

Author: User

Tags mysql view

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here we first understand the process of spark processing data, because spark have standalone,local,yarn and many other modes, each mode has a different place, but the overall process is the same, is roughly the client to the cluster Manager to submit operations, generate a direction without the loop map, The contents of the diagram include several stage, each stage has several tasks, each task is executed by which executor, and the next task is the whole spark cluster is arranged according to the direction-free graph, and the result is obtained.

Below we give a network log computing PV UV example, through the code into a jar package in the way of spark-submit execution, the code to achieve the following functions:

1. Data cleansing, preserving only date URLs and GUIDs

2. Create spark Schema, convert Rdd to Dataframe, and create temporary tables

3. Query UV PV Using SQL statements

4. Save the results in a database

Package com.stanley.scala.objects Import org.apache.spark.SparkConf import org.apache.spark.SparkContext Import Org.apache.spark.sql.SQLContext Import Org.apache.spark.sql.types.StructType Import Org.apache.spark.sql.types.StructField Import Org.apache.spark.sql.types.StringType Import Org.apache.spark.sql.Row Object WebLog {def main (args:array[string]): unit = {//Create configuration file, select Yarn-clent mode val C
     Onf=new sparkconf (). Setappname ("Sparktest"). Setmaster ("Yarn-client") Val sc =new sparkcontext (conf)/Read data  Val filerdd=sc.textfile (args (0))//etl cleaning data val weblogrdd=filerdd.filter (_.length>0). Map (line=>{Val
       Arr=line.split ("T") Val Url=arr (1) Val guid =arr (5) Val Date=arr. substring (0,10) (Date,guid,url)}). Filter (_._3.LENGTH&GT;0)//Establish Sparksql Val sqlcontext=new SqlContext (SC)//Establish schema Val Schema=structty PE (List structfield ("date", stringtype,trUE), Structfield ("GUID", Stringtype,true), Structfield ("url", stringtype,true)) ) Val Rowrdd=weblogrdd.map (Tuple=>row (tuple._1,tuple._2,tuple._3)) Val Weblogdf=sqlcontext.createdataframe (R Owrdd, Schema)//Register temporary table weblogdf.registertemptable ("WebLog")//Create SQL statement query UV,PV val uvsql= "SELECT count (
     *) Pv,count (DISTINCT (GUID)) UV from WebLog "Val uvpvdf=sqlcontext.sql (Uvsql) uvpvdf.show ()//result incoming MySQL Val url= "jdbc:mysql://master:3306/test?user=root&password=123456" Import java.util.Properties val Propert Ies=new properties UvpvDf.write.jdbc (URL, "UVPV", properties)//Close Resource sc.stop ()}

It is noteworthy that because the amount of data is not very large, we can set the number of partitions in the spark-defaults.conf to speed up the operation, if you do not set this parameter partition number may be 200 will produce 200 task

Spark.sql.shuffle.partitions 10

Next we run the program, start the cluster first, and open the Historyserver, then enter the Spark directory, which belongs to the Spark-sumbit directive

./bin/spark-submit \
--class com.stanley.scala.objects.WebLog \
/opt/testfile/sparktest.jar \
/input/ 2015082818

This is the WebUI can be seen through the direction of the ring-free diagram, divided into three stages

Then look at the following log, the number of partitions is also our set of 10

Enter MySQL view UVPV table already exists

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More