International - English

Cart Console

Topic Center

Contact Sales

Home > Others

Spark SQL implementation log offline batch processing

Last Update:2018-06-15 Source: Internet

Author: User

Tags throw exception try catch mysql command line

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tag: CAs ORC value try ignores HDFs body overwrite resource

First, the basic offline data processing architecture:

Data acquisition Flume:web Log writes to HDFs
Data cleansing of dirty data by Spark, Hive, Mr and other computational frameworks. When you're done cleaning, put it back in HDFs.
Data processing According to needs, conduct business statistics and analysis. Also done through the computational framework
Processing results stored in RDBMS, NoSQL
The visualization of data is displayed graphically. Echarts, HUE, Zeppelin

Process Block Diagram: 1 2 3 4 5 6 7 for offline processing, where 5 is not necessarily hive (and spark SQL, etc.) 6 is not necessarily an RDBMS (NoSQL) execution, the scheduling framework Oozie, Azkaban, specifies the time the task is executed the other line is the real-time processing of the proposed project requirements:

Count the topn and corresponding visits of one of the most popular items in a certain time period
By city statistics Most popular extracting city information from IP
Per-access traffic statistics

Internet logs generally include: Access time access URL consume traffic access IP address Extract the data we need from the log. Suppose we now have only one computer for learning to use as a cluster, and in order to prevent memory overflow, it is necessary to cut the log: with head- 10000 command intercept the first 10,000 data volume is too large, in the IDE may error two, log processing data cleansing: The first step: Extract useful information from the original log, in this case to get time, URL, traffic, IP

Read the log file, get the RDD, split into an array using the Map method, and select several useful items in the array (use breakpoints to analyze which items are useful and match the corresponding variables)
The information obtained may be caused by some problems, such as threading problems, resulting in the generation of information with errors, the first step in the beginning with SimpleDateFormat (thread insecure) to transform the time format, will cause some time conversion errors. It's usually changed to Fastdateformat.

Implementation code:

Extract useful information, convert format Object sparkstatformatjob {  def main (args:array[string]) = {    val spark = Sparksession.builder (). AppName ("Sparkstatformatjob"). Master ("local[2]"). Getorcreate ()    val access = Spark.sparkContext.textFile ("/ Users/kingheyleung/downloads/data/10000_access.log ")    //access.take (println)    Access.map ( Line = {      val splits = Line.split ("")      val ip = splits (0)      //Using breakpoints method, observe the splits array, find out the time, URL, and traffic corresponding to which field      // Create Time Class Dateutils, convert to a common time expression      //Remove the URL from the extra "" "quotation mark out      val time = Splits (3) +" "+ Splits (4)      val url = Splits (one). Re Placeall ("\" "," ")      val traffic = Splits (9)      //(IP, Dateutils.parse (time), URL, traffic)  is used to test whether the output is normal      //re-assemble the trimmed data, tab-Split      Dateutils.parse (time) + "\ T" + URL + "\ t" + traffic + "\ T" + IP    }). Saveastextfile ("File:///usr/local/mycode/immooclog /")    Spark.stop ()  }}

Date Resolution Object Dateutils {  //input format  val original_time_format = fastdateformat.getinstance ("dd/mmm/yyyy:hh:mm: SSS Z ", locale.english)  //output format  val target_time_format = fastdateformat.getinstance (" Yyyy-mm-dd HH:mm:ss ")   def parse (time:string) = {    Target_time_format.format (new Date (GetTime))  }  def getTime (time: String) = {    try {      original_time_format.parse (time.substring ("Time.indexof (") + 1, time.lastindexof ("]")). GetTime    } catch {case       e:exception = {        0l      }    }  }

General log processing needs to be partitioned

This example is partitioned by the access time in the loghttp://www.bokee.net/bloggermodule/blog_viewblog.do?id=31933291
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31933288
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31933283
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31931284
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31931268The second step: parse the previous step to get the useful information, I call it the analytic log is actually the more tidy data log, parse out the meaning of each field, and the RDD into DF in this case, the completion is: Input: Access time access URL consuming traffic Access IP address = = to output: URL, Type (the suffix of the URL in this example is article or video), the corresponding ID number, traffic, IP, city, time, day (for grouping) and create dataframe (that is, define row and Structtype, where row is to correspond to each field in the original log, The Structtype is the implementation code that defines the line according to the desired output:

//Parse Log object Sparkstatcleanjob {def main (args:array[string]) = {val spark = Sparksession.builder (). AppName ("Sp Arkstatcleanjob "). Master (" local[2] "). Getorcreate () Val Accessrdd = Spark.sparkContext.textFile (" file:///Users/ Kingheyleung/downloads/data/access_10000.log ")//rdd convert to DF, define Row and structtype val accessdf = spark.c Reatedataframe (Accessrdd.map (line + logconvertutils.converttorow (line)), logconvertutils.struct)// Accessdf.printschema ()//accessdf.show (false) Spark.stop ()}}

Rdd Convert to DF Tool class object Logconvertutils {//build struct val struct = Structtype (Array (Structfield ("url", StringType)      , Structfield ("Cmstype", StringType), Structfield ("Cmsid", Longtype), Structfield ("Traffic", Longtype), Structfield ("IP", StringType), Structfield ("City", StringType), Structfield ("Time", StringType), Structfi Eld ("Day", StringType))//extract information, build row def converttorow (line:string) = {try {val splits = Line.split ("\ t ") Val url = Splits (1) Val traffic = Splits (2). tolong val ip = Splits (3) Val domain =" Http://www.imoo c.com/"val cms = url.substring (url.indexof (domain) + domain.length ()) Val cmssplits = Cms.split ("/") var Cmstype = "" var Cmsid = 0l//Determine if there is an if (Cmssplits.length > 1) {cmstype = cmssplits (0) cm      SId = cmssplits (1). Tolong} val City = iputils.getcity (IP)//pass through the IP parsing tool, specifically see the following Val time = Splits (0) Val Day = Time.substrinG (0, +). ReplaceAll ("-", "")//define row, same as struct row (URL, cmstype, cmsid, traffic, IP, city, Time, day)} CATC h {Case e:exception = Row (0)}}}

Note: When converting, be sure to remember the type conversion!!!!

Further analysis: IP address resolution to obtain city information here, in order to convert the IP address into an intuitive city information, I used the open source project on GitHub to achieve: https://github.com/wzhe06/ Ipdatabase.git mvn clean package-dskiptests install jar package to your MAVEN repository by using MAVEN to compile the downloaded project: mvn install:install-file-dfile= path. JAR- Dgroupid=com.ggstar-dartifactid=ipdatabase-dversion=1.0-dpackaging=jar Add dependency to the IDE's pom.xml, referring to the dependency in Pom.xml on the GitHub home page but there's an error: java.io.filenotfoundexception:file:/users/ ROCKY/MAVEN_REPOS/COM/GGSTAR/IPDATABASE/1.0/IPDATABASE-1.0.JAR!/IPREGION.XLSX (No such file or directory) as prompted, We need to find the corresponding files in the project source code to copy into the IDE in the Main/resources! Store cleansed data: Store Partitionby storage mode by day partition: Mode (savemode.overwrite) covers storage coalesce: It is said that the production is often used, is the project's tuning advantages, control file output size, number three, statistical function realization function realization One: Statistics TOPN video First step: Read data, Read.format (). Load Second Step:

Statistical analysis using the Dataframe API
SQL API

Finally save the statistical results in the MySQL database The advantages: When reading the parquet file, the system will default to parse the corresponding data types of the fields, but sometimes we just need it is a string type, need to be added when the sparksession definition: config (" Spark.sql.sources.partitionColumnTypeInference.enabled, "false") becomes only read in two methods according to the original type: If you use the Dataframe API: You need to import an implicit conversion (this is the column name converted to a column)! Spark.implicits._ use the count () function of Dataframe to import the package:org.apache.spark.sql.functions._ if you use SQL API to do: Create temporary tables Createtempview carefully write SQL statements when wrapping without noticing and ignoring spaces implementing code:

Complete the statistics Operation object Topnstatjob {def main (args:array[string]) {val spark = Sparksession.builder (). AppName ("Topnstatjob")    . config ("spark.sql.sources.partitionColumnTypeInference.enabled", "false"). Master ("local[2]"). Getorcreate () Val accessdf = Spark.read.format ("parquet"). Load ("/users/kingheyleung/downloads/data/clean/") Dfcounttopnvideo ( Spark, ACCESSDF) Sqlcounttopnvideo (Spark, ACCESSDF)//accessdf.printschema () Spark.stop ()} def Dfcounttopnvi Deo (Spark:sparksession, accessdf:dataframe): Unit = {/* * DF API * * *//import implicit conversion, pay attention to the use of the $ number, and import the functions package so that a GG aggregate function count can be used, if not in the case of $, you can not let the Times desc sorted import spark.implicits._ val topndf = Accessdf.filter ($ "day" = = = "201705 "&& $" cmstype "= = =" Video "). GroupBy (" Day "," Cmsid "). Agg (Count (" Cmsid "). As (" Times "). ($" Times "). Des    c) Topndf.show (False)} def sqlcounttopnvideo (Spark:sparksession, accessdf:dataframe): Unit = {/* * SQL API * * *//create temporary table Access_view, noteWhen you change lines, it's easy to ignore the space Accessdf.createtempview ("Access_view") val topndf = Spark.sql ("Select Day, Cmsid, COUNT (1) as Times fr Om Access_view "+" where day = = ' 20170511 ' and cmstype = = ' Video ' "+" GROUP by day, Cmsid "+" ORDER by T IMEs desc ") Topndf.show (False)}}

Before you can save your data, you need to write a tool class that connects to the MySQL database, using the java.sql package

Using DriverManager, connect to MySQL 3306
Release resources, connection and PreparedStatement, pay attention to handling exceptions

Note: If the test does not get the connection, the following error, it is not added in the dependency or select the Mysql-connetor package java.sql.SQLException:No suitable driver found for JDBC: Mysql://localhost:3306/imooc_project?user=root&password=666error:scalac:error while loading <root> Error accessing/users/kingheyleung/.m2/repository/mysql/mysql-connector-java/5.0.8/ Mysql-connector-java-5.0.8.jar I finally chose 5.1. The 40 version is the implementation code:

/** Connect MySQL Database * Operation Tool class * */object mysqlutils {  //Get connection  def getconnection (): Unit = {    Drivermanager.getconnection ("jdbc:mysql://localhost:3306/imooc_project?user=root&password=666")  }  Release Resource  def release (Connection:connection, pstmt:preparedstatement): Unit = {    try {      if (pstmt! = null) {        Pstmt.close ()      }    } catch {case      e:exception = E.printstacktrace ()    } finally {      Connection.close ()}}}

Save statistics to MySQL

Create a table in MySQL with day,cms_id,times three fields (note the respective data types, and the definitions are not allowed to be null, and day and cms_id as PRI KEY)
Create a model class case class, three input parameters, day, Cms_id,times
To create an operational database DAO class, the input parameter is a list,list that is loaded with the above model class, with the intention of inserting insert records into the database, in DAO in the following steps:
First of all, do the JDBC connection preparation, create connection and preparestatement, close the connection also write, with try Catch finally throw exception;
Then write the SQL statement, the place where the preparestatement needs to be assigned is placed with a placeholder;
Perform a list traversal and put each object in the Pstmt
Adjust the advantages!!! Before traversing, the auto-commit is switched off, the pstmt is added to the batch process, and the batch operation is performed after the traversal. Finally submit the connection manually

Implementation code:&NBSP;

//Course Visits entity class case Class Videoaccessstat (day:string, Cmsid:long, Times:long)/** DAO operations for each dimension statistic * */object Statdao {/* * Bulk Save Videoaccessstat to Database * */def INSERTDAYACCESSTOPN (List:listbuffer[videoaccessstat]): Unit = {var connection : Connection = null//JDBC preparation, defining the connection var pstmt:preparedstatement = null try {Connection = Mysqlutils.getco Nnection ()//Real Get connection Connection.setautocommit (FALSE)//in order to implement batch processing, to turn off the default auto-commit val sql = "INSERT INTO Day_topn_vid  EO (day, cms_id, times) VALUES (?,?,?) "        Placeholder pstmt = connection.preparestatement (SQL)//The SQL statement is generated pstmt object, after which you can populate the placeholder for the data for (ele <-list) {    Pstmt.setstring (1, Ele.day) Pstmt.setlong (2, Ele.cmsid) Pstmt.setlong (3, Ele.times) Pstmt.addbatch () Add batch processing} pstmt.execute ()//Perform batch processing connection.commit ()//manual Commit} catch {case E:exceptio n = e.printstacktrace ()} finally {mysqlutils.release (connection, PSTMT)}}}

In order to correspond to the 3rd step above, the DF of the statistic record is generated into a list of objects:

Create a list of the model classes corresponding to
The record is traversed, each field of the record is treated as a parameter, and the model class object is created
Add each object to the list
To pass a list into the DAO class

The following code is added to the Topnjob class above to save the result record of the previously generated topdf to MySQL:

try {  Topndf.foreachpartition (partitionofrecords = {//    val list = new Listbuffer[videoaccessstat]  // Create a list to load statistics records     //traverse each record, take out the corresponding three fields above Day,cmsid,times    Partitionofrecords.foreach (info = {      Val day = Info.getas[string] ("Day")   //followed by is the fetch out of each field of the record      val cmsid = Info.getas[long] ("Cmsid")      val times = Info.getas[long] ("Times")       //each cycle creates a Videoaccessstat object, adding one entry to the list      List.append (Videoaccessstat (Day, Cmsid, Times)    })    //Put list into DAO class    STATDAO.INSERTDAYACCESSTOPN (list)}  )} catch {case  e: Exception = E.printstacktrace ()}

So far, the project needs have been completed. function Realization Two: According to the city to find TOPN video on the basis of function one, using the Row_number function to implement the specific implementation code:

  First count the number of accesses and follow the Day,cmsid,city group  val cityaccesstopndf = Accessdf.filter (Accessdf.col ("day") = = = = "20170511" & & Accessdf.col ("cmstype") = = = = "Video").    groupBy ("Day", "Cmsid", "City"). Agg (Count ("Cmsid"). As ("Times"))   //Sub-city sorting, use to Row_number function, generate a rank, defined as Time_rank, and take the top 3  cityaccesstopndf.select (    Cityaccesstopndf.col ("Day"),    cityaccesstopndf.col ("Cmsid"),    Cityaccesstopndf.col ("Times"),    Cityaccesstopndf.col ("City"),    row_number ()-Over (Window.partitionby (Cityaccesstopndf.col ("City")      . (Cityaccesstopndf.col ("Times"). Desc). As    ("Times_rank")).  Filter ("Times_rank <= 3"). Show ( False)}

Other steps and functions one, but the error when inserting MySQL, because MySQL does not support inserting Chinese!!!! The first can be changed in the MySQL command line with set character: Set character_set_client = utf8 can be through show variables like ' character_set_% '; To view the current character encoding settings Then when the JDBC connection, plus:useunicode=true&characterencoding=utf8 changed, although the ability to import MySQL, And there is no garbled, but only a subset of data, and in the console error: com.mysql.jdbc.preparedstatement.fillsendpacketcom.mysql.jdbc.preparedstatement.execute Then the batch was deleted and all the data could be imported into the: pstmt.executeupdate //pstmt insert function without batching three: Sort by traffic topn video and function one almost exactly the same, Just calculate the sum of the traffic is not the count function but to use the SUM function for code reusability, to prevent the generation of duplicate data, the Statdao definition of the deleted function:

def deletedaydata (day:string) = {   var connection:connection = null  var pstmt:preparedstatement = null  var t Ables = Array ("Day_topn_video",    "Day_city_topn_video",    "Traffic_topn_video"  )   try {    Connection = Mysqlutils.getconnection () for     (table <-tables) {      val deletesql = S ' Delete from $table where day =? "  Scala special handling      pstmt = connection.preparestatement (deletesql)      pstmt.setstring (1, table)      Pstmt.setstring (2, day)      pstmt.executeupdate ()    }  } catch {case    e:exception = E.printstacktrace ()  } finally {    mysqlutils.release (connection, pstmt)}  }

It is important to note that table has a special use in PSTMT!! The following will be visualized on the above content, run on yarn modification, performance tuning

Spark SQL implementation log offline batch processing

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

audacity batch processing etl batch processing php batch processing batch processing in java lightroom batch processing apache spark graph processing pdf spark sql warehouse dir

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark SQL implementation log offline batch processing

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support