Tag: CAs ORC value try ignores HDFs body overwrite resource
First, the basic offline data processing architecture:
- Data acquisition Flume:web Log writes to HDFs
- Data cleansing of dirty data by Spark, Hive, Mr and other computational frameworks. When you're done cleaning, put it back in HDFs.
- Data processing According to needs, conduct business statistics and analysis. Also done through the computational framework
- Processing results stored in RDBMS, NoSQL
- The visualization of data is displayed graphically. Echarts, HUE, Zeppelin
Process Block Diagram: 1 2 3 4 5 6 7 for offline processing, where 5 is not necessarily hive (and spark SQL, etc.) 6 is not necessarily an RDBMS (NoSQL) execution, the scheduling framework Oozie, Azkaban, specifies the time the task is executed the other line is the real-time processing of the proposed project requirements:
- Count the topn and corresponding visits of one of the most popular items in a certain time period
- By city statistics Most popular extracting city information from IP
- Per-access traffic statistics
Internet logs generally include: Access time access URL consume traffic access IP address Extract the data we need from the log. Suppose we now have only one computer for learning to use as a cluster, and in order to prevent memory overflow, it is necessary to cut the log: with head- 10000 command intercept the first 10,000 data volume is too large, in the IDE may error two, log processing data cleansing: The first step: Extract useful information from the original log, in this case to get time, URL, traffic, IP
- Read the log file, get the RDD, split into an array using the Map method, and select several useful items in the array (use breakpoints to analyze which items are useful and match the corresponding variables)
- The information obtained may be caused by some problems, such as threading problems, resulting in the generation of information with errors, the first step in the beginning with SimpleDateFormat (thread insecure) to transform the time format, will cause some time conversion errors. It's usually changed to Fastdateformat.
Implementation code:
Extract useful information, convert format Object sparkstatformatjob { def main (args:array[string]) = { val spark = Sparksession.builder (). AppName ("Sparkstatformatjob"). Master ("local[2]"). Getorcreate () val access = Spark.sparkContext.textFile ("/ Users/kingheyleung/downloads/data/10000_access.log ") //access.take (println) Access.map ( Line = { val splits = Line.split ("") val ip = splits (0) //Using breakpoints method, observe the splits array, find out the time, URL, and traffic corresponding to which field // Create Time Class Dateutils, convert to a common time expression //Remove the URL from the extra "" "quotation mark out val time = Splits (3) +" "+ Splits (4) val url = Splits (one). Re Placeall ("\" "," ") val traffic = Splits (9) //(IP, Dateutils.parse (time), URL, traffic) is used to test whether the output is normal //re-assemble the trimmed data, tab-Split Dateutils.parse (time) + "\ T" + URL + "\ t" + traffic + "\ T" + IP }). Saveastextfile ("File:///usr/local/mycode/immooclog /") Spark.stop () }}
Date Resolution Object Dateutils { //input format val original_time_format = fastdateformat.getinstance ("dd/mmm/yyyy:hh:mm: SSS Z ", locale.english) //output format val target_time_format = fastdateformat.getinstance (" Yyyy-mm-dd HH:mm:ss ") def parse (time:string) = { Target_time_format.format (new Date (GetTime)) } def getTime (time: String) = { try { original_time_format.parse (time.substring ("Time.indexof (") + 1, time.lastindexof ("]")). GetTime } catch {case e:exception = { 0l } } }
General log processing needs to be partitioned
This example is partitioned by the access time in the loghttp://www.bokee.net/bloggermodule/blog_viewblog.do?id=31933291
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31933288
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31933283
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31931284
http://www.bokee.net/bloggermodule/blog_viewblog.do?id=31931268The second step: parse the previous step to get the useful information, I call it the analytic log is actually the more tidy data log, parse out the meaning of each field, and the RDD into DF in this case, the completion is: Input: Access time access URL consuming traffic Access IP address = = to output: URL, Type (the suffix of the URL in this example is article or video), the corresponding ID number, traffic, IP, city, time, day (for grouping) and create dataframe (that is, define row and Structtype, where row is to correspond to each field in the original log, The Structtype is the implementation code that defines the line according to the desired output:
//Parse Log object Sparkstatcleanjob {def main (args:array[string]) = {val spark = Sparksession.builder (). AppName ("Sp Arkstatcleanjob "). Master (" local[2] "). Getorcreate () Val Accessrdd = Spark.sparkContext.textFile (" file:///Users/ Kingheyleung/downloads/data/access_10000.log ")//rdd convert to DF, define Row and structtype val accessdf = spark.c Reatedataframe (Accessrdd.map (line + logconvertutils.converttorow (line)), logconvertutils.struct)// Accessdf.printschema ()//accessdf.show (false) Spark.stop ()}}
Rdd Convert to DF Tool class object Logconvertutils {//build struct val struct = Structtype (Array (Structfield ("url", StringType) , Structfield ("Cmstype", StringType), Structfield ("Cmsid", Longtype), Structfield ("Traffic", Longtype), Structfield ("IP", StringType), Structfield ("City", StringType), Structfield ("Time", StringType), Structfi Eld ("Day", StringType))//extract information, build row def converttorow (line:string) = {try {val splits = Line.split ("\ t ") Val url = Splits (1) Val traffic = Splits (2). tolong val ip = Splits (3) Val domain =" Http://www.imoo c.com/"val cms = url.substring (url.indexof (domain) + domain.length ()) Val cmssplits = Cms.split ("/") var Cmstype = "" var Cmsid = 0l//Determine if there is an if (Cmssplits.length > 1) {cmstype = cmssplits (0) cm SId = cmssplits (1). Tolong} val City = iputils.getcity (IP)//pass through the IP parsing tool, specifically see the following Val time = Splits (0) Val Day = Time.substrinG (0, +). ReplaceAll ("-", "")//define row, same as struct row (URL, cmstype, cmsid, traffic, IP, city, Time, day)} CATC h {Case e:exception = Row (0)}}}
Note: When converting, be sure to remember the type conversion!!!!
Further analysis: IP address resolution to obtain city information here, in order to convert the IP address into an intuitive city information, I used the open source project on GitHub to achieve: https://github.com/wzhe06/ Ipdatabase.git mvn clean package-dskiptests install jar package to your MAVEN repository by using MAVEN to compile the downloaded project: mvn install:install-file-dfile= path. JAR- Dgroupid=com.ggstar-dartifactid=ipdatabase-dversion=1.0-dpackaging=jar Add dependency to the IDE's pom.xml, referring to the dependency in Pom.xml on the GitHub home page but there's an error: java.io.filenotfoundexception:file:/users/ ROCKY/MAVEN_REPOS/COM/GGSTAR/IPDATABASE/1.0/IPDATABASE-1.0.JAR!/IPREGION.XLSX (No such file or directory) as prompted, We need to find the corresponding files in the project source code to copy into the IDE in the Main/resources! Store cleansed data: Store Partitionby storage mode by day partition: Mode (savemode.overwrite) covers storage coalesce: It is said that the production is often used, is the project's tuning advantages, control file output size, number three, statistical function realization function realization One: Statistics TOPN video First step: Read data, Read.format (). Load Second Step:
- Statistical analysis using the Dataframe API
- SQL API
Finally save the statistical results in the MySQL database The advantages: When reading the parquet file, the system will default to parse the corresponding data types of the fields, but sometimes we just need it is a string type, need to be added when the sparksession definition: config (" Spark.sql.sources.partitionColumnTypeInference.enabled, "false") becomes only read in two methods according to the original type: If you use the Dataframe API: You need to import an implicit conversion (this is the column name converted to a column)! Spark.implicits._ use the count () function of Dataframe to import the package:org.apache.spark.sql.functions._ if you use SQL API to do: Create temporary tables Createtempview carefully write SQL statements when wrapping without noticing and ignoring spaces implementing code:
Complete the statistics Operation object Topnstatjob {def main (args:array[string]) {val spark = Sparksession.builder (). AppName ("Topnstatjob") . config ("spark.sql.sources.partitionColumnTypeInference.enabled", "false"). Master ("local[2]"). Getorcreate () Val accessdf = Spark.read.format ("parquet"). Load ("/users/kingheyleung/downloads/data/clean/") Dfcounttopnvideo ( Spark, ACCESSDF) Sqlcounttopnvideo (Spark, ACCESSDF)//accessdf.printschema () Spark.stop ()} def Dfcounttopnvi Deo (Spark:sparksession, accessdf:dataframe): Unit = {/* * DF API * * *//import implicit conversion, pay attention to the use of the $ number, and import the functions package so that a GG aggregate function count can be used, if not in the case of $, you can not let the Times desc sorted import spark.implicits._ val topndf = Accessdf.filter ($ "day" = = = "201705 "&& $" cmstype "= = =" Video "). GroupBy (" Day "," Cmsid "). Agg (Count (" Cmsid "). As (" Times "). ($" Times "). Des c) Topndf.show (False)} def sqlcounttopnvideo (Spark:sparksession, accessdf:dataframe): Unit = {/* * SQL API * * *//create temporary table Access_view, noteWhen you change lines, it's easy to ignore the space Accessdf.createtempview ("Access_view") val topndf = Spark.sql ("Select Day, Cmsid, COUNT (1) as Times fr Om Access_view "+" where day = = ' 20170511 ' and cmstype = = ' Video ' "+" GROUP by day, Cmsid "+" ORDER by T IMEs desc ") Topndf.show (False)}}
Before you can save your data, you need to write a tool class that connects to the MySQL database, using the java.sql package
- Using DriverManager, connect to MySQL 3306
- Release resources, connection and PreparedStatement, pay attention to handling exceptions
Note: If the test does not get the connection, the following error, it is not added in the dependency or select the Mysql-connetor package java.sql.SQLException:No suitable driver found for JDBC: Mysql://localhost:3306/imooc_project?user=root&password=666error:scalac:error while loading <root> Error accessing/users/kingheyleung/.m2/repository/mysql/mysql-connector-java/5.0.8/ Mysql-connector-java-5.0.8.jar I finally chose 5.1. The 40 version is the implementation code:
/** Connect MySQL Database * Operation Tool class * */object mysqlutils { //Get connection def getconnection (): Unit = { Drivermanager.getconnection ("jdbc:mysql://localhost:3306/imooc_project?user=root&password=666") } Release Resource def release (Connection:connection, pstmt:preparedstatement): Unit = { try { if (pstmt! = null) { Pstmt.close () } } catch {case e:exception = E.printstacktrace () } finally { Connection.close ()}}}
Save statistics to MySQL
- Create a table in MySQL with day,cms_id,times three fields (note the respective data types, and the definitions are not allowed to be null, and day and cms_id as PRI KEY)
- Create a model class case class, three input parameters, day, Cms_id,times
- To create an operational database DAO class, the input parameter is a list,list that is loaded with the above model class, with the intention of inserting insert records into the database, in DAO in the following steps:
- First of all, do the JDBC connection preparation, create connection and preparestatement, close the connection also write, with try Catch finally throw exception;
- Then write the SQL statement, the place where the preparestatement needs to be assigned is placed with a placeholder;
- Perform a list traversal and put each object in the Pstmt
- Adjust the advantages!!! Before traversing, the auto-commit is switched off, the pstmt is added to the batch process, and the batch operation is performed after the traversal. Finally submit the connection manually
Implementation code:&NBSP;
//Course Visits entity class case Class Videoaccessstat (day:string, Cmsid:long, Times:long)/** DAO operations for each dimension statistic * */object Statdao {/* * Bulk Save Videoaccessstat to Database * */def INSERTDAYACCESSTOPN (List:listbuffer[videoaccessstat]): Unit = {var connection : Connection = null//JDBC preparation, defining the connection var pstmt:preparedstatement = null try {Connection = Mysqlutils.getco Nnection ()//Real Get connection Connection.setautocommit (FALSE)//in order to implement batch processing, to turn off the default auto-commit val sql = "INSERT INTO Day_topn_vid EO (day, cms_id, times) VALUES (?,?,?) " Placeholder pstmt = connection.preparestatement (SQL)//The SQL statement is generated pstmt object, after which you can populate the placeholder for the data for (ele <-list) { Pstmt.setstring (1, Ele.day) Pstmt.setlong (2, Ele.cmsid) Pstmt.setlong (3, Ele.times) Pstmt.addbatch () Add batch processing} pstmt.execute ()//Perform batch processing connection.commit ()//manual Commit} catch {case E:exceptio n = e.printstacktrace ()} finally {mysqlutils.release (connection, PSTMT)}}}
In order to correspond to the 3rd step above, the DF of the statistic record is generated into a list of objects:
- Create a list of the model classes corresponding to
- The record is traversed, each field of the record is treated as a parameter, and the model class object is created
- Add each object to the list
- To pass a list into the DAO class
The following code is added to the Topnjob class above to save the result record of the previously generated topdf to MySQL:
try { Topndf.foreachpartition (partitionofrecords = {// val list = new Listbuffer[videoaccessstat] // Create a list to load statistics records //traverse each record, take out the corresponding three fields above Day,cmsid,times Partitionofrecords.foreach (info = { Val day = Info.getas[string] ("Day") //followed by is the fetch out of each field of the record val cmsid = Info.getas[long] ("Cmsid") val times = Info.getas[long] ("Times") //each cycle creates a Videoaccessstat object, adding one entry to the list List.append (Videoaccessstat (Day, Cmsid, Times) }) //Put list into DAO class STATDAO.INSERTDAYACCESSTOPN (list)} )} catch {case e: Exception = E.printstacktrace ()}
So far, the project needs have been completed. function Realization Two: According to the city to find TOPN video on the basis of function one, using the Row_number function to implement the specific implementation code:
First count the number of accesses and follow the Day,cmsid,city group val cityaccesstopndf = Accessdf.filter (Accessdf.col ("day") = = = = "20170511" & & Accessdf.col ("cmstype") = = = = "Video"). groupBy ("Day", "Cmsid", "City"). Agg (Count ("Cmsid"). As ("Times")) //Sub-city sorting, use to Row_number function, generate a rank, defined as Time_rank, and take the top 3 cityaccesstopndf.select ( Cityaccesstopndf.col ("Day"), cityaccesstopndf.col ("Cmsid"), Cityaccesstopndf.col ("Times"), Cityaccesstopndf.col ("City"), row_number ()-Over (Window.partitionby (Cityaccesstopndf.col ("City") . (Cityaccesstopndf.col ("Times"). Desc). As ("Times_rank")). Filter ("Times_rank <= 3"). Show ( False)}
Other steps and functions one, but the error when inserting MySQL, because MySQL does not support inserting Chinese!!!! The first can be changed in the MySQL command line with set character: Set character_set_client = utf8 can be through show variables like ' character_set_% '; To view the current character encoding settings Then when the JDBC connection, plus:useunicode=true&characterencoding=utf8 changed, although the ability to import MySQL, And there is no garbled, but only a subset of data, and in the console error: com.mysql.jdbc.preparedstatement.fillsendpacketcom.mysql.jdbc.preparedstatement.execute Then the batch was deleted and all the data could be imported into the: pstmt.executeupdate //pstmt insert function without batching three: Sort by traffic topn video and function one almost exactly the same, Just calculate the sum of the traffic is not the count function but to use the SUM function for code reusability, to prevent the generation of duplicate data, the Statdao definition of the deleted function:
def deletedaydata (day:string) = { var connection:connection = null var pstmt:preparedstatement = null var t Ables = Array ("Day_topn_video", "Day_city_topn_video", "Traffic_topn_video" ) try { Connection = Mysqlutils.getconnection () for (table <-tables) { val deletesql = S ' Delete from $table where day =? " Scala special handling pstmt = connection.preparestatement (deletesql) pstmt.setstring (1, table) Pstmt.setstring (2, day) pstmt.executeupdate () } } catch {case e:exception = E.printstacktrace () } finally { mysqlutils.release (connection, pstmt)} }
It is important to note that table has a special use in PSTMT!! The following will be visualized on the above content, run on yarn modification, performance tuning
Spark SQL implementation log offline batch processing