sparkSQL1.1 Introduction VIII: The comprehensive application of Sparksql

Last Update:2014-09-11 Source: Internet

Author: User

Tags hdfs dfs log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark's eye-catching, in addition to memory computing, and its all-in-one features, implemented one stack rule them all. The following is a simple simulation of several integrated scenarios, not only using Sparksql, but also using other spark components:

Store classification, according to the sales of the store classification
Allocation of goods, based on the quantity of goods sold and the distance between shops

The former will use the Sparksql+mllib clustering algorithm, which will use the SPARKSQL+GRAPHX algorithm. This experiment uses IntelliJ idea to debug the code, and finally generates Doc.jar, and then uses Spark-submit to commit to the cluster to run.
1: Store classification classification is very common in practical applications, such as classifying customers, classifying stores and so on, adopting different strategies for different categories, which can effectively reduce the operating costs and increase the income of enterprises. Clustering in machine learning is a method of dividing data into several classes according to different characteristic data, combined with the number of user-specified categories. Here is a simple example of the hive data in the fifth summary, according to the sales volume and sales amount of the two characteristics of the data, the classification of 3 levels of stores. Create a object:sqlmllib in idea

Package Docimport org.apache.log4j. {level, Logger}import org.apache.spark.sql.catalyst.expressions.Rowimport Org.apache.spark. {sparkconf, Sparkcontext}import org.apache.spark.sql.hive.HiveContextimport Org.apache.spark.mllib.clustering.KMeansimport org.apache.spark.mllib.linalg.Vectorsobject sqlmllib {def main (args : Array[string]) {//Mask unnecessary logs are displayed on the terminal Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn) logger.getlogger ("O  Rg.eclipse.jetty.server "). SetLevel (Level.off)//Set Run environment val sparkconf = new sparkconf (). Setappname (" Sqlmllib ") Val sc = new Sparkcontext (sparkconf) val hivecontext = new Hivecontext (SC)//Use Sparksql to find out the sales quantity and amount of each store Hivecontext.sql  ("Use Saledata") Val sqldata = Hivecontext.sql ("Select A.locationid, sum (b.qty) totalqty,sum (b.amount) TotalAmount from  Tblstock A join Tblstockdetail B on A.ordernumber=b.ordernumber Group by A.locationid ")//Convert query data to vector Val Parseddata = Sqldata.map {Case Row (_, Totalqty, totalamount) = VAL features = array[double] (totalqty.toString.toDouble, totalamount.toString.toDouble) Vectors.dense (features)}  For data aggregation classes, 3 classes, 20 iterations, forming a data model//Note that there is no number of partition set, the default partition number of Mllib will be used, and Val numclusters = 3 val numiterations = Val model = Kmeans.train (Parseddata, Numclusters, numiterations)//Use the model to classify the data being read, and output//Because the partition is not set, the output is 20 0 Small files that can be downloaded using Bin/hdfs dfs-getmerge merge to local val result2 = sqldata.map {case Row (LocationID, Totalqty, totalamount) =  > val features = array[double] (totalqty.toString.toDouble, totalamount.toString.toDouble) Val Linevectore = Vectors.dense (features) Val prediction = model.predict (linevectore) LocationID + "" + Totalqty + "" + t Otalamount + "" + Prediction}.saveastextfile (args (0)) Sc.stop ()}}

Run after compiling the package:
Run the process, you can find that the clustering process is using 200 partition:
After running, use Getmerge to go to the local file and view the results:
Finally, use R to represent different categories in 3 different colors.
2: The allocation of goods in commercial activities, how to put the goods in the most needed place is an eternal proposition. In Spark, you can solve this problem by graph calculation: The point of sale as the vertex of the graph, its properties can be goods sales, inventory and other characteristics, the allocation factor as a side, such as distance, use time, transfer costs. Information on the transfer of goods is obtained by polling the goods and by polling the allocation points. Here is a code framework for using SPARKSQL and GRAPHX:

Package doc//due to the lack of data on the temporary hand, this example only gives the framework, the opportunity to fill in the import org.apache.log4j. {level, Logger}import org.apache.spark.sql.hive.HiveContextimport Org.apache.spark. {sparkcontext, sparkconf}import org.apache.spark.graphx._import org.apache.spark.rdd.RDDobject sqlgraphx {def main ( Args:array[string]) {//Mask unnecessary logs are displayed on the terminal Logger.getlogger ("Org.apache.spark"). SetLevel (Level.warn) Logger.getlogge    R ("Org.eclipse.jetty.server"). SetLevel (Level.off)//Set Run environment val sparkconf = new sparkconf (). Setappname ("Sqlgraphx") Val sc = new Sparkcontext (sparkconf) val hivecontext = new Hivecontext (SC)//Switch to Sales database Hivecontext.sql ("Use sale Data ")//Use Sparksql to isolate store sales and inventory, as the vertex of the graph//where LocationID is Vertexid, (sales, inventory) for VD, generally (int,int) type val vertexdata = Hiveconte Xt.sql ("Select A.locationid, B.saleqty, b.invqty from a join B on a.col1=b.col2 where conditions")//use Sparksql to isolate the distance between stores It can also be time-consuming and allocation-related properties, as the side//distance of the graph is Ed, you can use a data type such as int, Long, double, val edgedata = Hivecontext.sql ("Select Srcid, DiSTI D, diStance from Distanceinfo ")//construct Vertexrdd and Edgerdd Val vertexrdd:rdd[(Long, (int, int))] = Vertexdata.map (...)    Val Edgerdd:rdd[edge[int]] = Edgedata.map (...)  Construction diagram Graph[vd,ed] val graph:graph[(int, int), int] = graph (Vertexrdd, EDGERDD)//Map processing according to the allocation rules val Initialgraph =    Graph.mapvertices (...)    Initialgraph.pregel (...) Output Sc.stop ()}}

3: Summary through the above code, it can be seen that the program in addition to the last disk landing, are in memory calculation. It avoids the landing process of interactive data in multiple systems, and improves the efficiency. That's what makes the spark ecosystem really powerful: one stack rule them all. In addition, sparksql+sparkstreaming can structure the current very popular lambda architecture to provide a solution for CEP. It is so powerful that it attracts the eyes of the vast open source enthusiasts and promotes the development of spark ecology.
will be into gold in the last few years of smelting .Spark Big Data Fast computing platform (phase III), this material is a new lesson material. This article in recent days to perfect a bit.

sparkSQL1.1 Introduction VIII: The comprehensive application of Sparksql

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More