Detailed spark SQL user-defined functions: UDF and UDAF

Source: Internet
Author: User
Tags hadoop mapreduce

UDAF = USER DEFINED AGGREGATION FUNCTION

Spark SQL provides a wealth of built-in functions for apes to use, why do they need user-defined functions? The actual business scenario can be complex, and built-in functions can't hold, so spark SQL provides an extensible built-in function interface: Dude, your business is so perverted, I can't meet you, I define a SQL function according to my specifications, how to toss it!
For example, there is a task table in the MySQL database with a total of two fields TaskID (task ID) and Taskparam (the JSON-formatted task request parameter). For simplicity, only one record is listed here:

TaskID1Taskparam {"Endage":[" -"],"endDate":["2016-06-21"],"Startage":["Ten"],"StartDate":["2016-06-21"]}

Suppose the application has read the record of this table in MySQL and registered it with Dateframe as a temporary table task. Here's the question: How do I get the first value of Startage in Taskparam?

Sqlcontext.sql ("Select TASKID,GETJSONFIELDUDF (taskparm, ' startage ')")

This time, we need to customize a UDF function, named GETJSONFIELDUDF. The Java version of the code is roughly as follows:

Package Cool.pengych.sparker.product;import Org.apache.spark.sql.api.java.udf2;import Com.alibaba.fastjson.JSONObject;/** * user-defined function * @author Pengyucheng*/ Public classGETJSONOBJECTUDF Implements Udf2<string,string,string>{    /** * Get the value of a field in the JSON string of an array type*/@Override Publicstring Call (String json, String field) throws Exception {Try{jsonobject Jsonobject=Jsonobject.parseobject (JSON); returnJsonobject.getjsonarray (field). getString (0); }        Catch(Exception e) {e.printstacktrace (); }        return NULL; }}

Such requirements are common in real-world projects: Request parameters are often stored in JSON format in the database. This is the first example of using Scala to implement a simple Hello world class sample to experience the use of UDFs and UDAF.

problem

Set the following array:

Val bigdata = Array ("Spark","Hadoop","Flink","Spark","Hadoop","Flink","Spark","Hadoop","Flink","Spark","Hadoop","Flink")

The characters in the group aggregation and calculate the length of each character and the number of characters appearing. Normal results
As follows:

+------+-----+------+|  name|count|length|+------+-----+------+| spark|    4|     5| | flink|    4|     5| | hadoop|    4|     6|+------+-----+------+

Note: The ' spark ' character has a length of 5, which appears 4 times.

Analysis
      • Customizing a function to find the length of a string
        Custom SQL functions are just like normal functions in Scala, except that they need to be registered in SqlContext before they are used.
      • Customizing an aggregate function
        After grouping by string name, call the custom aggregate function to implement the summation.
        Ah, good abstraction, look directly at the code!
Code
Package Main.scalaimport Org.apache.spark.SparkContextimport org.apache.spark.SparkConfimport Org.apache.spark.sql.hive.HiveContextimport Org.apache.spark.sql.Rowimport Org.apache.spark.sql.types.StructTypeimport Org.apache.spark.sql.types.StructFieldimport Org.apache.spark.sql.types.StringTypeimport Org.apache.spark.sql.SQLContextimport Org.apache.spark.sql.expressions.UserDefinedAggregateFunctionimport Org.apache.spark.sql.types.IntegerTypeimport Org.apache.spark.sql.types.DataTypeimport Org.apache.spark.sql.expressions.MutableAggregationBuffer/** Spark SQL Udas:user defined aggregation function * UDF: The input of the functions is a specific data record, the implementation is the ordinary Scala function-just need to register * UDAF: User-defined aggregate letter The function itself is used for the data collection and can be customized based on the specific operation*/ObjectSPARKSQLUDF {def main (args:array[string]): Unit={val conf=NewSparkconf (). Setmaster ("local[*]"). Setappname ("Sparksqlwindowfunctionops") Val SC=Newsparkcontext (conf) Val Hivecontext=NewSqlContext (SC) Val bigdata= Array ("Spark","Hadoop","Flink","Spark","Hadoop","Flink","Spark","Hadoop","Flink","Spark","Hadoop","Flink") Val Bigdatardd=sc.parallelize (Bigdata) Val Bigdatarowrdd= Bigdatardd.map (line =Row (line)) Val Structtype= Structtype (Array (Structfield ("name", StringType,true)) Val bigdatadf=hivecontext.createdataframe (Bigdatarowrdd, Structtype) bigdatadf.registertemptable ("bigdatatable")    /** with Hivecontext registration UDF, the scala2.10.x version UDF function can accept up to 22 input parameters*/HiveContext.udf.register ("Computelength", (input:string) =input.length) Hivecontext.sql ("Select Name,computelength (name) as length from bigdatatable"). Show//While (true) {}HiveContext.udf.register ("WordCount",NewMyudaf) Hivecontext.sql ("Select Name,wordcount (name) as Count,computelength (name) as length from bigdatatable Group by name"). Show}}/** * user-defined function*/ classMyudaf extends Userdefinedaggregatefunction {/** * Specify the type of input data * Self-paragraph name arbitrary: Users can choose names to identify the input arguments-here can be "name", or any other string*/  Overridedef inputschema:structtype = Structtype (Array (Structfield ("name", StringType,true)))  /** * Intermediate result type of the data to be processed at the time of the aggregation operation*/  Overridedef bufferschema:structtype = Structtype (Array (Structfield ("Count", Integertype,true)))  /** * return type*/  Overridedef datatype:datatype =Integertype/** * Whether given the same input, * always return the same output * True:yes*/  Overridedef Deterministic:boolean =true  /** * Initializes the given aggregation buffer*/  Overridedef initialize (buffer:mutableaggregationbuffer): Unit = {buffer (0)=0}  /** * At the time of aggregation, whenever a new value comes in, how to calculate the aggregation after grouping * Local aggregation operations, equivalent to combiner in the Hadoop mapreduce model*/  Overridedef update (buffer:mutableaggregationbuffer,input:row): unit={buffer (0) = Buffer.getint (0)+1  }  /** * Finally, the global level of the merge operation is required after the local reduce is done on the distributed node*/  Overridedef merge (Buffer1:mutableaggregationbuffer,buffer2:row): unit={buffer1 (0) = Buffer1.getint (0) +buffer2.getint (0)  }  /** * Return UDAF final calculation results*/  Overridedef evaluate (buffer:row): any = Buffer.getint (0)}
Execution Result:
 -/ ./ in  +: -: -INFO Dagscheduler:resultstage5(Show at Sparksqludf.scala: -) finishedinch 1.625s+------+-----+------+| name|count|length|+------+-----+------+| spark|4|5|| flink|4|5|| hadoop|4|6|+------+-----+------+ -/ ./ in  +: -: -INFO Dagscheduler:job3Finished:show at Sparksqludf.scala: -, took1.717878S
Summary
      • Call Spark great God upgrade UDAF implementation
        In order to implement a SQL aggregate function myself, I need to inherit userdefinedaggregatefunction and implement 8 abstract methods! 8 Ways Ah! What ' s a disaster! However, to complete the aggregation class (A = aggregation) function in SQL that conforms to a particular business scenario, you have to UDAF.
        How do you understand Mutableaggregationbuffer? Is the storage of intermediate results, aggregation means the accumulation of multiple records and other operations.

      • UDF and UDAF registration syntax

HiveContext.udf.register ("computelength", (input:string) = input.length)  hiveContext.udf.register ("wordCount",new Myudaf)

Detailed spark SQL user-defined functions: UDF and UDAF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.