Spark SQL UDF uses

Source: Internet
Author: User
Tags define function serialization

Spark1.1 introduces the Uer Define function, which allows users to customize the actual UDF needed to process data in spark SQL.

Because of the limited functions currently supported by Spark SQL itself, some commonly used functions are not, such as Len, concat...etc but it is very convenient to use UDFs to implement the functions according to business needs.

The spark SQL UDF is actually a Scala function, which is encapsulated by the catalyst as an expression node, and finally the result of the UDF is computed based on the current row by the Eval method, see:Spark SQL Source analysis UDF

The Spark SQL UDF is easy to use and has 2 steps:

First, registration

When we import SqlContext or hivecontext, we have the function of registering the UDF.

Registerfunction (udfname:string, Func:functionn)

Because of the Scala language limitations, the UDF parameter here only supports 22.

Second, use

Select Udfname (param1, param2 ...) from TableName

Iii. examples we create 2 tables here: The first dual will read the record from README.MD, there is only one field line:string the second table SRC, there are 2 fields Key,value, the data is the test data from spark SQL.
We useSBT/SBT Hive/consoleEnter the test environment:
1. String take length Len () to create TABLE dual:
Scala> SQL ("CREATE table dual (line string)"). Collect () 14/09/19 17:41:34 INFO Metastore. hivemetastore:0: Create_table:table (tablename:dual, Dbname:default, Owner:root, createtime:1411119694, lastaccesstime:0, retention:0, Sd:storagedescriptor (Cols:[fieldschema (Name:line, type:string, Comment:null)], Location:null, InputFormat:org.apache.hadoop.mapred.TextInputFormat, Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], parameters:{}, Skewedinfo:skewedinfo (skewedcolnames:[], skewedcolvalues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], parameters:{}, ViewOriginalText: NULL, Viewexpandedtext:null, tabletype:managed_table, Privileges:principalprivilegeset (UserPrivileges:null, Groupprivileges:null, Roleprivileges:null)) 14/09/19 17:41: Hivemetastore.audit:ugi=root ip=unknown-ip-addr cmd=create_table:table (tablename:dual, DbName:default, Owner:root, createtime:1411119694, lastaccesstime:0, retention:0, Sd:storagedescriptor (Cols:[FieldSchema (name:line , type:string, Comment:null)], Location:null, InputFormat:org.apache.hadoop.mapred.TextInputFormat, Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], parameters:{}, Skewedinfo:skewedinfo (skewedcolnames:[], skewedcolvalues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], parameters:{}, ViewOriginalText: NULL, Viewexpandedtext:null, tabletype:managed_table, Privileges:principalprivilegeset (UserPrivileges:null, Groupprivileges:null, Roleprivileges:null))
Loading README.MD data:
SQL ("Load data local inpath ' readme.md ' into table dual"). Collect () scala> SQL ("SELECT * from Dual"). Collect () Res4:arr Ay[org.apache.spark.sql.row] = Array ([# Apache Spark], [], [Spark is a fast and general cluster computing system for Big D Ata. IT provides], [high-level APIs in Scala, Java, and Python, and an optimized engine so], [supports General computation GR Aphs for data analysis. It also supports a], [rich set of higher-level tools including Spark SQL for SQL and structured], [data processing, MLLIB For machine learning, GraphX for graph processing,], [and Spark streaming.], [], [
write the Len function and register the function:
Scala> registerfunction ("Len", (x:string) =>x.length)
Test:
scala> SQL ("Select Len" from dual). Collect () 14/09/19 17:45:07 INFO Spark. Sparkcontext:job Finished:collect at sparkplan.scala:85, took 0.072239295 sres6:array[org.apache.spark.sql.row] = Arra Y ([14], [0], [78], [72], [73], [73], [73], [20], [0], [26], [0], [0], [23], [0], [68], [78], [56], [0], [17], [0], [75], [ 0], [22], [0], [67], [0], [26], [0], [64], [0], [21], [0], [52], [0], [44], [0], [27], [0], [66], [0], [17], [4], [61], [0 ], [43], [0], [19], [0], [+ 74], [74], [0], [29], [0], [32], [0], [75], [63], [67], [74], [72], [22], [0], [54], [0], [16], [0], [84], [17], [0], [19], [0], [31], [0], [77], [76], [77], [77], [0], [67], [27], [0], [25], [45], [0], [42] , [58], [0], [91], [29], [0], [31], [58], [0], [42], [61], [0], [35], [52], [0], [77], [79], [74], [22], [0], [51], [0], [  90], [0], [16], [42], [44], [+ 30], [17], [0], [0], [56], [0], [46], [86], [78], [0], [30], [0], [16], [0], [97], [0], [+], [0], [...] 

2. String connection Concat_str here, for the sake of simplicity, make an example based on the SRC table's key value type Int, string:
Scala> SQL ("desc src"). Collect () Res8:array[org.apache.spark.sql.row] = Array ([Key,int,null], [Value,string,null] )
Scala> SQL ("select * from src limit"). Collect () Res7:array[org.apache.spark.sql.row] = Array ([238,val_238], [86, VAL_86], [311,val_311], [27,val_27], [165,val_165], [409,val_409], [255,val_255], [278,val_278], [98,val_98], [484,val _484])

Write and register the CONCAT_STR function:
Scala> registerfunction ("Concat_str", (A:int, b:string) =>a.tostring+b)
testing the CONCAT function
Scala> SQL ("Select CONCAT_STR (key,value) from SRC"). Collect ()

--eof--
Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/39401391

Spark SQL UDF uses

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.