Spark SQL UDF uses

Last Update:2014-09-19 Source: Internet

Author: User

Tags define function serialization

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark1.1 introduces the Uer Define function, which allows users to customize the actual UDF needed to process data in spark SQL.

Because of the limited functions currently supported by Spark SQL itself, some commonly used functions are not, such as Len, concat...etc but it is very convenient to use UDFs to implement the functions according to business needs.

The spark SQL UDF is actually a Scala function, which is encapsulated by the catalyst as an expression node, and finally the result of the UDF is computed based on the current row by the Eval method, see:Spark SQL Source analysis UDF

The Spark SQL UDF is easy to use and has 2 steps:

First, registration

When we import SqlContext or hivecontext, we have the function of registering the UDF.

Registerfunction (udfname:string, Func:functionn)

Because of the Scala language limitations, the UDF parameter here only supports 22.

Second, use

Select Udfname (param1, param2 ...) from TableName

Iii. examples we create 2 tables here: The first dual will read the record from README.MD, there is only one field line:string the second table SRC, there are 2 fields Key,value, the data is the test data from spark SQL.
We useSBT/SBT Hive/consoleEnter the test environment:
1. String take length Len () to create TABLE dual:

Scala> SQL ("CREATE table dual (line string)"). Collect () 14/09/19 17:41:34 INFO Metastore. hivemetastore:0: Create_table:table (tablename:dual, Dbname:default, Owner:root, createtime:1411119694, lastaccesstime:0, retention:0, Sd:storagedescriptor (Cols:[fieldschema (Name:line, type:string, Comment:null)], Location:null, InputFormat:org.apache.hadoop.mapred.TextInputFormat, Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], parameters:{}, Skewedinfo:skewedinfo (skewedcolnames:[], skewedcolvalues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], parameters:{}, ViewOriginalText: NULL, Viewexpandedtext:null, tabletype:managed_table, Privileges:principalprivilegeset (UserPrivileges:null, Groupprivileges:null, Roleprivileges:null)) 14/09/19 17:41: Hivemetastore.audit:ugi=root ip=unknown-ip-addr cmd=create_table:table (tablename:dual, DbName:default, Owner:root, createtime:1411119694, lastaccesstime:0, retention:0, Sd:storagedescriptor (Cols:[FieldSchema (name:line , type:string, Comment:null)], Location:null, InputFormat:org.apache.hadoop.mapred.TextInputFormat, Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], parameters:{}, Skewedinfo:skewedinfo (skewedcolnames:[], skewedcolvalues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], parameters:{}, ViewOriginalText: NULL, Viewexpandedtext:null, tabletype:managed_table, Privileges:principalprivilegeset (UserPrivileges:null, Groupprivileges:null, Roleprivileges:null))

Loading README.MD data:

SQL ("Load data local inpath ' readme.md ' into table dual"). Collect () scala> SQL ("SELECT * from Dual"). Collect () Res4:arr Ay[org.apache.spark.sql.row] = Array ([# Apache Spark], [], [Spark is a fast and general cluster computing system for Big D Ata. IT provides], [high-level APIs in Scala, Java, and Python, and an optimized engine so], [supports General computation GR Aphs for data analysis. It also supports a], [rich set of higher-level tools including Spark SQL for SQL and structured], [data processing, MLLIB For machine learning, GraphX for graph processing,], [and Spark streaming.], [], [
write the Len function and register the function:Scala> registerfunction ("Len", (x:string) =>x.length)
Test:scala> SQL ("Select Len" from dual). Collect () 14/09/19 17:45:07 INFO Spark. Sparkcontext:job Finished:collect at sparkplan.scala:85, took 0.072239295 sres6:array[org.apache.spark.sql.row] = Arra Y ([14], [0], [78], [72], [73], [73], [73], [20], [0], [26], [0], [0], [23], [0], [68], [78], [56], [0], [17], [0], [75], [ 0], [22], [0], [67], [0], [26], [0], [64], [0], [21], [0], [52], [0], [44], [0], [27], [0], [66], [0], [17], [4], [61], [0 ], [43], [0], [19], [0], [+ 74], [74], [0], [29], [0], [32], [0], [75], [63], [67], [74], [72], [22], [0], [54], [0], [16], [0], [84], [17], [0], [19], [0], [31], [0], [77], [76], [77], [77], [0], [67], [27], [0], [25], [45], [0], [42] , [58], [0], [91], [29], [0], [31], [58], [0], [42], [61], [0], [35], [52], [0], [77], [79], [74], [22], [0], [51], [0], [  90], [0], [16], [42], [44], [+ 30], [17], [0], [0], [56], [0], [46], [86], [78], [0], [30], [0], [16], [0], [97], [0], [+], [0], [...] 

2. String connection Concat_str here, for the sake of simplicity, make an example based on the SRC table's key value type Int, string:Scala> SQL ("desc src"). Collect () Res8:array[org.apache.spark.sql.row] = Array ([Key,int,null], [Value,string,null] )
Scala> SQL ("select * from src limit"). Collect () Res7:array[org.apache.spark.sql.row] = Array ([238,val_238], [86, VAL_86], [311,val_311], [27,val_27], [165,val_165], [409,val_409], [255,val_255], [278,val_278], [98,val_98], [484,val _484])

Write and register the CONCAT_STR function:Scala> registerfunction ("Concat_str", (A:int, b:string) =>a.tostring+b)
testing the CONCAT function
Scala> SQL ("Select CONCAT_STR (key,value) from SRC"). Collect ()

--eof--
Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/39401391

Spark SQL UDF uses

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More