Spark1.1 introduces the Uer Define function, which allows users to customize the actual UDF needed to process data in spark SQL.
Because of the limited functions currently supported by Spark SQL itself, some commonly used functions are not, such as Len, concat...etc but it is very convenient to use UDFs to implement the functions according to business needs.
The spark SQL UDF is actually a Scala function, which is encapsulated by the catalyst as an expression node, and finally the result of the UDF is computed based on the current row by the Eval method, see:Spark SQL Source analysis UDF
The Spark SQL UDF is easy to use and has 2 steps:
First, registration
When we import SqlContext or hivecontext, we have the function of registering the UDF.
Registerfunction (udfname:string, Func:functionn)
Because of the Scala language limitations, the UDF parameter here only supports 22.
Second, use
Select Udfname (param1, param2 ...) from TableName
Iii. examples we create 2 tables here: The first dual will read the record from README.MD, there is only one field line:string the second table SRC, there are 2 fields Key,value, the data is the test data from spark SQL.
We useSBT/SBT Hive/consoleEnter the test environment:
1. String take length Len () to create TABLE dual:
Scala> SQL ("CREATE table dual (line string)"). Collect () 14/09/19 17:41:34 INFO Metastore. hivemetastore:0: Create_table:table (tablename:dual, Dbname:default, Owner:root, createtime:1411119694, lastaccesstime:0, retention:0, Sd:storagedescriptor (Cols:[fieldschema (Name:line, type:string, Comment:null)], Location:null, InputFormat:org.apache.hadoop.mapred.TextInputFormat, Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], parameters:{}, Skewedinfo:skewedinfo (skewedcolnames:[], skewedcolvalues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], parameters:{}, ViewOriginalText: NULL, Viewexpandedtext:null, tabletype:managed_table, Privileges:principalprivilegeset (UserPrivileges:null, Groupprivileges:null, Roleprivileges:null)) 14/09/19 17:41: Hivemetastore.audit:ugi=root ip=unknown-ip-addr cmd=create_table:table (tablename:dual, DbName:default, Owner:root, createtime:1411119694, lastaccesstime:0, retention:0, Sd:storagedescriptor (Cols:[FieldSchema (name:line , type:string, Comment:null)], Location:null, InputFormat:org.apache.hadoop.mapred.TextInputFormat, Outputformat:o Rg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Compressed:false, Numbuckets:-1, Serdeinfo:serdeinfo ( Name:null, SerializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Parameters:{serialization.format =1}), bucketcols:[], sortcols:[], parameters:{}, Skewedinfo:skewedinfo (skewedcolnames:[], skewedcolvalues:[], skewedcolvaluelocationmaps:{}), Storedassubdirectories:false), partitionkeys:[], parameters:{}, ViewOriginalText: NULL, Viewexpandedtext:null, tabletype:managed_table, Privileges:principalprivilegeset (UserPrivileges:null, Groupprivileges:null, Roleprivileges:null))
Loading README.MD data:
SQL ("Load data local inpath ' readme.md ' into table dual"). Collect () scala> SQL ("SELECT * from Dual"). Collect () Res4:arr Ay[org.apache.spark.sql.row] = Array ([# Apache Spark], [], [Spark is a fast and general cluster computing system for Big D Ata. IT provides], [high-level APIs in Scala, Java, and Python, and an optimized engine so], [supports General computation GR Aphs for data analysis. It also supports a], [rich set of higher-level tools including Spark SQL for SQL and structured], [data processing, MLLIB For machine learning, GraphX for graph processing,], [and Spark streaming.], [], [
write the Len function and register the function:Scala> registerfunction ("Len", (x:string) =>x.length)
Test:scala> SQL ("Select Len" from dual). Collect () 14/09/19 17:45:07 INFO Spark. Sparkcontext:job Finished:collect at sparkplan.scala:85, took 0.072239295 sres6:array[org.apache.spark.sql.row] = Arra Y ([14], [0], [78], [72], [73], [73], [73], [20], [0], [26], [0], [0], [23], [0], [68], [78], [56], [0], [17], [0], [75], [ 0], [22], [0], [67], [0], [26], [0], [64], [0], [21], [0], [52], [0], [44], [0], [27], [0], [66], [0], [17], [4], [61], [0 ], [43], [0], [19], [0], [+ 74], [74], [0], [29], [0], [32], [0], [75], [63], [67], [74], [72], [22], [0], [54], [0], [16], [0], [84], [17], [0], [19], [0], [31], [0], [77], [76], [77], [77], [0], [67], [27], [0], [25], [45], [0], [42] , [58], [0], [91], [29], [0], [31], [58], [0], [42], [61], [0], [35], [52], [0], [77], [79], [74], [22], [0], [51], [0], [ 90], [0], [16], [42], [44], [+ 30], [17], [0], [0], [56], [0], [46], [86], [78], [0], [30], [0], [16], [0], [97], [0], [+], [0], [...]
2. String connection Concat_str here, for the sake of simplicity, make an example based on the SRC table's key value type Int, string:Scala> SQL ("desc src"). Collect () Res8:array[org.apache.spark.sql.row] = Array ([Key,int,null], [Value,string,null] )
Scala> SQL ("select * from src limit"). Collect () Res7:array[org.apache.spark.sql.row] = Array ([238,val_238], [86, VAL_86], [311,val_311], [27,val_27], [165,val_165], [409,val_409], [255,val_255], [278,val_278], [98,val_98], [484,val _484])
Write and register the CONCAT_STR function:Scala> registerfunction ("Concat_str", (A:int, b:string) =>a.tostring+b)
testing the CONCAT function
Scala> SQL ("Select CONCAT_STR (key,value) from SRC"). Collect ()
--eof--
Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/39401391
Spark SQL UDF uses