Directly on the code, see note
ImportOrg.apache.spark.sql.hive.HiveContextImportOrg.apache.spark. {sparkcontext, sparkconf}/** * Created by zxh on 2016/6/10. */Object Udf_test {defMain (args:array[string]): Unit ={val conf=new sparkconf () implicit Val sc=new Sparkcontext (conf) implicit Val SqlContext=New Hivecontext (SC)Importsqlcontext.implicits._ val Data= Sc.parallelize (Seq ("a", 1), ("BB", 5), ("CCCC", 10), ("dddddd",)). TODF ("a","b") data.registertemptable ("Data") { //The function body takes the primitive type (non-column type), uses the UDF wrapper function body, registers the function body to the SQLCONTEXT.UDFImportorg.apache.spark.sql.functions._//function Body val filter_length_f= (str:string, _length:int) ={str.length>_length; } //register the function body to the current sqlcontext, note that the function body registered to SqlContext, the parameter cannot be columnAfter registration, it can be used in the following places: 1, df.selectexpr 2, Df.filter, 3, register the DF as temptable, and then use SqlContext.udf.register in SQL ("Filter_length", Filter_length_f) Val filter_length= UDF (filter_length_f)//to facilitate the use of column, we wrap the function body, and the input parameter after wrapping is column data.select ($"*", Filter_length ($"a", lit (2))). Show//Use UDF wrapped, must pass in column, note lit (2) data.selectexpr ("*","Filter_length (a,2) as Ax"). Show//Select if the write expression calls the function, you need to use selectexpr Data.filter (filter_length ($"a", lit (2))). Show//Same as Select Data.filter ("filter_length (a,2)"). Show//filter Call expression, you can use the Df.filter function directly, Sqlcontext.sql ("Select *,filter_length (a,2) from Data"). Show Sqlcontext.sql ("Select *,filter_length (a,2) from data where Filter_length (a,2)"). Show} {//function body using column type, cannot register to sqlcontext.udf//after using the UDF wrapper, each column must enter column, can we define it ourselves, such as a parameter is column, one is another typeImportorg.apache.spark.sql.functions._ImportOrg.apache.spark.sql.Column Val filter_length_f2= (Str:column, _length:int) ={Length (str)>_length} sqlContext.udf.register ("Filter_length", FILTER_LENGTH_F2)//TODO: Sorry, this is not registered, the function registered to SQLCONTEXT.UDF, the entry parameter does not support column type Data.select ($"*", Filter_length_f2 ($"a", 2)). Show//without the UDF wrapper, we can completely customize, then length can be passed into the integer data.selectexpr ("*","Filter_length_f2 (a,2) as Ax"). Show//TODO: Sorry, it's not going to work here, Data.filter (FILTER_LENGTH_F2 ($"a", 2)). Show//Same as Select Data.filter ("filter_length (a,2)"). Show//TODO: Sorry, it's not going to work here.}//Finally, let's write a relatively generic one.//defines two function bodies, one using the column type, one using the primitive type, and registering the native type function with the SQLCONTEXT.UDFImportorg.apache.spark.sql.functions._ImportOrg.apache.spark.sql.Column//function Body val filter_length_f= (str:string, _length:int) ={str.length>_length; } //The main function, below Df.select df.filter, is used in Val filter_length= (Str:column, _length:int) ={Length (str)>_length}//register the function body to the current sqlcontext, note that the function body registered to SqlContext, the parameter cannot be columnAfter registration, it can be used in the following places: 1, df.selectexpr 2, Df.filter, 3, register the DF as temptable, and then use SqlContext.udf.register in SQL ("Filter_length", Filter_length_f)//Here we do not use UDFs, directly using our own defined functions that support columnVal filter_length = UDF (filter_length_f)//to facilitate the use of column, we wrap the function body, and the input parameter after wrapping is column data.select ($"*", Filter_length ($"a", 2)). Show//Use UDF wrapped, must pass in column, note lit (2) data.selectexpr ("*","Filter_length (a,2) as Ax"). Show//Select if the write expression calls the function, you need to use selectexpr Data.filter (filter_length ($"a", 2)). Show//Same as Select Data.filter ("filter_length (a,2)"). Show//filter Call expression, you can use the Df.filter function directly, Sqlcontext.sql ("Select *,filter_length (a,2) from Data"). Show Sqlcontext.sql ("Select *,filter_length (a,2) from data where Filter_length (a,2)"). Show} }}
Initial knowledge of Spark UDF