Initial knowledge of Spark UDF

Source: Internet
Author: User

Directly on the code, see note

ImportOrg.apache.spark.sql.hive.HiveContextImportOrg.apache.spark. {sparkcontext, sparkconf}/** * Created by zxh on 2016/6/10. */Object Udf_test {defMain (args:array[string]): Unit ={val conf=new sparkconf () implicit Val sc=new Sparkcontext (conf) implicit Val SqlContext=New Hivecontext (SC)Importsqlcontext.implicits._ val Data= Sc.parallelize (Seq ("a", 1), ("BB", 5), ("CCCC", 10), ("dddddd",)). TODF ("a","b") data.registertemptable ("Data")    {      //The function body takes the primitive type (non-column type), uses the UDF wrapper function body, registers the function body to the SQLCONTEXT.UDFImportorg.apache.spark.sql.functions._//function Body val filter_length_f= (str:string, _length:int) ={str.length>_length; }      //register the function body to the current sqlcontext, note that the function body registered to SqlContext, the parameter cannot be columnAfter registration, it can be used in the following places: 1, df.selectexpr 2, Df.filter, 3, register the DF as temptable, and then use SqlContext.udf.register in SQL ("Filter_length", Filter_length_f) Val filter_length= UDF (filter_length_f)//to facilitate the use of column, we wrap the function body, and the input parameter after wrapping is column data.select ($"*", Filter_length ($"a", lit (2))). Show//Use UDF wrapped, must pass in column, note lit (2) data.selectexpr ("*","Filter_length (a,2) as Ax"). Show//Select if the write expression calls the function, you need to use selectexpr Data.filter (filter_length ($"a", lit (2))). Show//Same as Select Data.filter ("filter_length (a,2)"). Show//filter Call expression, you can use the Df.filter function directly, Sqlcontext.sql ("Select *,filter_length (a,2) from Data"). Show Sqlcontext.sql ("Select *,filter_length (a,2) from data where Filter_length (a,2)"). Show} {//function body using column type, cannot register to sqlcontext.udf//after using the UDF wrapper, each column must enter column, can we define it ourselves, such as a parameter is column, one is another typeImportorg.apache.spark.sql.functions._ImportOrg.apache.spark.sql.Column Val filter_length_f2= (Str:column, _length:int) ={Length (str)>_length} sqlContext.udf.register ("Filter_length", FILTER_LENGTH_F2)//TODO: Sorry, this is not registered, the function registered to SQLCONTEXT.UDF, the entry parameter does not support column type Data.select ($"*", Filter_length_f2 ($"a", 2)). Show//without the UDF wrapper, we can completely customize, then length can be passed into the integer data.selectexpr ("*","Filter_length_f2 (a,2) as Ax"). Show//TODO: Sorry, it's not going to work here, Data.filter (FILTER_LENGTH_F2 ($"a", 2)). Show//Same as Select Data.filter ("filter_length (a,2)"). Show//TODO: Sorry, it's not going to work here.}//Finally, let's write a relatively generic one.//defines two function bodies, one using the column type, one using the primitive type, and registering the native type function with the SQLCONTEXT.UDFImportorg.apache.spark.sql.functions._ImportOrg.apache.spark.sql.Column//function Body val filter_length_f= (str:string, _length:int) ={str.length>_length; }      //The main function, below Df.select df.filter, is used in Val filter_length= (Str:column, _length:int) ={Length (str)>_length}//register the function body to the current sqlcontext, note that the function body registered to SqlContext, the parameter cannot be columnAfter registration, it can be used in the following places: 1, df.selectexpr 2, Df.filter, 3, register the DF as temptable, and then use SqlContext.udf.register in SQL ("Filter_length", Filter_length_f)//Here we do not use UDFs, directly using our own defined functions that support columnVal filter_length = UDF (filter_length_f)//to facilitate the use of column, we wrap the function body, and the input parameter after wrapping is column data.select ($"*", Filter_length ($"a", 2)). Show//Use UDF wrapped, must pass in column, note lit (2) data.selectexpr ("*","Filter_length (a,2) as Ax"). Show//Select if the write expression calls the function, you need to use selectexpr Data.filter (filter_length ($"a", 2)). Show//Same as Select Data.filter ("filter_length (a,2)"). Show//filter Call expression, you can use the Df.filter function directly, Sqlcontext.sql ("Select *,filter_length (a,2) from Data"). Show Sqlcontext.sql ("Select *,filter_length (a,2) from data where Filter_length (a,2)"). Show} }}

Initial knowledge of Spark UDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.