With data analysis using MapReduce or spark application, using hive SQL or spark SQL can save us a lot of code effort, while hive SQL or spark The various types of UDFs built into SQL itself also provide a handy tool for our data processing, and when these built-in UDFs do not meet our needs, Hive SQL or Spark SQL also provides us with a custom UDF interface that allows us to scale according to our needs. The process of using custom UDFs in the hive World is more complex. We need to develop the appropriate UDF (UDAF, UDTF) Using the Java language as required, and then package the UDF's code and its dependent compilation as a jar, using two: (1) Temporary functions Create a temporary function using the following statement in a session: add Jar/run/jar/udf_test.jar; Create temporary function my_add as ' COM.HIVE.UDF.ADD '; This approach has one drawback: you need to create every time you use a function in a session, And only valid in the current session. (2) Permanent functions This feature requires a higher version of Hive Support, which has the advantage of storing the UDF jar in HDFs, which is only created once to be used permanently, as follows: create function Func.iptolocationbysina as ' Com.sina.dip.hive.function.IPToLocationBySina ' using jar ' hdfs:// Dip.cdh5.dev:8020/user/hdfs/func/location.jar '; Although the permanent function has some advantages over the temporal function, the development threshold of the Java language largely hinders the use of UDFs in the actual data analysis process. After all, our data analysts are mostly Python, SQL as the main analysis tool, each time the development of the UDF requires the involvement of engineers, development efficiency and application effect is not very good (may need to update the UDF frequently problem), The advent of Pyspark is a good solution to this problem: it makes it very easy to register a normal Python function as a UDF. to illustrate how to use Python UD in Spark (Hive) SQLF, we first simulate a data table, for the sake of simplicity, the table has only one row of data: we modeled a data table temp_table, which has only one column, where the column name is col, the column type is a string, and the output is not allowed to contain null: we demonstrate the use of UDFs on the basis of table temp_table: First we define a normal Python function: func_string, for simplicity it has no parameters, Just return a simple string; and then we can register the function func_string as udf,registerfunction to receive two parameters via Hivecontext registerfunction: UDF name, UDF-associated Python function; Finally we can use this UDF in Spark (Hive) SQL to output the results: we need to be aware that Hivecontext Registerfunction actually has three parameters: name:udf name, f:udf associated python function, returntype:udf (Python function) return value type, default = StringType (). In the above example, because our UDF function returns a string, the parameter returntype is omitted when registering the UDF using Hive Registerfunction, that is, the ReturnType default value is StringType (). If the return value type of the UDF (Python function) is not a string, you need to explicitly specify returntype for it. We demonstrate the need to explicitly specify ReturnType as examples of types Integertype, ArrayType, Structtype, and Maptype. (1) integertype (2) arraytype Note: arraytype (arrays) must ensure consistency of element types, If the specified UDF return value type is ArrayType (Integertype ()), the return value type of the function Func_array must be list or tuple, where the element type must be int. (3) structtype Note: Structtype must ensure that the return value type of the function is tuple, and that theTo register a UDF with Hivecontext registerfunction you need to specify the type of the name for each element in turn, such as the name of each element in the preceding example is first, the type is Integertype, and the second element has a name of second, The type is floattype; the third element has a name of third and the type is StringType. (4) maptype Note: Maptype must ensure that the return value type of the function is dict, and that all "keys" should be of the same type, and "value" remains the same type.
Use of UDFs in Spark (Hive) SQL (Python) "Go"