Use of UDF in Spark (Hive) SQL (Python) "Go"

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With data analysis using MapReduce or spark application, using hive SQL or spark SQL can save us a lot of code effort, while hive SQL or spark The various types of UDFs built into SQL itself also provide a handy tool for our data processing, and when these built-in UDFs do not meet our needs, Hive SQL or Spark SQL also provides us with a custom UDF interface that allows us to scale according to our needs. The process of using custom UDFs in the hive World is more complex. We need to develop the appropriate UDF (UDAF, UDTF) Using the Java language as required, and then package the UDF's code and its dependent compilation as a jar, using two: (1) Temporary functions Create a temporary function using the following statement in a session: add Jar/run/jar/udf_test.jar; Create temporary function my_add as ' COM.HIVE.UDF.ADD '; This approach has one drawback: you need to create every time you use a function in a session, And only valid in the current session. (2) Permanent functions This feature requires a higher version of Hive Support, which has the advantage of storing the UDF jar in HDFs, which is only created once to be used permanently, as follows: create function Func.iptolocationbysina as ' Com.sina.dip.hive.function.IPToLocationBySina ' using jar ' hdfs:// Dip.cdh5.dev:8020/user/hdfs/func/location.jar '; Although the permanent function has some advantages over the temporal function, the development threshold of the Java language largely hinders the use of UDFs in the actual data analysis process. After all, our data analysts are mostly Python, SQL as the main analysis tool, each time the development of the UDF requires the involvement of engineers, development efficiency and application effect is not very good (may need to update the UDF frequently problem), The advent of Pyspark is a good solution to this problem: it makes it very easy to register a normal Python function as a UDF. to illustrate how to use Python UD in Spark (Hive) SQLF, we first simulate a data table, for the sake of simplicity, the table has only one row of data: we modeled a data table temp_table, which has only one column, where the column name is col, the column type is a string, and the output is not allowed to contain null: we demonstrate the use of UDFs on the basis of table temp_table: First we define a normal Python function: func_string, for simplicity it has no parameters, Just return a simple string; and then we can register the function func_string as udf,registerfunction to receive two parameters via Hivecontext registerfunction: UDF name, UDF-associated Python function; Finally we can use this UDF in Spark (Hive) SQL to output the results: we need to be aware that Hivecontext Registerfunction actually has three parameters: name:udf name, f:udf associated python function, returntype:udf (Python function) return value type, default = StringType (). In the above example, because our UDF function returns a string, the parameter returntype is omitted when registering the UDF using Hive Registerfunction, that is, the ReturnType default value is StringType (). If the return value type of the UDF (Python function) is not a string, you need to explicitly specify returntype for it. We demonstrate the need to explicitly specify ReturnType as examples of types Integertype, ArrayType, Structtype, and Maptype. (1) integertype (2) arraytype Note: arraytype (arrays) must ensure consistency of element types, If the specified UDF return value type is ArrayType (Integertype ()), the return value type of the function Func_array must be list or tuple, where the element type must be int. (3) structtype Note: Structtype must ensure that the return value type of the function is tuple, and that theTo register a UDF with Hivecontext registerfunction you need to specify the type of the name for each element in turn, such as the name of each element in the preceding example is first, the type is Integertype, and the second element has a name of second, The type is floattype; the third element has a name of third and the type is StringType. (4) maptype Note: Maptype must ensure that the return value type of the function is dict, and that all "keys" should be of the same type, and "value" remains the same type.

Use of UDFs in Spark (Hive) SQL (Python) "Go"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use of UDF in Spark (Hive) SQL (Python) "Go"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Use of UDF in Spark (Hive) SQL (Python) "Go"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support