ODPS_ELE-UDF Python API

Source: Internet
Author: User
Tags explode scalar

Custom Functions (UDFs)

The UDF full name is User Defined function, which is custom functions. ODPs provides a number of built-in functions to meet the user's computing needs, while users can also create custom functions to meet different computing needs. UDFs are similar in use to normal SQL built-in functions.

In ODPs, there are three types of UDFs that users can extend, namely:

UDF Categories | Describe

The User Defined Scalar Function is also commonly referred to as a UDF
A custom function, which is a user-defined scalar function, is, to be precise, a custom Defined. The input and output of the UDF is a one-to-two relationship that reads a row of data and writes out an output value.

Udaf (User Defined Aggregation Function)
A custom aggregate function whose input and output is a many-to-one relationship that aggregates multiple input records into one output value. Can be associated with a group by statement in SQL. Refer to aggregate functions for specific syntax.

UDTF (User Defined Table valued Function)
Custom table functions are used to solve a function call output multi-line data scene, and is the only custom function that can return multiple fields. UDF and UDAF can only compute one return value at a time.

Annotations

UDF's generalized representation represents a collection of custom scalar functions, custom aggregate functions, and custom functions of three types of customized table functions. In a narrow sense, only user-defined scalar functions are represented. The document often uses this term, and the reader is asked to determine the specific meaning according to the context of the document.

Restricted environment

The Python version of the ODPS UDF is 2.7 and executes user code in sandbox mode , where code is executed in a restricted runtime environment where prohibited behavior includes:

    • Read and write local files
    • Start child process
    • Start thread
    • Using socket communication
    • Other system calls

For these reasons, the user uploads the code must be the pure Python implementation , C extension Module is forbidden.

In addition, not all modules are available in Python's standard library , and modules that involve these features are banned. Specific standard libraries available modules are described below:

    1. All modules with pure Python implementations (independent of extension modules) are available
    2. The following modules are available in the C implementation extension module
  • array; audioop;
  • Binascii; _bisect;
  • Cmath; _codecs_cn; _codecs_hk; _codecs_iso2022; _codecs_jp; _codecs_kr
  • _CODECS_TW; _collections; cstringio;
  • datetime;
  • _functools; future_builtins;
  • _hashlib; _HEAPQ;
  • Itertools;
  • _json;
  • _locale; _lsprof;
  • Math, _MD5, _multibytecodec
  • operator;
  • _random;
  • _sha256; _sha512; _sha; _struct; strop
  • time;
  • Unicodedata;
  • _weakref;
  • Cpickle;
    1. Some modules are limited in functionality. For example, the sandbox restricts the size of the user code to the standard output and the standard error output, that is, ' sys.stdout/sys.stderr ' can write up to 20Kb, and the extra characters are ignored.
Third-party libraries

In addition to the standard library, a common three-party library is installed in the running environment, which complements the standard library. The list of supported three-party libraries is as follows:

    • NumPy

Warning

The use of three-way libraries is also restricted by local, network IO, or other restricted environments, so APIs that involve related features in the three-way libraries are also banned.

Parameters and return value types

@ odps.udf. Annotate (signature)

Python UDF currently supports ODPS SQL data types:bigint, String, double, Boolean, and DateTime. Before the SQL statement executes, the parameter types and return value types of all functions must be determined. Therefore, for Python, a dynamic type language, the function signature needs to be specified by adding decorator to the UDF class.

The function signature signature is specified by a string with the following syntax:

Arg_type_list ' and ' type_listarg_type_list:type_list | ' * ' | ' Type_list: [type_list ', '] typetype: ' bigint ' | ' String ' | ' Double ' | ' Boolean ' | ' DateTime '
    1. The left side of the arrow represents the parameter type and the right represents the return value type.
    2. Only the return value of UDTF can be multiple columns, and UDFs and UDAF can only return one column.
    3. The ' * ' represents the variable-length parameter, and the UDF/UDTF/UDAF can match any input parameter using the variable-length parameter.

The following are examples of legal signature:

' bigint,double->string '            # parameter is bigint, double, return value is String ' Bigint,boolean->string,datetime '  
# UDTF parameter is bigint, Boolean, return value is String,datetime
' *->string '                        # variable length parameter,
Input parameters any
, the return value is String '->double '                         # parameter is empty and the return value is double

The query semantic parsing phase will check the use of non-conforming function signatures and throw an error to prohibit execution. Execution period, the parameters of the UDF function are passed to the user in the type specified by the function signature. The return value type of the user is also consistent with the type specified by the function signature, or an error is detected when the type mismatch is checked. The ODPS SQL data type corresponds to the Python type as follows:

Annotations

  • The datetime type is passed as an int to the user code, and the value is the number of milliseconds since the epoch UTC time began. The datetime module in the Python standard library can be used by the user to process the date-time type.
  • The null value corresponds to none in Python.

odps.udf. int (value[, silent=true])

The Python builtin function int is modified. Added the parameter silent . When silent is True , if value cannot be converted to int , no exception is thrown, but None is returned.

UDF

Implementing a Python UDF is very simple, you only need to define a New-style class and implement the evaluate method. Here is an example:

From ODPS.UDF Import annotate
@annotate ("Bigint,bigint->bigint") class Myplus (object):   def evaluate (self, arg0, arg1):       if None in (arg0, ARG1):           return None       return arg0 + arg1

Note: Python UDF must specify function signature via annotate

Udaf
class odps.udf. Baseudaf

Inherit this class to implement Python Udaf.

Baseudaf. New_buffer ( )

Implement this method to return the intermediate value of the aggregate function in buffer. Buffer must be mutable object (such as list, dict), and the size of the buffer should not increase with the amount of data, in the extreme case, buffer marshal after the size should not exceed 2Mb.

Baseudaf. Iterate ( buffer[, args, ...] )

Implement this method to aggregate the args into the intermediate value buffer.

Baseudaf. Merge ( buffer, pbuffer )

Implement this method to aggregate the two intermediate values buffer together, and the pbuffer merge into buffer.

Baseudaf. Terminate ( buffer )

Implement this method to convert the intermediate value buffer to the ODPs SQL base type.

Here is an example of UDAF averaging.

#coding: Utf-8from odps.udf import annotatefrom odps.udf import baseudaf@annotate (' double->double ') class Average ( BASEUDAF):    def new_buffer (self):        return [0, 0]    def iterate (self, buffer, number): If number was not        None:            Buffer[0] + = number            buffer[1] + 1    def merge (self, buffer, pbuffer):        buffer[0] + = pbuffer[0]        BUFFER[1] + = pbuffer[1]    def terminate (self, buffer):        if buffer[1] = = 0:            return 0        return buffer[0]/b UFFER[1]
UDTF
class odps.udf. BASEUDTF

The base class for Python UDTF, the user inherits this class, and implements the process, close , and other methods.

BASEUDTF. __init__ ( )

Initialize method, inheriting class if you implement this method, you must call the initialization method of the base class super (Baseudtf,self) at the beginning. __init__ () .

The __init__ method is called only once throughout the UDTF life cycle, before the first record is processed. When UDTF needs to save the internal state, all States can be initialized in this method.

BASEUDTF. Process ([ args, ...] )

This method is called by the ODPs SQL framework, where each record in SQL corresponds to the call process , and the process parameter is the UDTF input parameter specified in the SQL statement.

BASEUDTF. forward ([ args, ...] )

The output method of the UDTF, which is called by the user code. Each time a forward is called, a record is output. The forward parameter is the output parameter of the UDTF specified in the SQL statement.

BASEUDTF. Close ( )

UDTF the End method, which is called by the ODPs SQL framework and is called only once, after the last record is processed.

Here is an example of a UDTF.

#coding: utf-8# explode.py
From ODPS.UDF import annotatefrom odps.udf import baseudtf@annotate (' string-string ') class Explode (BASEUDTF):   "" Outputs a string separated by commas into multiple records "" "   def process (self, arg):       props = Arg.split (', ') for       p in props:           Self.forward (p)

Annotations

Python UDTF can also specify parameter types and return value types without annotate. This allows the function to match any input parameter when used in SQL, but the return value type cannot be inferred, and all output parameters will be considered to be of type string. Therefore, when calling forward , all output values must be converted to the str type.

Referencing resources

Python UDFs can reference resource files through the Odps.distcache module, which currently supports referencing file resources and table resources.

odps.distcache. get_cache_file ( resource_name )

release-2012.09.03 new features.

The

Returns the resource content for the specified name. resource_name is a str type that corresponds to a resource name that already exists in the current project. If the resource name is illegal or does not have a corresponding resource, an exception is thrown. The

Return value is File-like object , and after the object is used, the caller is obliged to call the close method to release the open resource file.

Below is an example of using get_cache_file :

From ODPS.UDF import annotatefrom odps.distcache import get_cache_file@annotate (' bigint->string ') class Distcacheexample (object):d EF __init__ (self):    cache_file = get_cache_file (' test_distcache.txt ')    kv = {}    For the cache_file: line        = Line.strip () If the line        :            continue        k, V = line.split ()        Kv[int (k)] = V
      cache_file.close ()    self.kv = Kvdef evaluate (self, arg):    return Self.kv.get (ARG)

Odps.distcache. get_cache_table (resource_name)

release-2012.11.14 new features.

Returns the contents of the specified resource table. resource_name is a str type that corresponds to a resource table name that already exists in the current project. If the resource name is illegal or does not have a corresponding resource, an exception is thrown.

The return value is the generator type, and the caller iterates through the contents of the table, and each iteration results in a record in a table that exists in the form of a tuple .

Here's an example of using get_cache_table :

From ODPS.UDF import annotatefrom odps.distcache import get_cache_table@annotate ('->string ') class Distcachetableexample (object):    def __init__ (self):        self.records = List (get_cache_table (' udf_test '))        Self.counter = 0        self.ln = Len (self.records)    def evaluate (self):        if Self.counter > self.ln-1:            return None        ret = self.records[self.counter]        self.counter + = 1        return str (ret)
Considerations for Expression Optimization

When there are multiple identical UDFs in a query, and their parameters are consistent, the UDFs are optimized to execute only once when they are executed. For example:

Random.seed (12345) @annotate (' Bigint->bigint ') class Myrand (object):    def evaluate (self, a):        return Random.randint (0, 10)

Implement an RAND function that you want to return a random value each time you call Rand.

> select Myrand (c_int_a), Myrand (c_int_a) from udf_test;+------------+------------+| _C0        | _c1        |+------------+------------+| 4          | 4          | | 0          | 0          | | 9          | 9          | | 3          | 3          |+---- --------+------------+

As you can see, by default, two Rand calls on the same row return a value as a result, because only one execution is done once it is optimized. If you do not want this optimization, you can cancel this optimization by setting the configuration item odps.sql.udf.optimize.reuse :

> Set odps.sql.udf.optimize.reuse=false;> Select Myrand (c_int_a), Myrand (c_int_a) from udf_test;+------------+- -----------+| _C0        | _c1        |+------------+------------+| 4          | 0          | | 9          | 3          | | 4          | 2          | | 6          | 1< c20/>|+------------+------------+
Summarize

The classes provided by ODPs for Python have

1. Parameters and return value types

@ odps.udf. Annotate (signature), the ODPS SQL data type corresponds to the Python type as follows:

odps.udf. int (value[, silent=true])

2. UDF

# define a New-style class and implement the evaluate method

From odps.udf import annotate@annotate ("Bigint,bigint->bigint") class Myplus (object):   def evaluate (self, arg0 , arg1):       if none in (arg0, arg1):           return None       return arg0 + arg1
3. Udaf

class odps.udf. baseudaf-inherits this class to implement the Python Udaf.

The four methods that are owned by the Baseudaf class are as follows:

Baseudaf. New_buffer ( )
Baseudaf. Iterate ( buffer [, args , ... ] )
Baseudaf. Merge ( buffer , pbuffer )
Baseudaf. Terminate ( buffer ) ¶

Below is an example of UDAF averaging.

#coding: Utf-8from odps.udf import annotatefrom odps.udf import baseudaf@annotate (' double->double ') class Average ( BASEUDAF):    def new_buffer (self):        return [0, 0]    def iterate (self, buffer, number): If number was not        None:            Buffer[0] + = number            buffer[1] + 1    def merge (self, buffer, pbuffer):        buffer[0] + = pbuffer[0]        BUFFER[1] + = pbuffer[1]    def terminate (self, buffer):        if buffer[1] = = 0:            return 0        return buffer[0]/b UFFER[1]
4.UDTF

class odps.udf. baseudtf- The base class for Python UDTF, the user inherits this class, and implements the process , close , and other methods.

Four methods owned by the BASEUDTF class

BASEUDTF. __init__ ( )
BASEUDTF. Process ([ args, ...] )
BASEUDTF. forward ([ args, ...] )
BASEUDTF. Close ( )
Below is an example of a UDTF.
#coding: utf-8# explode.py
From ODPS.UDF import annotatefrom odps.udf import baseudtf@annotate (' string-string ') class Explode (BASEUDTF): ""   outputs a string separated by commas into multiple records "" "   def process (self, arg):       props = Arg.split (', ') for       p in props:           Self.forward (P)

ODPS_ELE-UDF Python API

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.