Eighth article: Spark SQL Catalyst Source Analysis UDF

Last Update:2017-09-26 Source: Internet

Author: User

Tags eval reflection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

/** Spark SQL Source Analysis series Article */

In the world of SQL, in addition to the commonly used processing functions provided by the official, extensible external custom function interface is generally provided, which has become a fact of the standard.

In the previous article on the core process of spark SQL source analysis, the role of the Spark SQL Catalyst Analyzer has been introduced, which includes the function of the resolvefunctions function. However, with the release of the Spark1.1 release, the Spark SQL code has a lot of new improvements and new features, and my previous 1.0-based source analysis is somewhat different, such as support UDF:

spark1.0 and previous implementations:

[Java]View PlainCopy

Protected[sql] Lazy val catalog:catalog = new Simplecatalog
@transient
Protected[sql] Lazy val analyzer:analyzer =
New Analyzer (catalog, emptyfunctionregistry, casesensitive = true) //emptyfunctionregistry NULL implementation
@transient
Protected[sql] Val optimizer = Optimizer

Spark1.1 and future implementations:

[Java]View PlainCopy

Protected[sql] Lazy val functionregistry:functionregistry = new Simplefunctionregistry // Simplefunctionregistry implementation, support for simple UDFs
@transient
Protected[sql] Lazy val analyzer:analyzer =
New Analyzer (catalog, functionregistry, casesensitive = true)

First, Primer:

For functions in SQL statements, they are parsed into unresolvedfunction by Sqlparser. Unresolvedfunction will eventually be analyzed by analyzer.

Sqlparser:

In addition to unofficially defined functions, you can also define custom functions that SQL parser will parse.

[Java]View PlainCopy

Ident ~ "(" ~ repsep (expression, ",") <~ ")" ^^ {
Case Udfname ~ _ ~ Exprs = Unresolvedfunction (Udfname, Exprs)

Encapsulates the Sqlparser incoming udfname and Exprs into a class class unresolvedfunction inherited from expression.

Just this expression of datatype and so on a series of properties and eval calculation method can not access, forced access will throw an exception, because it is not resolved, just a vector.

[Java]View PlainCopy

Case class Unresolvedfunction (name:string, children:seq[expression]) extends Expression {
Override Def DataType = throw new Unresolvedexception (This, "DataType")
Override def foldable = throw new Unresolvedexception (This, "foldable")
Override Def nullable = throw new Unresolvedexception (This, "nullable")
Override lazy Val resolved = false
//unresolved functions is transient at compile time and don ' t get evaluated during execution.
Override Def eval (input:row = null): Evaluatedtype =
throw New Treenodeexception (This,the S "No function to evaluate expression. Type: ${this.nodename} ")
Override Def toString = s"' $name (${children.mkstring (",")})"
}<strong></strong>

Analyzer:

When analyzer initializes, it will require catalog,database and table metadata relationships, as well as functionregistry to maintain the UDF name and metadata for the UDF implementation, using Simplefunctionregistry.

[Java]View PlainCopy

/**
* Replaces [[Unresolvedfunction]]s with concrete [[catalyst.expressions.Expression expressions]].
*/
Object Resolvefunctions extends Rule[logicalplan] {
def apply (Plan:logicalplan): Logicalplan = Plan Transform {
Case Q:logicalplan =
Q transformexpressions { //transformexpressions operation for current Logicalplan
Case u @ unresolvedfunction (name, children) if u.childrenresolved =/ /if traverse to Unresolvedfunction
Registry.lookupfunction (name, children) //Find UDF functions from the UDF metadata table
}
}
}

Second, UDF registration

2.1 Udfregistration

registerfunction ("Len", (x:string) =>x.length)

registerfunction is udfregistration under the method, SqlContext now implement udfregistration this trait, as long as import sqlcontext, that is, you can use the UDF function.

udfregistration Core method Registerfunction:

Registerfunction method Signature Def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit

Accept a udfname and a functionn, which can be Function1 to Function22. That is, the UDF parameter only supports 1-22. (Scala's pain.)

The internal builder constructs an expression through SCALAUDF, where scalaudf inherits from expression (it is simple to understand that the current simpleudf is a catalyst's expression), The function that passes in to Scala is implemented as a UDF, and the reflection checks whether the field type is allowed by catalyst, see scalareflection.

[Java]View PlainCopy

def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit = {
def builder (e:seq[expression]) = Scalaudf (func, Scalareflection.schemafor (Typetag[t]). DataType, E)// Construct Expression
Functionregistry.registerfunction (name, builder)//SqlContext functionregistry (Maintenance of a hashmap to manage UDF mappings) Registration

2.2 Register Function:

Note: Here Functionbuilder is a type Functionbuilder = seq[expression] = Expression

[Java]View PlainCopy

Class Simplefunctionregistry extends Functionregistry {
Val functionbuilders = new mutable. Hashmap[string, Functionbuilder] () //udf mapping Relationship maintenance [udfname,expression]
def registerfunction (name:string, Builder:functionbuilder) = { //put expression into map
Functionbuilders.put (name, builder)
}
Override Def lookupfunction (Name:string, children:seq[expression]): Expression = {
Functionbuilders (name) (children) //Find UDF, return expression
}
}

At this point, we register a Scala function as one of the catalyst's expression, which is the simple UDF of Spark.

Third, UDF calculation:

Since the UDF has been encapsulated as an expression node in the catalyst tree, it is calculated as the Eval method of the scalaudf.

The parameters required by the function are computed by row and expression, and the function is called by reflection to achieve the purpose of calculating the UDF.

Scalaudf inherits from Expression:

SCALAUDF accepts a function, DataType, and a series of expressions.

Simple, look at the comments can:

[Java]View PlainCopy

Case class scalaudf (Function:anyref, Datatype:datatype, children:seq[expression])
extends Expression {
Type Evaluatedtype = any
def nullable = true
Override Def toString = s"scalaudf (${children.mkstring (",")})"
Override Def eval (input:row): any = {
Val result = children.size Match {
Case 0 = function.asinstanceof[() = any] ()
case 1 = function.asinstanceof[(any) + any] (children (0). Eval (input)) //Reflection Call function
Case 2 =
function.asinstanceof[(Any, any) and any] (
Children (0). Eval (input), //Expression argument calculation
Children (1). Eval (input)
case 3 =
function.asinstanceof[(Any, any, any) = + any] (
Children (0). Eval (input),
Children (1). Eval (input),
Children (2). Eval (input)
case 4 =
......
Case x = //scala function only supports 22 parameters, enumerated here.
function.asinstanceof[(Any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, Any, and any) = any] (
Children (0). Eval (input),
Children (1). Eval (input),
Children (2). Eval (input),
Children (3). Eval (input),
Children (4). Eval (input),
Children (5). Eval (input),
Children (6). Eval (input),
Children (7). Eval (input),
Children (8). Eval (input),
Children (9). Eval (input),
Children (). Eval (input),
Children (one). Eval (input),
Children (). Eval (input),
Children (). Eval (input),
Children. Eval (input),
Children (). Eval (input),
Children (+). Eval (input),
Children (+). Eval (input),
Children. Eval (input),
Children. Eval (input),
Children (). Eval (input),
Children (+). Eval (input)

Iv. Summary

The current UDF for Spark is actually Scala function. Encapsulates the Scala function into a catalyst expression, which computes the current input row using the same eval method for SQL calculations.

Writing a spark UDF is very simple, just give the UDF a function name and pass a Scala function. The ability to rely on Scala's functional programming makes it easier to write Scala UDFs, and more understandable than a hive UDF.

--eof--

Original articles, reproduced please specify:

Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/39395641

Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.

Transferred from: http://blog.csdn.net/oopsoom/article/details/39395641

Eighth article: Spark SQL Catalyst Source Analysis UDF

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More