/** Spark SQL Source Analysis series Article */
In the world of SQL, in addition to the commonly used processing functions provided by the official, extensible external custom function interface is generally provided, which has become a fact of the standard.
In the previous article on the core process of Spark SQL source Analysis , the role of the Spark SQL Catalyst Analyzer has been introduced, which includes the function of the resolvefunctions function. However, with the release of the Spark1.1 release, the Spark SQL code has a lot of new improvements and new features, and my previous 1.0-based source analysis is somewhat different, such as support UDF:
spark1.0 and previous implementations:
Protected[sql] Lazy val catalog:catalog = new Simplecatalog @transient protected[sql] Lazy val analyzer:analyze r = New Analyzer (catalog, emptyfunctionregistry, casesensitive = True)//emptyfunctionregistry empty implementation @transient Protected[sql] Val optimizer = Optimizer
Spark1.1 and future implementations:
Protected[sql] Lazy val functionregistry:functionregistry = new Simplefunctionregistry//simplefunctionregistry implementation, Supports simple UDF @transient Protected[sql] lazy val analyzer:analyzer = New Analyzer (catalog, Functionregistry, CaseSensitive = True)
First, Primer:
For functions in SQL statements, they are parsed into unresolvedfunction by Sqlparser. Unresolvedfunction will eventually be analyzed by analyzer.
Sqlparser:
In addition to unofficially defined functions, you can also define custom functions that SQL parser will parse.
Ident ~ "(" ~ repsep (expression, ",") <~ ")" ^^ {case Udfname ~ _ ~ Exprs = Unresolvedfunction (Udfname, Exprs)
Encapsulates the Sqlparser incoming udfname and Exprs into a class class unresolvedfunction inherited from expression.
Just this expression of datatype and so on a series of properties and eval calculation method can not access, forced access will throw an exception, because it is not resolved, just a vector.
Case Class Unresolvedfunction (Name:string, children:seq[expression]) extends Expression { override def dataType = th Row New Unresolvedexception (this, ' DataType ') override def foldable = throw new Unresolvedexception (this, "foldable" ) override Def nullable = throw new Unresolvedexception (this, "nullable") override lazy Val resolved = False / /unresolved functions is transient at compile time and don ' t get evaluated during execution. Override Def eval (input:row = null): Evaluatedtype = throw new Treenodeexception (this, S "No function to evaluate expr Ession. Type: ${this.nodename} ") override def toString = S" ' $name (${children.mkstring (",")}) "}<strong></strong >
Analyzer:
When analyzer initializes, it requires catalog,database and table metadata relationships, as well as functionregistry to maintain metadata for UDF names and UDF implementations, using Simplefunctionregistry.
/** * Replaces [[Unresolvedfunction]]s with concrete [[catalyst.expressions.Expression expressions]]. * /Object Resolvefunctions extends Rule[logicalplan] { def apply (Plan:logicalplan): Logicalplan = plan Transform {case Q:logicalplan = q transformexpressions {//transformexpressions operation on current Logicalplan Case U @ unresolvedfunction (name, children) If u.childrenresolved =//if traverse to Unresolvedfunction Registry.lookupfunction (name, children)//Find UDF function from UDF Meta data Table}}
Second, UDF registration
2.1 Udfregistration
registerfunction ("Len", (x:string) =>x.length)
registerfunction is udfregistration under the method, SqlContext now realized udfregistration this trait, as long as the import SqlContext, That is, you can use the UDF feature.
udfregistration Core method Registerfunction:
Registerfunction method Signature Def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit
Accept a udfname and a functionn, which can be Function1 to Function22. That is, the UDF parameter only supports 1-22. (Scala's pain.)
The internal builder constructs an expression through SCALAUDF, where scalaudf inherits from expression (it is simple to understand that the current simpleudf is a catalyst's expression), The function that passes in to Scala is implemented as a UDF, and the reflection checks whether the field type is allowed by catalyst, see scalareflection.
def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit = { def builder (e:seq[expression]) = Scal AUDF (func, Scalareflection.schemafor (Typetag[t]). DataType, E)//construct expression Functionregistry.registerfunction (name, builder)//To SqlContext's functionregistry (Maintenance of a hashmap to manage UDF mappings) registration }
2.2 Register Function:
Note: Here Functionbuilder is a type Functionbuilder = seq[expression] = Expression
Class Simplefunctionregistry extends Functionregistry { val functionbuilders = new mutable. Hashmap[string, Functionbuilder] ()//udf Mapping Relationship maintenance [udfname,expression] def registerfunction (name:string, builder: Functionbuilder) = {//put expression into map functionbuilders.put (name, builder) } override Def Lookupfunction (name:string, children:seq[expression]): Expression = { functionbuilders (name) (children)//Find UDF, return expression }}
At this point, we register a Scala function as one of the catalyst's expression, which is the simple UDF of Spark.
Third, UDF calculation:
Since the UDF has been encapsulated as an expression node in the catalyst tree, it is calculated as the Eval method of the scalaudf.
The parameters required by the function are computed by row and expression, and the function is called by reflection to achieve the purpose of calculating the UDF.
Scalaudf inherits from Expression:
SCALAUDF accepts a function, DataType, and a series of expressions.
Simple, look at the comments can:
Case Class SCALAUDF (Function:anyref, Datatype:datatype, children:seq[expression]) extends Expression {type evaluated Type = Any def nullable = True override def toString = S "scalaudf (${children.mkstring (", ")})" override def eval (inp Ut:row): any = { val result = children.size match { case 0 = function . asinstanceof[() = any] () Case 1 = function.asinstanceof[(any) and any] ( Children (0). Eval (input))//Reflection call function Case 2 => function.asinstanceof[(Any, any) + any] ( Children (0). Eval (input),//Expression parameter calculation children (1). Eval (input) ) Case 3 => function.asinstanceof[(Any, Any, and (any) = any] ( &nBsp Children (0). Eval (input), Children (1). Eval (input), Children (2). Eval (input) Case 4 = > ... case x =//scala function only supports 22 parameters, enumerated here. function.asinstanceof[(Any, any, any, any, any, any, any, any, any, any, any, A NY, any, any, any, any, any, any, any, any, any, any) = + any] ( CH Ildren (0). Eval (input), Children (1). Eval (input), Children (2). Eval (input), Children (3). Eval (input), Children (4). eval (input), Children (5). Eval (input), Children (6). Eval (input), Children (7). Eval ( Input), Children (8). Eval (input), Children (9). Eval (input), Children. Eval (input), children (one). Eval (input), children (). Eval (input), children. Eval (input), Children (14). eval (input), children (. Eval) (input), Children (+). Eval (input), Children (+) eval (input), Children (input), Children (+). Eval (input), children. Eval (input), Children (21). eval (input))
Iv. Summary
The current UDF for Spark is actually Scala function. Encapsulates the Scala function into a catalyst expression, which computes the current input row using the same eval method for SQL calculations.
Writing a spark UDF is very simple, just give the UDF a function name and pass a Scala function. The ability to rely on Scala's functional programming makes it easier to write Scala UDFs, and more understandable than a hive UDF.
--eof--
Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/39395641
Spark SQL Catalyst Source Code Analysis UDF