Spark SQL Catalyst Source Code Analysis UDF

Source: Internet
Author: User

/** Spark SQL Source Analysis series Article */

In the world of SQL, in addition to the commonly used processing functions provided by the official, extensible external custom function interface is generally provided, which has become a fact of the standard.

In the previous article on the core process of Spark SQL source Analysis , the role of the Spark SQL Catalyst Analyzer has been introduced, which includes the function of the resolvefunctions function. However, with the release of the Spark1.1 release, the Spark SQL code has a lot of new improvements and new features, and my previous 1.0-based source analysis is somewhat different, such as support UDF:

spark1.0 and previous implementations:

  Protected[sql] Lazy val catalog:catalog = new Simplecatalog  @transient  protected[sql] Lazy val analyzer:analyze r =    New Analyzer (catalog, emptyfunctionregistry, casesensitive = True)//emptyfunctionregistry empty implementation  @transient  Protected[sql] Val optimizer = Optimizer

Spark1.1 and future implementations:

  Protected[sql] Lazy val functionregistry:functionregistry = new Simplefunctionregistry//simplefunctionregistry implementation, Supports simple UDF  @transient  Protected[sql] lazy val analyzer:analyzer =    New Analyzer (catalog, Functionregistry, CaseSensitive = True)

First, Primer:

For functions in SQL statements, they are parsed into unresolvedfunction by Sqlparser. Unresolvedfunction will eventually be analyzed by analyzer.

Sqlparser:

In addition to unofficially defined functions, you can also define custom functions that SQL parser will parse.

  Ident ~ "(" ~ repsep (expression, ",") <~ ")" ^^ {case      Udfname ~ _ ~ Exprs = Unresolvedfunction (Udfname, Exprs)
Encapsulates the Sqlparser incoming udfname and Exprs into a class class unresolvedfunction inherited from expression.

Just this expression of datatype and so on a series of properties and eval calculation method can not access, forced access will throw an exception, because it is not resolved, just a vector.

Case Class Unresolvedfunction (Name:string, children:seq[expression]) extends Expression {  override def dataType = th Row New Unresolvedexception (this, ' DataType ')  override def foldable = throw new Unresolvedexception (this, "foldable" )  override Def nullable = throw new Unresolvedexception (this, "nullable")  override lazy Val resolved = False  / /unresolved functions is transient at compile time and don ' t get evaluated during execution.  Override Def eval (input:row = null): Evaluatedtype =    throw new Treenodeexception (this, S "No function to evaluate expr Ession. Type: ${this.nodename} ")  override def toString = S" ' $name (${children.mkstring (",")}) "}<strong></strong >

Analyzer:

When analyzer initializes, it requires catalog,database and table metadata relationships, as well as functionregistry to maintain metadata for UDF names and UDF implementations, using Simplefunctionregistry.

  /**   * Replaces [[Unresolvedfunction]]s with concrete [[catalyst.expressions.Expression expressions]].   *  /Object Resolvefunctions extends Rule[logicalplan] {    def apply (Plan:logicalplan): Logicalplan = plan Transform {case      Q:logicalplan =        q transformexpressions {//transformexpressions operation          on current Logicalplan Case U @ unresolvedfunction (name, children) If u.childrenresolved =//if traverse to Unresolvedfunction            Registry.lookupfunction (name, children)//Find UDF function from UDF Meta data Table}}  

Second, UDF registration

2.1 Udfregistration


registerfunction ("Len", (x:string) =>x.length)

registerfunction is udfregistration under the method, SqlContext now realized udfregistration this trait, as long as the import SqlContext, That is, you can use the UDF feature.

udfregistration Core method Registerfunction:

Registerfunction method Signature Def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit

Accept a udfname and a functionn, which can be Function1 to Function22. That is, the UDF parameter only supports 1-22. (Scala's pain.)

The internal builder constructs an expression through SCALAUDF, where scalaudf inherits from expression (it is simple to understand that the current simpleudf is a catalyst's expression), The function that passes in to Scala is implemented as a UDF, and the reflection checks whether the field type is allowed by catalyst, see scalareflection.

    def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit = {    def builder (e:seq[expression]) = Scal AUDF (func, Scalareflection.schemafor (Typetag[t]). DataType, E)//construct expression    Functionregistry.registerfunction (name, builder)//To SqlContext's functionregistry (Maintenance of a hashmap to manage UDF mappings) registration  }
2.2 Register Function:

Note: Here Functionbuilder is a type Functionbuilder = seq[expression] = Expression

Class Simplefunctionregistry extends Functionregistry {  val functionbuilders = new mutable. Hashmap[string, Functionbuilder] ()//udf Mapping Relationship maintenance [udfname,expression]  def registerfunction (name:string, builder: Functionbuilder) = {//put expression into map    functionbuilders.put (name, builder)  }  override Def Lookupfunction (name:string, children:seq[expression]): Expression = {    functionbuilders (name) (children)//Find UDF, return expression  }}
At this point, we register a Scala function as one of the catalyst's expression, which is the simple UDF of Spark.

Third, UDF calculation:

Since the UDF has been encapsulated as an expression node in the catalyst tree, it is calculated as the Eval method of the scalaudf.

The parameters required by the function are computed by row and expression, and the function is called by reflection to achieve the purpose of calculating the UDF.

Scalaudf inherits from Expression:

SCALAUDF accepts a function, DataType, and a series of expressions.

Simple, look at the comments can:

Case Class SCALAUDF (Function:anyref, Datatype:datatype, children:seq[expression]) extends Expression {type evaluated Type = Any def nullable = True override def toString = S "scalaudf (${children.mkstring (", ")})"  override def eval (inp Ut:row): any = {    val result = children.size match {      case 0 = function . asinstanceof[() = any] ()       Case 1 = function.asinstanceof[(any) and any] ( Children (0). Eval (input))//Reflection call function      Case 2 =>         function.asinstanceof[(Any, any) + any] (          Children (0). Eval (input),//Expression parameter calculation           children (1). Eval (input) )       Case 3 =>        function.asinstanceof[(Any, Any, and (any) = any] (        &nBsp Children (0). Eval (input),          Children (1). Eval (input),           Children (2). Eval (input)       Case 4 = > ...   case x =//scala function only supports 22 parameters, enumerated here.         function.asinstanceof[(Any, any, any, any, any, any, any, any, any, any, any, A NY, any, any, any, any, any, any, any, any, any, any) = + any] (          CH Ildren (0). Eval (input),          Children (1). Eval (input),           Children (2). Eval (input),           Children (3). Eval (input),          Children (4). eval (input),          Children (5). Eval (input),           Children (6). Eval (input),          Children (7). Eval ( Input),          Children (8). Eval (input),           Children (9). Eval (input),          Children. Eval (input),          children (one). Eval (input),           children (). Eval (input),           children. Eval (input),          Children (14). eval (input),          children (. Eval) (input),           Children (+). Eval (input),           Children (+) eval (input),          Children (input),           Children (+). Eval (input),           children. Eval (input),          Children (21). eval (input))

Iv. Summary

The current UDF for Spark is actually Scala function. Encapsulates the Scala function into a catalyst expression, which computes the current input row using the same eval method for SQL calculations.

Writing a spark UDF is very simple, just give the UDF a function name and pass a Scala function. The ability to rely on Scala's functional programming makes it easier to write Scala UDFs, and more understandable than a hive UDF.

--eof--

Original articles, reproduced please specify from: http://blog.csdn.net/oopsoom/article/details/39395641

Spark SQL Catalyst Source Code Analysis UDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.