Eighth article: Spark SQL Catalyst Source Analysis UDF

Source: Internet
Author: User
Tags eval reflection

/** Spark SQL Source Analysis series Article */

In the world of SQL, in addition to the commonly used processing functions provided by the official, extensible external custom function interface is generally provided, which has become a fact of the standard.

In the previous article on the core process of spark SQL source analysis, the role of the Spark SQL Catalyst Analyzer has been introduced, which includes the function of the resolvefunctions function. However, with the release of the Spark1.1 release, the Spark SQL code has a lot of new improvements and new features, and my previous 1.0-based source analysis is somewhat different, such as support UDF:

spark1.0 and previous implementations:

[Java]View PlainCopy
    1. Protected[sql] Lazy val catalog:catalog = new Simplecatalog
    2. @transient
    3. Protected[sql] Lazy val analyzer:analyzer =
    4. New Analyzer (catalog, emptyfunctionregistry, casesensitive = true) //emptyfunctionregistry NULL implementation
    5. @transient
    6. Protected[sql] Val optimizer = Optimizer

Spark1.1 and future implementations:

[Java]View PlainCopy
    1. Protected[sql] Lazy val functionregistry:functionregistry = new Simplefunctionregistry // Simplefunctionregistry implementation, support for simple UDFs
    2. @transient
    3. Protected[sql] Lazy val analyzer:analyzer =
    4. New Analyzer (catalog, functionregistry, casesensitive = true)

First, Primer:

For functions in SQL statements, they are parsed into unresolvedfunction by Sqlparser. Unresolvedfunction will eventually be analyzed by analyzer.


In addition to unofficially defined functions, you can also define custom functions that SQL parser will parse.

[Java]View PlainCopy
    1. Ident ~ "(" ~ repsep (expression, ",") <~ ")" ^^ {
    2. Case Udfname ~ _ ~ Exprs = Unresolvedfunction (Udfname, Exprs)

Encapsulates the Sqlparser incoming udfname and Exprs into a class class unresolvedfunction inherited from expression.

Just this expression of datatype and so on a series of properties and eval calculation method can not access, forced access will throw an exception, because it is not resolved, just a vector.

[Java]View PlainCopy
  1. Case class Unresolvedfunction (name:string, children:seq[expression]) extends Expression {
  2. Override Def DataType = throw new Unresolvedexception (This, "DataType")
  3. Override def foldable = throw new Unresolvedexception (This, "foldable")
  4. Override Def nullable = throw new Unresolvedexception (This, "nullable")
  5. Override lazy Val resolved = false
  6. //unresolved functions is transient at compile time and don ' t get evaluated during execution.
  7. Override Def eval (input:row = null): Evaluatedtype =
  8. throw New Treenodeexception (This,the S "No function to evaluate expression. Type: ${this.nodename} ")
  9. Override Def toString = s"' $name (${children.mkstring (",")})"
  10. }<strong></strong>


When analyzer initializes, it will require catalog,database and table metadata relationships, as well as functionregistry to maintain the UDF name and metadata for the UDF implementation, using Simplefunctionregistry.

[Java]View PlainCopy
  1. /**
  2. * Replaces [[Unresolvedfunction]]s with concrete [[catalyst.expressions.Expression expressions]].
  3. */
  4. Object Resolvefunctions extends Rule[logicalplan] {
  5. def apply (Plan:logicalplan): Logicalplan = Plan Transform {
  6. Case Q:logicalplan =
  7. Q transformexpressions { //transformexpressions operation for current Logicalplan
  8. Case u @ unresolvedfunction (name, children) if u.childrenresolved =/ /if traverse to Unresolvedfunction
  9. Registry.lookupfunction (name, children) //Find UDF functions from the UDF metadata table
  10. }
  11. }
  12. }

Second, UDF registration

2.1 Udfregistration

registerfunction ("Len", (x:string) =>x.length)

registerfunction is udfregistration under the method, SqlContext now implement udfregistration this trait, as long as import sqlcontext, that is, you can use the UDF function.

udfregistration Core method Registerfunction:

Registerfunction method Signature Def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit

Accept a udfname and a functionn, which can be Function1 to Function22. That is, the UDF parameter only supports 1-22. (Scala's pain.)

The internal builder constructs an expression through SCALAUDF, where scalaudf inherits from expression (it is simple to understand that the current simpleudf is a catalyst's expression), The function that passes in to Scala is implemented as a UDF, and the reflection checks whether the field type is allowed by catalyst, see scalareflection.

[Java]View PlainCopy
    1. def Registerfunction[t:typetag] (name:string, Func:function1[_, T]): Unit = {
    2. def builder (e:seq[expression]) = Scalaudf (func, Scalareflection.schemafor (Typetag[t]). DataType, E)// Construct Expression
    3. Functionregistry.registerfunction (name, builder)//SqlContext functionregistry (Maintenance of a hashmap to manage UDF mappings) Registration

2.2 Register Function:

Note: Here Functionbuilder is a type Functionbuilder = seq[expression] = Expression

[Java]View PlainCopy
  1. Class Simplefunctionregistry extends Functionregistry {
  2. Val functionbuilders = new mutable. Hashmap[string, Functionbuilder] () //udf mapping Relationship maintenance [udfname,expression]
  3. def registerfunction (name:string, Builder:functionbuilder) = { //put expression into map
  4. Functionbuilders.put (name, builder)
  5. }
  6. Override Def lookupfunction (Name:string, children:seq[expression]): Expression = {
  7. Functionbuilders (name) (children) //Find UDF, return expression
  8. }
  9. }

At this point, we register a Scala function as one of the catalyst's expression, which is the simple UDF of Spark.

Third, UDF calculation:

Since the UDF has been encapsulated as an expression node in the catalyst tree, it is calculated as the Eval method of the scalaudf.

The parameters required by the function are computed by row and expression, and the function is called by reflection to achieve the purpose of calculating the UDF.

Scalaudf inherits from Expression:

SCALAUDF accepts a function, DataType, and a series of expressions.

Simple, look at the comments can:

[Java]View PlainCopy
  1. Case class scalaudf (Function:anyref, Datatype:datatype, children:seq[expression])
  2. extends Expression {
  3. Type Evaluatedtype = any
  4. def nullable = true
  5. Override Def toString = s"scalaudf (${children.mkstring (",")})"
  6. Override Def eval (input:row): any = {
  7. Val result = children.size Match {
  8. Case 0 = function.asinstanceof[() = any] ()
  9. case 1 = function.asinstanceof[(any) + any] (children (0). Eval (input)) //Reflection Call function
  10. Case 2 =
  11. function.asinstanceof[(Any, any) and any] (
  12. Children (0). Eval (input), //Expression argument calculation
  13. Children (1). Eval (input)
  14. case 3 =
  15. function.asinstanceof[(Any, any, any) = + any] (
  16. Children (0). Eval (input),
  17. Children (1). Eval (input),
  18. Children (2). Eval (input)
  19. case 4 =
  20. ......
  21. Case x = //scala function only supports 22 parameters, enumerated here.
  22. function.asinstanceof[(Any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, any, Any, and any) = any] (
  23. Children (0). Eval (input),
  24. Children (1). Eval (input),
  25. Children (2). Eval (input),
  26. Children (3). Eval (input),
  27. Children (4). Eval (input),
  28. Children (5). Eval (input),
  29. Children (6). Eval (input),
  30. Children (7). Eval (input),
  31. Children (8). Eval (input),
  32. Children (9). Eval (input),
  33. Children (). Eval (input),
  34. Children (one). Eval (input),
  35. Children (). Eval (input),
  36. Children (). Eval (input),
  37. Children. Eval (input),
  38. Children (). Eval (input),
  39. Children (+). Eval (input),
  40. Children (+). Eval (input),
  41. Children. Eval (input),
  42. Children. Eval (input),
  43. Children (). Eval (input),
  44. Children (+). Eval (input)

Iv. Summary

The current UDF for Spark is actually Scala function. Encapsulates the Scala function into a catalyst expression, which computes the current input row using the same eval method for SQL calculations.

Writing a spark UDF is very simple, just give the UDF a function name and pass a Scala function. The ability to rely on Scala's functional programming makes it easier to write Scala UDFs, and more understandable than a hive UDF.


Original articles, reproduced please specify:

Reprinted from: Oopsoutofmemory Shengli's blog, oopsoutofmemory

This article link address: http://blog.csdn.net/oopsoom/article/details/39395641

Note: This document is based on the attribution-NonCommercial use-prohibition of the deduction of the 2.5 China (CC by-nc-nd 2.5 CN) Agreement, which is welcome to reprint, forward and comment, but please retain the author's attribution and link to the article. Please contact me if you need to negotiate for commercial purposes or in connection with licensing.

Transferred from: http://blog.csdn.net/oopsoom/article/details/39395641

Eighth article: Spark SQL Catalyst Source Analysis UDF

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.