How to transfer functions to spark-how to make your spark application more efficient and robust

Source: Internet
Author: User

It is believed that many people will encounter Task not serializable when they start using spark, most of which are caused by calling an object that cannot be serialized in the RDD operator. Why must the objects in the incoming operator be serialized? This is going to start with spark itself, Spark is a distributed computing framework, the RDD (resilient distributed Datasets, Elastic distributed dataset) is the abstraction of a distributed dataset, and the data is actually distributed across the nodes of the cluster, Abstraction through the RDD makes the user feel as if they are interacting locally. But in the actual operation, the operations in the operator are sent to the compute node (Executor) side to execute, which requires that the objects in the incoming operator can be serialized.

The operators of Spark are largely implemented by passing functions to the driver on the cluster, and the key to writing spark applications is to use operators (or transformations) to implement the spark transfer function. There are two common ways to transfer functions to spark (from the official spark documentation, spark Programming Guide):

The First kind: Anonymous function, the process of the code is less, you can use anonymous functions, directly written in the operator:

        

Myrdd.map (x = x+ 1)



the second: a static method in a global singleton object: First define the object myfunctions, and the static method: Funcone, and then pass the Myfunctions.funcone to the RDD operator.

        

       

Object Myfunctions {      def funcone (s:string): String = {...}} myrdd.map (Myfunctions.funcone)



in salesman development, you need to pass the RDD reference to a method of an instance of a class, passing it to the RDD function, as an instance method of the class instance:

      

Class MyClass {     def funcone (s:string): String = {...}     def dostuff (rdd:rdd[string]): rdd[string] = {Rdd.map (Funcone}}



In This example, we define a class MyClass, an instance method of the class Dostuff in a rdd,rdd operator called Another instance method of the class Funcone, when I am new a MyClass instance and call the Dostuff method The entire instance object is sent to the cluster, so the class MyClass must be serializable, requiring extends Serializable.

Similarly, object variables that are external to the access method also refer to the entire object and need to send the entire object to the cluster:

       

Class MyClass {    val field = "Hello"    def dostuff (rdd:rdd[string]): rdd[string] = {Rdd.map (x = field         + x) }
}



to prevent the entire object from being sent to the cluster, you can define a local variable to hold a reference to the field of the external object, especially in some large objects, to avoid sending the entire object to the cluster and improve efficiency.

      

def dostuff (rdd:rdd[string]): rdd[string] = {    val field_ = This.field    rdd.map (x = field_ + x)}



    

Spark applications are ultimately run in the cluster, many problems in a single local environment can not be exposed, and sometimes often encounter local running results and cluster running results inconsistent problems, which requires the development of the use of functional programming style, as far as possible to write functions are pure functions. The benefits of pure functions are: stateless, thread-safe, no thread synchronization, the application or runtime environment (runtime) can cache the results of the operation of the pure function, the operation is faster.

So what is a pure function?

pure functions are a function of which the input and output streams are all explicit (Explicit). Explicit (Explicit) means that functions exchange data with the outside world with only one channel-parameters and return values; All input information that the function accepts from outside the function is passed to the function through parameters, and all information that the function outputs to the outside of the function is passed to the outside of the function by the return value. If a function obtains data from the outside world through implicit (implicit), or outputs data externally, then the function is not a pure function, called a non-pure function (impure functions). Implicit (implicit) means that functions exchange data with the outside world through channels other than parameters and return values. For example, reading global variables, modifying global variables, is called as an implicit way to exchange data with the outside world, for example, the use of I/O API (input and output system function library) to read the configuration file, or output to a file, print to the screen, is called an implicit way to exchange data with the outside world.

when it comes to interacting with objects in the calculation process, try to use stateless objects, such as a bean, where the member variables are Val, and a new one where data interaction is required.

about (commutative and associative) Exchange law and binding law. In the transfer to Reudce,reducebykey, as well as some other merge, the functions in the aggregation operation must satisfy the commutative law and the binding law, the commutative law and the binding law that we have learned in mathematics:

A + b = b + a,a + B + c = A + (b + c)

the functions defined by func (A, B) and F (b,a) should get the same result, F (f (b), C) and F (a,f (b,c)) should get the same result.

Finally, let's talk about the use of broadcast variables and accumulators. Do not define a global variable in the program, if you need to share a data in multiple nodes, you can use the method of broadcast variables. If you need some global aggregation calculations, you can use an accumulator.

How to transfer functions to spark-how to make your spark application more efficient and robust

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.