Sparktask (tasknotserializable) problem analysis __spark

Source: Internet
Author: User
Tags serialization
problem description and cause analysis
In the writing Spark program, the task is not serialized because the externally defined variables and functions are used inside the map and other operators. However, the use of external variables in the calculation of the spark operator is unavoidable in many cases, such as filtering on the filter operator based on externally specified conditions, and map transformation according to the corresponding configuration. In order to solve the above task serialization problem, it is studied and summarized here.

The error of "Org.apache.spark.SparkException:Task not serializable" appears, usually because the arguments in the map, filter, and so on use external variables, but the variable cannot be serialized ( Not that you can't reference external variables, just do the serialization work, detailed in detail later). The most common scenario is when a member function or variable referencing a class, often the current class, causes all members of the class (the entire class) to support serialization. Although in many cases the current class uses the "extends Serializable" declaration to support serialization, the problem occurs when the entire class is serialized because some fields do not support serialization, resulting in a task deserialization problem.
example analysis of reference member variables
Mentioned above because a class member function or variable is referenced internally by the map, filter, and other operators in the Spark program, it is necessary that all members of the class need to support serialization, and because some of the class member variables do not support serialization, a task cannot be serialized .。 To verify the above reasons, we have written an instance program, as shown below. The function of this class is to filter the list of domain names for a particular top-level domain name (rootdomain, such as. com,.cn,.org) from the Domain Name list (RDD), and that particular top-level domain name needs to be specified when the function is called.
Class MyTest1 (conf:string) extends serializable{
  val list = list ("A.com", "www.b.com", "a.cn", "a.com.cn", "a.org");
  private Val sparkconf = new sparkconf (). Setappname ("AppName");
  Private Val sc = new Sparkcontext (sparkconf);
  Val rdd = sc.parallelize (list);

  Private Val rootdomain = conf

  def getresult (): array[(String)] = {
    val result = Rdd.filter (item => Item.contain S (rootdomain))
    Result.take (Result.count (). ToInt)
  }
}

Based on the reasons for this analysis, the current class's member variables are dependent on them, resulting in the need for serialization of the present class, causing an error because some of the current class's fields are not serialized. The actual situation is consistent with the reason for the analysis, and the error occurs during the run, as shown below. Analysis of the error report below is caused by SC (sparkcontext).
Exception in thread ' main ' org.apache.spark.SparkException:Task not serializable at
    Org.apache.spark.util.closurecleaner$.ensureserializable (closurecleaner.scala:166) at
    Org.apache.spark.util.closurecleaner$.clean (closurecleaner.scala:158) at
    Org.apache.spark.SparkContext.clean (**sparkcontext**.scala:1435) ...
caused By:java.io.NotSerializableException:org.apache.spark.SparkContext
    -field (class "Com.ntci.test.MyTest1 ", Name:" SC ", type:" Class Org.apache.spark.SparkContext "
    -Object (class" Com.ntci.test.MyTest1 "), com.ntci.test.mytest1@63700353)
    -Field (class "Com.ntci.test.mytest1$ $anonfun $", Name: "$outer", type: "Class Com.ntci.test.MyTest1 ")

To verify the above conclusion, the member variable that does not need to be serialized is used with the keyword "@transent" annotation, which indicates that the two member variables in the current class are not serialized, the function is executed again, and the same error is noted.
Exception in thread ' main ' org.apache.spark.SparkException:Task not serializable at
    Org.apache.spark.util.closurecleaner$.ensureserializable (closurecleaner.scala:166) ...
 caused by:java.io.NotSerializableException:org.apache.spark.SparkConf
    -field (class "Com.ntci.test.MyTest1", Name: "Sparkconf", type: "Class org.apache.spark.**sparkconf**")
    -Object (class "Com.ntci.test.MyTest1", com.ntci.test.mytest1@6107799e)

Although the cause of the error is the same, the field that caused the error this time is sparkconf (sparkconf). Using the same "@transent" notation, the SC (sparkcontext) and sparkconf (sparkconf) are labeled without serialization, and the program executes normally when executed again.
Class MyTest1 (conf:string) extends serializable{
  val list = list ("A.com", "www.b.com", "a.cn", "a.com.cn", "a.org");
  @transient
  private val sparkconf = new sparkconf (). Setappname ("AppName");
  @transient
  Private Val sc = new Sparkcontext (sparkconf);
  Val rdd = sc.parallelize (list);

  Private Val rootdomain = conf

  def getresult (): array[(String)] = {

    val result = Rdd.filter (item => Item.contain S (rootdomain))
    Result.take (Result.count (). ToInt)
  }
}

So, we can get the conclusion from the above example: because a class member function or variable is referenced internally by the map, filter, and other operators in the Spark program, all members of the class need to support serialization, and because some of the class member variables do not support serialization, a task cannot be serialized. Conversely, when you label a member variable in a class that does not support serialization, the entire class can be serialized properly, eliminating the task's serialization problem.
example Analysis of reference member functions
A member variable has the same effect on serialization as a member function, which refers to a member function of a class that causes all members of the class to support serialization. To verify this assumption, we use a member function of the current class in the map, if the current domain name is not "www." Start, add "www." to the header. Prefix (Note: Since rootdomain is defined within the GetResult function, there is no question of referencing a class member variable, and there is no question of the existence and exclusion of the problems discussed and raised in the previous example, so this example mainly discusses the effects of member function references; Not directly referencing class member variables is also a means of solving such problems, such as the practice of defining variables within functions in order to eliminate the effect of member variables in this example, the specific circumvention of such questions is outlined here, as detailed in the next section. The following code also complains, as in the example above, because the SC (sparkcontext) and sparkconf (sparkconf) Two member variables in the current class are not serialized, causing the serialization of the current class to be problematic.
Class MyTest1 (conf:string)  extends serializable{
  val list = list ("A.com", "www.b.com", "a.cn", "a.com.cn", " A.org ");
  Private Val sparkconf = new sparkconf (). Setappname ("AppName");
  Private Val sc = new Sparkcontext (sparkconf);
  Val rdd = sc.parallelize (list);

  Def getresult (): array[(String)] = {
    val rootdomain = conf
    val result = Rdd.filter (item => item.contains (rootdo Main)
    . Map (item => addwww (item))
    Result.take (Result.count (). ToInt)
  }
  def addwww (str:string) : String = {
    if (Str.startswith ("www."))
      Str
    Else
      "www." +str
  }
}

As in the previous practice, the program works correctly when the two member variables of SC (sparkcontext) and sparkconf (sparkconf) are annotated with the @transent, so that the current class does not serialize the two variables. In addition, slightly different from a member variable, because the member function does not depend on a particular member variable, it can be defined in Scala's object (similar to the static function in Java), which also cancels the dependency on a particular class. As shown in the following example, the addwww is placed into an object (Utiltool), which is called directly in the filter operation, so that the program will function correctly after processing.
Def getresult (): array[(String)] = {
    val rootdomain = conf
    val result = Rdd.filter (item => item.contains (rootdo Main)
    . Map (item => utiltool.addwww (item))
    Result.take (Result.count (). ToInt)
  }
Object Utiltool {
  def addwww (str:string): String = {
    if (Str.startswith ("www.")
      Str
    Else
      "www." +str
  }
}
verification of the requirements for full class serialization
As mentioned, referencing a member function of a class causes the class and all members to support serialization. Therefore, for situations where a class of member variables or functions are used, first the class needs to be serialized (extends Serializable), and some member variables that do not need to be serialized are marked to avoid having an effect on serialization. For the above two examples, by referencing a member variable or function of the class, which causes the class and all members to support serialization, you use "@transent" to annotate the effect of some member variables on serialization.
To further validate the assumption that the entire class needs serialization, here is a deletion of the relevant code for the class serialization (removing extends Serializable) based on the above example using the "@transent" callout and the code that works correctly. This enables program execution to report the class as a serialized error, as shown below. So the above hypothesis is illustrated by this example.
caused By:java.io.NotSerializableException:com.ntci.test.MyTest1
    -field (class "com.ntci.test.mytest1$$ Anonfun$1 ", Name:" $outer ", type:" Class Com.ntci.test.MyTest1 ")

So through the above examples can be explained: Map and other operators can refer to external variables and a class of member variables, but to do a good job of serialization of this class processing. The first is that the class needs to inherit the serializable class and, in addition, handle some of the member variables that have errors in the class, which is also the main reason for the task's deserialization problem. For this type of problem, first look at the member variable that could not be serialized, and use the @transent callout for a member variable that can not be serialized.
Also, it is not the class where the map operation must be serialized (inheriting the Serializable class), and for situations where you do not need to refer to a class member variable or function, the corresponding class must be serialized, as shown in the following example. The filter operation does not refer to any class member variables or functions internally, so the current class does not have to be serialized and the program executes normally.
Class MyTest1 (conf:string) {
  val list = list ("A.com", "www.b.com", "a.cn", "a.com.cn", "a.org");
  Private Val sparkconf = new sparkconf (). Setappname ("AppName");
  Private Val sc = new Sparkcontext (sparkconf);
  Val rdd = sc.parallelize (list);

  Def getresult (): array[(String)] = {
    val rootdomain = conf
    val result = Rdd.filter (item => item.contains (rootdo Main))
    Result.take (Result.count (). ToInt)
  }
}
Solutions and programming recommendations
As mentioned above, this problem mainly refers to a class of member variables or functions, and the corresponding class does not do the serialization processing caused. So the solution to this problem is the following two ways:
A member function or member variable that does not directly refer to a class (usually the current class) within a closed package (or not directly within) a map, if a member function or variable of a class is referenced, the corresponding class must be serialized (i) direct reference to a class of member functions or member variables within (or not directly within) a map
(1) For situations that depend on a class of member variables
If the program relies on a relatively fixed value, it is desirable to fix the value, or define it within the operation of map, filter, or in Scala
Object objects (similar to static variables in Java) if the dependency value needs to be dynamically specified (as a function argument) when the program is called, then the member variable is not directly referenced when the map, filter, and so on. Instead, a local variable is redefined based on the value of the member variable in the GetResult function, similar to the example above, so that the map operator does not need to refer to the class's member variable.
(2) for situations that depend on a member function of a class
If functions are functionally independent, they can be defined in Scala object objects (similar to the static method in Java), eliminating the need for a particular class. (b) If a member function or variable of a class is referenced, the corresponding class must be serialized
In this case, the class needs to be serialized, first inheriting the serialization class, and then using the @transent callout for the member variable that cannot be serialized, telling the compiler that no serialization is required.
In addition, if you can, you can put dependent variables into a small class, let this class support serialization, this can reduce network traffic, improve efficiency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.