Spark Source read two-sparkapplication running process

Source: Internet
Author: User
Keywords Spark source code Sparkcontext Create sparkconf Create sparkapplication run process
Tags apache array code configuration create creator data file
Code version: Spark 2.2.0 This article mainly describes a creator running process. Generally divided into three parts: (1) sparkconf creation, (2) Sparkcontext creation, (3) Task execution. If we use Scala to write a wordcount program to count the words in a file, package Com.spark.myapp import Org.apache.spark. {sparkcontext, sparkconf} object WordCount {def main (args:array[string]) {val conf = new sparkconf (). Setappname (" WordCount "). Setmaster (" spark://master:7077 ") val sc = new Sparkcontext (conf) sc.textfile (" "). Flatmap (_. Split ("")). Map ((_, 1)). Reducebykey (_+_). Collect (). foreach (println) sc.stop ()} After compiling the package jar, you submit a task in the standalone cluster environment. Typing a command in the Spark directory where the task machine is being installed: Spark-submit--classcom.spark.myapp.wordcount--masterspark://master:7077/home/xx/myapps/ Wordcount.jar about how Spark-submit is run to the task Code, refer to the previous article "Spark-submit execution". This article mainly describes a creator of the running process. Generally divided into three parts: (1) sparkconf creation, (2) Sparkcontext creation, (3) Task execution.   1.SparkConf Create sparkconf contains various parameters for the spark cluster configuration, let's look at the description of this class. Revisit for a Spark creator. Used to set various Spark parameters as Key-value pairs. Most of the time, your would create a sparkconf object with ' new SpArkconf () ', abound'll loadvalues from the any ' spark.* ' Java System Properties set in your creator as. In this case,parameters your set directly on the ' sparkconf ' object take priority over system properties. The point is that newsparkconf () will read spark related parameters from the system configuration, and the parameters are K pairs, and then you can use the sparkconf set function to set up the cover read configuration yourself. Common parameter setting functions are as follows: (1) Set Master URL def setmaster (master:string): sparkconf (2) Set creator name, display Def Spark in the Setappname Web UI ( name:string): sparkconf (3) Set jar bundle Def setjars (jars:seq[string)): sparkconf (4) Set executor environment variable Def setexecutorenv ( Variable:string, value:string): sparkconf (5) Set Spark Home installation directory def setsparkhome (home:string): sparkconf   2. Sparkcontext creation Sparkcontext is an important object in the spark development process and is the intermediary of spark upper level application and the underlying API. The Sparkcontext constructor parameters are the sparkconf described above. Only one sparkcontext may is active per JVM. You moment-in ' Stop () ' The active Sparkcontext beforecreating a new one. Several key properties for Sparkenv, Schedulerbackend, TaskScheduler, Dagscheduler (1) Create the sparkenv//Create the Spark execution Environnement ( Cache, map output tracker, ETC) _env = createsparkenv (_conf, isLocal, Listenerbus) Sparkenv.set (_env) (2) Creating Schedulerbackend, TaskScheduler//Create And start the scheduler Val (sched, ts) = Sparkcontext.createtaskscheduler (this, master, deploymode) _schedulerbackend = Sched _taskscheduler = ts//start TaskScheduler after TaskScheduler sets Dagscheduler reference in Dagscheduler ' Sconstructor _taskscheduler.start () Createtaskscheduler is based on the incoming master parameter to return the corresponding schedulerbackend, TaskScheduler, similar to the factory model. Master Match {case ' local ' caselocal_n_regex (threads) Caselocal_n_failures_regex (threads, maxfailures) Casespark_ REGEX (Sparkurl)//standalone into this branch Caselocal_cluster_regex (numslaves, Coresperslave, Memoryperslave) CASEMASTERURL/ /other yarn or Mesos cluster into this branch} (3) create Dagscheduler _dagscheduler = new Dagscheduler (this)   3. Task execution The basis of spark task execution is rdd (elastic distributed data Set), various spark operators output new Rdd on Rdd, resulting in output to screen or file or memory. Sc.textfile (""). Flatmap (_.split ("")). Map ((_, 1)). Reducebykey (_+_). Collect (). foreach (println) The first step sc.textfile ("") deF textfile (path:string, minpartitions:int = defaultminpartitions): rdd[string] = withscope {assertNotStopped () Hadoopfile (Path, Classof[textinputformat],classof[longwritable],classof[text], minpartitions). Map (pair => pair. _2.tostring). SetName (Path)} hadoopfile output is Hadooprdd, the default is 2 partition, and then a map operation to get a new mappartitionsrdd, look at the following code. Detailed procedures can refer to the previous article "Spark Read and write file code analysis." There are only 2 ways to create Rdd: one is to read data input from a file system or database, and the other is to get a new rdd from the parent RDD calculation. /** * Return a new RDD by applying a function to all elements the This rdd.*/def Map[u:classtag] (f:t => U): rdd[u] = Withscope {val cleanf = Sc.clean (f) New Mappartitionsrdd[u, T] (this, (context, PID, ITER) => (CLEANF))} The second step Flatmap (_. Split (""))/** * Return a new RDD by the applying a function to all elements of Thisrdd, and then flattening the results.*/def F Latmap[u:classtag] (f:t => traversableonce[u]): rdd[u] = withscope {val cleanf = Sc.clean (f) New Mappartitionsrdd[u, T] ( This, (context, PID, ITER) => Iter.flatmap (CLEANF))} uses a space to extract the word, merging into aCollection or return to Mappartitionsrdd. The third step ((_, 1)) increases the count of 1 for each data item, returning MAPPARTITIONSRDD. If there is such a word "Hello World", then the second step becomes (Hello, World), the third step ((Hello, 1), (World, 1)) The fourth Step Reducebykey (_+_) This function is not in the Rdd file, But in Pairrddfunctions. See code is doing a RDD implicit conversion/** * Defines implicit functions that provide extra functionalities on RDDs of specific. * For Example,[[rdd.rddtopairrddfunctions]]converts a RDD into a [[Pairrddfunctions]]for * Key-value-pair RDDs, and Enabling extra functionalities such as ' Pairrddfunctions.reducebykey '. */implicit def Rddtopairrddfunctions[k, V] (RDD: rdd[(K, V)]) (implicit kt:classtag[k], vt:classtag[v], Ord:ordering[k] =null): pairrddfunctions[k, V] = {New Pairrddfunctions (RDD)} def reducebykey (func: (V, v) => V): rdd[(K, v)] = self.withscope {Reducebykey ( Defaultpartitioner (self), func)} def reducebykey (Partitioner:partitioner, func: (V, v) => V): rdd[(K, v)] = Self.withscope {Combinebykeywithclasstag[v] (v:v) => V, Func, Func, Partitioner)} Data merge requires repartition after the partition object Partitioner default is HAshpartitioner The fifth step collect (). foreach (println) is described together, because at this point the action operator (collect and foreach are action operators) is explained, The front is all transformation operators. The transformation operation is deferred, and it needs to wait until the action operator can actually trigger the operation, at which point the job job is committed to executor execution. def collect (): array[t] = withscope {val results = Sc.runjob (this, (iter:iterator[t)) => Iter.toarray) Array.concat ( Results: _*)} def foreach (f:t => unit): unit = withscope {val cleanf = Sc.clean (f) sc.runjob (this, (iter:iterator[t)) = > Iter.foreach (CLEANF)} uses the runjob of the spark context, we see the implementation of Def Runjob[t, U:classtag] (rdd:rdd[t), func:iterator[t] = > U): array[u] = {runjob (Rdd, func, 0 loop Rdd.partitions.length)} finally called function/** * Run a function on a given set of partitions I n an RDD with the results to the given * handler function. This is the main entry point for all actions in Spark. * @param rddtarget RDD to run tasks on * @param Funca function to run in each partition of the RDD * @param partitionsset of Partit Ions to run on; Some jobs may isn't want to compute in all * partitiONS of the target RDD, e.g. for operations like ' a ' () ' * @param resulthandlercallback to pass, to*/def runjob[t, U: Classtag] (Rdd:rdd[t], func: (Taskcontext, iterator[t]) => U, Partitions:seq[int], Resulthandler: (Int, U) => unit): Unit = {if (Stopped.get ()) {throw new IllegalStateException ("Sparkcontext super-delegates been Shutdown")} val callsite = Getcallsite Val cleanedfunc = Clean (func) loginfo ("Starting job:" + callsite.shortform) if (Conf.getboolean ("Spark.loglineage"), False)) {Loginfo ("RDD ' recursive dependencies:\n" + rdd.todebugstring)} dagscheduler.runjob (RDD, Cleanedfunc, Partitions, Callsite, Resulthandler,localproperties.get) Progressbar.foreach (_.finishall ()) Rdd.doCheckpoint ()}
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.