program and dispatches execution. For each stage of the task, it will be stored in the TaskScheduler, Executorbackend the task in TaskScheduler to executorbackend execution when reporting to Schedulerbackend. The job ends when all stages are completed.
5. What is an RDD?
RDD, called Resilient distributeddatasets (elastic distributed data Set), is a read-only, fault-tolerant, parallel, distributed collection of data. The RDD can be cached in memory, can be iterated, and the RDD is the core of Sp
bits in the binary representation, so the finalization operation produces the desired result:
.forEach(mp -> System.out.println(mp.bitLength() + ": " + mp));
There are many tasks that do not know whether to use a stream or an iteration. For example, consider the task of initializing a new deck of cards. The Assumption Card is an immutable value class, which encapsulates Rank and Suit , they are enumerated types. This task represents any pair of elements that you want to calculate that can be
.
FlatMap (func)
Similar to the map operation, the difference is that each INPUT element can be mapped out to 0 or more output elements.
Filter (func)
Selecting the Func function on the source Dstream returns an element that is true only and eventually returns a new Dstream.
Repartition (numpartitions)
Change the partition size of the Dstream by the value of the input parameter numpartitions.
Flatmap, and if you only need to read the data, you can use the Mongospark.read (Spark) method to get Dataframereader directly. Val Spark =Sparksession.builder (). Master ("spark://192.168.2.51:7077"). config (NewSparkconf (). Setjars (Array ("Hdfs://192.168.2.51:9000/mongolib/mongo-spark-connector_2.11-2.0.0.jar",
"Hdfs://192.168.2.51:9000/mongolib/bson-3.4.2.jar",
"Hdfs://192.168.2.51:9000/mongolib/mongo-java-driver-
: Map is a conversion that records a record, Mappartition is//one partition (partition) conversion onceVal Mappartitionresult = file.mappartitions (x = = {//A partition corresponds to a partitionvar info =NewArray[string] (3) for(line //Yield: function: There is a return value, all records are returned after a collectioninfo = line.split ("\\t") (Info (0), info (1) )}}) Mappartitionresult.take (10). foreach (println)//Turn a row into multiple rows, use
; wordcounts = Textfile.flatmap (lambda line:line.split ()). Map (Lambda Word: (Word, 1)). Reducebykey (Lambda A, b:a+b)In the above statement, using Flatmap, map and Reducebykey three transformations, calculate the number of occurrences of each word in the file readme.md, and return a new RDD, each item in the format (string, int), that is, the number of words and corresponding occurrences. whichFlatMap (func): Similar to map, but each input item can
Transformations
The following table lists some of the common transformations supported by spark. Refer to the rdd api doc (Scala, Java, Python) and pair RDD functions DOC (Scala, Java) for details.
Transformation
Meaning
Map(Func)
Return a new distributed dataset formed by passing each element of the source through a functionFunc.
Filter(Func)
Return a new dataset formed by selecting those elements of the source on whichFuncReturns true.
can be flatmap processed to generate multiple elements to construct a new rdd.Example: generating y elements for each element x in the original Rdd (from 1 to y,y as the value of element x)scala> val a = sc.parallelize(1 to 4, 2)scala> val b = a.flatMap(x => 1 to x)scala> b.collectres12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)FlatmapwithFlatmapwith and Mapwith very similar, are to receive two functions, a function of partitionindex as input,
processing, and the elements in the original RDD can be flatmap processed to generate multiple elements to construct a new rdd.Example: generating y elements for each element x in the original Rdd (from 1 to y,y as the value of element x)scala> val A = sc.parallelize (1 to 4, 2) Scala> Val b = a.flatmap (x = 1 to x) Scala> = Array (1, 1, 2, 1, 2, 3, 1, 2, 3, 4)FlatmapwithFlatmapwith and Mapwith very similar, are to receive two functions, a function
Spark summarizes spark enginerdd elastic distributed Data set partitons composition, partition must be a concrete concept, is a continuous data in a physical node 1, a group of partitions composed of 2, applied to the operator above the RDD, will be applied to each partitions above 3, each RDD needs to have a dependency of 4, if the RDD is k,v key value pair, you can have some re-partition functions, such as some operators, Groupbykey,reducebykey, CountByKey5, some rdd have the best computing po
operation more convenient. 4. easier API: Support for Python,Scala and Java actuallySparkIt can also be implemented insideMapreduce, but here it's not an algorithm, it just providesMapStages andReducestage, but provides many algorithms in two phases. AsMapStage ofmap, FlatMap, filter, Keyby,ReduceStage ofReducebykey, Sortbykey, mean, gourpby, sortand so on. The above is and we do a knowledge sharing, just a few personal views, for the specific concep
. Similar to Python, Java functions cannot have default values.
The commonly used functional programming methods map, reduce, flatMap, filter, and sort are more important than understanding abstract functional programming concepts.Struct vs. class
Struct is a value class, class is a reference type, Java language does not have struct, but c/c ++/c # language has, but cannot contain methods.
Struct rather than class is recommended for Swift development
of the specified index.
1 assertEquals(listOf(2, 4, 5), list.slice(listOf(1, 3, 4)))
Take
Returns the list of the first N elements.
1 assertEquals(listOf(1, 2), list.take(2))
TakeLast
Returns the list of the last N elements.
1 assertEquals(listOf(5, 6), list.takeLast(2))
TakeWhile
Returns the list of the first element that meets the specified conditions.
1 assertEquals(listOf(1, 2), list.takeWhile { it
18.3 ing operation flatMap
Create a new set by
value of each element in the dataset.
Kv1.sortbykey (). Collect
Kv1.groupbykey (). Collect // grouping data based on the K value of each element in the dataset
Kv1.performancebykey (_ + _). Collect
Note: the differences between sortbykey, groupbykey, and performancebykey are as follows;
Val kv2 = SC. parallelize (List ("A", 4), ("A", 4), ("C", 3), ("A", 4 ), ("B", 5 )))
Kv2.distinct. Collect // deduplicate distinct
Kv1.union (kv2). Collect // kv1 is associated with kv2
Kv1
with driver") executor = new Executor(executorId, Utils.parseHostPort(hostPort)._1, sparkProperties,false)
Executor log information location: console/$ spark_home/logs
E. Run the task
Sample Code:
sc.textFile("hdfs://hadoop000:8020/hello.txt").flatMap(_.split(‘\t‘)).map((_,1)).reduceByKey(_+_).collect
After schedulerbackend receives the registration message of executor, it splits the submitted spark job into multiple specific tasks, and then d
processing. So, for WordCount, we can do this in such a simple way:File = Spark.textfile ("hdfs://...") file.flatmap (line = Line.split ("")) . Map (Word = = (Word, 1)) . Reducebykey (_ + _)If you use Hadoop, it's not so convenient. Fortunately, one of Twitter's open source framework scalding provides abstraction and packaging for Hadoop MapReduce. It allows us to execute the MapReduce job in Scala's way:Class Wordcountjob (Args:args) extends Job (args) { TextLine (args ("input")) .
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.