Spark Learning Summary

Source: Internet
Author: User

Spark summarizes spark enginerdd elastic distributed Data set partitons composition, partition must be a concrete concept, is a continuous data in a physical node 1, a group of partitions composed of 2, applied to the operator above the RDD, will be applied to each partitions above 3, each RDD needs to have a dependency of 4, if the RDD is k,v key value pair, you can have some re-partition functions, such as some operators, Groupbykey,reducebykey, CountByKey5, some rdd have the best computing position, such as HADOOPRDD, counter example is the local collection evolved RDD, that there is no best computing location (data locality) operator Operation Transformationsmap, Mappartition, FlatMap, Reducebykey, Groupbykey, filter, Sortbykey, mapvalues, sample ... The essence is to generate a new rdd,new Mappartitionsrdd () Actionscollect caution, reduce, count, take, foreach, foreachpartition ... Nature will submit a job to the cluster inside to calculate, Sc.runjob () Rdd fault tolerance 1, descent, lineage, re-count!! Recalculate will find the dependent RDD, if it has not persisted, re-read data from the data source 2,cache () persist () The default persistence policy memory_only, not saved, the next time to calculate _2 _ser  Emphasis to distinguish Open is memory_and_disk, this thing is not stored on the existence of local disk off_heap, default will go to find Tachyon3,checkpoint do checkpoint need first in Sc.setcheckpointdir (" hdfs://") There is a distributed file system inside! Spark Cluster--Worker Nodes--executors-Threads yarn Cluster--Node Managers-Containers --Threadsapplicationmaster is the Driver driver and ResourceManager communication intermediary or bridge application (Driver Dagscheduler TaskSCheduler)--Jobs (Action action)--Stages (width dependent/shuffle)--> tasks (pipeline/ See a stage inside the last Finalrdd there are several partitions, in fact, there are tasks are divided out) Dagscheduler will divide the job to Tasks,dagscheduler will calculate the best computing location for each task, It is to push backwards, that is, to push to the front of the pipeline a line of the RDD, if this line has not been persisted, the most previous RDD if for example, Hadooprdd, then the nearest position is determined by the location of the block, if it is persisted, then the nearest location , is to do persist position! Finally, if there is no persistence, if there is no block location, then there is no optimal location, then the task will be thrown into the resource list of a free executor inside, the data is to go network transmission! TaskScheduler at the time of initialization will apply to a bunch of executors/containers,taskscheduler on receiving Dagscheduler sent over the Taskset (corresponding to a stage) plainly, is the order to send the stage is dagscheduler to decide, TaskScheduler will taskset inside the task out to send to from the node inside to execute, really to from the node inside, will begin to read the data!!!! Tasks will return the results to driver after running from the node, so this place is where driver is, where to look at the results standalone--deploy client cluster the difference yarn--master Yarn-client yarn-cluster difference Spark corenew sparkcontext (CONF) operator Operation TOPNGROUPTOPN (Collection.sort (list) Insert sort) two orders ( Build a custom key) PageRank (note that the number of iterations is much, the DAG is complex and can be checkpoint for each iteration of the RDD) Sparkpi spark sqlnew Sqlcontext/hivecontext () DataFrame In addition to data and Schemardd into Dataframe 1, reflection JavaBean2, Dynamic Way, Need to build structfiled structtype several data sources Jsonmysql spark-default.confhive (note that when running the code, if it is Yarn-cluster mode, you need to--jars 3 jar packages) Open Window function inside the row_number () line number can be row_number () over (PARTITION by ... ORDER by ... DESC) rank where rank <=3 to do grouping to TOP3 custom UDF and udafudf definition Sqlconext.udf.register ("", = =) UDF is multiple elements come in, multiple elements go out UDAF is multiple elements come in , an element out of spark streamingval SSC = new StreamingContext (conf) ssc.start () ssc.awaittermination () is essentially a micro-batch process, In fact, every interval cuts an RDD, and then an RDD submits a job, essentially using our spark engine to process the read port data, Sockettextstream (), using the reciver mechanism, consumes additional threads to read the HDFS data , Textfilestream (), no use reciver, the essence is every interval to read the block data read Kafka data, 1, based on the Receiver,kafkautils.createstream () 2,direct mode, No Receiver,kafkautils.createdirectstream () direct way benefits 1,one to one mapping, plainly is Kafka inside Partitons there are several, This way to spark inside the RDD corresponds to a few 2, efficient, so that no Wal, preview the log, do not need additional disk io3,exactly-once, calculate once, calculate once, Not many sparkstreaming inside and sparkcore inside compared, compare has characteristic three transform operator operation 1,updatestatebykey () Note is, need Ssc.checkpoint ("hdfs://"), will continue to update the previous state of the 2,transform () feature is to give you one of the RDD, we can directly with all the operators inside the Spark core operation! Finally give it back an rdd and then swim down to pass onCan 3, based on the operation of the window note to set three time, the first or how much time to cut an RDD, the second is how many times each time (slideduration), the third is the amount of data per calculation (windowduration) We cite an example of Reducebykeyandwindow () This place has two Api,reducebykeyandwindow (_+_, windowduration,slideduration) one is optimized, This needs to first set the computation of Checkpointreducebykeyandwindow (_+_,_-_, windowduration,slideduration) parallelism to start reading the data sc.parallize ( Specify the parameters-how many partition) Sc.textfile (Specify parameters-at least how many Partiton) calculation process, we can use the repartition or colease operator to change the data calculation process, We can specify this shuffle's reduce-end reduce tasks data through, for example, Groupbykey ([Numtasks]), Reducebykey ([Numtasks]), and Conf.set (" Spark.default.parallism ", 100) Priority Groupbykey ([Numtasks]) > Conf.set (" spark.default.parallism ", 100) If there is a setting Conf.set ("Spark.default.parallism", 100), the use of memory is *spark based on the degree of parallelism in the last Rdd

Spark Learning Summary

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.