Spark Learning Summary

Last Update:2016-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark summarizes spark enginerdd elastic distributed Data set partitons composition, partition must be a concrete concept, is a continuous data in a physical node 1, a group of partitions composed of 2, applied to the operator above the RDD, will be applied to each partitions above 3, each RDD needs to have a dependency of 4, if the RDD is k,v key value pair, you can have some re-partition functions, such as some operators, Groupbykey,reducebykey, CountByKey5, some rdd have the best computing position, such as HADOOPRDD, counter example is the local collection evolved RDD, that there is no best computing location (data locality) operator Operation Transformationsmap, Mappartition, FlatMap, Reducebykey, Groupbykey, filter, Sortbykey, mapvalues, sample ... The essence is to generate a new rdd,new Mappartitionsrdd () Actionscollect caution, reduce, count, take, foreach, foreachpartition ... Nature will submit a job to the cluster inside to calculate, Sc.runjob () Rdd fault tolerance 1, descent, lineage, re-count!! Recalculate will find the dependent RDD, if it has not persisted, re-read data from the data source 2,cache () persist () The default persistence policy memory_only, not saved, the next time to calculate _2 _ser Emphasis to distinguish Open is memory_and_disk, this thing is not stored on the existence of local disk off_heap, default will go to find Tachyon3,checkpoint do checkpoint need first in Sc.setcheckpointdir (" hdfs://") There is a distributed file system inside! Spark Cluster--Worker Nodes--executors-Threads yarn Cluster--Node Managers-Containers --Threadsapplicationmaster is the Driver driver and ResourceManager communication intermediary or bridge application (Driver Dagscheduler TaskSCheduler)--Jobs (Action action)--Stages (width dependent/shuffle)--> tasks (pipeline/ See a stage inside the last Finalrdd there are several partitions, in fact, there are tasks are divided out) Dagscheduler will divide the job to Tasks,dagscheduler will calculate the best computing location for each task, It is to push backwards, that is, to push to the front of the pipeline a line of the RDD, if this line has not been persisted, the most previous RDD if for example, Hadooprdd, then the nearest position is determined by the location of the block, if it is persisted, then the nearest location , is to do persist position! Finally, if there is no persistence, if there is no block location, then there is no optimal location, then the task will be thrown into the resource list of a free executor inside, the data is to go network transmission! TaskScheduler at the time of initialization will apply to a bunch of executors/containers,taskscheduler on receiving Dagscheduler sent over the Taskset (corresponding to a stage) plainly, is the order to send the stage is dagscheduler to decide, TaskScheduler will taskset inside the task out to send to from the node inside to execute, really to from the node inside, will begin to read the data!!!! Tasks will return the results to driver after running from the node, so this place is where driver is, where to look at the results standalone--deploy client cluster the difference yarn--master Yarn-client yarn-cluster difference Spark corenew sparkcontext (CONF) operator Operation TOPNGROUPTOPN (Collection.sort (list) Insert sort) two orders ( Build a custom key) PageRank (note that the number of iterations is much, the DAG is complex and can be checkpoint for each iteration of the RDD) Sparkpi spark sqlnew Sqlcontext/hivecontext () DataFrame In addition to data and Schemardd into Dataframe 1, reflection JavaBean2, Dynamic Way, Need to build structfiled structtype several data sources Jsonmysql spark-default.confhive (note that when running the code, if it is Yarn-cluster mode, you need to--jars 3 jar packages) Open Window function inside the row_number () line number can be row_number () over (PARTITION by ... ORDER by ... DESC) rank where rank <=3 to do grouping to TOP3 custom UDF and udafudf definition Sqlconext.udf.register ("", = =) UDF is multiple elements come in, multiple elements go out UDAF is multiple elements come in , an element out of spark streamingval SSC = new StreamingContext (conf) ssc.start () ssc.awaittermination () is essentially a micro-batch process, In fact, every interval cuts an RDD, and then an RDD submits a job, essentially using our spark engine to process the read port data, Sockettextstream (), using the reciver mechanism, consumes additional threads to read the HDFS data , Textfilestream (), no use reciver, the essence is every interval to read the block data read Kafka data, 1, based on the Receiver,kafkautils.createstream () 2,direct mode, No Receiver,kafkautils.createdirectstream () direct way benefits 1,one to one mapping, plainly is Kafka inside Partitons there are several, This way to spark inside the RDD corresponds to a few 2, efficient, so that no Wal, preview the log, do not need additional disk io3,exactly-once, calculate once, calculate once, Not many sparkstreaming inside and sparkcore inside compared, compare has characteristic three transform operator operation 1,updatestatebykey () Note is, need Ssc.checkpoint ("hdfs://"), will continue to update the previous state of the 2,transform () feature is to give you one of the RDD, we can directly with all the operators inside the Spark core operation! Finally give it back an rdd and then swim down to pass onCan 3, based on the operation of the window note to set three time, the first or how much time to cut an RDD, the second is how many times each time (slideduration), the third is the amount of data per calculation (windowduration) We cite an example of Reducebykeyandwindow () This place has two Api,reducebykeyandwindow (_+_, windowduration,slideduration) one is optimized, This needs to first set the computation of Checkpointreducebykeyandwindow (_+_,_-_, windowduration,slideduration) parallelism to start reading the data sc.parallize ( Specify the parameters-how many partition) Sc.textfile (Specify parameters-at least how many Partiton) calculation process, we can use the repartition or colease operator to change the data calculation process, We can specify this shuffle's reduce-end reduce tasks data through, for example, Groupbykey ([Numtasks]), Reducebykey ([Numtasks]), and Conf.set (" Spark.default.parallism ", 100) Priority Groupbykey ([Numtasks]) > Conf.set (" spark.default.parallism ", 100) If there is a setting Conf.set ("Spark.default.parallism", 100), the use of memory is *spark based on the degree of parallelism in the last Rdd

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark Learning Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark Learning Summary

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support