Spark work mechanism detailed introduction, spark source code compilation, spark programming combat

Source: Internet
Author: User
Tags shuffle
Spark Communication Module
1, Spark Cluster Manager can have local, standalone, mesos, yarn and other deployment methods, in order to
Centralized communication mode
1, RPC remote produce call
Spark Communication mechanism:
The advantages and characteristics of Akka are as follows:
1, parallel and distributed: Akka in design with asynchronous communication and distributed architecture
2, Reliability: In the local, remote monitoring and recovery mechanism
3, High-performance: In a stand-alone environment wonderful can send 50 million messages, 1GB memory can create and save 2.5 million actor objects
4, to the center, different from the Master-slave mode, to take the architecture of the hub-free node
5, scalability: Can be in the distributed environment of Scala out, linear expansion of computational capacity.
You can see that Akka has powerful concurrent processing capabilities.


4.5 Fault-tolerant mechanisms
Rdd dependence of 4.5.1 lineage mechanism
A wide reliance, and a narrow reliance
Wide dependent: Lineage according to Partition, to restore, very simple
Narrow dependence: A son of Rdd have more than a father, more trouble, so added Checkpoin, checkpoint mechanism, in fact, is the meaning of backup, to do fault-tolerant processing
You can set the storage path for checkpoint data by Sparkcontext.setcheckpointdir () to store the data back up, and then spark Delete all ancestors rdd dependencies of rdd that have done checkpoints. This operation needs to be done after all the operations that need to be done on this rdd are completed.
Official recommendation: The RDD of the checkpoint is best done in memory that has been cached Rdd, otherwise the RDD will need to be recalculated in the persisted file, resulting in IO overhead.


4.6shuffler mechanism
Shufffler Write
Shuffler Fetch
Shuffler Aggenr


Spark in the implementation of the driver control of the application lifecycle, in the scheduling, Spark used the classic FIFO and fair scheduling algorithms for internal resources to achieve different levels of scheduling. In Spark IO, the data is abstracted to be managed quickly, and a partition in Rdd is a fast one that needs to be handled, and communication in the cluster is very important for the delivery of commands and States, spark through the Akka framework for cluster message communication, spark through Lineage and checkpoint mechanisms for fault-tolerance assurance, lineage to perform the operation, checkpoint redundant data backup, and finally introduced spark shuffle mechanism, spark also borrowed from the MapReduce model, But its shuffle mechanism has been innovated and optimized,
Fifth chapter: Spark development environment Configuration and process


The sixth chapter: Spark Programming combat
1, WordCount


























Spark work mechanism detailed introduction, spark source code compilation, spark programming combat

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.