Actual combat analysis spark operating principle and RDD decryption

Source: Internet
Author: User
Tags spark rdd

1. Actual combat analysis spark operation principle
Interactive query (SHELL,SQL)
Batch processing (machine learning, figure calculation)
First of all, Spark is a distributed and efficient memory-based computing framework, with a single-stack management mechanism, supporting stream processing, real-time interactive out, batch processing three ways, spark specifically supports iterative computing, so he has a strong support for machine learning, graph computing, for which he provides machine learning and graph computing interface.
(1) Distributed : Distributed computation
Distributed multi-machine operating features:
The entire spark has a submitter client, submitted to the cluster, there are many machines in the cluster, the job runs on the distributed node, the Spark program is submitted to the cluster, the nodes generally process part of the data, distributed for Parallelization,

Cluster node, client-side
Librarian Find books, curator cluster manager,1000 personal, distributed computing, how many shelves per bookshelf, distributed for parallel computing

There is the cluster resource Management Service (Cluster Manager) and the node (Worker node) running the job task, then the task control node for each application driver and the execution process with specific tasks on each machine node (Executor) Executor has two advantages: one is multi-threaded to perform specific tasks, rather than using process models like MR, reducing the start-up of tasks. Two there will be a Blockmanager storage module on the executor, similar to the KV system (memory and disk as a storage device), when the need to iterate multiple rounds, the intermediate process data can be placed on the storage system, the next time you need to read the data directly on the storage, Without the need to read and write to the relevant file system such as HDFs, or in the interactive query scenario, the table cache to the storage system beforehand, improve read/write IO performance. In addition to the shuffle, spark in the groupby,join and other scenes down the unnecessary sort operations, compared to mapreduce only map and reduce two modes, spark also provides a richer comprehensive operation such as filter, Groupby,join and so on.
(2) memory-based
Spark can use memory efficiently
3 million data, three-way machine count 1 million data, data first consider memory , if only 500,000, and 500,000 disks, try to put memory (fast)

(3) Good at iterative computing, is the true essence of spark
First stage calculation-"Second stage calculation"-Third stage calculation
After calculation, you can move the result to another machine--shuffle, moving from one node to another node.
Hadoop map+reduce two phase (read and write to disk every time)
Spark can be much more flexible (iterative) after the first phase (the memory is prioritized for each calculation result and the next stage can read in-memory data)
Spark Scheduler Dag Schedule lineage

Why do many companies use the Java language to develop spark?
1. Talent Issues
2. Integration is easier, and the Java EE does a lot of front-end programs
3. Easier Maintenance

Spark SQL can only replace Hive's compute engine and not replace Hive's data store
Driver execution on driver, execution on worker
Processing data sources: HDFS, HBase, Hive, DB, S3 (Amazon S3, full name is Amazon Easy Storage Service (Amazon simple Storage Service), by Amazon, Use their Amazon network services to provide online storage services)
Processing data output: HDFS, HBase, Hive, DB, S3, or return driver (the program itself)

2.RDD Decryption
Universal Distributed Elastic data sets
Rdd is the heart of spark
The RDD represents the data to be processed and is distributed at the time of processing.
(1) A series of shards, stored in the node, in memory, in memory, the data is not placed, a part of the data on the disk, automatically switch between memory and disk (one of elasticity)
(2) No. 900 error, a total of 1000 tasks, can be recalculated from the No. 900, do not have to start from scratch to calculate, improve the error recovery speed
(3) Task1000 A calculation step, No. 900 to recover, failed 3-5 times, default 4 times
(4) The task fails, the entire stage fails, the commit stage,1000000-5 is not committed, and only the 5 tasks that failed are submitted (default 3 times)
One of the elasticity: automatic memory and disk data storage switching
The second elasticity: fault tolerance in colleges and universities based on lineage
Elasticity Three: Task fails to automatically retry a specific number of times
Flex four: Stage failure automatically retries for a specific number of times

Cache Time:
1. Computational tasks are particularly time consuming
2. Calculate the chain is very long (calculate the cost) 1000, the No. 900 recovery
After 3.Shuffle, the cache does not need to be re-shuffle (extracting data from elsewhere) after the failure
4.CheckPoint put the entire data into the disk, checkpoint before the steps do not need to recalculate

Rdd is a series of data shards, data shards distributed on different nodes of different machines, managed by partition, partition is a data set the calculation of the RDD inclusion function
The most commonly used RDD on Hadoop
Start the file system

Start Spark



Cluster unique interface: Sparkcontext, Sparkcontext is where all the work goes, Sparkcontext creates an Rdd



Automatically acquired, local or clustered

All the operations of Spark are RDD, and each operation will produce an RDD.

Data.textfile
It's lazy, transformation, not running.

Data.count
is action, so it runs

What is the relationship between HDFs shards and the partitioning of the spark RDD?
When spark reads data, the RDD is equivalent to one of HDFs's block,partion size=block size (128M) Last record around two block,128m

partitions, can be hash,range and so on, different partitioning strategies


Reduce after the shuffle

Incoming HDFs

The data after any shuffle
Process _local

Cloudera Manager Spark is not the latest version, and can not be manually updated, (developer-provided, not recommended), not recommended (lazy people)
Spark+tachyon+hdfs, the future is a gold combination

Actual combat analysis spark operating principle and RDD decryption

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.