Spark operating principle and RDD parsing (dt Big Data DreamWorks)

Last Update:2016-02-03 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark is typically memory-based, and in some cases disk-based

Spark first puts the data in memory, and if the memory doesn't fit, it's put into the disk.

Not only can calculate the data under the storage, but also can calculate the data that the memory can't fit

If the data is larger than the memory, consider the data placement strategy and the optimization algorithm, because Spark was originally designed to be a single-walled processing

From the scale of 5~10 to 8000 units, spark can run

Big Data Computing Issues: Interactive queries (shell-based, sparksql), batch processing, machine learning and computing, and more

Bottom-based RDD, distributed elastic data-level support for a wide variety of paradigms such as stream processing, SQL, SPARKR, etc.

==========spark features ============

To understand Spark, understand the following

1, distributed multi-machine operation

The different nodes will deal with some of the data, each node processing non-interference, distributed to do parallelization

Cluster Manager is responsible for allocating resources to each node, after each node is counted, and then aggregated to Cluster manager and then unified output

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

2, in-memory + disk Computing

For example: 3 million data, allocated to 3 machines, such as each machine 1 million (also may not be average), a machine 1 million memory put down memory, not fit memory

3. Iterative computing is the true essence of spark

Divide the calculation into n processes, one process ends and the next

Shuffle is a node to another node

========== Development ============

We wrote the program through driver to each machine.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Why do most of the bulk write programs in Java

Because there are more people in Java, there are fewer people in Scala.

Java and EE fusion is more convenient

Follow-up maintenance is more convenient

Cons: Java development spark is too cumbersome.

The following example uses both Scala and Java to implement

Development is a separate machine, the submission machine is on another machine

Data sources that can be processed: Sparkworker can be derived from a variety of data, in addition to HDFs, HBase, and from Hive,oracle,mysql

Note: Hive Data Warehouse, data engine, Sparksql can implement this, but not completely replace hive

Processing data output: HDFS, HBase, Hive, Oracle, S3, or directly back to the client, etc.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

========== running ============

All based on RDD (elastic distributed data Set),

One of the elasticity:

Data sharding, by default in memory, if the memory is not placed, part will be placed on disk to save, the user does not need to care about where the data is, RDD will automatically switch between memory and disk

Elasticity of the second:

Based on lineage's efficient fault tolerance, suppose that a job step has 1000 steps, assuming that there is an error at 901 steps, it automatically corrects the error from 900 steps to recalculate.

Elasticity of the third:

If a task fails, it will automatically retry for a specific number of times, assuming a task has 1000 steps, assuming a 901 step error, if you retry from 900 steps, there will be a certain number of retries, or the failure will fail.

Flex Four:

If the stage fails, it will automatically retry for a specific number of times, and only the failed shards will be computed

Note: Stage is actually a phase

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

========== do cache timing ============

1, particularly time-consuming

2, the calculation chain is very long

3, after the shuffle, if this failed after the cache, there is no need to do shuffle

4, before the checkpoint, the previous steps have been cached, if the checkpoint broken, the previous saved

==========rdd Example ============

In addition to Hadoop, Spark's Start-all

Hadoop should start./start-df.sh

Spark should start./start-

http://master:18080 to see information about jobs that were run

Under Spark's bin./spark-shell--master spark://master:7077

Val data = Sc.textfile ("/library/wordcount/input/data") or Sc.textfile ("hdfs//master:9000/library/wordcount/input/ Data ")

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Spark will create the RDD itself

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Data.rodebugstring Look at data dependencies

It can be seen that the mappartitionnsrdd is fragmented and distributed across different machines.

Data.count look at the data

http://Master:4040 See job

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

The data is not moving, the code is moving, the data is distributed on each machine.

A block size is typically 128M, and the actual partition and block size may vary

Val flatted = Data.flatmap (_.spilit (""))

And a new Mappartitionnsrdd is created.

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Val mapped = Flatted.map (word=> (word,1))//1 per word count

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Val reduced = Mapped.reducebykey (_+_)

Key adds the same sum, producing shuffle

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Reduced. Saveastextfile ("/library/wordcount/input/data/output/onclick4")

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Other than that:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Data2222 does not exist, start loading is lazy, so will not error, only Data.count will error

Next: Eclipse develops Java and Scala perspectives to develop test and run programs

Homework: Write a blog and write a basic understanding of spark that you understand

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]

This article from "a Flower proud Cold" blog, declined reprint!

Spark operating principle and RDD parsing (dt Big Data DreamWorks)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More