Will spark load the data into memory?

Last Update:2016-09-09 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted from: https://www.iteblog.com/archives/1648

Objective:

Many beginners actually understand that the concept of Spark's programming model or RDD is not in place, and there are some misunderstandings. For example, many times we often assume that a file is fully read into memory and then make various transformations, which is most likely caused by two concepts misleading:

1.RDD definition, RDD is a distributed set of immutable data;

2.Spark is a memory processing engine;

If you don't take the initiative to cache/persist the RDD, it's just a conceptually existing virtual machine dataset,

You don't actually see the complete set of data for this rdd (he doesn't really put it in memory).

What is the nature of the RDD?

An RDD is essentially a function, and the RDD transformation is just a nesting of functions, and I think there are two types of Rdd:

1. Enter the RDD, typically such as Kafkardd,jdbcrdd and Hadooprdd

2. Convert the RDD, such as Mappartitionsrdd

Let's take the following code as an example to do the analysis:

Sc.textfile ("abc.log"). Map (). Saveastextfile ("")

Textfile will build a hadooprdd, and then return to the MAPPARTITIONSRDD,MAP function will build a mappartitionsrdd,saveastextfile trigger the actual process code execution

So the RDD is just the encapsulation of a function, and when a function is done with data processing, we get a data set of the RDD (which is a virtual one, which will be explained later).

Hadooprdd is the data source, each partition is responsible for obtaining the data, the process is to obtain a piece of data through iterator.next, assuming a moment to get a data A, the A will be immediately processed by the map function B (completed the conversion), Then start writing to the other data repeatedly so. So the whole process:

1, theoretically a mappartitionsrdd in the actual memory of the data is equal to the number of its partition, is a very small value.

2, Hadooprdd will be slightly more, because it belongs to the data source, read the file, assuming that the buffer to read the file is 1M, then the maximum is the partitionnum*1m data in memory

3, Saveastextfile is the same, to HDFs write file, need buffer, the maximum amount of data is buffer*partitionnum,

So the whole process is actually a streaming process, and a piece of data is handled by the functions wrapped in each rdd.

Just now I mentioned the nested function repeatedly, how do we know that he is nested?

If you write a code like this:

Sc.textfile ("abc.log"). Map (). maps (). Map (). Saveastextfile ("")

There are thousands of maps that are likely to overflow on the stack, why? The function is actually nested too deep.

According to the above logic, memory usage is actually very small, 10G memory running 100T data is not difficult. But why is spark often hung up because of memory problems? We then looked down:

What is the nature of shuffle?

That's why we have to divide the stage. Each stage is actually what I said above, a set of data is handled by n nested functions (that is, your transform action). Encountered shuffle, was cut off, the so-called shuffle, essentially the data by the rules of the temporary fall to disk, equivalent to complete a saveastextfile action, but to save the local disk. Then the next stage is cut. This data from the local disk is used as the data source to re-walk the process of polygon description.

Let's do another description:

The so-called shuffle is simply slicing the processing process and adding a storage-to-disk action to the last segment of the Shard (what we call stage M).

Turn the next segment of The Shard (stage m+1) data source into a disk file for stage M storage. Every stage can go to the top of my

Description, so that each piece of data can be processed by n nested functions, and finally stored by a user-specified action.

Why shuffle can easily cause spark to hang out

As we mentioned earlier, shuffle just secretly helped you to add a saveaslocaldiskfile-like action. However, writing a disk is an expensive action. So we put the data into the memory as much as possible, and then write the files in batches. Reading disk files is also a very memory-consuming action. Put the data into memory, there is this problem, such as 10,000 of data, how much memory will occupy? This is actually difficult to estimate, so accidentally, it is easy to lead to memory overflow, which is actually a very helpless thing.

What we do cache/persist means.

In fact, to a stage to add a saveasmemoryblockfile action, and then the next time you want data, you don't have to forget. The data that exists in memory represents the result of an RDD processing. This is where spark is the memory engine. In Mr, if you put it in HDFs. But spark allows you to put intermediate results in memory.

Will spark load the data into memory?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More