Will spark load data into memory?

Source: Internet
Author: User
Tags shuffle

Objective

Many beginners actually do not understand the concept of Spark programming mode or RDD, there will be some misunderstanding.

For example, many times we often assume that a file is fully read into memory and then make various transformations, which is likely to be misled by two concepts:

    1. The RDD definition, the RDD is a distributed set of immutable data
    2. Spark is a memory processing engine

If you're not proactive about rddcache/persist, it's just a conceptual virtual dataset, and you don't actually see the complete set of data for this rdd (he doesn't really put it into memory).

What is the nature of the RDD?

An RDD is essentially a function, and the RDD transformation is simply a nesting of functions. Rdd I think there are two categories:

    1. Enter the RDD, typically as Kafkardd,jdbcrdd
    2. Convert Rdd, such as Mappartitionsrdd

Let's take the following code as an example to do the analysis:

Sc.textfile ("abc.log"). Map (). Saveastextfile ("")
    • Textfile will build a Newhadooprdd,
    • When the map function runs, it builds a Mappartitionsrdd
    • Saveastextfile triggers the execution of the actual process code

So the RDD is just the encapsulation of a function, and when a function is done with data processing, we get a data set of the RDD (which is a virtual one, which will be explained later).

Newhadooprdd is the data source, each parition is responsible for obtaining the data, the acquisition process is obtained by iterator.next a record. Suppose you get a data A at a time, the A is immediately processed by the function in the map to get B (the conversion is completed), and then it begins to write to HDFs. This is repeated for other data. So the whole process:

    • In theory, the actual amount of data in memory in a mappartitionsrdd is equal to the number of its partition, which is a very small value.
    • Newhadooprdd will be slightly more, because it belongs to the data source, read the file, assuming that the buffer to read the file is 1M, then the maximum is the partitionnum*1m data in memory
    • Saveastextfile is the same, write files to HDFs, need buffer, the maximum amount of data is buffer* Partitionnum

So the whole process is actually a streaming process, and a piece of data is handled by the functions wrapped in each rdd.

Just now I mentioned the nested function repeatedly, how do we know it is nested?

If you write a code like this:

Sc.textfile ("abc.log"). Map (). maps (). Map (). Saveastextfile ("")

There are thousands of maps that are likely to overflow the stack. Why? Actually, the function nesting is too deep.

According to the above logic, memory usage is actually very small, 10G memory running 100T data is not difficult. But why is spark often hung up because of memory problems? We went on to look down.

What is the nature of shuffle?

That's why we have to divide the stage. Each stage is actually what I said above, a set of data is handled by n nested functions (that is, your transform action). Encountered shuffle, was cut off, the so-called shuffle, essentially the data by the rules of the temporary fall to disk, equivalent to complete a saveastextfile action, but to save the local disk. Then the next stage is cut. This data from the local disk is used as the data source to re-walk the process of polygon description.

Let's do another description:

The so-called shuffle is simply slicing the process, giving the last segment of The Shard (which we call stage M) with an action action stored to disk, turning the next segment of The Shard (stage m+1) data source into the disk file of stage m storage. Each stage can walk my description above, so that each piece of data can be processed by n nested functions, and finally through the user-specified action to store.

Why shuffle can easily cause spark to hang out

As we mentioned earlier, shuffle just secretly helped you to add a similar saveAsLocalDiskFile action. However, writing a disk is a high-profile action. So we put the data into the memory as much as possible, then write the file in bulk, and read the disk file is also the action of the charge memory. Put the data into memory, there is a problem, such as 10,000 of data, how much memory will occupy? This is really hard to predict. So accidentally, it is easy to cause memory overflow. This is actually a very helpless thing.

What do we mean by doing cache/persist?

In fact, to a stage to add a saveAsMemoryBlockFile move, and then the next time you want data, you don't have to forget. The data that exists in memory represents the result of an RDD processing. This is where spark is the memory engine. In Mr, you're going to put it in HDFs, but Spark allows you to put the intermediate results in memory.

Summarize

We explain from a newer perspective what the RDD and shuffle are all about.


Original link: http://www.jianshu.com/p/b70fe63a77a8

Will spark load data into memory?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.