Spark solves the problem of data skew by breaking hot key __spark

Source: Internet
Author: User
1. Data skew for hot key

In large data-related statistics and processing, the hot key caused by the data skew is very common and very annoying, often cause the job to run longer or cause job Oom finally cause the task to fail. For example, in the WordCount task, if a word is a hot word and there are a lot of occurrences, the last job's run time is determined by the task run time of the hotspot word. Therefore, we need to find ways to improve the code, optimize the task and improve the efficiency of the final operation. 2. Actual case

Now there is a simple practical example:
There is a path named "xxx" on the HDFs, with a larger amount of data than hundreds of G in this path. Now we want to count the number of rows for all the files in this path.
If the amount of data is small, in Spark-shell, you can solve the problem with a single line of code:

scala> sc.textfile ("xxx"). Count ()

But when the volume of data is large, the speed of the operation is very slow, slow and unacceptable, and the final program will report Oom exit, not get the final result. How about that. 3. By using the hot key to do the calculation

We have to do a little bit of the above needs transformation:
Count the number of rows for all data, assuming that each row corresponds to a key that is "all," and that the output of each row is "all, 1", and the last thing to do is simply wordcount, for all this hotspot key, and then sum.
This we clearly know what hot key is the case, the general practice is to break the hot key first, and then gather back.
Directly on the code:

    def linestats (Sc:sparkcontext) = {
        val inputpath = "xxx"
        sc.textfile (InputPath)
            . Map (x => {
                val Randomnum = (new Java.util.Random). Nextint (Watts)
                val allkey = randomnum + "_all"
                (Allkey, 1)
            })
            . Reducebykey ((x, y) => x + y)
            . Map (x => {
                val (keywithrandom, num) = (x._1, x._2)
                val key = StringUtils . Split (Keywithrandom, "_") (1)
                (key, Num.tolong)
            })
            . Reducebykey ((x, y) => x + y)
            . Map (x => "% s\t%s ". Format (x._1, x._2))
            . Repartition (1)
    }

The above code is thought as follows:
1. The first step is to first plan the key, give all "all" plus a random prefix.
2. Then the first aggregation with the key with random prefix, that is, the reducebykey operation, to obtain the first aggregation results.
3. Then remove the random prefix, do the second polymerization, that is, reducebykey operation, get the final result.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.