Spark uses Combinetextinputformat to mitigate excessive small files causing too many task numbers "go"

Source: Internet
Author: User

Transferred from: http://www.cnblogs.com/yurunmiao/p/5195754.html

At present, the platform uses Kafka + flume for real-time data access, the data in Kafka is written by the business party, part of which is streamed by spark streaming, and the other data is stored by flume to HDFS for data mining or machine learning. When HDFs stores data, the minimum logical unit of the directory is "hours", in order to ensure data integrity during data calculation (when the data in an hour directory is fully written, and no longer changes), we include the following strategy in flume: Close the file that is being written every five minutes, That is, the newly created file is written for data. Such a way can guarantee that the current hour after the first five minutes can begin to calculate the data in the previous hour directory, to some extent, improve the real-time offline data processing. As the business has increased, there has been feedback from business parties: "The amount of data actually being analyzed in HDFs is small, but the spark app has a fairly large number of tasks, not quite normal", and after we follow up, we find that the source of the problem lies in the following three areas: (1) Kafka's real-time data write volume is relatively small ; (2) Flume deploys multiple instances, consumes data from Kafka and writes to HDFs, (3) Flume re-creates the file to write data every five minutes (as described above); Such a scenario directly results in HDFs storing a large number of cases where the amount of data in a single file is small, Indirectly affects the number of spark App tasks. We describe Spark WordCount as an example, and the spark version is 1.5.1. Suppose the following file exists in the HDFs directory "/user/yurun/spark/textfile": only three files in this directory contain a small amount of data: part-00005, part-00010, part-00015, data size is 6 Byte, The remaining file data size is 0 Byte, which conforms to the small file scenario. Note: _success is equivalent to a "hidden" file, which is usually ignored when it is actually processed. General ImplementationWe use Sparkcontext textfile to complete the data entry, and after the application runs, the page from the Spark history server can be seen: During application execution, a job is generated that contains two stages, Each stage contains 16 tasks, that is, the total number of tasks is 32, as shown in: Each stage contains 16 tasks because 16 text files exist in the directory (_success does not participate in calculations). Optimized ImplementationIn this optimized version, we use Sparkcontext newapihadoopfile to complete the data entry, and we need to highlight " Org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat ", this class can combine multiple small files to generate a split, and a split will be handled by a task, thereby reducing the number of tasks. During the execution of this application, two jobs are generated, where Job0 contains a stage, a task;job1 contains two stages, each stage contains a task, that is, the total number of tasks is 3, as shown in: As you can see, by using the " Org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat "can largely alleviate the problem of small files causing too many tasks in the spark App.

Spark uses Combinetextinputformat to mitigate excessive small files causing too many task numbers "go"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.