Spark vs. MapReduce time saving 66%, calculation save 40%

Source: Internet
Author: User
Keywords Large data Spark mapreduce
Tags based business business logic cache compared compared to the computing cost

MapReduce provides powerful support for large data mining, but complex mining algorithms often require multiple mapreduce jobs to be completed, redundant disk read and write overhead and multiple resource request processes exist between multiple jobs, making the implementation of MapReduce based algorithms have serious performance problems. The Up-and-comer spark benefit from its advantages in iterative calculation and memory calculation, it can automatically dispatch complex computing tasks, avoid the intermediate result of disk read and write and resource request process, it is very suitable for data mining algorithm. Tencent TDW Spark platform based on the latest spark version of the community for in-depth transformation, performance, stability and scale have been greatly improved, for large data mining tasks to provide strong support.

This article will introduce the case of collaborative filtering recommendation based on items in comparison with the implementation of TDW Spark and MAPREUDCE, compared with the MAPREDUCE,TDW spark execution time was reduced by 66% and the calculation cost was reduced by 40%.

Introduction to

Algorithm

The development of Internet leads to the information explosion. Faced with a mass of information, how to brush and filter the information, the user most concerned about the most interesting information displayed in front of users, has become an urgent problem to be solved. Recommendation system can through the user and the information link, on the one hand helps the user to obtain the useful information, on the other hand can let the information display in front of its interested user, has realized the information provider and the user's mutual benefit.

Collaborative filtering recommendation (collaborative filtering recommendation) algorithm is the most classic and most commonly used recommendation algorithm, the algorithm through the analysis of user interest, in the user group to find the users of the similar users, combined with these similar users of a certain information evaluation, Form a system that predicts the extent to which this information is preferred by the specified user. Collaborative filtering can be subdivided into the following three kinds:

user-based CF: User based collaborative filtering, by different users of the Item rating to evaluate the similarity between users, based on the similarity between users to make recommendations; item-based CF: Collaborative filtering based on item, Evaluate the similarity between the item by the user's score on the different item, and make recommendations based on the similarity between the item; model-based CF: Model-based collaborative filtering (model-basedcollaborative filtering) is the first to use historical data to obtain a model, and then use this model for forecasting recommendations.

Problem Description

Enter data format: uid,itemid,rating (user Uid to ItemId rating).

Output data: The first n itemid with the highest similarity for each itemid.

Due to space limitations, here we only select a collaborative filtering algorithm based on item to solve this example.

algorithm Logic

The basic assumption of the collaborative filtering algorithm based on item is that two similar item is more likely to get the same user's praise. Therefore, the algorithm first calculates the user's preference to the item, then calculates the similarity between the items according to the user's preference, and finally finds the first n item that is most similar to each item. The detailed description of the algorithm is as follows:

Computing User preferences: Different users of the item can be a significant difference in value, so you need to first of each user's score to do two yuan processing, such as for a user to a certain item of the score is greater than its given average rating is marked as positive 1, otherwise the difference is 0. Calculates item similarity: using the Jaccard coefficient as a similarity method for calculating two item. The narrow-sense jaccard similarity is suitable for calculating the similarity between two sets, and the calculation method is divided by the intersection of two sets, and the following three steps are specified.

1 Item Acclaim Statistics, statistics on the number of users per item.


2 The item is rated for statistics, and counts any two users of the same praise for the associated item.


3, calculates the similarity of any two related item.

find the most similar first n item. In this step, the item's similarity also needs to be normalized after the integration, and then the most similar to each item of the first n item, the concrete is divided into the following three steps.

1) Item similarity normalized.


2 Item similarity score integration.


3 Get the top N item with the highest similarity for each item. The MapReduce based implementation scheme uses the MapReduce programming model to implement a MapReduce job for each step, which contains seven maprduce jobs altogether. Each MapReduce job contains both map and reduce, where the map reads from HDFs, and the output data is sent via shuffle to the reduce,reduce phase to <key,Iterator<value>> As input, the output processed key value pairs to HDFs. Its operating principle is shown in Figure 1.



Figure 1

Seven MapReduce jobs mean seven read and write HDFs, and their input and output data are associated, and seven job input and output data relationships are shown in Figure 2.


Figure 2

Implementation of this algorithm based on MapReduce has the following problems:

Seven MapReduce jobs are needed to implement a business logic, and data interchange between seven jobs is done through HDFS, increasing network and disk overhead. All seven jobs need to be dispatched to the cluster to run separately, increasing the resource scheduling cost of the Gaia cluster. MR2 and MR3 repeatedly read the same data, resulting in redundant hdfs read and write overhead.

These problems cause the operation time to increase greatly, the operation cost increases.


implementation scheme based on Spark

Compared with the MapReduce programming model, Spark provides a more flexible dag (directed acyclic Graph) programming model that includes not only the traditional map, reduce interface, but also the filter, Flatmap, union and other operating interfaces, Make the writing Spark program more flexible and convenient. Use the Spark programming interface to implement the above business logic as shown in Figure 3.

Figure 3

Relative to Mapreduce,spark optimizes the execution time and resource usage of the job in the following ways.

DAG programming model. Through the Spark DAG programming model, seven MapReduce can be simplified to one spark job. Spark automatically divides the job into eight stage, each of which contains several tasks that can be executed in parallel. The data between stage is passed through shuffle. Ultimately only need to read and write HDFs once. Reduced six times HDFs reading and writing, and read-write HDFs decreased by 70%. When the spark job starts, it will request the required executor resources, all stage tasks run as threads, share executors, and MapReduce request resources are reduced by nearly 90% compared to the Spark method. Spark introduces the RDD (resilientdistributed DataSet) model, where the intermediate data is stored as RDD, while the RDD distribution is stored in the slave node's memory, which reduces the number of reads and writes to the disk during the calculation. RDD also provides cache mechanisms, such as cache for the rdd3 above, RDD4 and RDD7 can access RDD3 data. Reduces the problem of MR2 and MR3 repeatedly reading the same data relative to MapReduce. The effect comparison test uses resources of the same size, where the MapReduce method contains 200 maps and 100 reduce, each of which is configured with 4G of memory, and since spark no longer requires reduce resources, The MapReduce primary logic and resource consumption is on the map side, so test with 200 and 400 executor, each containing 4G of memory. The results of the test are shown in the following table, in which about 3.8 billion entries are entered.

Run mode

Computing resources

Running Time (min)

Cost (slot* seconds)

MapReduce

map+100 Reduce (4G)

120

693872

Spark

Executor (4G)

33

396000

Spark

Executor (4G)

21

504000

Comparing the first and second rows of the result table, the spark efficiency and cost are significantly reduced relative to the mapreduce approach, in which the DAG model reduces the read-write and Cache-reading of 70% HDFs, which can reduce the operation time and reduce the cost and the reduction of resource scheduling can improve the operation efficiency.

Compared to the second and third rows of the results table, increase the number of executor, the operation time reduced by about 50%, the cost increased by about 25%, from this result see, increase executor resources can effectively reduce the operation time, but did not achieve a full linear increase. This is because the running time for each task is not exactly equal, for example, some tasks handle more data than other tasks, which may result in the stage of some tasks at the end of the last hour and the next stage, on the other hand, the job is always occupied by the executor, At this time there will be some executor idle situation, resulting in the increase in costs.

Summary

The Data mining class business has the complex processing logic, the traditional Mapreduce/pig class framework has the serious performance problem when dealing with this kind of data processing task. For these tasks, if you take advantage of Spark's iterative and memory computing advantages, you will significantly reduce the running time and computational costs. TDW has now maintained thousands of spark clusters, and will be in resource utilization, stability and ease of use and other aspects of further upgrading and improvement, to provide more favorable support for the business.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.