MapReduce provides powerful support for large data mining, but complex mining algorithms often require multiple mapreduce jobs to be completed, redundant disk read and write overhead and multiple resource request processes exist between multiple jobs, making the implementation of MapReduce based algorithms have serious performance problems. The Up-and-comer spark benefit from its advantages in iterative calculation and memory calculation, it can automatically dispatch complex computing tasks, avoid the intermediate result of disk read and write and resource request process, it is very suitable for data mining algorithm. Tencent TDW Spark platform based on the latest spark version of the community for in-depth transformation, performance, stability and scale have been greatly improved, for large data mining tasks to provide strong support.
This article will introduce the case of collaborative filtering recommendation based on items in comparison with the implementation of TDW Spark and MAPREUDCE, compared with the MAPREDUCE,TDW spark execution time was reduced by 66% and the calculation cost was reduced by 40%.
Algorithm Introduction
The development of Internet leads to the information explosion. Faced with a mass of information, how to brush and filter the information, the user most concerned about the most interesting information displayed in front of users, has become an urgent problem to be solved. Recommendation system can through the user and the information link, on the one hand helps the user to obtain the useful information, on the other hand can let the information display in front of its interested user, has realized the information provider and the user's mutual benefit.
Collaborative filtering recommendation (collaborative filtering recommendation) algorithm is the most classic and most commonly used recommendation algorithm, the algorithm through the analysis of user interest, in the user group to find the users of the similar users, combined with these similar users of a certain information evaluation, Form a system that predicts the extent to which this information is preferred by the specified user. Collaborative filtering can be subdivided into the following three kinds:
user-based CF: User based collaborative filtering, through different users of the item rating to evaluate the similarity between users, based on the similarity between users to make recommendations;
item-based CF: Collaborative filtering based on item, evaluating the similarity between the item by the user's score on different item, making recommendations based on the similarity of the item;
model-based CF: model-based Collaborative Filtering (model-based collaborative filtering) is the first to use historical data to obtain a model, and then use this model for forecasting recommendations.
Problem description
Enter data format: uid,itemid,rating (user Uid to ItemId rating).
Output data: The first n itemid with the highest similarity for each itemid.
Due to space limitations, here we only select a collaborative filtering algorithm based on item to solve this example.
Algorithm logic
The basic assumption of the collaborative filtering algorithm based on item is that two similar item is more likely to get the same user's praise. Therefore, the algorithm first calculates the user's preference to the item, then calculates the similarity between the items according to the user's preference, and finally finds the first n item that is most similar to each item. The detailed description of the algorithm is as follows:
Computing user preferences: Different users of the item can be a significant difference in value, so it is necessary for each user's score to do two yuan processing, for example, for a user to a certain item of the score is greater than its average rating is marked as Praise 1, otherwise for the difference of 0.
Calculates item similarity: using the Jaccard coefficient as a similarity method for calculating two item. The narrow-sense jaccard similarity is suitable for calculating the similarity between two sets, and the calculation method is divided by the intersection of two sets, and the following three steps are specified.
1 Item Praise Statistics, statistics on the number of users per item.
2 key value of the item to statistics, statistics of any two related to the same number of users of the item.
3 Item similarity calculation, calculate the similarity of any two related item.
Find the most similar first n item. In this step, the item's similarity also needs to be normalized after the integration, and then the most similar to each item of the first n item, the concrete is divided into the following three steps.
1 Item similarity normalized.
2 Item similarity score integration.
3 Get the top N item with the highest similarity for each item.
Implementation scheme based on MapReduce
Using the MapReduce programming model requires that you implement a mapreduce job for each step, which contains seven maprduce jobs altogether. Each MapReduce job contains both map and reduce, where the map reads from HDFs, and the output data is sent to the reduce,reduce stage with the input of the shuffle by a key value pair, Output processed key values to HDFs. Its operating principle is shown in Figure 1.
Seven MapReduce jobs mean seven read and write HDFs, and their input and output data are associated, and seven job input and output data relationships are shown in Figure 2.
Implementation of this algorithm based on MapReduce has the following problems:
In order to implement a business logic, seven mapreduce jobs are used, and the data exchange between seven jobs is completed through HDFS, increasing the overhead of network and disk.
All seven jobs need to be dispatched to the cluster to run separately, increasing the resource scheduling cost of the Gaia cluster.
MR2 and MR3 repeatedly read the same data, resulting in redundant hdfs read and write overhead.
These problems cause the operation time to increase greatly, the cost of operation increases.
Implementation scheme based on spark
Compared with the MapReduce programming model, Spark provides a more flexible dag (directed acyclic Graph) programming model that includes not only the traditional map, reduce interface, but also the filter, Flatmap, union and other operating interfaces, Make the writing Spark program more flexible and convenient. Use the Spark programming interface to implement the above business logic as shown in Figure 3.
Relative to Mapreduce,spark optimizes the execution time and resource usage of the job in the following ways.
DAG programming model. Through the Spark DAG programming model, seven MapReduce can be simplified to one spark job. Spark automatically divides the job into eight stage, each of which contains several tasks that can be executed in parallel. The data between stage is passed through shuffle. Ultimately only need to read and write HDFs once. Reduced six times HDFs reading and writing, and read-write HDFs decreased by 70%.
When the spark job starts, it will request the required executor resources, all stage tasks run as threads, share executors, and MapReduce request resources are reduced by nearly 90% compared to the Spark method.
Spark introduces the RDD (resilient distributed Dataset) model, where the intermediate data is stored as RDD, while the RDD distribution is stored in the slave node's memory, which reduces the number of reads and writes to the disk during the calculation. RDD also provides cache mechanisms, such as cache for the rdd3 above, RDD4 and RDD7 can access RDD3 data. Reduces the problem of MR2 and MR3 repeatedly reading the same data relative to MapReduce.
Effect comparison
The test uses resources of the same size, where the MapReduce method contains 200 maps and 100 reduce, each of which is configured with 4G of memory; Since spark no longer requires reduce resources, and mapreduce primary logic and resources are consumed on the map side, tests are done using 200 and 400 executor, each containing 4G of memory. The results of the test are shown in the following table, in which about 3.8 billion entries are entered.
Run mode calculates resource run time (min) cost (slot* seconds)
MapReduce200 map+100 Reduce (4G) 120693872
Spark200 Executor (4G) 33396000
Spark400 Executor (4G) 21504000
Comparing the first row and the second row of the result table, the spark running efficiency and cost is very obvious compared with the MapReduce mode, in which the DAG model reduces the reading of 70% HDFs reading and writing, and the cache reduces the duplication data, both of which can reduce the operation time and reduce the cost; The reduction of resource scheduling can improve the operation efficiency.
Compared to the second and third rows of the results table, increase the number of executor, the operation time reduced by about 50%, the cost increased by about 25%, from this result see, increase executor resources can effectively reduce the operation time, but did not achieve a full linear increase. This is because the running time for each task is not exactly equal, for example, some tasks handle more data than other tasks, which may result in the stage of some tasks at the end of the last hour and the next stage, on the other hand, the job is always occupied by the executor, At this time there will be some executor idle situation, resulting in the increase in costs.
Summary
The Data mining class business has the complex processing logic, the traditional Mapreduce/pig class framework has the serious performance problem when dealing with this kind of data processing task. For these tasks, if you take advantage of Spark's iterative and memory computing advantages, you will significantly reduce the running time and computational costs. TDW has now maintained thousands of spark clusters, and will be in resource utilization, stability and ease of use and other aspects of further upgrading and improvement, to provide more favorable support for the business.