Guess you like "recommended algorithm Contest champion sharing _ Data mining

Source: Internet
Author: User

Recently in the collation of some of the past experience, the following article is my datacastle to participate in the "Guess you like" Recommended algorithm contest won the idea of sharing.

I'm yes,boy!. , from the computer College of Northeastern University. In the guess you like the recommendation system competition, very lucky to get the first place, below I briefly introduce my thinking.
The background of this competition is to give about 34 million data, including a product site in the customer at a certain point of time to a certain commodity scoring value range of 1 to 5 points. The aim is to accurately predict a user's rating for a particular product at some point in the future by learning and training the data.
Through the background, this is a research question about the recommendation system. The recommendation system has different research direction in predicting accuracy, one is based on TOPN research, that is to give the user a personalized recommendation list, generally through the accuracy of the recommendations of the pros and cons; one is based on the scoring prediction, which is generally measured in rmse or Mae. In this contest is through RMSE to evaluate the quality of the forecast. So the next approach we're going to use is to focus on the rmse of the optimized scoring projections.
In the specific process, I think there are a few points to note:
1. Analyze the data and understand the general rules of the data
2. The method first tries the simple method, then tries the complex method, needs a little adjustment to the complex method
Based on this, I have used three kinds of models by Jane and fan:
1. Clustering based recommendations
2. Recommendation based on collaborative filtering
3. Recommendation based on model learning
In these three categories, each class contains many methods, it is not absolutely possible to say which type of model is best, according to the specific data form, data content.
The first kind of model I probably used the method:
1. Global mean value
2. Average value of items
3. User mean value
4. User classification-item mean value
5. Item Classification-user average
6. User Active Degree
7. Item Active Degree
8. Improved user activity
9. Improved item active degree
...
The common feature of such models is to classify users and objects by designing the clustering method, and to use the average value of similar items to predict the user's score. In addition, the realization of the model has a basic understanding of the characteristics of users and commodities.

The following is the code for one of the methods (user category-item mean):

Import pandas as PD
import NumPy as NP

train = pd.read_csv (' data/train.csv ')
test = pd.read_csv (' data/ Test.csv ')

Rate_rank = train.groupby (' uid '). mean (). loc[:,[score ']].iloc[:,-1 ' RATE_RANK=PD]
Dataframe (Np.int32 (rate_rank*2). Values), index=rate_rank.index,columns=[' group ']
rate_rank_des = Rate_ Rank.reset_index ()

train_plus = Pd.merge (train,rate_rank_des,how= ' left ', on= ' uid ')
test_plus = Pd.merge ( Test,rate_rank_des,how= ' left ', on= ' uid '
res = train_plus.groupby ([' IID ', ' Group ']). mean (). Reset_index (). loc[: , [' IID ', ' group ', ' score ']]
result6 = Pd.merge (test_plus,res,how= ' left ', on=[' iid ', ' Group ']). Filllna (3.0)

The second type of model I mainly use the method is based on the collaborative filtering of goods, its core idea is when the user to the prediction of an item, the main consideration is similar to the item and the user has been excessive number of items. Therefore, the method of similarity measurement is especially important, including Euclidean distance, Pearson similarity measure, cosine similarity measure and improved cosine similarity measure. (User-based collaborative filtering is not used because the user-user similarity matrix is too Large)

Implementation code See SIMILARITY.PY and model4.py

The

model for the third class uses the following methods:
1. SVD
2. NMF
3. RSVD
4. svd++
5. Svdfeature
6. LIBMF
7. LIBFM
The common feature of this type of model is matrix decomposition. That is, the user-object scoring matrix is decomposed into a number of small matrices, with the aim of decomposing the matrix product close to the original matrix, and thus realizing the prediction of the value of the original matrix being empty. Among these methods, some important parameters are: The number of hidden features, the learning rate in the descent of stochastic gradient, the regularization parameters, and the total number of iterations. The optimal values of these parameters are also different in each method. The
specifically describes two of the models that are best performed on this topic: Svdfeature and LIBFM
Svdfeature is a feature-based collaborative filtering and sequencing tool, developed by the University of Shanghai, where Chen Tianqi is located, The famous xgboost also come from them. It can easily realize SVD, svd++ and other methods. During use, the steps are as follows:
1. Data preprocessing: User and item IDs are not sequential and need to be mapped back to a continuous value from 1 to the number of users/items.
2. Data format conversion: format to be converted to the model requirement
3. For storage space and calculation speed, it is best to convert to binary form
4. Set various parameters.
5. Prediction
The line score can reach 7.86
Base_score = 3 global bias
Learning_rate = 0.005 Learning rate
Wd_item, Wd_user =0 when the main parameters are set as follows .004 regularization parameter
Num_factor = 4000 implied number of features

LIBFM is a sharp weapon specifically used for matrix decomposition, especially in which the MCMC (Markov Chain Monte Carlo) optimization algorithm is implemented, which is more accurate than the common SGD optimization method, but has a slower operation speed. Algorithms such as SGD, Sgda (Adaptive sgd), ALS (alternating least squares) are also implemented in LIBFM.
In this case, there are many parameters and methods can be flexibly set, such as-dim dimension,-iter iteration number,-learning_rate learning rate,-method optimization method,-task task type,-validation validation set,-regular regularization parameters
The data processing method is similar to the above, the main setting parameters are as follows, the optimal line result of this method is 7.88
-iter 100–dim 1,1,64–method Mcmc–task–r
Besides, the LIBMF model has a good effect and the optimal result can reach 7.85. The SVD based on SciPy and the NMF based on Sklearn are very effective in the small data sets, and the results are not satisfactory when the data volume is especially large. It may be that my tuning and optimization are not good enough.

About Fusion
One is the use of joint-level fusion, even if a model of the prediction results as the next model input, but to adjust the next model of the target function. Another method is model weighted Fusion, the simplest is linear fusion, through each model in the validation set of results and hyperopt optimization method to find the best fusion coefficient, and then online use these fusion coefficients for fusion.

About Time
Time factors I have not used, and later I read that there are time factors in the svd++ data, is expected to increase the model effect.

Because the game did not put the idea into a document, after the game began to sum up their own ideas, there are written not to understand where everyone can put forward, and then in the discussion of mutual inspiration.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.