Project Address: Https://github.com/Huangtuzhi/AlibabaRecommand
Alibabarecommand
Alibaba Mobile recommending algorithm competition.
Competition Introduction
The contest analyzes the user's behavior data for one months on the mobile terminal and makes a recommendation for the following day's user purchase behavior.
Directory structure
├──license #许可证 └──readme.md #使用说明 # ├──create_table.sql #创建基本表 ├──add_table.sql #后续增加的表 ├──add_index.sql #为表建立索引 ├──add_table_31day.sql #建立存储31天数据的表, Structure ditto └──add_index_31day.sql #为表建立索引 # data import ├ ──datatodb.sql #大赛csv格式原始数据导入基本表 └──featuretodb.sql #feature. txt to import the corresponding table #main├──__init__.py├──trainmodel.py ├──obtainpredict.py└──getfeature31day.py# data ├──feature.txt #符合某个标准的记录 (user_id,item_id,look,store,cart,buy) ├── Data_features.txt #feature The N-dimensional feature ├──data_features.npy #转为矩阵格式 (NumPy Library) recorded in. txt, #feature with ├──data_labels.txt. The label recorded in txt (1/0 = purchased/not purchased) ├──data_labels.npy├──feature_pos.txt #feature. txt all positive cases ├──feature_p.npy├──feat Ure_neg.txt #feature All negative examples ├──feature_p.npy├──trainset.npy #训练集 ├──testset.npy #测试集 └──31day_ in. txt Data_features.txt #31天所有数据的n维特征 # results ├──predict_all_pairs.txt #得到所有预测的userid itemid to └──filter_pairs.txt #用train_item过 Filter the UserID itemid to
Principle
The topic gave 31 days of data, and we chose the 30th day as the dividing point. Extract the n-dimensional features from the first 30 days of data (each [user_id,item_id] pair can fetch a single line of features) and mark each line with the real data of day 31st.
For example: A [user_id,item_id] pair [9909811,266982489] appears in the first 30 days, if on the 31st day it also appears and Behavior_type for purchase, the label for this line is 1, otherwise 0.
This formed a lot of characteristics of the data, we put the data in the logistic regression training, get a two classification model, so the model is trained.
The next thing to predict is the label above, which is the output of the model. A label of 1 means we think the user will buy it. So what is the input to the model? The input to the model is the characteristic of all data for 31 days.
1th~30th————> 31th的label1th~31th————> 32th的label
Since the 31th label data is known, it can be used to evaluate the trained model. The 32th label is the result of the output.
Description
This is a predictive framework, and the feature engineering needs to be further improved.
2015 Ali Tianchi big data game algorithm design