MDD Cup 2017 Small Kee (American group Reviews Internal algorithm contest) _kaggle

Source: Internet
Author: User
Tags xgboost
GBDT version

MDD Cup 2017, is the United States group reviews the first internal algorithm contest, mainly to predict the delivery time, is a regression problem, simply say the game process of their own thinking, record. After getting the data, according to the description of the contest a simple analysis, the training set gives one months of data, the test set is the next one months of data, it should be noted that the training set gives a 24h of data per day, and the test set only need to predict 11 points and 17 peak period of two hours of data, and 10-point and 16-point data are given to assist in the analysis (the test set has only 4 hours of data per day). Because the test requires only peak data, the training dataset needs to be consistent with the test dataset, first look at the order distribution for each hour of the training dataset (as shown below) to filter the data and remove the invalid data. Based on the number of orders per hour in the chart, we left only 10 to 20 points of data.     We want to keep as much data as possible for training, the effect of the model depends on the data, features and parameters, the more effective data, the better the generalization. What's more, we only keep 11 and 17 points of data in the validation set, which ensures that the data distribution is consistent between the validation set and the test set, but there is a lot more detail to be done to ensure the consistency of the data, which is a big chunk. Take a look at our label (the delivery time for each order), first need to remove the noise, filter out the delivery time is too short and too long data. Then look at the distribution, training data, the original distribution of the label is as follows the left figure, do smooth distribution as the right figure, the purpose is to want the distribution of the label as far as possible to conform to the normal distribution, here with the np.log1p (label), after the application np.expm1 () to do restore, This can be done further, for example, our training data can be changed to 30 minutes for 30 minutes of data. This step is almost done here, and after the model is trained, the predicted lable can be analyzed in the same way.

    Look at the features below, characteristics determine the upper limit of the model, in the contest preach in the great God of Yan Peng repeatedly remind everyone, the final model of the effect of integration depends on the quality of the single model, and the quality of a single model depends on the quality of the fact that most people will use the original features, statistical characteristics, combination features The characteristics of complex features and future information, the final model fusion should be able to enter 400 (the evaluation of the game is Mae, Unit is seconds), but further further, it is necessary to do more detailed work in each link, the full effort to do.     First say a few points, because often any one of these problems, it will let people collapse to give up. One is in the processing of class characteristics, Xgboost is not supported by the category feature, if you want to use it requires one-hot encoding, if the dimension of the code after the increase of a large, it will lead to a very slow machine training, and sometimes can be slow to a few iterations a day, the speed of the iteration is too slow is a big taboo, It's easy to break down. Here to a solution for xgboost, that is, the characteristics of the category into a numerical model, and then put into the xgboost inside, in fact, I encountered a lot of games, including work, this way is and one-hot coding after the difference is not too much, but the speed can be improved a lot. As to how to explain this phenomenon, is when the tree is very deep, actually can be regarded as a kind of one-hot code, but the actual tree is not very deep can achieve one-hot effect, also may these individual cases let me encounter.     LIGHTGBM is to support the processing of class features, that is, we do not need to do one-hot coding, direct table to recognize the characteristics of the category, LIGHTGBM will handle themselves. One thing to say here is that if you identify a category feature in several versions, LIGHTGBM will hang up, which is actually a bug, and the latest version of LIGHTGBM has been fixed.     There is the game, if you can use LIGHTGBM do not use Xgboost, because later when the feature extraction to a certain number of times, Xgboost will be slow to the point of collapse, one day iterations, and LIGHTGBM a few minutes can iterate, That's a blast.      Because the number of submissions per day is no more than 5 times, so how to divide data sets, can make, offline promotion, can also be promoted after submission is essential, which is equivalent to you can submit unlimited, but in general games are difficult to find offline there are promotion, After the submission will have to improve the data set of the way, for the game encountered offline there is a promotion, and the situation is not submitted after the fall is normal, but the offline suddenly improve a lot, the effect of a large drop after the submission, is generally extracted from theThe characteristics of the fitting.     First of all, the solution is to use the CV Cross-validation method, assuming that every day is independent, the data according to the average day division and then do cross validation. The third place is to skillfully use the test set one hours after the first 15 minutes of data as an offline validation set, the first one hours a day and test data is the closest. One is off-line with 20 days of training, 5 days of testing, 5 days of validation, there are many other ways not to enumerate here, of course, can also rely on feelings, such as each GBDT each iteration of the number of rounds.     Model After training, we need to analyze the importance of the characteristics, in fact, the first start Xgboost and LIGHTGBM general will only use the default F-value feature importance of ordering, but did not recognize the role of gain sorting, To illustrate that these indicators are to assist us to analyze the model, the ranking of the former characteristics does not necessarily mean that this feature is important, that is, if we remove this dimension in the model, the effect of the model will not necessarily decline, and may also be increased, especially when the characteristics of the extraction of high importance, But the model effect does not ascend but drops, do not rush to throw away this feature, can be used to do other models, and then model Fusion, which is to use multiple GBDT fusion one of the reasons.     Xgboost version: Model.get_score (importance_type= ' gain ')     LIGHTGBM version: Model.feature_ Importance (importance_type= ' gain ')
Speaking of how to extract effective features, in addition to blindly follow the routines to extract the characteristics of the all kinds of attempts, the most important thing is to understand the business, first of all to understand what factors, determine the rider's delivery time, the competition officially given orders: The next single moment, price, number of dishes users: User location Rider: Rider load Merchant: Meal ability, geographical location, backlog of order areas: Rider number, Single time: Noon Peak, late peak, off-peak weather data: temperature, wind speed, precipitation when we extract the features, we should surround these factors, and I've talked to some of the riders in private (from the rider's point of view, What determines the time of delivery when the meal is delivered, and the key factors are the ability of the merchant to eat out and the load of the region, and the most important thing is that the rider has an estimated time in each feeding system, and if the rider exceeds this time, there will be a penalty, which is critical to the rider, if the rider has more than one order to send, will be based on this time to adjust the order of distribution, so I was also wondering if the model is closer to the system to give the estimated time of this model, the more accurate prediction, to get to do is not to fit the original model. All the way down, feel that only we have enough in-depth understanding of contact with this business, in order to extract more effective features, the game is a lot of time and energy, every time I think of this I will tremble, because think of my middle page sort still have a lot of work to do.

American Team Algorithm engineer is mainly responsible for middle page sorting

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.