ICDM Winner ' s interview:3rd place, Roberto Diaz
This summer, the ICDM conference sponsored a competitionfocused on making individual user connections across multiple Digital devices. TOP teams were invited to submit a paper for presentation in an ICDM workshop.
Roberto Diaz, competing as team "Cookiemonster", took 3rd place. In this blog, he shares how he became a Kaggle addict, what he values in a competition, and most importantly, details on H Is approach to this unique dataset. Congrats to Roberto for achieving his goal of becoming a top Kaggle user!
407 players on 340 teams competed in ICDM 2015:drawbridge Cross-device Connections
The basicswhat is your background prior to entering this challenge?
In addition to being a Kaggle addict, I am a researcher at treelogicworking in the machine learning area. In parallel I work on my PhD thesis at the University Carlos III de Madrid focused on the parallelization of Kernel Method S.
Roberto ' s Kaggle profile
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I didn ' t has any knowledge on this domain. The topic is quite new and I couldn ' t find any papers related to this problem, most probably because there was not public Datasets.
How do you get started competing on Kaggle?
I started on the first Facebook competition a long time ago. A friend of mine was taking part, the challenge and he encouraged me to compete. That caught my initial curiosity so I accessed the Challenge's forum and I read a post with a solution that scored quite W Ell on the leaderboard and I thought "I think I can does better than that". At the end I scored 9th on the leaderboard.
For my second challenge (EMC Israel Data Science Challenge) I am on a team with my PhD mates. We finished 3rd receiving a prize.
After that it is too late for me, I had become an addict.
What made-decide to enter this competition?
The things I value most in a challenge is:
- A conference associated to the challenge: It's a good opportunity to publish your results. For example, my solution in the Higgs Boson Machine Learning Challenge:
Dìaz-morales, R., & Navia-v Zquez, A. (September). Optimization of AMS using Weighted AUC optimized models. In *jmlr:workshop and Conference proceedings*, Vol, pp. 109-127.
- A domain Unknown to me: It's the best-of-the-learn about-to-work with a different kind of data.
- The need to preprocess and extract the features from raw data to build the dataset: It gives the chance to use your intuition and imagination.
This challenge looked very interesting to me because all the conditions were met.
Let's Get Technicalwhat preprocessing and supervised learning methods did your use?
In this challenge we had a list of devices and a list of cookies and we had to tell what the cookie belonged to the person us ing the device.
The most important part is the feature extraction procedure, they had to contain information about the relation between D Evices and cookies (for example, the number of IP addresses visited by each one and by both of them).
Once I had the features I tried simple supervised machine learning algorithms and complex ones (my winning methodology was Semi-supervised learning procedure using Gradient boosting + Bagging) and the score just grew up from 0.865 to 0.88.
What is your most important insight into the data?
A key part of the solution is the initial selection of candidates and the post processing:
- Initial Selection: It is not possible to create a training set containing every combination of devices and cookie due to the high number of them. In order to reduce the initial complexity of the problem and to create an affordable dataset, some basic rules were create D to obtain a initial reduced set of candidate cookies for every device. The rules is based on the IP addresses that both device and cookie has in common and how frequent they is in other Devi CES and cookies.
- Supervised Learning: Every pattern in the training and test set represents a device/candidate cookie pair obtained by the previous step and con Tains information about the device (Operating System (OS), country, ...), the cookie (cookie Browser Version, Cookie Compu ter OS,...) and the relation between them (number of IP addresses shared by both device and cookie, number of other cookies with the S AME handle than the cookie,...).
- Post Processing: If the initial selection of candidates did not find a candidate with enough likelihood (logistic output of the classifier) We choose a new set of candidate cookies selecting every cookies that shares an IP address with the device and we score th EM using the classifier.
The initial selection of candidates reduces the complexity of the problem and the post processing step find out most of th E Device/cookie pairs lost by that initial selection strategy.
Were surprised by any of your findings?
Yes. When I sorted the scores obtained by the classifier for every candidate I saw that if the first score was high and the Seco nd is very low, was extremely likely that first cookie belongs to the device. I made use of this information to create Semi-supervised learning procedure updating some features in the training set and Retraining the algorithm again with this new information to improve the results.
This is shows the F05 score and the percentage of devices that fulfill the condition when we match devices and the fi RST cookies candidate when the second candidate scores less than a threshold:
Which tools did you use?
This solution have been implemented in Python and uses the external software xgboost.
The libraries of Python used were:
How do you spend your time on this competition?
I spent about 20% of the time in feature engineering, 10% in the supervised learning part and 70% eagerly awaiting for the Results.
What is the run time for both training and prediction of your winning solution?
Too much, the training procedure takes around 9 hours using the cores.
The prediction procedure takes around minutes, it is necessary to extract some features from the relational database.
Words of Wisdomwhat has taken away from this competition?
I was trying to reach a place in top of the users global ranking and I finally got it.
Regarding the challenge:
- I had learned how useful it was to save intermediate results on order to not repeat the full training procedure only to Ch Ange the last steps of the algorithm.
- A paper with my approach to the problem of the next ICDM workshop dedicated to the challenge.
Does any of the advice for those just getting started in data science?
"All hope abandon, ye who enter here".
No, seriously, at the beginning your may feel frustrated because it's difficult area and you were in the correct place if:
- Statistics more than other software engineers
- Software engineering more than and other statisticians.
Bio
Roberto Diaz is a researcher in the R Department of Treelogic, a SME Spanish company focused on machine learning, Co Mputer Vision and Big Data, takes part in many EU, and innovarions programmes. In parallel he works on his PhD thesis in the University Carlos III de Madrid focused on the parallelization of Kernel Met Hods.
ICDM Winner ' s interview:3rd place, Roberto Diaz