Liberty Mutual Property Inspection, Winner ' s interview:qingchen Wang
The hugely popular Liberty Mutual Group:property inspection Prediction Competition wrapped up on August. With Qin Gchen Wang at the top of a crowded leaderboard. A total of 2,362 players in 2,236 teams competed to predict what many hazards a property inspector would count during a hom E inspection.
This blog outlines Qinchen's approach, and how a relative newbie to Kaggle competitions learned from the community and Ult Imately took first place.
The basicswhat is your background prior to entering this challenge?
I did my bachelor's in computer. After working for a few months in EA Sports as a software engineer I felt the strong need to learn statistics and machine Learning as the problems that interested me the most were about predicting things algorithmically. Since then I ' ve earned master's degrees in machine learning and business and I ' ve just started a PhD in marketing analytic S.
Qingchen ' s profile on Kaggle
How do you get started competing on Kaggle?
I had an applied machine learning course during my master's at UCL and the course project is to compete on the Heritage H Ealth Prize. Although at the time I didn ' t really know what I was doing it was still a very enjoyable experience. I ' ve competed briefly in and competitions since, but this is the first time I ' ve been able to take part in a competitio N from start-to-finish and it turned out to has been quite a rewarding experience.
What made-decide to enter this competition?
I was in a period of unemployment so I decided to work on the data science competitions full-time until I found something else To do. I actually wanted to does the Caterpillar competition at first and decided to give this one a quick go since the data didn ' t Require any preprocessing to start. My early submissions were not very good so I became determined to improve and ended up spending the whole time doing this.
What made the competition so rewarding is how much I learned. As more or less a Kaggle newbie, I spent the whole II months trying and learning new things. I hadn ' t known about methods like gradient boosting trees or tricks like stacking/blending and the variety of ways to Han Dle categorical variables. At the same time, it is probably the intuition that I developed through previous education this set my model apart from S Ome of the other competitors so I am able to validate my existing knowledge as well.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I has zero prior experience or domain knowledge for this competition. It ' s interesting because during the middle of the competition I hit a wall and a number of the top-10 ranked competitors H Ave worked in the insurance industry so I thought maybe they had some domain knowledge which gave them an advantage. It turned out to the the case. As far as data science competitions go, I think this one is rather straightforward.
Histogram of all, the dataset with labels. Script by competition participant, Rajiv Shah
Let's Get Technicalwhat preprocessing and supervised learning methods did your use?
I used only Xgboost (tried others but none of the them performed well enough to end up in my ensemble). The key to my result is that I also do binary transformation of hazards which turned the regression problem into a s ET of classification problems. I noticed that some and people also tried this method through the forum thread but it seems that they didn ' t go far enou GH with the binary transformation as that is the best performing part of my ensemble.
I also played with different encodings of categorical variables and interactions, nothing sophisticated, just the standard Tricks that many others has used.
Were surprised by any of your findings?
I ' m surprised by what poor our prediction accuracies were. This seemed like a problem is well suited for data science algorithms and it is both disappointing and exciting to See such high prediction errors. I guess that's the difference between real life and the toy examples in courses.
Which tools did you use?
I only used Xgboost. It's really been a learning experience for me as I entered this competition have no idea what the gradient boosted trees was . After throwing random forests at the problem and getting nowhere near the top of the leaderboard, I installed xgboost and Worked really hard in tuning its parameters.
Xgboost fans or those new to boosting, check out this great blogs by Jeremy Kun on the math behind boosting and why it does N ' t Overfit
How do you spend your time on this competition?
Since The variables were anonymous there wasn ' t much feature engineering to is done. Instead I treated feature engineering as just another parameter to tune and spent all of my time tuning parameters. My Final Solution is an ensemble of different specifications so there were a lot of parameters to tune.
What is the run time for both training and prediction of your winning solution?
The combination of training and prediction of my winning solution takes about 2 hours on my personal laptop (2.2ghz Intel i7 processor).
Words of Wisdomwhat has taken away from this competition?
One thing that I learned which I's ve always overlooked before are thatparameter tuning really goes a long it in perform ance improvements. While in absolute terms it is much, in terms of leaderboard improvement it can be great value. Of course, without the community and the public scripts I wouldn ' t has won and may still isn't know about gradient boosted Trees, so a big thanks to all of the people who shared their ideas and code. I learned so much from both sources so it ' s been a worthwhile experience.
Click through to a animated view of the Community ' s leaderboard progression over time, and the influence of benchmark cod E sharing. Script by competition participant, inversion
Does any of the advice for those just getting started in data science?
For those who don ' t already has an established field, I strongly endorse education. All of my data science experience and expertise came from courses taken during my Bachelor ' s and master's degrees. I believe that without already has been so well educated on machine learning I wouldn ' t has been able to adapt so Quic Kly to the new methods used in practice and the tricks that people has talked about.
There is now a number of very good education programs in data science which I suggest so everyone who wants to start in Data science to look into. For those who already has their own established fields and is doing data science on the side, I think their own approach Es could is very useful when combined with the standard machine learning methods. It's always important to think outside the box and it's all the more rewarding when you bring in your own ideas and get th EM to work.
Finally, Don's is afraid to hits walls and grind through long periods of trying out ideas that don ' t work. A failed idea gets you one closer to a successful idea, and has many failed ideas often can result in a string of ideas That's work down the road. Throughout this competition I tried every idea I thought of and only a few worked. It is a combination of patience, curiosity, and optimism that got me through these. The same applies to learning the technical aspects of machine learning and data science. I still remember the pain that my classmates and I endured in the machine learning courses.
Just for funif do could run a Kaggle competition, what problem would do want to pose to other kagglers?
I ' m a sports junkie so I ' d love to see some competitions on sports analytics. It's a shame that I missed the one in March madnesspredictions earlier this year. Maybe one day I'll really run a competition on this stuff.
Editor's note:march machine learning Mania are an annual competition so can catch it again in 2016!
What is your dream job?
My dream job is to leads a data science team, preferably in an industry that's full of new and interesting prediction PROBL Ems. I ' d be just as happy as a data scientist though, but it's always nice to have greater responsibilities.
Bio
Qingchen Wang is a PhD student in marketing analytics at Theamsterdam Business School, VU Amsterdam, and Ortec. His interests is in applications of machine learning methods to complex real world problems in all domains. He has a bachelor's degree in computer science and biology from the University of British Columbia, a master's degree in M Achine Learning from University College London, and a master's degree in Business Administration frominsead. In He free time Qingchen competes on data science competitions and reads about sports.
Liberty Mutual Property Inspection, Winner ' s interview:qingchen Wang