How Much did It Rain? Winner ' s interview:1st place, Devin Anzelmo

Source: Internet
Author: User
Tags radar

How Much did It Rain? Winner ' s interview:1st place, Devin Anzelmo

An early insight to the importance of splitting the data on the number of radar scans in each row helped Devin Anzelmo t Ake first place in the What Much did It Rain? Competition. In this blog, he gets to details on the his approach and shares key visualizations (with code!) from the He analysis.

351 players on 321 teams built models to predict probabilistic distributions of hourly rainfall

The basicswhat is your background prior to entering this challenge?

My background is primarily in cognitive science and biology, but I dabbled in many different areas while in school. My particular interests is in human learning and behavior and how we can use human activity traces to learn to shape Futu Re actions.

Devin ' s profile on Kaggle

My interest in gaming as a means of teaching and competitive nature have made Kaggle a great fit for my learning style. I started competing seriously on Kaggle in October 2014. I do not has much experience with programming or applied machine learning, and thought entering a competition would prov IDE a structured introduction. Once I started competing I found I had difficult time stopping.

What made-decide to enter this competition?

I thought there is a decent chance I could get into the top five in the competition and this drove me to enter. After finishing the BCI competition I had to decide between Otto Group Product Challenge and this one. I chose how Much did it Rain because the dataset is difficult to process, and it wasn ' t obvious what to approach the PROBL Em. These factors favored my skills. I didn ' t feel like I could compete in Otto where the determining factor is going to primarily rely on ensembling skills.

Let's Get Technicalwhat preprocessing and supervised learning methods did your use?

most of the preprocessing was just feature generation. Like most other competitors I used descriptive statistics and counts of the different error codes. These made up the bulk of my features and turned out to being enough to get first place. We were given QC ' d reflectivity data, but instead of using the-information to limit the data used in feature generation I Included it as a feature and let Learning algorithm (Gradient Boosted decision Trees) use it as needed.

The most important decision and regard to supervised learning is how to model the output probability distribution. I decided to model it by transforming the problem into a multi-class classification problem with soft output. Since There was wasn't enough data to perform classification with the full of classes the problem had to be reduced further. It turned out there were many different ways so people solved this problem, and I highly recommend reading the end of CO Mpetition thread for some and other approaches.

See the Code on scripts

I ended up using a simple method in which basic component probability distributions were combined using the output of a CL Assification algorithm. For classes that had enough data a step function is used for a CDF. When there is less data, the several labels were combined and replaced by a single value. In this case an estimation of the empirical distribution for that class is used as a component CDF. This method worked well and I used it for most of the competition. I do try regression and classification just on the data from the minority classes but it never performed quite as well as Just using the empirical distribution.

What is your most important insight into the data?

Early in the competition I discovered that it is helpful to split the data based on the number of radar scans in each row . Each row have data spanning the hour previous to the rain gauge reading. In some cases there is only one radar scan in others there is more then 50. There is over one hundred thousand rows in the training set with more then a radar scans. For this data I wanted to create features which take into account the changing of weather conditions over time. In doing this I realized it is not possible to make these features for the rows that had is only 1 or 2 radar scans. This is the initial reason for splitting the dataset. When I started looking for places-split it I found that there was also a strong positive correlation between the number of radar scans and the average rain amount. Those rows with 1 scan had 95% 0mm of rain, while the subset with + or more scans only 48% of the data had 0mm of rain. Interestingly for the data with few radar scans many of the most Important features were the counts of the error codes.

See the Code on scripts

In contrast the most important features in the data with many scans were derived from reflectivity and Hybridscan which ha ve a physical relationship to rain amount. Splitting the data allowed me to use many more features for the higher scan data which gave a large boost to the score. Over 65% of the "error came from the data with more then 7 scans. The data with low scans contributed to very small amount of the final score and I were able to spend less time modeling the Se subsets.

Were surprised by any of your findings?

The most mysterious aspect of the competition is the "the" and "the" the "training data" had expected rain amount over 70 Mm. The requirements of the competition only asked us to model up to 69mm of rain in a hour but the evaluation metric punishe D Large classification errors so severely, I felt compelled to the figure of how to predict these large values. A quick calculation showed that's the 1.1 million rows in the training set these large values, if mis-predicted, Wou LD account for half of my error.

It turned out this many of the samples with labels above 70mm do not has reflectivity values indicating heavy rain. I was still able to improve my local validation score by treating the large rain amount samples as their own class and USI Ng an all-zero CDF in generating the final prediction. Unfortunately this also worsened my public leaderboard score by a large amount.

See the Code on scripts

Through Leaderboard Feedback I was able to determine that there were differences in the distribution of these large values In the training set and the test set. Removing the rows with large values from the training set turned out to be the best course of action.

My hypothesis about the large values are that they were generated by specific rain gauges, which the learning algorithm was Able to detect using features based on Distancetoradar and the-99903 error code. The-99903 error code can correspond to physical blockage of a radar beam by mountains or other physical objects. Both of these features can help identify specific rain gauges which would leads to overfitting the train set if there were Fixes to the malfunction before the start of 2014. As I don ' t has access to the labels this would remain speculation for now.

Which tools did you use?

I used Python for this competition relying heavily on pandas for data exploration and direct NumPy implementations when I Needed things to be fast. This is my first competition using Xgboost, and I is very pleased with the ease of use and speed.

How do you spend your time on this competition?

I probably spent 50% percent of my time coding, and then have to refactor when I realized my implementation is not flex Ible enough to incorporate my new ideas. I also tried several crazy things that required substantial programming time this I didn ' t end up using.

The other 50% percent was split pretty equally between feature engineering, data exploration and tweaking my classific ation Framework.

Words of Wisdomwhat has taken away from this competition?

I spent many hours coding and refactoring in this competition. Since I had to does nearly the same thing on five different datasets have manually code everything made it difficult to TR Y new ideas. Have a flexible framework to try out many ideas are critical and this was one of the things I spent time learning how to D O in this competition. The effort has already payed off and other competitions.

With only one submission a day it is important to the try out things in a systematic. What worked best is changing one aspect of my method and seeing whether it improved my score. I needed to keep records of everything I do or it was possible waste time redoing things I already tried. Have the discipline to keep in track and and not try too many things at once are critical for doing well and this Competi tion put me to the test in this.

Does any of the advice for those just getting started competing on Kaggle?

Read The Kaggle blog post profiling Kazanova for a great high level perspective on competing. I read this about weeks before the end of the competition and I started saving my models and predictions, and Automati ng more about my process which allowed for some late improvements.

Other then this I think it very helpful to read the forums and follow up in hints given by those at the top of the leader Board. Very often people would give small hints, and I had gotten in habit of following up on even the smallest clues. This have taught me many new things and helped me find critical insights into problems.

Just for funif do could run a Kaggle competition, what problem would do want to pose to other kagglers?

with the introduction of kaggle Scripts it seems it'll now being possible to has solution C Ode evaluated remotely instead of requiring competitors to submit a CSV submission file. I think have this functionality opens up the possibility of solving new types of problems that were don't feasible in the Past.

With the This on mind I would like to run a problem that favors reinforcement learning based solutions. As a simple example we could teach an agent to explore mazes. The training set would consist of several different mazes (perhaps it would be good generate their own training data) and The test set could is another set of unseen mazes hosted on Kaggle. All the training code would is required to run directly in scripts making transition to a evaluation server easy. I don ' t think this type of problem would has worked without scripts, and I think it would be fun to see if it's possible To turn agents learning problems into Kaggle competitions.

Another possibility with remote execution of solutions would is a Rock Paper Scissors programming Tournament. There is already some RPS tournaments available online. Perhaps hosting a variant as a knowledge competition would be possible as these types of competitions is really fun.

What is your dream job?

Ideally I would like to work with neural and behavioural data to help improve human performance and alleviate problems rel Ated to mental illness. There is many very challenging problems in this area. Unfortunately most of the current classification frameworks for mental illness is deeply flawed. My dream job would allow for the application of diverse descriptions, methods, and sensors, without the need to push a pro Duct out immediately.

My sense is, the amount of theoretical upheaval needed are holding back to the academia, and the ineffectiveness O F Most current techniques is hampering the development of new businesses (plus the legal issues of the health industry). I would be interested in any project that's making progress through this mire??

Beyond these interests what motivates me was the ability to explore complex problems and the freedom to try new solutions. Any job in which I can tackle a difficult problems with the ability to actually try-to-solve it is fundamentally interesti Ng to me.

Bio

Devin Anzelmo is a recent graduate from University of California Merced where he received a b.s in cognitive science and a Minor in Applied mathematics. His-interests include machine learning, neuroscience, and animal learning and behavior. He enjoys immersion in complex problems and puzzles. He currently spends most of his time competing on Kaggle but is also interested in employment opportunities in the San Lui S Obispo area.

How Much did It Rain? Winner ' s interview:1st place, Devin Anzelmo

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.