Facebook IV Winner ' s interview:1st place, Peter Best (aka Fakeplastictrees)

Source: Internet
Author: User

Facebook IV Winner ' s interview:1st place, Peter Best (aka Fakeplastictrees)

Peter Best (aka Fakeplastictrees) took 1st place in Human or Robot?, our fourth Facebook recruiting competition. Finishing ahead of 984 other data scientists, Peter ignored early results from the public leaderboard and stuck to his own Methodology (which involved removing select bots from the training set). In this blog, he shares what LEDs to this winning approach and how the competition helped him grow as a data scientist.

The basicswhat is your background prior to entering this challenge?

After studying chemistry for my first degree, I worked for twenty years in Quantitative asset management culminating in be ing a partner in my own firm. After this, I went back to university to obtain a second degree, in maths this time. Recently, I had been looking for the right project to commit to. My programming Experience extends all the "the" to a ZX81. Whilst I would never describe myself as a professional programmer, I seem to is able to program.

Kaggle profile for Peter Best (aka Fakeplastictrees)

How do you get started competing on Kaggle?

With all my experience, I decided my area of greatest interest lay in analysing complex data. I also quickly realised that my coding skills were from the previous century and thus I opted to learn python. Perhaps the best-of-learn a programming language is to actually does something in it and this led me to Kaggle.

Peter ' s top competition finishes

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Not really. Financial is somewhat like auctions and the people who work in financial, is somewhat like bots (occasion ally!), but I don ' t think I used no specialist knowledge in this competition. Many examples of data analysis has similarities despite relating to completely different fields.

What made-decide to enter this competition?

The key to this competition looked like it is going to be feature engineering, something I regard as a strength. In addition, the size of the data set is small enough to manage easily on my Mac with a short run times.

Let's Get Technicalwhat preprocessing and supervised learning methods did your use?

Most features were composed by manipulating the data in the bids database then transferring some aggregate measure into a Smaller database uniquely indexed by bidder. This smaller database is then used as input to a boosted trees model.

What is your most important insight into the data?

As soon as I started doing some cross-validation on a early version of my model, I found that the statistical error on th E estimate of the area under the ROC Curve (the evaluation metric used in the competition) is going to is quite high. This led me-undertake extensive resampling in my cross-validation to ensure my estimate of the AUC was accurate enough To base decisions upon. In addition, it suggested so the public leaderboard score is likely to being a poor guide to the eventual private Leaderbo ARD score and it is going to being much better to trust my own cross-validation. As a consequence of this I-actually submitted three entries into the competition.

Were surprised by any of your findings?

One of the most straightforward features to envisage is simply the number of bids made by each bidder. Ex Ante, one could suppose, a bot could place far more bids than the average human. When a histogram of the number of bids placed by each bot is considered, however, there was a striking anomaly. Five bots were identified as has only one bid for each of the data set, which looks unusual in the distribution, as shown.

Having a one bid per bot suggests that either there is some links with other bots in the data or that they has been l Abelled as bots using data which we don ' t know. Investigation yielded no obvious links to other bots and the latter point implies that the data from these bots won ' t gene Ralise well to a different data set. Thus, the algorithms that I used could be negatively affected by trying to learn from these bot examples. I decided then to create a solution so removed these five observations from the training data set and submitted this as One of my my selected entries. It is this entry, that won the competition.

Also, note that this is on the home page of the competition. There is five bots! Was it a clue?

Which tools did you use?

I did everything in Python, using the excellent Scikit-learn package for the boosted trees algorithm.

How do you spend your time on this competition?

I tend to flit to and fro between the both [Feature Engineering & Machine learning] as I ' m going along. I tend to start with some feature engineering, then perform some gross parameter tuning for the machine learning algorithm , then return to some more feature engineering and so on. In this competition, the feature engineering took up a lot of time than the machine learning. I spent quite a lot of time lovingly hand-crafting features it didn ' t improve the model in Cross-validation Normal for this sort of competition. Just because you think a feature would be useful doesn ' t mean it'll be.

What is the run time for both training and prediction of your winning solution?

Just a few minutes, though each cross-validation run took through an hour.

Words of Wisdomwhat has taken away from this competition?

Enhanced self-belief in persevering with "what" know to is the correct methodology and not getting distracted by things That's don ' t really matter (in this competition, which meant the public leaderboard!).

Also, a realisation that even though I think my methods gave me the best chance of doing well, actually winning the Compet Ition required the cards to fall my.

Does any of the advice for those just getting started in data science?

Firstly, look through the Kaggle website, especially the forums, to pick up a vast amount of good advice on what to improve Your data science.

Mostly though, my advice is to just has a go, make mistakes and learn from these.

Bio

Peter Best was a graduate in both chemistry and mathematics. He worked for many years in the quantitative asset management industry where he performed extensive data analysis. He is currently competing in Kaggle competitions and trying to decide on his next project.

Facebook IV Winner ' s interview:1st place, Peter Best (aka Fakeplastictrees)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.