Crowdflower Winner ' s interview:1st place, Chenglong Chen
The Crowdflower search Results relevance Competition asked Kagglers to evaluate the accuracy of E-commerce Search engines On a scale of 1-4 using a dataset of queries & results. Chenglong Chen finished ahead of 1,423 other data scientists to take first place. He shares his approach with us from his home in Guangzhou, Guangdong, and China. (to compare winning methodologies, you can read a write-up from the third place team here.)
The competition ran from May 11-july 6, 2015.
The basicswhat is your background prior to entering this challenge?
I was a Ph.D. student in Sun Yat-sen University, Guangzhou, China, and my-mainly focused on passive digital image Forensics. I have applied various machine learning methods, e.g., SVM and deep learning, to detect whether a digital image has been E Dited/doctored, or how much have the image under investigation been resized/rotated.
Chenglong ' s profile on Kaggle
I am very interested in machine learning and has read quite a lot of related papers. I also love to compete on Kaggle to test out what I had learnt and also to improve my coding skill. Kaggle is a great place for data scientists, and it offers real world problems and data from various domains.
Do you have any prior experience or domain knowledge that helped you succeed in this competition?
I have a background of image proecssing and has limited knowledge about NLP except BOW/TF-IDF kinda of things. During the competition, I frequently refered to the book Python Text processing with NLTK 2.0 Cookbook or Google For what to clean the text or create features from text.
I did read the paper about ensemble selection (which was the Ensembling method I used in this competition) a long time ago, But I Haven ' t has the opportunity to try it out myself in real word problem. I previously only tried simple (weighted) averaging or majority voting. The the first time I got so serious about the model ensembling part.
How do you get started competing on Kaggle?
It dates back a year and a half ago. At this time, I was taking Prof Hsuan-tien Lin's machine learning Foundations course on Coursera. He encouraged us to compete on Kaggle to apply and we have learnt to real world problems. From and on, I has occasionally participated in competitions I find interesting. And to is honest, most of my programming skills about Python and R is learnt during kaggling.
What made-decide to enter this competition?
After I passed my Ph.D. dissertation defense early on May, I had had some spare time before starting my job at an interne T company. I decided that I should learn something new and mostly get prepared for my job. Since my job would be on advertising and mostly NLP related, I thought this challenge would is a great opportunity to FA Miliarize myself with some basic or advanced NLP concepts. This is the main reason, which drove me to enter.
Another reason was, the this dataset isn't very large, which is ideal for practicing ensemble skills. While I had read papers about ensembling methods, I haven ' t got very serious on ensembling in previous competitions. Usually, I would try very simple (weighted) averaging. I thought this was a good chance to try some of the methods I had read, e.g., stacking generalization and ensemble Selecti On.
Let's Get Technicalwhat preprocessing and supervised learning methods did your use?
The documentation and code for my approach is available here. Below is a high level overview of my method.
Figure 1. Flowchart of my method
For preprocessing, I mainly performed HTML tags dropping, word replacement, and stemming. For a supervised learning method, I used ensemble selection-generate an ensemble from a model library. The model library was built with models trained using various algorithms, various parameter settings, and various feature Sets. I has usedhyperopt (usually used in parameter tuning) to choose parameter setting from a pre-defined parameter space for Training different models.
I have tried various objectives, e.g., MSE, Softmax, and pairwise ranking. MSE turned out to being the best with an appropriate decoding method. The following is the decoding method I used for MSE (i.e., regression):
- Calculate the PDF/CDF of each median relevance level, 1 are about 7.6%, 1 + 2 are about 22%, 1 + 2 + 3 are about 40%, and 1 + 2 + 3 + 4 is 100%.
- Rank The raw prediction in an ascending order.
- Set the first 7.6% to 1, 7.6% – 22% to 2, 22% – 40% to 3, and the rest to 4.
In CV, the PDF/CDF are calculated using the training fold only, and in the final model training, it is computed using the W Hole training data.
Figure 2 The shows some histograms from my reproduced best single model for one run of CV (only one validation fold is used). Specifically, I plotted histograms of 1) raw prediction, 2) rounding decoding, 3) ceiling decoding, and 4) the above CDF D Ecoding, grouped by the true relevance. It's most obvious that both rounding and ceiling decoding methods has difficulty in predicting relevance 4.
Figure 2. Histograms of raw prediction and predictions using various decoding methods grouped by true relevance. (The code generated is available here.)
Following is the kappa scores for each decoding method (using the 3 runs and 3 folds CV). The above CDF decoding method exhibits the best performance among the three methods we considered.
Method |
CV Mean |
CV STD |
Rounding |
0.404277 |
0.005069 |
Ceiling |
0.513138 |
0.006485 |
Cdf |
0.681876 |
0.005259 |
What is your most important insight into the data?
i has found the most important features for predicting the search results relevance is the ; correlation or distance between query and product title/description. In my solution, I had features like Interset word counting features, jaccard coefficients, Dice distance, and Cooccurence N word tf-idf features, etc. Also, it ' s important to perform some word replacements/alignments, e.g., spelling correction and synonym replacement, to a Lign those words with the same or similar meaning.
While I didn ' t has much time exploring word embedding methods, they is very promising for this problem. During the competition, I came across a paper entitled "Fromword embeddings to document distances". The authors of this paper used Word Mover ' s Distance (WMD) metric together with Word2vec embeddings to measure the Distanc e between text documents. This metric are shown to has superior performance than BOW and TF-IDF features.
Were surprised by any of your findings?
I had tried optimizing kappa directly uisng Xgboost (see below), but it performed a bit worse than plain regression. This might has something to does with the Hessian, which I couldn ' t get to work unless I used some scaling and change it to Its absolute value (see here).
Which tools did you use?
I used Python for this competition. For feature engineering part, I heavily relied on pandas and Numpy for data manipulation, Tfidfvectorizer and SVD in Sklea RN for extracting text features. For model training part, I mostly used Xgboost, Sklearn, Keras and RGF.
I would like to say a few more words on xgboost, which I have been using very often. It is great, accurate, fast and easy for use. Most importantly, it supports customized objective. To use this functionality, you has to provide the gradient and hessian of your objective. This is quite helpful in my case. During the competition, I had tried to optimize quadratic weighted kappa directly using Xgboost. Also, I have implemented II ordinal regression algorithms within Xgboost framework (both by specifying the customized obj Ective.) These models contribute to the final winning submission too.
How do you spend your time on this competition?
Where I spent my time on the competition changed during the competition.
- In the early stage, I mostly focused on data preprocessing. I have spent quite a lot of time in researching and coding down the methods to perform text cleaning. I have to mention that quite a lot of effort is spent on exploring the data (e.g., figuring out misspellings and synonyms etc.)
- Then, I spent the most of my time in feature extraction and trying to the figure out what features would is useful for this task. The time was split pretty equally between researching and coding.
- In the same period, I decided to build a model using ensemble selection and realized my implementation is not flexible en Ough to that goal. So, I spent most of the time refactoring my implementation.
- After this, most of my time is spent on coding down the training and prediction parts of various models. I didn ' t spend much time on tuning each model ' s performance. I utilized hyperopt for parameter tuning and model library building.
- With the pipeline-ensemble selection being built, most of my time is spent on figuring out new features and exploring The provided data.
In short, I would say I has done a lot of researching and coding during this competition.
What is the run time for both training and prediction of your winning solution?
Since the dataset is kinda of a small size and Kappa isn't very stable, I utilized bagged ensemble selection from a model Library containing hundreds or thousands of models to combat overfitting and stabilize my results. I don ' t has an exact number of the hours or days, but it should take quite a large amount of time to train and make Predi Ction. Furthermore, this also depends on the trade-off between the size of the model library (computation burden) and the perform Ance.
That's being said, you should is able to train the best of the model (i.e., xgboost with linear booster) in a few hours. It'll give you a model of the Kappa score about 0.708 (Private LB), which should is enough for a top pla Ce. For this model, feature extraction occupied most of the time. The training part (using the best parameters I has found) should be a few minutes using multi-threads (e.g., 8).
Words of Wisdomwhat has taken away from this competition?
- Ensembling of a bunch of diverse models helps a lot. Figure 3 shows the CV-mean, public lb, and Private lb scores of my + best public lb submissions generated using ensemble Selection. As time went by, I has trained more and more different models, which turned off to being helpful for ensemble selection in B Oth CV and Private LB.
- Do not ever underestimate the power of linear models. They can much better than tree-based models or SVR with Rbf/poly kernels when using raw TF-IDF features. They can even better if you introduce appropriate nonlinearities.
- Hyperopt is very useful for parameter tuning, and can being used to the build model library for ensemble selection.
- Keep your implementation flexible and scaleable. I was lucky to refactor my implementation early on. This allowed me-to-add new models to the model library very easily.
Figure 3. CV mean, public lb, and Private lb scores of my the best public lb submissions. One standard deviation of the CV score is plotted via error bar. (The code generated is available here.)
Does any of the advice for those just getting started in data science?
- Use things the Google to find a few relevant the papers. Especially if you is not a domain expert.
- Read the winning solutions for previous competitions. They contain lots of insights and tricks, which is quite inspired and useful.
- Practice makes perfect. Choose One competition that is interested in on Kaggle and start kaggling today (and every day)!
Bio
Chenglong Chen is a recent graduate from Sun Yat-sen University (SYSU), Guangzhou, China, where he received a B. S. degree in Physics on and recently got a Ph.D. degree in communication and information Systems. As a Ph.D. student, his The interests included image processing, multimedia security, pattern recognition, and in par Ticular Digital Image Forensics. He'll be starting his job career attencent this August, working on advertising. Chenglong can reached at: [email protected]
Crowdflower Winner ' s interview:1st place, Chenglong Chen