How to Use machine learning to solve practical problems-using the keyword relevance model as an Example

Source: Internet
Author: User
Tags svm idf

Based on the literal Relevance Model of Baidu keyword search recommendation tool, this article introduces the specific design and implementation of a machine learning task. Including target setting, training data preparation, feature selection and filtering, and model training and optimization. This model can be extended to Semantic Relevance models, and the design and implementation of Search Engine relevance and LTR learning tasks. The design, research, and implementation of this model can also easily be transplanted to solve other problems including Semantic Relevance.

Target setting: improve keyword search relevance

As a search + recommendation product, the product form of Baidu keyword search recommendation system is to recommend the keywords suitable for his/her business to fengchao users. For example, when an advertiser who sells flowers wants to perform keyword search and promotion on Baidu, he needs to submit keywords related to his business, and the submitted keywords need to be business-related, for example, he needs to submit keywords related to the flower sales business. Such as flowers and flowers. In this case, he can search and query in the Baidu keyword search and recommendation system and select the keyword suitable for him.

Baidu keyword search recommendation system query search

This is a typical search problem. It involves many factors, such as how to query, trigger, and sort, and how to handle regional factors; to improve the search quality, we must first ensure the relevance between the input query and the recommended words. The main problem we need to solve here is how to quickly, to accurately judge the correlation between the two keywords (input query and recommendation words), note that our mainThe goal is to make users feel that the product results are very reliable. Therefore, we only consider literal relevance, and more semantic extensions are not considered..

Note: The research and experiment implementation method of this model can easily translate to Semantic Relevance. For example, adding more semantic features, such as the bm25 feature of plsa and the similarity feature of word2vec (or the extended correlation validation, such as extending the word to the abstract extension of the baidu search result) improve the contribution of semantic features.

Relevance is also the cornerstone of all search problems, but it is used in different systems in different ways. In general search, relevance occupies a large weight, and sorting is based on relevance; in commercial systems, relevance is often used as the threshold for search presentation to control the quality of commercial promotion results (if only CTR is taken into account, when users search for flowers and express delivery, they will show their results, CTR will be higher, but the correlation is poor ). Of course, to determine the correlation we can simply use a method for direct determination, such as directly two keywords of TF-IDF calculation, or two keywords of bm25. However, this method is not very effective. To achieve better results, more features are needed. More features naturally require model combination, achieve the final expected results.

Figure: Location of relevance in the keyword system

Machine Learning is used to solve the problem. The following sections describe data preparation, feature selection, model selection, and model

Describes how to solve this problem in Baidu keyword search recommendation system.

Data, features, Models

When it comes to solving problems using machine learning, we often mention three Optimization Methods: Data, features, and models. First, find sufficient and accurate label data (only supervised learning tasks, such as relevance or LTr), and then extract the features that contribute a lot as input space, use label as output/ground true label, and then optimize the model (hypothesis )). The entire optimization process is described in the following three aspects:

Prepare training data

There are several methods to obtain training data:

  1. Manual labeling: The advantage is that the quality is high and the noise is low. The disadvantage is that the labeling result is related to the understanding of the annotator. For example, in the search engine, the correlation between Apple and mobile phone is determined. For young people, it is generally considered relevant, but it may be considered irrelevant for many elderly people. Another disadvantage is that the cost of manual tagging is high.
  2. Mining from logs: The advantage is that the data volume is relatively large and the acquisition cost is low (write several hadoop scripts to collect statistics on logs). The disadvantage is that there is a lot of noise, for example, noise data caused by malicious access captured by search engines

In the Relevance Model, we first used the manual feedback data of the Baidu keyword search recommendation system as the label to train the model, extract 1.5 million query-Recommendation words pair as positive and negative examples for Feature Extraction and model training.

When users like this keyword in interaction, they will click 'thumbs up to indicate that the result meets the user's needs (positive feedback, the query-Recommendation term pair can be used as a positive example ); if the user thinks that the keyword does not meet the requirement, he will click 'recycle cart' and throw the keyword into the recycle bin (negative feedback, the query-Recommendation word pair can be used as a negative example)

In the experiment, we found that the positive sample is correct, but there are many such cases in the negative sample: the query-Recommendation word is related, but this user does not do this service, so it is defined as a negative example, soStrong personalization of negative samples. So we asked our product manager to filter the negative examples, relabel the negative examples, and perform subsequent Feature Extraction and model training.

Then, we split the positive and negative examples (directly using Python random. Shuffle) into 10 parts.Cross-validation

Set standards and samples before model training

Note:The selection of training samples completely determines our problem objectives. Therefore, we need to accurately select a training sample from the beginning.If possible, all cases should be done manually or at least manually review. After confirming that there are no problems, proceed with the follow-up work. In particular, the issue of similar relevance is subjective. For example, there may be some differences between PM and RD in the judgment of this issue.

After confirming the training samples, evaluate the criteria, and then compress the model to optimize the model.

Feature Extraction

In general, feature selection and processing will greatly affect the learning task performance. In feature selection, features are usually first added and tested. For relevance models, we can add the features of traditional information retrieval. These features are generally divided into the following categories:

  1. General Structure Features of query/candidate words: such as query/candidate word length and term count
  2. Query-Correlation Measurement of candidate words: such as TF-IDF, bm25, lmir and multiple variants, plsa similarity measurement, word2vec semantic vector similarity, etc.; many times, the keyword information is less, you can also use keywords to expand the search engine abstract for similarity measurement.
  3. Importance measurement of keywords in information retrieval dimensions, for example, IDF, language model importance, etc.

At the beginning, we can first consider that all the features constructed are added to the feature vectors for experiment, and each class of features is added, you can see how much this type of feature improves the overall goal. This gives you an intuitive impression on the contribution of this feature.

The following data shows that as features increase, the effect is improved, and the features are not reduced (the model uses random forest for binary classification ):

When features are almost identical and the accuracy of the model has not been improved much, you can consider feature cutting. There is a simple, crude, and effective method to cut features, that is, using the tree model, this means that features with low contribution and feature importance are directly cut off. For example, features with zero contribution are directly cut off without affecting the accuracy of the Relevance Model.

Feature contribution

When adding features, it is difficult to improve the effect. To avoid overfitting, and considering the model online prediction, You need to select features. When using a tree model, you can directly use the contribution of features on several nodes and the number of nodes used to determine whether to remove the feature. The following is an example of using a tree model to select a feature:

For features with 0 feature contribution and split feature usage, the model effect is almost unaffected directly in the survey, and the prediction efficiency is improved.

Some experience is worth sharing when selecting features:

  1. Bm25 features and term weight features greatly contribute to classification tasks
  2. Some independent ratio features do not contribute much, for example, the ratio of the common term of the Recommendation term to the number of query term and the number of term of the Recommendation term. These features do not contribute much, however, when these features are combined with the number of terms in the query and recommendation words, they contribute a lot. Therefore, some features need to be combined to play a major role.
  3. Feature selection must be consistent with the target. For example, word2vec is a very high and reliable technology, but it is used in literal relevance and does not contribute much to the target (if the target is semantic-related, like plsa, word2vec will make a great contribution)
  4. Some features are used to solve special cases. Although they do not contribute much, they need to be retained (of course, they can also be directly set to strong rules and Model Cooperation). For example, the query is consistent with the recommendation word pinyin.
Model Selection

Classic Model

At first, we tried the maximum entropy, SVM, and AdaBoost models. Considering the efficiency of online use, we finally chose the Adaboost model as the online model. Although the results were not the best, however, the model built using a simple weak learner is indeed faster (see the blog post "AdaBoost"), and AdaBoost is used to launch the model and achieve better results: after the launch, not only the recall increase, the case correlation of 90% is higher than or equal to the original result (non-model version is used)

 

Distribution of evaluation results (2 to-2 indicate that the extended recall results are more relevant than, slightly higher than, equal to, slightly lower than, and less than the online strategy)

Integrated Tree Model

Currently, tree models are especially used because feature normalization is saved when used: If SVM-like models are used, features need to be normalized, but tree models are used, directly throw the feature vector and label to the model. The model selects the most suitable splitting point based on the information gain or Gini coefficient (for specific splitting standards, see the blog post: using impurity to select tree models to split nodes). Open-source tree models, such as the well-known Quinlan C4.5 or c5.0, can be used as a basis for feature selection.

In particular, the emergence of the integrated tree model greatly improves the effect of the tree model. So in my current project, I prefer to use the integrated tree model for performance experiments when adding features. For details about tree model usage, see integrated Tree Model and Its Application in search recommendation system.

Integration Tree Model configuration Selection

The configuration selection here is slightly different from the traditional model parameters. The output Tree Model configuration mainly refers to the number of trees in the integrated tree model, the feature selection factor and sample usage factor of each tree. In the project, considering the accuracy and speed, the final parameter is that the number of trees is 20, both the feature selection factor and sample selection factor are 0.65 (0.65 of samples and features are randomly selected for training on each tree)

For specific product results, see the sorting results of the Baidu keyword search Recommendation System in www2.baidu.com:

How to personalize

The first thing we need to consider is whether our data samples are personalized cases (the answer here is no). Suppose our labeled cases are personalized, that is, when the case itself contains personalized results, there is actually no big difference in the model training process. The main difference is what we choose to distinguish these personalized features, for example, the similarity between PZD vectors produced by the plsa model of an account (unit) in Baidu fengchao and query

Log on to www2.baidu.com-> keyword tool-> search query-> View the result.

For more information, see: http://semocean.com

How to Use machine learning to solve practical problems-using the keyword relevance model as an Example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.