Search Engine Click Log Clustering Implementation related search

Source: Internet
Author: User

Group often recruit interns, in technical questions asked the same time, I often ask them a question: ' Baidu related search, how you will design the implementation. ' Mainly want to see the interns will have what ideas, see if the idea is wide, how many methods, there is no way, I will prompt to see if he can some ideas.

In fact, the major search engine ' related search ' although the details will be more involved, including how to weigh the click, user experience, the relationship between income and other details, the main mining algorithm is still more similar. From the data, basically around the Internet users to search the session data, netizens click on the data, related to business monetization, may introduce advertisers information.

Baidu's implementation of the relevant search is not introduced here (may involve leaks), here is the main introduction before read a paper: agglomerative clustering of a search engine query log using the search click Log for query clustering, and use layer Sub-cluster results mining and recommendation of related search results. We hope to help you.

The innovation point in the algorithm is that the URL is clustered while the query is clustered . The process of clustering does not use the content information of query, but directly uses the behavior information of netizens (search, click Behavior). This approach is similar to collaborative filtering, in that it does not consider the content of the recommended item, but rather uses the user's behavior data to directly recommend it.

The algorithm first uses the click Log, the format is <query, the url> pair, constructs the bilateral diagram, the left is the query, the right is the URL, if searches the query after clicked the URL then links that query and the node which the URL represents establishes an edge. The binary chart is established as follows:

For example, next, the left is the query, the right is the clicked URL, the side of the search related query click the corresponding URL:

defining n (x) as the neighbor point for X, you can define the query point x, and the similarity of Y is as follows:

The similarity is between [0,1], that is, when x, Y is query, the similarity between them is measured using a node scale that is adjacent to X, Y, and, in contrast, when X, y is a URL, the similarity is defined using a co-search query.

The next task is to iterate through clustering, using URLs to calculate the similarity of 22 query, merging the most similar query, then using query as a feature to calculate the similarity of the 22 URLs, merging the most similar URLs, and iterating until the terminating condition.

The reason for each iteration is to query and URL separately, because only in this way to find out the original not obvious some clustering relationship. For example, it is not possible to visualize the relationship between a and C when the points are merged to 1 ', as shown below:

Termination condition has always been an important problem of agglomerative clustering, the general idea is to merge until it can not be merged, but in the experiment generally this will merge a lot of larger cluster, so many times will be used to control the size of each cluster (or the number of layers), And the maximum number of classes, etc. as the termination condition of the merger.

You can use this method to cluster the Click logs of a search engine. Specific application, when the Netizen input a specific query, judge the query belongs to the cluster, then the query in the cluster as the results of the relevant search candidates, of course, cluster query specifically show what, and how to sort, There are a number of factors to consider, such as CTR, user experience, ability to pour traffic, and so on, no further discussion here. The most important advantage of using this method is not to consider the content information of query, but to use the behavior information of netizens directly to cluster (there are similarities with collaborative filtering in Recommender system). Of course, the implementation of concrete projects, we can also use similar to the recommendation system of ideas, fused into the content of the characteristics of the joint clustering, to achieve better results. For example, redefine the similarity measure, weighted as the final similarity using the click Relationship and content similarity:

Where cross_ref_similarity represents the similarity based on relational data, see the ' Query Clustering Using User Logs ' For more information

For more information, see the original paper:

Beeferman, Doug and Berger, Adam. Agglomerative clustering of a Search Engine Query Log. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2000, 407-416.

Wen, Ji-rong, Nie, Jian-yun and Zhang, Hong-jiang. 2002. Query Clustering Using User Logs. ACM transactions on information Systems. January 2002, vol. 1, pp. 59-81.

You can also focus on Weibo: weibo.com

or direct access: http://semocean.com


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.