Recommended algorithm Practice-American Group Network

Source: Internet
Author: User

Objective

Recommender systems are not new, they exist long ago, but the referral system really goes into people's sights, and as an important module exists in various internet companies, or in recent years.

With the development of Internet, more and more information is spread on the Internet, which produces serious information overload. Without some means, it is difficult for users to find valuable information about themselves from so many streams.

There are several ways to solve information Overload: One is to search, when the user has a clear intention of information requirements, the intention to convert to a few short words or a combination of phrases (that is, query), and then submit these words or phrases to the corresponding search engine, Again by the search engine in the vast repository of information to retrieve the query related to the user, the other is recommended, many times the user's intentions are not very clear, or difficult to use a clear semantic expression, and sometimes even users themselves are not aware of their own needs, in this case the search seems to be stretched. Especially in recent years, with the rise of e-commerce, users are not necessarily with a clear intention to browse, many times is to "stroll", this scenario to solve information overload, understand user intentions, for the user push personalized results, recommendation system is a better choice.

As the rapid development of the domestic web site, with a large number of users and rich user behavior, these recommendations for the application and optimization of the system provides an indispensable condition, the next introduction of our recommendation system in the construction and optimization of the process of some practices, and share with you.

Framework

From the frame point of view, the recommendation system can be divided into data layer, trigger layer, Fusion filter layer and sorting layer. The data layer includes data generation and data storage, mainly using various data processing tools to clean raw logs, processing formatted data, landing into different types of storage systems for downstream algorithms and models. The candidate set trigger layer is mainly based on the user's historical behavior, real-time behavior, geographic location and other triggering strategies to produce the recommended candidate set. Candidate set fusion and filtering layer has two functions, one is to the departure layer generated by the different candidate sets to improve the coverage and accuracy of the recommendation strategy, but also to undertake a certain filtering duties, from the product, operational perspective to determine some of the artificial rules, filter out the non-eligible item. The sorting layer is mainly based on the machine learning model to reorder the candidate sets filtered by the trigger layer.

At the same time, for the two layers of triggering and reordering the candidate sets, for the effect iteration is two layers that need to be modified frequently, therefore need to support abtest. To support efficient iterations, we have decoupled the two layers of candidate set triggering and reordering, the results of which are orthogonal, so you can experiment with each other separately, without affecting each other. At the same time, within each layer, we will divide the traffic into multiple parts according to the user, and support multiple policies and online comparison.

Data application

Data is the algorithm, the model of this. As a trading platform, the group also has a rapidly growing number of users, resulting in a huge wealth of user behavior data. Of course, the value of different types of data and the strength of the user's intentions reflected are also different.

Behavior Category Behavior Details
Active behavior Data Search, filter, click, bookmark, order, pay, score
UGC Text evaluation, uploading pictures
Negative feedback data Left slide Delete, Cancel collection, cancel order, refund, negative rating, low rating
User Portrait User demographics, group DNA, category preference, consumption level, place of work and place of residence

    1. User active behavior data records the user's various behaviors on the platform, which is used for off-line computing in the candidate set triggering algorithm (described in the next section), and on the other hand, the intentions of these acts represent different strengths and weaknesses, Therefore, it is possible to set different regression target values for different behaviors during the training of reorder models, so as to describe the user's behavior intensity more finely. In addition, user-deal behavior can also be used as a cross-feature of reorder models for offline training and online prediction of models.
    2. Negative feedback data reflect that the current results may not meet the needs of users in some aspects, so in the subsequent candidate set trigger process need to consider the specific factors to filter or down, reduce the risk of the recurrence of negative factors, improve the user experience, while in the reordering model training, Negative feedback data can be used as a rare negative example to participate in the model training, these negative examples are more than those who have not clicked on the display, not the order of the sample significantly more.
    3. User portrait is the basic data depicting user attributes, some of which are directly obtained raw data, some of which are mined two times processing data, these properties can be used in the candidate set triggering process to weighted or down the deal, on the other hand can be a reorder model in the user dimension features.
    4. Through the data mining of UGC can extract some key words, and then use these keywords to deal tag, for deal personalized display.

Policy triggering

We mentioned the importance of the data above, but the point of the data is the algorithm and the model. Pure data is only a few bytes of accumulation, we must through the cleaning of data to remove the noise in the data, and then through the algorithm and model to learn the law, in order to maximize the value of data. In this section, you will describe the relevant algorithms used in the triggering process of the recommended candidate sets.

1. Collaborative filtering

When it comes to referrals, you have to say collaborative filtering, which is used in almost every recommender system. The basic algorithm is very simple, but to get better results, often need to do some differentiated processing according to the specific business.

    • Eliminate cheat, brush, purchase and other noise data. The existence of these data can seriously affect the effectiveness of the algorithm, so in the first step of the data cleansing will be eliminated.

    • Reasonable selection of training data. The time window of the selected training data should not be too long, certainly not too short. The specific window period values need to be determined by several experiments. It is also possible to consider introducing time decay, as recent user behavior is more reflective of the user's next behavioral action.

    • User-based is combined with item-based.

Group/individual calculation cost application Scenario Cold start explanatory real-time
User-based More dependent on the social behavior of the user groups that are close to the current user For applications where the number of users is small Occasions with strong timeliness and less significant user interest Newly added items will quickly enter the recommended list Weak User's new behavior does not necessarily lead to changes in recommendation results
Item-based More focused on the individual behavior of the user itself Suitable for applications where the number of items is less Long tail items rich, user personalized needs strong occasions Newly added users can be recommended soon Strong User's new behavior must lead to changes in recommendation results

    • Try different methods of similarity calculation. In practice, we use a similarity calculation method called Loglikelihood ratio[1]. In Mahout, Loglikelihood ratio is also used as a method of similarity calculation.
      The following table shows the interrelationships between event A and event B, where:
      Number of k11:event A and Event B co-occurrence
      K12:event b occurs, the number of times that event A does not occur
      K21:event A occurs, the number of times that event B does not occur
      The number of times k22:event A and event B do not occur
Event a everything but a
Event B A and B together (K_11) B, but not A (K_12)
Everything But B A without B (k_21) Neither A nor B (k_22)

Then loglikelihoodratio=2 * (matrixentropy-rowentropy-columnentropy)

which
Rowentropy = Entropy (K11, K12) + entropy (K21, k22)
Columnentropy = Entropy (K11, k21) + entropy (K12, k22)
Matrixentropy = Entropy (K11, K12, K21, k22)
(Shannon entropy of a system consisting of several elements entropy)

2. location-based

For mobile devices, one of the biggest differences from the PC side is that the location of the mobile device is often changed. Different geographic locations reflect different user scenarios and can take full advantage of a user's geographic location in a specific business. In the proposed candidate set trigger, we will also trigger the corresponding strategy based on the user's geographic location, place of work, and place of residence.

    • According to the user's historical consumption, historical browsing, and so on, mining a certain granularity of the region (such as the business district) within the regional consumption of hot list and regional purchase Hot List


Regional consumption Heat List


Regional purchase of Hot orders

    • When new online user requests arrive, a recommended list is eventually obtained based on the number of users ' locations to which the regional consumption hot order and region purchase hot ticket are weighted.

    • In addition, the user's similarity can be calculated by collaborative filtering based on the geographic location of the user.

3. query-based

Search is a strong user intent, a relatively clear response to the user's wishes, but in many cases, for a variety of reasons, did not form a final conversion. Nonetheless, we believe that this scenario represents a certain user's will and can be exploited. The following are the specific practices:

    • The user's search for a period of time without conversion behavior of mining, calculate each user to different query weights.

    • Calculates the weights of different deal under each query.

    • When the user requests again, according to the user to different query weights and different deal weights weighted, take out the weight of the largest topn to recommend.

4. graph-based

For collaborative filtering, the graph distance between user or deal is two hops, and the relationship to the farther distance cannot be taken into account. The graph algorithm can break this limit, the relationship between user and deal as a two-part diagram, the relationship between each other can be spread on the graph. SIMRANK[2] is a graph algorithm that measures the similarity of peer entities. The basic idea is that if two entities are related to another similar entity, they are similar, that is, similarity can be propagated.

    • Let S (b) denote the similarity between persons A and B, for A! = b

Let S (c,d) denote the similarity between items C and D, for C! = d

O (A), O (B): The set of out-neighbors for node A or node B
I (c), I (d): The set of in-neighbors for node C or node D

    • Calculation of SimRank (using matrix iteration method)

    • After calculating the similarity matrix, similar collaborative filtering can be applied to the online recommendation.

5. Real-time user behavior

At present, our business will include search, screening, collection, browsing, orders and other rich user behavior, these are our effective optimization of the important basis. We certainly hope that every user's behavior flow will reach the point of conversion, but in fact it is far from it.

When a user generates certain behaviors that are upstream for the next line, there is a significant amount of behavior that causes the flow to not form a transformation. However, these upstream behaviors of the user are very important to us prior knowledge. In many cases, the user did not convert at that time and does not mean that the user is not interested in the current item. When the user arrives at our recommended booth again, we understand and identify the user's true intention according to the prior behavior of the user, and then show the user the relevant deal in accordance with the user's intention, and guide the user to go downstream along the behavior flow, and finally reach the ultimate goal of order.

Currently introduced real-time user behavior includes: real-time browsing, real-time collection.

6. Substitution strategy

Although we have a series of candidate set triggering algorithms based on the user's historical behavior, the candidate set triggered by the above algorithm is too small for some new users or users who are not very rich in history, so they need to be populated with some fallback strategies.

    • Hot Sell list: The item that sells most in a certain time, can consider the influence of time attenuation and so on.
    • Rated single: User-generated reviews, with higher-scoring item.
    • City single: Satisfies the basic qualification, within the user's request city.

Sub-policy fusion

In order to combine the merits of different triggering algorithms and improve the diversity and coverage of candidate sets, different triggering algorithms need to be fused together. The common methods of fusion are as follows [3]:

    1. Weighted type: The simplest method of fusion is to assign different weights to different algorithms based on empirical values, and the candidate sets produced by each algorithm are weighted according to the given weights and then sorted by weight.
    2. Graded type: Priority to use a good algorithm, when the resulting candidate set size is not enough to meet the target value, then the effect of the second good algorithm, and so on.
    3. Modulation: Different algorithms produce a certain number of candidate sets at different scales, and then stack up to produce the final total candidate set.
    4. Filter type: The current algorithm filters the candidate sets produced by the previous algorithm, and so on, the candidate sets are filtered progressively, resulting in a small and fine candidate set.

At present, we use the method of integration of modulation and classification of the two fusion methods, different algorithms according to historical effects of the given different candidate sets of the proportion, while prioritizing the effect of a good algorithm trigger, if the candidate set is not large enough, then use the effect of the second algorithm triggered, and so on.

Candidate Set reordering

As mentioned above, for the candidate set triggered by the different algorithms, only according to the algorithm's historical effect determines the position of the item generated by the algorithm is somewhat simple and rough, at the same time, within each algorithm, the order of the different item is simply determined by one or several factors, These sorting methods can only be used for the first primary process, and the final sorting results need to be determined by means of a machine learning approach, using the relevant sequencing model, and combining various factors.

1. Model

Non-linear model can catch the nonlinear relation in the feature well, but the cost of training and forecasting is higher than the linear model, which also leads to the relatively long updating period of the non-linear model. On the other hand, the linear model has a high demand for the characteristics processing, it needs to do some advance processing by using the domain knowledge and experience, but because the linear model is simple, it is more efficient in training and forecasting. So in the update cycle can also be done shorter, also can combine the business to do some online learning attempts. In our practice, both non-linear models and linear models have applications.

    • Non-linear model
      At present we mainly use nonlinear Tree model additive groves[4] (AG), relative to the linear model, the non-linear model can better deal with the characteristics of the nonlinear relationship, do not have to be like a linear model in the characteristics of the processing and the combination of features to spend relatively large energy. AG is an additive model, consisting of a number of grove, bagging between different grove to obtain the final prediction result, which can reduce the effect of overfitting.

      Each grove has more than one tree, and the fitting target for each tree during training is the residual difference between the true value and the sum of the predicted results of the other trees. When a given number of trees is reached, the re-trained tree replaces the previous tree by tree. After many iterations, convergence is achieved.

    • Linear model
      At present, more and more linear models are used in non-logistic regression. In order to capture the changes in the data distribution in real time, we introduced the online learning, connected to the real-time data stream, and updated the model on-line using Google's ftrl[5] method.

The main steps are as follows:

    • Write feature vectors online to HBase
    • Storm parses real-time clicks and the next-day log stream, overwriting the label of the corresponding eigenvector in HBase
    • Update model weights with Ftrl
    • Apply a new model parameter to a line
2. Data
    • Sampling: For CTR estimates, positive and negative samples are severely unbalanced, so you need to do some sampling of negative examples.
    • Negative example: A normal example is a user to generate clicks, orders and other conversion behavior of the sample, but the user does not convert the behavior of the sample is certainly a negative example? In fact, many of the users do not see, so the sample as a negative example is unreasonable, it will affect the effect of the model. The more commonly used method is skip-above, that is, the user clicks on the item position above the display can be regarded as a negative example. Of course, the above negative examples are implicit negative feedback data, in addition, we also have the user actively delete the display negative feedback data, these data are high-quality negative examples.
    • Denoising: For the data of mixed brush list and other cheating behavior data, to exclude the training data, otherwise it will directly affect the effect of the model.
3. Features

In our current reordering model, there are several types of features that can be categorized as follows:

    • Deal (that is, purchase orders, the same below) dimension characteristics: mainly the deal itself some properties, including price, discount, sales, ratings, categories, CTR, etc.
    • Features of the user dimension: including the level of users, the population attributes of the user, the client type of the user, etc.
    • User, Deal cross-features: Including users of Deal click, collection, purchase, etc.
    • Distance characteristics: Includes the user's real-time geographic location, frequent geographical location, work place, place of residence and other distance from POI

For non-linear models, the above features can be used directly, whereas for linear models, it is necessary to do some batching, normalization, and so on for the eigenvalues to make the eigenvalues a continuous or 12 value between 0~1.

Summarize

Based on the data, with the algorithm to carve, only the two organically combined, will bring the effect of ascension. For us, the following two nodes are milestones in our optimization process:

    • Merging candidate sets: Improved recommended coverage, versatility, and accuracy
    • Introducing reordering Models: solves the problem of the permutation order between deal after the candidate set increases




The above is a summary of our practice, of course, we still have a lot of things to do. We is still on the way!

Note:

This article is for the United States Group recommendation and personalized team of collective wisdom crystallization, thank for this hard to pay every member. At the same time, the team of long-term recruitment algorithm engineer and Platform development Engineer, interested students please contact [email protected], the mail title marked "Candidate Recommendation System engineer"

Recommended algorithm Practice-American Group Network

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.