Application of natural language Processing technology (NLP) in recommendation system

Application of natural language Processing technology (NLP) in recommendation system _NLP

Last Update:2018-08-22 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Author: Zhang, 58 group algorithm architect, forwarding search recommendation department responsible for search, recommendation and algorithm related work. Over the years, mainly engaged in the recommendation system and machine learning, but also did the calculation of advertising, cheating and other related work, and keen to explore large data and machine learning technology in other fields of application practice.
Zebian: He Yongcan (heyc@csdn.net)
This article is "programmer" original article, more wonderful articles please subscribe to "programmer" overview

Personalized recommendation is an indispensable technology in the large data age, which plays an important role in the fields of electricity quotient, information distribution, computing advertisement and internet finance. Specifically, personalized recommendations in the efficient use of traffic, information efficient distribution, enhance user experience, long tail items mining and so on have played a central role. In the recommendation system often need to deal with a variety of text class data, such as product description, news, user messages and so on. Specifically, we need to use text data to accomplish the following tasks: Candidate item recalls. The candidate product recall is the first step in the recommended process to generate a collection of items to be recommended. This part of the core operation is based on a variety of recommended algorithms to obtain the corresponding collection of objects. The text class data is a kind of important recall algorithm, which has the advantage of not relying on user's behavior and rich diversity, and plays an important role in the situation of rich text information or lack of user information. Dependency calculation. Correlation calculation is full of the steps of the recommended system process, such as the various text similarity algorithms in the recall algorithm and some correlation calculations used in User portrait computation. Participates in the model ordering as a feature (CTR/CVR). In the ranking layer after the candidate set recall, text-class features can often provide a lot of information, thus becoming an important sorting feature.

However, compared with structured information (such as the attributes of a commodity), text information has some congenital drawbacks when used concretely.

First, there is less structure information in the text data. Strictly speaking, text data usually has no structure, generally can have the structure of "title", "Body", "comment" so as to differentiate the structure of the text source, in addition to generally no more structural information. Why should we care about structural information? Because the structure represents the amount of information, whether using algorithms or business rules, it is possible to develop recommendations based on structured messages such as "color" and "style" in a strategy such as "recall all large Blue Jackets". However, if there is no such structured information in the description database of the product, there is only one sentence "The down jacket is a blue long down jacket" free text, then the structure information can not be used to formulate strategies.

Second, the amount of content in the text is uncertain. With unstructured, it is the uncertainty of textual data in content, which is embodied in content and quantity, for example, the description of the same second-hand goods by different users may be very different, which may have great difference in terms, description, length of text, etc. The same two items that appear in the description of an item are not necessarily present in another item. The existence of this kind of difference makes it difficult to use text data as a stable and reliable data source, especially in the UGC obvious scene.

Thirdly, there are more ambiguity problems in free text. Ambiguity understanding is an important research subject in natural language processing, and ambiguity also affects the use of text data in Recommender systems. For example, when a user describes their second-hand mobile phone may write "sell IPhone6, intend to chip in to buy IPhone7" in this way, the meaning is very clear to people, but the machine caused a great trouble: The phone is iPhone6 or IPhone7. In such a context, it is a challenge to guarantee the accuracy of the recommended system.

But the text data is not useless, has the shortcoming but also has some structural data not to have the advantage: The data quantity is big. Unstructured text data is generally very easy to obtain, such as a variety of UGC channels, as well as network crawling methods, can wear a large number of text data. Rich in diversity. Unstructured is a double-edged sword, the bad side has been analyzed, the good side is due to its openness, resulting in rich diversity, will contain some of the structure of the data. Timely information. In some of the terminology, after the emergence of new things, the microblogging, the circle of Friends is often the first to reflect the changes in the place, and these are pure text data, the reasonable analysis of these data can be the fastest structured, predefined data can not get information, which is the advantage of text data.

To sum up, text data is a kind of large volume, complex, rich data, the recommendation system plays an important role, this article will be on the above mentioned several aspects of the recommendation system, the common text processing methods are introduced. Starting from here: Word bag model

Word bag model (Bag of Words, abbreviated as bow model) is the simplest text processing method, its core assumption is very simple, that is, a document is composed of words in the document of multiple sets (multiple sets and the common set is the difference between the number of elements in the collection is considered) constitute. This is one of the simplest assumptions that does not take into account other important factors such as grammar, word order, and so on, but only the number of occurrences. Such a simple assumption obviously lost a lot of information, but the benefit is that the use and calculation are simpler, but also have greater flexibility.

In the recommendation system, if an item is considered a word bag, we can recall the related items according to the words in the bag, for example, if the user browses a product that contains a "down jacket" keyword, we can recall the other goods containing "down jacket" as the candidate for the recommendation. And can be based on the word in the word bag in the number of times (frequency) of the recall of goods sorted.

There are obviously a number of problems with this simple approach:

First of all, it is not the words that can be used to recall and sort, for example, "The land you and I he" such a "stop word" should be removed, in addition, some occurrences of particularly high or particularly low words also need to do special treatment, otherwise it will lead to low correlation of recall results or too little recall results.

Secondly, the use of Word frequency to measure the importance of the rationality is inadequate. The above "down jacket" recall, for example, if the use of "down jacket" in the category of the word "down" in the product description of the frequency to measure the relevance of goods, will cause all of the down jacket has a similar correlation, because in the description of the use of a similar number of this term. So we need a more scientific and reasonable way to measure the correlation between texts.

In addition to the above usage, we can add each word in the word bag as a one-dimensional feature to the sorting model. For example, in a CTR ordering model with LR, if the weight of this dimension is w, it can be explained that "a sample containing this word will be higher than W on the log odds of clicks." In order to enhance the distinguishing ability of the feature, we often use the simple word bag model as an upgraded version of the--n-gram Word bag model when using the Word feature in the sorting model.

N-gram refers to the processing of n consecutive words as a unit, for example: "John likes to watch movies." Mary likes movies too. " This sentence is processed as a simple word bag model after the result is:

["John": 1, "likes": 2, "to": 1, "Watch": 1, "movies": 2, "Mary": 1, "too": 1]

The result of processing as Bigram (2-gram) is:

["John likes": 1, "likes to": 1, "to watch": 1, "Watch movies": 1, "Mary likes": 1, "likes movies": 1, "movies Too": 1]

What is the advantage of doing such a deal? If the bigram is used as the feature of the sorting model or the characteristics of similarity calculation, the most obvious benefit is to enhance the distinguishing ability of the feature, in simple terms: two items with n bigram overlap, and their correlations are greater than those with n words coincident. Fundamentally, it is because the probability of coincidence of Bigram is lower than the coincidence probability of 1-gram (i.e. common words). So is not the n-gram in the greater the more the better it. The increase in n although the ability to distinguish features, but also increased the sparsity of the data, in extreme cases, assuming N to 100, so almost no more than two documents have coincident 100-gram, so the characteristics of the loss of meaning. Generally in the practical application, Bigram and Trigram (3-gram) can be differentiated and sparse between the relatively good balance, n if continue to increase, sparsity will have a significant increase, but the effect will not be significantly improved, or even lower.

In a word bag model, though there are obvious drawbacks, but only need to do simple processing of the text can be used, so it is a kind of rapid processing of text data usage, and in the preprocessing (commonly used preprocessing includes the removal of the word, high-frequency/low-frequency Word removal or reduction of the weight of the importance of processing methods, can also use the external high-quality data to filter and limit the free text data in order to obtain higher quality raw data, and often can get good results. Unified weights and measures: weight calculation and vector space model

From what we've seen above, a simple word bag model can be used to recall candidate items in a referral system after proper preprocessing. But when it comes to calculating the relevance of items and keywords, and the correlations between items, it is obviously unreasonable to use simple word frequency as a sorting factor. In order to solve this problem, we can introduce a more expressive ability based on TF-IDF weight calculation method. In the TF-IDF method, the weight of a word t in document D is calculated as:

Where tft,d represents the frequency of T in D, and DFT refers to the number of documents that contain T, and N represents the number of documents.

TF-IDF and its various improvements and variants (detailed introduction to TF-IDF variants and improvements can be referenced in chapter sixth of the Introduction to information retrieval). The core improvement, compared to the simple TF approach, is to measure the importance of a word, for example: the original TF-IDF added a consideration to the IDF on the basis of TF, thus reducing the importance of high-frequency words that lead to the absence of discrimination, typically in terms of deactivation. Because the importance of words in the document and the number of occurrences is not completely linear correlation, the non-linear tf scaling to the TF log scaling, thereby reducing the occurrence of particularly high frequency of the word weight. The frequency with which a word appears in a document can be related to the length of the document in addition to its importance, and in order to eliminate this discrepancy, all TF may be normalized using the maximum TF.

The aim of these methods is to make the measurement of the importance of words in the document more reasonable, on the basis of which we can improve the method based on word frequency, for example, we can use the word frequency to sort items, improved to sort according to TF-IDF score.

But beyond that, we need a unified approach to measuring keywords and documents, as well as the relevance of documents and documents, which is the vector space model, called VSM.

VSM's core idea is to express a document as a vector, each dimension of the vector can represent a word, on the basis of which a vector operation method can be used to calculate the similarity between documents, and the most important calculation is the cosine similarity calculation of vectors:

where V (D1) and V (D2) are represented as vectors of two documents respectively. Such a seemingly simple formula actually has a very important meaning. First of all, it gives a general idea of correlation calculation, that is, as long as the two items can be expressed by vectors, the formula can be used to calculate correlation. Second, it has no restriction on the specific representation of vectors--the same calculation formula is used for collaborative filtering based on user behavior, and in text correlation computing, we can use TFIDF to fill vectors, and we can use N-gram, as well as the probability distribution of the text topics described later, Various word vectors and other representations. As long as we have a deep understanding of the meaning of the formula, we can construct a reasonable vector representation according to the requirement. Again, the formula has a strong interpretative nature, it will be the whole of the correlation of the superposition of multiple components, and this superposition can be adjusted by the formula, such a method is easy to explain, even for non-technical people, it is relatively easy to understand, this for and products, Operation and other non-technical interpretation algorithm is of great significance. Finally, this formula can be used in the actual calculation of some very efficient engineering optimization, so that it can calmly deal with large data environment in the vast number of data, this is the other correlation calculation method is difficult to match.

VSM is a kind of "no frontal, Big Qiao" method, the shape is simple and changeable, understand its essence, can play a great energy. Looking at nature through phenomena: The semantic model of metaphor

Some of the previous "explicit" uses of text data, called Explicit, refer to the fact that we use readable and understandable text as a feature of relevance computing, item recall, and model sequencing. The advantage of doing so is that it is straightforward and can clearly see what is at work, but the downside is the inability to capture deep information hidden beneath the surface of the text. For example, "Down jacket" and "coat" refer to something similar, "down-coats" and "shoes" have a strong correlation, similar to the deep information, is not explicit text processing can not capture, so we need some more complex methods to capture, and the semantic model (latent semantic Analysis, referred to as LSA, is one of the originator of such methods.

Implicit in the semantic model is the implied theme, the core assumption of the model is that although a document consists of a lot of words, the themes behind these words are not many. In other words, the words are created by the theme behind it, and the underlying theme is the more central message. This idea of sinking from the word to the subject runs through the other models we're going to introduce, and it's the common central idea of various text-subject models (Topic model), so it's very important to understand this.

Before the document is LSA decomposed, we need to construct the relationship between the document and the word, a simple example consisting of 5 documents and 5 words as follows:

The LSA approach is to use this primitive matrix C as a SVD decomposition in the following form:

The U is the orthogonal eigenvector matrix of the Matrix CCT, V is the orthogonal eigenvector matrix of the matrix CTC, ∑k is a diagonal matrix containing the former K singular value, and K is a descending dimension parameter selected beforehand. To get a low dimensional representation of the original data, the reduced dimension contains more information, and it can be thought that each dimension represents a topic. Each dimension of dimensionality reduction contains richer information, such as the recognition of synonyms and polysemy. Document D that is not in the training document can be transformed into a vector in the new vector space (such transformations cannot capture the information in the new document, such as the presence of words, and the appearance of new words, etc.), so the model needs to be fully trained on a regular basis. So that the similarity between documents can be computed in the space after dimensionality reduction. Because the new vector space contains more deep information such as synonyms, such transformations will improve the accuracy and recall of similarity calculation.

Why does the LSA have the ability to do this? We can look at this from the standpoint that each element of the CCT Ccti,j represents the number of documents that contain both the word I and the word j, and each element in the CTC ctci,j the number of words shared by document I and document J. So these two matrices contain the common occurrence of different words, and the sharing of the document word, through the decomposition of this information to get a similar topic than the keyword information is higher than the lower dimensional data.

From another perspective, the LSA is equivalent to a soft clustering of documents, each dimension of dimensionality can be viewed as a class, and the value of a document on this dimension represents the degree to which the document belongs to the cluster.

What can be done with the data recommendation after LSA processing. First, we can index the new dimension (subject dimension) as an index, instead of the traditional index of Word, and then map the behavior of the user to the new dimension. Once these two data are ready, you can recall the candidate items using the new data dimension. After the recall, you can use VSM to calculate the similarity, as mentioned above, the calculation after dimensionality will bring higher accuracy and recall rate, but also can reduce noise word interference, typically, even if two documents do not have any shared words, There is still a correlation between them, and that is one of the core advantages that LSA brings. In addition, it can be used as the sorting feature for the sorting model.

Simply put, the method that we can use on common keywords is still available on the LSA, because the essence of LSA is to reduce the semantics of the original data by just seeing it as a more informative keyword.

You can see that the LSA has made a big step forward compared to the keywords, mainly embodied in the increase in information, the reduction of dimensionality, and the understanding of synonyms and polysemy. However, the LSA also has some drawbacks, such as: high complexity of training. When the LSA is trained by SVD, and the SVD itself is very complex, it is difficult to compute in a massive document and large vocabulary, although some optimization methods can reduce the complexity of computation, but the problem is still not solved fundamentally. Retrieval (recall) of high complexity. As mentioned above, using the LSA for a recall requires mapping the document or query keywords to the LSA vector space, which is obviously a time-consuming operation. The value of a word under each topic in the LSA has no meaning of probability, and may even be negative, only to reflect the value size relationship. This makes it difficult for us to interpret and understand the relationship between the subject and the word from a probabilistic perspective, thus limiting our use of the results more richly. The magic of probability: a probabilistic semantic model

In order to further develop the power of the semantic model and try to overcome the problem of the LSA model, Thomas Hofmann in 1999 proposed a probabilistic semantic model (probabilistic latent semantic analysis, referred to as pLSA). As you can see from the previous LSA introduction, although the specific optimization method uses matrix decomposition, from another point of view, we can think that the decomposition of the U and v two matrices in the vector, respectively, representing the document and words in the semantic space representation, such as a document of the implicit vector represented as (1,2,0) T, On the first dimensional implicit vector, the value is 1, the second dimension is 2, and the third dimension is 0. If these values can form a probability distribution, so not only the result of the model is more conducive to understanding, but also bring many good properties, which is the core of pLSA thought: the relationship between the document and the word as a probability distribution, and then try to find the probability distribution, with the document and the probability distribution of words, Then we can get everything we want.

In the basic assumption of pLSA, document D and Word w are generated as follows: Select document D with the probability of P (d). The probability of P (Z|D) is chosen for the implicit class Z. Generates W from Z in the probability of P (w|z). P (z|d) and P (w|z) are all polynomial distributions.

The process is expressed by a joint probability:

Figure 1 The pLSA generation process

As you can see, we have implicitly variable Z as the intermediate Bridge, linking the document and the word to form a well-defined, interlocking chain of probability generation (shown in Figure 1). Although the core of pLSA is a probabilistic model, it can also be expressed in the form of matrix decomposition similar to LSI. To do this, we redefine the three matrices to the right of the equals sign in LSI:

Under such a definition, the original matrix C can still be expressed as C=U∑VT. This correspondence gives us a clearer picture of the good definition and clarity of plsa in terms of probability, and also reveals the close relationship between the semantic probabilistic model and matrix decomposition (the close relationship between probabilistic models and matrix decomposition can be referred to in this document: http:// Www.cs.cmu.edu/~epxing/Class/10708-15/slides/LDA_SC.pdf). In such a definition, the meaning of the subject represented by the implicit variable z is more pronounced, that is, we can explicitly view a z as a subject, and the words in the subject and the subject in the document have a definite probability meaning. It is precisely because of this good nature, coupled with the convenience of the optimization method, so that starting from pLSA, the text theme began to occupy an important position in a variety of large data applications.

From a matrix perspective, LSA and pLSA look very much alike, but their connotations are fundamentally different, and the most important point is that the optimization goals are completely different: the LSA is essentially the optimization of the square error between the matrix and the original matrix after the SVD decomposition, And pLSA is essentially in the optimization of likelihood functions, is a standard machine learning optimization routines. It is also because of the difference in nature, which leads to the difference between the two in terms of optimization results and explanatory ability.

So far as we can see, pLSA the LSA's thought from the angle of probability distribution, and obtains a more excellent result, but pLSA still has some problems, mainly including: Because pLSA generates a set of document-level parameters for each document, the number of parameters in the model is proportional to the number of documents. So it is easy to fit in the case of more documents. pLSA each document D as a mix of a set of topics, but the concrete blending ratio does not have a corresponding build probability model, in other words, pLSA cannot give a good topic distribution for new documents that are not in the training set. In short, pLSA is not a complete generative model.

And the advent of LDA is to solve these problems. Probability probability: A generative probabilistic model

To address the pLSA problems mentioned above, a new model, called "Hidden Dirichlet Distribution" (latent Dirichlet allocation, or LDA), was proposed in 2003 by David Blei, a rather obscure name. And it does not seem to be a model in terms of names, and here we try to make a possible interpretation: latent: The word, needless to say, is that the model is still a semantic model. Dirichlet: The term is said to be involved in the model of the main probability distributed Dirichlet distribution. Allocation: The word is saying that the generation of this model is using Dirichlet distribution to continuously assign topics and words.

This is not an official explanation, but hopefully it will help to understand the model.

The central idea of LDA is to plsa a priori outside of the so that the topic distribution in the document and the word distribution under the theme has the generation probability, thus solves the above pLSA existence "the non-generative" question, also reduces the model parameter, thus solves the pLSA another question. The process of generating words for a document di in LDA is as follows: sample a number n as the length of the document from the Poisson distribution (this step is not necessary and does not affect the subsequent process). A sample θi is sampled from Dirichlet distribution Dir (Alpha), representing the distribution of topics under the document. A group of sample φk is sampled from Dirichlet distribution Dir (Beta), representing the distribution of the words under each topic. For each word from 1 to n WN:
A topic ci,j is sampled from the polynomial distribution multinomial (θi). A word wi,j is sampled from the polynomial distribution multinomial (φi).

Fig. 2 The generation process of LDA

Ignore the first steps to select the document length, we find that the production process of LDA adds a layer of probability to the distribution of the document to the topic and the distribution of the word, which adds a layer of uncertainty to the PLSA, which naturally accommodates documents and words not found in the training document. This makes LDA have a better probabilistic property than pLSA. The application of LDA

In this part, we introduce some places that LDA should pay attention to when using similarity calculation and sorting feature, and then introduce the application of the text subject represented by LDA more different angles in the recommendation system.

Calculation of similarity

The above mentioned that the LSA can be directly applied to the VSM for similarity calculation, in LDA can also do similar calculations, the specific method is to quantify the subject distribution of the document and then use the cosine formula to calculate. But it is better to replace the cosine similarity with KL divergence or jensen–shannon divergence, because the topic distribution given by LDA is a definite probability value, and it is more reasonable to measure the similarity between measure probabilities.

Sorting characteristics

It is a natural way to use the object's LDA as a sorting model, but not all topics are useful. There are usually two kinds of topics on items: There are a few topics (three or less) that occupy a larger probability, and the remaining topic probabilities add up relatively small. The probability values for all the topics are similar and are relatively small.

In the first case, only the topics with a few previous probabilities are useful, and in the second case, basically none of the topics are used. So how do you identify both of these situations? The first method can make a simple K-means cluster based on the probability value of the subject. K is selected as 2, and in the first case, the number of topics in the two classes varies considerably-one class contains a small number of useful topics, another class contains other unwanted topics, and in the second case the number of topics varies considerably. You can use this method to identify the importance of a topic. The second method can calculate the information entropy of the topic distribution, the first kind of information entropy will be relatively small, and the second situation will be relatively large, choose the appropriate threshold can also distinguish between the two cases.

Items labeled & Tagged by user

After calculating the corresponding subject for the item, and the corresponding word distribution below the subject, we can select the maximum number of topics, and then choose the most likely words from these topics as the label for this item. On this basis, if the user has acted on the item, they can propagate the labels to the user.

This method of playing the label, with a very intuitive interpretation, in the appropriate scenario can serve as a reason for the recommended explanation. For example, when we do the mobile-side personalized push, available for display copy of the space is very small, you can use the way above to label the items, and then according to the user to the label spread to the user, in the push of these tags at the same time as a source of recall and recommend reasons, so that users understand why he made such a recommendation.

The importance measurement of subject & words

In the subject of LDA training, although all have the same position, but its importance is different, some topics contain important information, others are not. For example, a topic might contain "education, reading, school" and other words, and related to such a subject document, generally speaking, and education-related topics, then this is a topic of high information, on the contrary, some topics may include "first book, second volume, Volume three ..." If you train lda on all the books in a book sales Web site, you might get a topic like this because there are a lot of packages that contain such information, and a document related to such a topic may be any subject, such as a topic with a low volume of information.

It is important to distinguish between topics. From the above example we can be inspired: Important topics will not appear everywhere, will only appear in a small number of related documents, and unimportant topics may appear in various articles. Based on this idea, we can use the method of information entropy to measure the amount of content in a subject. By making the appropriate transformation of the output information of LDA, we can get the probability distribution of subject θi in different documents, and then we calculate its information entropy for this probability distribution, in layman's terms, information entropy measures the degree of probability dispersion in a probability distribution, the more dispersed the entropy, the smaller the concentration entropy. So in our question, the smaller the information entropy, the less the document corresponds to the topic, the greater the importance of the topic.

Using a similar approach, we can also calculate the importance of the word, here no longer repeat. More applications

In addition to the above, LDA has many other applications, even in the field of text outside the image and other areas are widely used. Lsa/plsa/lda The core of these theme models is the common word in the document, on the basis of which a variety of probability distribution, to grasp the core foundation, we can find more applications of text body model. For example, in the collaborative filtering problem, the basic data is also the common behavior of the user to the object, which also forms the basis of the text theme model, so you can also use LDA to model the behavior of the user, get the subject of the user's behavior, and the corresponding items under the subject, then make the recommendation of the article/user. Capturing contextual information: A neural probabilistic language model

The text subject model represented by LDA is decomposed by the common information of word, a lot of useful information is available, but Plsa/lda has an important assumption that documents in a collection of documents and the words in a document are independent and interchangeable in the case of a theme distribution, in other words, The model does not take into account the order of words and the relationship between words and words, which implicitly implies two meanings: in the process of generating a word, the previously generated word has no effect on the word that is generated next. Two documents if they contain the same words, but the order in which they appear is different, they are exactly the same in the case of LDA.

This assumption makes the LDA lose some important information, but in recent years, more and more attention to the Word2vec-represented neural probabilistic language model in this respect and LDA formed a certain degree of complementarity, so that the LDA can not capture the information.

The central idea of word2vector in one word: A word is characterized by the company it keeps (the character of a word is determined by the words around it).

This is a very philosophical words, very much like the idiom of "birds of a feather flock together." Specifically, the word vector model constructs a training sample using "Around the word => the current word" or "the word around the current word =>". Then the neural network is used to train the model, and after the training is completed, the input vector representation of the input word becomes the vector representation of the word, as shown in Figure 3.

This kind of training is essentially saying that if two words have a similar context (the context is composed of surrounding words), then the two words will have similar vector representations. With the vector representation of the word, we can do a lot of things, the most common is to represent this layer of vectors as an embedded layer of a deeper model. In addition to the use of deep learning, there are many other things that can be done in recommender systems, one of which is the clustering of words and the search for similar words. We know that LDA can naturally do the clustering of words and the computation of similar words, so the results calculated using Word2vec are different from that of LDA. The differences between them are embodied in two points: first, the granularity of the cluster is different, the granularity of the subject level of LDA is higher, and the word vector is concerned with the meaning of the lower level of grammatical semantics. For example, "Apple", "millet" and "Samsung" These three words, in the Lda method is likely to be clustered in a topic, but in terms of the vector of words, "apple" and "millet" may have a higher similarity, as "Jobs" and "lei June" in the word vector of the same relationship, so in the word vector may have: "Vector (millet)-vector (Apple) +vector (jobs) = vector (lei June)" results.

In addition, since Word2vec has the ability to "predict current content based on context", it can also be used to make predictions about user behavior preferences after appropriate modifications. First, we collect the user's behavior log, divide the session, get the training data similar to the text corpus, train the Word2vec model on this data, and get a model of "predicting the current behavior according to the context behavior". But the object of behavior in the original behavioral data is often the ID level, for example, the product, video ID, etc., if directly into the model training, will result in slow training, poor generalization ability, so need to do the original behavior of dimensionality reduction, in particular, can map behavior to search terms, LDA Topic, Categories, and so on low dimensional features, and then training. For example, we can train a Word2vec model for the user's search terms, and then we can predict his next search behavior based on the user's historical search behavior, and make recommendations on this basis. This approach takes into account the context, but does not do the best to deal with the relationship, since Word2vec's idea is to "predict the current content based on context," but the model we want to get is "predicting next behavior based on historical behavior", which is subtly different. For example, the user's behavior sequence is "ABCDE", each letter represents an object (or keyword) behavior, the standard Word2vec algorithm may construct the following samples: Ac→b, Bd→c, ce→d ... But the form we want is actually this: Ab→c, bc→d,cd→e ... Therefore, it is necessary to modify the logic of the Word2vec generation sample so that it contains only the single directional sample we need before we can get the results we really expect in the final model.

The following are some of the predictive examples generated by this method:

It can be seen that the predictive search terms are closely related to the historical search term. is an extension of historical search terms (examples of student desks and toasters) or refinement (examples of turtles and citizen watches), with better predictive properties, and a great source of recommended strategies. Along the way, we can further modify the Word2vec to get a more sensitive model of timing, and try to use pure time series models such as RNN and lstm to get better predictions, but because of space limitations, this does not unfold. Industry Application Status

After the text theme model is proposed, it has been widely used in various industries of the Internet because of its good probabilistic nature and the clustering abstraction ability which has the meaning to the text data. Search giant Google is using the text theme model extensively in all aspects of its system, and has developed a large-scale text theme system Rephil. For example, in the process of creating ads for users, the text theme is used to compute the match between the content of the Web page and the advertisement, which is one of the important factors for the success of the advertising product. In addition, the text theme can be used to improve the matching recall rate and accuracy when matching the relationship between user search terms and Web pages. Yahoo. It also uses the LDA theme feature extensively in its search sort model, and it also opens up the famous Yahoo! LDA tool.

In China, the most famous text theme of the system is Tencent developed peacock system, the system can capture millions other text topics, in Tencent's advertising classification, web page classification, precision advertising orientation, QQ Group classification and other important business are playing an important role. The HDP (hierarchical Dirichlet Process) model used in the system is an extension of the LDA model, which intelligently selects the number of topics in the data and the ability to capture long tail themes. In addition to Tencent, the text theme model in the company's recommendations, search and other business has been widely used, the use of methods according to their respective business.

The neural network model represented by Word2vec is widely used in recent years, such as clustering of words, discovery of synonyms, extension of quer y, extension of recommendation interest, and so on. Facebook has developed a Word2vec alternative Fasttext, which, based on the traditional word vectors, considers the concept of the sub Word (Subword), and obtains a better effect than the Word2vec. Summary and Prospect

We start from simple text keyword, along the structure, dimensionality reduction, clustering, probability, time series of ideas, combined with the recommendation system candidate set recall, correlation calculation, sorting model features and other specific applications, introduced the recommendation system of some common natural language processing technology and specific application methods. Natural language processing technology has made great progress in recent years through the East wind of deep learning, and its close relationship with recommendation system means that the recommendation system still has a huge room for improvement in this aspect, let us wait and see.

July 22-23rd, this year China Artificial Intelligence Conference (CCAI 2017), the most powerful sound--2017 China Artificial Intelligence conference, will be kicked off at the Hangzhou International Conference Center. The collection of more than 40 academic leaders, 8 authoritative experts theme reports, 4 open seminars, more than 2000 AI professionals will participate in this meeting, welcome to scan the two-dimensional code below or directly to the "conference website" to buy tickets.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More