The introduction of those algorithms behind Weibo
Weibo is a social application that a lot of people are using. Daily Brush Micro Bo people will carry out such a few operations every day: original, forwarding, reply, reading, attention, @ and so on. Of these, the top four are for short posts, the final focus and @ is for the relationship between the users, focusing on someone means you become his fans, and he becomes your friend, @ someone means you want him to see your microblog information.
Weibo is thought of as "the media", where the general public shares the "news" associated with itself. Recently, some people have been using their influence in the media to profit from the report is not uncommon. How does a person's influence count on Weibo? What other algorithms on Weibo are managing us as invisible hands. How each of our actions affects the algorithm.
Intuitively, micro-blogging is actually a simple microcosm of human society, some of the characteristics of micro-blogging network may inspire us to get the real social network on the law. Thanks to the explosive development of social networks, social computing, especially social network analysis, has become the new darling of data mining. The following is a brief introduction to some of the algorithms for micro-blogging network analysis, some of which may also be useful for other social applications. Label Propagation
Weibo users are vast, different people have different interests. Tapping the interests of each user contributes to more accurate advertising and content referrals. In order to get the interest of each user, you can tag the user, each label represents an interest of the user, the user can have one or more tags. To get the final user tag, make the first assumption:
Each user's friend (or fan) has the majority of people who share the same interests as the user.
This leads to the first algorithm introduced in this paper, that is, the tag propagation algorithm. In this algorithm, each user's label takes one or more of their friends or fans with the most tabs. Of course, friends and fans can be considered in the label, integration can be considered to give the friend label and the label of the fans different weights. The process of the label propagation algorithm is as follows:
1 to a part of the user to give the initial label;
2 for each user, statistics their friends and fans of the number of labels, give the user the most frequent one or more tags.
3 Loop The 2nd step until the user's label no longer changes significantly. Calculation of user similarity
Tag propagation algorithm is simple to achieve, the disadvantage is that when the assumption does not conform to the facts, for example, for social politeness, we usually add our friends and relatives to pay attention, these people do not necessarily have the same label with us, the result of the algorithm will become very poor. The solution is to measure the rate of contribution of a friend or fan's label to the user's label by calculating the similarity between users. Thus a second assumption is obtained:
The more likely a friend or fan to be with a user, the better the label of the user is.
So how do you measure the similarity between users? This needs to take into account the user's microblogging information, including forward and original. This is to consider the similarity between users rather than the similarity between users, so in the actual calculation, the user's all the microblogging information gathered together for calculation. An alternative approach is to use the word bag method to represent the microblogging information as a word vector, and then directly use the cosine method to compute its similarity. But this method is too simple to achieve good results, here we introduce a method of similarity calculation based on LDA (implied Dirichlet distribution).
LDA still uses the word bag method to represent text, but adds a subject layer in the middle, forming a "document-subject-word" three-layer probabilistic model, that is, each document is regarded as a probability distribution of the subject, and the subject is regarded as the probability distribution of the word. Under the LDA model, a document can be considered to be generated in the following way:
1 for each document:
2 extract a topic from the topic distribution;
3) to extract a word from the distribution of the words in the subject;
4 Repeat steps 2nd and 3rd until all the words in the document are generated.
The estimation algorithm of LDA model parameters is not within the scope of this article. It only needs to be known that the subject distribution of the microblogging information for each user can be obtained by LDA. Then, the similarity of the topic distribution among users is obtained by using the method of cosine method, KL distance and so on. Then the label propagation is weighted by using the similarity degree. Time factor and network factor
What are the disadvantages of the above algorithm?
As time changes, the interest of the user will change, and it is not reasonable to aggregate all the tweets together when computing user similarity. In this way, you can select from the current time of the more recent N micro-blog. For each user, for example, select the last 50 tweets from the current time and put them in the LDA training. The n here can be neither too big nor too small. Too large is not easy to reflect the user's interest in the time change, too small because the user published micro-blog random easy to cause interest drift. For the best results, you can not rigidly adhere to a fixed n, for example, you can consider to each user in accordance with their published Weibo time series to do N-value adaptive.
At this point, in the algorithm is not considered in the micro-bo relationship by the reply, forwarding, @ and so the composition of the network information. In the case of forwarding, if a friend's microblog is frequently forwarded to a user's microblog, the user's similarity to that friend should be higher than that of other friends. Here we can consider the assumption that three:
The higher the frequency with which the user forwards a friend's microblog, the more similar the user's interest to the friend.
Similarly, you can get the assumption that four:
The higher the frequency of a user in a user's microblog, the more similar the user's interest to the friend.
Thus, the other factors of calculating similarity are obtained. There are many ways to add a new factor to the original method of similarity calculation, for example, we can consider to quantify the forwarding frequency as a value and add it as a weight to the measurement of similarity. Community Discovery
The micro-blogging community is a group of people who are close to each other in micro-blogging, where people within the community are closely connected and communities are less connected. Here the relationship is closely related to two levels of meaning, the first is the community within the similarity between people's interests, and secondly refers to the community within the relationship between people, such as the requirements of the community within the two users can not exceed two degrees of association, two-degree association is the friend friend.
The similarity of interest has been described above, and the relationship similarity needs to be calculated with the attention relationship between users. With the user's concern as a one-way chain, the relationship between all the microblogging users can be expressed as a huge directed graph. The relationship similarity between users can be considered simply, such as using the reciprocal of the shortest path between users. But the measurement is imprecise, and we know that in the real world, there are six degrees of theory, and in the microblogging network and other social networks, it tends to be more closely related. Thus the simple relationship similarity can only have up to six discrete values, which is obviously not accurate enough.
In order to achieve better results, not only the shortest path is used as an explicit metric, but also some implicit measurements are considered. Here we first give two hypotheses, respectively, hypothesis five and hypothesis six:
The more common friends the two users have, the greater the similarity between the two friends.
The more common fans the two users have, the greater the similarity between the two friends.
This method can be used for reference to the calculation of Jaccard similarity, and the quantitative functions of these two hypotheses are expressed as the size of the intersection and the size of the set. For example, the quantitative index is called the common-directivity similarity, and the number of two users ' common friends is divided by the number of all the friends of two users. It is assumed that the quantified index of six is called the common directivity similarity, and the calculation method is similar to that of the directivity similarity degree. In a sense, these two similarities are not only the measurement of the relationship, to a certain extent also measure the interests of the similarity between users, intuitively, two of users pay close attention to the more friends, their interest similar degree is greater. These two similarities also have a professional name, which is based on the similarity calculation of the structure scenarios.
After the shortest path similarity, the common directivity similarity and the common directivity similarity are obtained, a weighted function can be used to fuse them to obtain the final similarity. After that, some clustering algorithms such as K-means and Dbscan can be used to get the final community cluster. The similarity weighted label propagation algorithm can also be used to treat people with the same label as a community. Impact calculation
In community discovery, it is possible to improve the accuracy of similarity calculation by using the network of networks in Weibo. But there are still a lot of things a network can do, and influence computing is one of the more important applications.
When it comes to the calculation of influence, it draws on the algorithm in the page ranking. The most widely known algorithm in Web rankings is PageRank, which was invented by Google founder Larry Page and Sergey, with Google's reputation as a commercial success. The algorithm is based on the link between the pages to determine the ranking of the page, the core of which is a hypothesis, the quality of the pages pointed to the quality of the page must also be high.
According to PageRank's thought, the hypothesis that the influence of Weibo can be obtained is called the hypothesis Seven:
The influence of the user with high impact must also be high.
The user is viewed as a Web page in PageRank, and the concern is viewed as a link in a Web page. Thus, it can be based on the PageRank algorithm flow in micro-blog focus on the impact of Network Computing algorithm:
1 give all users the same influence weight;
2) Each user's influence weight according to the number of their attention equal distribution;
3 for each user, its influence is equal to the weight of his fans assigned to him;
4 Steps 2nd and 3rd iterations until the weight is no longer changed.
In the page rank, the algorithm based on the network relation also has the hits, the hilltop algorithm and so on, these algorithms also may draw lessons from the influence computation.
What is the disadvantage of the above algorithm?
If it's based on a network of relationships, it's easy to make the impact of people with a large number of fans necessarily high. This causes some users to buy some zombie powder can achieve a high impact. Such an algorithm is obviously not able to deal with the actual situation, because there is too much information is not used.
In addition to his microblog relationship, the user's influence has much to do with his personal attributes, such as the user's active degree, the quality of the micro-text, and so on. The user's activity can be measured by the frequency with which it is published, and the quality of the text can be obtained by the number of tweets it transmits and by the number of responses. By measuring these values, coupled with the results of the above algorithm, more accurate impact results can be obtained.
Of course, this can also be considered, the user's reply relationship, forwarding relationship, @ relationship can constitute a network, they also have a corresponding assumption, respectively, for the assumption of eight, assuming nine, the assumption of ten:
The higher the impact of the users, the more influential the microblogging, the more influential the microblog owners.
The higher the impact of the user's tweets, the more influential the microblogging, the more influential the original author of the microblog.
The more influential users tend to be the most influential users in their microblog.
In this way, we get the forwarding network, reply network, @ Network Three kinds of network, draw lessons from PageRank algorithm, can get another three kinds of influence result. By fusing them with the impact of the relationship network, you can end up with the results of influence. The fusion can be simply considered as the weighted sum of the results, and the complex fusion method is not within the scope of this article. Topic factors and domain factors
What can be done after the method of calculating the influence is obtained.
You can analyze the current hot topic and get the opinion leader who has become the current hot topic on Weibo. In this way, find the micro-text related to the current hot topic and find the users who are involved in the current hot topic. How to find the micro-text related to the current hot topic. There is no need to say that there is a topic tag, for the text without the topic label, you can use the LDA algorithm described above, it can be found in all the users of the user's theme distribution, but also to a text to find the topic distribution, in general, due to the text of the word limit within 140, relatively short, Therefore, a micro-text contains not too many topics, take the text of the topic distribution of the highest probability of the subject as its theme can be.
After finding the topic corresponding to the text and users, the impact calculation algorithm, you can get the topic of the larger users. This is also the public opinion monitoring, the social hot spot monitoring one aspect.
For the result of the label propagation algorithm, the influence of the user under the same tag is calculated, and the influence rank under the label can be obtained. For example, Kai-fu Lee may not be the most influential in all areas, but it is one of the most influential in the IT field. Spam user identification
In the calculation of influence, it is mentioned to avoid the interference of zombie users to the calculation of influence. In the algorithm, if you can identify such a user, in the calculation of influence in the outside, not only can improve the effect, but also to reduce the amount of calculation.
Similar to the calculation of influence, the identification of the garbage user should consider both the user attribute and the link relationship.
For spammers, there are some statistical features that are different from normal users. For example, there are several points:
Ø Garbage users General Hair micro-text has a certain time regularity, you can use the entropy value to measure this, entropy is a measure of randomness, the greater the randomness, the smaller the entropy value. The method is to calculate the probability of the post in each time slice according to the time slicing of a certain granularity, then the entropy value is calculated by the probability. The greater the entropy represents the more regular the time the user sends the text, the more likely it is to be a spam user.
Some spam users tend to be malicious in the micro-text @ Others, so some of the spam user's micro-text in the proportion of users higher than the average user.
Ø some spam users in the micro-text in order to promote advertising, add a large number of URLs. Can be measured by the percentage of URLs in the micro-text. Also some users in order to cheat URL clicks, the content of the text and URL corresponding interface content is inconsistent, at this point, we need to judge the consistency between the text and the URL content, simple way to use the word bag method to the interface between the text and URL as a word vector, to see the text in the URL corresponding to the Web page frequency.
Ø for those who advertise for advertising, can also be the text of their texts to classify, to determine whether the micro-text is advertising, if a considerable part of the user is advertising, the user may be a garbage user.
Users of the garbage users generally casual attention, so the number of fans and the number of friends with the proportion of normal users will be different. And normal users are generally through friends to add friends, which will form a concern triangle, such as a see their friend B attention to C, then if a also to pay attention to C, formed a attention to B, c,b focus on C triangle. Generally speaking, the proportion of the concern triangle is different from the normal user because of the randomness of the garbage users ' attention.
Of course, the difference between the spam user and the normal user is more than that, this article is no longer one by one enumerated. The identification of spam users is essentially a two classification problem, after obtaining these attributes, you can enter the information into a machine learning classification model, such as logistic regression (LR), decision tree, Naive Bayes, etc., can be classified.
Of course, no link information has been used yet. In general, spammers pay attention to normal users, and normal users do not pay attention to junk users. That is the Assumption 11:
Normal users do not tend to focus on junk users.
This allows the PageRank algorithm to be used again to compute the probability of whether a user is a spam user. It should be noted here that the algorithm initialization using the above classifier results, the user's probability of the garbage set to 1, the probability of a normal user set to 0. In the process of PageRank calculation, it is not possible to calculate by a simple summation formula, for example, if a user focuses on multiple users, the probability of summation may be greater than 1, so some normalized methods or exponential family functions are used to update the probability. Conclusion
In this paper, the corresponding algorithm of the common problems in micro-blog is briefly introduced, and the algorithm in practical application is much more complicated than the introduction. Of course, this article covers the theme is not full, such as friends recommended, hot tracking, etc. is not involved. But the ancient cloud "glimpse and see did", I hope this article can help you better understand the microblogging of social networking applications.
In the text, you can see the assumption in bold, which seems to be consistent with our intuitive perception. Based on these, many effective algorithms can be deduced. So sometimes, as long as you are willing to find, the algorithm is around.
from:http://blog.csdn.net/stdcoutzyx/article/details/18814627