The algorithms behind Weibo

Source: Internet
Author: User

1

Introduction

Weibo is a social app that many people use. Every day the people who brush micro-Bo will carry out such a few operations daily: original, forward, reply, reading, attention, @ and so on. Among them, the first four are for short blog posts, the final attention and @ is for the relationship between users, attention to someone means that you become his fans, and he became your friend; @ someone means you want him to see your Weibo message.

Weibo is thought to be "self-media", the way that the general public shares the "news" associated with itself. Recently, there have been reports of people making profits using their own influence in the media. How does the personal influence on Weibo be calculated? What other algorithms on Weibo are managing us as invisible hands? How does each of our actions affect the algorithm?

Intuitively, Weibo is actually a simple microcosm of human society, and some of the features of the microblog network may inspire us to get a real social network of laws. Social computing, especially social networking, has become a new favorite for data mining, thanks to the explosive development of social networks. Here's a brief introduction to some of the algorithms for microblog network analysis, some of which may also be useful for other social applications.

Label Propagation

The volume of Weibo users is vast, and different people have different interests. Tapping the interests of each user contributes to more accurate ad delivery and content recommendations. In order to get each user's interest, the user can be labeled, each tag represents a user's interest, the user can have one or more tags. To get the final user tag, first assume:

The majority of people in each user's friend (or fan) who have the same interest in that user.

This leads to the first algorithm introduced in this paper, that is, the label propagation algorithm. In this algorithm, each user's tag takes one or more tags of their friends or fans. Of course, you can consider the tags of friends and fans in the integration can be considered to give friends the label and the label of the fans different weights. The process of the label propagation algorithm is as follows:

1) give the initial label to some users;

2) for each user, count the number of tags of their friends and followers, giving the user the most occurrences of one or more tags.

3) cycle through the 2nd step until the user's label no longer changes greatly.

Calculation of user similarity

The disadvantage of the label propagation algorithm is that when the assumptions do not conform to the facts, for example, for social politeness, we tend to add our friends and relatives to our attention, who do not necessarily have the same tag as us, and the result of this algorithm becomes very poor. The solution is to measure how much a friend or fan's tag contributes to the user's tag by calculating the similarity between the users. Thus a second hypothesis is obtained:

The more similar a friend or fan is to a user, the more likely the tag is the user's label.

So how do you measure the similarity between users? This will take into account the user's published microblogging information, including the forward and original. This is to consider the similarity between users rather than the similarity between users ' microblogs, so that in the actual calculation, all of a user's microblog information is aggregated together for calculation. An optional method is to use the word-bag method to express the microblog information as a word vector, and then use the cosine method to calculate its similarity directly. But this method is too simple, not easy to achieve good results, here is a method based on LDA (implicit Dirichlet distribution) of the similarity calculation.

LDA still uses the word-bag method to represent the text, but adds a subject layer in the middle, forming the "document-subject-word" three-layer probabilistic model, where each document is considered a probability distribution of the subject, and the subject is considered a probability distribution of the word. Under the LDA model, the document can be seen as being generated as follows:

1) for each document:

2) draw a theme from the distribution of topics;

3) Extracting a word from the distribution of words in the subject;

4) Repeat steps 2nd and 3rd until all the words in the document are generated.

The estimation algorithm of LDA model parameters is not covered in this paper. It is only necessary to know that the topic distribution of each user's microblog information can be obtained through LDA. Then, the similarity between users is obtained by using the method of calculating similarity of cosine method, KL distance and so on. The similarity is then used to weight the label propagation.

Time Factors and network factors

What are the drawbacks of the above algorithm?

As time changes, the interest of the user is changed, and it is not reasonable to calculate the user similarity by aggregating all of the microblog information together every time. In this way, you can select N Weibo that is closer to the current time. For example, for each user, choose 50 of the most recent micro-blogs from the current time to get together and train in Lda. The n here is neither too big nor too small. Too big is not easy to reflect the user's interest in time changes, too small is due to the user to publish the randomness of Weibo easily arouse interest drift. In order to make the best effect, you can not rigidly adhere to a fixed n, for example, you can consider each user according to their published microblog time series to do the adaptive n value.

At this point, in the algorithm has not considered the micro-Bo relationship from the reply, forwarding, @ and other network information. In forwarding, for example, if a friend's microblog is forwarded frequently in a user's microblog, the user's similarity to that friend should be higher than that of other friends. Here can be seen as assumption three:

The more frequently a user forwards a friend's microblog, the greater the user's affinity with that friend.

Similarly, you can get the assumption that four:

The higher the frequency of a user in a user's microblog, the greater the user's affinity with that friend.

This results in additional factors for calculating the similarity. There are many ways to add a new factor to the original similarity calculation method, for example, you can consider the quantization of the forwarding frequency as a value, as a weight added to the measurement of similarity.

Community Discovery

The microblog community is a group of people who are closely connected in the microblog, and the communities within the community are closely connected, and the relationships between the communities are relatively sparse. The relationship referred to here has two meanings, the first is the similarity between the people within the community, and the second is that the relationship between people within the community is close, such as requiring two users within the community not to exceed two degree associations, and two degrees associated with friends.

The similarity of interest has been described above, and the relationship similarity needs to be calculated using the relationship between the users concerned. With the user's concern as one-way chain, the relationship between all Weibo users can be expressed as a huge graph. The relationship similarity between users can be easily considered, such as using the reciprocal of the shortest path between users. But this approach is inaccurate, and we know that in the real world there are six-degree theories that tend to be closer to each other in micro-blogging networks and other social networks. So this simple relationship similarity can only have up to six discrete values, which is obviously not accurate enough.

In order to achieve better results, it is not only the shortest path as an explicit measure, but also some implicit measurements. Here are the first two assumptions, which are hypothetical five and six assumptions:

The more common friends of two users, the higher the relationship similarity between the two friends.

The more common fans of two users, the higher the relationship similarity between the two friends.

This can be used to refer to the calculation of Jaccard similarity, the two hypothetical quantization functions as the size of the intersection and the size of the quotient of the set. In hypothetical five, the quantitative index is also known as the common directivity similarity, quantified using the number of two user friends divided by two users of all friends. It is assumed that the quantitative index of the six is referred to as the co-directivity similarity, and the computational method is similar to the similarity of common directivity. In a sense, these two similarities are not only a measure of the relationship, but also a measure of the similarity of interest among users, and intuitively, the more two users have a common interest, the greater their interest similarity. These two similarity degrees also have a professional name, which is based on the similarity calculation of the structure scenario.

After the shortest path similarity, common directivity similarity and common directivity similarity are obtained, a weighted function can be used to combine them to obtain the final similarity. After that, some clustering algorithms, such as K-means and Dbscan, can be used to obtain the final community cluster. The similarity weighted label propagation algorithm can also be used to treat people with the same tag as a community.

Impact calculation

In community discovery, the use of relational networks in Weibo can improve the accuracy of similarity calculations. But the relationship network can do a lot of things, influence calculation is one of the more important applications.

When it comes to the calculation of influence, it draws on the algorithms in the page rankings. The well-known algorithm in page rankings is PageRank, invented by Google founder Larry Page and Sergey Brin, and has gained prominence as Google succeeds in business. The algorithm based on the link between the page to determine the ranking of the page, its core is a hypothesis, high-quality web pages point to the quality of the page must also be high.

According to PageRank's thinking, you can get the hypothesis of influence on Weibo, which is called hypothesis VII:

Users with high impact must also have a high impact on the user.

Consider a user as a Web page in PageRank, and consider a relationship as a link in a Web page. Thus, we can get the computing algorithm of influence on the network of Weibo based on the algorithm flow of PageRank:

1) Give all users the same weight of influence;

2) Assign each user's impact weights in the same amount as the number of people they care about;

3) For each user, its influence equals the sum of the weights assigned to him by his fans;

4) the 2nd and 3rd iterations until the weight no longer changes greatly.

In the Web page ranking, the algorithm based on the network relation also has hits, Hilltop algorithm and so on, these algorithms also can draw lessons from the influence computation.

What are the drawbacks of the above algorithm?

If it is only based on the relationship network, then it is easy to create, the number of fans, the impact of a lot of people will inevitably be very high. This leads some users to buy some zombie powder to achieve a high impact. Such an algorithm is obviously not able to deal with the actual situation, because there is too much information is not used.

User influence in addition to his microblog relationship, but also with his personal attributes have a great relationship, such as user activity, micro-text quality and so on. The user's activity can be measured by the frequency with which the microblog is published, and the quality of the micro-text can be obtained by the number of forwarded and the number of replies. By measuring these values, coupled with the results of the above algorithm, you can get more accurate impact results.

Of course, it can also be considered, the user's reply relationship, forwarding relationship, @ relationship can constitute the network, they also have a corresponding hypothesis, respectively, the hypothesis of the nine, assume that Ten:

The more influential users respond to Weibo, the higher the impact, which makes the microblog owner's influence higher.

The higher the impact, the higher the impact of the tweets that are being forwarded by the users, and the higher the influence of the microblog's original authors.

The more influential users tend to be in their microblog in the high-impact users.

In this way, we get the forwarding network, the reply network, the @ network three kinds of network, draw on the PageRank algorithm, can get another three kinds of influence result. By blending them with the impact of the relationship network, they can ultimately influence the outcome. The fusion here can be considered as the weighted sum of results, and the complex fusion method is not within the scope of this article.

Topic factors and domain factors

What can be done after the calculation of influence has been obtained?

We can analyze the current hot topics and get the opinion leaders who become the current hot topics on Weibo. This is done by finding the micro-text that is relevant to the current hot topic and finding users who are involved in the current hot topic. How to find the micro-text related to the current hot topic? Has the topic label Micro-text to say, for no topic label Micro-text, you can use the LDA algorithm described above, it can be in all the user's micro-text to find the user's topic distribution, but also to a micro-text to find the topic distribution, in general, because the text of the word limit within 140, relatively short, Thus, a micro-article contains not too many topics, taking the topic of the micro-topic distribution of the highest probability of the subject as its theme.

Find the topic of the corresponding micro-text and users, the operation of the impact calculation algorithm, you can get the topic of the users of greater influence. This is also the public opinion monitoring, social hotspot monitoring one aspect.

For the results of the label propagation algorithm, the influence calculation algorithm of the user running under the same tag can get the influence rank under the label, that is, the influence rank in the field. For example, Lee's influence in all areas may not be the highest, but in the IT field, its influence is definitely one of the best.

Spam user identification

In the impact calculation, it is mentioned to avoid the interference of the zombie users to the influence calculation. In the algorithm, if you can identify such users, when calculating the impact of the discharge, not only can improve the effect, but also reduce the amount of computation.

Similar to the influence calculation, the identification of the garbage user should consider the factors of both the user attribute and the link relationship.

For junk users, there are some statistical features that are different from normal users. For example, the following points:

The general micro-text of garbage users has a certain time regularity, it can be measured with entropy value, entropy is a measure of randomness, the greater the randomness, the smaller the entropy value. The specific method is to make a certain granularity of time slice statistics, the probability of each time slice of the blog post, and then according to the probability of entropy value calculation. The greater the entropy value, the more regular the user's time to tweet, and the more likely it is to be a junk user.

Ø Some users tend to be malicious in the micro-text of the @ other people, so some of the users of the micro-text in the use of the ratio is higher than the average user.

Ø some spam users of the micro-text in order to promote advertising, add a lot of URLs. Can be measured by the URL scale in the micro-text. Also some users in order to cheat the URL of the click, the content in the micro-text and URL corresponding interface of the content is inconsistent, it is necessary to determine the text and URL content consistency, simple practice can use the word bag method to the text and URL corresponding interface to express the word vector, see the text in the URL corresponding to the frequency of the page.

Ø for those who advertise for the promotion of the user, you can also text classification of its micro-text, to determine whether its micro-article is an advertisement, if a user a considerable part of the micro-text is advertising, then the user may be a spam user.

Ø the user is generally casual attention to users, so its number of fans and the proportion of friends and normal users will be different. And normal users are usually through friends to add friends, this will form a focus on the triangle, such as a see its friend B concerned about C, then if a also to pay attention to C, formed a concern B, c,b attention to C triangle. In general, due to the casual attention of garbage users, the proportion of their concern triangle is different from normal users.

Of course, the difference between a garbage user and a normal user is more than this, and this article is no longer one by one enumerated. Garbage user identification is essentially a two classification problem, after acquiring these attributes, you can enter this information into a machine learning classification model, such as logistic regression (LR), decision tree, naive Bayesian, etc., it can be categorized.

Of course, the link information has not been used. In general, spam users will pay attention to normal users, while normal users will not focus on junk users. This is assuming 11:

Normal users are not inclined to focus on junk users.

This allows you to use the PageRank algorithm again to calculate the probability that a user is a garbage user. It should be noted here that the algorithm initializes with the above classifier result, the probability of the garbage user is set to 1, the normal user's probability is set to 0. In the PageRank calculation process, can not be calculated by a simple summation formula, for example, if a user is concerned about a number of garbage users, the probability of summing may be greater than 1, so you need to use some normalization method or exponential family function to update the probability.

Conclusion

In this paper, the corresponding algorithm of the common problems in microblog is introduced, and the algorithm in practical application is more complicated than the introduction. Of course, this article covers not all the topics, such as friend referral, hot tracking and so on is not involved. But the ancients cloud "glimpse and see whole picture", I hope this article will help you better understand the social network applications such as Weibo.

In this article, we can see the assumptions in bold, which appear to be consistent with our intuitive perception. According to these, many effective algorithms can be deduced. So sometimes, as long as you're willing to find out, algorithms are around.

Zhang Yushi

Links:http://blog.csdn.net/stdcoutzyx/article/details/18814627

The algorithms behind Weibo (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.