Gold mines in the eyes of data scientists

Last Update:2014-11-07 Source: Internet

Author: User

Keywords Social networking Facebook

Tags accounts analysis analysis and research based basic behavior blogs business

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

With Facebook on the market, social networking has become the focus of attention again. Compared with traditional forums and blogs, social networking is a bridge between the virtual world and the real world, and the relationship between people in real life is established on the Internet. In terms of social networking, Facebook, Twitter, and LinkedIn represent three different social networks. Facebook is a social network based on strong relationships between friends that help maintain and improve relationships. Twitter is a social network of weak relationships based on one-way attention, which helps shape the spread of opinion leaders and messages; LinkedIn is a professional social network for business people, helping users use social relationships for business communication and job-seekers.

Three of social networks generate large amounts of user data (Ugc,user generated Content) every day, and have unprecedented scale and mass, attracting countless researchers to discover valuable information from unordered data. This is like the probability of statistics often put coins to calculate the positive and negative probability of the example, from a few of the throwing results are difficult to see the law, but through tens of thousands of of a large number of throwing experiments, it is easy to see positive and negative occurrences of almost equal law. Social networks generate a large number of large-scale, group data, including computer science, psychology, sociology, journalism and communication and other fields of experts and scholars to study and explore, hoping to rely on a stronger social network analysis and processing capabilities to discover more human not yet explored the law.

There are a lot of interesting research topics on the wide range of analysis and research of social networks. For example, the identification of community circles in social networks (Community detection), the computing of the influence of people in social networks, the communication model of information on social networks, the identification of false information and robot accounts, the prediction of stock markets, general elections and infectious diseases based on social networking information. The analysis and research of social networks is an interdisciplinary field of study, therefore, in the process of research, we usually use sociology, psychology and even medical basic conclusions and principles as the guidance, through the artificial intelligence field of machine learning, graph theory and other algorithms for social network behavior and future trends in simulation and prediction.

Identification of social circles

Unlike communities such as the general content-oriented forums, the core of a social network is the human relationship and the social circle (community), yet each person can belong to multiple social circles depending on his or her relationship and interest. All the information we publish in social networks is spread out through our relationship loops, and the messages we receive come directly from the people we care about, and the more peripheral messages must be spread to reach the end user. Therefore, how to discover the social circle is a very important fundamental research in the social network analysis. Social circles are shown in Figure 1.

Figure 1 Community Discovery effect map based on Oslom algorithm

Using computers to handle social networks often sees the entire social network as a graph structure, each user is the node in the diagram, the relationship between the people is the edge between the nodes, according to the different types of social networks, the graph can be either a map or a no direction graph, The strength of the relationship can also be reflected by the different weights on the side. For the discovery algorithm of social circles, the quality of social circle depends on the compactness of the relationship of the members in the circle and the degree of separation between different circles. But for hundreds of billions of nodes, the current Circle discovery algorithm is difficult to deal with the large-scale data, so many researchers put forward a heuristic method to reduce the complexity of program processing, the final results of the approximate solution.

However, the actual social circle is a more complex network, because users will have a variety of interests, can belong to multiple social circles, the discovery of this circle of research is also known as the discovery of overlapping communities. A relatively simple heuristic method is to take a large network of nodes as the initial circle, and then the largest contribution to the circle of neighboring nodes into the circle, until the global contribution to reach the extreme value, and form a circle. If there is a boundary node that has a large contribution to more than one circle, it is added to multiple circles. In the near future, it is also proposed to use the tag propagation (label propagation) algorithm and particle swarm optimization algorithm to solve the overlapping community discovery algorithm.

Social Circle Discovery algorithm is not only limited to the user's initiative to establish the relationship, its more important value lies in the user's explicit latent relationship discovery. From the results found in social circles, we can see more clearly the people who belong to a circle. Of course, there are many ways to divide social circles, such as relational social circles, interest-oriented social circles, etc. In the algorithm, the parent density as the primary index and interest as the primary indicator, also will be different social circle division.

One of the questions that arises is whether the online circle is consistent with the offline real social circle. When two of people interact with each other on social networks, are they real friends online? This is a difficult problem to solve from an algorithmic point of view, but if we think about it in a different way, Think about our offline contact, if A and B have each other's cell phone number, then they are offline real friends of the possibility is very large. Including flying letters, rice Chat, micro-letter and other products, if it can be made based on mobile phone address Book Social network, we can through heterogeneous social networks to the social circle of comprehensive judgments, its value immeasurable.

The calculation of influence

In social networks, opinion leaders have a huge impact on the spread of information and the behavior of ordinary users because of their powerful influence on the Web. Take Sina Weibo for example, one of the most intuitive influence performance is to add v certified celebrities, send a meal is a micro-blog, can also be hundreds of times forwarding, but for the general user, a micro-blog forwarding number can be on the two-digit number, it is enough to cheer for it.

So, like the real world, people in social networks have different classes and different influences. But how can influence be measured and calculated? As we mentioned earlier, when computers are dealing with social networks, they tend to use the structure of diagrams, which is consistent with the structure of the search engine, as shown in table 1. Because in the search engine, the graph node is the webpage, the edge is the link, however the search engine's PageRank algorithm is the algorithm which sorts the webpage. If we use PageRank on social networks, we can iterate over people's influence. In addition to the PageRank algorithm, and w-entropy algorithms are also used in social network impact calculation.

Table 1 Different definitions of graph structure between social networks and search engines

For everyone, however, their influence in different areas is not the same. For example, Kai-Fu Lee's influence mainly in the field of science and technology, Huang influence in the field of sports, Shire influence mainly in the investment and public welfare areas. So how to evaluate the influence of a person in different fields is also a very important problem, some scholars put forward the influence evaluation model based on topic level (Topic levels) tap (Topic Affinity Propagation) to try to solve this problem, The algorithm is applied to large scale social network data and shows good results.

Abroad, companies such as Famecount and Klout have designed algorithms to rate the impact of each person in a social network. Some companies even have a pattern of providing differentiated services in real life, depending on the influence of their personal network, for example, the Klout of Hong Kong Cathay Airlines who score no less than 40 points to enjoy the airport VIP lounge. Although the behavior has also been questioned by many people that this is a "snob" approach, but it can also be seen as a network impact on the use of business models of a new exploration. At home, Sina's micro-data and miu+ also made some explorations in the impact calculation of Weibo, and there is still a large space for development in this field.

Modeling of information dissemination

On social networks, everyone is a media. Unlike traditional media depending on content as a topic of communication, the spread of information on social networks relies more on the influence of publishers and social relationships, spreading information to social networks through relationships with friends or fans. This information is seen by friends and fans in social networks and is shared and forwarded in a certain probability to spread. Figure 2 shows a visual representation of the propagation process of a microblog.

Fig. 2 propagation Map of individual microblogging information (from www.doodod.com)

Some scholars use the spread of infectious diseases in the crowd and the spread of rumors in the society to simulate and depict the transmission of information in social networks, and then use the dynamics of infectious diseases and complex network theory to model and predict the propagation behavior of social networks. More intuitively, if you think of the entire social network as a graph structure, the user in the social network as a node in the graph, and the relationship between the user as the edge of the graph, then the process of information dissemination from the beginning of the user's node, along the adjacent edge of the dissemination of information, adjacent node users will be based on the time and theme of different, Will propagate or terminate the information at a certain probability. For the dynamics model of infectious diseases, the nodes in the network are usually defined as three types: propagation nodes, uninfected nodes and immune nodes. The communication node is characterized by receiving and having the ability to propagate the neighbor node information; The uninfected node does not receive information from the neighbor node, but has the opportunity to receive information that is likely to be infected; the immune node indicates that the node has accepted the information of the neighbor node but does not have the ability to propagate, thus defining some propagation rules:

If a propagating node is contacted with an uninfected node, the uninfected node becomes the propagation node with probability;

If a propagating node is contacted with an immune node, the propagation node becomes the immune node with probability;

The propagation node will not spread endlessly, will stop propagating at a certain speed, become immune node, need not contact with other nodes.

Thus, the state transition equation can be established through the methods of infectious disease dynamics. After establishing the communication model, we can find out the regularity of the information in the communication of the social network by examining the degree of the node of the beginning propagator (i.e. the number of friends or fans), the relative strength (the weight of the edge), and the influence on the information transmission.

Identification of false users

The identification of false information and false users is a basic work in the deep research and practical application of social network, and it is of great significance. Information on the social network in the process of transmission, inevitably encounter false content or false users of the navy to interfere. If we can identify false users and false content can better restore the true ideas and state of public opinion, for the enterprise marketing and the government to understand the public opinion to provide more real and effective data. In general, it is easier to identify false users of social networks than anonymous forums because they can be examined from more dimensions. In social networks, fake accounts do not usually have real social interaction, most of the links in the network are also false accounts, in addition to the account of the forwarding behavior and the content of the identification to effectively identify the false user. We use the following 8 kinds of user behavior characteristics to judge the false users on Sina Weibo.

The consistency of the blogger's creation time

The avatar and name of the blogger

Pay attention to the proportion of fans

The quality of the bloggers ' fans

Number of tweets posted

Object distribution for the last 200 forwards

Frequency of forwarding the same tweet

What to write when forwarding

In view of the above 8 characteristics, using machine learning classification algorithm training model, and using the model to predict the subsequent false users, can effectively find false users, in the analysis of public opinion to remove it, to restore the real information dissemination and public opinion (publicly published on the network of speech) state.

Predicting the future with data

The most intriguing research work on social networking data is the prediction of the future. Social networks attract hundreds of millions of of people to publish their data, state, and mood on the web every day, and the scale and mass of data give the data scientists an opportunity to discover the unknown laws of humans from massive amounts of data.

By monitoring public sentiment data in Twitter, American scientists have found that public sentiment data is strongly correlated with many social phenomena and events. For example, some researchers have found that the expression of negative emotions, whether "hopeful" or "fearful", is indicative of a fall in the US stock market index. As long as there is a sudden change in the mood of the public in social networks, some researchers suggest that uncertainty about the stock market can be used to predict the future trend of the stock market.

In terms of epidemiological predictions, British scientists are tracking the outbreak of flu based on Twitter data. They are based primarily on the keywords that the user publishes the message, such as "I am having a headache (I am having a headache)" and so on, and combined with the site of the user's posting, by comparing the official data of the Department of Health in the region, and finally establishing a forecast model. The entrepreneurial team "Sickweather" even launched its own entrepreneurial program on the theme of disease prediction.

Many researchers also use data mining methods to predict the movie box office, the U.S. general election trends and results, and achieved surprising results.

However, we can not be too optimistic about the ability to use the predictive power of social network data, because the prediction of social networks is based on massive data, but the current analysis algorithm for mass text data has not yet achieved the ideal accuracy. Especially for the seemingly simple problem of emotional judgment from textual information, the essence is the intersection of natural language processing and emotional psychology. But the present natural language processing method mainly uses the probability statistic method, as well as the lexical and the syntax analysis carries on the interpretation. The judgment of the mood of the text is based on the judgment of lexical library and grammatical structure and the method based on machine learning. However, these methods are difficult to judge effectively in the slightly complicated language, especially those with irony and implication. In addition, the use of social networks can not fully represent the effective population, because the use of social networks of people and age, geographical, racial and other aspects are very different, so only using the data generated by social networks to predict will likely be biased with the end result, Therefore, the scientific and effective sampling method from the angle of the population is also an important link for social network prediction.

Summary

The cognition and excavation of social network data is still in a relatively elementary stage, and the method of mining large-scale and high-dimensional data is still evolving. At present, many basic problems, such as emotional analysis of text language and prediction of social network propagation, can not be solved effectively, which has caused some restrictions on the in-depth study of social networks. However, with the increasing of the level of artificial intelligence, especially the combination of cognitive neuroscience and artificial intelligence, we can see the new hope for artificial intelligence. When we really have the ability to solve these problems, social networking will be a useful tool to help us predict future trends. However, the full use of social network data also means exposing more and more users ' privacy, so how to find a balance between user privacy and data integrity is also a problem for future data workers.

Author Zhang Wenhao, the founder of Unique Technology (DOODOD), Ph. D., computer Department of Tsinghua University, focuses on the research and Development of social network analysis, text emotion mining and other fields.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More