Big Data Learning Note 5 • Big data in Social computing (3)

Source: Internet
Author: User

The first two articles describe our research work on the understanding of user movement laws, including how to deal with missing data in user trajectories and how to recommend places of interest to users. In this section, I will present our research projects on user characterization .


First of all, I would like to introduce our recently launched Lifespec project. The goal of this project is to use user data from social networks to explore all aspects of the city's lifestyle.

We collected data from multiple social networks, including Street, Weibo, book and film review website Watercress, and the famous restaurant review website reviews. The data we collect covers more than 1.4 million users, who have at least two accounts in our data set.

The picture on the right shows the percentage of users with multiple accounts. As you can see, all the users in our data set have at least two accounts. In fact, this is a requirement for dataset collection. About 40% of users in a data set have at least three accounts.

From social networks, we collect different types of user footprints, including Twitter messages, photos, sign-ins, movies, books, music, offline events, online shopping history, and more. Because users may share portraits in some of these social networks, we can get a public portrait of the user from different websites, including age, gender, relationship, occupation, university, high school, etc. We have collected a total of 53 million footprints. Footprints include sign-in, movie and music reviews, events, and book reviews. We also have 3 million users of social relationships. There are about 39 million registration data in the footprint. This means that most of the footprints are location check-ins. So, location is a kind of important data in our data set. Our users come from different cities in China, including Shanghai, Beijing, Guangzhou and so on. The number of users in these cities is greater than the number of users in other cities.

Let's look at some simple statistics for the dataset.

    • The figure at the top left shows the daily check-in for two cities. Here we take Beijing and Guangzhou as an example. The x-axis represents each day of the year, and the y-axis represents the number of check-ins. From this picture, you can see that the number of check-ins on weekends is more than usual. In addition, there will be more sign-in times for national holidays, such as 51 holidays and 11 holidays.
    • If we look at the number of check-ins at different times of the day, we can see different patterns. For example, if you look at the picture at the bottom left (we're still comparing Beijing and Guangzhou), you can see that the x-axis still represents the day of the year, and the y-axis represents a different time of day. , you can see that people are less active at night because they usually have to sleep at this time. But comparing Beijing and Guangzhou data, we can also find that beijingers sleep longer than people in Guangzhou. This observation was validated by another questionnaire conducted by the Chinese Physicians Association. The results of the survey were released on World Sleep Day 2013. On average, Beijingers fall asleep at 10:15 in the evening, while people in Guangzhou fall asleep after 11:00 in the evening. Therefore, Beijing people actually sleep earlier than the people in Guangzhou.
    • We study the pattern of movement patterns of people in different cities. For example, we studied the movement pattern of Shanghai people in Beijing. In other words, they go to Beijing to travel or travel.

      These graphs show the registration density distributions in Beijing, Shanghai and Hong Kong. They show the pattern of movement patterns in different cities. The graph at the top left of these figures shows the law of movement of Beijing People in Beijing, which is the local movement law. Above the middle of the picture shows the Shanghai people in Beijing, that is, to Beijing to the Shanghai people's movement pattern. So, if you look at these 9 graphs, we can see that the local move pattern usually covers a larger area of the city than the non-native movement pattern. This means that if you live in the city, you will go to a place that is not so famous. However, if you go to another place, you are likely to go to some tourist attractions, convention centers, airports or railway stations. From this point, if we know whether a person is a native, we can use this property to help us do position prediction. In the experiment, we found that this can improve the accuracy of position prediction.

We connect users ' accounts on different social networks based on two types of self-disclosure information.

    • Cross-Domain publishing: This means that a user posts a message in a social network and synchronizes it to other social networks. For example, if you post a sign-in message at Foursquare, you can also sync this information to Facebook. Then, depending on the content, time and location, we can know that the two accounts are the same person.
    • User portrait: Typically, users will expose their accounts on different social networks on their portrait pages. For example, users might display their LinkedIn, Facebook, and Twitter accounts on their home page. Therefore, we can also use this information to connect different user accounts.

Based on the information of self-presentation, we developed the iconnect algorithm.

Iconnect can discover user profiles on different social networks, track connected accounts, and recursively discover more accounts and connections. In this way, we crawl multiple social networks and collect user data. Our user data consists of three sections. The first part is the portrait, including the age, gender and other personal background information. The second part is the footprints, including Twitter messages, check-ins and various reviews. The third part is a friend relationship, that is, the relationship between different users.

After collecting these datasets, we want to use these datasets to study all aspects of a group of people's lifestyles. Because of the user portrait, we can group people by location, university, age or company. If we define a group, then we can get all the footprints of this group. We use a tree to represent all aspects of the lifestyle of this group of users. The root node of the tree represents a common footprint or lifestyle for this group of users. The root node of a subtree represents a common behavior of a subgroup.

This picture shows the lifestyle of the people in Beijing. Here, the root node, as you can see, can be represented by three footprints: shopping in the daytime, going to work and eating fast food. Sub-groups of Beijing users ' sub-lifestyles include comedy, daytime and evening travel to the office. For a smaller group, they like coffee and Western food and go to bars in the evenings. In this way, we can visualize the lifestyle of this group of users.

We can also compare this group of users ' lifestyles and other groups ' lifestyles. We designed a relational-based hierarchical LDA to generate this tree of life.

Here, we treat each user as a document, taking his or her footsteps as a word in the document. So, for a group of people, we have a set of documents. For this document group, we use the theme model to generate the theme tree. Because this is hierarchical LDA, we can generate a hierarchical structure for this tree. Here, relationship refers to the social connection between different users, and here is the link between these documents. This is similar to the reference relationship between documents.

Research examples

Now, let's take a look at some examples.

Here we divide them into two groups according to the user's occupation, one group is called the financial practitioner, the other group is called the software practitioner. For financial practitioners, we found that the most common nodes in the spanning tree indicate that they like reading books in the economy category. We also see that they like to go to bars and banks. For software practitioners, there are no economic books in the tree, but we find that most of them enjoy reading computer and programming books. Some of them like the user experience design class books.

Another example:

Here, we group according to the year of birth of the user. , we can see two examples of both Gen and Y. From the hierarchy generated for the generation, we can see that some of them like to go to the coffee shop, some people like to read Hong Kong articles and play video games. This indicates that these users are very young. For Gen Y, many of them like hot pots, some of them like Sichuan, and they go to the office during the day and evening. This means that they are older because they need to go to work.

    • In the Lifespec project, a computational framework was developed for discovering the city's lifestyle.
    • The iconnect algorithm is designed in the system, which can identify the connected user accounts based on self-displayed information.
    • A hierarchical model based on relationship is designed to summarize the user's life style.

Big Data Learning Note 5 • Big data in Social computing (3)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.