Big Data Learning Note 3 • Big data in social computing (1)

Source: Internet
Author: User

Background information

What is the user behavior data, how the user behavior data accumulates. Why we need to study user understanding and why user understanding is so important. In the second part, I will introduce our recent research work on the application of mobile law understanding. For example, how to deal with the problem of missing data in the user track, how to recommend interesting places for users. In the last section, I'll show our recent research projects in user analytics and privacy protection.

This graph shows the amount of data that users generate per minute on some websites.

    • How user data is collected


More than 20 years ago, the concept of pervasive computing was just being raised. Mark Visser presents the concept of pervasive computing. With mainframes and personal computers, people wanted to know what the future of computing would be like. Mark proposes that pervasive computing is the future of computing.
So what is pervasive computing? Mark has put forward four principles.

    1. The purpose of the calculation is to help people do other things.
    2. The best computer is a quiet, invisible servant.
    3. The more things a person can do intuitively, the smarter he is. The computer should extend the human subconscious.
    4. Technology should create calm, which is important for pervasive computing.

After the concept of pervasive computing was put forward, researchers began to build prototypes. Basically, they want to make computers smarter.
The researchers designed three specifications for this type of equipment, namely tabs, pads and boards.

The tabs is a centimeter-level device. such as smartphones and smart cards. They are very easy to carry. People can take them anywhere.
Pads is a class of equipment such as a laptop computer. These devices can be carried with you, but not in your pocket.
The board is a meter-level device. Because they are so large, they cannot be carried or moved. However, they are ideal for photo-sharing or game-class applications.

Because these devices are portable, a straightforward idea is to detect their location and use them to build a context-aware application. For example, if you know the location of the device, we can use them to infer the user's location.

In today's mobile devices, such as smartphones, there are already many sensors. For example, a smartphone can record the time and location of a device, possibly from a GPS, Wi-Fi, mobile base station, or Bluetooth signal. Smartphones also have many sensors that can record device movements, such as accelerometers, gyroscopes, and digital compasses. Using these mobile signals, we can infer the user's movement and activity. Sensors can also be used to record environmental signals. For example, we can use a microphone to detect sound signals and use a camera to detect visual signals. We may also have ambient light sensors, proximity sensors, barometers, humidity sensors and thermometers.

    • Examples of user behavior data


Flickr has a large number of photos with geo-tagging. Usually, users take photos while traveling to other cities or meeting. These photos represent how they traveled in these cities and how they took photos. So, the photos contain a lot of information about places and users.

There are many taxis in Beijing, which is said to have more than 60,000 taxis in Beijing. Most taxis are equipped with GPS. This means that we can record the trajectory of these taxis. Based on the taxi's trajectory data, we draw this heat map. From the heat map, we can easily see the trunk road network and the popular area. We have done a lot of research work in this area. Mining knowledge from taxi tracks can be useful for many applications, such as city planning, location recommendations, and carpool taxis.

    • User understanding

These data have several similarities. They are all directly or indirectly produced by people. They represent the activities of some physical worlds. For example, a photo with a geographic label indicates where people take pictures. Location registration indicates where people stay, such as restaurants and cinemas. The taxi track shows how taxis travel in the city. All of this data is structured. For example, all data contains at least one timestamp and one location tag. Location markers can be represented by coordinates such as longitude and latitude, or by location name, bus stop.
These data contain a certain privacy risk. Because the data contains a lot of information about users that users may not want to disclose. So we have to be very careful when digging into this data. Using human behavioral data, all we have to do is make the best use of data to generate knowledge about the user. This knowledge is then used to make various cloud services more personalized and to provide users with better recommendations.
If we are better able to understand users, we can use this knowledge to help personalize and personalize different applications.

The understanding of User movement law--reconstruct the personal movement behavior through the smart card transaction data.

In this work, our research focuses on the smart card transaction data generated in the bus system. They are often designed for digital payments and for monitoring the digital payments of users. These data are valuable in revealing the user's movement laws, which are useful for urban planning, location-based social networks, GIS, and transportation applications.

There is always a gap between the data that the source application produces and the data that the target application needs.
The taxi track is an example. We use taxi tracks in a variety of transportation-related applications, such as traffic estimates. However, because the taxi tracks are collected for the purpose of taxi management, the data sampling frequency is usually very low. In other words, there is only one data for every 5 minutes or 3 minutes. This is common because people often do not anticipate where the data generated by the application will be used in the future when designing the source application. There will always be new applications that can benefit from this data.

Smart card Trading always contains a lot of uncertainty, and they are not designed for this application. There is a lot of data missing. To solve this problem, we try to take advantage of the different types of capital space, Time space, and geographic space constraints. I'll show you how to make the data more complete and more useful for these applications.

Data set


This slide shows the fields in the data set.

    • CardID: The ID of the card after processing is anonymized,
    • Bus: Bus line number.
    • Boarding: Upper Station
    • Alighting: Lower Station
      There are two types of buses: Non-ladder-priced buses and ladder-priced buses. Non-ladder-based bus is the line that the user pays a fixed fee when the bus is traveling, and the ladder-valued bus refers to the line that the user pays according to the distance of travel. The two types of buses are different from the upper station and the alight station. For non-ladder-priced buses, the data is not on the station and the drop-off station.
      For a ladder-priced bus, the data contains a code that represents the distance from the terminal, and the code is 0 for the starting point of the bus line.
    • Time: Times.
      This field is also different for these two types of buses. For non-ladder-priced buses, the time here represents the pick-up time, as the deal is done on the bus. For a ladder-priced bus, time indicates the drop-off time because the transaction was completed when the passenger alighted.
    • Expense: Fees.
      The cost of the non-ladder-priced bus is fixed, and the fare of the ladder-valued bus depends on the passenger's distance.
    • Balance: Balance.

We collected a total of 22 million travel records from 700,000 cardholders between August 2012 and May 2013. In this table, you can see that there are a lot of missing values, especially the non-ladder-priced bus station and the drop-off station. So, if you want to study the user's movement patterns, we need to populate the actual values of these missing stations. Otherwise, we will lose the user's many movement pattern.

To solve this problem, we have also collected some other datasets.

This data and includes the passenger's recharge record. We have about 6 million recharge records. The top-up record includes the card ID, recharge time, recharge amount and card balance after the anonymous processing.
We also have Beijing's road network data, which contains information about all sections of Beijing. The road network is represented by a picture. The figure contains about 148,000 nodes and 200,000 edges. These sides are all sections of the road.

We also collect bus line information from an online service.

The information contains the name of the bus line and the coordinates of all bus stations. Information also includes pricing information for bus lines, including ladder-denominated bus lines and non-ladder-denominated bus lines.

Our data includes billing records, road network information, some tagged travel records and bus line information. We need to fill the missing upper and lower stations with their longitude and latitude coordinates as much as possible. Then we can use this data to study the pattern of user movement patterns.

Spatial positioning framework

    • Monetary: The money space shows the change of the user's smart card balance.
    • Temporal: The time space shows the trading time or the payment time. Usually, it represents a ride or a charge time.
    • Geospatial: geo-space shows the location of the user's ride.

We need to connect these three spaces to discover and reconstruct the user's movement patterns from these data sets. If we can connect the three spaces together, we can connect the points of money, time, and space to each other. For example, we know the passenger's pick-up time, pickup location and drop-off point. We can reconstruct the user's trajectory from the data. Then we will be able to know the home address of these users, the location of the work and other important places, as well as their mode of ride.

Preprocessing a data set


Here, we need to divide the user's ride into segments, each of which should be contiguous in the capital space. This means that we do not want to include other payments. Because Beijing's smart card can be used in taxis and subways, it can also be used for shopping. We want to keep only the bus payment records, this data segmentation can be done at a linear time cost. We only check the balance and cost of each ride and whether they match.

Data Set conversions

Below, we define two types of transformations: internal and external transformations. They indicate whether the user is on the bus.

Suppose there is a user's bus ride sequence. Li is a bus ride, which contains on the station Oi and alight station DI. They are the starting point and the end point respectively. When a passenger is on a bus, we call it inside the conversion, that is, from O to D. When the user is not on the bus, we call it an external conversion, that is, from D to O.

If you only know the bus line number, then the passenger's ride plan may have many kinds, the ride plan combination quantity is ni* (ni-1). So, as we can see, there are a number of possibilities for this passenger's ride plan.

Data set constraints

Below, we will apply some constraints to reduce these possibilities.

The first constraint is an approximation constraint , which is for external conversions.
Suppose a person's walking speed and duration are limited. If the two rides are far apart, it means you can't find a nearby bus stop. Then it can be divided into two sections. Therefore, the distance threshold is defined in our algorithm. Use the distance threshold to reduce the number of possible bus stops. Therefore, the number of possible ride plans will be greatly reduced.

The second constraint is an expense constraint . Cost constraints are designed for internal conversions. This means that the user is on the bus. When taking a ladder-priced bus, passengers are charged according to the distance travelled. So if we know the cost of a passenger's ride, we can estimate the passenger's distance by looking at the ladder meter. In this way, the possibility of internal conversions can be further reduced.

The third constraint is a time constraint . Time constraints include internal transformations and external transformations. For non-ladder-priced buses, we have pick-up time, and on the ladder-priced bus we have to get off time. So we have a lot of key time points to calculate the time of the bus ride.
For example, in this picture, the first bus route is said to be the Delta T1,delta T1 should be less than the interval between the boarding time of the first bus line and the boarding time of the second bus line. Because we also know the departure time of the third bus line, so we can see that the second bus line and the third bus line travel time should be less than t3-t2.

Mark Missing data

After introducing three types of constraints, we will use the condition to mark missing data with the airport.

In the observation sequence, each node is defined as a combination of two continuous bus lines. For hidden sequences, each node includes the start of the first bus line, the end point, and the starting point of the second bus line. We use constrained semi-supervised training methods in previous literatures to solve this problem.
We evaluated the algorithm at the same time across the entire dataset and the user-tagged data set.

For the entire data set, we only used a ladder-denominated ride record and removed the labels from these travel records. We want to see if our algorithm can recover these tags. We compare the conditions with the airport algorithm and the constrained condition with the airport algorithm compared with the previous two methods. Both of these methods are TC+MF and Tc+ms. From the first diagram, we can see that if you use only the conditional random-airport model, the performance of the algorithm is similar to the previous work. However, if you increase the constraints, the performance of the algorithm will be greatly improved. We also get similar performance from MSRA user data.

Understand the user's movement laws

Here we use a simple application to understand the rules of user movement, which is home address and workplace detection.
We used a common method to complete home address and workplace detection. We asked 102 users who participated in our user studies to mark their home addresses and workplaces and to compare our calculations with their markings.
We found that after recovering the missing data, the accuracy of home address detection increased by 88%, and the accuracy of workplace detection was increased by 35%. This result is consistent with the results of the local family survey.

Summarize
    • We propose a spatial positioning framework. This framework combines three spaces of capital space, time space and geospatial space. Among them, the capital space was seldom taken into account before.
    • In our approach, we have designed a common method for recovering missing data from smart cards. This method is suitable for recovering missing drop-off, upper station and bus lines in the data.
    • Experimental results show that we have achieved very high accuracy in recovering the user movement law. We have invited 102 users to mark the data for 4 months. We use this data to evaluate our algorithms. We have found the significance and potential of this work in the application of mobile law analysis.

Big Data Learning Note 3 • Big data in social computing (1)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.