Objective
As we all know, rapid urbanization makes many people's lives modernized, but also has many challenges, such as traffic congestion, energy consumption and air pollution.
The complexity of the city makes it almost impossible to address these challenges. Recently, advances in sensing technology and large-scale computing infrastructure have produced a wide variety of big data, from social media data to traffic data, from geographic data to meteorological data. If used properly, we can use this data to address the challenges facing the city.
Inspired by this opportunity, we propose a solution to urban computing. It turns urban sensing, urban data management, urban data analysis and service delivery into a cycle of repetitive but unobtrusive improvements in people's lives, urban operating systems and the environment.
We need traffic data, demographic data and even pollution data. Therefore, how to release the power of knowledge from multiple datasets in different domains has become a new challenge, which makes urban computing inherently different from traditional data mining and machine learning tasks. I will introduce the concepts, methods and applications of urban computing, showing the representative studies of Urban sensing, urban data management and urban data analysis respectively. Applications for these studies include transportation, urban planning, environmental and energy consumption.
Urban Big Data classification 1. static data on spatial dimensions and time dimensions
This type of data can be divided into three sub-categories: points, lines, and graphs. For example, a point of interest is a static data point whose value does not change over time; a route can be modeled with a line, and a road network can be modeled using a graph.
- Point data
Shows the distribution of two types of points of interest data, the yellow dots for the cinema, and the blue dots for the bars. In the past five years, the number of cinemas in urban areas in Beijing has continued to grow, reaching 260. This means that more and more people like to go to the cinema instead of buying DVDs. By digging for years of data, you can tell a lot of stories like that.
- Line data
This picture shows the road network in Beijing. Among them, the red line represents the highway connecting Beijing and other cities, the blue line represents Beijing's ring road, and the black line represents Beijing's main road. With a few years of data, you can see how a city's road network is expanding.
2. Spatial static time Dynamic Data
Unlike the first type of data, the values associated with each point in the data change over time. We call it temporal dynamics . The sensor network data is the big data of this kind of city.
such as air quality data. Many cities have set up air quality monitoring stations on the ground, with the aim of reporting the ambient air quality to people every hour. Each air quality inspection station has static spatial information. However, the air quality of each site varies over time, and we call it time-dynamic but spatially static:
Again is the meteorological data, such as wind, temperature, humidity and so on. There are many meteorological monitoring stations in the city. Like the air quality example, each site has a fixed location, but the readings of the weather data change over time. There is also the real estate market, where each residential property has a fixed geographical location. However, its price and attributes change over time.
This image shows the dynamic heat map of Beijing:
It describes the number of taxi arrivals in each area during each period. The darker the color, the greater the number of arrivals for this area within a given time period. First, North Beijing is more popular than other parts of Beijing. This is the CBD of Beijing. By comparing the similar areas of two different types of days, we can see that more people arrive in the central area of Beijing than on holidays because most of them leave the city for a holiday.
3. Dynamic Data on spatial dimensions and time dimensions
This kind of data re-spatial dimension and time dimension are dynamic. The most complex data structures in this category are trajectories.
We have a lot of points. Each point is associated with a geographic information, such as the x-coordinate, the y-coordinate, and the timestamp. By collecting this data in chronological order, we can form a trajectory.
There are many sources that can produce trajectory data, such as the movement of people. We can track our travel experience by using a GPS recorder. We can also analyze physical activity by analyzing our trajectory. The sign-in data is also a trajectory. The movement of a vehicle can also be recorded as a trajectory, such as a taxi track, a bus track, and an animal's migration is a kind of trajectory data, and the movement of natural phenomena such as hurricanes and tornadoes can also be seen as trajectories.
This picture shows the thermal map of GPs trajectories generated by more than 3,000 taxis in Beijing. This data not only tells us the traffic patterns on the ground, but also includes the mobility patterns of people in cities, because we know where people get to taxis and taxis.
Concepts, frameworks and challenges of urban computing concepts
Let's start with an example of what urban computing is.
Air pollution is now a global problem, especially in developing countries. Many cities have built air quality monitoring stations on the ground, which report hourly ambient air quality to people. In this picture, each icon represents an air quality monitoring station, and the number associated with each icon is the air quality index measured by the air quality monitoring station. The smaller the number, the better the air quality, and the larger the figure, the worse the air quality.
We can see that even at the same moment, the air quality measured at different sites can vary greatly. This is not surprising because air quality is affected by many complex factors, such as traffic flow, energy consumption, and the distribution of buildings, factories, parks or areas. These factors are different in different parts of the city. So, without an air quality monitoring station, we would not be able to know exactly where the air quality is.
We cannot use the linear difference method to calculate the air quality in this place, because the distribution of air quality in cities is highly nonlinear and biased. We also cannot use the average readings from these sites to represent the air quality of this place. To solve this problem, we use two parts of big data to speculate on the real-time, fine-grained air quality of the entire city.
- The first part of the big data is the real-time readings and historical readings of the air quality available at existing sites.
- The second part of the big data includes five other data sources: meteorological data, such as wind, temperature, humidity, traffic flow, human mobility data, point-of-interest data such as the number of restaurants here, the number of factories, the density of buildings in a particular area, and the road network data, such as how many intersections in a given area, how many traffic lights What is the expressway mileage?
Using machine learning and data mining techniques, we can establish a network between the observed data in one area and the air quality of the area. Here is a fine result of the city's air quality. It is non-linear. With such fine air quality information, we can influence people's decision making, such as where to hike and when to close the window. At the same time, it is also a step forward in finding the root cause of air pollution in the future.
The framework and challenges of urban computing
It can be seen that the framework of urban computing mainly includes data collection, management, analysis and output. There are different challenges at different levels.
Challenges in the urban sensing
Data loss and sparsity
We only have sampled data, and it is a challenge to generate a true distribution of the entire data based on sampled data.
Biased distribution
We have some sign-in data for some of our users, but we want to get a mobility model for people across the city. It is clear that these sampled data are not real human mobility patterns within the city. This is called a biased distribution.
- A limited source
We have limited resources, budget, or manpower to encourage people to contribute their data.
For example, we have track data for taxis, but we want to estimate the traffic flow of all the vehicles on the road. The distribution of taxis may be different from the trajectory distribution of all vehicles. Therefore, we need the ability to generate overall traffic distribution based on sampled data.
In urban air projects, we have only established a limited number of air quality monitoring stations in the city. We only get sample data from these air quality monitoring stations. The data is very sparse, but we want to restore data across the city.
There are two types of data collection policies. The first strategy is static perception, which is to deploy some sensors in a fixed location. The problem with this strategy is where sites can be deployed to maximize the benefits of knowledge. The second is dynamic motivation, and for group-aware strategies, we want to put the right incentives in the right place to get more data.
Challenges in urban data management
- multimode Data
These data have different representations, use different units, and have different densities.
-
Dynamic, high-speed, massive data
we have to think about how to update the data frequently.
In urban air projects, we need to use five different datasets, including weather data, traffic data, and POI data. These data are completely different. They are multi-modal data with different measurements, densities, and representations. Most of the data has associated spatial and temporal information. Part of this is the category data, and the other part is the numeric data. So, in this project, we need to quickly extract the various data from a given region within a given time period. We need an index structure to better manage multimode data.
In addition, we need to consider the frequency of data updates and the amount of data. First, the data is updated very frequently. So we need some flexible index structure so that the data can be updated very frequently. Second, the frequency of updates for different datasets is different. If we simply organize different types of datasets into a single data index structure, we face some big challenges. When a piece of data in a data set is updated, we need to update the entire data structure, which will be a disaster. Thirdly, the data is massive. We can't store all the data on a single machine. So how to group and distribute data to different machines so that parallel computing is a new challenge for data management.
-
Identify the correlation patterns between multiple data sources in different realms
There is great value behind the correlation patterns across multiple domains. Identifying such an association pattern is challenging.
First, there is no clear concept of trading. For example, supermarket transactions record people buying milk, bread and diapers at the same time. However, here we have different data sources, where there is no clear concept of concurrency. So we have to define what the concurrency of different data sources means. The second reason is that we have a lot of data sources, each with many attributes, so different data sources and different attributes may have many combinations. This is a very time-consuming process. Thirdly, we need to deal with the intersection between different modes. It is easy to find the intersection between different categories. This is the way the traditional association rules approach deals with transaction data. But what if numeric data and numeric data are together, or numeric data and category data are together? This is a new challenge that we need to address.
Challenges in urban data analysis
- Spatial and spatio-temporal data analysis
Traditional data mining and machine learning techniques are often used to process text and image data. But now we have spatial and temporal data. This is a new field that we need to explore.
- Cross-domain Data fusion
We have multiple data sources that span multiple domains. How to release the power of knowledge from multiple datasets from different domains is a new challenge. It is also an end-to-end service that requires the integration of different technologies, including machine learning, data management, and visualization. We need to aggregate these technologies together.
Here, I divide the cross-domain data fusion method into three categories.
The first class uses different datasets for different stages of a task. We first use the road network to divide the city into areas, then using traffic data to analyze the commuting patterns between different regions. This is known as phase-based data fusion.
The second kind of data fusion method is the fusion of different characteristic level data. We extract features from different datasets, collect the data together as a new eigenvector and use them as a classification or information retrieval task. Advanced feature-level-based fusion methods use deep neural networks to learn new representations of features extracted from different datasets.
The third kind of data fusion method is called the method based on semantic meaning. This means that we need to understand the semantic meaning of the data. There are four subclasses of such a method.
Challenges in urban data export
- It must be a dynamic decision-making and service delivery within a city-wide context. It can't just be a service on a road segment. This is a city-wide service that needs to influence people's decision making.
- Use some services to predict the future and use some services for understanding history.
For example, we want to infer the fine-grained air quality of the entire city. This can serve as a service to understand the current air quality in the city. We can also predict the future air quality. So, this is a kind of understanding of the future. Sometimes we need to look at history to understand our data. For example, what is the root cause of air pollution in cities.
Report:
1. Urban Air Project home: http://urbanair.msra.cn/
2. Urban Air Project paper: http://research.microsoft.com/en-us/projects/urbanair/default.aspx
3. More on Urban computing content and data download: http://research.microsoft.com/en-us/projects/urbancomputing/
Big Data Learning Note 7 • Urban Computing (1)