1 Content Introduction
First, through the crawler to collect all the online housing data of Nanjing, and the data collected to clean; then, after the cleaning of the data for visual analysis, explore hidden in a large number of data behind the law; Finally, a clustering algorithm is used to analyze all the secondary data, and according to the results of clustering analysis, The listings are broadly categorized to summarize all the data. Through the above analysis, we can understand the current market on the basic characteristics of the housing and distribution of properties, to help us make the purchase decision.
2 Introduction of Application Technology
1) Python web crawler Technology
2) Python Data analysis technology
3) K-means Clustering algorithm
4) German Map developer Application JS API
3 Data acquisition and cleaning 3.1 collection
This part of the network crawler crawl all the Nanjing House on the chain home data, collect raw data, as the cornerstone of the entire data analysis.
The structure analysis of 3.1.1 Chain Home NET website
Chain Home Home Page interface 1, figure 2, the home page above the red box location shows the current Nanjing Secondary housing for the listing of the various regional location names, the middle of the red box shows the total number of listings, the following red box shows the housing information thumbnails, the Red box area contains the secondary Housing listing page URL address label. Figure 2 Below is a red box showing the number of listings on the home page.
Chain Home net second-hand home page Upper part:
Figure 1 Chain Home net Home
Chain Home net second-hand home page lower part:
Figure 2 Chain Home net Home
Housing Listings Information Page 3, figure 4. We need to collect the target data on this page, including the basic information, housing properties and trading attributes of the three major categories. Various types of information include the following data items:
1) basic information: cell name, area, total price, unit price.
2) Housing properties: House type, floor, building area, type structure, inner area, building types, housing orientation, building structure, decoration situation, ladder household ratio, with elevator, property rights years.
3) Trading Properties: Listing time, transaction ownership, last transaction, housing use, housing age, property ownership, collateral information, spare parts of the room.
Figure 3 Housing Listings Information page
Figure 4 Housing Listings Information page
3.1.3 Network Crawler key problem description
1) Problem 1: The home page of the chain home only shows a maximum of 100 pages of property data, so in the collection of Housing Information page URL address will be collected, resulting in the end can only collect some data.
Resolution: All Nanjing Second-hand housing data in the sub-region to crawl, 100 pages can display up to 3000 suites, the area of less than 3000 units can be directly crawled, if the area of more than 3000 units can be divided into smaller areas.
2) Question 2: If the crawler is running too fast, it will trigger the anti-crawler mechanism of the chain home when the 两、三千条 data is collected, all requests will be redirected to the link's man-machine authentication page, which will cause the subsequent crawl to fail.
WORKAROUND: ① constructs the header for each HTTP request in the program and transforms the HTTP request header header to user_agents the value of the data entry so that the request information looks like a request for access from a different browser. After every HTTP request and response is processed, the ② crawler randomly sleeps for 1-3 seconds, after each request 2,500 times, the program sleeps for 20 minutes, and the control procedure requests speed.
3.2 Data Cleansing
For the data collected by the crawler can not be analyzed directly, we need to remove some "dirty" data, correct some error data, unify the format of all data fields, and make these scattered data into unified structured data.
3.2.1 Raw data mainly need to clean the parts
The main data parts to be cleaned are as follows:
1) Align the data items of a cluttered record
2) Cleaning some data item formats
3) Missing value processing
3.2.3 Data Cleansing Results
Data cleaning before the original data 8, after cleaning the data 9, you can see that after cleaning data has been structured a lot.
Figure 8 raw data before cleaning
Figure 9 The data after cleaning
4 Visualization Analysis of data
After the data cleansing is complete, we can begin to visualize the data. This stage is mainly to make an exploratory analysis of the data and visualize the results, to help people better, more intuitive understanding of the data, hidden in a large number of data behind the collection and refinement. This paper mainly analyzes the properties of the total price, unit price, area, type and region of the housing.
The main steps of data visualization analysis are as follows: 1) data loading; 2) data conversion; 3) visualization of data.
4.1 Data loading
Data analysis and modeling of a lot of work are used in data preparation, such as: cleanup, loading, transformation and so on. After cleaning the data is still stored in a text file (CSV format), in order to visualize the data, you must first load the data into memory as a result. We use the Dataframe object provided by pandas to load and process our cleansed data, pandas also provides functions that read tabular data as Dataframe objects. The main issues to be aware of during data load processing are as follows:
1) The processing of the row and column index of the data item;
2) data type inference and data conversion;
3) Processing of missing values.
4.2 Overall data quality analysis
4.2.1 Data Basic situation
After the data is loaded, the data is basic 10. You can see that the loaded data is 20527 rows, 25 columns, and takes up memory 3.9+mb. On the data type, there are 3 columns float64 type, 2 column int64 type, 20 Column object type. In addition to the family structure, the cover area, collateral information, three columns of missing value of data items, other columns of missing values are not many, so the overall quality of the data is good.
Figure 10 Data Base diagram
4.2.2 Overall data file word cloud
From the overall data file word cloud (see Figure 11), we can get in Nanjing secondary housing listings often appear in the high-frequency words, such as commercial buildings, ordinary housing, a ladder two, steel mix structure, hardcover and so on. With these high-frequency words, we can get a very rough idea of the basic content in the entire data file.
Figure 11 Overall data file Word cloud
4.2.3 Nanjing Each region housing number of housing quantity line chart
The number of secondary housing listings in Nanjing Line chart (see Figure 13) The horizontal axis is the name of each administrative region of Nanjing, vertical axis for the number of listings (sets). It can be seen that jiangning on the sale of the largest number of listings, up to more than 5,000 sets, accounting for 1/4 of the total. In contrast to the Liuhe district, Liuhe area on the sale of the number of listings only 1 sets, the number is too small, the number of other districts is similar. So there is a certain error in the analysis of the Liuhe area behind us.
Fig. 13 The number of secondary housing listings in Nanjing Area line chart
4.2.4 Nanjing Secondary Housing Use horizontal column chart
Nanjing Secondary Housing Use horizontal histogram (see Figure 14) The horizontal axis is the number of listings (sets), vertical shaft for the type of housing use. We can see that the types of housing use are: ordinary residential, villa, commercial office, serviced apartments, garage 5 type. One of the main concerns of the general residential type of housing, the number of nearly 20000 sets, accounting for the bulk of the total. So in this article, we did not exclude the use of housing for other types of records, because these types in all of the inventory samples accounted for a relatively small, will not affect the results of the subsequent analysis, and they are also in the context of the secondary.
Figure 14 Nanjing Secondary housing use horizontal histogram
Summary of overall quality of 4.2.5 data
From the previous analysis, we can see that the overall quality of the data file is good. Although there are some missing values more data items, but we are concerned about some data items missing value is not much. Many of these missing values are secondary data items that do not affect our analysis. In the housing use type, the data file contains 5 types of second-hand housing listings, of which the general residential type accounted for more than 98%, so we can be seen behind the analysis of the general housing type of the second-hand house for analysis, which is in line with our expectations. The only deficiency in the whole data file is that the sample of the secondary housing in Liuhe area is too small, which makes us have some errors in the analysis of Liuhe area.
4.3 Visual analysis of basic information of Nanjing Secondary housing
Second-hand housing basic information visualization analysis mainly for the secondary housing: area, total price, unit price, building area four attributes of the analysis.
4.3.1 Average unit price histogram of each district in Nanjing
The average unit price histogram of each district in Nanjing (see Figure 15) is the name of each region of Nanjing, and the vertical axis is the unit price (yuan/square meter). We can see Jianye District and Gulou the average unit price of the highest, nearly 40000 yuan/square meters. Jianye District is the downtown area, the development momentum is very good in recent years, house prices soared, now has become one of Nanjing's most expensive areas. Gulou District as the core of Nanjing, with many shopping malls and school district rooms, the average price has been high. On the whole, the average price of each area of Nanjing (except for the area where there is error) has exceeded 20000 yuan/square meter. These can reflect the result of Nanjing house price soaring in recent years. Pukou District Although the house price is already very low, but compared to the PU kou a few years ago the house price, almost doubled.
Fig. 15 average unit price of each district in Nanjing
4.3.2 Nanjing each area housing price and Total price box line diagram
Nanjing Each district Housing Unit price Box line diagram (see Figure 16) The horizontal axis is the name of the Nanjing region, the longitudinal axis for the unit price (yuan/square meters). Although the average unit price is an important reference data, but the average value can not effectively represent the overall distribution of data, especially the distribution of some discrete values in the data, the performance of these information needs to use the box line diagram. As can be seen from Figure 16, Jianye and Gulou two regional housing unit price Normal distribution is not too concentrated, 50% of the unit price distribution in 30000-50000 of the interval, the interval is larger than other areas. Although the average unit price of Jianye District is slightly higher than Gu Lou, but the abnormal value of Gu Lou is very many, the price exceeds 50000 of the listing is numerous, the highest unit price has reached 100000, the unit price limit is far above Jianye District, but the Jianye District anomaly value is relatively few. In view of the above situation, Gulou District should be the highest unit price in Nanjing City. and Gulou District adjacent to the Xuanwu District and Qinhuai District Unit price Normal distribution is more concentrated 50% of the data are distributed between 30000-40000, but these two outliers are also more, the unit price limit is very high. The value of such outliers in these areas is inextricably linked to the concentration of education and medical resources in these areas.
Figure 16 Nanjing Each district Housing Unit price Box line diagram
Nanjing Each region of the secondary housing Total Price Box line diagram (see Figure 17 and Figure 18) The horizontal axis is the name of the Nanjing region, the longitudinal axis is the unit price (million yuan). Figure 18 zooms in on the vertical axis of Figure 17, making it easier to see and no difference in other aspects. From this dimension of the total price, drum Tower, Jianye the two highest price area, the total price is very high, 5 million yuan of the secondary housing to distribute in the normal range. Most of the other parts of Nanjing housing prices are concentrated between 2.004 billion yuan, the next four minutes is very close to 2 million. Jiangning, Qixia Although the unit price is not high, but the total price is not low, especially in recent years, the higher prices of jiangning, more than 5 million of the outliers have been more. Pukou District Total Price data distribution is the most concentrated, most of the data are within 2.003 billion interval.
Fig. 17 The plot of the total price box of each district in Nanjing
Fig. 18 The plot of the total price box of each district in Nanjing
4.3.3 Nanjing Resale Unit Price highest TOP20
Nanjing Housing Price of the highest TOP20 horizontal histogram (see Figure 19) Horizontal axis is the unit price (yuan/square meters), vertical axis for the cell name. It can be seen that the price of the first 20 of the listings have been more than 90,000, and are concentrated in the Gulou District, which also confirms the above box-line map of the Gulou district so many outliers exist.
Figure 19 Nanjing Resale Unit Price highest TOP20
4.3.4 Nanjing Unit Price and total price of the Heat force diagram
The unit price thermodynamic diagram of Nanjing (see Figure 20) and the total price of the Nanjing housing (see Figure 21) The red area represents a high-density and high-price area. It can be seen that the upper part of Drum Tower, Xuanwu, Qinhuai and Jianye is the most dense area. These 4 areas are located in the center of Nanjing, convenient transportation, medical care, education and other resources, these factors together to create these regional high prices.
Fig. 20 The unit Price heat Chart of Nanjing
Figure 21 Nanjing Secondary housing Total price Heat force diagram
4.3.5 Nanjing Housing Price less than 2 million of the distribution map
Nanjing Secondary housing Total price less than 2 million of the listings a total of more than 6,000 sets, the distribution map is shown in Figure 23. We can see that in addition to the Gu Lou and Jianye District relatively few, other areas less than 2 million of the house is still there.
Figure 23 Nanjing Secondary housing Total price less than 2 million of the distribution map
Analysis of the building area of 4.3.6 Nanjing
Nanjing Secondary Housing Area distribution interval map (Fig. 24) The horizontal axis is the number of listings (sets), longitudinal axes for the distribution range (square meters). It can be seen that the number of listings in the building area of 50-100, more than 10000 sets. Next is the interval of 100-150 and less than 50.
Fig. 24 The distribution interval histogram of the housing area in Nanjing
The average floor area of Nanjing Area histogram (Fig. 25) is the name of each area on the horizontal axis and the vertical area (square meter). You can see Xuanwu, Qinhuai, Drum Tower, the average floor area of the old quarter is relatively high, the average area of about 80 square meters. Instead, Jiangning, Pukou, the two lowest-priced area of the average floor area is the largest, the average size of more than 100 square meters.
Fig. 25 The average floor space of each district in Nanjing Column chart
4.3.7 Nanjing Resale Unit Price, total price and gross floor area scatter chart
Nanjing Secondary Housing Total price and gross floor area scatter plot (Figure 26) The horizontal axis is the gross floor area (square meter), the longitudinal axis is the total price (million yuan). It can be seen that the two variables of total price and floor area are in positive correlation. The distribution of data points is relatively concentrated, most of which are in the total price of 15 million yuan and construction area of 0-400 square meters in this area.
Fig. 26 The total price and gross floor space of Nanjing
Nanjing Secondary Housing Unit price and gross floor area scatter plot (Figure 27) The horizontal axis is the floor area (square meters), the longitudinal axis is the unit price (yuan/square meter). It can be seen that the floor area and the unit price does not have a significant relationship, the same sample point distribution is more concentrated, the discrete value is not much, but the price is particularly high, the building area is not too large, probably because these houses are generally located in the city center.
Fig. 27 The unit price and construction area scatter plot in Nanjing
4.4 Visual analysis of the properties of the secondary housing in Nanjing
4.4.1 The proportion of the housing units in Nanjing
From the Nanjing Secondary Housing House Type pie chart (Figure 28) can be seen, 2 Room 1 Hall and 2 Room 2 Hall as a standard configuration, a total of nearly half. of which 3 Room 2 Hall and 3 Room 1 Hall of the House also accounted for a lot, other housing units of the share ratio of the relatively few.
Figure 28 Nanjing Secondary Housing House Type pie chart
4.4.2 Nanjing Secondary Housing Decoration situation
From the Nanjing Secondary Housing Decoration Situation Pie chart (Figure 29) can be seen, nearly 60% of the housing decoration situation is other, probably because the house all for the sake of the secondary, we have been self-renovated.
Figure 29 Nanjing Secondary Housing Decoration Situation pie chart
4.4.3 The orientation distribution of secondary housing in Nanjing
Nanjing Secondary Housing towards the histogram (Fig. 30) The horizontal axis is the housing direction, the longitudinal axle is the number of listings (sets). We can see that only a few of the faces are more, the rest are very small, and obviously belong to the long tail distribution type (severe bias). This is also in line with our understanding that the house is facing more than half of the faces south.
Fig. 30 The distribution histogram of the secondary housing in Nanjing
4.4.4 of the construction type of the secondary housing in Nanjing
From the Nanjing Secondary Building Type pie chart (Figure 31), we can see that the building type 65.6% is the board building, now the real estate developers like to develop the tower is less, this and Nanjing secondary housing construction time is relatively long match.
Figure 31 Nanjing Secondary-Secondary building type pie chart
5 Data Cluster analysis
This stage uses the K-means algorithm of the clustering algorithm to carry on the cluster analysis to all the housing data, according to the result and the experience of the cluster, these houses roughly classifies, has reached the goal which summarizes to the data. In the clustering process, we selected three numeric variables of area, total price and unit price as the cluster attribute of sample points.
5.1 K-means Algorithm principle
5.1.1 Fundamentals
K-means algorithm is one of the most popular clustering algorithms, it is a unsupervised learning algorithm, which aims to classify similar objects into the same cluster. The more similar objects within a cluster, the better the clustering effect. The algorithm is not suitable for processing discrete attributes, but it has good clustering effect for continuous properties.
5.1.2 Clustering Effect Judging standard
The sum of squares of the centroid of each sample point and the cluster is minimized, which is the evaluation criterion for evaluating the final clustering effect of the K-means algorithm.
5.1.3 Algorithm Implementation steps
1) Select K Value
2) Create a K-point as the starting centroid of the K-clusters.
3) calculates the distance of the remaining elements to the centroid of the K-clusters, respectively, and classifies the elements into clusters with the smallest distances.
4) According to the clustering results, the new centroid of the K clusters is recalculated, i.e. the arithmetic mean value under the respective dimensions of all the elements in the cluster.
5) re-cluster all elements according to the new centroid.
6) Repeat the 5th step until the cluster result no longer changes.
7) Finally, output clustering results.
5.1.4 Algorithm Disadvantages
Although the K-means algorithm is simple in principle, it has its own flaws:
1) clusters of cluster k values need to be given before clustering, but in many cases the selection of K value is very difficult to estimate, many situations we do not know before clustering the data set should be divided into how many classes are most appropriate.
2) K-means need to artificially determine the initial centroid, the different initial centroid may come up with very different clustering results, there is no guarantee that the K-means algorithm converges to the global optimal solution.
3) sensitive to outlier points.
4) results are unstable (affected by input order).
5) The time complexity is high O (NKT), where n is the total number of objects, K is the number of clusters, and T is the number of iterations.
5.2 Algorithm Implementation key problem description
5.2.1 The selected description of the K value
According to the cluster principle: The gap in the group is small, the gap between the groups is big. We first calculate each SSE under a different K value (Sum of
Squared
Errors) value, and then draw a line chart (Figure 32) to compare and select the optimal solution from. , we can see that when the K value reaches 5, SSE changes tend to be flat, so we select 5 as the K value.
Figure 32 SSE Value line chart with different k values
5.2.2 Initial K-centroid selection instructions
The initial K centroid selection is the random method used. The k centroid is randomly selected from the middle of each column's numeric maximum and minimum values by a positive distribution. 5.2.3
About outlier points
Outliers are far away from the whole, very unusual, very special data points. Because the K-means algorithm is very sensitive to outliers, the outliers such as "Maxima" and "min" should be removed before clustering, otherwise the results of clustering will be affected. The criterion of outliers is judged by the scatter plot and the box line diagram of the previous data visualization analysis process. According to scatter plots and box plots, the range of discrete values that need to be removed is as follows:
1) Unit Price: The basic is within 100000, there is no special abnormal value.
2) Total Price: The basic concentration is within 3000, here we need to remove outliers of 3000.
3) Construction Area: The basic concentration is within 500, here we need to remove the outliers of 500.
Standardization of 5.2.4 Data
Because the unit of total price is million, unit price of yuan/square meters, the construction area of the unit is square meters, so the data points to calculate the Euclidean distance unit is meaningless. At the same time, the total price is 3000 within the number, the construction area is less than 500 of the number, but the unit price is more than 20000 of the number, in the calculation of the distance from the price of the role than the total price, the total price and unit price is far greater than the construction area, so the result of clustering is problematic. In this case, we need to standardize the data and scale the data so that it falls within a certain interval. The unit limit of the data is removed, and it is converted into dimensionless pure value, so that the indexes of different units or magnitude can be calculated and compared.
We map the unit price, the total price and the area to 500, because the area itself is within 500, do not deal with special. Unit price when calculating distances, you need to multiply the mapping scale by 0.005, and the total price needs to be multiplied by the mapping scale of 0.16. Before the data standardization and data standardization after the cluster effect comparison is as follows: Figure 32, Figure 33 is no data standardization before the clustering effect scatter chart, figure 34, figure 35 is the data standardization of the cluster effect scatter plot.
Data standardization before the unit price and floor area cluster Effect scatter chart:
Fig. 32 Scatter Chart of unit price and gross area before data standardization
Data standardization before the total price and building area cluster effect scatter plot.
Fig. 33 The total price and gross area scatter plot before data normalization
Data standardization after the unit price and the area of the cluster effect scatter plot.
Fig. 34 Scatter Chart of unit price and gross area after data normalization
Data standardization after the total price and the area of the cluster effect scatter plot.
Fig. 35 Scatter plot of gross price and GFA after data normalization
5.3 Clustering Results analysis
The clustering results are as follows
1) Cluster results statistics are as follows:
2) After the clustering of the unit price and gross floor area scatter plot and total price and gross floor area scatter map See figure 34, figure 35.
3) Cluster results grouped 0, 1, 2, 3, 4 of the regional distribution map are as follows: Figure 36, figure 37, figure 38, figure 39, figure 40.
The regional distribution map of cluster result grouping 0 is as follows:
Fig. 36 Cluster Results 0 Regional distribution map
The regional distribution map of cluster result grouping 1 is as follows:
Fig. 37 Cluster Results 1 regional distribution map
The regional distribution map of cluster result grouping 2 is as follows:
Fig. 38 Cluster Results 2 regional distribution map
The regional distribution map of cluster result grouping 3 is as follows:
Fig. 39 Cluster Results 3 regional distribution map
The regional distribution map of cluster result grouping 4 is as follows:
Fig. 40 Cluster Results 4 regional distribution map
Based on the above clustering results and our empirical analysis, we can roughly divide these 20,000 sets of listings into the following 4 categories:
A, large-scale (big area, high price), belongs to the No. 0 category. The average area is more than 200, this large-scale housing relatively small number, mainly distributed in the Drum Tower, Jianye, Jiangning, Qixia and other places (specifically from the various types of regional distribution map can be seen).
B, Lot type (unit price high), belongs to the 2nd, 4 category. This property around the central location of the Nanjing center distribution, excellent geographical location, convenient transportation, mainly distributed drum Tower, Xuanwu, Jianye, Jianye and other places (specifically from various types of regional distribution map can be seen).
C, the public dwelling type (small size, relatively low prices, more than housing), belongs to the 3rd category. This type of housing distribution is wide, mainly around the subway lines on both sides. Typical areas are Qinhuai, Gulou, Jiangning, Xuanwu, Pukou and other locations.
D, high cost-effective type (relatively large area, low unit price), belongs to the 1th category. Typical areas are Qixia, Pukou, jiangning and other locations.
PS: Wait for someone to see me again release GitHub address:), feel the fire
Visual analysis of the data of Nanjing's secondary housing based on Python