Spatial analysis, the most important concept is the distance, different distances will lead to different results. In the study, there is a concept called "spatial scale", this is interested, please self-Baidu (the usual: Baidu know things don't ask me).

Therefore, in the study of clustering, the most important thing is to determine the distance between the different data, otherwise it will be as follows:

In cluster analysis, the distance between features is an important parameter, that is, how far apart are the two elements to be considered as a group? In any kind of clustering algorithm, it is more tangled to explore a suitable distance. The expert proposes various algorithms, all want to optimize this distance exploration process, in order to effectively reduce the computational overhead.

The same data, at different distances, the performance of the cluster effect is certainly different, so we today to a gray often magical work, he is different from other analysis tools of the single distance analysis, he can summarize at a certain distance of all the data related to the aggregation, for us to choose the appropriate ratio of analysis.

first, let's see Ripley ' s K a function is something. The Ripley ' s K method is a method of analyzing point data patterns that can be analyzed using the Ripley ' s K function for different distances of point datasets, as shown in:

when the distance is 5 , the position and density of the centroid of the feature is shown on the left, when the distance expands to Ten , the center of mass and the number of features involved change, so the density of the data changes as well.

Therefore, the Ripley's K function is used to indicate the degree of spatial aggregation or spatial diffusion of the mass centroid of the elements, and how changes occur when the neighborhood size changes.

Let's take a look at the fundamentals of this algorithm.

first we have to set a starting distance, of course, you can also specify the final distance or increment step, as 5 , and then each calculation increases 3 like that. When the calculated distance increases, the number of adjacent features will naturally be more, then the density of the contained data can be calculated for different distances.

When all is done, the density of each distance is averaged, and the average density is used as the standard density value for comparison.

The density of the amount of data contained within each distance is then compared to the standard density values. is greater than the standard density, then we think that the distance, the data is in the cluster distribution, but less than the standard value, we think he is in a discrete distribution. As shown in the following:

As can be seen, the entire data distribution, in fact, is not linear, and this so-called discrete or distance, more is a qualitative argument, as to which distance on the aggregation effect is good, which degree of dispersion is large, generally by observing k values and Expectations k values are compared.

the so-called observation k value, which refers to the actual density value we calculate, and the expected k A value that refers to the expected distribution in the case of a random distribution.

Some students will ask, not to use the average density to compare it? What is this expected K - value and random distribution?

The average problem, before we have been again said, although he is easy to use, but his advantages and shortcomings of the same obvious, in the description of the algorithm, can be used to describe the average value, but the actual use, the average exposure of the various problems, will let the analyst crazy. Especially in the study of spatial distribution. If you only use average density to study data with spatial analysis, the following problems can arise:

Therefore, in order to avoid some simple rough calculation of the average, in the study of spatial distribution, more is to use the 0 hypothesis of the way to set the random number distribution, as the expected value.

In fact, the way of research is this:

In each study area, random assumptions are made, which is to independently set expectations in each of the 0 assumptions that are expected in each study, so that the errors of the above-mentioned overall averages can be avoided.

therefore, the entire algorithm, after the completion of the calculation, will generate two of data, a called "observation K value ", one called" expectation K values ", they feature the following:

if a specific distance K The observed value is greater than K expected value, the distribution is much more clustered than the random distribution of the distance (the analysis scale). If the k observation value is less than the expected value of K , the distribution is much more discrete than the random distribution of the distance.

of course, before we explain the 0 hypothesis, say any assumptions, it is best to establish a confidence level, as previously said, to verify the time, the first to decide how many coins we want to throw, Sir Fisher also summed up a 5% the law.

Of course, you can also not set the confidence level, so that, it means how to calculate the line, as long as the result is good.

And the effect is better, of course, is to set the degree of execution.

within this algorithm, determine the desired K value is achieved by setting a random number, which means that you have - data, I'm going to generate a - random numbers, and then randomly distributed in your research area, use this random distribution hypothesis to validate your data.

because it's a random number, when you place these random numbers, you're not sure exactly where you're going to throw them, it's probably all in a pile. So the best way to do this is to set up a few more random numbers and place them several times to get the best results. So how many sets of random numbers would be better to set up? Theoretically, of course, the more the better, but in fact it is impossible to do very much. In the multi-distance spatial cluster analysis (Ripley ' s K function) tool provided by ArcGIS , the "Compute_confidence_envelope" is given (Calculate confidence interval) Such a parameter gives a total of 4 options:

- 0_permutations_-_no_confidence_envelope-does not create confidence intervals.
- 9_permutations-randomly placed 9 sets of points/values.
- 99_permutations-randomly placed 99 sets of points/values.
- 999_permutations-randomly placed 999 sets of points/values.

Where: 9 means that 90%,99 represents 99%,999 99.9%.

using this parameter, the algorithm also calculates the lwconfenv and hiconfenv The two data, they show the confidence interval information for each iteration calculation (specified by the distance segment quantity parameter).

if When the observed K value is greater than the hiconfenv value, the spatial clustering of the distance is statistically significant. If the observed K value is less than the lwconfenv value, the spatial dispersion of the distance is statistically significant.

This algorithm is calculated and its results and specific interpretation, we continue in the next article.

To be continued.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Vernacular Space Statistics 15: Multi-distance spatial cluster analysis (Ripley's K function) (upper)