Geographic Statistical Analysis notes (I) exploration data

Source: Internet
Author: User

Geographic Statistical Analysis notes (I) exploration data

Before performing local statistical analysis, browsing, familiarizing yourself with, and checking your data is crucial. Drawing and checking data is a necessary stage in the statistical analysis process. We can obtain some prior knowledge from these tasks to guide the subsequent work.


Draw data in Stage 1

Using the layer rendering scheme of ArcMap to draw data, we can get the first impression of the data.

For example, you can use single-symbol rendering to understand the dense distribution of sampling points, and use classification rendering to understand the distribution of high and low values of sampling points.


<喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> VcD4KCgoKPGgxIGlkPQ = "stage-2-check data"> Stage 2 check data

UseExploratory Spatial Data Analysis, ESDA)Tool to perform the second phase of data exploration. These tools provide a more quantitative way to check the data than to plot the data, which helps us to gain a deeper understanding of the phenomenon being studied, this helps us make more accurate decisions on how to build the interpolation model.

ESDA tools include:


Do I obey normal distribution? Histogram/Histogram

Histogram is used to show the frequency distribution of interest datasets and calculate the statistical data. How can we interpret the graph and statistical information?

If the data follows a normal distribution, the mean value is similar to the median value. The skewness value should be close to zero and the kurtosis value should be close to three.
The average value is the arithmetic average value of the data. The average value provides the measurement value in the distribution center. The median value corresponds to the cumulative ratio of 0.5. If the data is in ascending order, 50% of the values are under the middle value, and 50% of the values are above the middle value. The median provides another measurement value in the distribution center. The first and third quantiles correspond to the cumulative ratio of 0.25 and 0.75 respectively. If the data is sorted in ascending order, 25% of the values are located below the first quantile, and 25% of the values are located above the third quantile. The first and third quantiles are special quantile values. The skewness coefficient is the measured value of the distribution symmetry. For symmetric distribution, the skewness coefficient is zero. If the distribution has a long right tail of the secondary node, it is a positive partial distribution; if the distribution has a long small value left tail, it is a negative partial distribution. For positive distribution, the average value is greater than the middle value; for negative distribution, the average value is less than the middle value. The peak degree depends on the size of the tail of the distribution, and provides a measure of the possibility that the distribution produces abnormal values. The kurtosis of a normal distribution is three. The distribution with a thick tail is called the Peak state, and its peak degree is greater than three. The distribution with thin tails is called the low-peak state, and its peak value is less than three. The variance of data, usually sensitive to excessively high or excessively low values. The standard deviation is the square root of the variance, which describes the degree of data dispersion around the average value. The smaller the variance and standard deviation, the closer the clustering of the measured value to the average value.


Normal QQPlots/Normal QQ plot

The vertices on the normal QQ plot indicate the normality of the single-variable distribution of the dataset. If the data is normally distributed, it will be placed on the 45-degree reference line. If the data is not normally distributed, the point will deviate from the reference line.


General QQPlots/General QQ plot

A common QQ plot is used to evaluate the similarity between the two datasets. The process of creating these charts is similar to that of the normal QQ plot. The difference is that the second dataset does not have to obey the normal distribution and can be used by any dataset. If the two datasets share the same distribution, the points in the normal QQ plot will fall on a 45-degree straight line.



### Data Transformation

Some interpolation methods in Geostatistical Analyst require normal distribution of data. If the data is skewed (unevenly distributed), you may need to change the data to a normal distribution.

Box-Cox transformation (also called power transformation)
If a small number is calculated in a study area, the variability in this area is smaller than that in another area where the Count value is greater. In this case, the square root TransformationHelps to make the variance in the entire study area more constantNormally, the data is normally distributed. The square root transformation is λ =? In Box-Cox transformation? .

Logarithm Transformation
The logarithm transformation is actually a special case of the Box-Cox transformation when λ = 0. Logarithm TransformationUsually used for data with positive and Partial Distribution. Some of these values are very large. If these dimensions are located in the study area, the logarithm transformation will help to make the variance more constant and normalize the data.

For example, the data distribution is as follows:

Comparison before and after conversion:

Arcsin Transformation
Arcsin TransformationData that can be used to indicate the ratio or percentage. Generally, when the data is in the proportional form, the variance is the smallest when it is close to 0 and 1, and the maximum when it is close to 0.5. The arcsin transformation helps to make the variance of the entire study area more constant, and usually results in a normal distribution of data.


Ⅱ is there an abnormal value?

Global outliers are measured sampling points with very high or very low values relative to all values in the dataset.
A local outlier is a Measured sample point. Its value is within the normal value range of the entire dataset. However, when you view the surrounding points, its value is unusually high or abnormally low.

If an abnormal value is a real exception in a symptom, it may be the most important point for us to study and understand the symptom. If an abnormal value is caused by an error during data input, correct or remove it before creating the surface.

Histogram/Histogram

If you can see an isolated entry on the leftmost (minimum) or rightmost (maximum) side of the histogram, it may indicate that the vertex represented by this entry is an abnormal value. The more isolated the bar is from the main bar group of the histogram, the higher the probability that the vertex is an abnormal value.


KNN Diagram

The KNN map is a map of tysen polygon formed by sampling points.

When you view the canvas, check whether the color of the Area symbol is significantly different.

For example, the red area is obviously different from the surrounding value.


Semivariogram/Covariance Cloud/semi-variant function/Covariance Cloud

The semi-variant function/covariance cloud tool can be used to check self-related local features in the data set space and to find local outliers.

Each vertex in the cloud represents a pair of vertices in the dataset, the x axis represents the distance between locations, and the y axis represents the difference square of the values at these locations. Each point in the semi-mutations function represents a location pair, rather than a single location on the map. Therefore, the number of vertices in the cloud increases rapidly as the number of vertices in the dataset increases. If n vertices exist in the dataset, the semi-variant function/covariance cloud displays n * (n-1)/2 vertices. Therefore, it is not recommended to use datasets with more than thousands of points. If a dataset contains thousands of points, use a subset element tool to randomly select points and then use a subset in the semi-variant function/covariance cloud.

The "semi-variant function/covariance cloud" tool is particularly useful for detecting local abnormal values. They are displayed as close points (low values on the x axis), but they are high values on the y axis, indicating that the values of the two points on the Composition points are very different. This is opposite to the expected result, that is, the points close to each other have similar values.


Ⅲ is there a trend? Trend Analyst/Trend analysis

The trend analysis tool provides a three-dimensional perspective of data. The position of the sampling point is drawn on the x and y planes, and the z value indicates the property value of interest. The trend analysis tool projects scatter plots on the x, z, and y and z planes, and fits each projection using polynomial curves.

Browse the thick lines on the vertical wall of the image. These lines represent trends. One trend line follows the x axis (usually vertical trend), and the other shows the trend along the y axis (usually latitude trend ). If the curve passing through the projection point is flat, there is no trend; if the polynomial curve has an exact pattern (such as the Blue Line and Green Line), it indicates that there is a certain trend in the data.

Also, it is useful to change the order of polynomials when checking for trends; checking for trends different from standard N-S and E-W directions can be helpful to check data by rotating the trend axis.


IV is spatial self-correlation?

We can explore the spatial self-correlation of data by checking the sampling data pairs at different locations. We still use the ESDA tool half variant function cloud mentioned above.


Semivariogram/Covariance Cloud/semi-variant function/Covariance Cloud

If spatial correlation exists, the point closer to the x axis (on the leftmost side of the x axis) should have a small difference (on the y axis, the value is smaller ). As the distance between various points increases (point on the x axis to the right), the square of the difference should also increase (move up on the y axis ). Generally, the square difference remains unchanged after a distance is exceeded. The location that exceeds this distance is considered irrelevant.

If the point in the semi-variant function forms a horizontal straight line, there may be no spatial self-correlation in the data, so interpolation of the data will be meaningless.

The basic assumption of the local statistical method is that the square of the difference value of any two places with the same distance and direction should also be similar. This relationship is calledStability. Spatial self-correlation may only depend on the distance between two locations, which is calledHomography. If a thing is more similar in some directions than in other directions, that is, the semi-variant function and the covariance have such a direction effect, it is calledHeterosexual.


Cross-covariance cloud

The cross-covariance cloud tool can be used to study cross-correlation between two datasets. The cross covariance cloud shows the empirical cross covariance of all the location pairs between the two datasets, and uses it as a function to draw the distance between the two datasets, similar to the above tool, the tool also provides a covariance surface with the search direction function.


We have made a first impression on the data and checked the data with ESDA tools. We have already had some prior knowledge about the data we have studied. Then we can select interpolation to create a surface, next article continues.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.