Data Mining-Understanding data

Last Update:2018-10-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data Objects and property types

The dataset consists of data objects. A data object represents an entity. For example, in a sales database, an object can be a customer, a commodity. A property is a data field that represents a feature of a data object.

Property type

Nominal attribute (nominal attribute): The name of something, each of which represents a category, encoding, or state. There is no meaningful order, it is not quantitative, its mean and median are meaningless, the total value is meaningful. For example, the color of the object may have black, red, white, etc., the occupation may be worth teachers, doctors and so on.
Binary attribute (binary attribute): A nominal attribute with only two categories or states: 0 or 1. There are symmetrical and asymmetric conditions, such as the gender of two male and female status, asymmetric, such as HIV testing positive and negative, for convenience, usually 1 for the most important results (usually rare, the other with 0 encoding.
Ordinal attribute (ordinal attribute): There is a meaningful order between values, but the difference between successive values is unknown. The center trend can be expressed in both the majority and the median, but the mean value cannot be defined. For example, scores have a A +, A, a-and so on.

The above three are qualitative attributes, that is, they describe the characteristics of an object without giving the actual size or quantity, and its value represents only the encoding, not the measurable amount.

Numeric attributes (numeric attribute) are quantitative, measurable, expressed as integers or real values.
- Interval scale attribute (interval-scaled): Allows comparison and quantification of the difference between values, but there is no real 0 points, no ratio or multiple relationships, you can calculate the median, the majority and the mean. For example, in Celsius, we cannot say 10 degrees Celsius is twice times warmer than 5 degrees Celsius.
- Ratio scale attribute (ratio-scaled): With a fixed 0 point, you can calculate the mean, median, and majority. For example, count attributes such as working life, article count.

Basic statistical description of the data

In order to grasp the full picture of the data, we focus on the data center trend measurement, data dissemination and graphic display.

Center Trend Measurement

Center trend measures the central or central location of the data distribution, or, given an attribute, where does the majority of its values fall?

Mean value (mean)

The most common and most effective are the arithmetic mean values:
\[\overline{x} = \frac{\sum_{i=1}^n X_i}{n} \]
Or use a weighted average to reflect the meaning, importance, or frequency of the corresponding value.
\[\overline{x} = \frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \]
But the mean is sensitive to extreme values, and for asymmetric data, a better measure of the data center is the median.

Median (median)

The median is the intermediate value of the ordered data and divides the data into two halves.
The median is computationally expensive when the number of observations is large. The approximate calculation formula is given below. The data is assumed to be divided into intervals based on values, and the frequency of each interval (number of data values) is known. The interval containing the median frequency is the median interval.

where\ (l_1\) is the lower bound of the median interval,\ (n\) is the number of values in the entire data set,\ (\sum freq\) is the frequency of all intervals below the median interval, and\ (freq_{ median}\) is the frequency of the median interval, and\ (width\) is the width of the median interval.

Majority (mode)

The most frequent values appear. A set of data with one, two, three majority is a single peak (unimodal), Shuangfeng (bimodal), and Three Peaks (Trimodal).
When the data is symmetric, the majority = Median = mean value.
When the frequency distribution right bias, that is, the mean value is affected by the high value, its position must be in the right of the majority, the median number and the arithmetic mean, the majority < median < mean value.
Conversely, when the number of times distributed to the left, that is, negative tilt, the mean is affected by the small value of the larger, its position in the left of the majority, the median is still between the two, mean < median < the number.

Number of columns (midrange)

The average of the maximum and minimum values.

Scatter Extreme (range) of data

The difference between maximum and minimum values.

Four-digit number (quartile)

Divide the data into four coherent sets of essentially equal size.
$ q_1 $: There are 25% of data below this;
$ q_2 $: There are 50% of the data below this, that is, the median number;
$ Q_3 $: There is 75% of the data below this.
IQR: The range covered by half the middle of the data is given.
\[IQR = q_3-q_1 \]
For skewed distributions, a single walk of numeric measurements such as IQR are not very useful, and the usual rule for identifying outliers is to pick a value that falls above \ ( q_3\ ) and \ ( q_1\) at least \ (1.5 * iqr\) .
Five-digit generalization: minimum,\ (q_1\), median,\ (q_3\), maximum. The box diagram embodies a five-digit generalization.

Variance (Variance) and standard deviation (deviation)

\[\sigma ^2 = \frac{1}{n} \sum_{i=1}^{n} (X_i-\overline{x}) ^2 \]
\ (\sigma ^2\) is the variance,\ (\sigma\) is the standard deviation.

Graphical display of the split-number graph

Set \ (x_i\) is the data in ascending order, so that \ (x_1\) is the smallest observation, and \ (x_n\) is the largest, each observation value \ (x_i\) with a percentage \ (f_ I\) , indicating that the data of approximately \ (f_i * 100%\) is less than the value \ (x_i\).

Number of bits-the number of bits

Allows the user to observe whether there is drift from one distribution to another.
For example, the division of the unit price data for two departments selling goods-the number of bits.

For example, in \ (q2\), Department 1 sales of goods 50% is less than or equal to 78 U.S. dollars, while the Department 21 sales of goods 50% is less than or equal to 85 U.S. dollars. The 45-degree solid line in the middle represents no offset. From the overall point of view, we can see that the Unit 1 sales of commodity prices tend to be lower than the Department 2.

(frequency) Histogram

If the data is nominal, usually called a bar chart, the data is numeric, and the term histogram is used more often.

Scatter chart

Scatter plots are a useful way to observe bivariate data. It can be seen that two variables are positive, negative, or irrelevant.

Similarity and dissimilarity of metric data

Similarities and differences are referred to as proximity and are used to assess the degree to which objects are comparable or not similar to each other.

Data matrix and dissimilarity matrix data matrix

n objects are depicted by P properties.
Object-Property structure, rows represent objects, columns represent attributes, so data matrices are often referred to as two-mode matrices.
\[\left[\begin{matrix} x_{11} & \cdots & x_{1f} & \cdots & x_{1p} \\\ \cdots & \cdots & \cdots &A mp \cdots & \cdots \\\ x_{i1} & \cdots & X_{if} & \cdots & X_{ip} \\\ \cdots & \cdots & \cdots &A mp \cdots & \cdots \\\ x_{n1} & \cdots & X_{nf} & \cdots & X_{NP} \end{matrix} \right]\]

Dissimilarity Matrix

Stores the proximity of N objects 22.
Object-Object structure, which contains only a class of entities, is called a single-mode matrix.
\[\left[\begin{matrix} 0 & & & & \\\ D (2,1) & 0 & & & \\\ D (3,1) & D (3,2) & 0 &am P & \\\ \vdots & \vdots & \vdots & & \\\ D (n,1) & D (n,2) & \cdots & \cdots & 0 \end{matr IX} \right]\]
The matrix is symmetric,\ (d (i,j) \) is a measure of the dissimilarity between the object I and the object J, where \ (d (i,i) = 0\), that is, an object differs from its own by 0.

Here we discuss the proximity metrics for different types of data.

Measurement of proximity of nominal data

The nominal properties can take different states, such as red, yellow, and so on, which can be represented by letters, symbols, or a set of integers.
The dissimilarity between two objects can be calculated based on the mismatch rate.
\[d (I,J) = \frac{p-m}{p} \]
where M is the number of matches, that is, I and J have the same number of attributes, and p is the total number of attributes.
Similarity can be calculated according to the following formula:
\[sim (i,j) = 1-d (i, j) = \frac{m}{p} \]

The nominal attribute can be encoded with an asymmetric two-element attribute, such as a color, that can create a two-tuple variable for all color states, if one is yellow, the yellow property is set to 1, and the other is set to 0.

The proximity measure of binary attributes

	1	0
1	Q	R
0	S	T

In the table above, Q is the object I and J all take 1 of the number of attributes, other similar.
For symmetric two-element properties:
\[d (i,j) = \frac{r+s}{q+r+s+t} \]

For an asymmetric two-tuple tree, it is more meaningful to take 1 of the two values, which are considered to be 0 than the two, and the negative match number T is usually ignored.
\[d (i,j) = \frac{r+s}{q+r+s} \]
The asymmetric two-dollar similarity is called the Jaccard coefficient and is widely used in literature.
\[sim (i,j) = \frac{q}{q+r+s} = 1-d (i,j) \]

The dissimilarity of numeric attributes: Minkov distance

\[d (i,j) = \sqrt[h]{|x_{i1}-x_{j1}|^h + \cdots + |x_{ip}-x_{jp}|^h}\]
When H=1, for the Manhattan distance, $ d (i,j) = |x_{i1}-x_{j1}| + \cdots + |x_{ip}-x_{jp}|$
When h=2, for Euclidean distance, $ d (i,j) = \sqrt[2]{|x_{i1}-x_{j1}|^2 + \cdots + |x_{ip}-x_{jp}|^2} $
When H approaches infinity, it is the upper bound distance, that is, the maximum attribute value difference of two objects, $ d (i,j) = \max_{f}^{p}|x_{if}-x_{jf}| $

Proximity measures for ordinal attributes

Suppose \ (f\) is an ordinal attribute with a value of \ (x_{if}\), which has a \ (m_f\) ordered state, which indicates the rank, with the corresponding rank \ (r_{if} \in \{1,\dots,m_f \}\) replace \ (x_{if}\)
Because each ordinal attribute can have a different number of States, it is common to map the domain value of each property to \ ([0.0,1.0]\) , and \ (z_{if}\) instead of \ (r_{if}\) for data normalization.
\[z_{if} = \frac{r_{if}-1}{m_{f}-1} \]
The proximity measure of the ordinal attribute can then be converted to the proximity metric of the numeric attribute.

The dissimilarity of mixed attributes

An object may contain many different types of data, may have a nominal, symmetric, or asymmetric two-dollar, numeric, or ordinal, assuming that the dataset contains a property of $ P $ mixed type:
\[d (i,j) = \frac{\sum_{f=1}^{p} \delta_{ij}^{[f]} D_{ij}^{[f]}} {\sum_{f=1}^{p} \delta_{ij}^{[f]}}\]
Where object I and object J do not have a measure of property f, or the measure of F for two objects is 0 and F is a non-symmetric two-element property, then \ (\delta_{ij}^{[f]}=0\), otherwise take 1.
As for \ (d_{ij}^{[f]}\),

F is numeric: $ d_{ij}^{[f]} = \frac{|x_{if}-x_{jf}|} {\max_{h} x_{hf}-\min_{h} X_{HF} $, where h passes all non-missing objects of property F.
F is nominal or two yuan: if \ (x_{if} = x{jf}\), then \ (d_{ij}^{[f]}=0\), otherwise take 1.
F is ordinal: computes the rank \ (r_{if}\) and \ (z_{if} = \frac{r_{if}-1}{m_f-1}\)and then processes it as a numeric attribute.

Cosine similarity

To compare documents, each document is represented by a so-called word frequency vector, usually very long and sparse, and the traditional distance metric is not good.
\[sim (i,j) = \frac{x \cdot y}{| | x| | || y| |} \]
When the property is a two Value property,\ (x \cdot y\) is the number of attributes that are common to \ (x\) and \ (y\) , and \ (|x| | y|\) is the geometric mean of \ (x\) has the number of attributes and \ (y\) has the number of attributes, so \ (Sim (x, y) \) is a metric that the public attribute has relative ownership, and a simple variant of cosine similarity is as follows:
\[sim (x, y) = \frac{x \cdot y}{x \cdot x + y \cdot y-x \cdot y}\]
Known as Tanimoto distances, are commonly used in information retrieval and biological classification.

Data Mining-Understanding data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Mining-Understanding data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Mining-Understanding data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support