Data mining concepts and techniques reading notes (ii) Understanding data

Last Update:2016-01-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2.1 Data Objects and property types

2.1.1 What is a property

2.1.2 Nominal attribute: Its value is the name of some symbol or thing. Each value represents a category, encoding, or state, so the nominal attribute is also considered categorized.

The nominal attribute is not quantitative, it is meaningless to find its mean or median, and it is meaningful to find the majority, which is a central trend measure.

2.1.32 Meta attribute: is a nominal attribute, only two categories or states: 0 or 1, also known as a Boolean attribute.

A binary attribute can be symmetric: there is no preference for which result should be 0 or 1.

Binary attributes can be asymmetric: Their status results are not equally important, such as positive or negative. For convenience, the key results will be encoded with 1 and the other 0 encoded.

2.1.4 Ordinal attribute: There is a meaningful ordinal or rank evaluation between its possible values, but the difference between successive values is unknown. For example, large, medium and small, excellent, good, medium and pass, very dissatisfied, not very satisfied, neutral, satisfied, very satisfied.

The central trend of the ordinal attribute can be expressed in its majority and median, but it cannot define the mean.

2.1.5 Numeric properties: Can be interval scale or ratio scale

1. Interval scale attribute: measured in equal unit scale. The value of the interval attribute is ordered and can be positive, 0, or negative. You can calculate the median and the number of people, and you can also calculate the mean value.

2. Ratio scale attribute: is a numeric attribute with intrinsic 0 points. You can calculate the difference, the mean, the median, and the number of people.

2.1.6 discrete attributes and continuous attributes

2.2 Basic statistical description of the data

2.2.1 Center trend measurement: mean, median, and majority

Mean: Too sensitive to extreme values

Weighted arithmetic mean or weighted average:

Truncation mean: The mean value after dropping the high and low extremes.

Median: The median value of an ordered data value.

The majority:

Number of columns: average of the maximum and minimum values

Positive tilt: The number of digits is now less than the median value.

Negative skew: The number of digits is now greater than the median value.

2.2.2 Metric Data dispersion: Extreme, four, variance, standard deviation, and four-bit differential

1. Extremely poor, four-bit and four-cent

Extreme difference: The difference between the maximum value and the minimum value

Division: Divides the data into a coherent set of equal basic sizes.

Four-point: divided into 4 parts.

Percentile: Divided into 100 coherent sets of equal size.

First four percentile: Q1, number 25th percentile

Third four-percentile: Q3, 75th percentile

Iqr:q3-q1 of the four-digit differential

2. Five number generalization, box chart and outlier point

Common rules for identifying suspicious outliers: Pick the value that falls below the 3rd four-digit or 1th four-digit number of 1.5*IQR.

Five-digit generalization: median, q1,q2, minimum and maximum values.

3. Variance and Standard deviation

The low standard deviation meaning data observations tend to be very close to the mean, while the high standard of difference means that the data is scattered in a large range of values.

Standard deviation

Variance

Graphical display of basic statistical descriptions of 2.2.3 data

1. Number of Bits

2. The number of bits-the number of bits

3. Histogram:

4. Scatter plot: One of the most effective graphical methods for determining whether there is a connection, pattern, or trend between two numeric variables.

2.3 Visualization of data

2.4 Similarity and divergence of metric data

2.4.1 Data matrix and dissimilarity matrix

2.4.2 measurement of the proximity of a nominal attribute

Mismatch Rate: D (i,j) = (p-m)/p is the total number of attributes depicting the object, and M is the number of matches

Similarity: Sim (i,j) =1-d (i,j) =m/p

2.4. The proximity metric of the 32-dollar attribute

R:I the number of attributes in the 1,j to take 0

S:I the number of attributes in the 0,j to take 1

The number of attributes in Q:i,j is 1

The number of attributes in T:i,j is 0

Symmetrical two-yuan dissimilarity: D (i,j) = (r+s)/(Q+R+S+T)

Asymmetric two-yuan dissimilarity: D (i,j) = (r+s)/(Q+r+s)

Asymmetric two-dollar similarity: Sim (i,j) =q/(q+r+s) =1-d (i,j), also known as Jaccard coefficients

2.4.4 differences in numeric properties: Minkowski distance

Euclidean distance:

Weighted Euclidean distance:

Manhattan Distance:

They have the following mathematical properties:

Non-negative:

Identity: The distance from the object to itself is 0

Symmetry: Distance is a symmetric function

Triangular inequalities: the direct distance from Object I to object J will not be greater than the distance from the path of any other object K.

Minkowski Distance:

2.4.5 measure of proximity of ordinal attributes

2.4.6 the dissimilarity of mixed-type attributes

2.4.7 Strings Similarity

Boundary distance (Chebyshev distance)

Exercise: R language version

2.2 Assume that the parsed data includes the attribute age, which has a value of 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70 in the data tuple

A) mean value? Middle Digit?

b) The majority?

c) in the number of columns?

d) q1,q3?

e) Five number?

f) box diagram?

Data<-c ( -, the, -, -, +, -, -, +, A, A, -, -, -, -, -, -, -, *, *, *, *, $, +, $, $, the, -) mean (data) median (data) which.max (table (x)) (max (data)+min (data))/2quantile (data,0.25) quantile (data,0.75) fivenum (data) barplot (table (data))

2.3

Data<-c ( $, the, -, the, the, -) Median<-sum (data)/2sum=0 for(Iinch 1: Length (data)) {Sum=sum+Data[i]if(sum<median&&sum+data[i+1]>median) Break} #出循环后i+1 is the subscript of the median interval, i.e. 20~ - -+ (sum (data)/2+sum)/data[i+1])* -

2.4

Age<-c ( at, at, -, -, the, A, -, the, -, the, Wu, Wu, About, $, -, -, -, A) Fat<-c (9.5,26.5,7.8,17.8,31.4,25.9,27.4,27.2,31.2,34.6,42.5,28.8,33.4,30.2,34.1,32.9,41.2,35.7) mean (age), median (age), SD (age) mean (fat), median (fat) SD (FAT) barplot (table (age)) Barplot (table (FAT)) plot (Age,fat) Qqplot (Age,fat)

2.6

V1<-c (1,v2), <-c (0, 8 ) sqrt (sum ((v1-v2) ^2) #欧几里德sum (ABS (v1-v2)) #曼哈顿距离 (SUM (ABS (v1-v2) ^  3) ^ (1/3) #闵可夫斯基max (ABS (v1-v2)) #上确界距离

2.8

A1<-c (1.5,2,1.6,1.2,1.5) A2<-c (1.7,1.9,1.8,1.5,1.0) Data<-Data.frame (A1,A2) x<-c (1.4,1.6) e<-C () m<-C () u<-C () Co<-C () for(Iinch 1: Nrow (data)) {e<-c (E,sqrt (SUM (x-data[i,]) ^2)) ) m<-c (M,sum (ABS (xData[i,]))) U<-c (U,max (ABS (xData[i,]))) Co<-c (Co,sum (X*data[i,])/(sqrt (SUM (x^2)) *sqrt (SUM (data[i,]^2))) }rank (e) rank (m) rank (U) rank (CO)

Data mining concepts and techniques reading notes (ii) Understanding data

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining concepts and techniques reading notes (ii) Understanding data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining concepts and techniques reading notes (ii) Understanding data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support