2.1 Data Objects and property types
2.1.1 What is a property
2.1.2 Nominal attribute: Its value is the name of some symbol or thing. Each value represents a category, encoding, or state, so the nominal attribute is also considered categorized.
The nominal attribute is not quantitative, it is meaningless to find its mean or median, and it is meaningful to find the majority, which is a central trend measure.
2.1.32 Meta attribute: is a nominal attribute, only two categories or states: 0 or 1, also known as a Boolean attribute.
A binary attribute can be symmetric: there is no preference for which result should be 0 or 1.
Binary attributes can be asymmetric: Their status results are not equally important, such as positive or negative. For convenience, the key results will be encoded with 1 and the other 0 encoded.
2.1.4 Ordinal attribute: There is a meaningful ordinal or rank evaluation between its possible values, but the difference between successive values is unknown. For example, large, medium and small, excellent, good, medium and pass, very dissatisfied, not very satisfied, neutral, satisfied, very satisfied.
The central trend of the ordinal attribute can be expressed in its majority and median, but it cannot define the mean.
2.1.5 Numeric properties: Can be interval scale or ratio scale
1. Interval scale attribute: measured in equal unit scale. The value of the interval attribute is ordered and can be positive, 0, or negative. You can calculate the median and the number of people, and you can also calculate the mean value.
2. Ratio scale attribute: is a numeric attribute with intrinsic 0 points. You can calculate the difference, the mean, the median, and the number of people.
2.1.6 discrete attributes and continuous attributes
2.2 Basic statistical description of the data
2.2.1 Center trend measurement: mean, median, and majority
Mean: Too sensitive to extreme values
Weighted arithmetic mean or weighted average:
Truncation mean: The mean value after dropping the high and low extremes.
Median: The median value of an ordered data value.
The majority:
Number of columns: average of the maximum and minimum values
Positive tilt: The number of digits is now less than the median value.
Negative skew: The number of digits is now greater than the median value.
2.2.2 Metric Data dispersion: Extreme, four, variance, standard deviation, and four-bit differential
1. Extremely poor, four-bit and four-cent
Extreme difference: The difference between the maximum value and the minimum value
Division: Divides the data into a coherent set of equal basic sizes.
Four-point: divided into 4 parts.
Percentile: Divided into 100 coherent sets of equal size.
First four percentile: Q1, number 25th percentile
Third four-percentile: Q3, 75th percentile
Iqr:q3-q1 of the four-digit differential
2. Five number generalization, box chart and outlier point
Common rules for identifying suspicious outliers: Pick the value that falls below the 3rd four-digit or 1th four-digit number of 1.5*IQR.
Five-digit generalization: median, q1,q2, minimum and maximum values.
3. Variance and Standard deviation
The low standard deviation meaning data observations tend to be very close to the mean, while the high standard of difference means that the data is scattered in a large range of values.
Standard deviation
Variance
Graphical display of basic statistical descriptions of 2.2.3 data
1. Number of Bits
2. The number of bits-the number of bits
3. Histogram:
4. Scatter plot: One of the most effective graphical methods for determining whether there is a connection, pattern, or trend between two numeric variables.
2.3 Visualization of data
2.4 Similarity and divergence of metric data
2.4.1 Data matrix and dissimilarity matrix
2.4.2 measurement of the proximity of a nominal attribute
Mismatch Rate: D (i,j) = (p-m)/p is the total number of attributes depicting the object, and M is the number of matches
Similarity: Sim (i,j) =1-d (i,j) =m/p
2.4. The proximity metric of the 32-dollar attribute
R:I the number of attributes in the 1,j to take 0
S:I the number of attributes in the 0,j to take 1
The number of attributes in Q:i,j is 1
The number of attributes in T:i,j is 0
Symmetrical two-yuan dissimilarity: D (i,j) = (r+s)/(Q+R+S+T)
Asymmetric two-yuan dissimilarity: D (i,j) = (r+s)/(Q+r+s)
Asymmetric two-dollar similarity: Sim (i,j) =q/(q+r+s) =1-d (i,j), also known as Jaccard coefficients
2.4.4 differences in numeric properties: Minkowski distance
Euclidean distance:
Weighted Euclidean distance:
Manhattan Distance:
They have the following mathematical properties:
Non-negative:
Identity: The distance from the object to itself is 0
Symmetry: Distance is a symmetric function
Triangular inequalities: the direct distance from Object I to object J will not be greater than the distance from the path of any other object K.
Minkowski Distance:
2.4.5 measure of proximity of ordinal attributes
2.4.6 the dissimilarity of mixed-type attributes
2.4.7 Strings Similarity
Boundary distance (Chebyshev distance)
Exercise: R language version
2.2 Assume that the parsed data includes the attribute age, which has a value of 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70 in the data tuple
A) mean value? Middle Digit?
b) The majority?
c) in the number of columns?
d) q1,q3?
e) Five number?
f) box diagram?
Data<-c ( -, the, -, -, +, -, -, +, A, A, -, -, -, -, -, -, -, *, *, *, *, $, +, $, $, the, -) mean (data) median (data) which.max (table (x)) (max (data)+min (data))/2quantile (data,0.25) quantile (data,0.75) fivenum (data) barplot (table (data))
2.3
Data<-c ( $, the, -, the, the, -) Median<-sum (data)/2sum=0 for(Iinch 1: Length (data)) {Sum=sum+Data[i]if(sum<median&&sum+data[i+1]>median) Break} #出循环后i+1 is the subscript of the median interval, i.e. 20~ - -+ (sum (data)/2+sum)/data[i+1])* -
2.4
Age<-c ( at, at, -, -, the, A, -, the, -, the, Wu, Wu, About, $, -, -, -, A) Fat<-c (9.5,26.5,7.8,17.8,31.4,25.9,27.4,27.2,31.2,34.6,42.5,28.8,33.4,30.2,34.1,32.9,41.2,35.7) mean (age), median (age), SD (age) mean (fat), median (fat) SD (FAT) barplot (table (age)) Barplot (table (FAT)) plot (Age,fat) Qqplot (Age,fat)
2.6
V1<-c (1,v2), <-c (0, 8 ) sqrt (sum ((v1-v2) ^2) #欧几里德sum (ABS (v1-v2)) #曼哈顿距离 (SUM (ABS (v1-v2) ^ 3) ^ (1/3) #闵可夫斯基max (ABS (v1-v2)) #上确界距离
2.8
A
A1<-c (1.5,2,1.6,1.2,1.5) A2<-c (1.7,1.9,1.8,1.5,1.0) Data<-Data.frame (A1,A2) x<-c (1.4,1.6) e<-C () m<-C () u<-C () Co<-C () for(Iinch 1: Nrow (data)) {e<-c (E,sqrt (SUM (x-data[i,]) ^2)) ) m<-c (M,sum (ABS (xData[i,]))) U<-c (U,max (ABS (xData[i,]))) Co<-c (Co,sum (X*data[i,])/(sqrt (SUM (x^2)) *sqrt (SUM (data[i,]^2))) }rank (e) rank (m) rank (U) rank (CO)
Data mining concepts and techniques reading notes (ii) Understanding data