Data mining concepts and techniques reading notes (ii) Understanding data

Source: Internet
Author: User

2.1 Data Objects and property types

2.1.1 What is a property

2.1.2 Nominal attribute: Its value is the name of some symbol or thing. Each value represents a category, encoding, or state, so the nominal attribute is also considered categorized.

The nominal attribute is not quantitative, it is meaningless to find its mean or median, and it is meaningful to find the majority, which is a central trend measure.

2.1.32 Meta attribute: is a nominal attribute, only two categories or states: 0 or 1, also known as a Boolean attribute.

A binary attribute can be symmetric: there is no preference for which result should be 0 or 1.

Binary attributes can be asymmetric: Their status results are not equally important, such as positive or negative. For convenience, the key results will be encoded with 1 and the other 0 encoded.

2.1.4 Ordinal attribute: There is a meaningful ordinal or rank evaluation between its possible values, but the difference between successive values is unknown. For example, large, medium and small, excellent, good, medium and pass, very dissatisfied, not very satisfied, neutral, satisfied, very satisfied.

The central trend of the ordinal attribute can be expressed in its majority and median, but it cannot define the mean.

2.1.5 Numeric properties: Can be interval scale or ratio scale

1. Interval scale attribute: measured in equal unit scale. The value of the interval attribute is ordered and can be positive, 0, or negative. You can calculate the median and the number of people, and you can also calculate the mean value.

2. Ratio scale attribute: is a numeric attribute with intrinsic 0 points. You can calculate the difference, the mean, the median, and the number of people.

2.1.6 discrete attributes and continuous attributes

2.2 Basic statistical description of the data

2.2.1 Center trend measurement: mean, median, and majority

Mean: Too sensitive to extreme values

Weighted arithmetic mean or weighted average:

Truncation mean: The mean value after dropping the high and low extremes.

Median: The median value of an ordered data value.

The majority:

Number of columns: average of the maximum and minimum values

Positive tilt: The number of digits is now less than the median value.

Negative skew: The number of digits is now greater than the median value.

2.2.2 Metric Data dispersion: Extreme, four, variance, standard deviation, and four-bit differential

1. Extremely poor, four-bit and four-cent

Extreme difference: The difference between the maximum value and the minimum value

Division: Divides the data into a coherent set of equal basic sizes.

Four-point: divided into 4 parts.

Percentile: Divided into 100 coherent sets of equal size.

First four percentile: Q1, number 25th percentile

Third four-percentile: Q3, 75th percentile

Iqr:q3-q1 of the four-digit differential

2. Five number generalization, box chart and outlier point

Common rules for identifying suspicious outliers: Pick the value that falls below the 3rd four-digit or 1th four-digit number of 1.5*IQR.

Five-digit generalization: median, q1,q2, minimum and maximum values.

3. Variance and Standard deviation

The low standard deviation meaning data observations tend to be very close to the mean, while the high standard of difference means that the data is scattered in a large range of values.

Standard deviation

Variance

Graphical display of basic statistical descriptions of 2.2.3 data

1. Number of Bits

2. The number of bits-the number of bits

3. Histogram:

4. Scatter plot: One of the most effective graphical methods for determining whether there is a connection, pattern, or trend between two numeric variables.

2.3 Visualization of data

2.4 Similarity and divergence of metric data

2.4.1 Data matrix and dissimilarity matrix

2.4.2 measurement of the proximity of a nominal attribute

Mismatch Rate: D (i,j) = (p-m)/p is the total number of attributes depicting the object, and M is the number of matches

Similarity: Sim (i,j) =1-d (i,j) =m/p

2.4. The proximity metric of the 32-dollar attribute

R:I the number of attributes in the 1,j to take 0

S:I the number of attributes in the 0,j to take 1

The number of attributes in Q:i,j is 1

The number of attributes in T:i,j is 0

Symmetrical two-yuan dissimilarity: D (i,j) = (r+s)/(Q+R+S+T)

Asymmetric two-yuan dissimilarity: D (i,j) = (r+s)/(Q+r+s)

Asymmetric two-dollar similarity: Sim (i,j) =q/(q+r+s) =1-d (i,j), also known as Jaccard coefficients

2.4.4 differences in numeric properties: Minkowski distance

Euclidean distance:

Weighted Euclidean distance:

Manhattan Distance:

They have the following mathematical properties:

Non-negative:

Identity: The distance from the object to itself is 0

Symmetry: Distance is a symmetric function

Triangular inequalities: the direct distance from Object I to object J will not be greater than the distance from the path of any other object K.

Minkowski Distance:

2.4.5 measure of proximity of ordinal attributes

2.4.6 the dissimilarity of mixed-type attributes

2.4.7 Strings Similarity

Boundary distance (Chebyshev distance)

Exercise: R language version

2.2 Assume that the parsed data includes the attribute age, which has a value of 13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70 in the data tuple

A) mean value? Middle Digit?

b) The majority?

c) in the number of columns?

d) q1,q3?

e) Five number?

f) box diagram?

Data<-c ( -, the, -, -, +, -, -, +, A, A, -, -, -, -, -, -, -, *, *, *, *, $, +, $, $, the, -) mean (data) median (data) which.max (table (x)) (max (data)+min (data))/2quantile (data,0.25) quantile (data,0.75) fivenum (data) barplot (table (data))

2.3

Data<-c ( $, the, -, the, the, -) Median<-sum (data)/2sum=0 for(Iinch 1: Length (data)) {Sum=sum+Data[i]if(sum<median&&sum+data[i+1]>median) Break} #出循环后i+1 is the subscript of the median interval, i.e. 20~ - -+ (sum (data)/2+sum)/data[i+1])* -

2.4

Age<-c ( at, at, -, -, the, A, -, the, -, the, Wu, Wu, About, $, -, -, -, A) Fat<-c (9.5,26.5,7.8,17.8,31.4,25.9,27.4,27.2,31.2,34.6,42.5,28.8,33.4,30.2,34.1,32.9,41.2,35.7) mean (age), median (age), SD (age) mean (fat), median (fat) SD (FAT) barplot (table (age)) Barplot (table (FAT)) plot (Age,fat) Qqplot (Age,fat)

2.6

V1<-c (1,v2), <-c (0, 8 ) sqrt (sum ((v1-v2) ^2) #欧几里德sum (ABS (v1-v2)) #曼哈顿距离 (SUM (ABS (v1-v2) ^  3) ^ (1/3) #闵可夫斯基max (ABS (v1-v2)) #上确界距离

2.8

A

A1<-c (1.5,2,1.6,1.2,1.5) A2<-c (1.7,1.9,1.8,1.5,1.0) Data<-Data.frame (A1,A2) x<-c (1.4,1.6) e<-C () m<-C () u<-C () Co<-C () for(Iinch 1: Nrow (data)) {e<-c (E,sqrt (SUM (x-data[i,]) ^2)) ) m<-c (M,sum (ABS (xData[i,]))) U<-c (U,max (ABS (xData[i,]))) Co<-c (Co,sum (X*data[i,])/(sqrt (SUM (x^2)) *sqrt (SUM (data[i,]^2))) }rank (e) rank (m) rank (U) rank (CO)

Data mining concepts and techniques reading notes (ii) Understanding data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.