"Reading notes-data mining concepts and techniques" understanding data

Source: Internet
Author: User

Attribute classification:

    • Nominal attribute (qualitative)
    • Binary attributes (qualitative)
    • Ordinal attribute (qualitative)
    • Numeric properties (Quantitative)

Nominal attribute--"related to name", whose value is the name of some symbol or thing.

eg. hair color (black, brown, yellowish, red)

Marital status (single, married, divorced, widowed)

Binary attribute--one of the nominal attributes, only two categories or states: 0 or 1 (Boolean attribute).

Symmetric and asymmetric, symmetry-two states have equal value and carry the same weight eg. sex

Asymmetry-the result of its state is not equally important eg. test results (negative 0, positive 1)

Ordinal attribute--its value has a meaningful ordinal or rank evaluation, but the difference between successive values is unknown.

eg. results (A +, A, A-, B +, B. 、......)

Evaluation (0--, neutral, 2--Praise)

Numeric attribute--its value is a measurable amount, expressed as an integer or real value. Can be to see the scale or the ratio scale

Interval scale--eg. Temperature (5°, 10°, 15° 、......)

Ratio scale--eg. Weight, height, speed, amount of money

Machine Learning Domain Categories:

    • Discrete properties
    • Continuous properties

————————————————————————————————————————————————————————————————————————————

Basic statistical description of the data

Center trend measurement-mean, median, majority

Metric data dispersion--extreme, four, variance, standard deviation, four-bit poor

    • Extreme difference: Maximum number-minimum number
    • Four-digit: in statistics, all the values are arranged from small to large and divided into four equal parts, and the score at the position of three split points is four.

The 14th Division (Q1), also known as the "smaller four", is equal to the number of all values in the sample from small to large after the 25th. Position of Q1 = (n+1) x0.25
The 24th (Q2), also known as the "median", is equal to the number of all values in the sample from small to large after the 50th. Position of Q2 = (n+1) x0.5
The 34th Division (Q3), also known as the "larger four", equals the number of all values in the sample from small to large after the 75th. Position of Q3 = (n+1) x0.75
The gap between the 34th and 14th is also known as the four-point distance (Interquartile RANGE,IQR).

    • Quad-bit differential (IQR) =q3-q1
    • Five-digit generalization--median Q2, four-digit Q1 and Q3, maximum and minimum values
    • Box diagram (Box chart)--Through the box diagram, when analyzing the data, the box diagram can effectively help us to identify the characteristics of the data:
      1. Visually identify outliers in datasets (see outliers).
      2, judge the data set of the degree of dispersion and bias (observe the length of the box, the shape of the upper and lower compartment, and the length of the beard).

    • Variance & Standard deviation

Graphic display

    • The scale graph--for observing univariate data distributions

The single variable here is: Unit price

    • The number of bits-or q-q-to see if one distribution to another is drifting

In statistics, the QQ chart [1] (Q stands for the quantile) is a graphical method of comparing two probability distributions by drawing the number of bits. First, the interval length is selected, and the point (x, y) corresponds to the same number of bits as the second distribution (y-axis) for the first distribution (× axis). So the drawing is a curve with parameters, the number of intervals.
If the two distributions that are compared are similar, their QQ graphs are approximate to y = x. If the two distributions are linearly correlated, the points on the QQ graph fall approximately on a straight line, but not necessarily y = x. QQ graphs can also be used to estimate the positional parameters of a distribution.
The QQ graph can compare the shape of probability distribution, and show the position of two distribution from the graph, whether the nature of scale and skewness is similar or different. It can be used to compare the empirical distribution of a set of data and the theoretical distribution of consistency. [2] In addition, QQ graph is a comparison of the two sets of data behind the random variable distribution of non-parametric method. In general, when comparing the two sets of samples, QQ graph is a more effective method than the histogram, but understanding the QQ map requires more background knowledge.

    • Scatter plots are data-related--there appears to be a connection between two numeric variables.

————————————————————————————————————————————————————————————————————————————

Visualization of data

    • Based on Pixel
    • Geometric projection
    • Based on a graphic character
    • Level
    • Visualization of Complex objects

————————————————————————————————————————————————————————————————————————————

Similarity and divergence of metric data (proximity measure)

Different classes of data, D (I,J) have different methods of calculation.

    • Nominal properties

    • Binary properties

    • Numeric Properties-Minkowski distance (Minkowski distance), Euclidean distance, Manhattan distance

    • Ordinal attribute

Get:

    • Mixed type properties--based on Test1, 2, 3, averaging
    • Similarity assessment: Cosine similarity (for comparison of documents), Tanimoto coefficients

"Reading notes-data mining concepts and techniques" understanding data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.