Chapter 2 understanding data notes

Source: Internet
Author: User

I. Data Objects and attribute types

1. attribute: a data field that represents a feature of a data object. (Attributes, dimensions, features, and variables can be interchangeable.) 2. Nominal attributes: the nominal meaning is "name-related". The value of a nominal attribute is a symbol or incorrect name. Each one only represents a certain category, encoding, or status, Therefore, nominal attributes are classified.. For example, the attributes of a person-> hair color (black, white, brown, red, yellow...) and marital status (unmarried, married, divorced...) are nominal attributes. The nominal attribute values can be represented by numbers, such as 1, 2, and 3. However, these values do not have a meaningful order and are not quantitative. Therefore, the mean and median values of this attribute are meaningless. The mode is meaningful.3. Binary Attribute: it is a nominal attribute. There are only two types (Status): 0 and 1. Generally, 0 indicates that this attribute does not appear, and 1 indicates that it appears. Also known as Boolean attributes (true and false ). Binary attributes are symmetric and asymmetric: symmetric refers to two States with the same value and weight, such as gender (male and female ); asymmetric means that the results of the status are not equally important, such as virus testing results (positive or negative ).4. ordinal property: the possible values have a meaningful ordinal or rank evaluation, but the difference between successive values is unknown. The ordinal attribute is usually used for rating surveys. The nominal, binary, and ordinal attributes are qualitative. They describe the characteristics of an object without providing the actual size or quantity. The value of a qualitative attribute is generally a word that represents a category.5. Value Attribute: Quantitative. It is a measurable volume and is represented by a certificate or a real value. The value attribute can be a range scale or a ratio scale. Range Scale property: the property is measured in equal units. Values that are familiar with the interval are ordered, for example, 20 degrees, 15 degrees (temperature attribute); ratio scale attribute: A numerical attribute with a fixed zero point, that is, one value is a multiple of the other (ratio ). The ratio value is also ordered. The difference between values can be calculated, and the mean, median, and mode can also be calculated. 6. discrete attributes and continuous attributes: classification algorithms developed in the machine learning field generally classify attributes into discrete or continuous attributes. Discrete property: it has a finite or infinite number of values, which can be expressed by or without integers. For example, the hair color and marital status are all finite values, which are discrete. 2. Basic statistical description of data is crucial for successful data preprocessing. Three basic statistical descriptions: central trend measurement: The Central or central location of the metric data distribution, and the mean, median, mode, and columns. Data Distribution: common measurements include: range, quartile, quartile range, five-number generalization, box plot, and variance and standard deviation of data. (Can be used to identify outliers) visual display of data: bar Chart, pie chart, line chart, quantile chart, quantile-quantile chart, histogram, scatter chart 1. The most common and effective numerical measurement of the central trend measurement dataset is (Arithmetic) mean: That is, the SQL operation in the database: AVG () Weighted average (weighted arithmetic mean): Weight indicates the meaning, importance, or frequency of occurrence of a value. (weight W corresponds to value x.) mean is not always the best method for measuring data centers: sensitive to extreme values (outlier; the solution can be used Truncation mean: Average value after discarding two extreme values (not necessarily one value or multiple values. Median: A better method for measuring data centers (asymmetric data) is the middle value of ordered data values. Mode: It is another central trend measurement. Is the most frequent value in the set. If a dataset with multiple modes is multi-peak, the other extreme case is that if each data value only appears once, the dataset has no mode. (Usually corresponds to asymmetric data) Columns in: Average of the largest and smallest values in a dataset 2. metric data distribution: Very poor: The range of the set is the difference between the maximum value and the minimum value. Quantile: The data is divided into consistent sets with the same size on the base database from the point at a certain interval of data distribution.   2-quantile: It is a data point. It divides the data distribution into two halves. The 2-quantile value corresponds to the median. 4-quantile: It is three data points. They divide the data distribution into four equal parts, so that each part represents 1/4 of the data distribution. 100-quantileThey divide data distribution into 100 consistent sets of equal sizes. Quartile: The first quartile is recorded as Q1, Which is 25th percentages (25% of the data set). The second quartile is recorded as Q2, Which is 50th percentile, the center where the data distribution is given as the median. The third quartile is recorded as Q3, Which is 75th percentile (75% of the data set ). Quartile range (iqr): Iqr = Q3-Q1 SummaryIt consists of the median (Q2), quartile Q, Q3, minimum, and maximum values. In the order of Min, Q1, median, Q3, and Max. Rules for identifying suspicious outlier: Generally, the value that falls above the third quartile or at least 1.5 times the iqr value under the first quartile. Eg: Variance and standard deviation:3. Data Visualization data visualization aims to clearly and effectively express data through graphical representation. Iv. Similarity of measurement data and the opposite sex 2. Two objects, I. calculation formula of J: 3. Measurement of the closeness of binary attributes Q, R, s, T indicates that two objects are under 1, 0 Number of attributes (if an attribute is I = 1, j = 1, q + 1). P = q + R + S + T.The opposite sex of two objects I j: Sometimes, you can ignore the property (meaningless) When both objects are 0, which is called the calculation formula of the opposite sex of asymmetric binary attributes: similarity is: SIM (I, j) is also called the jaccard coefficient. 4. Similarity of numerical attributes: min kowski distance, Euclidean distance, and Manhattan distance: refers to the block distance between city blocks (horizontal distance + vertical distance). Example: min kowski distance: H is a real number, h> = 1 (this distance is also called LP norm, P is h) H = 1 the distance from Manhattan, H = 2 is Euclidean distance. The upper-validation distance (also known as Lmax, loo norm, and cherbihov distance) is a promotion of Min kowski distance when h-> oo. (The maximum difference value on an attribute is the upper limit) 5. The value of the ordinal measure attribute has a meaningful order or ranking, while the value between successive values is unknown. 7. Cosine similarity: it is a measure that can be used to compare documents or sort documents based on a given query word vector. The cosine value 0 indicates that the two vectors are orthogonal (90 °) and no matching is performed. The closer the value is to 1, the smaller the angle, the greater the matching between vectors.

Chapter 2 understanding data notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.