Data Mining--data (learning experience)

Source: Internet
Author: User

Data mining is a kind of technology, it combines the traditional data analysis method with the complex algorithm of processing large amount of data, in a large database, the process of discovering the useful information automatically, also has the ability to predict the future observation result. The data mining object is the data, so leaving the data, mining is impossible to talk about. Now I will study "Introduction to Data Mining" notes written to consolidate the knowledge.

First, the data type.

Data objects have other names, such as records, dots, vectors, patterns, events, cases, samples, observations, or entities.

1. Attributes and Metrics

A property is a property or attribute of an object, depending on the object, or varies over time.

A measurement scale is a rule that associates a value or symbol with an object's properties.

There are four types of attributes: nominal, ordinal, interval, ratio. Where the nominal and ordinal attributes are collectively referred to as categorical or qualitative. The intervals and ratios are numeric or quantitative.

2. Type of data set

The dataset has three important features: dimension, sparsity, resolution.

The dataset has the following types:

* Record data, including: Transaction data or basket data, data matrix, sparse data matrix.

* Graph-based data, including: Data with contact between objects, data with graphical objects.

* Ordered data, including: Timing data, sequence data, time series data, spatial data.

Ii. Quality of data

1. Measurement and data collection issues

Measurement errors and data collection errors:

Noise and artifacts: where artifacts are deterministic distortions, such as stripes on a group of photos in the same place.

accuracy, bias, and accuracy: accuracy is usually measured by the standard deviation of the value set, while the bias is measured by the difference between the mean value of the value set and the known value being measured.

Outliers: A data object that has a characteristic that differs from most of the other data objects in the dataset, or an unusual attribute value relative to the attribute's typical value, also known as an exception object.

Missing value: Is the information about one or more of the properties of an object that is not collected. There are many strategies for handling missing values, such as deleting data objects or properties, estimating missing values, and ignoring missing values when parsing.

Duplicate data: A DataSet may contain duplicate or almost duplicate data objects.

2. Questions about the application

In addition to the quality of the data in the application, but also to consider the following properties: timeliness, relevance.

Third, data preprocessing

1. Gathering

Aggregation is the merging of two or more objects into a single object.

2. Sampling

Sampling is a common method of selecting a subset of data for analysis, based on the idea that if the sample is representative, the use of the sample is almost the same as the effect of using the entire dataset.

There are several methods for sampling: No-return sampling, back-sampling, stratified sampling, and progressive sampling.

3, dimensional return about

The dimensionality is different from the aggregation, the aggregation is the merging object, the dimensionality reduction is the number of attributes, that is, the reduction of the dimension. Dimension reduction reduces the dimension of a dataset by creating new properties and merging some old properties together.

Dimensional disasters: This is a phenomenon in which many data analysis becomes very difficult as data dimensions increase.

Linear algebra techniques for dimensional regression: Principal component Analysis (PCA), singular value decomposition (SVD).

4. Feature subset Selection

Another way to reduce the dimension is to use only a subset of the feature, which replaces the original set of attributes, and captures important information in the dataset more efficiently. There are three standard feature selection methods: Embedding, filtering, packaging.

Feature weighting: The larger the feature (attribute), the greater the weight given, while the less important feature assigns a lesser weight value.

5, discretization and two yuan

In data mining, it is often necessary to transform continuous attributes into categorical attributes (discretization), and both continuous and discrete attributes may need to be transformed into one or more two-tuple attributes (two-tuple).

6. Variable transformation

A variable transformation is a transformation that is used for all values of a variable, that is, a property transformation. There are two important types of variable transformations: Simple functions, normalization, or normalization.

Iv. Similarity and dissimilarity measurement

1, similarity and dissimilarity of the high-level definition is the term of the proximity. The similarity is a numerical measure of the similarity of two pairs of images. Dissimilarity (often also called distance) is a numerical measure of the degree of variance of two objects.

2. Differences between Data Objects

The classic is Euclidean distance (European distance).

3. Similarity between Data Objects

The similarity measure of the binary data is also called the similarity coefficient. The following methods are generally used to measure: simple matching coefficients: smc= value Match number/number of attributes. Jaccard coefficients: The number of attributes that j= matches/does not involve the number of 0-0 matching attributes.

Cosine similarity: More commonly used to compare two vectors. In Web mining, it is often used to compare the similarity of two mesh faces.

Generalized Jaccard coefficients (Tanimoto coefficients): is an extension of the Jaccard coefficient, which can be used for document data.

4, the problem of proximity calculation

Combining the similarity of different attributes: You can calculate the similarity between each attribute separately, and then use a method that causes similarity between 0 and 1 to combine these similarities. If some properties are non-symmetric, you can do this: if the value of two objects on non-object properties is 0, they can be ignored at the computer similarity level.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.