Data Mining note (4)-Definition and broad knowledge

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Data Mining classification: From the Perspective of data analysis, data mining can be divided into two types: Descriptive data mining-to express the existence of meaningful properties in data in a concise manner. Predictive Data Mining-one or a group of data models obtained by applying a specific method to the provided dataset, and use this model to predict the relevant nature of new data in the future.

2. Concepts of broad sense knowledge

(1) Definition: generalized knowledge refers to the general descriptive knowledge of class features, also known as conceptual descriptions. It reflects the common nature of similar things and summarizes, refines, and abstracts data.

Generalized knowledge is the induction and generalization of a large amount of data, and extracts General descriptive statistics knowledge.

(2) The simplest descriptive data (generalized knowledge) Mining is qualitative induction. Qualitative induction is often referred to as the concept description. The concept description involves a group of objects (in the same category), such as regular customers.

The concept description generates a qualitative description of the data and a comparative qualitative description.

Qualitative concept description provides a concise and clear description of the entire data (concept)

Comparative qualitative concept description provides a comparative concept description (concept extension) based on multiple groups (different categories) of data)

3. Broad Knowledge Discovery Methods

Data Mining function: data generalization is an analytical process from a relatively low-level concept to a higher-level concept that abstracts and outlines a large amount of data related to tasks in the database.

There are two methods to describe a large amount of data effectively and flexibly: 1. Data Cube 2. Attribute-oriented specification

(1) The data cube method (also known as the OLAP method) is used to generalize data. It stores the aggregate computing results of partial or all dimensions (attributes) in the data cube.

Data Generalization and data refinement of multi-dimensional data cubes can be achieved through roll up or drill down operations.

Roll-Up)

Drill-down (drill-down): the inverse operation of the volume is from less detailed data to more detailed data. It can be achieved by layering the concept of dimension or introducing a new dimension.

Limitations of the data cube method:

① Data type restriction: most commercial data cubes are implemented to limit the dimension type to the numerical type and to limit the processing to the simple numerical aggregation. Because many applications involve analysis of more complex data types, the data cube method has limited application at this time.

② Lack of certain standards: The data cube method cannot solve some important problems that can be solved by the concept description, such as: Which dimensions should be used in the description? Which abstract level should the generalization process proceed. Users are responsible for providing answers to these questions.

(2) Attribute-oriented specification (AOI)

Basic Idea: first, use relational database queries to collect task-related data, and check the number of different attribute values in the task-related data set to complete data generalization. Data Generalization is performed by reducing attributes or generalized attributes (also known as conceptual level improvement. Merge (after generalization) the same rows and accumulate their corresponding numbers. This naturally reduces the data set size after generalization. The obtained (generalized) results are provided to users in various forms, such as charts and rules.

The first step of the AOI method is to extract the student data (related to this mining Task) from the university database using the database query language, and then specify a set of attribute sets related to the mining task. On the other hand, users may provide too many attributes, in this case, we need to use the data cleaning and dimension reduction methods described in data preprocessing to filter out irrelevant or weak-related attributes from descriptive data mining.

There are two types of AOI operations:

① Attribute Elimination: It is performed based on the following rules: if an attribute (in the initial data set) has many different values, and (a) the attribute cannot be generalized (for example: no corresponding conceptual hierarchy is defined), or (B) its high-level concept is described by other attributes. In this case, this attribute can be deleted from the dataset.

② Attribute generalization: it is based on the following rules: if an attribute (in the initial data set) has many different values, and this attribute has a set of generalization operations, you can select a generalized operation to process this attribute.

Methods To control the generalization process:

① Attribute generalized threshold control: This technology sets a general threshold for all attributes, or sets a threshold for each attribute. If the number of different values of an attribute exceeds the generalized threshold, you need to perform further attribute reduction or generalized operations on the corresponding attributes. Data mining systems usually have a default attribute threshold (generally from 2 to 8)

② Generalized link threshold control: if the number of rows with different content in a generalization relationship (number of tuples) exceeds the generalized link threshold, the generalization of related attributes is required. Otherwise, further generalization is not required. Generally, this threshold value is preset in the Data Mining System (usually 10 to 30)

These two technologies can be used in serial mode, that is, the attribute threshold control is applied to generalize each attribute, and the generalized link threshold control is applied to further reduce the size of the generalized relationship.

4. Association Rules

Define 1 dataset for association rule mining as D (D is generally a transaction database), D = {T1, T2 ,..., TK ,..., Tn}, where k = 1, 2 ,..., N.

TK = {I1, I2 ,..., IJ ,..., IP} is a transaction. The element ij (j = 1, 2 ,..., P) is called a project ).

Define 2 set I = {I1, I2 ,..., IJ ,..., Im} is a collection of all the projects in D, called the item set. Any subset of I x (x I) is called the project set (itemset) in D ). If | x | = K, X is called a K-item set. Set Ti and X to the transaction and project set in D. If x Ti is used, the transaction Ti contains the project set X. Obviously, Ti I.

5. association knowledge reflects the dependency or mutual association between an event and other events. If two or more attributes are associated, the attribute value of one of them can be predicted based on other attribute values.

6. association rule mining is to find valuable information about the association between data items from a large amount of data. As the data collected and stored in the database grows, people are increasingly interested in mining related knowledge from the data. For example, discovering valuable associated knowledge from a large number of commercial transaction records can help design, cross-marketing, or assist in other related business decisions.

7. A typical application of mining associated knowledge is market shopping analysis.

"What product groups or collection customers will most buy at the same time during a single shopping"

Given: transaction database, each transaction is a series of items (items bought by a consumer at a time)

Locate: All rules that indicate that these items are related to another series of items.

E. g., 98% of users who buy auto parts will buy auto services.

Application:

* → Maintenance agreement (which products can enhance daily consumption ?)

Household Appliances → * (which products should be kept in high inventory ?)

Confidence and support in rules

	Low confidence level (accuracy)	High Confidence (accuracy)
High support (coverage)	The rules are rarely correct, but they can be used.	The rules are correct in most cases and can be used frequently.
Low support (coverage)	The rules are rarely correct and are generally not used.	The rules are correct in most cases, but are rarely used.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Mining note (4)-Definition and broad knowledge

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Mining note (4)-Definition and broad knowledge

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support