[Introduction to Data Mining]-Introduction to data types and Data Mining

**Data Type**

Different datasets are manifested in many aspects. For example, attributes describing data objects can have different types: quantitative or qualitative. In addition, a dataset may also have a specific nature, such as a time series or associated with each other. This is because the data type determines what tools and technologies should be used for data analysis. In addition, data mining is designed to adapt to new application fields and new data types.

**Data quality**Data is often far from perfect. Although most data mining technologies tolerate imperfect data, focusing on understanding and improving data quality is one of the important ways to improve accurate analysis results.

**Pre-processing steps suitable for Data Mining**Generally, raw data must be processed before it can be analyzed. On the one hand, processing is to improve the quality of data, and on the other hand, to better adapt data to specific data mining technologies or tools.

**Analyze Data by data contact**One method of data analysis is to find the links between data objects, and then use these links instead of the data objects themselves for the rest of the analysis.

Generally, a dataset can be seen as a collection of data objects. Data objects can be records, points, vectors, and patterns. A Data Object uses a set of attribute descriptions describing the basic features of an object, such as variables, fields, features, or dimensions.

**Attributes and Measurements****What is property:****Attribute)**It is the nature or feature of an object. It varies with the object or with time. Trace the source. The attribute is not a number or symbol. However, to discuss and analyze the characteristics of objects, we assign them numbers and symbols. To achieve this in a clearly defined way, we need to measure the scale.

**Measurement scale (mreasurement scale)**It is a rule (function) that associates numeric or symbolic values with object attributes ). Formally, a measurement process associates a value with a specific attribute of a specific object using a measurement scale. Although it is a bit abstract. However, in our daily life, we always perform measurement. For example, when we get on the bus, we can see whether there are any seats left for us to sit. In these cases, physical values of object properties are mapped to numerical values or symbolic values.

**Attribute type**We learned from the previous that the property nature does not have to be the same as the property used to measure its value. That is, it is used to indicate that the value of an attribute may have different properties and the property itself, and vice versa.

The type of an attribute tells us that the properties are reflected in the values used to measure it. Knowing the importance of the attribute type tells us that the nature of the measured value is consistent with the basic nature of the attribute, so that I can avoid the silly behavior of calculating the average ID of an employee, note that the attribute type is usually called the measurement scale type.

**Different types of attributes**A useful method for specifying an attribute type is to determine the nature of the numeric value corresponding to the basic property of the attribute. For example, the attribute of length can have many numeric values. Comparing objects by length, determining the sorting of objects, and the difference and proportion of length make sense. The following operations generally describe attributes:

Given these properties, we can define four types of properties: nominal (nominal), ordinal (ordinal), interval (interval), ratio (ratio ).

You can also describe the type of an attribute without changing its meaning, for example, the length can be measured by meters or feet. The following table shows the allowed transformations of the four attribute types in the Table above:

**Describe attributes with the number of values**An independent way to differentiate attributes is to determine the number of possible attribute values.**Discrete (discrete)**A discrete attribute has a finite or infinite number of values. Generally, discrete attributes are represented by integer variables. Binary attribute is a special case of discrete attributes. It only accepts two values: True/false, true/false, and 01. Binary attributes are represented by Boolean variables.

**Continuous)**The continuous attribute is the attribute of the real value. Such as temperature and height. Generally, continuous attributes are represented by floating point variables.

Theoretically, any measurement scale type (nominal, ordinal, interval, ratio) can be any type based on the number of attribute values (binary, discrete, continuous) combination. Some combinations do not often appear or make no sense.

**Asymmetric attributes**Non-zero attribute values are important for asymmetric attributes. For example, for one object, each object is a student's dataset. Each attribute records whether a student chooses a university course. If a student chooses a course with a certain attribute, the value is 1; otherwise, the value is 0. Because students can only select a part of all optional courses, most of the values of this type of dataset are 0, so focusing on non-zero values makes more sense. A binary attribute that is only important to a non-zero value is a non-symmetric Binary Attribute.

**Dataset type**There are many types of datasets. Generally, we divide datasets into three groups: Record Data, graph-based data, and ordered data.

**Common Features of a dataset****Dimension (dimensionality)**The dimension of a dataset is the number of attributes of objects in the dataset. It is divided into bottom, middle, and high dimensions. When analyzing data, it is best to reduce the data dimension. This is because when analyzing high-dimensional data**Curse of dimensionality)**. Therefore, an important motive for data pre-processing is to reduce dimensions.**Dimensionality reduction)**

**Sparsity)**In some datasets, such as datasets with asymmetric features, most attributes of an object have 0 values. In many cases, the non-zero items are less than 1%. In fact, sparsity is an advantage, because only non-zero values need to be stored and processed. This greatly saves computing time and storage space.

**Resolution)**Data can often be obtained at different resolutions, and the data properties are different under different resolutions. For example, in a few meters of resolution, the earth surface looks rather uneven, but relatively flat at dozens of kilometers of resolution.

**Record Data**Many data mining tasks assume that a dataset is a collection of records (Data Objects). Each record contains a fixed set of data fields (attributes. The following describes different types of record data:

**Transaction data or basket data****Transaction data)**Is a special type of record data, where each record (data) involves a series of items. Consider that a collection of items purchased by a customer for a single shopping constitutes a transaction, and all purchased items are used as items. This type of data is called**Market basket data ).**

**Data Matrix**If all data objects in a dataset family have the same numerical attribute set, the data objects can be viewed as vertices (vectors) in a multi-dimensional space. Each dimension represents a different attribute of the object. Such a data object set can be represented by an m * n matrix, where m rows, one object row, n columns, and one attribute column. This matrix is called**Data matrix)**Or**Pattern matrix ).**

**Sparse data matrix**A Sparse data matrix is a special case of a data matrix. The attributes are of the same type and are not symmetric. That is, only non-zero values are important. Transaction data is an example of a sparse data matrix containing only 0-1 elements. Another common feature is document data. The representation of a document set is usually called**Document-word matrix)**, 2-2D. The document is the row of the matrix, and the word is the column of the Matrix.

**Graph-based data**Sometimes a graph can effectively represent data, but there are two special cases: the graph captures the relationship between data objects, and the data objects themselves are represented in graphs.

**Data of contacts between worried objects**The relationship between objects often carries important information. In this case, data is often graphically represented. Generally, data objects are mapped to graph nodes, and the links between objects are represented by links or directions between objects and weights. Such as a webpage linked to each other.

**Data with graphical objects**If the object has a structure, that is, the object contains associated sub-objects, such objects are often graphically represented. For example, the chemical structure is graphically represented.

**Ordered data**For some data types, attributes involve temporal or spatial order connections. As follows:

**Time Series Data****Sequential data)**Also known**Time data (temporal data)**It can be seen as an extension of record data. Each record contains a time associated with it. The time can also be related to each attribute. For example, each record can be the shopping history of a customer, including the list of items bought at different times. Using this information, we may find that people who buy an iPhone will not be concerned with low-end android devices.

**Sequential data****Sequence data)**It is a collection of data. It is a sequence of various entities, such as a sequence of words or letters and a genome sequence.

**Time Series Data****Time series data)**Is a special time series data, where each record is**Time series**), That is, the measurement sequence over a period of time. 2-4c records the average time series from January 1, 1982 to January 1, 1994. Note: When analyzing time data, consider**Temporal autocorrelation)**That is, if the time of two measurements is very close, the values of these measurements are usually very similar.

**Spatial Data**Some data may also have spatial attributes, such as location or region. There are many examples of spatial data, such as collecting meteorological data from different places. An important feature of spatial data is space.**Spatial autocorrelation)**That is, the physically close objects tend to be similar to other aspects.

**Process non-recorded data**Most data mining algorithms are designed to record data or their variants (transaction data, data matrix. Extract features from an object and use these features to create records corresponding to each object. Record data can also be used with non-recorded data. For example, for chemical structure data, given a common sub-structure set, each compound can be represented by a record with binary properties, which indicate whether a compound contains a specific sub-structure, this also indicates the transaction dataset, where the transaction is a compound, and the item is a sub-structure.

Introduction to Data Mining ebook

Go to the provincial bookstore to buy

Answers to exercises in Data Mining

Introduction

This book comprehensively introduces the theories and methods of data mining, focusing on how to use data mining knowledge to solve various practical problems, involving a wide range of subject fields, and a wide range of applications. It contains a large number of charts, comprehensive examples, and a wide range of exercises, and uses examples, concise descriptions of key algorithms, and exercises to focus as much as possible on the main concepts of data mining. This book does not require a database background, but only requires a small amount of statistical or mathematical background knowledge. It is suitable for a wide range of readers.

This book comprehensively introduces the theories and methods of data mining, aiming to provide readers with the knowledge necessary to apply data mining to practical problems. This book covers five topics: data, classification, association analysis, clustering, and exception detection. Apart from exception detection, each topic consists of two chapters: the previous chapter describes basic concepts, representative algorithms, and evaluation techniques, and the next chapter discusses Advanced Concepts and Algorithms in depth. The goal is to give readers a thorough understanding of the basics of data mining and learn more important advanced topics. In addition, the book provides a large number of examples, I tables and exercises.

This book is suitable for senior undergraduates and graduate Data Mining courses of relevant majors. It can also be used as a reference book for data mining research and application developers.

--------------------------------------------------------------------------------

Author Profile

He is now an assistant professor at the Department of Computer and engineering at Michigan State University, mainly teaching data mining, database systems, and other courses. Previously, he was an associate researcher at the U. S. Army High Performance Computing Research Center (2002-2003) at the University of Minnesota ).

--------------------------------------------------------------------------------

Edit recommendations

This book comprehensively introduces the theories and methods of data mining, focusing on how to use data mining knowledge to solve various practical problems, involving a wide range of subject fields, and a wide range of applications. It contains a large number of charts, comprehensive examples, and a wide range of exercises, and uses examples, concise descriptions of key algorithms, and exercises to focus as much as possible on the main concepts of data mining. This book does not require a database background, but only requires a small amount of statistical or mathematical background knowledge. It is suitable for a wide range of readers.

--------------------------------------------------------------------------------

Directory

Chapter 1 Introduction

1.1 What is Data Mining

1.2 challenges in Data Mining

1.3 origins of Data Mining

1.4 Data Mining tasks

Contents and organization of the 1.5 book