[Introduction to Data Mining]-Introduction to Data Mining

Source: Internet
Author: User

[Introduction to Data Mining]-Introduction to Data Mining
Introduction to Data Mining Reading Notes
Prerequisites for data mining: rapid advances in data collection and storage technologies. Data Mining is a technology that combines traditional data analysis methods with complex algorithms that process large amounts of data. It provides an exciting opportunity to explore and analyze new data types and analyze data types using new methods.
Data Mining is a process of automatically discovering useful information in a large data repository.
Data Mining and Knowledge DiscoveryData mining is an indispensable part of database knowledge discovery (knodge DGE deiscovery in database) KDD. KDD is the whole process of converting unprocessed data into useful information.
Input data: Input various forms of storage, and can reside in several data repositories, live distribution on multiple sites.
Data preprocessing: Convert unprocessed input data into a form suitable for analysis. Including data from multiple data sources, data cleansing, and noise and repeated observed values, and selecting records and features related to the current data mining task. It is the most laborious and time-consuming step in the entire knowledge discovery process.
Post-processing: Combines the Rules revealed by the data mining results with commercial activity management tools to carry out or test effective commercial activities. Integrate valid and useful results into the decision support system.
Problems to be Solved in Data MiningScalableAs data generation and collection technology advances, big data is becoming more and more common. If a data mining algorithm needs to process these massive datasets, the algorithm must be scalable ). Use sampling technology or develop parallel and distribution algorithms to improve scalability.
High DimensionCurrently, data is usually a data set with hundreds of thousands of attributes. Datasets with time or space components often have high dimensions. Traditional data analysis technologies developed for bottom-Dimension Data cannot process high-dimensional data well. In addition, with the increase of dimensions (number of features) in some data analysis algorithms, the computing complexity rapidly increases.
Heterogeneous Data and complex dataTraditional data analysis methods only process datasets with the same type of attributes, either continuous or classified. As data mining plays an increasingly important role in business, science, and other fields, it is increasingly necessary to process heterogeneous attributes. For example, DNA data with sequences and three-dimensional structures. The technology developed to mine such complex objects should consider the relationship between data. For example, time and space self-correlation, graph connectivity, etc.
Data ownership and distributionSometimes, the data to be analyzed is not stored on one site or belongs to one organization, and the data is geographically distributed across multiple organizations. This requires the development of distributed data mining technology. The main challenges facing distributed data mining algorithms include: how to reduce the traffic required for distributed computing, how to effectively unify the data mining results obtained from multiple resources, and how to process data security.
Non-traditional analysisThe traditional statistical method is based on a hypothesis-test model, that is, to propose a hypothesis, design an experiment to collect data, and then analyze the data for the hypothesis. However, this method is inefficient. Therefore, it is necessary to automatically generate and evaluate assumptions. In addition, the data analyzed by data mining is generally not carefully related to experimental results, but the opportunistic sample, rather than the random sample ).

The origin of Data MiningTo address these challenges, data mining uses the following ideas:

  • Statistical sampling, estimation, hypothesis test
  • Artificial intelligence, pattern recognition, machine learning search algorithms, modeling technology and Learning Theory
  • Optimization
  • Evolutionary computing
  • Information Theory
  • Signal Processing
  • Visualization
  • Information Retrieval
  • Database System
  • High-Performance Parallel Computing Technology
  • Distributed Technology


Data Mining tasks It is generally divided into two categories: Prediction task:Predict the value of a specific attribute based on the value of other attributes. The predicted attribute is called the target variable dependent variable ). The attribute used for prediction is called the explanatory variable or independent variable)
Description task:Export the modes (correlation, trend, clustering, track, and exception) that summarize the potential connections in data. In essence, descriptive data mining tasks are usually exploratory. Post-processing technical verification and interpretation results are required
Predictive modeling(Predictive modeling) describes how to create a model for the target variable in the form of a variable function. There are two types of prediction modeling tasks: classification is used to predict discrete target variables, and regression is used to predict continuous target variables. For example, it is used to predict whether online shopping is classified by web users because the target variable is a binary value. Predicting the future price of a stock is a regression because the price has a continuous value attribute. Both tasks train a model and minimize the error between the predicted value of the target variable and the actual value.

Association AnalysisAssociation analysis is used to discover models that describe strongly correlated features in data. The patterns found are usually expressed in the form of a kneeling or feature subset. As the index scale is used to search for adultery, the goal of association analysis is to extract the most interesting model in an effective way.
Cluster AnalysisThe purpose of cluster analysis is to find closely related observation group groups so that the observations of the same cluster are as similar as those of different clusters as possible. Clustering can be used to group related customers to find marine areas that significantly affect the Earth's climate.

Exception Detection(Anomaly detection) the task is to identify its features significantly different from the observations of other data. Such observations are called anomaly or outlier exception detection algorithms. The goal of this algorithm is to find the real exception points. Avoid incorrectly marking normal objects as abnormal points. In other words, a good anomaly detector must have a high detection rate and a low false positive rate. Applications include network attack detection and fraud detection.









Answers to exercises in Data Mining

Introduction
This book comprehensively introduces the theories and methods of data mining, focusing on how to use data mining knowledge to solve various practical problems, involving a wide range of subject fields, and a wide range of applications. It contains a large number of charts, comprehensive examples, and a wide range of exercises, and uses examples, concise descriptions of key algorithms, and exercises to focus as much as possible on the main concepts of data mining. This book does not require a database background, but only requires a small amount of statistical or mathematical background knowledge. It is suitable for a wide range of readers.

This book comprehensively introduces the theories and methods of data mining, aiming to provide readers with the knowledge necessary to apply data mining to practical problems. This book covers five topics: data, classification, association analysis, clustering, and exception detection. Apart from exception detection, each topic consists of two chapters: the previous chapter describes basic concepts, representative algorithms, and evaluation techniques, and the next chapter discusses Advanced Concepts and Algorithms in depth. The goal is to give readers a thorough understanding of the basics of data mining and learn more important advanced topics. In addition, the book provides a large number of examples, I tables and exercises.
This book is suitable for senior undergraduates and graduate Data Mining courses of relevant majors. It can also be used as a reference book for data mining research and application developers.

--------------------------------------------------------------------------------

Author Profile
He is now an assistant professor at the Department of Computer and engineering at Michigan State University, mainly teaching data mining, database systems, and other courses. Previously, he was an associate researcher at the U. S. Army High Performance Computing Research Center (2002-2003) at the University of Minnesota ).

--------------------------------------------------------------------------------

Edit recommendations
This book comprehensively introduces the theories and methods of data mining, focusing on how to use data mining knowledge to solve various practical problems, involving a wide range of subject fields, and a wide range of applications. It contains a large number of charts, comprehensive examples, and a wide range of exercises, and uses examples, concise descriptions of key algorithms, and exercises to focus as much as possible on the main concepts of data mining. This book does not require a database background, but only requires a small amount of statistical or mathematical background knowledge. It is suitable for a wide range of readers.

--------------------------------------------------------------------------------

Directory

Chapter 1 Introduction
1.1 What is Data Mining
1.2 challenges in Data Mining
1.3 origins of Data Mining
1.4 Data Mining tasks
Contents and organization of the 1.5 book

Introduction to Data Mining ebook

Go to the provincial bookstore to buy
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.