Introduction to Data Mining Reading Notes
Prerequisites for data mining: rapid advances in data collection and storage technologies. Data Mining is a technology that combines traditional data analysis methods with complex algorithms that process large amounts of data. It provides an exciting opportunity to explore and analyze new data types and analyze data types using new methods.
Data Mining is a process of automatically discovering useful information in a large data repository.
Data Mining and Knowledge DiscoveryData mining is an indispensable part of database knowledge discovery (knodge DGE deiscovery in Database) KDD. KDD is the whole process of converting unprocessed data into useful information.
Input data: Input various forms of storage, and can reside in several data repositories, live distribution on multiple sites.
Data preprocessing: Convert unprocessed input data into a form suitable for analysis. Including data from multiple data sources, data cleansing, and noise and repeated observed values, and selecting records and features related to the current data mining task. It is the most laborious and time-consuming step in the entire knowledge discovery process.
Post-processing: Combines the Rules revealed by the data mining results with commercial activity management tools to carry out or test effective commercial activities. Integrate valid and useful results into the decision support system.
Problems to be Solved in Data Mining
ScalableAs data generation and collection technology advances, big data is becoming more and more common. If a data mining algorithm needs to process these massive datasets, the algorithm must be scalable ). Use sampling technology or develop parallel and distribution algorithms to improve scalability.
High DimensionCurrently, data is usually a data set with hundreds of thousands of attributes. Datasets with time or space components often have high dimensions. Traditional data analysis technologies developed for bottom-Dimension Data cannot process high-dimensional data well. In addition, with the increase of dimensions (number of features) in some data analysis algorithms, the computing complexity rapidly increases.
Heterogeneous Data and complex dataTraditional data analysis methods only process datasets with the same type of attributes, either continuous or classified. As data mining plays an increasingly important role in business, science, and other fields, it is increasingly necessary to process heterogeneous attributes. For example, DNA data with sequences and three-dimensional structures. The technology developed to mine such complex objects should consider the relationship between data. For example, time and space self-correlation, graph connectivity, etc.
Data ownership and distributionSometimes, the data to be analyzed is not stored on one site or belongs to one organization, and the data is geographically distributed across multiple organizations. This requires the development of distributed data mining technology. The main challenges facing distributed data mining algorithms include: how to reduce the traffic required for distributed computing, how to effectively unify the data mining results obtained from multiple resources, and how to process data security.
Non-traditional analysisThe traditional statistical method is based on a hypothesis-test model, that is, to propose a hypothesis, design an experiment to collect data, and then analyze the data for the hypothesis. However, this method is inefficient. Therefore, it is necessary to automatically generate and evaluate assumptions. In addition, the data analyzed by data mining is generally not carefully related to experimental results, but the opportunistic sample, rather than the random sample ).
The origin of Data MiningTo address these challenges, data mining uses the following ideas:
- Statistical sampling, estimation, hypothesis test
- Artificial intelligence, pattern recognition, machine learning search algorithms, modeling technology and Learning Theory
- Optimization
- Evolutionary computing
- Information Theory
- Signal Processing
- Visualization
- Information Retrieval
- Database System
- High-Performance Parallel Computing Technology
- Distributed Technology
Data Mining tasks
It is generally divided into two categories:
Prediction task:Predict the value of a specific attribute based on the value of other attributes. The predicted attribute is called the target variable dependent variable ). The attribute used for prediction is called the explanatory variable or independent variable)
Description task:Export the modes (correlation, trend, clustering, track, and exception) that summarize the potential connections in data. In essence, descriptive data mining tasks are usually exploratory. Post-processing technical verification and interpretation results are required
Predictive modeling(Predictive modeling) describes how to create a model for the target variable in the form of a variable function. There are two types of prediction modeling tasks: classification is used to predict discrete target variables, and regression is used to predict continuous target variables. For example, it is used to predict whether online shopping is classified by web users because the target variable is a binary value. Predicting the future price of a stock is a regression because the price has a continuous value attribute. Both tasks train a model and minimize the error between the predicted value of the target variable and the actual value.
Association AnalysisAssociation analysis is used to discover models that describe strongly correlated features in data. The patterns found are usually expressed in the form of a kneeling or feature subset. As the index scale is used to search for adultery, the goal of association analysis is to extract the most interesting model in an effective way.
Cluster AnalysisThe purpose of cluster analysis is to find closely related observation group groups so that the observations of the same cluster are as similar as those of different clusters as possible. Clustering can be used to group related customers to find marine areas that significantly affect the Earth's climate.
Exception Detection(Anomaly detection) the task is to identify its features significantly different from the observations of other data. Such observations are called anomaly or outlier exception detection algorithms. The goal of this algorithm is to find the real exception points. Avoid incorrectly marking normal objects as abnormal points. In other words, a good anomaly detector must have a high detection rate and a low false positive rate. Applications include network attack detection and fraud detection.
[Introduction to Data Mining]-Introduction