The 1th Chapter Introduction
Data mining is a technology that combines traditional methods of data analysis with complex algorithms for processing large amounts of data. Data Mining provides an exciting opportunity to explore and analyze new data types and to analyze old data types in new ways. We summarize data mining and list the key topics covered.
Introduce some applications of data mining analysis technology:
Business: With POS data collection technology [barcode scanners, RFID and smart card technology], retailers can collect the latest data from shoppers at their stores ' premiere. Retailers can take advantage of this information, along with other important business data such as e-commerce website logs, customer service records in the E-shopping center, to better understand the needs of customers and make informed business decisions.
Data mining technology can be used to support a wide range of business intelligence applications, such as Customer analytics, targeted marketing, workflow management, store distribution, and fraud detection. Data mining can also help retailers answer important business questions. such as "Who is the most valuable customer?" "What products can cross-sell or improve sales?" "What is the revenue outlook for the company next year?" "These problems have spawned a new data analysis technology---correlation analysis.
Medicine, Science and engineering: for example, in order to gain a deeper understanding of the Earth's climate system, NASA has deployed a series of Earth-orbit health, constantly collecting global observational data for the surface, oceans and atmosphere. However, due to the scale and spatio-temporal nature of these data, traditional methods are often unsuitable for analyzing these datasets. Data mining technology can help scientists answer questions like "What is the link between the frequency and intensity of ecosystem disturbances such as droughts and hurricanes and global warming?" "What is the effect of ocean surface temperature on surface precipitation and temperature?" "How to accurately predict the start and end of a region's growing season?" ”。
1.1 What is Data mining
Data mining is the process of automatically discovering useful information in the Daxing data repository, discovering useful patterns that were previously unknown. Future observations can also be predicted. For example, predict whether a new customer will spend more than $100 in a department store.
Not all information discovery tasks are considered data mining. For example, using a database management system to find individual records, or searching through the Internet's search engines for specific Web pages, is a task in the field of information retrieval (information retrieval). While these tasks are important and may involve the use of complex algorithms and data structures, they rely primarily on traditional computer science and technology and the obvious characteristics of data to create index structures that effectively organize and retrieve information. Nonetheless, data mining techniques are being used to enhance the ability of information retrieval systems.
Data Mining and knowledge discovery
Data mining is an integral part of the database Chapter Knowledge Discovery (Knowledge discovery in Database,kdd), and KDD is the whole process of converting raw data into useful information. As shown, the process includes a series of conversion steps, from data preprocessing to post-processing of data mining results.
The input data can be stored in various forms and can reside in a centralized data repository or distributed across multiple sites. The purpose of data preprocessing is to convert raw input data into a form suitable for analysis. The steps of data preprocessing design include merging data from multiple data sources, cleaning data and eliminating noise and repetitive observations, and selecting any records and features related to the current data mining. Data preprocessing can be the most laborious and time-consuming step in the entire knowledge discovery process due to the variety of ways in which data is collected and stored.
An end loop typically refers to the process of integrating a data mining structure into a decision support system. For example, in a commercial application, the results of data mining reveal a collation that can be combined with a Business activity management tool to conduct or test effective merchandising campaigns. Such a combination requires a post-processing step to ensure that only those effective and useful results are integrated into the decision support system. An example of post-processing is visualization, which enables data analysts to explore data and data mining results from a variety of different perspectives. In the post-processing phase, you can also use statistical measures or hypothesis tests to remove false data mining results.
1.2 Problems to be solved in data mining
Scalability: As data generation and collection technology advances, the number of g\t\p bytes of data is more common. If the data mining algorithm is to handle these massive datasets, the algorithm must be scalable (scalable). Many data mining algorithms use special search strategies to deal with the problem of finger-count search. To achieve scalability, you might also need to implement a new data structure to access each record in an efficient manner. For example, a non-memory algorithm may be required when the data to be processed cannot be put into memory. Using sampling techniques to develop parallel and distributed algorithms can also improve scalability.
High dimensional: You often encounter datasets with thousands of cost attributes. In the field of bioinformatics, advances in microarray technology have resulted in gene expression data involving thousands of characteristics. Datasets with time and space components also often have very high dimensions. For example, consider a dataset that contains temperature measurements from different regions, and if measured repeatedly over a fairly long period of time, the growth of the dimension (feature number) is proportional to the number of measurements. Traditional data analysis techniques developed for low-dimensional data often fail to handle such high-dimensional data well. In addition, for some data analysis algorithms, the computational complexity increases rapidly with the increase of the dimension.
Heterogeneous data and complex data: traditional methods of data analysis only deal with datasets that contain the same type of attributes, either sequential or categorical. With the increasing role of data mining in business, science, medicine, and other fields, there is a growing need for technologies that can handle heterogeneous attributes. In recent years, more complex data objects have emerged. Examples of these non-traditional data types are: Web page sets with semi-structured text and hyperlinks, DNA data with sequence and three-dimensional structure, and meteorological data containing time series measurements (temperature, air pressure, etc.) at different locations on the Earth's surface. Techniques developed to exploit this complex object should consider the relationships in the data, such as the autocorrelation of time and space, the connectivity of graphs, the semi-structured text, and the parent-child relationship between elements in an XML document.
Ownership and distribution of data: Sometimes the data that needs to be analyzed is stored in a single site, or belongs to an organization, but geographically distributed among resources belonging to multiple agencies. This requires the development of distributed data mining technology. The main challenges faced by distributed data mining algorithms include: (1) How to reduce the amount of traffic required to perform distributed computing? (2) How to effectively unify data mining results from multiple sources? (3) How to deal with data security issues?
Unconventional analysis: Traditional statistical methods are based on a hypothesis-test model, which is to propose a hypothesis, design experiments to collect data, and then analyze the data for assumptions. However, this process is labor-intensive. The current data analysis task often requires the generation and evaluation of thousands of assumptions, so it is necessary to generate and evaluate assumptions automatically, prompting people to develop some data mining techniques. In addition, data mining analysis datasets are often not the result of well-designed experiments, and they typically represent timing samples of data rather than random samples. Moreover, these datasets often involve non-traditional data types and data distributions.
1.3 Origins of data mining
Data mining utilizes ideas from the following areas: (1) sampling, estimating and hypothesis testing from statistics, (2) Artificial intelligence, pattern recognition and its learning algorithms, modeling techniques, and learning theory. Data mining has also quickly embraced ideas from other areas, including optimization, evolutionary computing, information theory, signal processing, Jebsen Petrochemical, and research.
Some other areas also play an important supporting role. In particular, database systems are required to provide efficient storage, indexing, and query processing support. Technologies that originate from high-performance (parallel) computing are often important in dealing with massive datasets. Distributed technology can also help to process massive amounts of data, and it is critical when data is not centralized for processing.
For example, show links between data mining and other areas:
1.4 Data Mining tasks
The Data Mining task is divided into the following two main categories:
Prediction Tasks : The goal of these tasks is to predict the value of a particular property based on the value of another property. The properties that are predicted are generally referred to as the target variable variable or the dependent variable dependent variable. The attribute used to make the prediction is called the description variable explanatory variable or the independent variable of the argument.
Description Task: The goal is to export patterns (correlations, trends, clusters, trajectories, and exceptions) that summarize the potential links in the data. In essence, descriptive data mining tasks are typically exploratory and often require post-processing techniques to validate and interpret the results.
As shown in the four major data mining tasks that will be covered:
Predictive Modeling (predictive modeling): involves a CV model of target variables in a way that describes variable functions. There are two types of predictive modeling tasks: Classification (classification), which predicts discrete target variables, and regression (regression), which is used to predict successive target variables. For example, predicting whether a Web user will buy a book in an online bookstore is a classification task because the target variable is a two value, and the future price of a stock is predicted to be a return task because the price has a continuous value attribute. Both tasks aim to train a model so that the error between the predicted value of the target variable and the actual value is minimized. Predictive models can be used to determine customer responses to product promotions, to predict disturbances in the Earth's ecosystems, or to determine whether a patient suffers from a disease based on the results of the examination.
Example 1.1 predicts the type of flower to consider the following tasks: Predict the type of flower according to the characteristics of the flower. Consider classifying Iris Iris according to one of three categories: Setosa, Versicolour, and Virginica. For this task, we need a data set, including the characteristics of these three types of flowers. A dataset with this kind of information is a well-known iris data set that can be obtained from the Machine Learning Database (Http://www.ics.uci.edu/~mlearn). In addition to the types of flowers, the data also contains sepals width, sepals length, petal length, petal width four other properties. A comparison chart of 150 petal widths and petal lengths in Iris data sets is given. Petal widths are also divided into low, medium, and high categories, respectively, corresponding to the interval [0,0.75), [0.75,1.75], [1.75,+]. Petal lengths are also divided into low, medium, and high categories, respectively, corresponding to the interval [0,2.5), [2.5,5], [5,+]. According to these categories of petal width and length, the following rules can be introduced:
Although these rules do not classify all flowers, it is possible to classify most flowers well. Note: Depending on the petal width and petal length, the flowers of the setosa species can be separated from the flowers of the versicolour and virginica species, but the latter two types of flowers overlap on these properties.
Association Analysis: used to discover patterns that describe strongly associated features in data. The patterns found are usually expressed in the form of implication rules or feature subsets. Because the search space refers to the size of the data, the goal of association analysis is to extract the most interesting patterns in an efficient manner. The Applications of association analysis include identifying genomes with related functions, identifying Web pages visited by users, and understanding the linkages between different elements of the Earth's climate system.
Example 1.2 Shopping basket Analysis The transaction given is the sales data collected at a grocery store checkout. Correlation analysis can be used to discover products that customers often buy at the same time. For example, we may find the rule {diaper}-->{milk}. The rule implies that customers who buy diapers will probably buy milk. This type of rule can be used to discover opportunities for cross-selling that may exist in various commodities.
Clustering (Cluster analysis): The aim is to find closely related observation groups so that the observed values belonging to the same cluster are as similar as possible to those observed in different clusters. Clustering can be used to group related customers, identify marine areas that significantly affect the Earth's climate, and compress data.
Example 1.3 Document clustering gives news articles that can be grouped according to their respective topics. Each article is expressed as a set of word-frequency pairs (w,c), where W is the word, and C is the number of times that word appears in the article. In this data set, there are two natural clusters. The first cluster consists of the top four articles, which correspond to economic news, while the second cluster contains the last four articles that correspond to health care news. A good clustering algorithm should be able to identify the two clusters according to the similarity of the words appearing in the article.
anomaly Detection (anomaly detection): The task is to identify observations whose characteristics are significantly different from other data. Such observations are known as Anomaly points (anomaly) or outlier points (outlier). The purpose of the anomaly detection algorithm is to find the true anomaly, and avoid incorrectly labeling the normal object as an anomaly, instead of incorrectly labeling the normal object as an anomaly. In other words, a good anomaly detector must have a high detection rate and a low false alarm rate. Applications for anomaly detection include detection of fraud, cyber attacks, unusual patterns of disease, and ecosystem disturbances.
Example 1.4 Credit card fraud detection credit card companies record transactions made by each cardholder and also record personal information such as credit limits, age, annual salary, and address. Because of the relatively small number of frauds compared to legitimate transactions, anomaly detection techniques can be used to construct the contours of a user's legitimate trade. When a new deal is reached, it is compared with it. If the nature of the transaction is very different from the previously constructed outline, mark the transaction as potentially fraudulent.
1.5 contents and organization of the book
This paper introduces the main principles and techniques used in data mining from the point of view of algorithm, in order to better understand how data mining technology is used in various types of data, it is very important to study these principles and techniques.
Start the technical discussion of the book from the Data (chapter 2nd). Discusses the basic types of data, data quality, preprocessing techniques, and similarity and dissimilarity metrics. is an important basis for data analysis. The 3rd chapter discusses data exploration, discussion of aggregated statistics, visualization techniques, and on-line analytical processing, which can be used to quickly and thoroughly understand datasets.
Chapters 4th and 5th cover classifications. The 4th chapter is based on the discussion of decision tree classification and some important classification problems: overfitting, performance evaluation and comparison of different classifier models. On this basis, the 5th chapter introduces other important classification techniques: rule-based system, nearest neighbor classifier, Bayesian classifier, artificial neural network, support vector machine and combinatorial classifier. A composite classifier is a set of classifiers. This chapter also discusses a number of categories of issues and imbalances.
Correlation analysis in Chapters 6th and 7th. The 6th chapter introduces the foundation---frequent itemsets, association rules, and some algorithms of the correlation analysis. Special types of frequent itemsets (maximal itemsets, closed itemsets, and hyper-clusters) are important for data mining. This chapter concludes with a discussion of evaluation metrics for correlation analysis. The 7th chapter considers a variety of more advanced topics, including how to use correlation analysis for categorical data and continuous data, or for data with conceptual layering. (Conceptual layering is the hierarchical classification of objects, such as shoes, footwear, clothing, stock goods, and so on). It also describes how to extend association analysis to discover sequence patterns, patterns in diagrams, negative links (other items do not appear if one item appears).
Cluster analysis is discussed in chapters 8th and 9th. The 8th chapter introduces different types of clusters, and then gives three specific clustering techniques: K-mean, condensed hierarchical clustering and dbscan. Next, we discuss the techniques for verifying the results of clustering algorithms. More clustering concepts and techniques are examined in the 9th chapter, including Fuzzy and probabilistic clustering, self-organizing mappings (SOM), graph-based clustering, and density-based clustering. This chapter also discusses the scalability issues and the factors that need to be considered in selecting clustering algorithms.
The last one is about anomaly detection. After giving some basic definitions, several types of anomaly detection are introduced, including statistical, distance-based, density-based, and cluster-based.
Introduction to Data Mining-reading notes (2)-Introduction [2016-8-8]