Data
Absrtact: Data mining is a new and important research field at present. This paper introduces the concept, purpose, common methods, data mining process and evaluation method of data mining software. This paper introduces and forecasts the problems faced in the field of data mining.
Keywords: Data Mining data collection
1. Introduction
Data Mining (Mining) is a process of extracting information and knowledge that is implied in it from a large, incomplete, noisy, fuzzy, random data, but is potentially useful in advance. With the rapid development of information technology, the amount of data accumulated by people is increasing rapidly, so it is urgent to extract useful knowledge from massive data. Data mining is a kind of data processing technology which is developed in order to conform to this need. is a critical step in knowledge discovery (Knowledge Discovery in Database).
2. The task of data mining
The main tasks of data mining are relevance analysis, clustering analysis, classification, prediction, timing pattern and deviation analysis.
⑴ Association Analysis (Association)
Association rules mining is first proposed by Rakesh Apwal and others. There is a certain regularity between the values of two or more than two variables, which is called an association. Data Association is a kind of important and discovered knowledge which exists in the database. Associations are divided into simple associations, sequential associations and causal associations. The purpose of association analysis is to find out the hidden network of links in the database. In general, the correlation rules are measured by the two thresholds of support and credibility, and some parameters such as interest degree and correlation are introduced to make the mining rules more accord with the requirement.
⑵ cluster analysis (clustering)
Clustering is based on the similarity of the data into several categories, the same type of data similar to each other, not homogeneous data. Cluster analysis can establish the macroscopic concept, discover the data distribution pattern, and the relationship between the possible data attributes.
⑶ Classification (classification)
Classification is to find a category of conceptual description, which represents the overall information of such data, that is, the description of the class, and use this description to construct the model, generally in the rule or decision tree model. Classification is the use of training data sets through a certain algorithm to obtain classification rules. Classifications can be used for rule description and prediction.
⑷ Prediction (predication)
The prediction is to use historical data to find the change law, to establish the model, and to predict the kinds and characteristics of future data. Predictions are concerned with precision and uncertainty, usually measured by predictive variance.
⑸ Sequential mode (time-series pattern)
The time series pattern is a pattern of high repetition probability which is searched by the time-series. As with regression, it also predicts future values with known data, but the difference between the data is the time at which the variable is.
⑹ deviation Analysis (deviation)
There are a lot of useful knowledge in the deviation, there are many anomalies in the data in the database, it is very important to find the abnormal situation of the data in the database. The basic method of deviation test is to find the difference between the observation result and the reference.
3. Data Mining Objects
According to the information storage format, the objects used for mining are relational database, object-oriented database, Data Warehouse, text data source, multimedia database, spatial database, temporal database, heterogeneous database and Internet.
4. Data mining process
⑴ definition problem: Clearly define the business problem and determine the purpose of the data mining.
⑵ Data Preparation: Data preparation includes: selection of data--the target dataset of data mining in large database and data warehouse target, data preprocessing--data reprocessing, including checking data integrality and data consistency, removing noise, filling missing domain, deleting invalid data, etc.
⑶ Data Mining: According to the type of data function and the characteristics of the data selection of the corresponding algorithm, in the purification and conversion of data sets for data mining.
⑷ Result Analysis: The result of data mining is interpreted and evaluated, and the conversion becomes the knowledge that can be understood by the users.
The application of ⑸ knowledge: Integrate the knowledge gained in the analysis into the organization structure of the business information system.
5. Methods of data Mining
⑴ Neural Network method
Because of its good robustness, self-organizing adaptive, parallel processing, distributed storage and high fault tolerance, neural network is very suitable to solve the problem of data mining, so it has been paid more and more attention in recent years. Typical neural network models are divided into 3 main categories: A feedforward neural network model for classification, prediction and pattern recognition, which is represented by Perceptron, BP reverse-propagation model and functional network, is used for the feedback neural network model of associative memory and optimal computation, which is represented by Hopfield discrete model and continuous model. The self-organizing mapping method for clustering, which is represented by Art model and Koholon model. The disadvantage of neural network method is "black box", it is difficult for people to understand the learning and decision-making process of the network.
⑵ Genetic algorithm
Genetic algorithm is a stochastic search algorithm based on biological natural selection and genetic mechanism, and it is a bionic global optimization method. Genetic algorithm has the characteristics of implicit parallelism and easy integration with other models, which makes it applied in data mining.
Sunil has successfully developed a data mining tool based on genetic algorithm, using this tool to carry out data mining experiments on the real database of two aircraft crashes, the results show that genetic algorithm is one of the effective methods of data mining [4]. The application of genetic algorithm is also embodied in the combination of neural network and rough set technology. If the neural network structure is optimized by genetic algorithm, the redundant connection and the hidden layer element are removed without increasing the error rate, and the neural network is trained by genetic algorithm and BP algorithm, then the rules are extracted from the network. However, the algorithm of genetic algorithm is more complex, and convergence to the early convergence of local minima has not been solved.
⑶ Decision Tree Method
Decision tree is an algorithm commonly used in forecasting models, which finds some valuable and potential information by classifying large amounts of data in a purposeful way. Its main advantage is that the description is simple, the classification speed is fast, especially suitable for large-scale data processing. The most influential and earliest decision tree method is the famous ID3 algorithm based on information entropy proposed by Quinlan. Its main problems are: ID3 learning algorithm, ID3 decision tree is a single variable decision tree, complex concept of the expression of difficulties, the same-sex relationship between the stress is not enough, noise resistance is poor. In view of the above problems, there are many better improved algorithms, such as Schlimmer and Fisher designed ID4 Incremental Learning algorithm, chime, Chen Wenwei and so on ible algorithm.
⑷ Rough Set method
Rough set theory is a mathematical tool to study imprecise and uncertain knowledge. The rough set method has several advantages: it does not need to give extra information, it simplifies the expression space of input information, the algorithm is simple and easy to operate. Rough set processing objects are information tables similar to two-dimensional relational tables. At present, the mature relational database management system and the newly developed Data Warehouse management system have laid a solid foundation for the data mining of rough set. But the mathematical basis of rough set is set theory, it is difficult to directly deal with continuous attributes. But the continuous attribute in the reality information table is universal. Therefore, the discretization of continuous attributes is the difficulty of restricting the application of rough set theory. Now the international has developed a number of tools based on rough sets of software applications, such as the Regina University of Canada Kdd-r Development, the United States Kansas University Lers.
⑸-Case exclusion method for covering positive cases
It uses the idea of covering all the positive examples and rejecting all the counter examples to find the rules. First, select a seed in the set of the positive example and compare it to the set of the counter example. The choice of compatibility with the value of the field is left out and the other is retained. By this thought loop all the positive example seed, will get the rule of the positive example (select the combination of the child). The typical algorithm has Michalski AQ11 method, Hongjiayong improved AQ15 method and his AE5 method.
⑹ statistical analysis method
There are two relationships between database field items: Functional relationships (deterministic relationships that can be expressed by function formulas) and related relationships (which cannot be expressed in function formulas but are still related to deterministic relationships), and they can be analyzed by statistical methods, that is, statistical principles are used to analyze the information in the database. Common statistics can be used (to find the maximum value in a large number of data, minimum, sum, average, etc.), regression analysis (using the regression equation to represent the quantitative relationship between variables), correlation analysis (using correlation coefficients to measure the correlation between variables), difference analyses (to determine whether there is a difference between the population parameters from the difference in sample statistics) Wait
⑺ Fuzzy Set method
That is, fuzzy set theory is used to evaluate practical problems, fuzzy decision, fuzzy pattern recognition and fuzzy cluster analysis. The higher the complexity of the system, the stronger the fuzziness, the general fuzzy set theory is to describe the fuzzy things with the membership degree. Li Deii and other people on the basis of traditional fuzzy theory and probability statistic, a qualitative and quantitative uncertainty conversion model-cloud model is proposed and the cloud theory is formed.
6. Evaluating data mining software issues to consider
More and more software vendors are joining the competition in the field of data mining. How to evaluate a commercial software correctly and choose the right software is the key to the successful application of data mining.
Evaluation of a data mining software should be mainly from the following four main areas:
⑴ Computing performance: such as whether the software can operate on different commercial platforms, the software architecture, the ability to connect different data sources, the operation of large data sets, the performance changes are linear or exponential, the efficiency of calculation, whether the component structure is easy to expand, the stability of the operation and so on;
⑵ functionality: such as whether the software provides a sufficient variety of algorithms; whether it can avoid the black-box of the mining process, whether the algorithm can be applied to many kinds of data, whether the user can adjust the parameters of the algorithm and the algorithm, whether the software can set up the pre mining model from the data set randomly, and whether the mining result can be displayed in different forms;
⑶ Usability: If the user interface is friendly, software is easy to learn and easy to use, software users: beginners, advanced users or experts? Error Reporting is a great help for user debugging; The field of software application: specialize in a particular field or apply to many fields;
⑷ accessibility: If you allow users to change error values in a dataset or to clean data; whether the global substitution of values is allowed, whether continuous data can be discretized, whether a subset can be extracted from the dataset according to the rules established by the user, or whether null values in the data can be replaced with an appropriate mean value or a user-specified value Whether the results of one analysis can be fed back to another analysis, and so on.
7. Conclusion
Data mining technology is a young and promising field of research, and the powerful driving force of commercial interests will continue to promote its development. Every year, new data mining methods and models are available, and people's research on it is becoming more and more extensive and in-depth. Nevertheless, data mining technology still faces many problems and challenges: for example, the efficiency of data mining method should be improved, especially the efficiency of data mining in large scale dataset, and the mining method adapting to multiple data type and noise tolerance is developed to solve the problem of data mining in heterogeneous datasets; Data mining of dynamic Data and knowledge Data mining in network and distributed environment; In addition, multimedia database is developing rapidly in recent years, and the mining technology and software oriented to multimedia database will become the hotspot of research and development in the future.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.