R Language Data Mining Combat (1)
First, the basis of data mining
Data Mining : "Gold panning" from the data, extracting hidden, unknown, potentially valuable relationships, patterns, and trends from a large amount of data, including text, and using these knowledge and rules to build models for decision support and to provide predictive decision support methods, tools, and processes.
Tasks for Data Mining
Using classification and prediction, cluster analysis, association rules, time series patterns, deviation detection, intelligent recommendations and other methods to help enterprises to extract data contained in the business value, improve the competitiveness of enterprises.
Data Mining Modeling Process
Define a mining target, that is, decide what you want to do?
Data sampling. Extracts a subset of the sample data associated with the mining target. Criteria for extracting data: first, relevance, reliability, and effectiveness. The criteria for measuring the quality of sampled data include: (1) Completeness of data, complete range of indicators, and (2) accurate data, reflecting the level of normal (not abnormal) state. Common sampling methods include: random sampling, equidistant sampling, stratified sampling, sampling from the starting sequence, and classifying samples.
Data exploration. The purpose of data exploration and preprocessing is to guarantee the quality of sample data, and thus lay the foundation for the quality of the model. Common data exploration methods are: Outlier analysis, missing value analysis, correlation analysis, periodic analysis and so on.
Data preprocessing. When the sampling data dimension is large, how to reduce the dimension, and to deal with the missing value is the problem that the data preprocessing should solve. The commonly used data preprocessing methods include: Data filtering, data variable conversion, missing value processing, bad data processing, standardization, principal component analysis, attribute selection, data specification, etc.
Mining modeling. This modeling is a data mining application which kind of problem (classification, Clustering, association rules, time series patterns or smart Recommendations), which algorithm to use to build the model?
Model evaluation. Automatically find the best model from these models to interpret and apply the model according to the business.
Common data Mining modeling tools
(1) R.
R is a language environment designed for statistical computation and graphical display, and is an implementation of the S language developed by Rick Becker, John Chambers and Allan Wilks of Bell Labs.
(2) Python.
Python is an easy-to-learn and powerful programming language with efficient advanced data structures and the ability to do object-oriented programming in a simple and efficient manner.
(3) SAS Enterprise Miner
Enterprise Miner (EM) is an integrated data mining system introduced by SAS, allowing the use and comparison of different technologies, while also integrating complex database management software.
(4) IBM SPSS Modeler
It encapsulates state-of-the-art statistics and data mining techniques to gain predictive knowledge and deploy the appropriate decision-making solutions to existing business systems and business processes. Intuitive operator interface, automated data preparation and proven predictive analytics models.
(5) SQL Server
The data Mining component--analysis Servers is integrated in Microsoft SQL Server. In SQL Server 2008, we provide decision tree algorithm, clustering algorithm, Naive Bayes Algorithm, association rule algorithm, time Series algorithm, neural network algorithm, linear regression algorithm, etc. 9 commonly used data mining algorithms. But platform portability is relatively poor.
(6) MATLAB
MATLAB is the United States MathWorks Company developed the application software, with a strong scientific and engineering computing capacity, it not only has a matrix calculation based on the powerful mathematical computing ability and analytical ability, but also has a wealth of visual graphics performance functions and convenient program design capabilities.
(7) WEKA
WEKA (Waikato Environment for knowledge analysis) is a high-profile, open-source machine learning and data mining software.
(8) TIPDM
TIPDM (the top data mining platform) is developed using the Java language to obtain data from a variety of data sources and to build multiple data mining models. At present, dozens of kinds of predictive algorithms and analysis techniques have been integrated, which basically covers the algorithms supported by the main mining systems at home and abroad.
This article is from the "Rangers" blog, please be sure to keep this source http://ccnupxz.blog.51cto.com/8803964/1930452
R Language Data Mining Combat series (1)