R Language Data Mining Combat series (1)

Source: Internet
Author: User

R Language Data Mining Combat (1)

First, the basis of data mining

Data Mining : "Gold panning" from the data, extracting hidden, unknown, potentially valuable relationships, patterns, and trends from a large amount of data, including text, and using these knowledge and rules to build models for decision support and to provide predictive decision support methods, tools, and processes.

Tasks for Data Mining

Using classification and prediction, cluster analysis, association rules, time series patterns, deviation detection, intelligent recommendations and other methods to help enterprises to extract data contained in the business value, improve the competitiveness of enterprises.

Data Mining Modeling Process

Define a mining target, that is, decide what you want to do?

Data sampling. Extracts a subset of the sample data associated with the mining target. Criteria for extracting data: first, relevance, reliability, and effectiveness. The criteria for measuring the quality of sampled data include: (1) Completeness of data, complete range of indicators, and (2) accurate data, reflecting the level of normal (not abnormal) state. Common sampling methods include: random sampling, equidistant sampling, stratified sampling, sampling from the starting sequence, and classifying samples.

Data exploration. The purpose of data exploration and preprocessing is to guarantee the quality of sample data, and thus lay the foundation for the quality of the model. Common data exploration methods are: Outlier analysis, missing value analysis, correlation analysis, periodic analysis and so on.

Data preprocessing. When the sampling data dimension is large, how to reduce the dimension, and to deal with the missing value is the problem that the data preprocessing should solve. The commonly used data preprocessing methods include: Data filtering, data variable conversion, missing value processing, bad data processing, standardization, principal component analysis, attribute selection, data specification, etc.

Mining modeling. This modeling is a data mining application which kind of problem (classification, Clustering, association rules, time series patterns or smart Recommendations), which algorithm to use to build the model?

Model evaluation. Automatically find the best model from these models to interpret and apply the model according to the business.

Common data Mining modeling tools

(1) R.

R is a language environment designed for statistical computation and graphical display, and is an implementation of the S language developed by Rick Becker, John Chambers and Allan Wilks of Bell Labs.

(2) Python.

Python is an easy-to-learn and powerful programming language with efficient advanced data structures and the ability to do object-oriented programming in a simple and efficient manner.

(3) SAS Enterprise Miner

Enterprise Miner (EM) is an integrated data mining system introduced by SAS, allowing the use and comparison of different technologies, while also integrating complex database management software.

(4) IBM SPSS Modeler

It encapsulates state-of-the-art statistics and data mining techniques to gain predictive knowledge and deploy the appropriate decision-making solutions to existing business systems and business processes. Intuitive operator interface, automated data preparation and proven predictive analytics models.

(5) SQL Server

The data Mining component--analysis Servers is integrated in Microsoft SQL Server. In SQL Server 2008, we provide decision tree algorithm, clustering algorithm, Naive Bayes Algorithm, association rule algorithm, time Series algorithm, neural network algorithm, linear regression algorithm, etc. 9 commonly used data mining algorithms. But platform portability is relatively poor.

(6) MATLAB

MATLAB is the United States MathWorks Company developed the application software, with a strong scientific and engineering computing capacity, it not only has a matrix calculation based on the powerful mathematical computing ability and analytical ability, but also has a wealth of visual graphics performance functions and convenient program design capabilities.

(7) WEKA

WEKA (Waikato Environment for knowledge analysis) is a high-profile, open-source machine learning and data mining software.

(8) TIPDM

TIPDM (the top data mining platform) is developed using the Java language to obtain data from a variety of data sources and to build multiple data mining models. At present, dozens of kinds of predictive algorithms and analysis techniques have been integrated, which basically covers the algorithms supported by the main mining systems at home and abroad.


This article is from the "Rangers" blog, please be sure to keep this source http://ccnupxz.blog.51cto.com/8803964/1930452

R Language Data Mining Combat series (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.