Open-source tools for data mining)
========================================================== ====================
Blazzupan, PhD, Janez demsar, PhD (Compilation: idmer)
The history of data mining software is not long. Even the term "Data Mining" was formally proposed in the 1990s S, it integrates statistics, machine learning, data visualization, knowledge engineering, and other research fields. It is quite mature in data exploration and model inference. Compared with the present, the data mining software at that time was very clumsy and generally only provided the command line interface. For many users without a background in computer science, it is still too difficult to use.
Currently, the commercial data mining software is mature and provides easy-to-use visual interfaces. It integrates a complete set of functions such as data processing, modeling, and evaluation. Although open-source data mining tools cannot be compared with commercial data mining software in terms of stability and maturity (idmer: in addition, open-source data mining tools cannot provide reliable performance and after-sales support for commercial users), but some open-source tools are still doing well, you can select it for less important analysis and mining.
This article briefly reviews the evolution of open-source data mining tools and selects some excellent open-source mining tools for your choice.
Evolution of open-source data mining tools
--------------------------
Since the 1980s s, early model inference and machine learning programs have emerged. They are generally executed in the form of command lines (started from Unix or DOS command lines ), you can specify the input data file name and algorithm-related parameters in the command. The widely known classification tree induction algorithm C4.5, is this program (C4.5 source see http://www.rulequest.com/Personal ). Rules-based learning algorithms, such as AQ and cn2. These procedures are mostly used in the medical field, such as the diagnosis and Prediction of cancer.
These programs generally do not include data sampling and other processing functions. users usually use some scripting languages (such as Perl) to do these tasks. At the same time, some research groups have also developed libraries (used to support data format sharing, modeling evaluation, and report functions), such as MLC ++, which is a machine learning library written in C ++.
The command line interface makes it difficult for users to perform interactive data analysis, and the output of the text format is not intuitive. The next development of data mining tools is the built-in data visualization and enhanced interaction functions. In the 1990s S, Silicon Graphics acquired MLC ++ and developed it into mineset. Mineset is almost called the most comprehensive data mining platform at that time. Clementine was also a very popular commercial data mining software at that time, which was very prominent in the ease of use of the interface.
Currently, most open-source data mining software uses Visual Programming Design Ideas (idmer: A graphical method to establish the entire mining process ). This is because it is flexible and easy to use and is more suitable for users who lack computer science knowledge.
In analysis software, flexibility and scalability are very important. It allows you to develop and expand new mining algorithms. In this regard, WEKA (idmer: Almost representative of open-source data mining software) provides a comprehensive documentation of Java functions and class libraries, which is very suitable for expansion. Of course, you must first fully understand the WEKA architecture and master Java programming technology. Another well-known open-source software, R, uses a relatively different approach. R provides rich statistical analysis and data mining functions. Its kernel is implemented by C. However, if you want to use R to develop new mining algorithms, you do not need to use C language, but use the r Software's own scripting language for development. The advantage of using the script language is speed (idmer: it means that the time for developing new algorithms is shortened, because the script language is relatively more advanced and simpler) flexible (complex functions in the mining software can be directly called through scripts) and scalable (Functions of other data mining software can be called through interfaces ). Of course, graphical interfaces are easier to use, but developing new computing rules using scripting languages can meet some specific analysis needs.
Open-source data mining toolbox-expected functions
------------------------------
- Provides a set of basic statistical tools for regular data exploration;
- Multiple data visualization technologies, such as histograms, scatterplots, distribution charts, parallel coordinate visualizations, mosaic and sieve diagrams;
- Standard data processing components, including querying from databases, case selection, feature ranking and subset selection, and feature discretization;
- No guidance on data analysis technologies, such as principal component analysis, various Clustering Techniques, inference of association rules, and subgroup mining techniques;
- Provides guidance on data analysis technologies, such as classification rules and trees, Support Vector Machines, naive Bayesian Classifiers, and discriminant analysis;
- Model Evaluation and scoring tools, including graphical presentation of results (such as ROC curve and lift diagram );
- Visualizes the model (idmer: for example, displaying trained decision trees in a tree structure, displaying clustering in a bubble chart, and displaying associations in a network diagram );
- Provide an Exploratory Data Analysis Environment
- You can save the model as a standard format (such as pmml) for sharing and porting.
- Provides the report function to generate analysis reports and save users' remarks or descriptions.
Several excellent open-source data mining tools
--------------------------
This article only examines several popular open-source data mining platforms, such as WEKA and R. To find more open-source data mining software, go to kdnuggets and open directory. To evaluate these software, we use the heart disease diagnosis dataset on UCI machine learning repository.
R
R (http://www.r-project.org) is used for statistical analysis and graphical computer language and analysis tools, in order to ensure performance, its core computing module is written in C, C ++ and FORTRAN. It also provides a scripting language (R) for ease of use. The r language is similar to the s language developed by Bell Labs. R supports a series of analysis technologies, including statistical testing, predictive modeling, and data visualization. Numerous open source extension packages are available on cran (http://cran.r-project.org.
The preferred interface of the r software is the command line interface, which calls the analysis function by writing scripts. If you lack programming skills, you can also use graphical interfaces, such as using R Commander (http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/) or rattle (http://rattle.togaware.com ).
Tanagra
Tanagra (http://eric.univ-lyon2.fr/wricco/tanagra/) is a graphical data mining software that uses a tree structure similar to Windows Explorer to organize analysis components. Tanagra lacks advanced visualization capabilities, but its strength is statistical analysis, which provides a wide range of parametric and parametric testing methods. At the same time, there are many feature selection methods.
WEKA
WEKA (Waikato environment for knowledge analysis, http://www.cs.waikato.ac.nz/ml/weka/) may be the most famous open source machine learning and data mining software. Advanced users can call their analysis components through Java programming and command lines. WEKA also provides graphical interfaces for common users, such as WEKA knowledgeflow environment and WEKA explorer. Compared with R, WEKA is weaker in statistical analysis, but more powerful in machine learning. In WEKA Forum (http://weka.sourceforge.net/wiki/index.php/Related_Projects) You can find a lot of extension packages, such as text mining, visualization, grid computing and so on. Many other open-source data mining software also supports calling WEKA's analysis function.
Yale(Idmer: Now renamed rapidminer)
Yale (yet another learning environment, http://rapid-i.com) provides a graphical interface, similar to the Tree Structure in Windows Explorer to organize analysis components, each node on the tree represents a different operator ). Yale provides a large number of operators, including data processing, transformation, exploration, modeling, and evaluation. Yale is developed in Java and built based on WEKA. That is to say, it can call various analysis components in WEKA.
Knime
Knime (Konstanz informationminer, http://www.knime.org) is a well-developed data mining tool based on Eclipse development environment. No installation is required and it is easy to use (idmer: Haha, everyone's favorite green version ). Like Yale, knime is developed in Java and can be extended using the mining algorithm in WEKA. What's different from Yale is that knime uses data flow to establish an analysis and mining process (idmer: I like this, similar to commercial data mining software such as SAS em or SPSS Clementine ). A mining process consists of a series of functional nodes, each of which has an input/output port for receiving data or models and exporting results. (Idmer: knime is easier to use than knowledgeflow of WEKA. It is convenient to connect nodes. You can drag and drop the connection port with the mouse. In WEKA, You need to right-click the node and select the subsequent node, which is troublesome. It took just half a day to connect to the node)
Each node in knime carries a traffic signal light to indicate the status of the node (when no connection, no configuration, or lack of input data, it is a red light; when the preparation is a yellow light; after the execution is complete, it is a green light ). A special feature in knime, hilite, allows users to mark records of interest in node results and further explore.
Orange
Orange (http://www.ailab.si/orange) is a data mining tool similar to knime and WEKA knowledgeflow. Its graphical environment is called the orange canvas (orangecanvas), where you can place analysis controls (widgets) on the canvas ), connect the control to form a mining process. The controls here are similar to nodes in knime. Each control executes a specific function, but different from a node in knime, the input and output of a knime node are divided into two types (model and data ), the control of orange can transmit a variety of different signals, such as learners, classifiers, evaluation results, distance matrices, and dendrograms. The control of orange is not as detailed as that of the knime node. That is to say, to complete the same analysis and mining task, the number of controls used in orange can be less than the number of nodes in knime. The advantage of orange is that it is simpler to use, but its disadvantage is that its control capability is weaker than that of knime.
In addition to user-friendly interfaces and ease of use, orange provides a large number of visualization methods to display data and models graphically and intelligently search for appropriate visualization forms, supports interactive data exploration.
Orange is weak in traditional statistical analysis. It does not support statistical tests and has limited report capabilities. Orange's underlying core is also written in C ++ while allowing users to use Python scripting language for extended development (see http://www.scipy.org ).
Ggobi
Data visualization is an important part of data mining, ggobi (http://www.ggobi.org) is an open source software for interactive visualization, it uses brushing method. Ggobi can be used as a plug-in for R Software or called through Perl, Python, and other scripting languages.
Conclusion
----
The software described above is an excellent open-source data mining software with its own strengths and weaknesses. Readers can choose based on their own needs, or use multiple software in combination. For common users, you can use user-friendly and easy-to-use software. for users who want to develop algorithms, you can use different software development tools (Java, R, C ++, Python, etc) to select the appropriate software. The above software (except ggobi) basically provides most of the features we expect.
(From: http://idmer.blog.sohu.com/106647744.html)