R
R (http://www.r-project.org) is used for statistical analysis and graphical computer language and analysis tools, in order to ensure performance, its core computing module is written in C, C ++ and FORTRAN. It also provides a scripting language (R) for ease of use. The r language is similar to the s language developed by Bell Labs. R supports a series of analysis technologies, including statistical testing, predictive modeling, and data visualization. In cran (http://cran.r-project.org)
You can find many open-source extension packages.
The preferred interface of the r software is the command line interface, which calls the analysis function by writing scripts. If you lack programming skills, you can also use graphical interfaces, such as using R Commander (http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/) or rattle (http://rattle.togaware.com ).
Tanagra
Tanagra (http://eric.univ-lyon2.fr/wricco/tanagra/) is a graphical data mining software that uses a tree structure similar to Windows Explorer to organize analysis components. Tanagra lacks advanced visualization capabilities, but its strength is statistical analysis, which provides a wide range of parametric and parametric testing methods. At the same time, there are many feature selection methods.
WEKA
WEKA (Waikato environment for knowledge analysis, http://www.cs.waikato.ac.nz/ml/weka/) may be the most famous open source machine learning and data mining software. Advanced users can call their analysis components through Java programming and command lines. WEKA also provides graphical interfaces for common users, such as WEKA knowledgeflow environment and WEKA explorer. Compared with R, WEKA is weaker in statistical analysis, but more powerful in machine learning. At WEKA Forum (http://weka.sourceforge.net/wiki/index.php/Related_Projects)
You can find many extension packages, such as text mining, visualization, and grid computing. Many other open-source data mining software also supports calling WEKA's analysis function.
Yale (idmer: Now renamed rapidminer)
Yale (yet another learning environment, http://rapid-i.com) provides a graphical interface, similar to the Tree Structure in Windows Explorer to organize analysis components, each node on the tree represents a different operator ). Yale provides a large number of operators, including data processing, transformation, exploration, modeling, and evaluation. Yale is developed in Java and built based on WEKA. That is to say, it can call various analysis components in WEKA.
Knime
Knime (Konstanz informationminer, http://www.knime.org) is a well-developed data mining tool based on Eclipse development environment. No installation is required and it is easy to use (idmer: Haha, everyone's favorite green version ). Like Yale, knime is developed in Java and can be extended using the mining algorithm in WEKA. What's different from Yale is that knime uses data flow to establish an analysis and mining process (idmer: I like this, similar to commercial data mining software such as SAS em or SPSS Clementine ). A mining process consists of a series of functional nodes, each of which has an input/output port for receiving data or models and exporting results. (Idmer: knime is easier to use than knowledgeflow of WEKA. It is convenient to connect nodes. You can drag and drop the connection port with the mouse. In WEKA, You need to right-click the node and select the subsequent node, which is troublesome. It took just half a day to connect to the node)
Each node in knime carries a traffic signal light to indicate the status of the node (when no connection, no configuration, or lack of input data, it is a red light; when the preparation is a yellow light; after the execution is complete, it is a green light ). A special feature in knime, hilite, allows users to mark records of interest in node results and further explore.
Orange
Orange (http://www.ailab.si/orange) is a data mining tool similar to knime and WEKA knowledgeflow. Its graphical environment is called the orange canvas (orangecanvas), where you can place analysis controls (widgets) on the canvas ), connect the control to form a mining process. The controls here are similar to nodes in knime. Each control executes a specific function, but different from a node in knime, the input and output of a knime node are divided into two types (model and data ), the control of orange can transmit a variety of different signals, such as learners,
Classifiers, evaluation results, distance matrices, dendrograms, and so on. The control of orange is not as detailed as that of the knime node. That is to say, to complete the same analysis and mining task, the number of controls used in orange can be less than the number of nodes in knime. The advantage of orange is that it is simpler to use, but its disadvantage is that its control capability is weaker than that of knime.
In addition to user-friendly interfaces and ease of use, orange provides a large number of visualization methods to display data and models graphically and intelligently search for appropriate visualization forms, supports interactive data exploration.
Orange is weak in traditional statistical analysis. It does not support statistical tests and has limited report capabilities. Orange's underlying core is also written in C ++ while allowing users to use Python scripting language for extended development (see http://www.scipy.org ).
Ggobi
Data visualization is an important part of data mining, ggobi (http://www.ggobi.org) is an open source software for interactive visualization, it uses brushing method. Ggobi can be used as a plug-in for R Software or called through Perl, Python, and other scripting languages.
Conclusion
----
The software described above is an excellent open-source data mining software with its own strengths and weaknesses. Readers can choose based on their own needs, or use multiple software in combination. For common users, you can use user-friendly and easy-to-use software. for users who want to develop algorithms, you can use different software development tools (Java, R, C ++, Python, etc) to select the appropriate software. The above software (except ggobi) basically provides most of the features we expect.
(Idmer: I have tried the above open-source software. WEKA is very famous, but it is not convenient to use. The interface is also simple. rapidminer is currently gaining popularity, however, the operation method of knime differs greatly from that of commercial software, and it does not support analysis flowcharts. It is not easy to view when there are many operators. knime and orange both look good, the orange interface looks refreshing, but I found it does not support Chinese. I recommend that you install the WEKA and R extension packages at the same time .)
(Idmer: my comments are purely personal opinions. You are welcome to criticize and exchange them. In my actual work, there are not many open source mining tools, and most of the time I am using SAS enterprise miner .)