Http://blog.csdn.net/yiweis/article/category/1315006
Orange Data Format
In addition to C4.5 and other formats, data mining tool orange also has its own data format.
Native Data Format
Unlike C4.5, the native data format consists of multiple files, but a single file. This file ends with. tab.
The first line shows the name of the Data Attribute. The class name is separated by a tab.
The second row shows the data type. Continuous Data is represented by C, and discontinuous data is represented by D.
The third row provides additional information about the data. For example, it indicates that a column is a class, or ignore a column in the mining process and use I to represent it.
The following is a famous exampleTail flowerData:
Sepal length sepal width petal length petal width Iris
C d
Class
5.1 3.5 1.4 0.2 iris-setosa
4.9 3.0 1.4 0.2 iris-setosa
4.7 3.2 1.3 0.2 iris-setosa
4.6 3.1 1.5 0.2 iris-setosa
5.0 3.6 1.4 0.2 iris-setosa
5.4 3.9 1.7 0.4 iris-setosa
4.6 3.4 1.4 0.3 iris-setosa
5.0 3.4 1.5 0.2 iris-setosa
4.4 2.9 1.4 0.2 iris-setosa
4.9 3.1 1.5 0.1 iris-setosa
5.4 3.7 1.5 0.2 iris-setosa
4.8 3.4 1.6 0.2 iris-setosa
4.8 3.0 1.4 0.1 iris-setosa
4.3 3.0 1.1 0.1 iris-setosa
5.8 4.0 1.2 0.2 iris-setosa
5.7 4.4 1.5 0.4 iris-setosa
5.4 3.9 1.3 0.4 iris-setosa
5.1 3.5 1.4 0.3 iris-setosa
5.7 3.8 1.7 0.3 iris-setosa
......
For C4.5 data format, refer to here
Http://www.cs.washington.edu/dm/vfml/appendixes/c45.htm
Ii. Clustering
Import Orange # load data = Orange. data. table ("Iris") # hierarchical clustering. By default, the similarity between clusters is calculated on average in groups. Root = Orange. clustering. hierarchical. clustering (data) Labels = [STR (D. get_class () for D in dataworks outputs the generated image hclust-dendrogram.png orange. clustering. hierarchical. dendrogram_draw ("hclust-dendrogram.png", root, labels = labels)
Import Orange # load data Iris = Orange. data. table ('iris ') KNN = Orange. classification. KNN. knnlearner (IRIS, K = 10) For I in IRIS: # output the part of the prediction result that is different from the actual result if I. getclass ()! = KNN (I): print I. getclass (), KNN (I)
Iii. C4.5 decision tree
Install orange C4.5
- Download: http://www.rulequest.com/personal/c4.5r8.tar.gzand decompress
- Download ensemble. c buildc45.py to the SRC subfolder of the folder decompressed in the previous step.
- Run the buildc45.py File
import Orangeiris = Orange.data.Table("iris")tree = Orange.classification.tree.C45Learner(iris)print "\n\nC4.5 with default arguments"for i in iris[:5]: print tree(i), i.getclass()