1. write a new learning scheme. If you need to implement a learning algorithm that does not have a special purpose for WEKA, or if you want to experiment with machine learning, or you just want to learn more about the internal operation of an induction algorithm through hands-on programming. This section uses a simple example to demonstrate how to compile a classifier, how to make full use of the hierarchical structure of WEKA classes to meet users' needs.
WEKA contains the basic learning schemes listed in table 15-1 for educational purposes. the solutions in the table have no special requirements for accepting command line options. they are useful for understanding the internal operation of classifier. we will. classifiers. trees. as an example, ID3 is used to implement the ID3 decision tree learner in section 4.3.
Table 15-1 simple learning solution in WEKA
Solution |
Description |
WEKA. classifiers. BAYes. naivebayessimple |
Probability learner |
WEKA. classifiers. Trees. ID3 |
Decision tree learner |
WEKA. classifiers. Rules. Prism |
Rule learner |
WEKA. classifiers. Lazy. ib1 |
Instance-based learner |
2. A classifier example
Figure 15-1 shows WEKA. classifiers. trees. ID3 source code. You can see from the code that it extends the classifier class. classifier classes must be extended for each WEKA classifier class, whether used to predict nouns or predict numerical values.
WEKA. classifiers. trees. the first method in the ID3 solution is globalinfo (): Let's talk about this method before getting into the more interesting part. when this scheme is selected on WEKA's graphic user interface, this method simply returns a string displayed on the screen. package WEKA. classifiers. trees;
Import WEKA. classifiers .*;
Import WEKA. Core .*;
Import java. Io .*;
Import java. util .*;
/**
* Class implementing an ID3 demo-tree classifier.
*/
Public class ID3 extends classifier {
/** The node's successors .*/
Private ID3 [] m_successors;
/** Attribute used for splitting .*/
Private Attribute m_attribute;
/** Class value if node is leaf .*/
Private double m_classvalue;
/** Class distribution if node is leaf .*/
Private double [] m_distribution;
/** Class attribute of dataset .*/
Private Attribute m_classattribute;
/**
* Returns a string describing the classifier.
* @ Return a description suitable for the GUI.
*/
Public String globalinfo (){
Return "class for constructing an unpruned demo-tree Based on the ID3"
+ "Algorithm. can only deal with nominal attributes. No missing values"
+ "Allowed. Empty leaves may result in unclassified instances. For more"
+ "Information see:/n"
+ "R. Quinlan (1986)./" induction of demo"
+ "Trees/". machine learning. Vol.1, No.1, pp. 81-106 ";
}
/**
* Builds ID3 demo-tree classifier.
*
* @ Param data the training data
* @ Exception if classifier can't be built successfully
*/
Public void buildclassifier (instances data) throws exception {
If (! Data. classattrial (). isnominal ()){
Throw new unsupportedclasstypeexception ("ID3: nominal class, please .");
}
Enumeration enumatt = data. enumerateattributes ();
While (enumatt. hasmoreelements ()){
If (! (Attribute) enumatt. nextelement (). isnominal ()){
Throw new unsupportedattributetypeexception ("ID3: only nominal" +
"Attributes, please .");
}
}
Enumeration Enum = data. enumerateinstances ();
While (enum. hasmoreelements ()){
If (Instance) enum. nextelement (). hasmissingvalue ()){
Throw new nosupportformissingvaluesexception ("ID3: no missing values ,"
+ "Please .");
}
}
Data = new instances (data );
Data. deletewithmissingclass ();
Maketree (data );
}
/**
* Method for building an ID3 tree.
*
* @ Param data the training data
* @ Exception if demo-tree can't be built successfully
*/
Private void maketree (instances data) throws exception {
// Check if no instances have reached this node.
If (data. numinstances () = 0 ){
M_attribute = NULL;
M_classvalue = instance. missingvalue ();
M_distribution = new double [data. numclasses ()];
Return;
}
// Compute attribute with maximum information gain.
Double [] infogains = new double [data. numattributes ()];
Enumeration attenum = data. enumerateattributes ();
While (attenum. hasmoreelements ()){
Attribute ATT = (attribute) attenum. nextelement ();
Infogains [Att. Index ()] = computeinfogain (data, ATT );
}
M_attribute = data. Attribute (utils. maxindex (infogains ));
// Make leaf if information gain is zero.
// Otherwise create successors.
If (utils. eq (infogains [m_attribute.index ()], 0 )){
M_attribute = NULL;
M_distribution = new double [data. numclasses ()];
Enumeration instenum = data. enumerateinstances ();
While (instenum. hasmoreelements ()){
Instance inst = (Instance) instenum. nextelement ();
M_distribution [(INT) Inst. classvalue ()] ++;
}
Utils. normalize (m_distribution );
M_classvalue = utils. maxindex (m_distribution );
M_classattribute = data. classattribute ();
} Else {
Instances [] splitdata = splitdata (data, m_attribute );
M_successors = new ID3 [m_attribute.numvalues ()];
For (Int J = 0; j <m_attribute.numvalues (); j ++ ){
M_successors [J] = new ID3 ();
M_successors [J]. maketree (splitdata [J]);
}
}
}
/**
* Classifies a given test instance using the demo-tree.
*
* @ Param instance the instance to be classified
* @ Return the classification
*/
Public double classifyinstance (instance)
Throws nosupportformissingvaluesexception {
If (instance. hasmissingvalue ()){
Throw new nosupportformissingvaluesexception ("ID3: no missing values ,"
+ "Please .");
}
If (m_attribute = NULL ){
Return m_classvalue;
} Else {
Return m_successors [(INT) instance. Value (m_attribute)].
Classifyinstance (instance );
}
}
/**
* Computes class distribution for instance using demo-tree.
*
* @ Param instance the instance for which distribution is to be computed
* @ Return the class distribution for the given instance
*/
Public double [] distributionforinstance (instance)
Throws nosupportformissingvaluesexception {
If (instance. hasmissingvalue ()){
Throw new nosupportformissingvaluesexception ("ID3: no missing values ,"
+ "Please .");
}
If (m_attribute = NULL ){
Return m_distribution;
} Else {
Return m_successors [(INT) instance. Value (m_attribute)].
Distributionforinstance (instance );
}
}
/**
* Prints the demo-tree using the private tostring method from below.
*
* @ Return a textual description of the Classifier
*/
Public String tostring (){
If (m_distribution = NULL) & (m_successors = NULL )){
Return "ID3: no model built yet .";
}
Return "ID3/n" + tostring (0 );
}
/**
* Computes information gain for an attribute.
*
* @ Param data the data for which info gain is to be computed
* @ Param att the attribute
* @ Return the information gain for the given attribute and Data
*/
Private double computeinfogain (instances data, attribute ATT)
Throws exception {
Double infogain = computeentropy (data );
Instances [] splitdata = splitdata (data, ATT );
For (Int J = 0; j <Att. numvalues (); j ++ ){
If (splitdata [J]. numinstances ()> 0 ){
Infogain-= (double) splitdata [J]. numinstances ()/
(Double) data. numinstances ())*
Computeentropy (splitdata [J]);
}
}
Return infogain;
}
/**
* Computes the entropy of a dataset.
*
* @ Param data the data for which entropy is to be computed
* @ Return the entropy of the data's class distribution
*/
Private double computeentropy (instances data) throws exception {
Double [] classcounts = new double [data. numclasses ()];
Enumeration instenum = data. enumerateinstances ();
While (instenum. hasmoreelements ()){
Instance inst = (Instance) instenum. nextelement ();
Classcounts [(INT) Inst. classvalue ()] ++;
}
Double entropy = 0;
For (Int J = 0; j <data. numclasses (); j ++ ){
If (classcounts [J]> 0 ){
Entropy-= classcounts [J] * utils. log2 (classcounts [J]);
}
}
Entropy/= (double) data. numinstances ();
Return entropy + utils. log2 (data. numinstances ());
}
/**
* Splits a dataset according to the values of a nominal attribute.
*
* @ Param data the data which is to be split
* @ Param att the attribute to be used for splitting
* @ Return the sets of instances produced by the split
*/
Private instances [] splitdata (instances data, attribute ATT ){
Instances [] splitdata = new instances [Att. numvalues ()];
For (Int J = 0; j <Att. numvalues (); j ++ ){
Splitdata [J] = new instances (data, data. numinstances ());
}
Enumeration instenum = data. enumerateinstances ();
While (instenum. hasmoreelements ()){
Instance inst = (Instance) instenum. nextelement ();
Splitdata [(INT) Inst. Value (ATT)]. Add (insT );
}
For (INT I = 0; I <splitdata. length; I ++ ){
Splitdata [I]. compactify ();
}
Return splitdata;
}
/**
* Outputs a tree at a certain level.
*
* @ Param level the level at which the tree is to be printed
*/
Private string tostring (INT level ){
Stringbuffer text = new stringbuffer ();
If (m_attribute = NULL ){
If (instance. ismissingvalue (m_classvalue )){
Text. append (": NULL ");
} Else {
Text. append (":" + m_classattribute.value (INT) m_classvalue ));
}
} Else {
For (Int J = 0; j <m_attribute.numvalues (); j ++ ){
Text. append ("/N ");
For (INT I = 0; I <level; I ++ ){
Text. append ("| ");
}
Text. append (m_attribute.name () + "=" + m_attribute.value (j ));
Text. append (m_successors [J]. tostring (LEVEL + 1 ));
}
}
Return text. tostring ();
}
/**
* Main method.
*
* @ Param ARGs the options for the Classifier
*/
Public static void main (string [] ARGs ){
Try {
System. Out. println (evaluation. evaluatemodel (New ID3 (), argS ));
} Catch (exception e ){
System. Err. println (E. getmessage ());
}
}
}
Figure 15-1 source code of ID3 decision tree learner 3. buildclassifier ()
The buildclassifier () method constructs a classifier based on the training dataset. the ID3 algorithm cannot process non-Nouns, Incomplete Attribute values, or any non-Nouns. Therefore, the buildclassifier () method first checks the aforementioned attributes in the data. then, it generates a replica of the training set (to avoid modifying the original data) and calls WEKA. core. in instances, a method is used to delete all instances with incomplete class values, because these instances do not work during training. at last, it calls maketree (). This method generates all the Subtrees attached to the root node recursively and generates a decision tree.
4. maketree ()
In maketree (), the first step is to check whether the dataset is empty. if yes, a leaf node is generated by setting m_attribute as null. the class value m_classvalue specified for this leaf is set to incomplete, and the probability of each class in the m_distribution is initialized to 0. if the training instance is ready, maketree () will find out the attributes that make these instances generate the maximum information gain. it first generates a Java enumeration of the dataset attributes. if the index of the class property has been set, the class property will be automatically excluded from the enumeration, just like the dataset setting being discussed.
In the enumeration, the message J hydrogen gain of each attribute is calculated by computelnfogain () and stored in an array. we will repeat this method later. WEKA. core. the index () method in attribute returns the index of the attribute in the dataset. it can index the array just mentioned. once enumeration is completed, attributes with maximum information gain are stored in the instance variable M attribute. WEKA. core. the maxlndex () method in utils returns an index of the maximum value in an array consisting of integers or double-precision floating-point decimals. (if more than one group element has the maximum value, only the first one is returned .) the index of this attribute will be passed to WEKA. core. attribute () method in instances. This method returns the attribute corresponding to the index.
The user may be wondering, what is the value corresponding to the class attribute in the array? Don't worry about this, Because Java will automatically Initialize all the group elements in the array to an integer 0, and the information gain is always greater than or equal to 0. if the maximum information gain is 0, maketree () generates a leaf node. in this case, maketree () is set to null, and maketree () calculates the distribution of class probability and the class with the maximum probability at the same time. (WEKA, core. the normalize () method in utils normalizes a double-precision floating point fractional array so that the sum of the group members is 1 .)
When it generates a leaf node with a specified class value, maketree () stores the class attribute in m_classattrfbute. This is because the method used to output the decision tree needs to read the class value to display the class label.
If a non-zero information gain attribute is found, maketree () splits the dataset based on the value of this attribute and builds a subtree for each new dataset in a recursive manner. this method calls another method splitdata () for segmentation. this will generate multiple empty datasets as the property values and store these datasets in an array (set the initial capacity of each dataset to the number of instances contained in the original dataset ), in the original dataset, each instance is cyclically repeated, and the new dataset opens up space for these instances based on the corresponding attribute values. then compress the instances object to reduce memory usage. after maketree () is returned, the obtained dataset array is used to build the subtree. this method generates an array composed of ID3 objects. Each object in the array corresponds to a property value 9 and transmits the corresponding dataset to maketree (). This method is called on each object.
5. computeinfogain ()
Now back to corrtputeinfogain (), the information gain associated with an attribute and a dataset is calculated using a direct implementation of the equation described in section 4.3. calculate the entropy of the dataset, divide the dataset into subsets using splitdata (), and call computeentr0py () on each subset (). finally, the difference between the entropy calculated previously and the weighted sum of each entropy calculated subsequently is called information gain return. the computeentropy () method uses WEKA. core. the log2 () method in utils obtains the logarithm of a number (based on 2 ).
6. classifyinstance ()
After reading how ID3 constructs a decision tree, let's look at how to use a tree structure to predict class values and probabilities. each classifylnstance () method or distributionfor must be implemented. the instance () method (or both methods ). classifier superclass include the default implementation of these two methods. the default Implementation of classifylnstance () calls distributionforlns tance (). if the class is nouns, classifyinstance () will predict attributes with the highest probability as classes. Otherwise, if all probabilities returned from distributionforinstance () are zero, classifylllstarlce () returns an incomplete value. if the class is numeric, distributionforlnstance () must return a single set of meta arrays with numerical prediction. This array is also called classifylnstance () to be extracted and returned. finally, the default Implementation of distributionforlnstance (), in turn, encapsulates the predictions from classifyinstance () into a single array of elements. if the class is noun, distributionforinstance () assigns probability 1 to the class attribute predicted by classihzlnstance () and assigns probability 0 to other attributes. if classi. fylnstance () returns an incomplete value. The probability of all attributes is set to 0. WEKA. classifiers. the trees and ID3 classes rewrite these two methods.
Let's take a look at the classifylnstance () of the predicted class value for a given instance (). as mentioned in the previous section, like the nominal property value, the nominal class value is encoded and stored in the form of a double variable, indicating the index of the value name in the attribute declaration. this more concise and effective object-oriented processing method can speed up the operation. in the ID3 implementation, classifyinstance () first checks whether there is a missing value in the instance to be classified. if yes, an exception is discarded. otherwise, it will recursively follow the tree from top to bottom based on the attribute values of the instance to be classified until it reaches an end leaf node. then, it returns the class value m_classvalue stored on the leaf node. note that the returned value may also be incomplete. If the returned value is incomplete, the Instance becomes unclassified. the distributionforinstance () method works in the same way. It returns the probability distribution stored in m_distribution.
Most machine learning models, especially decision trees, fully reflect the data structure. therefore, every WEKA classifier, like many other Java objects, implements the tostring () method to generate a text representation of itself in the form of string variables. the tostring () method of ID3 outputs a decision tree in roughly the same format as j4.8 (Figure 10-5 ). it reads the attribute information stored on the node and uses recursion to input a string variable into the tree structure. it uses WEKA. core. the name () and value () methods in attribute get the name and value of each attribute. null end leaf nodes without class values are identified by null strings.
7. Main ()
WEKA. classifiers. tree. the only method not described in ID3 is main (). This method is called whenever a class is executed by the command line. as you can see, this method is simple: it basically tells WEKA's evaluation class to evaluate ID3 with the given command line option and output the obtained string. the single line expression used to complete this task is included in a try-catch statement. This statement can capture various exceptions thrown by WEKA routines or other Java methods.
WEKA. classifiers. the evaluation () method in evaluation explains the General Command Line options that can be applied to any learning scheme and their functions discussed in section 13.3. for example, it can accept the-L option of the training file name and carry the corresponding dataset. if no test file exists, it performs cross-validation by generating a classifier and repeatedly calling buildclassifier () and classify instance () on different subsets of the training data () and distributionforlnstance (). unless the user sets the corresponding command line option to block the output of the model, it also calls the tostring () method to output the model generated by the entire training dataset.
What if a learning scheme needs to explain a specific option, such as a trim parameter? This can be done by WEKA. the optionhandler interface in core. the classifier that implements this interface contains three methods: listoptions (), setoptions (), and getoption (). they are used to list all the options for this classifier, set some of the options, and obtain the currently set options. if a classifier implements the optionhandle R interface, the evaluation () method in the Evaluation Class automatically calls these methods. after general options are processed, evaluation () calls setoption () to process the remaining options, and then uses buildclassifier () to generate a new classifier. output the generated classifier. Evaluation () uses getoptions () to output a column of currently set options. in WEKA. classifiers. the source code of rules.0ner can find a simple example of how to implement these methods.
Optionhandler makes it possible to set options in the command line. to set these options on the graphic user interface, WEKA uses the Java Bean architecture. all you need to do to implement this architecture is to provide the set... () and get... () method. for example, the setpruningparameter () and getpruningparameter () methods are required for a trim parameter. another method is also required. pruningparametertiptext () returns a description of this parameter displayed on the graphic user interface. again, see WEKA. classifiers. rules. example in oner.
Some classifiers can be incrementally updated when new training instances arrive one after another, and do not need to process all data in the same batch. in WEKA, incremental classifier must implement WEKA. updateableclassifier interface in classifiers. this interface only declares a method named updateclassifier (). This method only accepts a single training instance as its variable parameter. for more information, see WEKA. classifiers. lazy. source code of the ibk.
If a classifier can use instance permissions, it must implement WEKA. weightedlnstartce shandler () interface in core. in this way, other algorithms, such as those used for improvement, can exploit this attribute.
In WEKA. there are many other interfaces that are useful for classifiers, such as rondomizable, summarizable, drawable, and graphable. for more information about interfaces, see WEKA. javadoc of the corresponding class in core.
8. practices related to classifier implementation
When implementing a WEKA classifier, some users must follow this Convention. Otherwise, the program may encounter errors. For example, the WEKA evaluation module may not be able to properly calculate its statistics when evaluating the classifier.
As mentioned above, when the buildclassifier () method of a classifier is called, the model must be reset. the checkclassifier class is used for testing to ensure that the model is indeed reset. when buildc 'assifier () is called on a dataset, no matter how many times the classifier has previously been called on the same or other datasets, the result must be the same. in addition, some instance variables correspond to some options that only apply to specific solutions. The buildclassifier () method cannot reset these variables because the values of these variables are once set, they must remain unchanged during multiple calls to buildclassifier. also, calling buildclassifier () cannot change the lost data.
The other two Conventions have been mentioned before. one is that when a classifyinstance () method cannot make a prediction, its classifyinstance () method must return instance. missingvalue (), and its distributionforlnstance () method must return a 0 probability for all class attributes. in Figure 15-1, ID3 is implemented in this way. another practice is this. For the classifyinstance () used by the opponent as a numerical prediction classifyinstance, The classifyinstance () must return the predicted numerical value of the classifier. there are also some classifiers that can predict the nouns, class probabilities, and numerical class values, WEKA. classifiers. lazy. ibk is an example. these classifiers implement the distributionforlnstance () method. If the class is numerical, it returns a single group element array, and its unique group element contains the predicted numerical value.
Another convention is not indispensable, but it is beneficial in any case. That is, each classifier implements a tostring () method to output a text description of itself.