原始碼工程檔案(vs2005)http://d.download.csdn.net/down/1018461/cctt_1
過去在網上找了段代碼,發現寫的代碼要改些地方,而且也想順便練習下自己的c++編碼。
首先我要建立一個真正的樹形結構。於是使用了自己過去的GeneralTree.h(當然這裡還是改動些GeneralTree的代碼例如增添了些函數,另外把有些私人函數變成了公有函數)。
訓練文字格式設定如下:並命名為decision2.txt 並發在自己的工程目錄下。當然你也可以改改相關原始碼
概念 顏色 形狀 輕重
蘋果 紅 球 一般
蘋果 綠 球 一般
香蕉 黃 彎月 一般
草莓 紅 球 輕
草莓 綠 球 輕
西瓜 綠 橢球 重
西瓜 綠 球 重
桔子 桔黃 橢球 輕
測試格式文字格式設定如下:命名為test.txt並放在工程目錄下(試試改改原始碼)
顏色 形狀 輕重
紅 球 一般
綠 球 一般
黃 彎月 一般
這裡應該考慮各個類分開的。不過為了看起來方便,就合在一起了。
下面是具體代碼:
/* created by chico chen<br />* date 2009/02/02<br />* 如需轉載註明出處<br />*/<br />#include "stdafx.h"<br />#include <iostream><br />#include <fstream><br />#include <string><br />#include <sstream><br />#include <vector><br />#include <map><br />#include <cmath><br />#include "D://Tools//not Finished//TreeTest//TreeTest//GeneralTree.h"<br />using namespace std;<br />// this class is for computing attribute entropy<br />class AttribDiff<br />{<br />public:<br />string attribName; // 屬性名稱<br />map<string,int> attribNum; //具體屬性和個數對<br />map<string,map<string,int>> typeNumber;<br />// 第一個string為具體屬性名稱,第二個為類型,<br />// int是類型在具體屬性中的個數.<br />// 例如:是否可見 類型 形狀<br />// 1 西瓜 圓<br />// 1 冬瓜 扁<br />// 0 橘子 圓<br />// 其中具體屬性為圓,類型為西瓜等個數為形狀為圓的類型為西瓜的個數<br />AttribDiff(string& attribName)<br />{<br />this->attribName = attribName;<br />}<br />// in order to computer entropy of an attribute<br />double AttribDifferComputer(vector<vector<string>> infos,int i_attrib,int i_types, vector<int>& visible)<br />{<br />double probability = 0;<br />double entropy = 0;<br />double attribG = 0;<br />map<string,int> temp;<br />int tempNum = 0;<br />for(int i =0 ; i < infos.size(); i++)<br />{<br />if(visible[i] != 0 )<br />{<br />tempNum = attribNum[infos[i][i_attrib]];<br />attribNum[infos[i][i_attrib]] = ++tempNum;<br />temp = typeNumber[infos[i][i_attrib]];<br />tempNum = temp[infos[i][i_types]];<br />temp[infos[i][i_types]] = ++tempNum;<br />typeNumber[infos[i][i_attrib]] = temp;<br />}<br />}<br />map<string,int>::iterator i_number;<br />map<string,int>::iterator i_type;</p><p>for(i_number = attribNum.begin(); i_number != attribNum.end(); i_number++)<br />{</p><p>probability = (*i_number).second/(double)infos.size();<br />cout <<(*i_number).first <<"機率為:"<< probability<<endl;<br />entropy = 0;</p><p>for(i_type = typeNumber[(*i_number).first].begin(); i_type != typeNumber[(*i_number).first].end(); i_type++)<br />{<br />entropy += ComputerEntropyHelp((*i_type).second/(double)(*i_number).second);<br />}</p><p>attribG += (-1)*probability * entropy;</p><p>}</p><p>return attribG;<br />}<br />// compute the entropy<br />double ComputerEntropyHelp(double pi)<br />{<br />return pi*log(pi)/log((double)2);<br />}<br />};<br />// this class is create a data struct for general tree node<br />class NodeInfo<br />{<br />public:<br />// 顏色<br />// 紅<br />// 藍<br />string attribName; // the attribute name, such as 顏色<br />vector<string> detailAttrib; // all of detail attributes under one of attribute name, for example, 紅<br />NodeInfo()<br />{<br />attribName = "";<br />}<br />NodeInfo(string & attribName)<br />{<br />this->attribName = attribName;<br />}<br />NodeInfo(NodeInfo & ni)<br />{<br />this->attribName = ni.attribName;<br />this->detailAttrib = ni.detailAttrib;<br />}<br />// add detail attributes in NodeInfo<br />void NodeDetailInfoAdd(string & detailA)<br />{<br />if(!CheckIsHas(detailA))<br />{<br />this->detailAttrib.push_back(detailA);<br />}<br />}<br />// If detail attribute is in the detailAttrib list, return true;<br />// else return false;<br />bool CheckIsHas(string & name)<br />{<br />for(int i = 0; i < detailAttrib.size(); i++)<br />{<br />if(strcmp(name.c_str(),detailAttrib[i].c_str()) ==0)<br />{<br />// the same attribute<br />return true;<br />}<br />}<br />return false;<br />}<br />// this is print control for printing NodeInfo<br />static void Print(NodeInfo& info)<br />{<br />cout << info.attribName<< "(";</p><p>for(int i = 0; i < info.detailAttrib.size() ; i++)<br />{<br />cout << info.detailAttrib[i]<<" ";<br />}<br />cout << ")/n";</p><p>}</p><p>};<br />// this class is decision tree<br />class DT<br />{<br />protected:<br />const string filename; // the data file name<br />vector<vector<string>> infos; // the array for storing information<br />vector<string> attribs; // the array for storing the attributes<br />GeneralTree<NodeInfo>gt; // the general tree for storing the decision tree<br />const int START; // which column is the start attribute, except the type column<br />const int I_TYPE;// the column index of type<br />const int MAX_ENTROPY; // set an max entropy to find the minimal entropy<br />private:<br />// to help print<br />void PrintHelp(int helpPrint)<br />{<br />for(int i = 0; i < helpPrint; i++)<br />{<br />cout << "..";<br />}<br />}<br />// to find the index of the attribName in attribs array<br />int Find(string& attribName,vector<string>& attribs)<br />{<br />for(int i = 0; i < attribs.size(); i++)<br />{<br />if(strcmp(attribName.c_str(),attribs[i].c_str()) == 0)<br />{<br />// the same<br />return i;<br />}<br />}<br />return -1;<br />}<br />// this function is used for detecting if the arithmetic is over<br />bool CheckOver(vector<int>& visible,string& type)<br />{<br />map<string,int> types;<br />int temp = 0;<br />for(int i = 0; i < infos.size(); i++)<br />{<br />if(visible[i] != 0)<br />{<br />type = infos[i][I_TYPE];<br />temp = types[infos[i][I_TYPE]];<br />if(temp == 0)<br />{<br />types[infos[i][I_TYPE]] = 1;<br />}<br />if(types.size() > 1)<br />{<br />return false; // there are more than one types<br />}<br />}<br />}<br />return true; // there is only one type<br />}<br />// to create a Decision Tree<br />void DTCreate(GeneralTreeNode<NodeInfo> *parent, vector<int> visible,vector<int> visibleA, int i_attrib,string& detailA, int helpPrint)<br />{<br /> if(i_attrib >= START)<br />{<br />for(int i = 0; i < infos.size(); i++)<br />{<br />if(strcmp(infos[i][i_attrib].c_str(),detailA.c_str()) !=0)<br />{<br />// not same detail attribute<br />visible[i] = 0;<br />}<br />}<br />}<br />string type = "";<br />if(CheckOver(visible,type))<br />{<br />// the arithmetic is over and add the type node into the general tree<br />NodeInfo n(type);<br />GeneralTreeNode<NodeInfo> * node = new GeneralTreeNode<NodeInfo>(n);<br />gt.Insert(node,parent);<br />PrintHelp(helpPrint);<br />cout << "decision type:"<<n.attribName<<endl;<br />return;<br />}</p><p>map<string,double> attribGs; // this is for deciding which attrib should be used</p><p>for(int i = START; i < attribs.size(); i++)<br />{<br />// iterate attribs<br />if(visibleA[i] != 0)<br />{<br />AttribDiff ad(attribs[i]);<br />attribGs[attribs[i]] = ad.AttribDifferComputer(infos,i,I_TYPE,visible);<br />cout <<attribs[i] <<"的G為:"<< attribGs[attribs[i]]<<endl;<br />}<br />}<br />// to find the decision attribute<br />double min = MAX_ENTROPY;<br />string attributeName;<br />for(map<string,double>::iterator i = attribGs.begin(); i != attribGs.end(); i++)<br />{</p><p>if(min >= (*i).second)<br />{<br />attributeName = (*i).first;<br />min = (*i).second;<br />}<br />}<br />NodeInfo n(attributeName);<br />int i_max = Find(attributeName,attribs);<br />for(int i = 0; i<infos.size() ; i++)<br />{<br />n.NodeDetailInfoAdd(infos[i][i_max]);<br />}<br />GeneralTreeNode<NodeInfo> * node = new GeneralTreeNode<NodeInfo>(n);<br />gt.Insert(node,parent);<br />visibleA[i_max] = 0;<br />PrintHelp(helpPrint);<br />cout << "choose attribute:"<< attributeName<<endl;<br />for(int i = 0; i < node->data.detailAttrib.size(); i++)<br />{<br />PrintHelp(helpPrint);<br />cout << "go into the branch:"<<node->data.detailAttrib[i]<<endl;<br />// go to every branch to decision<br />DTCreate(node,visible,visibleA,i_max,node->data.detailAttrib[i],helpPrint+1);<br />}</p><p>}<br />public:<br />// 要注意的一點是這裡的decision2.txt要放在工程目錄下。當然如果你願意可以寫絕對路徑<br />// 注意檔案的格式:<br />// 首先一列為類別,然後是各個屬性<br />// 例如: 類型 形狀<br />// 西瓜 圓<br />// 冬瓜 扁<br />// 橘子 圓<br />DT():filename("decision2.txt"),START(1),I_TYPE(0),MAX_ENTROPY(10000)<br />{<br />GetInfo(attribs,infos,filename);<br />DTCreate();</p><p>}</p><p>// this function is used for read data from the file<br />// and create the attribute array and all information array<br />// post: attribs has at least one element<br />// infos has at least one element<br />// pre: filename is not empty and the file is exist<br />void GetInfo(vector<string>& attribs,vector<vector<string>>& infos,const string& filename)<br />{<br />ifstream read(filename.c_str());</p><p>int start = 0;<br />int end = 0;<br />string info = "";<br />getline(read,info);<br />istringstream iss(info);<br />string attrib;</p><p>while(iss >> attrib)<br />{<br />attribs.push_back(attrib);<br />}<br />while(true)<br />{<br />info = "";<br />getline(read,info);<br />if(info == "" || info.length() <= 1)<br />{<br />break;<br />}<br />vector<string> infoline;<br />istringstream stream(info);</p><p>while(stream >> attrib)<br />{<br />infoline.push_back(attrib);<br />}<br />infos.push_back(infoline);<br />}<br />read.close();<br />}<br />// create the DT<br />void DTCreate()<br />{<br />vector<int> visible(infos.size(),1);<br />vector<int> visibleA(attribs.size(),1); // to judge which attribute is useless<br />string temp = "";<br />DTCreate(NULL,visible,visibleA,START-1,temp,0);<br />}<br />// print the DT<br />void Print()<br />{</p><p>gt.LevelPrint(NodeInfo::Print);<br />}<br />void Judge(const string& testFilename,vector<string>& types,const string& testResultFileName)<br />{<br />vector<string> attribs_test;<br />vector<vector<string>> infos_test;<br />GetInfo(attribs_test,infos_test,testFilename);</p><p>if(!CheckFileFormat(attribs_test))<br />{<br />throw "file format error";<br />}<br />GeneralTreeNode<NodeInfo> * root = gt.GetRoot();<br />for(int i = 0; i < infos_test.size(); i++)<br />{</p><p>types.push_back(JudgeType(root,infos_test[i],attribs_test));<br />}<br />WriteTestTypesInfo(testResultFileName,types);<br />}<br />void WriteTestTypesInfo(const string& filename, vector<string>& types)<br />{<br />ofstream out(filename.c_str());<br />out << "類別"<<endl;<br />for(int i = 0 ; i < types.size(); i++)<br />{<br />out << types[i]<<endl;<br />}<br />out.close();<br />}<br />string JudgeType(GeneralTreeNode<NodeInfo> * node, vector<string>& info,vector<string>& attribs_test)<br />{<br />if(gt.GetChildNodeNum(node) == 0)<br />{<br />return node->getData().attribName;<br />}<br />int index = Find(node->getData().attribName,attribs_test);<br />int branch_index = Find(info[index],node->getData().detailAttrib);<br />if(branch_index == -1)<br />{<br />// is not find this detail attribute in this node detailAttrib<br />// there are two way to deal with this situation<br />// 1. every branch has possibility to choose<br />// 2. no such type and can not judge<br />// the first solution make the correct ratio low<br />// the second solution has no fault-tolerance.<br />// and here I choose the second solution.<br />// if I have more free time later, I will write the first solution<br />throw "no such type";<br />}<br />GeneralTreeNode<NodeInfo> * childNode = gt.GetAChild(node,branch_index);<br />return JudgeType(childNode, info,attribs_test);<br />}<br />bool CheckFileFormat(vector<string>& attribs_test)<br />{<br />bool isCorrect = true;<br />for(int j = 0; j < attribs_test.size(); j++)<br />{<br />if(Find(attribs_test[j],attribs) == -1)<br />{<br />isCorrect = false;<br />}<br />}<br />if(attribs_test.size() == attribs.size() - 1)<br />{<br />isCorrect = isCorrect && true;<br />}<br />else<br />{<br />isCorrect = false;<br />}<br />return isCorrect;<br />}<br />};
這裡的main函數這樣寫(自己使用的VS2005):
int _tmain(int argc, _TCHAR* argv[])<br />{<br />DT dt;<br />//dt.Print();<br />string testFile = "test.txt";<br />string testResult = "testResult.txt";<br />vector<string>types;<br />dt.Judge(testFile,types,testResult);<br />return 0;<br />}<br />
自己感覺DT 的注釋比較詳細,所以在我的blog中就不再做太多的解釋。另外這段代碼會將測試結果放在工程目錄下的testResult.txt中。
另外在控制台上會有產生決策樹ID3的相關相關的資訊顯示,例如:
紅機率為:0.25
黃機率為:0.125
桔黃機率為:0.125
綠機率為:0.5
顏色的G為:1
球機率為:0.625
橢球機率為:0.25
彎月機率為:0.125
形狀的G為:1.20121
輕機率為:0.375
一般機率為:0.375
重機率為:0.25
輕重的G為:0.688722
choose attribute:輕重
go into the branch:一般
紅機率為:0.125
黃機率為:0.125
綠機率為:0.125
顏色的G為:0
球機率為:0.25
彎月機率為:0.125
形狀的G為:0
..choose attribute:顏色
..go into the branch:紅
....decision type:蘋果
..go into the branch:綠
....decision type:蘋果
..go into the branch:黃
....decision type:香蕉
..go into the branch:桔黃
....decision type:
go into the branch:輕
紅機率為:0.125
桔黃機率為:0.125
綠機率為:0.125
顏色的G為:0
球機率為:0.25
橢球機率為:0.125
形狀的G為:0
..choose attribute:顏色
..go into the branch:紅
....decision type:草莓
..go into the branch:綠
....decision type:草莓
..go into the branch:黃
....decision type:
..go into the branch:桔黃
....decision type:桔子
go into the branch:重
..decision type:西瓜
這一段資訊是什麼意思呢?
紅機率為:0.25
黃機率為:0.125
桔黃機率為:0.125
綠機率為:0.5
顏色的G為:1
紅,黃,桔黃,綠的機率是顏色的具體屬性。這裡沒有把entropy列印出來。如果此段代碼被中科院的師弟師妹有幸看到,
你們可以在AttribDifferComputer()函數中添加幾行代碼就可以把每一個entropy列印出來。反正老師也會讓你們看代碼,這裡就當作作業題吧。(另外老師第十章機器學習ppt上的決策樹的這個例子計算結果有錯誤。如果你認真計算過的話)顏色G的含義是顏色G的決策值,決策值越小,選擇此屬性的機率就越大。
那決策樹是什麼樣子的呢?
choose attribute:輕重
go into the branch:一般
..choose attribute:顏色
..go into the branch:紅
......................
看看上面的這些.這裡代表根節點是“輕重”,然後進入“一般”分支,然後進入“一般”分支的節點為顏色..然後進入”紅“分支.這裡一定要注意”..“,相等的"..”代表樹的相同的層次。
做出這個Decision Tree 的ID3代碼主要是為了學弟學妹們在考試中測試用的。因為我只是測試了老師ppt中的例子,不保證對於所有的例子都正確。而且老師出的考試題比較變態(屬性十個左右)..如果手工計算應該需要一個小時左右的時間。
當初後悔沒有先編一個程式。祝各位考試順利..(我想我這段代碼可能會在考試之前被搜到)。
同時提醒大家一點, ID3也不是什麼很好的演算法。當兩個屬性的G值一致時,如果它並不能給出一個更好的判斷標準。而且如果採用順序選擇很有可能產生一個非最小決策樹。這點還值得研究一下。