1. Introducing
This article is my notes on the study of data mining and machine learning –weka application technology and practice. The electronic version of the book Link is: http://download.csdn.net/detail/fhb292262794/8759397
The previous blog post summarizes the algorithm processing using Weka to demonstrate machine learning, mainly through Weka3.8 client software operations.
This article is handled through Java API calls, so that the machine learning algorithm of Weka can be used to process data in programming.
The example of this book is using weka3.7, I download the latest version of the weka3.8, update the code to adapt to the weka3.8 after the collation of records as follows. 1. Category (Teach you to write code) 1.1 Linear regression
Forecast Price
Room Rate Data:
@RELATION House
@ATTRIBUTE housesize NUMERIC
@ATTRIBUTE lotsize NUMERIC
@ATTRIBUTE bedrooms
@ ATTRIBUTE granite NUMERIC
@ATTRIBUTE bathroom NUMERIC
@ATTRIBUTE sellingprice NUMERIC
@DATA
3529,9191,6,0,0,205000
3247,10061,5,1,1,224900
4032,10150,5,0,1,197900
2397,14156,4,1,0,189900
2200,9600,4,0,1,195000
3536,19994,6,1,1,325000
2983,9365,5,0,1,230000
Demand: Forecasts the price of new homes based on house information and prices in the vicinity of the area. The House information is: HOUSESIZE:3198;LOTSIZE:9669;BEDROOMS:5;GRANITE:3;BATHROOM:1; please forecast the price.
The logic to be addressed by the requirements is:
1. Load rate data.
2. Set property information.
3. Build the classifier and calculate the coefficients.
4. Use regression coefficient to predict unknown house price.
The code is as follows:
public static final String Weka_path = "data/weka/";
public static final String Weather_nominal_path = "Data/weka/weather.nominal.arff";
public static final String Weather_numeric_path = "Data/weka/weather.numeric.arff";
public static final String Segment_challenge_path = "Data/weka/segment-challenge.arff";
public static final String Segment_test_path = "Data/weka/segment-test.arff";
public static final String Ionosphere_path = "Data/weka/ionosphere.arff";
public static void PLN (String str) {System.out.println (str); @Test public void Testlinearregression () throws Exception {instances DataSet = Converterutils.datasour
Ce.read (Weka_path + "Houses.arff");
Dataset.setclassindex (Dataset.numattributes ()-1);
Linearregression linearregression = new Linearregression ();
try {linearregression.buildclassifier (DataSet);
catch (Exception e) {e.printstacktrace ();
} double[] Coef = linearregression.coefficients ();
Double Myhousevalue = (coef[0] * 3198) + (coef[1] * 9669) + (coef[2] * 5) +
(Coef[3] * 3) + (coef[4] * 1) + coef[6];
System.out.println (Myhousevalue); }
1.2 Random Forest
Code:
@Test public
void Testrandomforestclassifier () throws Exception {
Arffloader loader = new Arffloader ();
Loader.setfile (New File (Weka_path + "Segment-challenge.arff"));
instances instances = Loader.getdataset ();
Instances.setclassindex (Instances.numattributes ()-1);
SYSTEM.OUT.PRINTLN (instances);
System.out.println ("------------");
Randomforest RF = new Randomforest ();
Rf.buildclassifier (instances);
SYSTEM.OUT.PRINTLN (RF);
}
1.3 Meta classifier
The meta classifier
@Test public
void Testmetaclassifier () throws Exception {
instances data = ConverterUtils.DataSource.read (Weather_numeric_path);
if (data.classindex () = = 1)
Data.setclassindex (Data.numattributes ()-1);
Attributeselectedclassifier classifier = new Attributeselectedclassifier ();
Cfssubseteval eval = new Cfssubseteval ();
Greedystepwise stepwise = new Greedystepwise ();
Stepwise.setsearchbackwards (true);
J48 base = new J48 ();
Classifier.setclassifier (base);
Classifier.setevaluator (eval);
Classifier.setsearch (stepwise);
Evaluation Evaluation = new Evaluation (data);
Evaluation.crossvalidatemodel (classifier, data, new Random (1234));
PLN (Evaluation.tosummarystring ());
1.4 Forecast Classification Results (batch processing)
Code:
/** * Using training set to predict the classification of test sets, batch processing/@Test public void Testoutputclassdistribution () throws Exception {
Arffloader loader = new Arffloader ();
Loader.setfile (New File (Segment_challenge_path));
Instances train = Loader.getdataset ();
Train.setclassindex (Train.numattributes ()-1);
Arffloader loader1 = new Arffloader ();
Loader1.setfile (New File (Segment_test_path));
Instances test = Loader1.getdataset ();
Test.setclassindex (Test.numattributes ()-1);
J48 classifier = new J48 ();
Classifier.buildclassifier (train);
System.out.println ("Num\t-\tfact\t-\tpred\t-\terr\t-\tdistribution"); for (int i = 0; i < test.numinstances (); i++) {Double pred = classifier.classifyinstance (Test.instance (i))
;
double[] dist = classifier.distributionforinstance (test.instance (i));
StringBuilder sb = new StringBuilder (); Sb.append (i + 1). APPend ("-"). Append (Test.instance (i). ToString (Test.classindex ()). Append ("-")
. Append (Test.classattribute (). Value ((int) pred)). Append ("-");
if (pred!= test.instance (i). Classvalue ()) sb.append ("yes");
else Sb.append ("no");
Sb.append ("-");
Sb.append (utils.arraytostring (Dist));
System.out.println (Sb.tostring ()); }
}
Here is the designation J48, is the decision tree classifier, can use other better classifier substitution, please compare the effect to choose classifier. 1.5 cross-validation
Code:
Cross-validation and prediction @Test public void Testoncecvandprediction () throws Exception {instances data = Converterutils.
Datasource.read (Ionosphere_path);
Data.setclassindex (Data.numattributes ()-1);
Classifier classifier = new J48 ();
int seed = 1234;
int folds = 10;
Debug.random Random = new Debug.random (seed);
Instances NewData = new instances (data);
Newdata.randomize (random);
if (Newdata.classattribute (). Isnominal ()) newdata.stratify (folds);
Performs cross validation and adds a predictive instances predicteddata = null;
Evaluation eval = new Evaluation (NEWDATA);
for (int i = 0; i < folds i++) {Instances train = Newdata.traincv (folds, i);
Instances test = NEWDATA.TESTCV (folds, i);
Classifier clscopy = abstractclassifier.makecopy (classifier);
Clscopy.buildclassifier (train);
Eval.evaluatemodel (clscopy, test);
Add prediction Addclassification filter = new Addclassification ();
Filter.setclassifier (classifier);
Filter.setoutputclassification (TRUE);
Filter.setoutputdistribution (TRUE);
Filter.setoutputerrorflag (TRUE);
Filter.setinputformat (train);
Filter.usefilter (train, filter);
Instances pred = Filter.usefilter (test, Filter);
if (Predicteddata = = null) Predicteddata = new instances (pred, 0);
for (int j = 0; J < Pred.numinstances (); j + +) Predicteddata.add (Pred.instance (j)); PLN ("classifier:" + classifier.getclass (). GetName () + "" + utils.joinoptions ((optionhandler) classifier). Getop
tions ()));
PLN ("Data:" + data.relationname ());
PLN ("seed:" + seed);
PLN (eval.tosummarystring ("= = =" + folds + "test = = =", false)); Write Data ConverterUtils.DataSink.write (Weka_path + "Predictions.arff", predicteddata);
}
2. Clustering (hands-on teaching you to write code)
2.1 EM
@Test public
void Testem () throws Exception {
instances instances = ConverterUtils.DataSource.read (Weka_path + " Contact-lenses.arff ");
EM cluster = new em ();
Cluster.setoptions (New string[]{"I", "M"});
Cluster.buildclusterer (instances);
PLN (Cluster.tostring ());
2.2 Estimation of the clustering device
The way to evaluate the cluster 3 kinds @Test public void testevaluation () throws Exception {String FilePath = Weka_path + "Contac
T-lenses.arff ";
instances instances = ConverterUtils.DataSource.read (FilePath);
1th string[] options = new string[]{"-T", filePath};
String output = Clusterevaluation.evaluateclusterer (new EM (), options);
PLN (output);
The 2nd kind of densitybasedclusterer DBC = new EM ();
Dbc.buildclusterer (instances);
Clusterevaluation clusterevaluation = new Clusterevaluation ();
Clusterevaluation.setclusterer (DBC);
Clusterevaluation.evaluateclusterer (New instances (instances));
PLN (Clusterevaluation.clusterresultstostring ());
3rd///Density based clustering crossover verification densitybasedclusterer Newdbc = new EM (); Double Loglikelyhood = Clusterevaluation.crossvalidatemodel (NEWDBC, instances, 10,
Instances.getrandomnumbergenerator (1234));
PLN ("Loglikelyhood:" + Loglikelyhood); }
2.3 Clustering and evaluation
@Test public
void Testclassestoclusters () throws Exception {
String FilePath = Weka_path + "Contact-lenses.arff" ;
instances data = ConverterUtils.DataSource.read (FilePath);
Data.setclassindex (Data.numattributes ()-1);
Remove remove = new remove ();
Remove.setattributeindices ("" + (Data.classindex () + 1));
Remove.setinputformat (data);
Instances Datacluster = filter.usefilter (data, remove);
Clusterer cluster = new EM ();
Cluster.buildclusterer (datacluster);
Clusterevaluation eval = new Clusterevaluation ();
Eval.setclusterer (cluster);
Eval.evaluateclusterer (data);
PLN (Eval.clusterresultstostring ());
2.4 Output Clustering points
@Test public
void Testoutputclusterdistribution () throws Exception {
instances train = ConverterUtils.DataSource.read (Segment_challenge_path);
Instances test = ConverterUtils.DataSource.read (Segment_test_path);
if (!train.equalheaders (test))
throw new Exception ("Train data and test data not the same.");
EM clusterer = new Em ();
Clusterer.buildclusterer (train);
PLN ("Id-cluster-distribution");
for (int i = 0; i < test.numinstances (); i++) {
int cluster = clusterer.clusterinstance (Test.instance (i));
double[] dists = clusterer.distributionforinstance (Test.instance (i));
StringBuilder sb = new StringBuilder ();
Sb.append (i + 1). Append ("-"). Append (Cluster). Append ("-"). Append (utils.arraytostring (dists));
PLN (Sb.tostring ());
}
3. Attribute selection (hands-on teaching you to write code)
Automatic Property Selection
Application Cfssubseteval and Greedystepwise processing:
The underlying API attribute selection
@Test public
void Testuselowapi () throws Exception {
Converterutils.datasource Source = new Converterutils.datasource (Weather_nominal_path);
instances data = Source.getdataset ();
if (data.classindex () = = 1)
Data.setclassindex (Data.numattributes ()-1);
Attributeselection attributeselection = new Attributeselection ();
Cfssubseteval eval = new Cfssubseteval ();
Greedystepwise search = new Greedystepwise ();
Search.setsearchbackwards (true);
Attributeselection.setevaluator (eval);
Attributeselection.setsearch (search);
Attributeselection.selectattributes (data);
int[] Indices = Attributeselection.selectedattributes ();
PLN (utils.arraytostring (indices));
4. Other
4.1 database table Operations
@Test public
void Testsavecsv () throws Exception {
Databaseloader loader = new Databaseloader ();
Loader.seturl (Sqlutil.url);
Loader.setuser (sqlutil.user);
Loader.setpassword (Sqlutil.password);
Loader.setquery ("Select question from question");
Instances data1 = Loader.getdataset ();
if (data1.classindex () = = 1)
Data1.setclassindex (Data1.numattributes ()-1);
System.out.println (data1);
Csvsaver saver = new Csvsaver ();
Saver.setinstances (data1);
Saver.setfile (New File ("Data/weka/baidubook-csvsaver.csv"));
Saver.writebatch ();
}
4.2 Filters
Filtration
@Test public
void Testfilter () throws Exception {
instances instances = ConverterUtils.DataSource.read ("data/ Weka/houses.arff ");
Instances.setclassindex (Instances.numattributes ()-1);
SYSTEM.OUT.PRINTLN (instances);
string[] options = new string[2];
Options[0] = "-R";
OPTIONS[1] = "1";
Remove remove = new remove ();
Remove.setoptions (options);
Remove.setinputformat (instances);
Instances NewData = Filter.usefilter (instances,remove);
System.out.println (NewData);
}
Filter and classify
@Test public
void Testfilteronthefly () throws Exception {
instances instances = ConverterUtils.DataSource.read ("Data/weka/weather.nominal.arff");
Instances.setclassindex (Instances.numattributes ()-1);
SYSTEM.OUT.PRINTLN (instances);
Remove remove = new remove ();
Remove