Hello, WEKA

Source: Internet
Author: User

From http://dreamhead.blogbus.com/logs/16813833.html

WEKA is a data mining software written in Java. Data mining, literally, is a process of searching for useful information from data. However, it involves a lot of content, so here we use the "classification" side for details.
Classification. From the name point of view, it's no longer easy. It gives you something and divides it into classes. How do you know how to classify it? Obviously, this is based on your existing experience. Where does this experience come from for computers? Only when people tell it, that is to say, we need to train computers with a batch of data. The trained computers have certain recognition capabilities and can complete some simple classification work. In reality, there are many opportunities to use classification. For example, one of my previous projects used this method to identify vehicles.
The following describes how to use WEKA to complete a classification program.
Import WEKA. classifiers. classifier;
Import WEKA. classifiers. BAYes. naivebayesmultinomial;
Import WEKA. Core. Attribute;
Import WEKA. Core. fastvector;
Import WEKA. Core. instance;
Import WEKA. Core. instances;
Import WEKA. Filters. filter;
Import WEKA. Filters. unsupervised. Attribute. stringtowordvector;
Public class main {
Private Static final string good = "G ";
Private Static final string bad = "B ";
Private Static final string Category = "category ";
Private Static final string text = "text ";
Private Static final int init_capacity = 100;
Private Static final string [] [] training_data = {
{"Good", good },
{"Wonderful", good },
{"Cool", good },
{"Bad", bad },
{"Disaster", bad },
{"Terrible", bad}
};
Private Static final string test_data = "good ";
Private Static filter = new stringtowordvector ();
Private Static classifier = new naivebayesmultinomial ();
Public static void main (string [] ARGs) throws exception {
Fastvector categories = new fastvector ();
Categories. addelement (good );
Categories. addelement (bad );
Fastvector attributes = new fastvector ();
Attributes. addelement (new attribute (text, (fastvector) null ));
Attributes. addelement (new attribute (category, categories ));
Instances instances = new instances ("WEKA", attributes, init_capacity );
Instances. setclassindex (instances. numattributes ()-1 );
For (string [] pair: training_data ){
String text = pair [0];
String Category = pair [1];
Instance = createinstancebytext (instances, text );
Instance. setclassvalue (category );
Instances. Add (instance );
}
Filter. setinputformat (instances );
Instances filteredinstances = filter. usefilter (instances, filter );
Classifier. buildclassifier (filteredinstances );
// Test
String testtext = test_data;
Instance testinstance = createtestinstance (instances. stringfreestructure (), testtext );
Double predicted = classifier. classifyinstance (testinstance );
String Category = instances. classattribute (). Value (INT) predicted );
System. Out. println (category );
}
Private Static instance createinstancebytext (instances data, string text ){
Attribute textatt = data. Attribute (text );
Int Index = textatt. addstringvalue (text );
Instance = new instance (2 );
Instance. setvalue (textatt, index );
Instance. setdataset (data );
Return instance;
}
Private Static instance createtestinstance (instances data, string text) throws exception {
Instance testinstance = createinstancebytext (data, text );
Filter. Input (testinstance );
Return filter. Output ();
}
}
This program is divided into two parts. The first half is used to train the classifier, and the second half is used to test the classifier.
To train a classifier, We need to select a classification algorithm and prepare training data. In WEKA, each classification algorithm is a subclass of classifier, so that the classification algorithm can be easily modified without changing other parts.
In fact, people who have a little understanding of this knowledge will know that classification algorithms are important, but what really determines the skill size of a classifier is the data used for training. To get a good classifier, you must constantly adjust the training data and continuously train the classifier. This problem is the same as that of human cognition. It is more widely known to have better resolution capabilities.

In WEKA, the data used for training is instances. As the name suggests, this is the plural number of instances. Obviously, a separate training data is instance, and the existence of instances class, some common attributes of the instance can be put together. Here, we can see that in order to use text as training data, we will convert the text to instance. Similarly, when we test the classifier, we also convert the text into an instance and then classify it.
In addition, there is also a filter concept, similar to the common filter concept, which gives us an opportunity to process data before formal processing. Here, we mainly perform some changes to the instance.
After we get a classifier, we can use this classifier for classification. The most critical code is
Classifier. classifyinstance (testinstance );
This code returns a similarity calculated based on the classification algorithm. We can use this value to estimate the category of the data we are testing.
The code itself is not complicated. As mentioned above, a good classifier requires data help. Therefore, if you change the test data, you will find that the Classifier Implemented in this code is not powerful at all. If you want it to be powerful, expanding training data is an inevitable result. However, this blog is not important because we only need to ask WEKA about it. Further efforts are needed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.