Data preprocessing and use of WEKA. Filters-Data Mining learning and WEKA usage (3)

Source: Internet
Author: User

The previous article introduced the ARFF format, which is a proprietary WEKA format. Generally, We need to extract or obtain data from other data sources. WEKA supports conversion from CVS or from databases. The interface is shown in figure

The WEKA installation directory contains a data directory containing some test data for testing and learning.

Importing data is just the beginning. We also need to pre-process the data.

Data preprocessing)

Data preprocessing refers to the processing of data before the main processing.

In the real world, data is basically incomplete, and inconsistent dirty data cannot be directly mined, or the mining results are unsatisfactory.

In order to improve the quality of data mining, data preprocessing technology is introduced.

Data pre-processing involves data cleaning, data integration, data transformation, and data reduction. These data processing technologies are used before data mining, which greatly improves the quality of the data mining model and reduces the time required for actual data mining.

Data cleaning is frequently used, mainly including:

(1) vacant value handling

Currently, the most common method is to use the most likely value to fill the vacant value. For example, you can use regression, Bayesian formal method tools, or decision tree induction to determine the vacant value. these methods rely on existing data information to speculate on the vacancy value, so that the vacancy value has a greater opportunity to maintain the relationship with other attributes.

You can also use a global constant to replace the vacancy value, use the average value of the attribute to fill in the vacancy value, or classify all tuples according to certain attributes, and then use the average value of the attribute in the same class to fill in the vacancy value. if there are many vacant values, these methods may mislead the mining results.

(2) Noise Data Processing

Noise is a random error or deviation in a measurement variable, including an incorrect value or an isolated point value that deviates from the expected value. Common binning, regression, computer check, manual check, clustering, and other methods for noise processing.

 

Data changes mainly use methods such as smooth aggregation, data generalization, and standardization to convert data into a format that is more conducive to data mining.

 

Data reduction is mainly used to compress the data volume. The source data can be used to obtain the reduction representation of a dataset. It is close to maintaining the integrity of the original data, but the data volume is much smaller than the original data. compared with non-normalized data, it takes less time and memory resources to mine the contracted data, which is more effective and produces identical or almost identical analysis results. Common methods include dimension reduction, data compression, and numerical reduction.

WEKA. Filters

WEKA. Filters contains some simple implementations of data pre-processing (in fact enough), which are mainly divided into two categories: supervised filtering (unsupervisedfilter) and unsupervisedfilter (unsupervisedfilter ).

If you are using the GUI, click the choose filter to select

After the selection is complete, click the selected filter to modify the relevant parameters.

After modifying the parameters, click Apply.

I usually use a lot of unsupervised filtering. The following describes some of the more common ones.

First, we will introduce the attributes under the WEKA. Filters. unsupervised. Attribute package, which is a non-supervised Method for Attribute preprocessing.

1. Add

Add a new attribute for the database. The new attribute will contain all missing values. Optional parameters:

Attributeindex: attribute location, starting from 1. Last is the last one, and first is the first one.

Attributename: attribute name

Attributetype: attribute type. Generally, select 4 and select 1.

Dateformat: data format, see ISO-8601

Nominallabels: nominal tag. Multiple values are separated by commas (,).

2. addexpression

Add an attribute, which is calculated by the existing attribute using the set expression. Supports +,-, *,/, ^, log, Abs, cos, exp, SQRT, floor, Ceil, RINT, tan, Sin. The existing attributes are composed of A + index values.

3. addid

Literally, add an ID

4. addnoise

Valid only for nominal attributes. Change the value according to a certain proportion.

5. Center

The mean value of the numeric attribute is 0.

6. changedateformat

Modify Data Format

7. Copy

Copy the property and name it copy of XX.

8. discretize

Discretization of simple partitioning. Parameters:

Attributeindices: attribute range, such as 1-5, first-last

Bins: Number of buckets

9. firstorder

The Nth value is replaced by the difference between the n + 1 value and the N value.

10. mathexpression

Functions are similar to addexpression, but more operations are supported, especially Max and Min. All supported operators are as follows: +,-, *,/, pow, log, Abs, cos, exp, SQRT, tan, sin, Ceil, floor, RINT ,(,), a, mean, Max, Min, SD, Count, sum, sumsquared, ifelse

11. reorder

Re-arrange attributes. Input 2-last and 1 to sort the first item to the end. If 1, 3, 5 is entered... No other items

12. Standardize

This is roughly the same as the center function, with an additional standardized unit variation.

13. stringtonominal

Convert string type to nominal type

14. swapvalues

Exchange Value

Then, it is under the WEKA. Filters. unsupervised. instance package.

1. nonsparsetosparse

Convert all input to sparse format

2. Normalize

Normalize the entire instance set

3. removefolds

Cross-validation does not support layering. If necessary, use the supervised learning method.

4. removerange

Remove the instance with a specified range and convert it to Nan.

5. resample

Random Sampling: generates new small samples from existing samples.

6. subsetbyexpression

Filters Based on rules. Supports logical operations, up values, and absolute values.

The WEKA. Filters. Supervised package contains less content and involves some process principles. We will not introduce it here.ArticleWill be introduced slowly

Call WEKA for data preprocessing

The use of WEKA is not limited to its built-in GUI or command line. We can use WEKA's Java API in WEKA's infrastructure and implementedAlgorithm.

Create a Java project and add references to WEKA. jar. This package is usually in the installation directory. The size of my 3.6 version is 6kb.

Instances is the most important data set container. It reads the ARFF file and initializes it as follows:

Instances instances = datasource. Read ("Data/CPU. ARFF ");
System. Out. println (instances. tosummarystring ());

Effect:

 

The general process of using filter is: instantiate the filter, pass in the filter parameter, and use the filter through filter. usefilter

For example, I want to add an ID to the CPU database and use the addid filter.

Instantiate addid first

 
Addid filter =NewAddid ();

The filter requires two parameters: location and name. Create a 4-character string array and fill in the Parameter

String [] Options =NewString [4];
Options [0] = "-c ";
Options [1] = "first ";
Options [2] = "-n ";
Options [3] = "ID ";
Filter. setoptions (options );
Filter. setinputformat (instances );

Use a filter and then output

 
Instances newinstances = filter. usefilter (instances, filter );
System. Out. println (newinstances. tosummarystring ());

CompleteCode:

Instances instances = datasource. Read ("Data/CPU. ARFF ");
System. Out. println (instances. tosummarystring ());

Addid filter =NewAddid ();
String [] Options =NewString [4];
Options [0] = "-c ";
Options [1] = "first ";
Options [2] = "-n ";
Options [3] = "ID ";
Filter. setoptions (options );
Filter. setinputformat (instances );
Instances newinstances = filter. usefilter (instances, filter );
System. Out. println (newinstances. tosummarystring ());

Effect:

Demonstrate the use of a discretization filter and the preservation of new data

Discretize =NewDiscretize ();
Options =NewString [6];
Options [0] = "-B ";
Options [1] = "8 ";
Options [2] = "-M ";
Options [2] = "-1.0 ";
Options [4] = "-R ";
Options [5] = "2-last ";
Discretize. setoptions (options );
Discretize. setinputformat (newinstances );
Instances newinstances2 = filter. usefilter (newinstances, discretize );
System. Err. println (newinstances2.tosummarystring ());
Datasink. Write ("Data/newcpu. ARFF", newinstances2 );

In fact, it can be clearly seen that it is not difficult for Java to call WEKA APIs. The key is to understand and be familiar with Data Mining and WEKA itself, and to have certain concepts about which method to use and what parameters are required.

WEKA extension to implement your own Filter

The WEKA version I used is 3.6.6. if the version is different, there may be some details.

Everything starts with the filter class and then simplefilter. In general, we inherit simplestreamfilter and simplebatchfilter.

The essential difference between the two is that one is full reading, and the other is data stream processing, but the code can be exactly the same, mainly because of the difference in efficiency and space usage.

For example, I want to round down all attributes, inherit simplestreamfilter, and implement and override the methods.

PublicCapabilities getcapabilities ()
PublicString globalinfo ()
ProtectedInstances determineoutputformat (instances inputformat)
ProtectedInstances process (instances insT)

Complete code:

@ Override
Public Capabilities getcapabilities (){
Capabilities capabilities = Super . Getcapabilities ();
Capabilities. enableallattributes ();
Capabilities. enableallclasses ();
Capabilities. Enable (capability. no_class );
Return Capabilities;
}

Public String globalinfo (){
Return "A simple batch filter that adds an additional attribute 'B' at the end"
+ "Containing the index of the processed instance .";
}

Protected Instances determineoutputformat (instances inputformat ){
Instances result = New Instances (inputformat, 0 );
Return Result;
}

Protected Instances process (instances insT ){
Instances result = New Instances (determineoutputformat (insT), 0 );
For ( Int I = 0; I <Inst. numinstances (); I ++ ){
Double [] Values = New Double [Result. numattributes ()];
For ( Int N = 0; n <Inst. numattributes (); N ++)
Values [N] = math. Floor (Inst. instance (I). Value (n ));
Result. Add ( New Instance (1, values ));
}
Return Result;
}

Effect:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.