"Python Machine Learning" notes (iv)

Source: Internet
Author: User

Data preprocessing--building a good training data set

The ultimate learning outcome of machine learning algorithms depends on two main factors: the quality of the data and the amount of useful information contained in the data.

Processing of missing data

In practical applications, it is not uncommon for a sample to be missing one or more of the various reasons. The main reasons are: errors occurred in the data acquisition process, the commonly used measurement method is not applicable to certain characteristics, or in the investigation process some data is not filled, and so on. Usually the missing value we see is a hollow value in the datasheet, or a placeholder similar to Nan.

If we ignore these missing values, most of the calculation tools will not be able to process the original data or get some unpredictable results. Therefore, these missing values must be processed before more in-depth analysis.

Delete a feature or sample that has missing values

The simplest way to handle missing data is to remove the feature (column) or sample (row) containing the identified data from the dataset. The Dropna method can be used to delete rows containing missing values in the dataset (where the Dropna () function is present in the DATAFRAME data structure)

Similarly, we can set the axis parameter to 1 to delete a column with at least one Nan value in the data set


The Dropna method also supports other parameters to handle the case of various missing values:

#only drop rows where all columns is NaN

Df,dropna (how= ' all ')

#drop rows that has not at least 4 Non-nan values

Df.dropna (thresh=4)

#only drop rows where NaN appear in specific columns (here: ' C ')

Df.dropna (subset=[' C ')

Deleting missing data can seem like a convenient way, but there are certain, such as: We may delete too many samples, resulting in low reliability of the analysis results. On the other hand, if you delete too many feature columns, you risk losing valuable information that is necessary for the classifier to differentiate categories.


Missing data population

In general, deleting a sample or deleting an earthquake Brother feature column in a dataset is not feasible, as it may result in the loss of too much valuable data. In this case, we can use different interpolation techniques to estimate missing values through data from other training samples in the dataset. One of the most commonly used interpolation techniques is the mean interpolation, even if the missing values are replaced with the corresponding feature mean values. We can use the Impute class in Scikit-learn to easily implement this method

Imr=imputer (missing_value= ' NaN ', strategy= ' mean ', axis=0)

If the parameter axis=0 is changed to Axis=1, the corresponding substitution is made with each row of the mean value. The optional options for parameter strategy are also median or most_frequent, which represent the most frequently used values in the corresponding row or column to replace the missing values, often for populating category eigenvalues.


Working with category data

So far, we've only learned how to work with numerical data, but in real data sets, there are often feature columns for one or more categories of data. When we discuss category data, we can further divide them into nominal and ordered features. An ordered feature can be understood as a category of values that are ordered or can be sorted. Conversely, nominal data does not have a sorting feature.

Mapping of ordered features

To ensure that the learning algorithm can correctly use ordered features, we need to convert the category string to an integer. However, there is no proper way to automatically convert dimension features to the correct order. Thus, we need to define the corresponding mappings manually.

size_mapping={' XL ': 3, ' L ': 2, ' M ': 1}

df[' size ']=df[' size '].map (size_mapping) #构建了映射


Encoding of the class tag

Many machine learning libraries require class labels to be encoded as integer values. Although most categorical estimator in Scikit-learn converts class labels internally to integers, it is considered a good practice to avoid some problems from a technical point of view by converting class labels to integer sequences. In order to encode class labels, you can use a similar approach to the ordered feature mappings discussed earlier. To be clear, class labels are not ordered, and it is not important for us to assign an integer value to a particular string class tag. Therefore, we can simply set the class label starting from 0 in the enumeration way.

One-hot Coding on nominal features

We have used the dictionary mapping method to convert ordered dimension features to integers. Because Scikit-learn's estimator handles class labels as unordered data, you can use the Labelencoder class in Scikit-learn to convert string class labels to integers. Similarly, you can use this method to handle the color column of the nominal data format in the dataset.

One-hot technology is the creation of a new virtual feature, each of which represents a value of the nominal data. For example the color is marked with a r,g,b tri-color.

When we initialize the Onehotencoder object, we need to use the Categorical_features parameter to select the location of the feature we want to convert. By default, when we call Onehotencoder's Transform method, it returns a sparse matrix. For the sake of visualization, we can convert it to a regular numpy array through the ToArray method. Sparse matrices are an efficient way to store large datasets and are supported by many Scikit-learn functions, especially when the data contains a lot of 0 values. In order to skip the use of Toarry, we can also return a regular numpy array by using Onehotencoder (..., sparse=false) during the initialization phase.

In addition, we can realize the virtual feature of Onehot coding technology more conveniently through the Get_dummies method in pandas. When applied to dataframe data, the Get_dummies method transforms only the string columns, while the other columns remain unchanged

Dividing a dataset into a training dataset and a test data set

Cross-validation (validation)

From sklearn.cross_validation import Train_test_split

Scale the value of a feature to the same interval

Feature scaling is an important step in data preprocessing, but it is very easy to ignore.

Decision trees and random forests are the few algorithms in machine learning algorithms that do not require feature scaling, but for most machine learning and optimization algorithms, scaling the values of features to the same interval can make them better.

Currently, there are two common ways to scale different features to the same interval: normalization and normalization


Normalization:

From sklearn.preprocessing import Minmaxscaler

Standardization:

From sklearn.preprocessing import Standardscaler

Select a feature that is meaningful

If a model behaves much better than a test data set on a training dataset, it means that the model is too fit for training data.

The commonly used schemes to reduce generalization errors are:

(1) Collect more training data

(2) Introduction of penalty by regularization

(3) Select a simple model with a relatively small number of parameters

(4) Reduce the dimension of the data

Sequence Feature selection algorithm

Another way to reduce the complexity of the model and solve the overfitting problem is to decrease the dimension by feature selection, which is particularly effective for non-regularization models.

dimensionality reduction technology is divided into two major categories: Feature selection and feature extraction. With feature selection, we can choose a subset of the original features. In feature extraction, a new feature subspace is constructed by deducing the existing feature information. In this section, we will look at some of the classic feature selection algorithms.

The sequence feature selection algorithm is a greedy search method for compressing the original D-dimensional feature space into a K-dimensional feature subspace, K

"Python Machine Learning" notes (iv)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.