Learning Sklearn and kagggle problems, what is the single-hot code? Why use a single-hot code? Under what circumstances can I use a single-hot code? and several other ways of encoding.
first, learn about feature categories in machine learning: continuous and discrete features
To obtain the original characteristics, each feature must be normalized, for example, the value range of the feature A is [ -1000,1000], the value range of the characteristic b is [ -1,1]. If you use logistic regression, W1*X1+W2*X2, because the value of X1 is too large, So x2 basically does not play a role. Therefore, the normalization of features must be performed, and each feature is normalized separately.
For continuity characteristics:
- rescale bounded continuous features: All continuous input that is bounded, rescale them to [-1, 1] through x = ( 2x-max-min)/(Max-min). Linear indent to [ -1,1]
- standardize all continuous features: All continuous input should is standardized and by this I-mean, for every CO ntinuous feature, compute its mean (U) and standard deviation (s) and do x = (x-u)/S. Indent to a mean of 0 with a variance of 1
For discrete features:
- binarize Categorical/discrete features: For the discrete feature is basically according to the One-hot (single Hot) coding, the discrete characteristics of how many values, the number of Bellavita to represent the feature.
I. What is a single-hot code?
Single hot Code, in the English literature is called One-hot code, the intuitive way is how many states have how many bits, and only a bit of 1, the other all 0 of a code system. Examples are as follows:
If there are three kinds of color features: red, yellow, blue. The use of machine learning algorithms generally requires vectorization or digitization. Then you may want to make red =1, yellow = 2, blue =3. Then this actually implements the tag encoding, which gives the different categories to label. However, this means that the machine may learn to "Red < yellow < Blue", but this is not the intention of our machine learning, just to let the machine distinguish them, and no meaning of size comparison. So the label coding is not enough and needs to be further converted. Because there are three color states, there are 3 bits of it. i.e. Red: 1 0 0, Yellow: 0 1 0, Blue: 0 0 1. As a result, the distance between each of the two vectors is square root 2, which is equal in vector space, so there is no partial ordering, and the effect based on vector space measurement algorithm is not affected basically.
The Natural status code is: 000,001,010,011,100,101
Single-Heat code: 000001,000010,000100,001000,010000,100000
Here's an example of a sklearn:
from Import =3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) 1, 3]]). ToArray () # encode
Output: Array ([[1., 0., 0., 1., 0., 0., 0., 0., 1.]]
The data matrix is 4*3, which is 4 data and 3 feature dimensions.
0 0 3 Looking at the data matrix on the left, the first column is the first feature dimension, and there are two values of 0\1. So the corresponding encoding method is 10, 01
1 1 0 Similarly, the second column is the second feature dimension, there are three kinds of value 0\1\2, so the corresponding encoding is 100, 010, 001
0 2 1 Similarly, the third column is the third feature dimension, which has four values of 0\1\2\3, so the corresponding encoding method is 1000, 0100, 0010, 0001
1 0 2
Then look at the parameters to encode [0, 1, 3], 0 as the first feature is encoded as 10, 1 as the second feature is encoded as 010, and 3 is encoded as a third feature of 0001. Therefore, the result of the code is 1 0 0 1 0 0 0 0 1
Two. Why should I encode it alone?
as stated above, the single-Hot coding (dummy variable dummy variable) is because most of the algorithms are calculated based on the metric in the vector space , in order to make the non-partial order relationship of the variable value is not partial order, and to the dot is equidistant. Using One-hot encoding, the value of discrete feature is extended to European space, and some value of discrete feature corresponds to a certain point of European space. Using One-hot encoding for discrete features makes it more reasonable to calculate the distance between features. After the one-hot encoding of discrete features, the characteristics of each dimension can be regarded as continuous characteristics . It can be the same as the normalization of the continuous features, the normalization of each dimension features. For example, normalized to [ -1,1] or normalized to a mean of 0, the variance is 1.
Why do eigenvectors map to European space?
The discrete features are mapped to European space through one-hot encoding, because, in the regression, classification, clustering and other machine learning algorithms, the calculation of distance between features or similarity calculation is very important, and our common distance or similarity calculation is in the European Space similarity calculation, the calculation of cosine similarity, is based on the European space.
Three. Advantages and disadvantages of single-hot coding
- Advantages: The single-hot coding solves the problem that the classifier does not handle the attribute data, and in a certain extent it also plays the role of expanding the feature . It has a value of only 0 and 1, and the different types are stored in vertical space.
- Cons: When the number of categories is many, the feature space becomes very large. In this case, the PCA can generally be used to reduce the dimension . And this combination of one hot ENCODING+PCA is also very useful in practice.
Four. Under what circumstances (not) using a single-hot code?
- Use: Single-hot coding to solve the problem of discrete values of categorical data ,
- No: The discrete characteristics of the one-hot encoding function, is to make the distance calculation more reasonable , but if the characteristics are discrete, and without one-hot coding can be reasonable to calculate the distance, then there is no need to one-hot coding. Some tree-based algorithms in the processing of variables, not based on vector space measurement, the value is only a category symbol, that is, there is no partial order relationship, so do not have to do the single-hot coding. The tree model does not require ONE-HOT encoding: for decision trees, the essence of one-hot is to increase the depth of the tree .
In general, if the category number of one hot encoding is not too large, it is recommended to take precedence.
Five. Under what circumstances (no) need to be normalized?
- Need: parameter-based or distance-based models are the normalization of features.
- Not required: Tree-based methods do not require normalization of features, such as random forests, bagging, and boosting.
Six. Tag Code Labelencoder
Function: The use of Labelencoder () will be converted into continuous numerical variables. That is, numbers or text that are not contiguous are numbered, for example:
from Import = labelencoder () le.fit ([1,5,67,100]) le.transform ([1,1,100,67,5])
Output: Array ([0,0,3,2,1])
>>> le =preprocessing. Labelencoder ()>>> Le.fit (["Paris","Paris","Tokyo","Amsterdam"]) Labelencoder ()>>>list (le.classes_) ['Amsterdam','Paris','Tokyo'] # three categories 0 1 2 respectively>>> Le.transform (["Tokyo","Tokyo","Paris"]) Array ([2, 2, 1]...) >>> list (Le.inverse_transform ([2, 2, 1]) # inverse process ['Tokyo','Tokyo','Paris']
Limitations : The example of color above has already referred to the label encoding. Label encoding is useful in some cases, but there are many limitations to the scene. One more example: for example, [Dog,cat,dog,mouse,cat], we convert it to [1,2,1,3,2]. Here's a strange phenomenon: the average of dog and mouse is cat. So there is no widespread use of label coding.
Attached: Basic machine learning process
Reference:
QUora:what is good ways to handle discrete and continuous inputs together?
Data preprocessing: Single-hot coding (One-hot Encoding)
Elegant data mining with SklearnUniversal framework for data mining competitionsLabel Encoding vs one hot Encoding[Scikit-learn] features binary coded functions of some pits
Onehotencoder single-hot coding and labelencoder tag coding