Transferred from: http://blog.sina.com.cn/s/blog_5252f6ca0102uy47.html
Origin of the problem
In many machine learning tasks, features are not always sequential, but they can be categorized values.
For example, consider the three characteristics:
["Male", "female"]
[From Europe, ' from US ', ' from Asia ']
["Uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]
If the above features are represented by numbers, the efficiency will be much higher. For example:
["Male", "from US", "uses Internet Explorer") expressed as [0, 1, 3]
["Female", "from Asia", "uses Chrome") expressed as [1, 2, 1]
However, even after the conversion to a digital representation, the above data cannot be used directly in our classifier. Because, the classifier is often the default data data is continuous, and is orderly. However, according to our stated above, the numbers are not ordered, but are randomly allocated.
Single Hot Coding
In order to solve the above problems, one possible solution is to use the single-Hot coding (one-hot Encoding).
The single-Hot code is one-hot encoding, also known as a valid encoding, the method is to use n-bit status register to encode n states, each state by his independent register bit, and at any time, only one of them is valid.
For example:
The Natural status code is: 000,001,010,011,100,101
Single-Heat code: 000001,000010,000100,001000,010000,100000
It can be understood that for each feature, if it has m possible values, then after the single-hot code, it becomes the M two-dollar feature. Also, these features are mutually exclusive, with only one activation at a time. As a result, the data becomes sparse.
The main benefits of this are:
Solves the problem that the classifier does not handle the attribute data well
To some extent, it also plays an important role in expanding features.
Example
We write a simple example based on Python and Scikit-learn:
From Sklearn Import preprocessing
ENC = preprocessing. Onehotencoder ()
Enc.fit ([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
Enc.transform ([[[0, 1, 3]]). ToArray ()
Output Result:
Array ([[1., 0., 0., 1., 0., 0., 0., 0., 1.]]
One hot encoding