The classification method usually requires the conversion of various properties of the data to a vector representation, so that each data feature is a vector, and each dimension on the vector represents a feature attribute.
But if the data to be converted contains 3 properties, such as height, weight, and age. A is a woman, 168cm,70kg,30 years old, B is a male, 180cm,90kg,20 years old, then the direct use of the numerical vector becomes, 0,168,70,30;1,180,90,20. While 168 and 70 or 30 are different attributes, it is more obvious that 0 or 1 represents a greater difference in gender and other dimensions.
One is the dimensionless normalization of the values of each dimension, meaning that the values of each dimension are normalized to 0-1 or -0.5-+0.5.
However, this is still not good, such as gender 0,1 and other dimensions are still not relevant, so there is a code called One-hot, that is, a feature encoding of an attribute, only one activation point at a time (not 0). The gender of a is changed to "1,0" and the gender of B is encoded as "0,1". Age, weight, height, etc. can be expressed in a larger number of enums (the number is not necessarily to satisfy all enumerations, but to meet the actual data to appear in the category can be, such as only three height, then the side represents only need [0,0,1] can)
Then, the various properties are concatenated together to construct a very sparse eigenvector, such as sex and height concatenation as "0,1,0,0,1", which guarantees the dispersion of various data.
Related references:
http://blog.csdn.net/google19890102/article/details/44039761