? single-Hot Coding (one-hot Encoding) refers to a list of categorical features (or noun features, nominal/categorical features) mapped into a series of two of dollars
Continuous characteristics of the process, the original category features there are several possible values, this feature will be mapped into several two-element continuous features, each of which represents a value, if the sample
Ben shows this feature, then take 1, otherwise take 0.
One-hot coding is suitable for some of the algorithms that are expected to feature continuous features, such as logistic regression.
First, you create a dataframe that contains a list of categorical features, and it is important to note that before you convert using Onehotencoder, Dataframe needs to first use the
Stringindexer to value the original tag:
#导入相关的类库From pyspark.sql import sparksessionFrom pyspark.ml.feature import onehotencoder,stringindexer#创建SparkSession对象, configure Sparkspark = SparkSession.builder.master (' local '). AppName (' Onehotencoderdemo '). Getorcreate ()#创建一个简单的DataFrame训练集df = Spark.createdataframe ([(0, "a"),(1, "B"),(2, "C"),(3, "a"),(4, "a"),(5, "C")], ["id", "category"])#创建StringIndexer对象, set input and output parametersindexer = stringindexer (inputcol= ' category ', outputcol= ' Categoryindex ')#生成训练模型model = Indexer.fit (DF)#利用生成的model对DataFrame进行转换indexed = model.transform (DF)#创建OneHotEncoder对象, set input and output parametersOnehotencoder = Onehotencoder (inputcol= ' Categoryindex ', outputcol= ' Categoryvec ')#我们创建OneHotEncoder对象对处理后的DataFrame进行编码, you can see that the encoded binary features are sparse#向量形式, the same sequence as the Stringindexer encoding, note that the last category ("B") is encoded as a full 0 -way#量, if you want "B" to also have a binary feature, you can specify Setdroplast (FALSE) when you create Onehotencoder. oncoded = onehotencoder.transform (indexed)oncoded.show ()
Feature extraction--conversion of tags and indexes: Onehotencoder