Before formal modeling, you need to know a lot about the data to be used in modeling, this article mainly introduces some common data observation and processing methods. 1. Data observation
(1) The missing rate of each column data in the Statistic data table
%pyspark
#构造原始数据样例
df = spark.createdataframe ([
1,175,72,28, ' m ', 10000),
(2,171,70,45, ' m ', None),
(3,172,none,none,none,none),
(4,180,78,33, ' m ', none), (
5,none,48,54, ' f ', none),
(6,160,45,30, ' f ', 5000),
(7,169,65,none, ' m ', 5000), ],
[' id ', ' height ', ' weight ', ' age ', ' gender ', ' income '])
RES_DF = Df.rdd.map (lambda x:x). Map (list). Collect (
#统计每列的数据缺失率 for
I in range (6):
#获取第i列数据
columns = [Item[i] for item in RES_DF]
# Statistics number of NON-NULL data in column I data
count = SUM ([1 for item in columns if Item])
#计算第i列的数据缺失率
missing_rate = 1-count/len (res_d f)
print (the data loss rate for column {} is: {:. 4f}% ". Format (i+1,missing_rate*100))
The output results are as follows:
(2) Statistical details of the specified column data
%pyspark
from Pyspark.sql import functions as F
#构造原始数据样例
df = Spark.createdataframe ([
(1,175,72,28, ' M ', 10000,
(2,171,70,45, ' m ', 8000),
(3,172,none,27, ' F ', 7000),
(4,180,78,30, ' m ', 4000), (5,none
, 48,54, ' f ', 6000,
(6,160,45,30, ' f ', 5000),
(7,169,65,36, ' M ', 7500),],
[' id ', ' height ', ' weight ', ' age ', ' Gender ', ' income ']
#先基于gender分组, and then use the various aggregate functions (MAX,MIN,MEAN,STDDEV) to count the information of the Age column
df_summary = sorted ( Df.groupby (Df.gender). Agg (F.max (Df.age), F.min (Df.age), F.mean (Df.age), F.stddev (Df.age)). Collect ())
print ( Df_summary)
The output results are as follows:
(3) Obtain the data information of vector in Dataframe
%pyspark
from pyspark.ml.linalg import Vectors
df = sc.parallelize ([
("Assert", Vectors.dense ([1,2,3]),
("Require", Vectors.sparse (3,{1:2})),
("Announce", Vectors.sparse (3,{0:1,2:4}))
]. TODF (["word", "vector"])
#提取DataFrame中的Vector中的数据信息
def extract (row): Return
(Row.word,) + tuple ( Row.vector.toArray (). ToList ())
RES_DF = Df.rdd.map (extract). TODF (["word", "v_1", "v_2", "V_3"])
Res_ Df.show ()
#获取指定列的数据
print (Res_df.select ("word", "v_1"). Show ()
The output results are as follows:
2. Data processing
This section mainly records some small techniques of data processing.
(1) to generate an index for the list
%pyspark
#通过enumerate为col_list生成索引
col_list = [' username ', ' id ', ' gender ', ' age ']
mapping_list = List ( Enumerate (sorted (col_list))
print (mapping_list)
The output results are as follows:
(2) Convert list to Dict
%pyspark
#将mapping_list中的key和value互换位置 and converted to dict
revs_maplist = {value:idx for [idx,value] in Mapping_list}
Print (revs_maplist)
The output results are as follows:
(3) nested for loop shorthand
%pyspark
test_list = [1,2,-3,10,none,-5,0,10.5]
#for循环简写1 (here if after the For loop)
RESULT1 = [2*item for item In Test_list if Item!= None]
print (RESULT1)
#for循环简写2 (here If-else must exist simultaneously and precede the For loop)
result2 = [1 if Item > 0 Else 0 for item in RESULT1]
print (RESULT2)
The output results are as follows:
(4) Adding new columns with specified conditions
%pyspark
from Pyspark.sql import functions as F
#构造原始数据样例
df = Spark.createdataframe ([
1, 175,72,28, ' m ', 10000, (
2,171,70,45, ' m ', 8000),
(3,172,none,none, ' F ', 7000),
(4,180,78,33, ' m ', 4 (
5,none,48,54, ' f ', 6000),
(6,160,45,30, ' f ', 5000),
(7,169,65,none, ' M ', 7500),]
, [' id ', ' height ', ' weight ', ' age ', ' gender ', ' income '])
#1. Add a column of data to DF ' income2 ', income2 = income +.
Test1 = Df.withcolumn ("Income2", Df.income +)
#print (Test1.show ())
#2. Add a column of data to test1 ' label ' when gender== ' M ' When, label=1, otherwise label=0.
Test2 = Test1.withcolumn ("label", f.when (Test1.gender = = ' M ', 1). otherwise (0))
print (Test2.show ())
The output results are as follows: