Pyspark Learning Notes (6)--Data processing

Source: Internet
Author: User
Tags pyspark

Before formal modeling, you need to know a lot about the data to be used in modeling, this article mainly introduces some common data observation and processing methods. 1. Data observation

(1) The missing rate of each column data in the Statistic data table

%pyspark

#构造原始数据样例
df = spark.createdataframe ([
    1,175,72,28, ' m ', 10000),
    (2,171,70,45, ' m ', None),
    (3,172,none,none,none,none),
    (4,180,78,33, ' m ', none), (
    5,none,48,54, ' f ', none),
    (6,160,45,30, ' f ', 5000),
    (7,169,65,none, ' m ', 5000), ],
    [' id ', ' height ', ' weight ', ' age ', ' gender ', ' income '])

RES_DF = Df.rdd.map (lambda x:x). Map (list). Collect (

#统计每列的数据缺失率 for
I in range (6):
    #获取第i列数据
    columns = [Item[i] for item in RES_DF]
    # Statistics number of NON-NULL data in column I data
    count = SUM ([1 for item in columns if Item])
    #计算第i列的数据缺失率
    missing_rate = 1-count/len (res_d f)
    print (the data loss rate for column {} is: {:. 4f}% ". Format (i+1,missing_rate*100))

The output results are as follows:

(2) Statistical details of the specified column data

%pyspark  
 
from Pyspark.sql import functions as F

#构造原始数据样例
df = Spark.createdataframe ([    
    (1,175,72,28, ' M ', 10000,    
    (2,171,70,45, ' m ', 8000),    
    (3,172,none,27, ' F ', 7000),    
    (4,180,78,30, ' m ', 4000), (5,none    
    , 48,54, ' f ', 6000,    
    (6,160,45,30, ' f ', 5000),    
    (7,169,65,36, ' M ', 7500),],    
    [' id ', ' height ', ' weight ', ' age ', ' Gender ', ' income ']

#先基于gender分组, and then use the various aggregate functions (MAX,MIN,MEAN,STDDEV) to count the information of the Age column
df_summary = sorted ( Df.groupby (Df.gender). Agg (F.max (Df.age), F.min (Df.age), F.mean (Df.age), F.stddev (Df.age)). Collect ())

print ( Df_summary)

The output results are as follows:


(3) Obtain the data information of vector in Dataframe

%pyspark

from pyspark.ml.linalg import Vectors

df = sc.parallelize ([
    ("Assert", Vectors.dense ([1,2,3]),
    ("Require", Vectors.sparse (3,{1:2})),
    ("Announce", Vectors.sparse (3,{0:1,2:4}))
    ]. TODF (["word", "vector"])

#提取DataFrame中的Vector中的数据信息
def extract (row): Return
    (Row.word,) + tuple ( Row.vector.toArray (). ToList ())
    
RES_DF = Df.rdd.map (extract). TODF (["word", "v_1", "v_2", "V_3"])
Res_ Df.show ()

#获取指定列的数据
print (Res_df.select ("word", "v_1"). Show ()

The output results are as follows:



2. Data processing

This section mainly records some small techniques of data processing.

(1) to generate an index for the list

%pyspark

#通过enumerate为col_list生成索引
col_list = [' username ', ' id ', ' gender ', ' age ']
mapping_list = List ( Enumerate (sorted (col_list))
print (mapping_list)

The output results are as follows:

(2) Convert list to Dict

%pyspark

#将mapping_list中的key和value互换位置 and converted to dict
revs_maplist = {value:idx for [idx,value] in Mapping_list}
Print (revs_maplist)

The output results are as follows:


(3) nested for loop shorthand

%pyspark

test_list = [1,2,-3,10,none,-5,0,10.5]

#for循环简写1 (here if after the For loop)
RESULT1 = [2*item for  item In Test_list if Item!= None]
print (RESULT1)

#for循环简写2 (here If-else must exist simultaneously and precede the For loop)
result2  = [1 if Item > 0 Else 0 for item in RESULT1]
print (RESULT2)

The output results are as follows:


(4) Adding new columns with specified conditions

%pyspark

from Pyspark.sql import functions as F 
#构造原始数据样例
df = Spark.createdataframe ([
1, 175,72,28, ' m ', 10000, (
2,171,70,45, ' m ', 8000),
(3,172,none,none, ' F ', 7000),
(4,180,78,33, ' m ', 4    (
5,none,48,54, ' f ', 6000),
(6,160,45,30, ' f ', 5000),
(7,169,65,none, ' M ', 7500),]
,  [' id ', ' height ', ' weight ', ' age ', ' gender ', ' income '])
#1. Add a column of data to DF ' income2 ', income2 = income +.
Test1 = Df.withcolumn ("Income2", Df.income +)
#print (Test1.show ())

#2. Add a column of data to test1 ' label ' when gender== ' M ' When, label=1, otherwise label=0.
Test2 = Test1.withcolumn ("label", f.when (Test1.gender = = ' M ', 1). otherwise (0))
print (Test2.show ())

The output results are as follows:








Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.