標籤:pipe ppi value atm out key值 ping inpu air
Mapping Single Rows to Multiple Pairs
目的:
把如下的這種資料,
Input Data
00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411
轉換為這樣:
一個Key值,帶的這幾個索引值,分別羅列:
(00001,sk010)
(00001,sku933)
(00001,sku022)
...
(00002,sku912)
(00002,sku331)
(00003,sku888)
這就是所謂的 Mapping Single Rows to Multiple Pairs
步驟如下:
[[email protected] ~]$ vim act001.txt
[[email protected] ~]$
[[email protected] ~]$ cat act001.txt
00001ku010:sku933:sku022
00002sku912:sku331
00003sku888:sku022:sku010:sku594
00004sku411
[[email protected] ~]$ hdfs dfs -put act001.txt
[[email protected] ~]$
[[email protected] ~]$ hdfs dfs -cat act001.txt
00001ku010:sku933:sku022
00002sku912:sku331
00003sku888:sku022:sku010:sku594
00004sku411
[[email protected] ~]$
In [6]: mydata01=mydata.map(lambda line: line.split("\t"))
In [7]: type(mydata01)
Out[7]: pyspark.rdd.PipelinedRDD
In [8]: mydata02=mydata01.map(lambda fields: (fields[0],fields[1]))
In [9]: type(mydata02)
Out[9]: pyspark.rdd.PipelinedRDD
In [10]:
In [11]: mydata03 = mydata02.flatMapValues(lambda skus: skus.split(":"))
In [12]: type(mydata03)
Out[12]: pyspark.rdd.PipelinedRDD
In [13]: mydata03.take(1)
Out[13]: [(u‘00001‘, u‘ku010‘)]
[Spark][Python]Mapping Single Rows to Multiple Pairs