[Spark][Python]Mapping Single Rows to Multiple Pairs

來源:互聯網
上載者:User

標籤:pipe   ppi   value   atm   out   key值   ping   inpu   air   

Mapping Single Rows to Multiple Pairs
目的:

把如下的這種資料,

Input Data

00001 sku010:sku933:sku022
00002 sku912:sku331
00003 sku888:sku022:sku010:sku594
00004 sku411


轉換為這樣:
一個Key值,帶的這幾個索引值,分別羅列:

(00001,sk010)
(00001,sku933)
(00001,sku022)

...
(00002,sku912)
(00002,sku331)
(00003,sku888)

這就是所謂的 Mapping Single Rows to Multiple Pairs

步驟如下:

[[email protected] ~]$ vim act001.txt
[[email protected] ~]$
[[email protected] ~]$ cat act001.txt
00001ku010:sku933:sku022
00002sku912:sku331
00003sku888:sku022:sku010:sku594
00004sku411
[[email protected] ~]$ hdfs dfs -put act001.txt
[[email protected] ~]$
[[email protected] ~]$ hdfs dfs -cat act001.txt
00001ku010:sku933:sku022
00002sku912:sku331
00003sku888:sku022:sku010:sku594
00004sku411
[[email protected] ~]$

In [6]: mydata01=mydata.map(lambda line: line.split("\t"))

In [7]: type(mydata01)
Out[7]: pyspark.rdd.PipelinedRDD

In [8]: mydata02=mydata01.map(lambda fields: (fields[0],fields[1]))

In [9]: type(mydata02)
Out[9]: pyspark.rdd.PipelinedRDD

In [10]:

In [11]: mydata03 = mydata02.flatMapValues(lambda skus: skus.split(":"))

In [12]: type(mydata03)
Out[12]: pyspark.rdd.PipelinedRDD

In [13]: mydata03.take(1)
Out[13]: [(u‘00001‘, u‘ku010‘)]

[Spark][Python]Mapping Single Rows to Multiple Pairs

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.