[Continuation of the Spark][python]sortbykey example:
[Spark] [Python]groupbykey Example
In []: Mydata003.collect ()
OUT[29]:
[[u ' 00001 ', U ' sku933 '],
[u ' 00001 ', U ' sku022 '],
[u ' 00001 ', U ' sku912 '],
[u ' 00001 ', U ' sku331 '],
[u ' 00002 ', U ' sku010 '],
[u ' 00003 ', U ' sku888 '],
[u ' 00004 ', U ' sku411 ']
in [+]: Mydata005=mydata003.groupbykey ()
in [+]: Mydata005.count ()
OUT[32]: 4
In [All]: Mydata005.collect ()
OUT[33]:
[(U ' 00004 ', <pyspark.resultiterable.resultiterable at 0x7fcebe436b10>),
(U ' 00001 ', <pyspark.resultiterable.resultiterable at 0x7fcebe436850>),
(U ' 00003 ', <pyspark.resultiterable.resultiterable at 0x7fcebe436050>),
(U ' 00002 ', <pyspark.resultiterable.resultiterable at 0x7fcebe4361d0>)]
So, for this:
(00004,sku411)
(00003,sku888)
(00003,sku022)
(00003,sku010)
(00003,sku594)
(00002,sku912)
In theory it becomes in this form:
(00002,[sku912,sku331])
(00001,[sku022,sku010,sku933])
(00003,[sku888,sku022,sku010,sku594])
(00004,[sku411])
How do we print them out in a format like this, I consider the need for a function, and then the value of each line of the RDD as a list, and then the traversal.
(Wait for next write)
00002
sku912
sku331
00001
sku022
sku010
sku933
00003
sku088
sku022
sku022
sku010
sku594
00004
sku411
[Spark] [Python]groupbykey Example