[Example of a limited record taken in Spark][python]dataframe
the continuation
In [4]: Peopledf.select ("Age")
OUT[4]: Dataframe[age:bigint]
In [5]: Mydf=people.select ("Age")
---------------------------------------------------------------------------
Nameerror Traceback (most recent)
<ipython-input-5-b5b723b62a49> in <module> ()
----> 1 Mydf=people.select ("Age")
Nameerror:name ' People ' is not defined
In [6]: Mydf=peopledf.select ("Age")
In [7]: Mydf.take (3)
17/10/05 05:13:02 INFO Storage. Memorystore:block broadcast_5 stored as values in memory (estimated size 230.1 kb, free 871.7 KB)
17/10/05 05:13:02 INFO Storage. Memorystore:block broadcast_5_piece0 stored as bytes in memory (estimated size 21.4 kb, free 893.1 KB)
17/10/05 05:13:02 INFO Storage. Blockmanagerinfo:added Broadcast_5_piece0 in Memory on localhost:55073 (size:21.4 KB, free:208.7 MB)
17/10/05 05:13:02 INFO Spark. Sparkcontext:created broadcast 5 from take at <ipython-input-7-745486715568>:1
17/10/05 05:13:02 INFO Storage. Memorystore:block broadcast_6 stored as values in memory (estimated size 251.1 kb, free 1144.2 KB)
17/10/05 05:13:02 INFO Storage. Memorystore:block broadcast_6_piece0 stored as bytes in memory (estimated size 21.6 kb, free 1165.8 KB)
17/10/05 05:13:02 INFO Storage. Blockmanagerinfo:added Broadcast_6_piece0 in Memory on localhost:55073 (size:21.6 KB, free:208.7 MB)
17/10/05 05:13:02 INFO Spark. Sparkcontext:created broadcast 6 from take at <ipython-input-7-745486715568>:1
17/10/05 05:13:03 INFO mapred. Fileinputformat:total input paths to process:1
17/10/05 05:13:03 INFO Spark. Sparkcontext:starting Job:take at <ipython-input-7-745486715568>:1
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:got Job 2 (take at <ipython-input-7-745486715568>:1) with 1 output partitions
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:final Stage:resultstage 2 (take at <ipython-input-7-745486715568>:1)
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:submitting Resultstage 2 (mappartitionsrdd[14] at take at <ipython-input-7-745486715568>:1), which Has no missing parents
17/10/05 05:13:03 INFO Storage. Memorystore:block broadcast_7 stored as values in memory (estimated size 4.3 kb, free 1170.2 KB)
17/10/05 05:13:03 INFO Storage. Memorystore:block broadcast_7_piece0 stored as bytes in memory (estimated size 2.5 kb, free 1172.6 KB)
17/10/05 05:13:03 INFO Storage. Blockmanagerinfo:added Broadcast_7_piece0 in Memory on localhost:55073 (size:2.5 KB, free:208.7 MB)
17/10/05 05:13:03 INFO Spark. Sparkcontext:created broadcast 7 from broadcast at dagscheduler.scala:1006
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 2 (mappartitionsrdd[14) at take at <ipython-input-7-745486715 568>:1)
17/10/05 05:13:03 INFO Scheduler. Taskschedulerimpl:adding Task Set 2.0 with 1 tasks
17/10/05 05:13:03 INFO Scheduler. Tasksetmanager:starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,process_local, 2149 bytes)
17/10/05 05:13:03 INFO executor. Executor:running task 0.0 in stage 2.0 (TID 2)
17/10/05 05:13:03 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/people.json:0+179
17/10/05 05:13:03 INFO CodeGen. Generateunsafeprojection:code generated in 113.719806 ms
17/10/05 05:13:03 INFO executor. executor:finished task 0.0 in stage 2.0 (TID 2). 2235 bytes result sent to driver
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:resultstage 2 (take at <ipython-input-7-745486715568>:1) finished in 0.493 s
17/10/05 05:13:03 INFO Scheduler. tasksetmanager:finished task 0.0 in stage 2.0 (TID 2) in 487 MS on localhost (1/1)
17/10/05 05:13:03 INFO Scheduler. Taskschedulerimpl:removed TaskSet 2.0, whose tasks has all completed, from pool
17/10/05 05:13:03 INFO Scheduler. Dagscheduler:job 2 finished:take at <ipython-input-7-745486715568>:1, took 0.737231 s
OUT[7]: [Row (Age=none), Row (age=30), Row (age=19)]
In [8]:
[Spark] [Python] DataFrame Select Operation Example