[Spark] [Python] Example of a dataframe in which a limited record is taken:
SqlContext = Hivecontext (SC)
PEOPLEDF = SqlContext.read.json ("People.json")
Peopledf.limit (3). Show ()
===
[Email protected] ~]$ HDFs dfs-cat People.json
{"Name": "Alice", "Pcode": "94304"}
{"Name": "Brayden", "age": +, "Pcode": "94304"}
{"Name": "Carla", "age": +, "Pcoe": "10036"}
{"Name": "Diana", "Age": 46}
{"Name": "Etienne", "Pcode": "94104"}
[Email protected] ~]$
In [1]: SqlContext = Hivecontext (SC)
In [2]: PEOPLEDF = SqlContext.read.json ("People.json")
17/10/05 05:03:11 INFO Hive. Hivecontext:initializing execution Hive, version 1.1.0
17/10/05 05:03:11 INFO Client. clientwrapper:inspected Hadoop version:2.6.0-cdh5.7.0
17/10/05 05:03:11 INFO Client. clientwrapper:loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/05 05:03:14 INFO hive.metastore:Trying to connect to Metastore with URI thrift://localhost.localdomain:9083
17/10/05 05:03:14 INFO hive.metastore:Opened A connection to Metastore, current connections:1
17/10/05 05:03:15 INFO hive.metastore:Connected to Metastore.
17/10/05 05:03:16 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-99a33db4-b69a-46a9-8032-f87d63299040/scratch/training
17/10/05 05:03:16 INFO session. sessionstate:created Local Directory:/tmp/4e1c5259-7ae8-482c-ae77-94d3a0c51f91_resources
17/10/05 05:03:16 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-99a33db4-b69a-46a9-8032-f87d63299040/scratch/training/ 4e1c5259-7ae8-482c-ae77-94d3a0c51f91
17/10/05 05:03:16 INFO session. sessionstate:created Local Directory:/tmp/training/4e1c5259-7ae8-482c-ae77-94d3a0c51f91
17/10/05 05:03:16 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-99a33db4-b69a-46a9-8032-f87d63299040/scratch/training/ 4e1c5259-7ae8-482c-ae77-94d3a0c51f91/_tmp_space.db
17/10/05 05:03:16 INFO session. Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.
17/10/05 05:03:16 INFO JSON. Jsonrelation:listing Hdfs://localhost:8020/user/training/people.json on Driver
17/10/05 05:03:19 INFO Storage. Memorystore:block broadcast_0 stored as values in memory (estimated size 251.1 kb, free 251.1 KB)
17/10/05 05:03:20 INFO Storage. Memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 kb, free 272.7 KB)
17/10/05 05:03:20 INFO Storage. Blockmanagerinfo:added Broadcast_0_piece0 in Memory on localhost:55073 (size:21.6 KB, free:208.8 MB)
17/10/05 05:03:20 INFO Spark. Sparkcontext:created broadcast 0 from JSON at Nativemethodaccessorimpl.java:-2
17/10/05 05:03:20 INFO mapred. Fileinputformat:total input paths to process:1
17/10/05 05:03:21 INFO Spark. Sparkcontext:starting Job:json at Nativemethodaccessorimpl.java:-2
17/10/05 05:03:21 INFO Scheduler. Dagscheduler:got Job 0 (JSON at nativemethodaccessorimpl.java:-2) with 1 output partitions
17/10/05 05:03:21 INFO Scheduler. Dagscheduler:final Stage:resultstage 0 (json at nativemethodaccessorimpl.java:-2)
17/10/05 05:03:21 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/05 05:03:21 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/05 05:03:21 INFO Scheduler. Dagscheduler:submitting resultstage 0 (mappartitionsrdd[3] at JSON at Nativemethodaccessorimpl.java:-2), which have no MIS Sing parents
17/10/05 05:03:21 INFO Storage. Memorystore:block broadcast_1 stored as values in memory (estimated size 4.3 kb, free 277.1 KB)
17/10/05 05:03:21 INFO Storage. Memorystore:block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 kb, free 279.5 KB)
17/10/05 05:03:21 INFO Storage. Blockmanagerinfo:added Broadcast_1_piece0 in Memory on localhost:55073 (size:2.4 KB, free:208.8 MB)
17/10/05 05:03:21 INFO Spark. Sparkcontext:created broadcast 1 from broadcast at dagscheduler.scala:1006
17/10/05 05:03:21 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 0 (mappartitionsrdd[3) at the JSON at Nativemethodaccessorimpl.java: -2)
17/10/05 05:03:21 INFO Scheduler. Taskschedulerimpl:adding task set 0.0 with 1 tasks
17/10/05 05:03:21 INFO Scheduler. Tasksetmanager:starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,process_local, 2149 bytes)
17/10/05 05:03:21 INFO executor. Executor:running task 0.0 in stage 0.0 (TID 0)
17/10/05 05:03:21 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/people.json:0+179
17/10/05 05:03:21 INFO Configuration.deprecation:mapred.tip.id is deprecated. Instead, use Mapreduce.task.id
17/10/05 05:03:21 INFO Configuration.deprecation:mapred.task.id is deprecated. Instead, use Mapreduce.task.attempt.id
17/10/05 05:03:21 INFO Configuration.deprecation:mapred.task.is.map is deprecated. Instead, use Mapreduce.task.ismap
17/10/05 05:03:21 INFO Configuration.deprecation:mapred.task.partition is deprecated. Instead, use Mapreduce.task.partition
17/10/05 05:03:21 INFO Configuration.deprecation:mapred.job.id is deprecated. Instead, use Mapreduce.job.id
17/10/05 05:03:22 INFO executor. executor:finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/05 05:03:22 INFO Scheduler. Dagscheduler:resultstage 0 (JSON at nativemethodaccessorimpl.java:-2) finished in 0.931 s
17/10/05 05:03:22 INFO Scheduler. tasksetmanager:finished task 0.0 in stage 0.0 (TID 0) in 850 ms on localhost (1/1)
17/10/05 05:03:22 INFO Scheduler. taskschedulerimpl:removed TaskSet 0.0, whose tasks has all completed, from pool
17/10/05 05:03:22 INFO Scheduler. Dagscheduler:job 0 Finished:json at Nativemethodaccessorimpl.java:-2, took 1.388410 s
17/10/05 05:03:23 INFO Hive. Hivecontext:default Warehouse Location Is/user/hive/warehouse
17/10/05 05:03:23 INFO Hive. hivecontext:initializing Metastore Client version 1.1.0 using Spark classes.
17/10/05 05:03:23 INFO Client. clientwrapper:inspected Hadoop version:2.6.0-cdh5.7.0
17/10/05 05:03:23 INFO Client. clientwrapper:loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/05 05:03:23 INFO Spark. contextcleaner:cleaned Accumulator 2
17/10/05 05:03:23 INFO Storage. Blockmanagerinfo:removed Broadcast_1_piece0 on localhost:55073 in memory (size:2.4 KB, free:208.8 MB)
17/10/05 05:03:25 INFO hive.metastore:Trying to connect to Metastore with URI thrift://localhost.localdomain:9083
17/10/05 05:03:25 INFO hive.metastore:Opened A connection to Metastore, current connections:1
17/10/05 05:03:25 INFO hive.metastore:Connected to Metastore.
17/10/05 05:03:25 INFO session. sessionstate:created Local Directory:/tmp/684b38e5-72f0-4712-81d4-4c439e093f5c_resources
17/10/05 05:03:25 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training/684b38e5-72f0-4712-81d4-4c439e093f5c
17/10/05 05:03:25 INFO session. sessionstate:created Local Directory:/tmp/training/684b38e5-72f0-4712-81d4-4c439e093f5c
17/10/05 05:03:25 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training/684b38e5-72f0-4712-81d4-4c439e093f5c/_tmp_space.db
17/10/05 05:03:25 INFO session. Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.
In [3]: Peopledf.limit (3). Show ()
17/10/05 05:04:09 INFO Storage. Memorystore:block broadcast_2 stored as values in memory (estimated size 65.5 kb, free 338.2 KB)
17/10/05 05:04:10 INFO Storage. Memorystore:block broadcast_2_piece0 stored as bytes in memory (estimated size 21.4 kb, free 359.6 KB)
17/10/05 05:04:10 INFO Storage. Blockmanagerinfo:added Broadcast_2_piece0 in Memory on localhost:55073 (size:21.4 KB, free:208.8 MB)
17/10/05 05:04:10 INFO Spark. Sparkcontext:created broadcast 2 from showstring at nativemethodaccessorimpl.java:-2
17/10/05 05:04:10 INFO Storage. Memorystore:block broadcast_3 stored as values in memory (estimated size 251.1 kb, free 610.7 KB)
17/10/05 05:04:11 INFO Storage. Memorystore:block broadcast_3_piece0 stored as bytes in memory (estimated size 21.6 kb, free 632.4 KB)
17/10/05 05:04:11 INFO Storage. Blockmanagerinfo:added Broadcast_3_piece0 in Memory on localhost:55073 (size:21.6 KB, free:208.7 MB)
17/10/05 05:04:11 INFO Spark. Sparkcontext:created broadcast 3 from showstring at nativemethodaccessorimpl.java:-2
17/10/05 05:04:12 INFO mapred. Fileinputformat:total input paths to process:1
17/10/05 05:04:12 INFO Spark. Sparkcontext:starting job:showstring at Nativemethodaccessorimpl.java:-2
17/10/05 05:04:12 INFO Scheduler. Dagscheduler:got Job 1 (showstring at nativemethodaccessorimpl.java:-2) with 1 output partitions
17/10/05 05:04:12 INFO Scheduler. Dagscheduler:final Stage:resultstage 1 (showstring at nativemethodaccessorimpl.java:-2)
17/10/05 05:04:12 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/05 05:04:12 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/05 05:04:12 INFO Scheduler. Dagscheduler:submitting resultstage 1 (mappartitionsrdd[9] at showstring at Nativemethodaccessorimpl.java:-2), which Has no missing parents
17/10/05 05:04:12 INFO Storage. Memorystore:block broadcast_4 stored as values in memory (estimated size 5.9 kb, free 638.2 KB)
17/10/05 05:04:12 INFO Storage. Memorystore:block broadcast_4_piece0 stored as bytes in memory (estimated size 3.3 kb, free 641.5 KB)
17/10/05 05:04:12 INFO Storage. Blockmanagerinfo:added Broadcast_4_piece0 in Memory on localhost:55073 (size:3.3 KB, free:208.7 MB)
17/10/05 05:04:12 INFO Spark. Sparkcontext:created broadcast 4 from broadcast at dagscheduler.scala:1006
17/10/05 05:04:12 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 1 (mappartitionsrdd[9 "at Showstring" at Nativemethodaccessorimpl . java:-2)
17/10/05 05:04:12 INFO Scheduler. Taskschedulerimpl:adding Task Set 1.0 with 1 tasks
17/10/05 05:04:12 INFO Scheduler. Tasksetmanager:starting task 0.0 in Stage 1.0 (TID 1, localhost, partition 0,process_local, 2149 bytes)
17/10/05 05:04:12 INFO executor. Executor:running task 0.0 in Stage 1.0 (TID 1)
17/10/05 05:04:12 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/people.json:0+179
17/10/05 05:04:14 INFO CodeGen. Generateunsafeprojection:code generated in 1563.240244 ms
17/10/05 05:04:14 INFO CodeGen. Generatesafeprojection:code generated in 182.529448 ms
17/10/05 05:04:15 INFO executor. executor:finished task 0.0 in Stage 1.0 (TID 1). 2328 bytes result sent to driver
17/10/05 05:04:15 INFO Scheduler. Dagscheduler:resultstage 1 (showstring at nativemethodaccessorimpl.java:-2) finished in 2.549 s
17/10/05 05:04:15 INFO Scheduler. Dagscheduler:job 1 finished:showstring at Nativemethodaccessorimpl.java:-2, took 2.852393 s
17/10/05 05:04:15 INFO Scheduler. tasksetmanager:finished task 0.0 in Stage 1.0 (TID 1) in 2547 ms on localhost (1/1)
17/10/05 05:04:15 INFO Scheduler. Taskschedulerimpl:removed TaskSet 1.0, whose tasks has all completed, from pool
+----+-------+-----+-----+
| age| name|pcode| pcoe|
+----+-------+-----+-----+
|null| alice|94304| null|
| 30| brayden|94304| null|
| 19| carla| null|10036|
+----+-------+-----+-----+
In [4]:
[Spark] [Python] Example of taking a limited record out of a dataframe