[Spark] [Python]spark example of obtaining Dataframe from Avro file
Get the file from the following address:
Https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro
Import into the HDFS system:
HDFs Dfs-put Episodes.avro
Read in:
Mydata001=sqlcontext.read.format ("Com.databricks.spark.avro"). Load ("Episodes.avro")
Interactive Run Results:
In [7]: Mydata001=sqlcontext.read.format ("Com.databricks.spark.avro"). Load ("Episodes.avro")
17/10/03 07:00:47 INFO Avro. Avrorelation:listing Hdfs://localhost:8020/user/training/episodes.avro on Driver
In [8]: type (mydata001)
OUT[8]: Pyspark.sql.dataframe.DataFrame
In [9]: Mydata001.count ()
17/10/03 07:01:05 INFO Storage. Memorystore:block broadcast_3 stored as values in memory (estimated size 65.5 kb, free 65.5 KB)
17/10/03 07:01:05 INFO Storage. Memorystore:block broadcast_3_piece0 stored as bytes in memory (estimated size 21.4 kb, free 86.9 KB)
17/10/03 07:01:05 INFO Storage. Blockmanagerinfo:added Broadcast_3_piece0 in Memory on localhost:40075 (size:21.4 KB, free:208.8 MB)
17/10/03 07:01:05 INFO Spark. Sparkcontext:created broadcast 3 from count at Nativemethodaccessorimpl.java:-2
17/10/03 07:01:05 INFO Storage. Memorystore:block broadcast_4 stored as values in memory (estimated size 230.4 kb, free 317.3 KB)
17/10/03 07:01:06 INFO Storage. Memorystore:block broadcast_4_piece0 stored as bytes in memory (estimated size 21.5 kb, free 338.8 KB)
17/10/03 07:01:06 INFO Storage. Blockmanagerinfo:added Broadcast_4_piece0 in Memory on localhost:40075 (size:21.5 KB, free:208.8 MB)
17/10/03 07:01:06 INFO Spark. Sparkcontext:created broadcast 4 from Hadoopfile at avrorelation.scala:121
17/10/03 07:01:06 INFO mapred. Fileinputformat:total input paths to process:1
17/10/03 07:01:07 INFO Spark. Sparkcontext:starting Job:count at Nativemethodaccessorimpl.java:-2
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:registering RDD (count at Nativemethodaccessorimpl.java:-2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:got Job 1 (count at nativemethodaccessorimpl.java:-2) with 1 output partitions
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:final Stage:resultstage 3 (count at Nativemethodaccessorimpl.java:-2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:parents of Final stage:list (Shufflemapstage 2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:missing parents:list (shufflemapstage 2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:submitting Shufflemapstage 2 (mappartitionsrdd[16] at count at Nativemethodaccessorimpl.java:-2), which Has no missing parents
17/10/03 07:01:07 INFO Storage. Memorystore:block broadcast_5 stored as values in memory (estimated size 11.5 kb, free 350.3 KB)
17/10/03 07:01:07 INFO Storage. Memorystore:block broadcast_5_piece0 stored as bytes in memory (estimated size 5.7 kb, free 356.0 KB)
17/10/03 07:01:07 INFO Storage. Blockmanagerinfo:added Broadcast_5_piece0 in Memory on localhost:40075 (size:5.7 KB, free:208.8 MB)
17/10/03 07:01:07 INFO Spark. Sparkcontext:created broadcast 5 from broadcast at dagscheduler.scala:1006
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Shufflemapstage 2 (mappartitionsrdd[16) at count at Nativemethodaccessorimpl . java:-2)
17/10/03 07:01:07 INFO Scheduler. Taskschedulerimpl:adding Task Set 2.0 with 1 tasks
17/10/03 07:01:07 INFO Scheduler. Tasksetmanager:starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,process_local, 2249 bytes)
17/10/03 07:01:07 INFO executor. Executor:running task 0.0 in stage 2.0 (TID 2)
17/10/03 07:01:07 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:08 INFO executor. executor:finished task 0.0 in stage 2.0 (TID 2). 2484 bytes result sent to driver
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:shufflemapstage 2 (count at Nativemethodaccessorimpl.java:-2) finished in 0.691 s
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:looking for newly runnable stages
17/10/03 07:01:08 INFO Scheduler. DAGScheduler:running:Set ()
17/10/03 07:01:08 INFO Scheduler. DAGScheduler:waiting:Set (Resultstage 3)
17/10/03 07:01:08 INFO Scheduler. DAGScheduler:failed:Set ()
17/10/03 07:01:08 INFO Scheduler. tasksetmanager:finished task 0.0 in stage 2.0 (TID 2) in 693 ms on localhost (1/1)
17/10/03 07:01:08 INFO Scheduler. Taskschedulerimpl:removed TaskSet 2.0, whose tasks has all completed, from pool
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:submitting resultstage 3 (mappartitionsrdd[19] at count at Nativemethodaccessorimpl.java:-2), which have no M Issing Parents
17/10/03 07:01:08 INFO Storage. Memorystore:block broadcast_6 stored as values in memory (estimated size 12.6 kb, free 368.5 KB)
17/10/03 07:01:08 INFO Storage. Memorystore:block broadcast_6_piece0 stored as bytes in memory (estimated size 6.1 kb, free 374.7 KB)
17/10/03 07:01:08 INFO Storage. Blockmanagerinfo:added Broadcast_6_piece0 in Memory on localhost:40075 (size:6.1 KB, free:208.8 MB)
17/10/03 07:01:08 INFO Spark. Sparkcontext:created broadcast 6 from broadcast at dagscheduler.scala:1006
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 3 (mappartitionsrdd[19) at count at Nativemethodaccessorimpl.jav A:-2)
17/10/03 07:01:08 INFO Scheduler. Taskschedulerimpl:adding Task Set 3.0 with 1 tasks
17/10/03 07:01:08 INFO Scheduler. Tasksetmanager:starting task 0.0 in Stage 3.0 (TID 3, localhost, partition 0,node_local, 1999 bytes)
17/10/03 07:01:08 INFO executor. Executor:running task 0.0 in Stage 3.0 (TID 3)
17/10/03 07:01:08 INFO Storage. Shuffleblockfetcheriterator:getting 1 Non-empty blocks out of 1 blocks
17/10/03 07:01:08 INFO Storage. shuffleblockfetcheriterator:started 0 remote fetches in 0 ms
17/10/03 07:01:08 INFO executor. executor:finished task 0.0 in Stage 3.0 (TID 3). 1666 bytes result sent to driver
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:resultstage 3 (count at Nativemethodaccessorimpl.java:-2) finished in 0.344 s
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:job 1 Finished:count at Nativemethodaccessorimpl.java:-2, took 1.480495 s
17/10/03 07:01:08 INFO Scheduler. tasksetmanager:finished task 0.0 in Stage 3.0 (TID 3) in 345 ms on localhost (1/1)
17/10/03 07:01:08 INFO Scheduler. taskschedulerimpl:removed TaskSet 3.0, whose tasks has all completed, from pool
OUT[9]: 8
In [ten]: Mydata001.take (1)
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_7 stored as values in memory (estimated size 230.1 kb, free 604.8 KB)
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 kb, free 626.2 KB)
17/10/03 07:01:18 INFO Storage. Blockmanagerinfo:added Broadcast_7_piece0 in Memory on localhost:40075 (size:21.4 KB, free:208.7 MB)
17/10/03 07:01:18 INFO Spark. Sparkcontext:created broadcast 7 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_8 stored as values in memory (estimated size 230.5 kb, free 856.7 KB)
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_8_piece0 stored as bytes in memory (estimated size 21.5 kb, free 878.2 KB)
17/10/03 07:01:18 INFO Storage. Blockmanagerinfo:added Broadcast_8_piece0 in Memory on localhost:40075 (size:21.5 KB, free:208.7 MB)
17/10/03 07:01:18 INFO Spark. Sparkcontext:created broadcast 8 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO mapred. Fileinputformat:total input paths to process:1
17/10/03 07:01:18 INFO Spark. Sparkcontext:starting Job:take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:got Job 2 (take at <ipython-input-10-35862abbc114>:1) with 1 output partitions
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:final Stage:resultstage 4 (take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:submitting resultstage 4 (mappartitionsrdd[27] at take at <ipython-input-10-35862abbc114>:1), which Has no missing parents
17/10/03 07:01:19 INFO Storage. Memorystore:block broadcast_9 stored as values in memory (estimated size 5.6 KB, free 883.8 KB)
17/10/03 07:01:19 INFO Storage. Memorystore:block broadcast_9_piece0 stored as bytes in memory (estimated size 3.0 kb, Free 886.9 KB)
17/10/03 07:01:19 INFO Storage. Blockmanagerinfo:added Broadcast_9_piece0 in Memory on localhost:40075 (size:3.0 KB, free:208.7 MB)
17/10/03 07:01:19 INFO Spark. Sparkcontext:created broadcast 9 from broadcast at dagscheduler.scala:1006
17/10/03 07:01:19 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 4 (mappartitionsrdd[27) at take at <ipython-input-10-35862abb C114>:1)
17/10/03 07:01:19 INFO Scheduler. Taskschedulerimpl:adding task Set 4.0 with 1 tasks
17/10/03 07:01:19 INFO Scheduler. Tasksetmanager:starting task 0.0 in Stage 4.0 (TID 4, localhost, partition 0,process_local, 2260 bytes)
17/10/03 07:01:19 INFO executor. Executor:running task 0.0 in Stage 4.0 (TID 4)
17/10/03 07:01:19 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:19 INFO CodeGen. Generateunsafeprojection:code generated in 124.624053 ms
17/10/03 07:01:19 INFO executor. executor:finished task 0.0 in Stage 4.0 (TID 4). 2237 bytes result sent to driver
17/10/03 07:01:19 INFO Scheduler. Dagscheduler:resultstage 4 (take at <ipython-input-10-35862abbc114>:1) finished in 0.415 s
17/10/03 07:01:19 INFO Scheduler. Dagscheduler:job 2 finished:take at <ipython-input-10-35862abbc114>:1, took 0.565858 s
17/10/03 07:01:19 INFO Scheduler. tasksetmanager:finished task 0.0 in Stage 4.0 (TID 4) in 415 ms on localhost (1/1)
17/10/03 07:01:19 INFO Scheduler. taskschedulerimpl:removed TaskSet 4.0, whose tasks has all completed, from pool
OUT[10]: [Row (Title=u ' The eleventh Hour ', Air_date=u ' 3 April ', doctor=11)]
In [11]:
[Spark] [Python]spark example of obtaining Dataframe from Avro file