[Spark] [Python]spark example of obtaining Dataframe from Avro file

Source: Internet
Author: User
Tags hdfs dfs databricks

[Spark] [Python]spark example of obtaining Dataframe from Avro file

Get the file from the following address:
Https://github.com/databricks/spark-avro/raw/master/src/test/resources/episodes.avro

Import into the HDFS system:
HDFs Dfs-put Episodes.avro

Read in:
Mydata001=sqlcontext.read.format ("Com.databricks.spark.avro"). Load ("Episodes.avro")

Interactive Run Results:

In [7]: Mydata001=sqlcontext.read.format ("Com.databricks.spark.avro"). Load ("Episodes.avro")
17/10/03 07:00:47 INFO Avro. Avrorelation:listing Hdfs://localhost:8020/user/training/episodes.avro on Driver

In [8]: type (mydata001)
OUT[8]: Pyspark.sql.dataframe.DataFrame

In [9]: Mydata001.count ()
17/10/03 07:01:05 INFO Storage. Memorystore:block broadcast_3 stored as values in memory (estimated size 65.5 kb, free 65.5 KB)
17/10/03 07:01:05 INFO Storage. Memorystore:block broadcast_3_piece0 stored as bytes in memory (estimated size 21.4 kb, free 86.9 KB)
17/10/03 07:01:05 INFO Storage. Blockmanagerinfo:added Broadcast_3_piece0 in Memory on localhost:40075 (size:21.4 KB, free:208.8 MB)
17/10/03 07:01:05 INFO Spark. Sparkcontext:created broadcast 3 from count at Nativemethodaccessorimpl.java:-2
17/10/03 07:01:05 INFO Storage. Memorystore:block broadcast_4 stored as values in memory (estimated size 230.4 kb, free 317.3 KB)
17/10/03 07:01:06 INFO Storage. Memorystore:block broadcast_4_piece0 stored as bytes in memory (estimated size 21.5 kb, free 338.8 KB)
17/10/03 07:01:06 INFO Storage. Blockmanagerinfo:added Broadcast_4_piece0 in Memory on localhost:40075 (size:21.5 KB, free:208.8 MB)
17/10/03 07:01:06 INFO Spark. Sparkcontext:created broadcast 4 from Hadoopfile at avrorelation.scala:121
17/10/03 07:01:06 INFO mapred. Fileinputformat:total input paths to process:1
17/10/03 07:01:07 INFO Spark. Sparkcontext:starting Job:count at Nativemethodaccessorimpl.java:-2
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:registering RDD (count at Nativemethodaccessorimpl.java:-2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:got Job 1 (count at nativemethodaccessorimpl.java:-2) with 1 output partitions
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:final Stage:resultstage 3 (count at Nativemethodaccessorimpl.java:-2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:parents of Final stage:list (Shufflemapstage 2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:missing parents:list (shufflemapstage 2)
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:submitting Shufflemapstage 2 (mappartitionsrdd[16] at count at Nativemethodaccessorimpl.java:-2), which Has no missing parents
17/10/03 07:01:07 INFO Storage. Memorystore:block broadcast_5 stored as values in memory (estimated size 11.5 kb, free 350.3 KB)
17/10/03 07:01:07 INFO Storage. Memorystore:block broadcast_5_piece0 stored as bytes in memory (estimated size 5.7 kb, free 356.0 KB)
17/10/03 07:01:07 INFO Storage. Blockmanagerinfo:added Broadcast_5_piece0 in Memory on localhost:40075 (size:5.7 KB, free:208.8 MB)
17/10/03 07:01:07 INFO Spark. Sparkcontext:created broadcast 5 from broadcast at dagscheduler.scala:1006
17/10/03 07:01:07 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Shufflemapstage 2 (mappartitionsrdd[16) at count at Nativemethodaccessorimpl . java:-2)
17/10/03 07:01:07 INFO Scheduler. Taskschedulerimpl:adding Task Set 2.0 with 1 tasks
17/10/03 07:01:07 INFO Scheduler. Tasksetmanager:starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,process_local, 2249 bytes)
17/10/03 07:01:07 INFO executor. Executor:running task 0.0 in stage 2.0 (TID 2)
17/10/03 07:01:07 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:08 INFO executor. executor:finished task 0.0 in stage 2.0 (TID 2). 2484 bytes result sent to driver
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:shufflemapstage 2 (count at Nativemethodaccessorimpl.java:-2) finished in 0.691 s
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:looking for newly runnable stages
17/10/03 07:01:08 INFO Scheduler. DAGScheduler:running:Set ()
17/10/03 07:01:08 INFO Scheduler. DAGScheduler:waiting:Set (Resultstage 3)
17/10/03 07:01:08 INFO Scheduler. DAGScheduler:failed:Set ()
17/10/03 07:01:08 INFO Scheduler. tasksetmanager:finished task 0.0 in stage 2.0 (TID 2) in 693 ms on localhost (1/1)
17/10/03 07:01:08 INFO Scheduler. Taskschedulerimpl:removed TaskSet 2.0, whose tasks has all completed, from pool
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:submitting resultstage 3 (mappartitionsrdd[19] at count at Nativemethodaccessorimpl.java:-2), which have no M Issing Parents
17/10/03 07:01:08 INFO Storage. Memorystore:block broadcast_6 stored as values in memory (estimated size 12.6 kb, free 368.5 KB)
17/10/03 07:01:08 INFO Storage. Memorystore:block broadcast_6_piece0 stored as bytes in memory (estimated size 6.1 kb, free 374.7 KB)
17/10/03 07:01:08 INFO Storage. Blockmanagerinfo:added Broadcast_6_piece0 in Memory on localhost:40075 (size:6.1 KB, free:208.8 MB)
17/10/03 07:01:08 INFO Spark. Sparkcontext:created broadcast 6 from broadcast at dagscheduler.scala:1006
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 3 (mappartitionsrdd[19) at count at Nativemethodaccessorimpl.jav A:-2)
17/10/03 07:01:08 INFO Scheduler. Taskschedulerimpl:adding Task Set 3.0 with 1 tasks
17/10/03 07:01:08 INFO Scheduler. Tasksetmanager:starting task 0.0 in Stage 3.0 (TID 3, localhost, partition 0,node_local, 1999 bytes)
17/10/03 07:01:08 INFO executor. Executor:running task 0.0 in Stage 3.0 (TID 3)
17/10/03 07:01:08 INFO Storage. Shuffleblockfetcheriterator:getting 1 Non-empty blocks out of 1 blocks
17/10/03 07:01:08 INFO Storage. shuffleblockfetcheriterator:started 0 remote fetches in 0 ms
17/10/03 07:01:08 INFO executor. executor:finished task 0.0 in Stage 3.0 (TID 3). 1666 bytes result sent to driver
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:resultstage 3 (count at Nativemethodaccessorimpl.java:-2) finished in 0.344 s
17/10/03 07:01:08 INFO Scheduler. Dagscheduler:job 1 Finished:count at Nativemethodaccessorimpl.java:-2, took 1.480495 s
17/10/03 07:01:08 INFO Scheduler. tasksetmanager:finished task 0.0 in Stage 3.0 (TID 3) in 345 ms on localhost (1/1)
17/10/03 07:01:08 INFO Scheduler. taskschedulerimpl:removed TaskSet 3.0, whose tasks has all completed, from pool
OUT[9]: 8

In [ten]: Mydata001.take (1)
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_7 stored as values in memory (estimated size 230.1 kb, free 604.8 KB)
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 kb, free 626.2 KB)
17/10/03 07:01:18 INFO Storage. Blockmanagerinfo:added Broadcast_7_piece0 in Memory on localhost:40075 (size:21.4 KB, free:208.7 MB)
17/10/03 07:01:18 INFO Spark. Sparkcontext:created broadcast 7 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_8 stored as values in memory (estimated size 230.5 kb, free 856.7 KB)
17/10/03 07:01:18 INFO Storage. Memorystore:block broadcast_8_piece0 stored as bytes in memory (estimated size 21.5 kb, free 878.2 KB)
17/10/03 07:01:18 INFO Storage. Blockmanagerinfo:added Broadcast_8_piece0 in Memory on localhost:40075 (size:21.5 KB, free:208.7 MB)
17/10/03 07:01:18 INFO Spark. Sparkcontext:created broadcast 8 from take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO mapred. Fileinputformat:total input paths to process:1
17/10/03 07:01:18 INFO Spark. Sparkcontext:starting Job:take at <ipython-input-10-35862abbc114>:1
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:got Job 2 (take at <ipython-input-10-35862abbc114>:1) with 1 output partitions
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:final Stage:resultstage 4 (take at <ipython-input-10-35862abbc114>:1)
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/03 07:01:18 INFO Scheduler. Dagscheduler:submitting resultstage 4 (mappartitionsrdd[27] at take at <ipython-input-10-35862abbc114>:1), which Has no missing parents
17/10/03 07:01:19 INFO Storage. Memorystore:block broadcast_9 stored as values in memory (estimated size 5.6 KB, free 883.8 KB)
17/10/03 07:01:19 INFO Storage. Memorystore:block broadcast_9_piece0 stored as bytes in memory (estimated size 3.0 kb, Free 886.9 KB)
17/10/03 07:01:19 INFO Storage. Blockmanagerinfo:added Broadcast_9_piece0 in Memory on localhost:40075 (size:3.0 KB, free:208.7 MB)
17/10/03 07:01:19 INFO Spark. Sparkcontext:created broadcast 9 from broadcast at dagscheduler.scala:1006
17/10/03 07:01:19 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 4 (mappartitionsrdd[27) at take at <ipython-input-10-35862abb C114&GT;:1)
17/10/03 07:01:19 INFO Scheduler. Taskschedulerimpl:adding task Set 4.0 with 1 tasks
17/10/03 07:01:19 INFO Scheduler. Tasksetmanager:starting task 0.0 in Stage 4.0 (TID 4, localhost, partition 0,process_local, 2260 bytes)
17/10/03 07:01:19 INFO executor. Executor:running task 0.0 in Stage 4.0 (TID 4)
17/10/03 07:01:19 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/episodes.avro:0+597
17/10/03 07:01:19 INFO CodeGen. Generateunsafeprojection:code generated in 124.624053 ms
17/10/03 07:01:19 INFO executor. executor:finished task 0.0 in Stage 4.0 (TID 4). 2237 bytes result sent to driver
17/10/03 07:01:19 INFO Scheduler. Dagscheduler:resultstage 4 (take at <ipython-input-10-35862abbc114>:1) finished in 0.415 s
17/10/03 07:01:19 INFO Scheduler. Dagscheduler:job 2 finished:take at <ipython-input-10-35862abbc114>:1, took 0.565858 s
17/10/03 07:01:19 INFO Scheduler. tasksetmanager:finished task 0.0 in Stage 4.0 (TID 4) in 415 ms on localhost (1/1)
17/10/03 07:01:19 INFO Scheduler. taskschedulerimpl:removed TaskSet 4.0, whose tasks has all completed, from pool
OUT[10]: [Row (Title=u ' The eleventh Hour ', Air_date=u ' 3 April ', doctor=11)]

In [11]:

[Spark] [Python]spark example of obtaining Dataframe from Avro file

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.