[Spark] [Python] An example of opening a JSON file in a dataframe way:
[email protected] ~]$ cat People.json
{"Name": "Alice", "Pcode": "94304"}
{"Name": "Brayden", "age": +, "Pcode": "94304"}
{"Name": "Carla", "age": +, "Pcoe": "10036"}
{"Name": "Diana", "Age": 46}
{"Name": "Etienne", "Pcode": "94104"}
[Email protected] ~]$
[Email protected] ~]$ HDFs dfs-put People.json
[Email protected] ~]$ HDFs dfs-cat People.json
{"Name": "Alice", "Pcode": "94304"}
{"Name": "Brayden", "age": +, "Pcode": "94304"}
{"Name": "Carla", "age": +, "Pcoe": "10036"}
{"Name": "Diana", "Age": 46}
{"Name": "Etienne", "Pcode": "94104"}
In [1]: SqlContext = Hivecontext (SC)
In [2]: PEOPLEDF = SqlContext.read.json ("People.json")
17/10/01 05:20:22 INFO Hive. Hivecontext:initializing execution Hive, version 1.1.0
17/10/01 05:20:22 INFO Client. clientwrapper:inspected Hadoop version:2.6.0-cdh5.7.0
17/10/01 05:20:22 INFO Client. clientwrapper:loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:23 INFO hive.metastore:Trying to connect to Metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:23 INFO hive.metastore:Opened A connection to Metastore, current connections:1
17/10/01 05:20:23 INFO hive.metastore:Connected to Metastore.
17/10/01 05:20:23 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
17/10/01 05:20:23 INFO session. sessionstate:created Local Directory:/tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
17/10/01 05:20:23 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/ B3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session. sessionstate:created Local Directory:/TMP/TRAINING/B3E52BFC-FE3A-4ABE-AC7B-DA071104B2F9
17/10/01 05:20:23 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/ B3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
17/10/01 05:20:23 INFO session. Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.
17/10/01 05:20:23 INFO JSON. Jsonrelation:listing Hdfs://localhost:8020/user/training/people.json on Driver
17/10/01 05:20:25 INFO Storage. Memorystore:block broadcast_0 stored as values in memory (estimated size 251.1 kb, free 251.1 KB)
17/10/01 05:20:25 INFO Storage. Memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 kb, free 272.7 KB)
17/10/01 05:20:25 INFO Storage. Blockmanagerinfo:added Broadcast_0_piece0 in Memory on localhost:42171 (size:21.6 KB, free:208.8 MB)
17/10/01 05:20:25 INFO Spark. Sparkcontext:created broadcast 0 from JSON at Nativemethodaccessorimpl.java:-2
17/10/01 05:20:26 INFO mapred. Fileinputformat:total input paths to process:1
17/10/01 05:20:26 INFO Spark. Sparkcontext:starting Job:json at Nativemethodaccessorimpl.java:-2
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:got Job 0 (JSON at nativemethodaccessorimpl.java:-2) with 1 output partitions
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:final Stage:resultstage 0 (json at nativemethodaccessorimpl.java:-2)
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:submitting resultstage 0 (mappartitionsrdd[3] at JSON at Nativemethodaccessorimpl.java:-2), which have no MIS Sing parents
17/10/01 05:20:26 INFO Storage. Memorystore:block broadcast_1 stored as values in memory (estimated size 4.3 kb, free 277.1 KB)
17/10/01 05:20:26 INFO Storage. Memorystore:block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 kb, free 279.5 KB)
17/10/01 05:20:26 INFO Storage. Blockmanagerinfo:added Broadcast_1_piece0 in Memory on localhost:42171 (size:2.4 KB, free:208.8 MB)
17/10/01 05:20:26 INFO Spark. Sparkcontext:created broadcast 1 from broadcast at dagscheduler.scala:1006
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 0 (mappartitionsrdd[3) at the JSON at Nativemethodaccessorimpl.java: -2)
17/10/01 05:20:26 INFO Scheduler. Taskschedulerimpl:adding task set 0.0 with 1 tasks
17/10/01 05:20:26 INFO Scheduler. Tasksetmanager:starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,process_local, 2149 bytes)
17/10/01 05:20:26 INFO executor. Executor:running task 0.0 in stage 0.0 (TID 0)
17/10/01 05:20:26 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/people.json:0+179
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.tip.id is deprecated. Instead, use Mapreduce.task.id
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.task.id is deprecated. Instead, use Mapreduce.task.attempt.id
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.task.is.map is deprecated. Instead, use Mapreduce.task.ismap
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.task.partition is deprecated. Instead, use Mapreduce.task.partition
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.job.id is deprecated. Instead, use Mapreduce.job.id
17/10/01 05:20:27 INFO executor. executor:finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/01 05:20:27 INFO Scheduler. Dagscheduler:resultstage 0 (JSON at nativemethodaccessorimpl.java:-2) finished in 0.715 s
17/10/01 05:20:27 INFO Scheduler. tasksetmanager:finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
17/10/01 05:20:27 INFO Scheduler. taskschedulerimpl:removed TaskSet 0.0, whose tasks has all completed, from pool
17/10/01 05:20:27 INFO Scheduler. Dagscheduler:job 0 Finished:json at Nativemethodaccessorimpl.java:-2, took 1.084685 s
17/10/01 05:20:27 INFO Hive. Hivecontext:default Warehouse Location Is/user/hive/warehouse
17/10/01 05:20:28 INFO Hive. hivecontext:initializing Metastore Client version 1.1.0 using Spark classes.
17/10/01 05:20:28 INFO Client. clientwrapper:inspected Hadoop version:2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO Client. clientwrapper:loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO Storage. Blockmanagerinfo:removed Broadcast_1_piece0 on localhost:42171 in memory (size:2.4 KB, free:208.8 MB)
17/10/01 05:20:28 INFO Spark. contextcleaner:cleaned Accumulator 2
17/10/01 05:20:30 INFO hive.metastore:Trying to connect to Metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:30 INFO hive.metastore:Opened A connection to Metastore, current connections:1
17/10/01 05:20:30 INFO hive.metastore:Connected to Metastore.
17/10/01 05:20:30 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training
17/10/01 05:20:30 INFO session. sessionstate:created Local Directory:/tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
17/10/01 05:20:30 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session. sessionstate:created Local Directory:/tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
17/10/01 05:20:30 INFO session. Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.
In [3]: type (PEOPLEDF)
OUT[3]: Pyspark.sql.dataframe.DataFrame
In [4]:
[Spark] [Python] Example of opening a JSON file in Dataframe mode