[Spark] [Python] Example of opening a JSON file in Dataframe mode

Source: Internet
Author: User
Tags deprecated hdfs dfs

[Spark] [Python] An example of opening a JSON file in a dataframe way:

[email protected] ~]$ cat People.json
{"Name": "Alice", "Pcode": "94304"}
{"Name": "Brayden", "age": +, "Pcode": "94304"}
{"Name": "Carla", "age": +, "Pcoe": "10036"}
{"Name": "Diana", "Age": 46}
{"Name": "Etienne", "Pcode": "94104"}
[Email protected] ~]$

[Email protected] ~]$ HDFs dfs-put People.json

[Email protected] ~]$ HDFs dfs-cat People.json
{"Name": "Alice", "Pcode": "94304"}
{"Name": "Brayden", "age": +, "Pcode": "94304"}
{"Name": "Carla", "age": +, "Pcoe": "10036"}
{"Name": "Diana", "Age": 46}
{"Name": "Etienne", "Pcode": "94104"}


In [1]: SqlContext = Hivecontext (SC)

In [2]: PEOPLEDF = SqlContext.read.json ("People.json")

17/10/01 05:20:22 INFO Hive. Hivecontext:initializing execution Hive, version 1.1.0
17/10/01 05:20:22 INFO Client. clientwrapper:inspected Hadoop version:2.6.0-cdh5.7.0
17/10/01 05:20:22 INFO Client. clientwrapper:loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:23 INFO hive.metastore:Trying to connect to Metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:23 INFO hive.metastore:Opened A connection to Metastore, current connections:1
17/10/01 05:20:23 INFO hive.metastore:Connected to Metastore.
17/10/01 05:20:23 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training
17/10/01 05:20:23 INFO session. sessionstate:created Local Directory:/tmp/b3e52bfc-fe3a-4abe-ac7b-da071104b2f9_resources
17/10/01 05:20:23 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/ B3e52bfc-fe3a-4abe-ac7b-da071104b2f9
17/10/01 05:20:23 INFO session. sessionstate:created Local Directory:/TMP/TRAINING/B3E52BFC-FE3A-4ABE-AC7B-DA071104B2F9
17/10/01 05:20:23 INFO session. sessionstate:created HDFS directory:file:/tmp/spark-839b35f5-91a1-436c-aae5-922ebacb27f1/scratch/training/ B3e52bfc-fe3a-4abe-ac7b-da071104b2f9/_tmp_space.db
17/10/01 05:20:23 INFO session. Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.
17/10/01 05:20:23 INFO JSON. Jsonrelation:listing Hdfs://localhost:8020/user/training/people.json on Driver
17/10/01 05:20:25 INFO Storage. Memorystore:block broadcast_0 stored as values in memory (estimated size 251.1 kb, free 251.1 KB)
17/10/01 05:20:25 INFO Storage. Memorystore:block broadcast_0_piece0 stored as bytes in memory (estimated size 21.6 kb, free 272.7 KB)
17/10/01 05:20:25 INFO Storage. Blockmanagerinfo:added Broadcast_0_piece0 in Memory on localhost:42171 (size:21.6 KB, free:208.8 MB)
17/10/01 05:20:25 INFO Spark. Sparkcontext:created broadcast 0 from JSON at Nativemethodaccessorimpl.java:-2
17/10/01 05:20:26 INFO mapred. Fileinputformat:total input paths to process:1
17/10/01 05:20:26 INFO Spark. Sparkcontext:starting Job:json at Nativemethodaccessorimpl.java:-2
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:got Job 0 (JSON at nativemethodaccessorimpl.java:-2) with 1 output partitions
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:final Stage:resultstage 0 (json at nativemethodaccessorimpl.java:-2)
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:parents of Final stage:list ()
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:missing parents:list ()
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:submitting resultstage 0 (mappartitionsrdd[3] at JSON at Nativemethodaccessorimpl.java:-2), which have no MIS Sing parents
17/10/01 05:20:26 INFO Storage. Memorystore:block broadcast_1 stored as values in memory (estimated size 4.3 kb, free 277.1 KB)
17/10/01 05:20:26 INFO Storage. Memorystore:block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 kb, free 279.5 KB)
17/10/01 05:20:26 INFO Storage. Blockmanagerinfo:added Broadcast_1_piece0 in Memory on localhost:42171 (size:2.4 KB, free:208.8 MB)
17/10/01 05:20:26 INFO Spark. Sparkcontext:created broadcast 1 from broadcast at dagscheduler.scala:1006
17/10/01 05:20:26 INFO Scheduler. Dagscheduler:submitting 1 missing tasks from Resultstage 0 (mappartitionsrdd[3) at the JSON at Nativemethodaccessorimpl.java: -2)
17/10/01 05:20:26 INFO Scheduler. Taskschedulerimpl:adding task set 0.0 with 1 tasks
17/10/01 05:20:26 INFO Scheduler. Tasksetmanager:starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,process_local, 2149 bytes)
17/10/01 05:20:26 INFO executor. Executor:running task 0.0 in stage 0.0 (TID 0)
17/10/01 05:20:26 INFO Rdd. Hadooprdd:input split:hdfs://localhost:8020/user/training/people.json:0+179
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.tip.id is deprecated. Instead, use Mapreduce.task.id
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.task.id is deprecated. Instead, use Mapreduce.task.attempt.id
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.task.is.map is deprecated. Instead, use Mapreduce.task.ismap
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.task.partition is deprecated. Instead, use Mapreduce.task.partition
17/10/01 05:20:27 INFO Configuration.deprecation:mapred.job.id is deprecated. Instead, use Mapreduce.job.id
17/10/01 05:20:27 INFO executor. executor:finished task 0.0 in stage 0.0 (TID 0). 2354 bytes result sent to driver
17/10/01 05:20:27 INFO Scheduler. Dagscheduler:resultstage 0 (JSON at nativemethodaccessorimpl.java:-2) finished in 0.715 s
17/10/01 05:20:27 INFO Scheduler. tasksetmanager:finished task 0.0 in stage 0.0 (TID 0) in 667 ms on localhost (1/1)
17/10/01 05:20:27 INFO Scheduler. taskschedulerimpl:removed TaskSet 0.0, whose tasks has all completed, from pool
17/10/01 05:20:27 INFO Scheduler. Dagscheduler:job 0 Finished:json at Nativemethodaccessorimpl.java:-2, took 1.084685 s
17/10/01 05:20:27 INFO Hive. Hivecontext:default Warehouse Location Is/user/hive/warehouse
17/10/01 05:20:28 INFO Hive. hivecontext:initializing Metastore Client version 1.1.0 using Spark classes.
17/10/01 05:20:28 INFO Client. clientwrapper:inspected Hadoop version:2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO Client. clientwrapper:loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0-cdh5.7.0
17/10/01 05:20:28 INFO Storage. Blockmanagerinfo:removed Broadcast_1_piece0 on localhost:42171 in memory (size:2.4 KB, free:208.8 MB)
17/10/01 05:20:28 INFO Spark. contextcleaner:cleaned Accumulator 2
17/10/01 05:20:30 INFO hive.metastore:Trying to connect to Metastore with URI thrift://localhost.localdomain:9083
17/10/01 05:20:30 INFO hive.metastore:Opened A connection to Metastore, current connections:1
17/10/01 05:20:30 INFO hive.metastore:Connected to Metastore.
17/10/01 05:20:30 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training
17/10/01 05:20:30 INFO session. sessionstate:created Local Directory:/tmp/8c1eba54-7260-4314-abbf-7b7de85bdf0a_resources
17/10/01 05:20:30 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session. sessionstate:created Local Directory:/tmp/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a
17/10/01 05:20:30 INFO session. sessionstate:created HDFS Directory:/tmp/hive/training/8c1eba54-7260-4314-abbf-7b7de85bdf0a/_tmp_space.db
17/10/01 05:20:30 INFO session. Sessionstate:no Tez session required at this point. Hive.execution.engine=mr.


In [3]: type (PEOPLEDF)
OUT[3]: Pyspark.sql.dataframe.DataFrame

In [4]:

[Spark] [Python] Example of opening a JSON file in Dataframe mode

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.