Spark loads a JSON file from an HDFs file into a SQL table via the RDD

Source: Internet
Author: User

RDD definition

The RDD full name is the resilient distributed Dataset, the core abstraction layer of spark, through which you can read a variety of files, demonstrating how to read HDFs files. All spark work takes place on the RDD, such as creating a new RDD, converting an existing RDD, and finding the result for the current RDD calculation.

The RDD is a collection of immutable (immutable) objects in spark that can be divided into multiple partitions and stored in different nodes.


Create an Rdd

There are two ways to load an external data set, such as a file loaded with HDFs below, running in Scalar-shell:

Another approach is to use the Sparkcontext Paralleize method in the driver program. It is not discussed here for the time being.

Reading JSON files

The log file above is actually in JSON format, so you can change the method of reading:

scala> val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) sqlContext:org.apache.spark.sql.SQLContext = [email  protected]scala> val Path = "Hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.log" Path:string = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logscala> val c = SqlContext.read.json (path) c:org.apache.spark.sql.dataframe = [data:struct<client_version:bigint,corp_id: String,east:bigint,ext_o_latitude:double,ext_o_longitude:double,gps_num:array<struct<east:bigint,gps_num : Bigint,gpstime:bigint,latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north: Bigint,terminal:string,tsp_obd_n900_head:array<bigint>>>,gpstime:bigint,heading:bigint,k:string, Latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north:bigint,syn_type:bigint, Systime_driverstorage:bigint,systime_listenerserver:bigint,target_id:string,target_name:string,terminal:string , termInal_id:string,terminal_status_desc:string,tsp_obd_n900_head:array<bigint>,type:bigint,update_time:    Bigint>, drivername:string, type:string]scala> c.printschema () root |--data:struct (nullable = True) |    |--Client_version:long (nullable = True) |    |--corp_id:string (nullable = True) |    |--East:long (nullable = True) |    |--ext_o_latitude:double (nullable = True) |    |--ext_o_longitude:double (nullable = True) |    |--Gps_num:array (nullable = True) |    |    |--element:struct (Containsnull = True) |    |    |    |--East:long (nullable = True) |    |    |    |--Gps_num:long (nullable = True) |    |    |    |--Gpstime:long (nullable = True) |    |    |    |--latitude:double (nullable = True) |    |    |    |--longitude:double (nullable = True) |    |    |    |--Msg_id:long (nullable = True) |    |    |    |--Msg_length:long (nullable = True) |    |    |    |--Msg_type:long (nullable = True) |    |    | |--North:long (nullable = True) |    |    |    |--terminal:string (nullable = True) |    |    |    |--Tsp_obd_n900_head:array (nullable = True) |    |    |    |    |--Element:long (Containsnull = True) |    |--Gpstime:long (nullable = True) |    |--Heading:long (nullable = True) |    |--k:string (nullable = True) |    |--latitude:double (nullable = True) |    |--longitude:double (nullable = True) |    |--Msg_id:long (nullable = True) |    |--Msg_length:long (nullable = True) |    |--Msg_type:long (nullable = True) |    |--North:long (nullable = True) |    |--Syn_type:long (nullable = True) |    |--Systime_driverstorage:long (nullable = True) |    |--Systime_listenerserver:long (nullable = True) |    |--target_id:string (nullable = True) |    |--target_name:string (nullable = True) |    |--terminal:string (nullable = True) |    |--terminal_id:string (nullable = True) |    |--terminal_status_desc:string (nullable = True) |   |--Tsp_obd_n900_head:array (nullable = True) | |    |--Element:long (Containsnull = True) |    |--Type:long (nullable = True) | |--Update_time:long (nullable = True) |--drivername:string (nullable = True) |--type:string (nullable = True)


Convert into tables


Now write to the temp table OBD and iterate over the contents of the table:

C.registertemptable ("OBD") val set = Sqlcontext.sql ("SELECT * from OBD") Set.collect (). foreach (println)

will automatically flatten out the tree structure of the JSON, whether good or not, at least a table that can be used.


This is a program and SQL mixed with the pattern, a bit of meaning, but there are some shortcomings. Since it is a program, you need to automatically complement the function, Spark-shell not provided.








Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Spark loads a JSON file from an HDFs file into a SQL table via the RDD

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.