Spark loads JSON files from HDFS files to SQL tables through RDD

Source: Internet
Author: User

Spark loads JSON files from HDFS files to SQL tables through RDD
RDD Definition

RDD stands for Resilient Distributed Dataset, which is the core abstraction layer of spark. It can be used to read multiple files. Here we demonstrate how to read hdfs files. All spark jobs occur on RDD. For example, you can create a new RDD, convert the existing RDD, and obtain the results of the existing RDD calculation.

RDD is an immutable object set in spark. RDD can be divided into multiple partitions and stored on different nodes.

 

Create RDD

There are two methods: one is to load external datasets, such as loading HDFS files below and running in scalar-shell:

 

val textFile = sc.textFile(hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.log)textFile.count()res1: Long = 3574 

 

Another method is to use the paralleize method of SparkContext in driver program. It is not discussed here.

 

Read JSON files

 

The above log file content is actually in json format, so you can use another reading method:

 

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@2f92b5a1scala> val path = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logpath: String = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logscala> val c = sqlContext.read.json(path)c: org.apache.spark.sql.DataFrame = [data: struct
 
  >>,gpstime:bigint,heading:bigint,k:string,latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north:bigint,syn_type:bigint,systime_driverStorage:bigint,systime_listenerserver:bigint,target_id:string,target_name:string,terminal:string,terminal_id:string,terminal_status_desc:string,tsp_obd_n900_head:array
  
   ,type:bigint,update_time:bigint>, driverName: string, type: string]scala> c.printSchema()root |-- data: struct (nullable = true) |    |-- client_version: long (nullable = true) |    |-- corp_id: string (nullable = true) |    |-- east: long (nullable = true) |    |-- ext_o_latitude: double (nullable = true) |    |-- ext_o_longitude: double (nullable = true) |    |-- gps_num: array (nullable = true) |    |    |-- element: struct (containsNull = true) |    |    |    |-- east: long (nullable = true) |    |    |    |-- gps_num: long (nullable = true) |    |    |    |-- gpstime: long (nullable = true) |    |    |    |-- latitude: double (nullable = true) |    |    |    |-- longitude: double (nullable = true) |    |    |    |-- msg_id: long (nullable = true) |    |    |    |-- msg_length: long (nullable = true) |    |    |    |-- msg_type: long (nullable = true) |    |    |    |-- north: long (nullable = true) |    |    |    |-- terminal: string (nullable = true) |    |    |    |-- tsp_obd_n900_head: array (nullable = true) |    |    |    |    |-- element: long (containsNull = true) |    |-- gpstime: long (nullable = true) |    |-- heading: long (nullable = true) |    |-- k: string (nullable = true) |    |-- latitude: double (nullable = true) |    |-- longitude: double (nullable = true) |    |-- msg_id: long (nullable = true) |    |-- msg_length: long (nullable = true) |    |-- msg_type: long (nullable = true) |    |-- north: long (nullable = true) |    |-- syn_type: long (nullable = true) |    |-- systime_driverStorage: long (nullable = true) |    |-- systime_listenerserver: long (nullable = true) |    |-- target_id: string (nullable = true) |    |-- target_name: string (nullable = true) |    |-- terminal: string (nullable = true) |    |-- terminal_id: string (nullable = true) |    |-- terminal_status_desc: string (nullable = true) |    |-- tsp_obd_n900_head: array (nullable = true) |    |    |-- element: long (containsNull = true) |    |-- type: long (nullable = true) |    |-- update_time: long (nullable = true) |-- driverName: string (nullable = true) |-- type: string (nullable = true)
  
 

 

 

Convert to table


Now, write it to the temporary table obd and traverse the table content:

 

c.registerTempTable(obd)val set = sqlContext.sql(select * from obd)set.collect().foreach(println)

 

The JSON tree structure is automatically flattened. No matter whether it is good or not, at least a usable table is used.

 

This is a mixed use mode of programs and SQL. It is a bit interesting, but there are still some shortcomings. Since it is a program, Automatic completion and other functions are required, which is not provided by spark-shell.

 


 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.