Spark loads JSON files from HDFS files to SQL tables through RDD

Last Update:2015-11-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Spark loads JSON files from HDFS files to SQL tables through RDD
RDD Definition

RDD stands for Resilient Distributed Dataset, which is the core abstraction layer of spark. It can be used to read multiple files. Here we demonstrate how to read hdfs files. All spark jobs occur on RDD. For example, you can create a new RDD, convert the existing RDD, and obtain the results of the existing RDD calculation.

RDD is an immutable object set in spark. RDD can be divided into multiple partitions and stored on different nodes.

Create RDD

There are two methods: one is to load external datasets, such as loading HDFS files below and running in scalar-shell:

val textFile = sc.textFile(hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.log)textFile.count()res1: Long = 3574

Another method is to use the paralleize method of SparkContext in driver program. It is not discussed here.

Read JSON files

The above log file content is actually in json format, so you can use another reading method:

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@2f92b5a1scala> val path = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logpath: String = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logscala> val c = sqlContext.read.json(path)c: org.apache.spark.sql.DataFrame = [data: struct
 
  >>,gpstime:bigint,heading:bigint,k:string,latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north:bigint,syn_type:bigint,systime_driverStorage:bigint,systime_listenerserver:bigint,target_id:string,target_name:string,terminal:string,terminal_id:string,terminal_status_desc:string,tsp_obd_n900_head:array
  
   ,type:bigint,update_time:bigint>, driverName: string, type: string]scala> c.printSchema()root |-- data: struct (nullable = true) |    |-- client_version: long (nullable = true) |    |-- corp_id: string (nullable = true) |    |-- east: long (nullable = true) |    |-- ext_o_latitude: double (nullable = true) |    |-- ext_o_longitude: double (nullable = true) |    |-- gps_num: array (nullable = true) |    |    |-- element: struct (containsNull = true) |    |    |    |-- east: long (nullable = true) |    |    |    |-- gps_num: long (nullable = true) |    |    |    |-- gpstime: long (nullable = true) |    |    |    |-- latitude: double (nullable = true) |    |    |    |-- longitude: double (nullable = true) |    |    |    |-- msg_id: long (nullable = true) |    |    |    |-- msg_length: long (nullable = true) |    |    |    |-- msg_type: long (nullable = true) |    |    |    |-- north: long (nullable = true) |    |    |    |-- terminal: string (nullable = true) |    |    |    |-- tsp_obd_n900_head: array (nullable = true) |    |    |    |    |-- element: long (containsNull = true) |    |-- gpstime: long (nullable = true) |    |-- heading: long (nullable = true) |    |-- k: string (nullable = true) |    |-- latitude: double (nullable = true) |    |-- longitude: double (nullable = true) |    |-- msg_id: long (nullable = true) |    |-- msg_length: long (nullable = true) |    |-- msg_type: long (nullable = true) |    |-- north: long (nullable = true) |    |-- syn_type: long (nullable = true) |    |-- systime_driverStorage: long (nullable = true) |    |-- systime_listenerserver: long (nullable = true) |    |-- target_id: string (nullable = true) |    |-- target_name: string (nullable = true) |    |-- terminal: string (nullable = true) |    |-- terminal_id: string (nullable = true) |    |-- terminal_status_desc: string (nullable = true) |    |-- tsp_obd_n900_head: array (nullable = true) |    |    |-- element: long (containsNull = true) |    |-- type: long (nullable = true) |    |-- update_time: long (nullable = true) |-- driverName: string (nullable = true) |-- type: string (nullable = true)

Convert to table

Now, write it to the temporary table obd and traverse the table content:

c.registerTempTable(obd)val set = sqlContext.sql(select * from obd)set.collect().foreach(println)

The JSON tree structure is automatically flattened. No matter whether it is good or not, at least a usable table is used.

This is a mixed use mode of programs and SQL. It is a bit interesting, but there are still some shortcomings. Since it is a program, Automatic completion and other functions are required, which is not provided by spark-shell.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spark loads JSON files from HDFS files to SQL tables through RDD

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spark loads JSON files from HDFS files to SQL tables through RDD

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support