Spark loads JSON files from HDFS files to SQL tables through RDD
RDD Definition
RDD stands for Resilient Distributed Dataset, which is the core abstraction layer of spark. It can be used to read multiple files. Here we demonstrate how to read hdfs files. All spark jobs occur on RDD. For example, you can create a new RDD, convert the existing RDD, and obtain the results of the existing RDD calculation.
RDD is an immutable object set in spark. RDD can be divided into multiple partitions and stored on different nodes.
Create RDD
There are two methods: one is to load external datasets, such as loading HDFS files below and running in scalar-shell:
val textFile = sc.textFile(hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.log)textFile.count()res1: Long = 3574
Another method is to use the paralleize method of SparkContext in driver program. It is not discussed here.
Read JSON files
The above log file content is actually in json format, so you can use another reading method:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@2f92b5a1scala> val path = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logpath: String = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logscala> val c = sqlContext.read.json(path)c: org.apache.spark.sql.DataFrame = [data: struct
>>,gpstime:bigint,heading:bigint,k:string,latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north:bigint,syn_type:bigint,systime_driverStorage:bigint,systime_listenerserver:bigint,target_id:string,target_name:string,terminal:string,terminal_id:string,terminal_status_desc:string,tsp_obd_n900_head:array
,type:bigint,update_time:bigint>, driverName: string, type: string]scala> c.printSchema()root |-- data: struct (nullable = true) | |-- client_version: long (nullable = true) | |-- corp_id: string (nullable = true) | |-- east: long (nullable = true) | |-- ext_o_latitude: double (nullable = true) | |-- ext_o_longitude: double (nullable = true) | |-- gps_num: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- east: long (nullable = true) | | | |-- gps_num: long (nullable = true) | | | |-- gpstime: long (nullable = true) | | | |-- latitude: double (nullable = true) | | | |-- longitude: double (nullable = true) | | | |-- msg_id: long (nullable = true) | | | |-- msg_length: long (nullable = true) | | | |-- msg_type: long (nullable = true) | | | |-- north: long (nullable = true) | | | |-- terminal: string (nullable = true) | | | |-- tsp_obd_n900_head: array (nullable = true) | | | | |-- element: long (containsNull = true) | |-- gpstime: long (nullable = true) | |-- heading: long (nullable = true) | |-- k: string (nullable = true) | |-- latitude: double (nullable = true) | |-- longitude: double (nullable = true) | |-- msg_id: long (nullable = true) | |-- msg_length: long (nullable = true) | |-- msg_type: long (nullable = true) | |-- north: long (nullable = true) | |-- syn_type: long (nullable = true) | |-- systime_driverStorage: long (nullable = true) | |-- systime_listenerserver: long (nullable = true) | |-- target_id: string (nullable = true) | |-- target_name: string (nullable = true) | |-- terminal: string (nullable = true) | |-- terminal_id: string (nullable = true) | |-- terminal_status_desc: string (nullable = true) | |-- tsp_obd_n900_head: array (nullable = true) | | |-- element: long (containsNull = true) | |-- type: long (nullable = true) | |-- update_time: long (nullable = true) |-- driverName: string (nullable = true) |-- type: string (nullable = true)
Convert to table
Now, write it to the temporary table obd and traverse the table content:
c.registerTempTable(obd)val set = sqlContext.sql(select * from obd)set.collect().foreach(println)
The JSON tree structure is automatically flattened. No matter whether it is good or not, at least a usable table is used.
This is a mixed use mode of programs and SQL. It is a bit interesting, but there are still some shortcomings. Since it is a program, Automatic completion and other functions are required, which is not provided by spark-shell.