RDD definition
The RDD full name is the resilient distributed Dataset, the core abstraction layer of spark, through which you can read a variety of files, demonstrating how to read HDFs files. All spark work takes place on the RDD, such as creating a new RDD, converting an existing RDD, and finding the result for the current RDD calculation.
The RDD is a collection of immutable (immutable) objects in spark that can be divided into multiple partitions and stored in different nodes.
Create an Rdd
There are two ways to load an external data set, such as a file loaded with HDFs below, running in Scalar-shell:
Another approach is to use the Sparkcontext Paralleize method in the driver program. It is not discussed here for the time being.
Reading JSON files
The log file above is actually in JSON format, so you can change the method of reading:
scala> val sqlcontext = new Org.apache.spark.sql.SQLContext (SC) sqlContext:org.apache.spark.sql.SQLContext = [email protected]scala> val Path = "Hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.log" Path:string = hdfs://namenode-host:9000/input/dean/obd_hdfs-writer-4-9-1447126914492.logscala> val c = SqlContext.read.json (path) c:org.apache.spark.sql.dataframe = [data:struct<client_version:bigint,corp_id: String,east:bigint,ext_o_latitude:double,ext_o_longitude:double,gps_num:array<struct<east:bigint,gps_num : Bigint,gpstime:bigint,latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north: Bigint,terminal:string,tsp_obd_n900_head:array<bigint>>>,gpstime:bigint,heading:bigint,k:string, Latitude:double,longitude:double,msg_id:bigint,msg_length:bigint,msg_type:bigint,north:bigint,syn_type:bigint, Systime_driverstorage:bigint,systime_listenerserver:bigint,target_id:string,target_name:string,terminal:string , termInal_id:string,terminal_status_desc:string,tsp_obd_n900_head:array<bigint>,type:bigint,update_time: Bigint>, drivername:string, type:string]scala> c.printschema () root |--data:struct (nullable = True) | |--Client_version:long (nullable = True) | |--corp_id:string (nullable = True) | |--East:long (nullable = True) | |--ext_o_latitude:double (nullable = True) | |--ext_o_longitude:double (nullable = True) | |--Gps_num:array (nullable = True) | | |--element:struct (Containsnull = True) | | | |--East:long (nullable = True) | | | |--Gps_num:long (nullable = True) | | | |--Gpstime:long (nullable = True) | | | |--latitude:double (nullable = True) | | | |--longitude:double (nullable = True) | | | |--Msg_id:long (nullable = True) | | | |--Msg_length:long (nullable = True) | | | |--Msg_type:long (nullable = True) | | | |--North:long (nullable = True) | | | |--terminal:string (nullable = True) | | | |--Tsp_obd_n900_head:array (nullable = True) | | | | |--Element:long (Containsnull = True) | |--Gpstime:long (nullable = True) | |--Heading:long (nullable = True) | |--k:string (nullable = True) | |--latitude:double (nullable = True) | |--longitude:double (nullable = True) | |--Msg_id:long (nullable = True) | |--Msg_length:long (nullable = True) | |--Msg_type:long (nullable = True) | |--North:long (nullable = True) | |--Syn_type:long (nullable = True) | |--Systime_driverstorage:long (nullable = True) | |--Systime_listenerserver:long (nullable = True) | |--target_id:string (nullable = True) | |--target_name:string (nullable = True) | |--terminal:string (nullable = True) | |--terminal_id:string (nullable = True) | |--terminal_status_desc:string (nullable = True) | |--Tsp_obd_n900_head:array (nullable = True) | | |--Element:long (Containsnull = True) | |--Type:long (nullable = True) | |--Update_time:long (nullable = True) |--drivername:string (nullable = True) |--type:string (nullable = True)
Convert into tables
Now write to the temp table OBD and iterate over the contents of the table:
C.registertemptable ("OBD") val set = Sqlcontext.sql ("SELECT * from OBD") Set.collect (). foreach (println)
will automatically flatten out the tree structure of the JSON, whether good or not, at least a table that can be used.
This is a program and SQL mixed with the pattern, a bit of meaning, but there are some shortcomings. Since it is a program, you need to automatically complement the function, Spark-shell not provided.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Spark loads a JSON file from an HDFs file into a SQL table via the RDD