1. Spark reads the CSV file;
You can use Databricks's third-party package to read a CSV file, download a third-party package, and put it in the specified path
1.1 Defining data formats
We need to define the data format according to the specific data format before importing the data.
Use Structtype to define the field format, corresponding to each field in the DataSet one by one.
The three parameters in Structfield are field names, field data types, and whether they are not allowed to be empty.
val FieldSchema = Structtype (Array (
structfield ("TID", StringType, True),
Structfield ("Lat", Doubletype , true),
Structfield ("Lon", Doubletype, True),
Structfield ("Time", StringType, True)))
2. Spark reads data
After the field format has been defined, the read interface provided by SqlContext is called, specifying the format to load format com.databricks.spark.csv as defined in the third-party library. Because the first row in the dataset used in this lesson does not have field names for each column, you need to set the Read option header to false. Finally, the path of the dataset file to be read is indicated in the Load method.
Val taxidf = SqlContext.read.format ("Com.databricks.spark.csv"). Option ("Header", "false"). Schema (FieldSchema). Load ("/home/shiyanlou/taxi.csv")
Note that since we are entering code in the Spark shell as an interactive command line, the spark Shell has created the SqlContext object during the boot process and we can use it directly. If you are developing a spark program in a standalone application, create the SQL context manually via spark context.