First we're going to create sparksession
Val spark = Sparksession.builder () . AppName ("Test"). Master ("local") . Getorcreate () Import Spark.implicits._//Convert RDD into dataframe and support SQL operations
Then we create dataframe through sparksession.
1. toDF
Creating Dataframe using Functions
by importing (importing) spark.implicits, you can convert a local sequence (seq), an array, or an rdd to Dataframe.
As long as the content of the data can specify the data type.
Import Spark.implicits._val df = Seq ( (1, "Zhangyuhang", Java.sql.Date.valueOf ("2018-05-15")), (2, " Zhangqiuyue ", Java.sql.Date.valueOf (" 2018-05-15 ")). TODF (" id "," name "," Created_time ")
Note: If you use TODF () without specifying a column name, the default column name is "_1", "_2"
We can modify the column name by df.withcolumnrenamed ("_1", "newName1"). Withcolumnrenamed ("_2", "newName2")
2. createDataFrame
creating Dataframe using Functions
Create with schema + row
Import org.apache.spark.sql.types._//defines the structure of the dataframe schemaval schema = Structtype (List ( Structfield ("id", Integertype, Nullable = False), Structfield ("name", StringType, Nullable = True), Structfield ("Create_time", Datetype, Nullable = True))//define Dataframe content Rddval Rdd = sc.parallelize (Seq ( Row (1, "Zhangyuhang", Java.sql.Date.valueOf ("2018-05-15")), Row (2, "Zhangqiuyue", Java.sql.Date.valueOf ("2018-05-15")))// Create Dataframeval df = Spark.createdataframe (RDD, schema)
3. Create dataframe directly from a file
(1) Create with Parquet file
Val df = Spark.read.parquet ("Hdfs:/path/to/file")
(2) Create with JSON file
Val df = Spark.read.json ("Examples/src/main/resources/people.json")
(3) Create with CSV file
Val df = spark.read . Format ("Com.databricks.spark.csv") . Option ("Header", "true")//reading the headers . Option ("Mode", "dropmalformed") . Load ("Csv/file/path")
(4) Create with Hive table
Spark.table ("Test.person")//library name. The format of the table name . registertemptable ("person") //register as a temporary table Spark.sql ("" " | SELECT * | from the person | limit "" . Stripmargin). Show ()
Remember, finally we're going to call Spark.stop () to close sparksession.
"Sparksql" Create Dataframe