First, create dataframe through structured data:
PublicStaticvoidMain (string[] args) {
sparkconf conf =Newsparkconf ()
. Setappname ( " dataframecreate "
Javasparkcontext sc = new javasparkcontext (conf);
SqlContext SqlContext = new SqlContext (SC);
DataFrame df = Sqlcontext.read (). JSON ("hdfs://spark1:9000/students.json"); Structured data is loaded directly into Dataframe
Df.show ();
}
Two ways to create dataframe through RDD
(Data source Students.txt)
2.1 Create a dataframe with a schema of a known type, with the following code:
PublicStaticvoidMain (string[] args) {
sparkconf conf =NewSparkconf ()
. Setmaster ("Local")
. Setappname ("rdd2dataframereflection");
Javasparkcontext sc =NewJavasparkcontext (conf);
SqlContext SqlContext =NewSqlContext (SC);
Javardd<string> lines = Sc.textfile ("D://students.txt");
//convert lines to javardd<student>
javardd<student> students = Lines.map (NewFunction<string, student> () {
PrivateStaticFinalLongSerialversionuid =1L;
@Override
PublicStudent Call (String line) throws Exception {
//TODO auto-generated Method Stub
string[] Strplits = Line.split (",");
Student stu =NewStudent ();
Stu.setid (Integer.valueof (strplits[0]));
Stu.setname (strplits[1]);
Stu.setage (Integer.valueof (strplits[2]));
returnStu
}
});
//convert the RDD to Dataframe using the reflection method
//This requires that the JavaBean must implement the Serializable interface, which is serializable
//Create dataframe based on student's schema and Rdd
DataFrame STUDENTSDF = sqlcontext.createdataframe (students, Student.class);
Studentsdf.show ();
}
2.2 Create schema manually by creating Dataframe
Public static void main (string[] args) {
//... Omitting the process of creating sqlcontext
Javardd<string> lines = Sc.textfile ("D://students.txt");
//Replace the normal rdd with javardd<row>
javardd<row> Rowrdd = Lines.map (NewFunction<string, row> () {
PrivateStaticFinalLongSerialversionuid =1L;
@Override
PublicRow Call (String) throws Exception {
string[] Strarray = Line.split (",");
Row row= Rowfactory.create (
Integer.valueof (strarray[0]),//ID
strarray[1],//name
Integer.valueof (strarray[2]));// Age
returnRow
}
});
//The second step is to create the meta-type, which creates the schema
List<structfield> Structfields =NewArraylist<structfield> ();
Structfields.add (Datatypes.createstructfield ("ID", Datatypes.integertype,true));
Structfields.add (Datatypes.createstructfield ("name", Datatypes.stringtype,true));
Structfields.add (Datatypes.createstructfield (" Age", Datatypes.integertype,true));
Structtype Structtype = Datatypes.createstructtype (structfields);
//Convert javardd<row> to dataframe based on meta data type
DataFrame STUDENTDF = Sqlcotnext.createdataframe (Rowrdd, Structtype);
Studentdf.show ();
}
-"DataFrame, RDD, List
javardd<row> rows = Studentdf.javardd ();
list<row> studentlist = Rows.collect ();
Iii. Basic usage of dataframe
//Print all the data in the Dataframe (SELECT * From ... )
Df.show ();
//Print Dataframe metadata (Schema)
Df.printschema ();
//querying all data for a column
Df.Select("name"). Show ();
//query all of the data in a few columns and calculate the columns
Df.Select(Df.col ("name"), Df.col (" Age"). Plus (1). Show ();
//filtering based on the values of a column
Df.filter (Df.col (" Age"). GT ( -). Show ();
//GROUP by a column and then aggregate
Df.groupby (Df.col (" Age"). Count (). Show ();
DataFrame STUDENTDF = Sqlcotnext.createdataframe (Rowrdd, structtype);
Studentdf.show ();
Studentdf.registertemptable ("Students");//register Dataframe as zero, name students
//SQL query for Students Zero table
DataFrame OLDSTUDENTDF = Sqlcotnext.sql ("SELECT * from students where age>18");
Oldstudentdf.show ();
Spark SQL Basic Usage