Dataframe in Spark SQL is similar to a relational data table. A single-table or query operation in a relational database can be implemented in Dataframe by invoking its API interface. You can refer to the Dataframe API provided by Scala.
The code in this article is based on the Spark-1.6.2 document implementation.
First, the generation of Dataframe objects
Spark-sql can generate Dataframe objects with other Rdd objects, parquet files, JSON files, hive tables, and a JDBC connection to other relational databases as data sources. This article will take the MySQL database as the data source, generate the Dataframe object after the related datafame operation.
The code for generating dataframe in this article is as follows:
Object Dataframeoperations {def main (args:array[string]) {val sparkconf = new sparkconf (). Setappname ("Spark SQL DataFrame Operations"). Setmaster ("Local[2]") val sparkcontext = new Sparkcontext (sparkconf) val sqlcontext = new SqlContext (sparkcontext) Val URL ="Jdbc:mysql://m000:3306/test" val jdbcdf = SqlContext.read.format ("JDBC"). Options (Map (URLs, URL,"User""Root","Password""Root","DBTable""Spark_sql_test")). Load () Val joinDF1 = SqlContext.read.format ("JDBC"). Options (Map (URL, " user", "root", " password" , "root", "dbtable "Spark_sql_join1") ). Load () Val joinDF2 = SqlContext.read.format ( "JDBC"). Options (Map ( "url") URL, " user", "root", "password", " Root", " dbtable", " spark_sql_join2")). Lo AD () ... ... }}
Second, the action action on the Dataframe object 1,
show
: Show Data
Displays data in the form of a table in jdbcDF
the output, similar to select * from spark_sql_test
the function.
show
There are four ways to call methods, namely,
(1) show
Only the first 20 records are displayed.
Example:
jdbcDF.show
Results:
(2) show(numRows: Int)
Display numRows
Bar
Example:
jdbcDF.show(3)
Results:
(3) show(truncate: Boolean)
Whether to display a maximum of 20 characters, the default is true
.
Example:
jdbcDF.show(true)jdbcDF.show(false)
Results:
(4) show(numRows: Int, truncate: Boolean)
Combines the previous display record number, and the display format of the long string.
Example:
jdbcDF.show(3, false)
Results:
2.
collect
: Gets all the data to the array
Unlike the previous show
method, the method here will collect
jdbcDF
fetch all the data in and return an Array
object.
jdbcDF.collect()
As a result, the result array contains jdbcDF
each record, each of which is represented by an GenericRowWithSchema
object that can store field names and field values.
3.
collectAsList
: Get all data to list
function and collect
similar, just return the structure to the List
object, using the following method
jdbcDF.collectAsList()
The results are as follows
4.
describe(cols: String*)
: Gets the statistics for the specified field
This method can dynamically pass in one or more String
types of field names, and the result is still DataFrame
objects, which are used to count the statistical values of numeric type fields, such as count, mean, stddev, min, max
.
Use the following method, where the field is a c1
character type, the c2
field is integer, and the c4
field is floating-point
.describe("c1" , "c2", "c4" ).show()
The results are as follows
5.
first, head, take, takeAsList
: Get several rows of records
The four methods listed here are more similar, where
(1) first
get the first row of records
(2) head
get first row record, head(n: Int)
get top n row record
(3) take(n: Int)
get the first n rows of data
(4) takeAsList(n: Int)
get the first n rows of data and List
show them in the form
Row
returns one Array[Row]
or more rows of data in or out of the form. first
and head
functions are the same.
take
And the takeAsList
method will return the obtained data to the driver end, so you need to pay attention to the amount of data when using these two methods, in order to avoid driver occurrenceOutOfMemoryError
Use and results slightly.
Second, the Dataframe object on the condition query and join operations
The following returns a method of type Dataframe, which can be called consecutively.
1. Where conditions are relevant
(1) where(conditionExpr: String)
: The condition after the WHERE keyword in the SQL language
An expression that is passed in a filter condition can be used with the and
or
. Gets the return result of the Dataframe type,
Example:
.where("id = 1 or c1 = ‘b‘" ).show()
Results
(2) filter
: Filter by Field
Pass in the filter condition expression to get the return result of the Dataframe type. where
same as conditions of use
Example:
.filter("id = 1 or c1 = ‘b‘" ).show()
Results
2. Query the specified field
(1) select
: Gets the specified field value
Gets the value of the specified field, based on the passed-in String
Type field name, returned by the Dataframe type
Example:
jdbcDF.select( "id" , "c3" ).show( false)
Results:
There is also an overloaded select
method, not passing in a String
type parameter, but passing in a Column
type parameter. This logic can be implemented select id, id+1 from test
.
jdbcDF.select(jdbcDF( "id" ), jdbcDF( "id") + 1 ).show( false)
Results:
The method of obtaining Column
the type is apply
as well as the col
method, the general apply
method is more convenient.
(2) selectExpr
: Special handling of specified fields is possible
You can call the UDF function directly on the specified field, or specify aliases, and so on. Pass String
in the type parameter to get the Dataframe object.
example, Query id
field, c3
field alias time
, c4
field rounding:
.selectExpr("id" , "c3 as time" , "round(c4)" ).show(false)
Results
(3) col
: Gets the specified field
Only one field can be obtained and the returned object is of type column.
Val idcol = jdbcdf.col ("id") fruit slightly.
(4) apply
: Gets the specified field
Can only get one field, return object to column type
Example:
val idCol1 = jdbcDF.apply("id")val idCol2 = jdbcDF("id")
Results slightly.
(5) drop
: Remove the specified field and keep the other fields
Returns a new Dataframe object that contains no fields removed and only one field can be stripped at a time.
Example:
jdbcDF.drop("id")jdbcDF.drop(jdbcDF("id"))
Results:
3. Limit
limit
The Dataframe method gets the first n rows of the specified row record, and a new Dataframe object is obtained. and take
Unlike head
the other, the limit
method is not an action action.
jdbcDF.limit(3).show( false)
Results
4. ORDER BY
(1) orderBy
and sort
: Sort by specified field, default to Ascending
Example 1, sort by the specified field. The addition -
represents a descending sort. sort
orderBy
Same as the use method
jdbcDF.orderBy(- jdbcDF("c4")).show(false)// 或者jdbcDF.orderBy(jdbcDF("c4").desc).show(false)
Results
Example 2, Sort ascending by field string
jdbcDF.orderBy("c4").show(false)
Results
(2) sortWithinPartitions
sort
similar to the method above, the difference is that sortWithinPartitions
the method returns the Dataframe object sorted by partition.
5. GROUP BY
(1) groupBy
: Operation according to the field group by
groupBy
There are two ways to call a method, you can pass in the String
field name of a type, or you can pass Column
in an object of type.
Use the following method,
.groupBy("c1" )jdbcDF.groupBy( jdbcDF( "c1"))
(2) cube
and rollup
: Expansion of GROUP by
function is similar to the SQL
Middle group by cube/rollup
, slightly.
(3) Groupeddata object
The method gets the GroupedData
type object, which provides the following GroupedData
operations in the API, group by
for example,
max(colNames: String*)
method to get the maximum value for a specified field in a group or all numeric type fields, only for numeric fields
min(colNames: String*)
method, gets the minimum value for the specified field in the group or all numeric type fields, only for numeric fields
mean(colNames: String*)
method to get the average of a field in a specified field or all numeric types in a group, only for numeric fields
sum(colNames: String*)
method to get the value of a field in a specified field or all numeric types in a group, only for numeric fields
count()
method to get the number of elements in a group
Examples of running results:
count
max
This is more complicated than the following two methods,
agg
, which is similar to what is described below, and can be used to aggregate operations on a specified field.
pivot
6, distinct
(1) distinct
: Returns a dataframe that does not contain duplicate records
Returns a row record that is not duplicated in the current dataframe. The method is the same as the next dropDuplicates()
method that does not pass in the specified field.
Example:
jdbcDF.distinct()
Results
(2) dropDuplicates
: Remove weight According to the specified field
Go to the weight according to the specified field. Similar to select distinct a, b
operation
Example:
jdbcDF.dropDuplicates(Seq("c1"))
Results:
7. Aggregation
The aggregation operation calls the agg
method, which has several invocation methods. Generally used in conjunction with the groupBy
method.
The simplest and most straightforward use of the following example is to id
maximize the field and c4
sum the fields.
jdbcDF.agg("id" -> "max", "c4" -> "sum")
Results:
8. Union
unionAll
Methods: Two dataframe were combined
Similar to SQL
the UNION ALL
action in.
Example:
jdbcDF.unionALL(jdbcDF.limit(1))
Results:
9. Join
The point has come. The SQL
use of a lot in the language is the join
operation, Dataframe also provides join
the function.
The next step is to introduce the join
method. Six overloaded methods are available in Dataframe join
.
(1), Cartesian product
joinDF1.join(joinDF2)
(2), using
a field form
The following joins are similar in a join b using column1
form, requiring two dataframe to have the same column name,
joinDF1.join(joinDF2, "id")
joinDF1
and joinDF2
depending on the field id
join
, the result is as follows, and the using
field is displayed only once.
(3), using
multiple field forms
In addition to the using
case of one of the above fields, you can also have using
multiple fields, as follows
joinDF1.join(joinDF2, Seq("id", "name"))
(4), specified join
type
The operation of two Dataframe join
has a inner, outer, left_outer, right_outer, leftsemi
type. In the using
join case of multiple fields above, you can write a third String
type parameter, specifying join
the type, as shown below
joinDF1.join(joinDF2, Seq("id", "name"), "inner")
(5), Column
type of use join
to
You using
can use the following form if you want to specify a field flexibly without a pattern join
joinDF1.join(joinDF2 , joinDF1("id" ) === joinDF2( "t1_id"))
The results are as follows
(6), specifying the join
join
type at the same time in the specified field
As shown below
joinDF1.join(joinDF2 , joinDF1("id" ) === joinDF2( "t1_id"), "inner")
10. Get the specified field statistics
stat
Method can be used to calculate statistical information between a specified field or a specified field, such as variance, covariance, etc. This method returns a DataFramesStatFunctions
type object.
The following code shows the c4
contents of the field, based on the frequency of the values appearing 30%
above. The contents of the field in are jdbcDF
c1
"a, b, a, c, d, b"
. where a
and b
the frequency of occurrence is 2 / 6
, greater than0.3
jdbcDF.stat.freqItems(Seq ("c1") , 0.3).show()
The results are as follows:
11. Get the records of two Dataframe
intersect
Method can calculate the same record in two dataframe.
jdbcDF.intersect(jdbcDF.limit(1)).show(false)
The results are as follows:
12. Get a record in one dataframe that is not in another dataframe
Example:
jdbcDF.except(jdbcDF.limit(1)).show(false)
The results are as follows
13. Operation Field Name
(1) withColumnRenamed
: Rename the specified field name in Dataframe
If the specified field name does not exist, no action is made. jdbcDF
in the following example, the fields in are renamed id
to idx
.
jdbcDF.withColumnRenamed( "id" , "idx" )
The results are as follows:
(2) withColumn
: Add a new column to the current Dataframe
whtiColumn(colName: String , col: Column)
Method adds a new column to the Dataframe based on the specified colName
, and if it colName
already exists, overwrites the forefront.
The following code jdbcDF
adds a column named to id2
the
jdbcDF.withColumn("id2", jdbcDF("id")).show( false)
The results are as follows
14, Row to column
Sometimes you need to split the content of a field and then create multiple rows, and you can use the explode
method
In the following code, the contents of the field are segmented according to c3
the spaces in the field, and the contents of the split are stored in the new field c3_
, as shown below
"c3" , "c3_" ){time: String => time.split( " " )}
The results are as follows
Dataframe operation of Sparksql