Dataframe operation of Sparksql

Source: Internet
Author: User

Dataframe in Spark SQL is similar to a relational data table. A single-table or query operation in a relational database can be implemented in Dataframe by invoking its API interface. You can refer to the Dataframe API provided by Scala.

The code in this article is based on the Spark-1.6.2 document implementation.

First, the generation of Dataframe objects

Spark-sql can generate Dataframe objects with other Rdd objects, parquet files, JSON files, hive tables, and a JDBC connection to other relational databases as data sources. This article will take the MySQL database as the data source, generate the Dataframe object after the related datafame operation.
The code for generating dataframe in this article is as follows:

Object Dataframeoperations {def main (args:array[string]) {val sparkconf = new sparkconf (). Setappname ("Spark SQL DataFrame Operations"). Setmaster ("Local[2]") val sparkcontext = new Sparkcontext (sparkconf) val sqlcontext = new SqlContext (sparkcontext) Val URL ="Jdbc:mysql://m000:3306/test" val jdbcdf = SqlContext.read.format ("JDBC"). Options (Map (URLs, URL,"User""Root","Password""Root","DBTable""Spark_sql_test")). Load () Val joinDF1 = SqlContext.read.format ("JDBC"). Options (Map (URL, " user", "root", "  password"  , "root", "dbtable "Spark_sql_join1") ). Load () Val joinDF2 = SqlContext.read.format ( "JDBC"). Options (Map ( "url") URL, " user", "root", "password", "   Root", " dbtable", " spark_sql_join2")). Lo AD () ... ... }}
Second, the action action on the Dataframe object 1, show: Show Data

Displays data in the form of a table in jdbcDF the output, similar to select * from spark_sql_test the function.
  showThere are four ways to call methods, namely,
(1) show
Only the first 20 records are displayed.
Example:

jdbcDF.show

Results:
  

(2) show(numRows: Int)
Display numRows Bar
Example:

jdbcDF.show(3)

Results:
  

(3) show(truncate: Boolean)
Whether to display a maximum of 20 characters, the default is true .
Example:

jdbcDF.show(true)jdbcDF.show(false)

Results:
  

(4) show(numRows: Int, truncate: Boolean)
Combines the previous display record number, and the display format of the long string.
Example:

jdbcDF.show(3, false)

Results:
  

2. collect: Gets all the data to the array

Unlike the previous show method, the method here will collect jdbcDF fetch all the data in and return an Array object.

jdbcDF.collect()

As a result, the result array contains jdbcDF each record, each of which is represented by an GenericRowWithSchema object that can store field names and field values.
  

3. collectAsList: Get all data to list

function and collect similar, just return the structure to the List object, using the following method

jdbcDF.collectAsList()

The results are as follows
  

4. describe(cols: String*): Gets the statistics for the specified field

This method can dynamically pass in one or more String types of field names, and the result is still DataFrame objects, which are used to count the statistical values of numeric type fields, such as count, mean, stddev, min, max .
Use the following method, where the field is a c1 character type, the c2 field is integer, and the c4 field is floating-point

.describe("c1" , "c2", "c4" ).show()

The results are as follows
  

5. first, head, take, takeAsList: Get several rows of records

The four methods listed here are more similar, where
(1) first get the first row of records
(2) head get first row record, head(n: Int) get top n row record
(3) take(n: Int) get the first n rows of data
(4) takeAsList(n: Int) get the first n rows of data and List show them in the form
Rowreturns one Array[Row] or more rows of data in or out of the form. firstand head functions are the same.
  takeAnd the takeAsList method will return the obtained data to the driver end, so you need to pay attention to the amount of data when using these two methods, in order to avoid driver occurrenceOutOfMemoryError

Use and results slightly.

Second, the Dataframe object on the condition query and join operations

The following returns a method of type Dataframe, which can be called consecutively.

1. Where conditions are relevant

(1) where(conditionExpr: String) : The condition after the WHERE keyword in the SQL language
An expression that is passed in a filter condition can be used with the and or . Gets the return result of the Dataframe type,
Example:

.where("id = 1 or c1 = ‘b‘" ).show()

Results
  

(2) filter : Filter by Field
Pass in the filter condition expression to get the return result of the Dataframe type. wheresame as conditions of use
Example:

.filter("id = 1 or c1 = ‘b‘" ).show()

Results
  

2. Query the specified field

(1) select : Gets the specified field value
Gets the value of the specified field, based on the passed-in String Type field name, returned by the Dataframe type
Example:

jdbcDF.select( "id" , "c3" ).show( false)

Results:
  

There is also an overloaded select method, not passing in a String type parameter, but passing in a Column type parameter. This logic can be implemented select id, id+1 from test .

jdbcDF.select(jdbcDF( "id" ), jdbcDF( "id") + 1 ).show( false)

Results:
  

The method of obtaining Column the type is apply as well as the col method, the general apply method is more convenient.

(2) selectExpr : Special handling of specified fields is possible
You can call the UDF function directly on the specified field, or specify aliases, and so on. Pass String in the type parameter to get the Dataframe object.
example, Query id field, c3 field alias time , c4 field rounding:

.selectExpr("id" , "c3 as time" , "round(c4)" ).show(false)

Results
  

(3) col : Gets the specified field
Only one field can be obtained and the returned object is of type column.
Val idcol = jdbcdf.col ("id") fruit slightly.

(4) apply : Gets the specified field
Can only get one field, return object to column type
Example:

val idCol1 = jdbcDF.apply("id")val idCol2 = jdbcDF("id")

Results slightly.

(5) drop : Remove the specified field and keep the other fields
Returns a new Dataframe object that contains no fields removed and only one field can be stripped at a time.
Example:

jdbcDF.drop("id")jdbcDF.drop(jdbcDF("id"))

Results:
  

3. Limit

  limitThe Dataframe method gets the first n rows of the specified row record, and a new Dataframe object is obtained. and take Unlike head the other, the limit method is not an action action.

jdbcDF.limit(3).show( false)

Results
  

4. ORDER BY

(1) orderBy and sort : Sort by specified field, default to Ascending
Example 1, sort by the specified field. The addition - represents a descending sort. sort orderBy Same as the use method

jdbcDF.orderBy(- jdbcDF("c4")).show(false)// 或者jdbcDF.orderBy(jdbcDF("c4").desc).show(false)

Results
  

Example 2, Sort ascending by field string

jdbcDF.orderBy("c4").show(false)

Results
  

(2) sortWithinPartitions
sortsimilar to the method above, the difference is that sortWithinPartitions the method returns the Dataframe object sorted by partition.

5. GROUP BY

(1) groupBy : Operation according to the field group by
  groupByThere are two ways to call a method, you can pass in the String field name of a type, or you can pass Column in an object of type.
Use the following method,

.groupBy("c1" )jdbcDF.groupBy( jdbcDF( "c1"))

(2) cube and rollup : Expansion of GROUP by
function is similar to the SQL Middle group by cube/rollup , slightly.

(3) Groupeddata object
The method gets the GroupedData type object, which provides the following GroupedData operations in the API, group by for example,

    • max(colNames: String*)method to get the maximum value for a specified field in a group or all numeric type fields, only for numeric fields
    • min(colNames: String*)method, gets the minimum value for the specified field in the group or all numeric type fields, only for numeric fields
    • mean(colNames: String*)method to get the average of a field in a specified field or all numeric types in a group, only for numeric fields
    • sum(colNames: String*)method to get the value of a field in a specified field or all numeric types in a group, only for numeric fields
    • count()method to get the number of elements in a group

      Examples of running results:
        count
        

        max
        

      This is more complicated than the following two methods,
      agg, which is similar to what is described below, and can be used to aggregate operations on a specified field.

pivot

6, distinct

(1) distinct : Returns a dataframe that does not contain duplicate records
Returns a row record that is not duplicated in the current dataframe. The method is the same as the next dropDuplicates() method that does not pass in the specified field.
Example:

jdbcDF.distinct()

Results
  

(2) dropDuplicates : Remove weight According to the specified field
Go to the weight according to the specified field. Similar to select distinct a, b operation
Example:

jdbcDF.dropDuplicates(Seq("c1"))

Results:
  

7. Aggregation

The aggregation operation calls the agg method, which has several invocation methods. Generally used in conjunction with the groupBy method.
The simplest and most straightforward use of the following example is to id maximize the field and c4 sum the fields.

jdbcDF.agg("id" -> "max", "c4" -> "sum")

Results:
  

8. Union

  unionAllMethods: Two dataframe were combined
Similar to SQL the UNION ALL action in.
Example:

jdbcDF.unionALL(jdbcDF.limit(1))

Results:
  

9. Join

The point has come. The SQL use of a lot in the language is the join operation, Dataframe also provides join the function.
The next step is to introduce the join method. Six overloaded methods are available in Dataframe join .
(1), Cartesian product

joinDF1.join(joinDF2)

(2), using a field form
The following joins are similar in a join b using column1 form, requiring two dataframe to have the same column name,

joinDF1.join(joinDF2, "id")

  joinDF1and joinDF2 depending on the field id join , the result is as follows, and the using field is displayed only once.
  

(3), using multiple field forms
In addition to the using case of one of the above fields, you can also have using multiple fields, as follows

joinDF1.join(joinDF2, Seq("id", "name"))

(4), specified join type
The operation of two Dataframe join has a inner, outer, left_outer, right_outer, leftsemi type. In the using join case of multiple fields above, you can write a third String type parameter, specifying join the type, as shown below

joinDF1.join(joinDF2, Seq("id", "name"), "inner")

(5), Column type of use join to
You using can use the following form if you want to specify a field flexibly without a pattern join

joinDF1.join(joinDF2 , joinDF1("id" ) === joinDF2( "t1_id"))

The results are as follows
  

(6), specifying the join join type at the same time in the specified field
As shown below

joinDF1.join(joinDF2 , joinDF1("id" ) === joinDF2( "t1_id"), "inner")
10. Get the specified field statistics

  statMethod can be used to calculate statistical information between a specified field or a specified field, such as variance, covariance, etc. This method returns a DataFramesStatFunctions type object.
The following code shows the c4 contents of the field, based on the frequency of the values appearing 30% above. The contents of the field in are jdbcDF c1 "a, b, a, c, d, b" . where a and b the frequency of occurrence is 2 / 6 , greater than0.3

jdbcDF.stat.freqItems(Seq ("c1") , 0.3).show()

The results are as follows:
  

11. Get the records of two Dataframe

  intersectMethod can calculate the same record in two dataframe.

jdbcDF.intersect(jdbcDF.limit(1)).show(false)

The results are as follows:
  

12. Get a record in one dataframe that is not in another dataframe

Example:

jdbcDF.except(jdbcDF.limit(1)).show(false)

The results are as follows
  

13. Operation Field Name

(1) withColumnRenamed : Rename the specified field name in Dataframe
If the specified field name does not exist, no action is made. jdbcDFin the following example, the fields in are renamed id to idx .

jdbcDF.withColumnRenamed( "id" , "idx" )

The results are as follows:
  

(2) withColumn : Add a new column to the current Dataframe
  whtiColumn(colName: String , col: Column)Method adds a new column to the Dataframe based on the specified colName , and if it colName already exists, overwrites the forefront.
The following code jdbcDF adds a column named to id2 the

jdbcDF.withColumn("id2", jdbcDF("id")).show( false)

The results are as follows
  

14, Row to column

Sometimes you need to split the content of a field and then create multiple rows, and you can use the explode method
In the following code, the contents of the field are segmented according to c3 the spaces in the field, and the contents of the split are stored in the new field c3_ , as shown below

"c3" , "c3_" ){time: String => time.split( " " )}

The results are as follows
  

Dataframe operation of Sparksql

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.