"Spark" dataframe common operations

Source: Internet
Author: User

Spark Dataframe is derived from the Rdd class, but provides very powerful data manipulation capabilities. Of course, the main support for class SQL.

In the actual work will encounter such a situation, the main will be two data set filtering, merging, re-storage.

The function of limit is only found when the dataset is loaded first, and then during the first few rows of the extracted dataset.

Merging uses the Union function and re-stocking, that is, the Registertemple register as a table, and then write to hive.

Had to marvel at the strength of dataframe.

Example: In order to get a sample-balanced training set, we need to make the same number of training samples in two datasets, so this function is used.

scala> val fes = Hivecontext.sql (SQLSS)
Fes:org.apache.spark.sql.DataFrame = [Caller_num:string, Is_sr:int, Call_count:int, avg_talk_time:double, max_talk_t Ime:int, Min_talk_time:int, Called_num_count:int, called_lsd:double, Null_called_count:int]


scala> val fcount = Fes.count ()
Fcount:long = 4371029

scala> val zcfea = Hivecontext.sql (SQLS2)
Zcfea:org.apache.spark.sql.DataFrame = [Caller_num:string, Is_sr:int, Call_count:int, avg_talk_time:double, Max_talk _time:int, Min_talk_time:int, Called_num_count:int, called_lsd:double, Null_called_count:int]


scala> val zcount = Zcfea.count ()
Zcount:long = 14208117


scala> val f01 = Fes.limit (25000)
F01:org.apache.spark.sql.DataFrame = [Caller_num:string, Is_sr:int, Call_count:int, avg_talk_time:double, max_talk_t Ime:int, Min_talk_time:int, Called_num_count:int, called_lsd:double, Null_called_count:int]


scala> val f02 = Zcfea.limit (25000)
F02:org.apache.spark.sql.DataFrame = [Caller_num:string, Is_sr:int, Call_count:int, avg_talk_time:double, max_talk_t Ime:int, Min_talk_time:int, Called_num_count:int, called_lsd:double, Null_called_count:int]


Scala> Val Ff=f01.unionall (F02)
Ff:org.apache.spark.sql.DataFrame = [Caller_num:string, Is_sr:int, Call_count:int, avg_talk_time:double, Max_talk_ti Me:int, Min_talk_time:int, Called_num_count:int, called_lsd:double, Null_called_count:int]


Scala> ff.registertemptable ("Ftable01")


Scala> hivecontext.sql ("CREATE TABLE Shtrainfeature as SELECT * from FTABLE01")
Res1:org.apache.spark.sql.DataFrame = []

Finally, I enclose some operation and usage of Dataframe:

Functions of the DataFrame
Action actions
1, collect (), the return value is an array that returns all the rows of the Dataframe collection
2, Collectaslist () The return value is an array of Java types, returning all rows of the Dataframe collection
3, COUNT () returns the number of rows of type Dataframe
4, describe (cols:string*) returns a mathematical calculation of the class table value (count, mean, StdDev, Min, and Max), this can pass multiple parameters, separated by commas, if there is a field is empty, then do not participate in the operation, This is only for fields of numeric type. For example Df.describe ("Age", "height"). Show ()
5, first () returns the line, type is row type
6, head () returns the first row, type is row type
7, head (n:int) returns n rows, type is row type
8, Show () returns the value of the Dataframe collection by default is 20 rows, the return type is unit
9, Show (N:int) returns n rows, and the return value type is unit
10, table (N:int) returns n rows, type is row type

Basic operation of Dataframe
1. Cache () memory for synchronizing data
2, Columns returns an array of type string, the return value is the name of all columns
3, Dtypes returns a two-dimensional array of type string, the return value is the name of all columns and the type
4. Explan () Print execution Plan physical
5, explain (N:boolean) input value is False or True, the return value is unit default is false, if the input true will print the logical and physical
6, IsLocal return value is Boolean, if Allow mode is local return true otherwise return false
7, persist (Newlevel:storagelevel) returns a dataframe.this.type input storage model type
8, Printschema () print out the field name and type according to the tree structure to print
9, registertemptable (tablename:string) return unit, the DF object is placed in only one table, the table with the deletion of the object deleted
10. The schema returns the Structtype type, returning the field name and type according to the struct type
11, TODF () returns a new dataframe type of
12, TODF (colnames:string*) returns several fields in the parameter to a new dataframe type,
13, Unpersist () returns the Dataframe.this.type type, removes the data in the mode
14, Unpersist (Blocking:boolean) returns Dataframe.this.type type True and Unpersist is the same as the role of false is to remove the RDD

Integrated Query:
1, agg (expers:column*) return dataframe type, with mathematical calculation evaluation
Df.agg (Max ("Age"), avg ("salary"))
Df.groupby (). AGG (Max ("Age"), avg ("salary"))
2, agg (exprs:map[string, String]) returns the Dataframe type, the same as the mathematical evaluation of the MAP type
Df.agg (Map ("Age", "Max", "Salary", "avg"))
Df.groupby (). Agg (Map ("Age", "Max", "Salary", "avg"))
3, agg (aggexpr: (String, String), Aggexprs: (String, String) *) returns the Dataframe type, evaluated with mathematical calculation
Df.agg (Map ("Age", "Max", "Salary", "avg"))
Df.groupby (). Agg (Map ("Age", "Max", "Salary", "avg"))
4. Apply (colname:string) returns the column type, capturing the object entered in the column
5. As (alias:string) returns a new Dataframe type, which is the original alias
6. Col (colname:string) returns the column type, capturing the object entered in the column
7, Cube (col1:string, cols:string*) returns a groupeddata type, summarized according to some fields
8, distinct go back to a dataframe type
9. Drop (col:column) delete a column return dataframe type
10, Dropduplicates (colnames:array[string]) Delete the same column returns a Dataframe
11, except (Other:dataframe) returns a DataFrame, returning the presence of the current collection that does not exist in the other collection
12, Explode[a, B] (inputcolumn:string, outputcolumn:string) (f: (A)? TRAVERSABLEONCE[B]) (implicit arg0:scala.reflect.api.javauniverse.typetag[b]) The return value is the Dataframe type, which splits a field into more rows
Df.explode ("name", "names") {name:string=> Name.split ("")}.show ();
Split the name field according to a space, and the split field is placed inside the names
13. Filter (conditionexpr:string): Swipe to select part of the data, return Dataframe type Df.filter ("Age>10"). Show ();   Df.filter (DF ("age") >10). Show (); Df.where (DF ("age") >10). Show (); All can
14, GroupBy (col1:string, cols:string*) summarizes the return groupedate type Df.groupby ("age") according to a write field. Agg (Map ("Age", "count")). Show ( );d F.groupby ("Age"). AVG (). Show ();
15, Intersect (Other:dataframe) returns an DataFrame, an element that exists in 2 DataFrame
16. Join (Right:dataframe, Joinexprs:column, jointype:string)
One is the associated dataframe, the second associated condition, and the third associated type: inner, outer, Left_outer, Right_outer, Leftsemi
Df.join (DS,DF ("name") ===ds ("name") and DF ("Age") ===ds ("Age"), "outer"). Show ();
17, limit (n:int) return dataframe type to n data out
18, Na:dataframenafunctions, can call dataframenafunctions function area to do filter Df.na.drop (). Show (); Delete rows that are empty
19, order (sortexprs:column*) to do alise sort
20, select (cols:string*) dataframe do field brush Df.select ($ "ColA", $ "ColB" + 1)
21, selectexpr (exprs:string*) Do the field of brush df.selectexpr ("name", "name as names", "Upper (name)", "age+1"). Show ();
22. Sort (sortexprs:column*) sorts Df.sort (DF ("age"). Desc). Show (); The default is ASC
23, UnionAll (other:dataframe) Merge Df.unionall (DS). Show ();
24, Withcolumnrenamed (existingname:string, newname:string) Modify the list df.withcolumnrenamed ("name", "names"). Show ();
25, Withcolumn (colname:string, Col:column) adds a column of Df.withcolumn ("AA", DF ("name")). Show ();

---------------------This article from Sparkexpert csdn blog, full-text address please click: 51042970?utm_source=copy

Spark Dataframe common actions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.