Org.apache.spark.sql.functions Summary

Source: Internet
Author: User
Tags base64 crc32 explode pow rand square root string format unique id

Summaryorg.apache.spark.sql.functions is an object that provides about 200 + functions. Most functions are similar to hive. In addition to the UDF function, the spark-used directly in SQL. After import org.apache.spark.sql.functions._, it can also be used for dataframe,dataset. Version2.3.0most functions that support column names also support string types. The return type of these functions is basically column. There are a lot of functions, all below. Aggregate function approx_count_distinctcount_distinct approximate avg mean collect_list aggregates the value of the specified field to listcollect_ Set aggregates the value of the specified field to Setcorr calculates the Pearson correlation coefficient for two columns count Count countdistinct de-count SQL usageSelectCOUNT (Distinctclass) Covar_pop total covariance (population covariance) Covar_samp sample covariance (sample covariance) First grouping the last element of the previous group Groupinggrouping_idkurtosis calculates the peak state (kurtosis) value Skewness calculates the skewness (skewness) Max min Minimum value mean mean stddev that is Stddev_sampstddev_samp sample standard deviation (Sample standards Deviation) Stddev_pop general standard deviation (population Standard deviation) sum sum sumdistinct non-repeating value summation SQL usageSelectSUM (DISTINCTclass) Var_pop Population variance (population variance) var_samp sample no prescription difference (unbiased variance) variance that Var_samp set function array_contains (column , value) checks whether the array type field contains the specified element explode expands array or map to multiple lines explode_outer with explode, but expands to null when array or map is empty or null. Posexplode with explode, with positional index. Posexplode_outer with Explode_outer, with positional index. From_json parse JSON string as Structtype or arraytype, there are various parameter forms, see the documentation. To_json to JSON string, supports Structtype, arraytype of Structtypes, a maptype or arraytype of maptypes. Get_json_object (Column,path) Gets the JSON object string that specifies the JSON path. SelectGet_json_object ('{"A" 1, "B": 2}','$.a'); [Introduction to JSON Path] (http://blog.csdn.net/koflance/article/details/63262484)json_tuple (column,fields) Gets the value of the specified field in the JSON. SelectJson_tuple ('{"A": 1, "B": 2}','a','b'Map_keys returns the arraymap_values of the map to return the value of the map consisting of the Arraysizearray or map length Sort_array (e:column, Asc:boolean) Sorts the elements in the array (natural sort), the default ASC. The Time function add_months (Startdate:column, Nummonths:int) specifies the date to add n months date_add (Start:column, Days:int) after the specified date n days e.g.SelectDate_add ('2018-01-01',3) date_sub (Start:column, Days:int) before the specified date n days datediff (End:column, start:column) Two date interval days current_date () Current date Current_ Timestamp () current timestamp, Timestamptype type Date_format (Dateexpr:column, format:string) date formatting dayofmonth (E:column) days in January, Support for date/timestamp/stringdayofyear (E:column) days in a year, supported by date/timestamp/stringWeekOfYear (E:column) The number of weeks in a year, supported by date/timestamp/stringfrom_unixtime (Ut:column, f:string) timestamp to string format from_utc_timestamp (Ts:column, tz:string) timestamp go to specify time zone timestamp to_utc_ Timestamp (ts:column, tz:string) specify time zone timestamp goto UTF timestamp hour (e:column) extract hour value minute (e:column) Extract minute value month (e:column) Extract Month value quarter (E:column) Extract Quarter second (e:column) Extract seconds Year (E:column): Extract years Last_day (e:column) month-end date for the specified date Months_between (date1 : Column, Date2:column) calculates a two-day difference of several months next_day (Date:column, dayofweek:string) calculates the next 周一、二 after a specified date ..., dayOfWeek is case-sensitive, only accepts "Mon","Tue","Wed","Thu","Fri","Sat","Sun". To_date (e:column) field type to Datetypetrunc (Date:column, format:string) date truncation Unix_timestamp (S:column, p:string) Specifies the format of time string to timestamp Unix_timestamp (s:column), the default format is yyyy-mm-DD HH:mm:ssunix_timestamp (): Current timestamp (seconds), the underlying implementation is Unix_timestamp (Current_timestamp (), yyyy-mm-dd HH:mm:ss) Windows (Timecolumn:column, windowduration:string, slideduration:string, starttime:string) time window functions, Divides the specified time (timestamptype) into the window mathematical function Cos,sin,tan computes the cosine of the angle, sine ... Sinh,tanh,cosh computes hyperbolic sine, tangent, ... Acos,asin,atan,atan2 Compute cosine/the angle of the sine value bin converts the long type to the string corresponding to the binary value for example, Bin (" A") returns"1100". Bround rounding, using the Half_even mode of decimal, v>0.5 Rounding up,v<0. 5 rounds down, v0.5 rounds to the nearest even number. Round (E:column, Scale:int) half_up mode is rounded to scale as a decimal point. V>=0.5 Rounding up,v<0. 5 rounds down, that is, rounding. Ceil rounding up floor down rounding cbrtcomputes the cube-root of the given Value.conv (Num:column, Frombase:int, Tobase:int) converts the value (string) of the binary log (Base: Double, A:column): $log _{Base(a) $log (a:column): $log _e (a) $log (a:column): $log _{Ten} (a) $log 2 (a:column): $log _{2} (a) $log 1p (a:column): $log _{e} (a+1) $pmod (Dividend:column, Divisor:column): Returns The positive value of dividend mod Divisor.pow (l:double, R:column) : $r^l$ Note R is the column pow (L:column, r:double): $r^l$ Note L is a row of POW (L:column, r:column): $r^l$ Note that r,l are both column radians (e:column): Angle to radians rint (e:column): Returns theDoubleValue that isClosestinchvalue to the argument and isequal to a mathematical integer.shiftleft (E:column, Numbits:int): Left Shift Shiftright (E:column, numbits:int): Shift Right shifts Rightunsigned (E:column, numbits:int): Shift Right (unsigned bit) Signum (E:column): Returns the numeric sign sqrt (E:column): square root Hex (column:column): Turn hex Unhex (column:column): Reverse hex-Promiscuous (Misc) function Crc32 (e:column): Calculate CRC32, return Biginthash (Cols:column*): Calculates hash code, returns INTMD5 (E:column): Calculates MD5 digest, returns 32 bits, 16 binary string SHA1 (E:column): Compute sha-1 Digest, return 40 bits, 16 binary string Sha2 (E:column, numbits:int): Compute sha-1 Summary, returns numbits bit, 16 binary string. Numbits Support 224, the,384, or +other non-aggregate functions abs (E:column) absolute value Array (Cols:column*multiple columns merged into Array,cols must be of the same type map (Cols:column*): Organize multiple columns into map, input columns must be (key,value), and key of each column/value is the same type, respectively. Bitwisenot (E:column): Computes bitwise NOT.BROADCAST[T] (Df:dataset[t]): Dataset[t]: The DF variable is broadcast for broadcast join implementation. such as Left.join (broadcast (right),"Joinkey") COALESCE (E:column*): Returns the first non-null value col (colname:string): Returns the colname corresponding Columncolumn (colname:string): The alias of the Col function expr (expr:string): Parsing the expr expression , the return value is stored in column, and this column is returned. Greatest (Exprs:column*): Returns the maximum value in multiple columns, skipping Nullleast (Exprs:column*): Returns the minimum value in multiple columns, skipping Nullinput_file_name (): Returns the file name of the current task?? isNaN (E:column): Checks if Nan (non-numeric) IsNull (E:column): Checks for Nulllit (Literal:any): Creates a literal (literal) with a columntypedlit[t] ( Literal:t) (Implicitarg0:scala.reflect.api.javauniverse.typetag[t]): Create a column,literal to support Scala types e.g with a literal (literal).: List, Seq and map.monotonically_increasing_id (): Returns a monotonically incrementing unique ID, but the IDs of different partitions are not contiguous. The ID is a 64-bit integral type. NANVL (Col1:column, Col2:column): Col1 is Nan then returns Col2negate (e:column): Negative number, same as DF.Select(-DF ("Amount") not (E:column): Negate, same as Df.filter (!DF ("isActive") rand (): random number [0.0,1.0]rand (Seed:long): random number [0.0,1.0], use seed seed randn (): Random number, from normal distribution take Randn (Seed:long): spark_partition_id (): Return partition IDstruct(cols:column*): Multi-column group to synthesize a new struct column?? When (Condition:column, Value:any): Returns value, such as people, when condition is true.Select(When (People ("Gender") ==="male",0). When (People ("Gender") ==="female",1). Otherwise (2) ) returns null if there is no otherwise and all condition are killed. Sort function asc (columnname:string): Positive sequence Asc_nulls_first (columnname:string): Positive sequence, Null first Asc_nulls_last (columnname:string): Positive order, null row last E.g.df.sort (ASC ("Dept"), Desc (" Age") corresponds to the DESC function desc,desc_nulls_first,desc_nulls_last String function ASCII (E:column): Computes the first character of the ASCII code base64 (E:column): Base64 transcoding unbase64 (e:column): Base64 decoding concat (Exprs:column*): Concatenate multi-column string Concat_ws (sep:string, Exprs:column*): Use Sep as a delimiter to concatenate multi-column strings decode (Value:column, charset:string): Decode Encode (Value:column, charset:string): transcoding, CharSet support /c1>'Us-ascii','iso-8859-1','UTF-8','Utf-16be','Utf-16le','UTF-16'. Format_number (X:column, d:int): Formatting'#,###,###.##'form of String format_string (format:string, Arguments:column*): Formats arguments as format, printf-Style Initcap (e:column): initial capital Lower (e:column): Turn lowercase upper (e:column): Turn uppercase InStr (Str:column, substring:string): SUBSTRING first occurrence in str length (e:column): string length Levenshtein (L:column, R:column): Calculates the editing distance between two strings (Levenshtein Distance) Locate (substr:string, Str:column): substring first occurrence in str, position number starting from 1, 0 means not found. Locate (substr:string, Str:column, Pos:int): ibid., but looks from POS location. Lpad (Str:column, Len:int, pad:string): string left padding. Fills the str string to Len length with pad characters. There are corresponding rpad, right padding. LTrim (e:column): Cut off the left space, white space characters, corresponding to Rtrim.ltrim (E:column, trimstring:string): Cut off the left of the specified character, corresponding to Rtrim.trim (E:column, trimstring:string): Cut off the left and right sides of the specified character trim (e:column): Cut the left and right sides of the space, white space characters regexp_extract (E:column, Exp:string, Groupidx:int): The regular extraction matches the group Regexp_replace (E:column, Pattern:column, Replacement:column): The regular replaces the matched part, where the argument is the column. Regexp_replace (E:column, Pattern:string, replacement:string): Regular replaces matched portions repeat (str:column, n:int): Repeats str back n times Reverse (str:column): Reverse str soundex (E:column): Calculates the sandy code PS: used to index names in English, and words with the same pronunciation but different spelling will be mapped to the same code. Split (Str:cOlumn, pattern:string): Split the Strsubstring (Str:column, Pos:int, len:int) with the pattern: truncate the substring from the POS position at the beginning of Len.  Substring_index (Str:column, Delim:string, Count:int): Returns the substring from stringStr before count occurrences of the delimiter Delim. If Count isPositive, everything the left of the final delimiter (counting fromLeft isReturned. If Count isNegative, every to the right of the final delimiter (counting fromThe right) isReturned. Substring_index performs a Case-sensitive match when searching fordelim.translate (Src:column, Matchingstring:string, replacestring:string): Change the matchingstring in Src to replacestring. UDF function User-defined function.calludf (udfname:string, Cols:column*): Call Udfimport org.apache.spark.sql._val DF= Seq (("ID1",1), ("Id2",4), ("ID3",5). TODF ("ID","value") Val Spark=Df.sparkSessionspark.udf.register ("simpleudf", (v:int) = v *v) DF.Select($"ID", CALLUDF ("simpleudf", $"value") UDF: Define UDF window function cume_dist (): Cumulative distribution of values within a window Partitioncurrentrow (): Returns the spec Ial frame boundary that represents the current rowinchThe window Partition.rank (): Rank, returns the rank of the data item in the grouping, ranking equal will leave a vacancy in the rank1,2,2,4. Dense_rank (): Rank, returns the rank of the data item in the group, ranking equal will not leave a vacancy in the rank1,2,2,3. Row_number (): line number, returns a number for each record1,2,3,4Percent_rank (): Returns the relative rank (i.e. percentile) of rows within a window Partition.lag (E:column, Offset:in T, Defaultvalue:any): Offset rows before the current Rowlead (E:column, Offset:int, Defaultvalue:any): Returns the Valu E that isoffset rows After the current rowntile (N:int): Returns the Ntile group ID ( from 1to n inclusive)inchAn ordered window partition.unboundedfollowing (): Returns the special frame boundary that represents the last row inchThe window partition.

Org.apache.spark.sql.functions Rollup

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.