Spark SQL requires several "tables" to be present, either from hive or from a temporary table. If the table is from hive, its schema (column name, column type, and so on) has been determined at the time of creation, normally we can parse the data in the table directly from spark SQL, and if "table" comes from "temporal table", we need to consider two questions: (1) Where does the data for "temp table" come from? (2) What is the mode of "temporary table"? With Spark's official documentation, you can see that creating a temporary table requires two elements: (1) An RDD with data, (2) a data pattern; that is, we need to apply the data pattern to the RDD that is associated with the data, and then we can register the RDD as a "temporary table." In this process, the most important thing is the data type, it directly affects the spark SQL calculation process and the correctness of the calculation results. Current data types supported by Pyspark.sql.types: Nulltype, StringType, Binarytype, Booleantype, Datetype, Timestamptype, Decimaltype, Doubletype, Floattype, Bytetype, Integertype, Longtype, Shorttype, ArrayType, Maptype, Structtype (StructField), Among them, ArrayType, Maptype, Structtype we call "compound type", the remainder is called "Basic type", "Compound type" is built on the basis of "basic type". These "basic types" correspond to the Python data types as follows:
nulltype |
None |
stringtype |
basestring |
Binarytype |
ByteArray |
booleantype |
bool |
datetype |
datetime.date |
timestamptype |
datetime.datetime |
decimaltype |
Decimal. Decimal |
doubletype |
float (double precision floats) |
Flo Attype |
Float (single precision floats) |
bytetype |
Int (a signed&n Bsp;integer) |
integertype |
Int (a signed 32-bit integer) |
Longtype |
Long (A signed 64-bit integer) |
shorttype |
Int (a signed 16-bit integer) |
We describe the use of these data types in spark SQL, respectively. 1. Number types (Bytetype, Shorttype, Integertype, Longtype, Floattype, Doubletype, Decimaltype) number types can be divided into two categories, integer types: Bytetype, Shorttype, Integertype, Longtype, you need to pay attention to the respective integer expression range, floating-point type: Floattype, Doubletype, Decimaltype, use not only need to pay attention to the respective floating-point representation range, Also pay attention to the respective range of accuracy. We use the common data type Integertype to illustrate how a numeric type is used: a. Simulates a row of two columns of data and converts it to an rdd source through the Parallelize method, This rdd is the rdd; b of the data. Create a data schema that specifies the column name, column type, and null (NULL) value for both columns, where the pattern needs to be represented by Structtype, each property of each column (column name, column type, The inclusion of a null (NULL) value) requires a Structfield representation, the first column has a column name of col1, the column type is Integertype, cannot contain a null (NULL) value (FALSE), the column name of the second column is col2, the column type is Integertype, Cannot contain empty (null) values (FALSE); (Note: The data type of each column in the actual use is not necessarily the same) c. The data schema schema is applied to the RDD source by means of the Applyschema method. This results in a schemardd (with a modal Rdd) table; d. Register the Schemardd table as a table:temp_table; we have finished creating the RDD, creating the schema, Registering the entire process of the table, you can then use this table (temp_table) to complete the analysis using Spark (Hive) SQL. Other numeric types are used in a similar manner. Actually, the "one row, two columns" data in this example is actually the Intergertype range: [-2147483648, 2147483647], and the other numeric types are represented as follows:
bytetype |
[ -128, 127] |
shorttype |
[-32 768, 32767] |
integertype |
[ -2147483648, 2147483647] |
Longtype |
[ -9223372036854775808, 9223372036854775807] |
floattype |
[1.4e-45, 3.4028235e38] |
doubletype |
[4.9e-324, 1.7976931348623157e308] |
It can be seen that although we use Python to write programs, these data types are represented in the same range as Byte, short, Integer, Long, Float, double in Java, because Spark is implemented in Scala, Scala runs on top of Java virtual machines, so data types in spark SQL Bytetype, Shorttype, Integertype, Longtype, Floattype, Doubletype, The corresponding data in Decimaltype is actually represented by a byte, short, Integer, Long, Float, double in Java. When writing spark application in Python, keep in mind that choosing the right data type for the data you are analyzing and avoiding the input data anomalies due to data overflow can only solve the overflow problem of data entry. It is not possible to solve the overflow problem that may occur during the calculation of the data. We modify the example data in the above example to (9223372036854775807, 9223372036854775807), the data type is modified to Longtype, and now the sample data is actually the maximum value that Longtype can represent, If we add these two values together, is there an overflow situation? output: As you can see, the actual calculation results are exactly the same as we expected, since the type of col1 and col2 is Longtype, then the col1 + col2 type should also be longtype (for reasons seen), However, the result value of Col1 + col2 18446744073709551614 has exceeded the range that longtype can represent ([ -9223372036854775808, 9223372036854775807]), Inevitably lead to overflow. Because we are using Hivecontext (SqlContext is not currently recommended), many times we will think of the use of "bigint", output is still: to explain this reason, It is necessary to understand the respective representation range of number types in hive: the representation range of the hive bigint can be found by comparison with Longtype, after all, hive is implemented in Java, so we can guess hive tinyint, smallint, int, bigint, float, double and Java Byte, ShoRT, Integer, Long, Float, double are one by one corresponding (only conjecture, and do not actually view the source code verification), so we will longtype data type conversion to bigint way is not feasible, their numerical range is the same. So how do we solve the overflow problem? Notice the last number type in hive Numeric types: DECIMAL, introduced from Hive 0.11.0, Hive 0.13.0 began to support users to customize "precision" and "scale". Decimal is based on the Java BigDecimal implementation, which can represent immutable task precision in decimal numbers, support general mathematical operations (+,-,*,/) and UDFs (floor, ceil, round, and so on), and can also be converted (cast) to other numeric types. Use the example below: use decimal to pay attention to the "precision" and "scale" values of the selection, Java BigDecimal (BigInteger, later mentioned) the value range is theoretically dependent on (virtual) memory size, They are relatively memory-intensive, so we need to select the appropriate values for them according to our actual needs, and we need to meet the following conditions: integer partial digits (precision-scale) + fractional bits (scale) = precision The maximum number of digits that the Longtype can represent: 19, because in our example, it causes an overflow problem, so we convert the value to decimal and specify precision as 38,scale to 0 so that we can get the correct result: It is important to note that the computed result type also becomes decimal. Decimal (python), when writing spark application with Python, Pyspark also provides Decimaltype, a special type of data that is not a python built-in data type, You need to import the module decimal, using the following: using the data type Decimaltype there are two places to note: (1) When creating an rdd, you need to generate data using the decimal in the module decimal; (2) An exception occurs when Decimaltype is used in a spark 1.2.0 environment: Java.lang.classcastexception: java.math.bigdecimal cannOt be cast to org.apache.spark.sql.catalyst.types.decimal.decimal, can be used normally in spark 1.5.0 environment, However, you need to modify the module name from "Pyspark.sql" to "Pyspark.sql.types". We explicitly specify what type of data it is, then what determines the type of result after our regular math operation (+,-,*,/)? These mathematical runs are actually implemented by UDFs in Hive (org.apache.hadoop.hive.ql.exec.FunctionRegistry), (1) + (2)- (3) * (4)/ (5)% can be seen, "+", "-", "*", "%" by overloading the supported data types: Byte, short, int, Long, float, double, Decimal, "/" only supports data types by overloading: Double, decimal, the result type computed is the same as the input type, which also means: (1) Mathematical operations "+", "-", "*", "%" An implicit conversion may occur (such as int + long = long + long); (2) The mathematical operation "/" is a uniform conversion of input data to a data type Double or decimal, which also means that The result of the calculation corresponds to the data type double or decimal. 2. The time type (Datetype,timestamptype) datetype can be understood as year, month, and day, and Timestamptype can be understood as year, month, day, hour, minute, and second, respectively, to date in Python datetime , datetime, using examples such as the following: output: 3. StringType, Booleantype, Binarytype, nonetype These types of data are used in roughly the same way, not explained, note that binarytype corresponds to the use of the ByteArray in Python. output Result: 4. Composite data types (arraytype, Maptype, Structtype) Composite data types TotalThere are three kinds: Array (arraytype), dictionary (maptype), struct (structtype), where the array (arraytype) requires the same array element type, and the dictionary (Maptype) requires all "key" types to be consistent, all "value" , but the type of "key", "value" can be inconsistent, and the element type of the struct (Structtype) can be inconsistent. (1) Arraytype arraytype requires specifying the array element type. (2) Maptype maptype requires that the key type and value type be specified. (3) Structtype structtype contains different types of elements that you need to specify the appropriate name and data type in order of the elements. in summary, Spark (Hive) SQL provides us with a wealth of data types that we need to select the appropriate data type (base type, composite type) based on the actual situation of the data being analyzed. This is especially true of the data type's respective representation (precision) range and the case of data overflow processing.
Spark (Hive) SQL data Type usage details (Python)