By default, NULL is saved as \ n in the hive table, and you can view the table's source file (Hadoop fs-cat or Hadoop fs-text), where a large amount of \ n is stored in the file,
resulting in a lot of wasted space. And in Java, Python directly into the path to manipulate the source data, the resolution should also be noted. In
addition, in the source file of the hive table, the default column delimiter is \001 (SOH), the line delimiter is \ n (currently only \ n is supported, the other is not available, so you do not need to display the declaration when defining). The delimiter between the elements delimiter \002,map in key and value is \003.
For example, if a record in the source file is:
10000042soh77soh435soh16soh22soh1156120000soh\nsoh\nsoh\nsoh\nsoh\nsoh\nsoh\nsohyoukusoh85133.0soh111
You can see that storing null is a waste of space.
However, the null of hive is sometimes required:
1) The INSERT statement in hive must match the number of columns, does not support no write, and a column without a value must use a null placeholder.
2) separate columns by delimiter in the data file of the Hive table. An empty column holds null (\ n) to preserve the column position. However, if the columns are not sufficient when the external table loads some data, such as Table 13 columns, and the file data has only 2 columns, the remaining columns at the end of the table when the table is queried have no data corresponding to them and are automatically displayed as null.
therefore, null conversion to an empty string, you can save disk space, the implementation method has several
1) the direct designation when the table is built (two ways)
A, the phrase
ROW FORMAT SERDE ' Org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe ' with
serdeproperties (' serialization.null.format ' = ')
implementation, note that both must be used together, such as
CREATE TABLE hive_tb (ID int,name string)
partitioned by (' Day ' string, ' type ' tinyint COMMENT ' 0 as bid, 1 as win, 2 As CK ', ' hour ' tinyint)
ROW FORMAT SERDE ' org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe '
with Serdeproperties (
' field.delim ' = '/t ',
' escape.delim ' = '//',
' serialization.null.format ' = '
) STORED as Textfile;
b, or via row FORMAT delimited NULL DEFINED as '
CREATE TABLE hive_tb (ID int,name string)
partitioned by (' Day ' string, ' type ' tinyint COMMENT ' 0 as bid, 1 as win, 2 As CK ', ' hour ' tinyint)
ROW FORMAT delimited
NULL DEFINED as ' STORED as
textfile;
2) Modifying a table that already exists
ALTER TABLE HIVE_TB set Serdeproperties (' serialization.null.format ' = ');
The space-saving validation results are as follows: Hadoop fs-du/hivedata/warehouse/pmp.db/hive_tb/day=2016-05-14/type=1/hour=00/0* 1137 Hadoo P fs-du/hivedata/warehouse/pmp.db/hive_tb/day=2016-05-14/type=1/hour=01/0* 319753------------------------------ -----Hadoop fs-du/hivedata/warehouse/pmp.db/hive_tb/day=2016-05-14/type=1/hour=00/0* 885 Hadoop Fs-du/hiv edata/warehouse/pmp.db/hive_tb/day=2016-05-14/type=1/hour=01/0* 249529