First, the parameter setting of hive
1. Three ways to set up: Configuration file
· User-defined profile: $HIVE _conf_dir/hive-site.xml
· Default profile: $HIVE _conf_dir/hive-default.xml
The user-defined configuration overrides the default configuration.
In addition, hive is read into the Hadoop configuration, and since Hive is started as a client of Hadoop, the Hadoop configuration file contains
· $HADOOP _conf_dir/hive-site.xml
· $HADOOP _conf_dir/hive-default.xml
The configuration of hive overrides the configuration of Hadoop.
Configuration file settings are valid for all hive processes that are natively started.
2. Command-line parameters,
Bin/hive-hiveconf Hive.root.logger=info,console
This setting is for the start session (for Server mode startup. Sessions) is valid for all requests.
3. Statement of parameters
Set mapred.reduce.tasks=100;
The scope of this setting is also the session level
Ii. Where to use hive some attention
1. The character set used by hive is UTF-8 by default. There is no such function in hive that converts character encodings
Hive.exec.compress.output This parameter, the default is False.
But most of the time it seems to be explicitly set individually. Otherwise it will compress the result, assuming that your file will be directly behind Hadoop, then you cannot compress the
2. Semantic differences in handling null values in join
The special logic here is that, in the join of Hive, the field of the Joinkey is compared. The null=null is meaningful. And the return value is true. Check the following query:
Select U.uid, COUNT (U.uid)
From T_weblog L joins T_user u on (l.uid = u.uid) GroupBy u.uid;
In the query, a record with a null UID in the T_weblog table will be connected to a record with an empty UID in the T_user table. That is L.uid = U.uid=null was established.
Assumptions need to be consistent with the semantics of the standard. We need to rewrite the case where the query manually filters for null values:
Select U.uid, COUNT (U.uid)
From T_weblog l Join T_user u
On (L.uid = U.uid and l.uid are NOT null and U.uid is Notnull)
Group BY U.uid;
In practice, this semantic difference is also one of the reasons that often leads to data skew.