Handling of the hive configuration file and Null values in join

Source: Internet
Author: User

First, the parameter setting of hive

1. Three ways to set up: Configuration file

· User-defined profile: $HIVE _conf_dir/hive-site.xml

· Default profile: $HIVE _conf_dir/hive-default.xml

The user-defined configuration overrides the default configuration.

In addition, hive is read into the Hadoop configuration, and since Hive is started as a client of Hadoop, the Hadoop configuration file contains

· $HADOOP _conf_dir/hive-site.xml

· $HADOOP _conf_dir/hive-default.xml

The configuration of hive overrides the configuration of Hadoop.

Configuration file settings are valid for all hive processes that are natively started.

2. Command-line parameters,

Bin/hive-hiveconf Hive.root.logger=info,console

This setting is for the start session (for Server mode startup. Sessions) is valid for all requests.

3. Statement of parameters

Set mapred.reduce.tasks=100;

The scope of this setting is also the session level

Ii. Where to use hive some attention

1. The character set used by hive is UTF-8 by default. There is no such function in hive that converts character encodings

Hive.exec.compress.output This parameter, the default is False.

But most of the time it seems to be explicitly set individually. Otherwise it will compress the result, assuming that your file will be directly behind Hadoop, then you cannot compress the

2. Semantic differences in handling null values in join

The special logic here is that, in the join of Hive, the field of the Joinkey is compared. The null=null is meaningful. And the return value is true. Check the following query:

Select U.uid, COUNT (U.uid)

From T_weblog L joins T_user u on (l.uid = u.uid) GroupBy u.uid;

In the query, a record with a null UID in the T_weblog table will be connected to a record with an empty UID in the T_user table. That is L.uid = U.uid=null was established.

Assumptions need to be consistent with the semantics of the standard. We need to rewrite the case where the query manually filters for null values:

Select U.uid, COUNT (U.uid)

From T_weblog l Join T_user u

On (L.uid = U.uid and l.uid are NOT null and U.uid is Notnull)

Group BY U.uid;

In practice, this semantic difference is also one of the reasons that often leads to data skew.

Handling of the hive configuration file and Null values in join

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.