Apache Hive cannot collect stats

Last Update:2015-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Environment:
Hive: apache-hive-1.1.0
Hadoop: hadoop-2.5.0-cdh5.3.2
Hive metadata and stats are stored using mysql.
Hive stats parameters are as follows:
Hive. stats. autogather: automatically collects statistics when the insert overwrite command is run. The default value is true.
Hive. stats. dbclass: the database that stores the hive temporary statistics. The default value is jdbc: derby, and the value is jdbc: mysql.
Hive. stats. jdbcdriver: jdbc driver for temporarily storing hive statistics in the database; set to com. mysql. jdbc. driver
Hive. stats. dbconnectionstring: Temporary statistics database connection string. The default value is jdbc: derby: databaseName = TempStatsStore; create = true. It is set to jdbc: mysql: // [ip: port]/[dbname]? User = [username] & password = [password]
Hive. stats. defaults. publisher: If dbclass is not jdbc or hbase, use this as the default release. The StatsPublisher interface must be implemented. The default value is null. Retain the default value.
Hive. stats. defaults. aggregator: If dbclass is not jdbc or hbase, use this class for aggregation. The StatsIAggregator interface must be implemented. The default value is null. Retain the default value.
Hive. stats. jdbc. timeout: Specifies the jdbc connection timeout configuration. The default value is 30 seconds.
Hive. stats. retries. max: Maximum number of retries when an exception occurs during database update during publishing and merging. The default value is 0. No retries are made. Retain the default value.
Hive. stats. retries. wait: The waiting window between retries. The default value is 3000 milliseconds. Keep the default value.
Hive. client. stats. publishers: list of statistical release classes for count jobs, separated by commas (,). The default value is null. org. apache. hadoop. hive. ql. stats. clientStatsPublisher interface; Retain default
Symptom:

If insert overwrite table is executed, numRows and rawDataSize are not returned correctly. The result is similar to the following:
[NumFiles = 1, numRows = 0, totalSize = 59, rawDataSize = 0]
No related stats are inserted in the hive stats mysql database.
First, locate the problem because hive stats has a problem. The problem cannot be precisely located because the information printed by the console is too small. Therefore, Set
Hive -- hiveconf hive. root. logger = INFO, console; print the detailed logs and find the following information:
[Error 30001]: StatsPublisher cannot be initialized. There was a error in the initialization
Of StatsPublisher, and retrying might help. If you dont want the query to fail because accurate
Statistics cocould not be collected, set hive. stats. reliable = falseSpecified key was too long; max key length is 767 bytes
This problem is simple because hive1.1.0, ID column length is 4000 by default, and ID is set as the primary key, resulting in an error.
Org. apache. hadoop. hive. ql. stats. jdbc. JDBCStatsSetupConstants
12 // MySQL-65535, SQL Server-8000, Oracle-4000, Derby-32762, Postgres-large.
Public static final int ID_COLUMN_VARCHAR_SIZE = 4000;

Org. apache. hadoop. hive. ql. stats. jdbc. JDBCStatsPublisher: public boolean init (Configuration hconf)
If (colSize <JDBCStatsSetupConstants. ID_COLUMN_VARCHAR_SIZE ){
String alterTable = JDBCStatsUtils. getAlterIdColumn ();
Stmt.exe cuteUpdate (alterTable );
}

From this code, if the table ID column size is smaller than 4000, it will be automatically changed to 4000; therefore, only the source code will be modified to 4000-> 255 (mysql uses utf8 encoding, A utf8 occupies 3 bytes, so 255*3 = 765 <767); and 255 bytes are sufficient for the current cluster.

Public static final int ID_COLUMN_VARCHAR_SIZE = 255;

Recompile and push the package to the test environment. The problem still exists after testing.
[NumFiles = 1, numRows = 0, totalSize = 59, rawDataSize = 0]
Hive -- hiveconf hive. root. logger = INFO, console; print out detailed logs

No exception was found.
Set hive. stats. reliable = true;
Run the command again. This time an error is reported. Check the error message of the job.
Org. apache. hadoop. hive. ql. stats. jdbc. JDBCStatsAggregator
Try {
Class. forName (driver). newInstance ();
} Catch (Exception e ){
LOG. error ("Error during instantiating JDBC driver" + driver + ".", e );
Return false;
}

This is running on yarn and cannot find com. mysql. jdbc. the Driver class causes the mysql Driver package to be placed under the yarn/lib/directory, push the whole cluster, re-run the test script, and find the problem solved.

Hive programming guide PDF (Chinese Version)

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Hive operating architecture and configuration and deployment

Hive details: click here
Hive: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Apache Hive cannot collect stats

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Apache Hive cannot collect stats

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support