Apache Hive cannot collect stats

Source: Internet
Author: User

Apache Hive cannot collect stats

Environment:
Hive: apache-hive-1.1.0
Hadoop: hadoop-2.5.0-cdh5.3.2
Hive metadata and stats are stored using mysql.
Hive stats parameters are as follows:
Hive. stats. autogather: automatically collects statistics when the insert overwrite command is run. The default value is true.
Hive. stats. dbclass: the database that stores the hive temporary statistics. The default value is jdbc: derby, and the value is jdbc: mysql.
Hive. stats. jdbcdriver: jdbc driver for temporarily storing hive statistics in the database; set to com. mysql. jdbc. driver
Hive. stats. dbconnectionstring: Temporary statistics database connection string. The default value is jdbc: derby: databaseName = TempStatsStore; create = true. It is set to jdbc: mysql: // [ip: port]/[dbname]? User = [username] & password = [password]
Hive. stats. defaults. publisher: If dbclass is not jdbc or hbase, use this as the default release. The StatsPublisher interface must be implemented. The default value is null. Retain the default value.
Hive. stats. defaults. aggregator: If dbclass is not jdbc or hbase, use this class for aggregation. The StatsIAggregator interface must be implemented. The default value is null. Retain the default value.
Hive. stats. jdbc. timeout: Specifies the jdbc connection timeout configuration. The default value is 30 seconds.
Hive. stats. retries. max: Maximum number of retries when an exception occurs during database update during publishing and merging. The default value is 0. No retries are made. Retain the default value.
Hive. stats. retries. wait: The waiting window between retries. The default value is 3000 milliseconds. Keep the default value.
Hive. client. stats. publishers: list of statistical release classes for count jobs, separated by commas (,). The default value is null. org. apache. hadoop. hive. ql. stats. clientStatsPublisher interface; Retain default
Symptom:

If insert overwrite table is executed, numRows and rawDataSize are not returned correctly. The result is similar to the following:
[NumFiles = 1, numRows = 0, totalSize = 59, rawDataSize = 0]
No related stats are inserted in the hive stats mysql database.
First, locate the problem because hive stats has a problem. The problem cannot be precisely located because the information printed by the console is too small. Therefore, Set
Hive -- hiveconf hive. root. logger = INFO, console; print the detailed logs and find the following information:
[Error 30001]: StatsPublisher cannot be initialized. There was a error in the initialization
Of StatsPublisher, and retrying might help. If you dont want the query to fail because accurate
Statistics cocould not be collected, set hive. stats. reliable = falseSpecified key was too long; max key length is 767 bytes
This problem is simple because hive1.1.0, ID column length is 4000 by default, and ID is set as the primary key, resulting in an error.
Org. apache. hadoop. hive. ql. stats. jdbc. JDBCStatsSetupConstants
12 // MySQL-65535, SQL Server-8000, Oracle-4000, Derby-32762, Postgres-large.
Public static final int ID_COLUMN_VARCHAR_SIZE = 4000;


Org. apache. hadoop. hive. ql. stats. jdbc. JDBCStatsPublisher: public boolean init (Configuration hconf)
If (colSize <JDBCStatsSetupConstants. ID_COLUMN_VARCHAR_SIZE ){
String alterTable = JDBCStatsUtils. getAlterIdColumn ();
Stmt.exe cuteUpdate (alterTable );
}

From this code, if the table ID column size is smaller than 4000, it will be automatically changed to 4000; therefore, only the source code will be modified to 4000-> 255 (mysql uses utf8 encoding, A utf8 occupies 3 bytes, so 255*3 = 765 <767); and 255 bytes are sufficient for the current cluster.

Public static final int ID_COLUMN_VARCHAR_SIZE = 255;

Recompile and push the package to the test environment. The problem still exists after testing.
[NumFiles = 1, numRows = 0, totalSize = 59, rawDataSize = 0]
Hive -- hiveconf hive. root. logger = INFO, console; print out detailed logs

No exception was found.
Set hive. stats. reliable = true;
Run the command again. This time an error is reported. Check the error message of the job.
Org. apache. hadoop. hive. ql. stats. jdbc. JDBCStatsAggregator
Try {
Class. forName (driver). newInstance ();
} Catch (Exception e ){
LOG. error ("Error during instantiating JDBC driver" + driver + ".", e );
Return false;
}

This is running on yarn and cannot find com. mysql. jdbc. the Driver class causes the mysql Driver package to be placed under the yarn/lib/directory, push the whole cluster, re-run the test script, and find the problem solved.

Hive programming guide PDF (Chinese Version)

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Hive operating architecture and configuration and deployment

Hive details: click here
Hive: click here

This article permanently updates the link address:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.