hive0.11 upgrade hit the pit

Last Update:2018-08-12 Source: Internet

Author: User

Tags coding standards

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last week our production environment officially on the line hive 0.11/spark 0.8/shark 0.8, in the early test and regression process encountered a lot of pits, this side records, there are other companies to go on, you can take some detours.

1. Hive 0.11 maintains the respective Schema information for each partition, while the partition in 0.9 is the serde of the field using the table schema, if a table adds a field and then creates a partition, the new partition inherits the table schema, If the previous partition does not drop the rebuild, it will keep the old table Schema before the new column, even if the data for the new field is added to the underlying HDFs file, it is null to read it in Hive, because our DW before the table specification is to create the table after the first six months of partition, So the problem is magnified, and the Jira of this feature is https://issues.apache.org/jira/browse/HIVE-3833 in 0.11, and if you want to solve this problem, One way is for all pre-built tables to be drop-rebuilt, and then discard previously built specifications, but this cost is relatively large, we take a second way, roll back this feature, partition the data are read with a table schema, without partition schema

Our shark are branch-0.8 on the basis of the merge branch hive0.11, this branch contribution by WANdisco, shark inside will use the Hive API, so need to do API compatibility, Shark also need to roll back the above feature, otherwise will throw nosuchmethodexception
2. On the left OUTER join has invalid filter conditions, observed through the implementation plan will not produce the relevant filteroperator, 0.9 will not error, 0.11 for the statement check more stringent, will be the error
Bad case:
Dpods_hippo_tuangou_source_order_results s on o.orderid = s.order_id LEFT outer JOIN
dpstg_order_source_1_20131205 cp on o.orderid = Cp.referid and Cp.type = + LEFT outer JOIN
dpstg_order_ssource_1_20131205 cp2 on o.orderid = Cp2.referid and Cp.type = 37

3. Multiple execution statements are not separated by semicolons, 0.9 do not error, 0.11 will error
Bad case:
Set Mapred.reduce.tasks=-1
Set hive.exec.reducers.max=999;

4. Hive comes back longwritable with UDF round at 0.9, and 0.11 returns doublewritable, because the type mismatch causes the cast to complain. The solution is to add a 0.9 round implementation to our UDF to cover the 0.11 built-in round implementations

5. Users add custom jar packs (add jar Xx.jar) within the hive session, and if they are used locally mapredlocaltask (such as exprnode conversion and filtering), they throw classnotfoundexception, This should be a bug of hive, we will repair it later, the current work-around mode is the user active in the session set Hive.aux.jars.path=file:///tmp/xxx/xxx.jar, show join local In the classpath of a task

6.0.11 the set parameter to the right of the equal sign does not support the use of calculation expressions, you need to change the value
Bad case:
Set hive.exec.reducers.max=100*9;

7. When you insert overwrite partitioned table, the name and order of the partition field description must be consistent with what is defined in the table schema.
Bad case:
Insert Overwrite table ABC partition (b= ' xx ', a= ' yy ', c= ' zz ') Select Bla bla ...
In the schema of table ABC, the partitioning column order is a,b,c
The error "Partition columns in Partition specification are is not the same as that defined in the table schema." The names and orders have to be exactly the same.

8. Join two tables, if the removed field name in both tables exist, you need to show the specified alias.column_name, otherwise it will cast "ambiguous column reference", Very strange 0.9 incredibly to this point did not complain, do not default to use left table first column?
Bad case:
Select Xx.day from (SELECT * to tmptest a join tmptest B on a.day=b.day limit) xx;

A new optimization parameter "Hive.auto.convert.join.noconditionaltask" is added in 9.0.11, and it is optimized for n-way join, enumerating all N-1 table size totals, Determines whether the parameter hive.auto.convert.join.noconditionaltask.size (the default 10MB) is less than the value, and if it is less than this, it is automatically converted to a map-side join. The N-1 sheet hashtable packaged Distributedcache are uploaded to the HDFS distribution to the task nodes and factor table to join, thus avoiding the overhead of reduce-side a normal N-1 join. But this argument is too optimistic, unlike the previous map join optimization strategy (Hive.auto.convert.join), which, after replacing the conditional task, does not have a corresponding Common-join backup task. As a result of mapredlocaltask fail for some reason, the entire job fail, and the backup task in 0.9 will be alternative, so it is done correctly from the job level, but the execution time becomes longer. To ensure the correctness of the execution of the statement, we turn off this optimization parameter by default.

Metadataonlyoptimizer Physical Execution Plan optimization report NPE (https://issues.apache.org/jira/browse/HIVE-4935), patch or set Hive.optimize.metadataonly=false can all be solved.

The GROUP BY statement produces error data, and the same group by key produces multiple rows, such as statement "Select X, Count (*) from (select X, y to ABC Group by X,Y)" A GROUP by X ; "Only one Mr Job,reducesinkoperator key in 0.11 is x,y, causing the same X data to be distributed across the different reduce task. In the correct case, a second Mr Job will be played to distribution X. 0.12 has fix this problem (see https://issues.apache.org/jira/browse/HIVE-5237, https://issues.apache.org/jira/browse/ HIVE-5149), in addition workaround can turn off optimizer set Hive.optimize.reducededuplication=false

The map join hint has been ignored by default in 12.0.11 (https://issues.apache.org/jira/browse/HIVE-4042), map Joins are automatically converted by Hive.auto.convert.join.

The Hive Run command file (. hiverc) switches from $hive_home/bin down to $hive_home/conf, putting the global Hiverc in $HIVE _home/bin/. HIVERC is deprecated. Use $HIVE _conf_dir/.hiverc instead. "

Grant weights the syntax differently than before, and does not support the "dbname.tablename" form, such as ' Grant Select,show_database on TABLE ' Bi.dprpt_dp_target_shop_mtd_ Summary ' to user ' ba_crm_online ' can not find the table bi.dprpt_dp_target_shop_mtd_summary, you need to use database and grant TableName, such as ' Bi GRANT select,show_database on TABLE dprpt_dp_target_shop_mtd_summary to user ' ba_crm_online ';

This time we're just skipping 0.10 liters of 0.11 from 0.9, Hive community is very active, code changes very frequently, the new feature constantly in the join, and gradually began to SQL92 standards, all aspects are also in continuous improvement, the above mentioned some pits is actually the development of coding standards, due to the compilation phase of the pre-check more stringent, guide To the hidden deeper problems exposed, this is a good thing, through these fail-fast error, can help us to improve the development of hive specifications, including how to build tables, how to write statements, how to set optimization parameters, etc., so that developers maintain a consistent code style, At the same time reduce the maintenance costs of future code.

This article link http://blog.csdn.net/lalaguozhe/article/details/17504761, reprint please specify

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More