Hive supports transaction and row-level updates starting with version 0.14, but is not supported by default and requires some additional configuration. To support row-level INSERT, update, and delete, you need to configure hive support transactions.
Hive A usage scenario with acid semantic transactions
1. Stream-type receive data.
Many users use tools such as Apache Flume, Apache Storm, or Apache Kafka to pour streaming data into the Hadoop cluster. When these tools are written at hundreds of lines per second, hive may only add a partition every 15 minutes to 1 hours, because adding the partition too frequently will soon make the number of partitions in a table difficult to maintain. These tools may also write data to existing partitions, but this will result in dirty reads (data written after the query start point), and a large number of small files left in the directory where the partitions are located, which puts pressure on the namenode. In this use scenario, transactional support can obtain a consistent view of the data while avoiding excessive file creation.
2. Slow change dimension.
In a typical star-mode data Warehouse, dimension tables change slowly over time. For example, a retailer opens a new store, needs to add the store data to the shop table, or an existing store's business area or other features that need to be tracked change. These changes can result in the insertion or modification of individual records. Starting with version 0.14, hive supports row-level updates.
3. Data restatement.
It is sometimes found that the data collection has errors and needs to be corrected. Or the current data is just an approximation (such as only 90% of all the data, the total data will lag). Or a business rule may need to restate a particular transaction based on a subsequent transaction (for example, a customer buys a few items and then buys a membership, at which point they can enjoy a discounted price, including a previously purchased item). Or a customer may request to delete their customer data after terminating the partnership in accordance with the contract. Starting with hive 0.14, these usage scenarios can be supported by INSERT, UPDATE, and delete.
Second, configure Hive support Services (Hive version 2.0)
1. Add the following configuration items to the Hive-site.xml file
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name> Hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name> hive.compactor.worker.threads</name>
<value>1</value>
</property>
2. Add hive meta data (use MySQL storage)
INSERT into next_lock_id VALUES (1);
INSERT into next_compaction_queue_id VALUES (1);
INSERT into next_txn_id VALUES (1);
COMMIT;
Note: Initially, the three tables do not have data, and if you do not add data, the following error is reported:
Org.apache.hadoop.hive.ql.lockmgr.DbTxnManager Failed:error in acquiring Locks:error communicating with the Metastore
third, testing
1. Start Hadoop cluster and MySQL
$HADOOP _home/sbin/start-dfs.sh
$HADOOP _home/sbin/start-yarn.sh
~/mysql/bin/mysqld &
2. Set Up Test table
Use test;
CREATE TABLE t1 (id int, name string)
clustered by (ID) into 8 buckets
stored as orc tblproperties (' transactional ' = ' true ');
Description: The form statement must have an into buckets clause and a stored as ORC tblproperties (' transactional ' = ' true ') clause and cannot have a sorted by clause.
3. Test INSERT, UPDATE, delete
INSERT into T1 values (1, ' AAA ');
INSERT into T1 values (2, ' BBB ');
Update T1 set name= ' CCC ' where id=1;
The results of the execution are shown in Figure 1-3 respectively.
Figure 1
Figure 2
Figure 3
Note: You cannot modify the value of the bucket column, or you will report the following error:
failed:semanticexception [Error 10302]: Updating values of bucketing columns is not supported. The Column ID.
4. Conversion of existing ORC tables
--Write the following 4 rows of data in the local file/home/grid/a.txt
1, John, Us,ca
2, Dick, US,CB
3, Harry, CA,BB
4, Zhao Liu, CA,BC
--Create a non-partitioned table and load the data
CREATE TABLE T1 (ID INT, name string, Cty string, st STRING) ROW FORMAT delimited FIELDS terminated by ',';
LOAD DATA local inpath '/home/grid/a.txt ' into TABLE T1;
SELECT * from T1;
--Create an external partitioned transaction table and load
The data Create EXTERNAL table t2 (ID INT, name string) partitioned by (country string, state string)
CLU Stered by (ID) into 8 buckets
STORED as ORC tblproperties (' transactional ' = ' true ');
INSERT into T2 PARTITION (country, State) SELECT * from T1;
SELECT * from T2;
--Modify data
INSERT into TABLE T2 PARTITION (country, State) VALUES (5, ' Liu ', ' DD ', ' DD ');
Update T2 SET name= ' Zhang ' WHERE id=1;
DELETE from T2 WHERE name= ' Dick ';
The data before and after the modification are shown in Figure 4, Figure 5, respectively.
Figure 4
Figure 5
Note: You cannot update the partitioning key, or you will report the following error:
failed:semanticexception [Error 10292]: Updating values of partition columns is not supported
Four, reference
Https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
https:// Cwiki.apache.org/confluence/display/hive/languagemanual+dml#languagemanualdml-delete
http:// unmeshasreeveni.blogspot.in/2014/11/updatedeleteinsert-in-hive-0140.html
https://www.mapr.com/blog/ HIVE-TRANSACTION-FEATURE-HIVE-10