標籤:hive 唯一id 自增id
給大資料檔案的每一行產生唯一的id
4個主要思路:
1 單線程處理
2 普通多線程
3 hive
4 Hadoop
搜到一些參考資料
《Hadoop實戰》的筆記-2、Hadoop輸入與輸出
https://book.douban.com/annotation/17068812/
TextInputFormat:檔案位移量:整行資料
但是這個位移量,貌似是在一個檔案的位移,而不是全域。
Generate Auto-increment Id in Map-reduceJob
http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/
Generate unique customer id / insert uniquerows in hive
http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive
Need to add auto increment column in atable using hive
http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive
https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/
Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.
最後我採取了用hive寫udf的方案。
package hive.udf;/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.hive.ql.udf.UDFType;/** * UDFRowSequence. */@Description(name = "row_sequence", value = "_FUNC_() - Returns a generated row sequence number starting from 1")@UDFType(deterministic = false, stateful = true)//stateful參數是必要的public class UDFRowSequence extends UDF{ private int result; public UDFRowSequence() { result=0; } public int evaluate() { result++; return result; }}// End UDFRowSequence.java
本文linger
本文連結:http://blog.csdn.net/lingerlanlan/article/details/46430747
給大資料檔案的每一行產生唯一的id