給大資料檔案的每一行產生唯一的id

來源:互聯網
上載者:User

標籤:hive   唯一id   自增id   

給大資料檔案的每一行產生唯一的id

4個主要思路:

1 單線程處理

2 普通多線程

3 hive

4 Hadoop

 

搜到一些參考資料


《Hadoop實戰》的筆記-2、Hadoop輸入與輸出

https://book.douban.com/annotation/17068812/

TextInputFormat:檔案位移量:整行資料

但是這個位移量,貌似是在一個檔案的位移,而不是全域。

 

Generate Auto-increment Id in Map-reduceJob

http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/

 

Generate unique customer id / insert uniquerows in hive

http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive

 

Need to add auto increment column in atable using hive

http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

 

 

https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.

 

最後我採取了用hive寫udf的方案。


package hive.udf;/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements.  See the NOTICE file * distributed with this work for additional information * regarding copyright ownership.  The ASF licenses this file * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License.  You may obtain a copy of the License at * *     http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.hadoop.hive.ql.exec.Description;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.hive.ql.udf.UDFType;/** * UDFRowSequence. */@Description(name = "row_sequence",    value = "_FUNC_() - Returns a generated row sequence number starting from 1")@UDFType(deterministic = false, stateful = true)//stateful參數是必要的public class UDFRowSequence extends UDF{  private int result;  public UDFRowSequence() {    result=0;  }  public int evaluate() {  result++;    return result;  }}// End UDFRowSequence.java

 

本文linger

本文連結:http://blog.csdn.net/lingerlanlan/article/details/46430747



給大資料檔案的每一行產生唯一的id

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.