generate a unique ID for each line of a big data file
4 Main ideas:
1 single Thread processing
2 Common multithreading
3 Hive
4 Hadoop
Search for some references
Hadoop in Action notes-2, Hadoop input and output
https://book.douban.com/annotation/17068812/
Textinputformat : File offset : Entire row of data
But this offset, it seems, is in the offset of a file, not the global.
Generate auto-increment Id in Map-reducejob
http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/
Generate Unique customer Id/insert uniquerows in hive
Http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive
Need to add auto increment column in atable using hive
Http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive
https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/
Here make sure this addition of annotation@UDFType (stateful = true) is required otherwisecounter value would not Get increment in the Hive column, it'll just returnvalue 1 for all the rows and not the actual row number.
Finally I took the scheme of writing UDF with Hive.
Package hive.udf;/** * Licensed to the Apache software Foundation (ASF) under one * or more contributor license agreements . See the NOTICE file * Distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * under the Apache License, Version 2.0 (The * "License"); You are not a use of this file except in compliance * with the License. Obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * unless required by applicab Le law or agreed into writing, software * Distributed under the License is distributed on a "as is" BASIS, * without WAR Ranties or CONDITIONS of any KIND, either express OR implied. * See the License for the specific language governing permissions and * limitations under the License. */import Org.apache.hadoop.hive.ql.exec.description;import Org.apache.hadoop.hive.ql.exec.udf;import org.apache.hadoop.hive.ql.udf.udftype;/** * udfrowsequence. */@Description (name = "Row_sequencE ", value =" _func_ ()-Returns a generated row sequence number starting from 1 ") @UDFType (deterministic = False, Statef UL = true) The//stateful parameter is necessary for public class Udfrowsequence extends udf{private int result; Public udfrowsequence () {result=0; } public int Evaluate () {result++; return result; }}//End Udfrowsequence.java
This article linger
This article link: http://blog.csdn.net/lingerlanlan/article/details/46430747
Generate a unique ID for each line of a big data file