Only IDs are generated for large data files per line

Source: Internet
Author: User

only IDs are generated for large data files per line

4 Main ideas:

1 single Thread processing

2 Common multithreading

3 Hive

4 Hadoop

Search for some references


Hadoop in Action notes-2, Hadoop input and output

https://book.douban.com/annotation/17068812/

Textinputformat : File offset : Entire row of data

But this offset, it seems, is in the offset of a file, not the overall.

Generate auto-increment Id in Map-reducejob

http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/

Generate Unique customer Id/insert uniquerows in hive

Http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive

Need to add auto increment column in atable using hive

Http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

Here make sure this addition of annotation@UDFType (stateful = true) is required otherwisecounter value would not Get increment in the Hive column, it'll just returnvalue 1 for all the rows and not the actual row number.

Finally, I took the scheme of writing UDF with Hive.


Package hive.udf;/** * Licensed to the Apache software Foundation (ASF) under one * or more contributor license agreements  .  See the NOTICE file * Distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * under the Apache License, Version 2.0 (The * "License");  You are not a use of this file except in compliance * with the License. Obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * unless required by applicab Le law or agreed into writing, software * Distributed under the License is distributed on a "as is" BASIS, * without WAR Ranties or CONDITIONS of any KIND, either express OR implied. * See the License for the specific language governing permissions and * limitations under the License. */import Org.apache.hadoop.hive.ql.exec.description;import Org.apache.hadoop.hive.ql.exec.udf;import org.apache.hadoop.hive.ql.udf.udftype;/** * udfrowsequence. */@Description (name = "Row_sequencE ", value =" _func_ ()-Returns a generated row sequence number starting from 1 ") @UDFType (deterministic = False, Statef  UL = true) The//stateful parameter is necessary for public class Udfrowsequence extends udf{private int result;  Public udfrowsequence () {result=0;    } public int Evaluate () {result++;  return result; }}//End Udfrowsequence.java

This article linger

This article link: http://blog.csdn.net/lingerlanlan/article/details/46430747



Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.

Only IDs are generated for large data files per line

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.