How does pig load files with different numbers of fields? Load file with different items (F1 has 42 columns, F2 has 43 columns to read into an object)

Source: Internet
Author: User

As mentioned in my article, loading some columns of a file is feasible. Two columns. You only read one column. No problem.

However, there are 42 columns in two files, F1, F2, and F1, 43 columns in F2, and are loaded to a stream object at the same time. What should I do?

A: The file is loaded successfully. However, schema unknown is displayed after discribe: schema for origin_cleaned_data unknown.

This situation is similar to Union. When two objects in different columns are merged, an unknown mode object is generated.


Background: because there are 42 columns of the old log, the new log plus one column is in the 20th column, because the 20 columns cannot follow the same name, and the number of user clicks of the overall log is required. So load them together for unified statistics.

(If you know the types of logs of different dates, you can read them separately, specify the Clear mode, and then use onschema for uion for separate statistics. It's a pity that I accept the project and I'm not sure which day it was changed online)

Sample: Old log log_without.txt, new log log_with_android_ad_id.txt

The Code is as follows:

Register piggybank. jar;
Define sequencefileloader org. Apache. Pig. piggybank. Storage. sequencefileloader ();

% Default cleanedlog/user/wizad/tmp/log _*

-- % Default cleanedlog1/home/wizad/LMJ/log_without.txt
-- % Default cleanedlog 2/home/wizad/LMJ/log_with_android_ad_id.txt

Origin_cleaned_data = load '$ cleanedlog' using pigstorage (',');

Dump origin_cleaned_data;

Describe origin_cleaned_data;


Display result:

(Null), clerk, 575356365101899146,2014-07-30 10:33:56, 10:33:56, 2, 151.87.202.1, 1,-1,-1, LMJ,-1, clerk, clerk, 02: 00: 00: 00: 00: 00,1940064625594046032, d70cc494, 25100,206, 7.1, 2, 42.833298, 12.833298, 120232210032202)
(Null), 11:15:05, 11:15:05, 2, 155.128.32.119, 33052513139, large, LMJ, small, 40: 0e: 85: 40: 0e: 1A,-7537294162085162169, 7626e397, 62713,206, 4.3, 3, 37.774902, 122.4194,-023010203333003)
(Null) 5, 74, e7a4afce-ffd9-4ecd-b916-39f9d793c218, 207640323432175503,2014-07-30 10:29:22, 10:29:22, 2, 111.200.142.163, 1,-1,-1,-1, LMJ,-1, small, e7a4afce-ffd9-4ecd-b916-39f9d793c218, 02: 00: 00: 00: 00: 00,1179719885645020154, d4eeab6e, 66104,101, 7.1, 2, 7, 39.928894, 116.388306, 132100103322203)
(Null), clerk, 575356365101899146,2014-07-30 10:33:56, 10:33:56, 2, 151.87.202.1, 1,-1,-1, -1, hour, hour, 02: 00: 00: 00: 00: 00,1940064625594046032, d70cc494, 25100,206, 7.1, 2, 42.833298, 12.833298, 120232210032202)
(Null),-30 10:07:57, 10:07:57, 2, 56.2.20.220, 1,-1,-1,-1, -1, expires, 302bd8f1-b974-4af5-8183-1f67d270000d6, 02: 00: 00: 00: 00,-488564527359896578, 103b14d3, 25100,206, 7.1, 2 ,,,, 37.774902,-122.4194, 023010203333003)
(Null) 5, 74, e7a4afce-ffd9-4ecd-b916-39f9d793c218, 207640323432175503,2014-07-30 10:29:22, 10:29:22, 2, 111.200.142.163, 1,-1,-1,-1, -1, small, e7a4afce-ffd9-4ecd-b916-39f9d793c218, 02: 00: 00: 00: 00: 00,1179719885645020154, d4eeab6e, 66104,101, 7.1, 2, 39.928894, 7, 116.388306, 132100103322203)
Schema for origin_cleaned_data unknown.


An LMJ column is added. You can see that there is no structure.


Union: Merge columns of different formats

(Union does not duplicate rows)

A = load ‘input1‘ as (x:int, y:float);B = load ‘input2‘ as (x:int, y:chararray);C = union A, B;describe C;
Display result:
Schema for C unknown

The Union variable of two different column names uses onschema.

Note: To use onschema, a clear schema is required for all input. Otherwise, an error occurs. Because when Union is combined, the comparison is based on the name and column type (it can be automatically converted from low-level to advanced type ).

After merging, the empty columns add null.

A = load 'input1' as (W: chararray, X: int, Y: Float); B = load 'input2' as (X: int, Y: Double, Z: chararray); C = Union onschema a, B; describe C; Result: C: {W: chararray, X: int, Y: Double, Z: chararray}

An example of code that cannot be combined is provided.

% Default cleanedlog1/home/wizad/LMJ/log_without.txt
% Default cleanedlog2/home/wizad/LMJ/log_with_android_ad_id.txt

Origin1 = load '$ cleanedlog1' using pigstorage (',');
Origin2 = load '$ cleanedlog2' using pigstorage (',');

Describe origin1
Describe origin2

Origin = Union origin1, origin2

Result:

Origin1 and origin2 show schema for origin2 unknown.

Therefore, origin cannot be generated.

How does pig load files with different numbers of fields? Load file with different items (F1 has 42 columns, F2 has 43 columns to read into an object)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.