In pig, limit can sample a small amount of data, but there are many problems, such as no less than 10 data records, otherwise all are returned.
Another problem today:
Invalid limit for the data after the group: The data after the group cannot use limit. It is estimated that the group structure cannot be used (not verified)
The sample is better than the sample. I tried it and it can also be used for group data.
The test code is as follows:
Origin_cleaned_data = load '$ cleanedlog' as omitted;
Store origin_cleaned_data into '/user/wizad/tmp/origin_cleaned_data' using pigstorage (',');
Describe origin_cleaned_data;
Test_data = foreach origin_cleaned_data generate $0, $1, $2, $3, $4;
Ts_limit = Limit origin_cleaned_data 100; -- 100 data records are returned.
Store ts_limit into '/user/wizad/tmp/My' using pigstorage (',');
G_log = group test_data by ($2, $4 );
Describe g_log;
Alldata = Limit g_log 10;
Dump alldata; -- all data is returned, and the limit is invalid.
The returned group structure is as follows:
Origin_cleaned_data:
{
Wizad_ad_id: chararray,
Guid: chararray,
Android_id: chararray,
IMEI: chararray,
App_category_id: chararray
}
G_log :{
GROUP: (android_id: chararray, app_category_id: chararray ),
Test_data :{
Wizad_ad_id: chararray,
Guid: chararray,
Android_id: chararray,
IMEI: chararray,
App_category_id: chararray
}
}
Test sample:
Test_data = foreach origin_cleaned_data generate wizad_ad_id, guid, android_id, IMEI, app_category_id;
G_log = group test_data by ($2, $4 );
Describe g_log;
Myts = sample g_log 0.0001;
Store myts into '/user/wizad/tmp/my' using pigstorage (',');
Dump myts;
Describe myts;
The result is as follows:
G_log :{
GROUP: (android_id: chararray, app_category_id: chararray ),
Test_data: {wizad_ad_id: chararray, guid: chararray, android_id: chararray, IMEI: chararray, app_category_id: chararray}
}
Myts :{
GROUP: (android_id: chararray, app_category_id: chararray ),
Test_data: {wizad_ad_id: chararray, guid: chararray, android_id: chararray, IMEI: chararray, app_category_id: chararray}
}
(-1,), {(74,905 1235c-a391-4dae-ab22-f93d24a12636,-1,-1,), (75,053 Tib,-1,-1,), (75, e77c7cd1-c2ae-4fdb-bc07-992235b97b58,-1, -1,), (74, f2752e0b-b5e6-4310-849a-12c5dd10a0c5,-1,-1,), (74, a9072546-3cd9-4215-9d0c-b0632d201505,-1,-1,), (74, 68b5a933-f452-4850-8b93-59acd60b0e9a,-1, -1,), (74, percentile,-1,-1,), (74, a4efdcb5-9a2e-42c4-8c8f-d32c0a22e765,-1,-1,), (74, 49d6e13e-4080-4739-b0dc-ad3f89e4bdf9,-1, -1,), (74,798 bytes,-1,-1,), (74,051 d62e1-634e-4a22-abe6-d2653c00005fb,-1,-1,), (74, percentile,-1, -1,), (74, 5a7cf340-4f66-4b43-862e-49acd8beeea3,-1,-1,), (74, e6abe1d5-3070-4e2a-9e86-358b21f0cd6b,-1,-1,), (74, percentile,-1, -1,), (74, percentile,-1,-1,), (74,915 percentile,-1,-1,), (74,433 521f3-e677-4409-80bc-f918ba5bd49c,-1, -1,), (74, 51e0b6f3-26a2-4401-9a0f-337f25a4c171,-1,-1,), (74, dfe58439-29d6-4a05-84d1-0dd0ee033dc7,-1,-1,), (74, f46d55c8-5f5f-4f50-a9ef-75d3f25ea21c,-1, -1,), (74,262 3d054-dbe1-433e-97c6-5d568d49442f,-1,-1,), (74, average,-1,-1,), (74, average,-1, -1,), (74, 5d58bc0e-14a9-45ec-bc2c-9d06780b2d9b,-1,-1,), (74, d896d166-327e-4187-a4c6-4c2fd58876a5,-1,-1,), (74, e214a7f9-2c2c-4875-85ac-011b431296f3,-1, -1,), (74,374 bytes,-1,-1,), (74,777 c5eb8-0b12-46fb-968f-9e9679794abf,-1,-1,), (74,550 bytes,-1, -1,), (74, percentile,-1,-1,), (74, 1b5dc09b-f03c-4fa3-b290-7388afacbea7,-1,-1,), (74, percentile,-1, -1,), (74, clerk,-1,-1,), (74,169 f9282-0437-43af-98c4-e078fc360533,-1,-1,), (74, clerk,-1, -1,), (74, 9bc1a4dc-0324-4007-8be0-93a4b3286b49,-1,-1,), (74, percentile,-1,-1,), (74, e2b26638-c2bb-4375-8498-bc94bf590b18,-1, -1,), (74, a02e372d-6773-4282-9925-44def1d6c564,-1,-1,), (74, b7da49ce-5ba3-41d4-ba18-8eade05c9745,-1,-1,), (74, b44017a5-b5c9-45ba-8903-6a6de996772b,-1, -1,), (74, da0e84c3-78f3-4b19-a695-c00d8f114299,-1,-1,), (74, f935cf93-8867-4496-8315-bd4045b1b6b9,-1,-1,), (74,582 7b614-03cc-47a5-bc40-5c6952488617,-1, -1,), (74,877 1188e-cc86-43b8-bcc1-ad632dfd56cb,-1,-1,), (74,261 percentile,-1,-1,), (74, percentile,-1, -1,), (74, b3f2c66a-6988-4ede-8f11-66e34764e93c,-1,-1,), (74, b1ab5c0b-5e1c-4e0f-99eb-07dc7722b7a4,-1,-1,), (74, c0698e6f-6e2e-4456-8bc0-c438276f0b36,-1, -1,), (74, feb44a38-ac42-4fb4-a7f1-30d087709a97,-1,-1,), (74,038 7bcf5-04e2-49f9-af3c-65bb6875b09d,-1,-1,), (74, f3711f23-2a34-49af-89f8-130f299eb17d,-1, -1,), (74, 85f13431-5be6-4d72-ad97-9873b2c6c2e7,-1,-1,), (74, percentile,-1,-1,), (74, fec1932a-b0e4-4bf0-b504-8ed8f3c159e7,-1, -1,), (74, d74374ec-8cf4-4c4a-b598-9631f6972cbb,-1,-1,), (74,678 0962a-bf75-4c4c-a557-94a7de5a3e36,-1,-1,), (17915-ee3f-4d34-943f-d6f1813afdef,-1, -1,), (74, c5547aca-3b8b-4108-93ba-bf365c106cdd,-1,-1,), (74, e9a986c1-6868-4f7f-baf6-69d8c302583e,-1,-1,), (74, 9c1341cf-45b8-48c6-b699-33b1a4215c66,-1, -1,), (74, f16e6222-a84b-4758-ae71-0613c8f34b29,-1,-1,), (74, 47cc25ef-05bc-47f4-a32b-3cddaf0ac22b,-1,-1,), (74, d5c1b6b0-38c3-464b-8cb9-70ced875be5f,-1, -1,), (74, average,-1,-1,), (74, 23bb2f0c-d629-479d-800e-b86fc3d6e45c,-1,-1 ,)})
(A50a17bde79ac018,), {(74,863010025134441, a50a17bde79ac018, 863010025134441 ,)})
(A51779f736cd3f54,), {(74,862949029595753, a51779f736cd3f54, 862949029595753 ,)})
(C7ae5867-3b77-4987-b082-ed3867b5c384,), {(74,353627055387065, c7ae5867-3b77-4987-b082-ed3867b5c384, 353627055387065 ,)})
Invalid pig limit (all records are returned). The sample is valid.