Using awk to process large volumes of data in shell scripts

Source: Internet
Author: User
Tags sql using

First, the demand

     original file a for 1.7g Mangyo look. There are also 2 auxiliary files b and C c files are only 2 and b1 same, A2 a1 B2" a4 a5

Second, the idea

2.1 Before using awk to process data, consider using whileread ... done<a.txt way to process data, but execution is inefficient and does not meet expectations.

2.2 later uses awk to process data, but the problem that must be addressed is passing in an external array or passing data from three files into awk .

Third, the way to solve

3.1 using the while read ... done<a.txt way to process data

While read B1 B2

Do

b_array[$B 1]= $B 2

Done<b.txt

While read C1 C2

Do

c_array[$C 1]= $C 2

Done <b.txt

While read A1 A2 A3 A4 A5

Do

b2_value=b_array[$A 1]

c2_value=c_array[$A 2]

ECHO$A1, $B 2_value, $C 2_value, $A 3, $A 4, $A 5>>d.txt

Fi

Done <a.txt

# The first method can be used when the amount of data is not available.

3.2 An external array into awk

awk Multi-file processing, I encountered an unresolved problem, is to read 3 files into awk directly printed, it is impossible to perform other operations, So I gave up this way instead of using the method of passing an external array into awk. So I found the following code

Awk-vs1= "${time[*]}"-V s2= "${!time[*]}" '

Begin{split (S1,S3, ""); Split (S2,s4, "");

For (I=1;i<=length (S4); i++)

Res[s4[i]]=s3[i];} '

(Reference blog:http://sunlujing.iteye.com/blog/1918907)

The code that eventually processes the data becomes the following:

While ReadB1 B2

Do

b_array[$B 1]= $B 2

Done <b.txt

While ReadC1 C2

Do

c_array[$C 1]= $C 2

Done <b.txt

Awk-f ",",-v s1= "${b_array[*"} "-V s2=" ${! B_array[*]} "-V w1=" ${c_array[*]} "-V w2=" ${! C_array[*]} "'

begin{

Split (S1,S3, ",");

Split (S2,S4, ",");

For (I=1;i<=length (S4); i++)

B_new_array[s4[i]]=s3[i];

Split (W1,W3, ",");

Split (W2,w4, ",");

For (I=1;i<=length (W4); i++)

C_new_array[w4[i]]=w3[i];

}

{

Len=split ($1,a_array, "")

A1=A_ARRAY[1];

A2=A_ARRAY[2];

A3=A_ARRAY[3];

A4=A_ARRAY[4];

A5=A_ARRAY[5];

B2_VALUE=B_NEW_ARRAY[A1];

C2_VALUE=C_NEW_ARRAY[A2];

Printa1,b2_value,c2_value,a3,a4,a5

} ' A.txt>>d.txt

# execution time is approximately 4 minutes.

3.3 and then there was a new demand .

The new requirements are based on A1,B2,C2 are grouped, summing operations on A3,A4,and A5 respectively. This can be done directly in SQL using the GroupBy grouping.

Reference Blog In this:http://linuxguest.blog.51cto.com/195664/424496(awk 's class SQL data processing )

The code then becomes the following:

While ReadB1 B2

Do

b_array[$B 1]= $B 2

Done <b.txt

While ReadC1 C2

Do

c_array[$C 1]= $C 2

Done <b.txt

Awk-f ",",-v s1= "${b_array[*"} "-V s2=" ${! B_array[*]} "-V w1=" ${c_array[*]} "-V w2=" ${! C_array[*]} "'

begin{

Split (S1,S3, ",");

Split (S2,S4, ",");

For (I=1;i<=length (S4); i++)

B_new_array[s4[i]]=s3[i];

Split (W1,W3, ",");

Split (W2,w4, ",");

For (I=1;i<=length (W4); i++)

C_new_array[w4[i]]=w3[i];

}

{

Len=split ($1,a_array, "")

A1=A_ARRAY[1];

A2=A_ARRAY[2];

A3=A_ARRAY[3];

A4=A_ARRAY[4];

A5=A_ARRAY[5];

B2_VALUE=B_NEW_ARRAY[A1];

C2_VALUE=C_NEW_ARRAY[A2];

A3_array[a1 "," B2_value "," C2_VALUE]+=A3;

A4_array[a1 "," B2_value "," C2_VALUE]+=A4;

A5_array[a1 "," B2_value "," C2_VALUE]+=A5;}

end{

for (i Ina3_array)

{

Printi "," a3_array[i] "," A4_array "," A5_array

}

} ' A.txt>>e.txt


This article is from the "three countries Cold jokes" blog, please be sure to keep this source http://hwj91.blog.51cto.com/9763975/1698470

Using awk to process large volumes of data in shell scripts

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.