Document directory
- 1.1 questions
- 1.2 awk Introduction
- 1.3 magic moments
- 1.4 run
When I spent time using SED and awk on an airplane, I was reminded of a question raised by netizens before:
1. sort by the length of the specified column 1.1
This netizen has a format (in.txt ):
W = I BM = standard CEQ = Chen WM = We nnyl = NWM, = We djh = Hello, tdmd = They tzm = comrades tzm, = comrades djhnv = Hello everyone ppaa = Ping An an tzdv = comrades ppaa, = Ping An an wrmfw = serving the people gzrzf = work conscientiously and responsibly
He wants to sort the dictionary according to such a rule: regards = as a column separator, first sorts the dictionary by the length of the second column, then sorts it by the length of the first column, and then sorts it by the encoding of the second column, sort by the first column encoding. For example, the sorting result of the preceding example is:
W = I BM = standard CEQ = Chen nnyl = nuwm = We WM, = We tdmd = They djh = Hello, tzm = comrades tzm, = comrades djhnv = Hello, everyone, tzdv = comrades, ppaa = Ping An, ppaa, = Ping An, wrmfw = serving the people, gzrzf = Seriously responsible for work
I can use Excel. Some netizens say that excel is too slow and cannot support hundreds of thousands of phrases. Can we solve this problem with awk.
1.2 awk Introduction
Awk is a magic text processing tool. It processes input text line by line according to our commands. We write the command in a script file. For example, in a file named test. awk, write a line of command:
{print $2}
Then read the animation and let awk execute our command:
awk -F= -f test.awk in.txt
We will get the following output:
I would like to Mark Chen, we would like to thank all of you, comrades, and comrades for serving the people in peace and responsibility.
Awk regards text as a data list. Each line of data includes multiple data items separated by delimiters. The default Delimiter is space or tab. You can use the-F parameter on the command line to specify the delimiter. For example, the "-F =" command awk uses "=" as the separator. -F parameter specifies the script file to be executed.
Let's look at the command. The command is placed in {}. Before {}, you can specify the scope of the command, that is, the lines for which the command is executed. If no range is specified, execute the command on all rows. Awk uses $1 to represent the first data item of each row, $2 to represent the second data item, and so on. $0 can be used to reference the entire line of text. "Print $2" is to print the second item, so we can see the above result.
Let's issue a new command and change test. awk:
{print length($2) "/t" $2}
Read the animation spell again:
awk -F= -f test.awk in.txt
This time we got:
1. I'm one, two, one, two, three, two, one, two, two
The length function can calculate the length of a string. Length ($2) is the length of the second string. "/T" is a tab. Let's check if the number in the first column is the length of a Chinese phrase?
Awk also supports printf functions similar to C, so that we can precisely control the output. For example, change test. awk:
{printf("%02d%02d/t%s/t%s/n", length($2), length($1), $2, $1)}
If you are familiar with the printf function, you can easily understand this line of commands. The output is as follows:
0101 my w0102 standard bm0103 Chen ceq0202 we wm0104 nu nnyl0203 we WM, 0303 Hello djh0204 their comrades tzm0304 comrades tzm, 0305 Hello everyone, Ping An, ppaa0305 comrades tzmdv0405 Ping An ppaa, 0505 serve the people wrmfw0605 seriously responsible for gzrzf
We have added a column before the original text. Each item in this column has four digits. The first two digits are the length of the Chinese phrase, and the last two digits are the encoding length.
1.3 magic moments
We will prepare another command file output. awk with the following content:
{print $3"="$2}
Can readers understand the meaning of this command? Then we read a long string of spells:
awk -F= -f test.awk in.txt|sort|awk -f output.awk>output.txt
The output file output.txt has the following content:
W = I BM = standard CEQ = Chen nnyl = nuwm = We WM, = We tdmd = They djh = Hello, tzm = comrades tzm, = comrades djhnv = Hello, everyone, tzdv = comrades, ppaa = Ping An, ppaa, = Ping An, wrmfw = serving the people, gzrzf = Seriously responsible for work
This is what we need. The meaning of "awk-F =-F test. awk in.txt" has been mentioned before. Its output is transmitted to sort through the pipeline "|. After sorting the input, sort sends it to "awk-F output. awk" in a pipeline ". The command "Print $3" = "$2" prints the third and second columns of the input and separates them with "=. Finally, we will redirect the output of "awk-F output.awk.pdf to output.txt, and we will get the above result.
1.4 run
I put the tools and examples used in this article on my homepage (Download). After the reader decompress the package, go to the sortit directory on the command line and execute:
awk -F= -f test.awk in.txt|lsort|awk -f output.awk>output.txt
You can get the result described above. On Windows, in order not to conflict with Windows sort, I changed the sort program to lsort.
2 Conclusion
Awk is an intermediate magic of text processing. People who have learned this magic can control it. A book about this magic is called SED and awk. I read the previous two chapters, and then solved a small text processing problem by searching for the quick reference at the end of the book and several attempts.
In addition to awk, this magic book also introduces an intermediate magic called sed. Awk and sed are both tools for processing text streams. Text flows through them and changes to the desired shape according to our commands. If awk is a programming language dedicated to processing text lists, sed writes many editing commands in advance to a file called a script, and then executes the script on one or more files. Sed is good at searching and replacing a lot. With the help of sort, grep, and other basic magic, we can easily complete a lot of text processing work.