Sed cultivation series (III): Implementing window sliding technology for sed advanced applications, sed window
Directory:
1. What is slide window Technology?
2. Achieve window sliding
2.1 slide the window through the "s" command
2.2 Save the window with reserved space
2.3 replace the window maintenance command "s" with "D"
2.4 real big moves
2.5 maintenance window methodology
3. Best Partner: "N", "P", and "D" commands
1. What is slide window Technology?
A picture is better than a thousand words.
In the middle, the height of the resource manager is fixed to the next 10 lines of file names, if you want to display 11th lines, you need to drop down the distance of a row of the scroll bar, so that 11 lines can be displayed. However, the oldest line is kicked out of the current visual window.
This is roughly the meaning of the sliding window. Maintain a window. When new data is added to the window, the old data is removed to ensure that the window size is fixed. Of course, there is a dynamic window with an unfixed size. In this case, you can determine whether to remove old data and which old data to remove based on other rules.
In the advanced usage of sed, window sliding plays a very important role, which is also an important reason for "N", "D", and "P. Sed uses the mode space as the "main battlefield", and maintains the space as the "secondary Battlefield ". Therefore, sed must maintain the "window" only through the mode space, because only in the mode space can it decide whether to kill the old data and which old data to kill. However, in many cases, the window data after each operation is saved to the holding space, and the data is retrieved from the holding space for maintenance in the next cycle.
The following are some examples and window technology usage instructions.
2. Achieve window sliding
If you want to output the first 10 lines of file a.txt. This is simple.
sed '10q' a.txt
Because the "q" command outputs the content of the mode space before exiting the sed program, the 10th lines will also be output. If the "Q" command is used, it should be:
sed 11Q a.txt
The problem arises. How can we implement the output of the last 10 rows and the last 15 rows? The last 20 rows ?. How can we implement the countdown viewing function of the tail tool.
2.1 slide the window through the "s" command
BecauseSed is a stream processor that never goes back. Any input stream will never be read again if it has been read.
Furthermore,Sed uses the row number counter to instantly count. Every time a row is read, the counter is incremented by 1. Therefore, sed does not know the number of rows after reading the last row, nor when it is the last row. Until the last row of the input stream is read, sed marks the row with "$", indicating that this is the last row. "$" Is only a markup symbol, not a row number. The row number is only recorded in the counter. Therefore, you cannot calculate the last few rows through "$, for example, "$-1" is incorrect.. This means that sed cannot directly output the reciprocal row. You must use a window to output the number of reciprocal rows.
Now that the row number is mentioned, mention it by the way. The matching process of a row number is much more efficient than that of a regular expression. Because the row number is recorded in the memory, you only need to compare the value of the row number in the expression with the value recorded in the counter. Regular Expression matching requires compilation and matching, and the matching efficiency of sed's Regular Expression Engine is not as good as imagined, especially when ". * "when combined with other expressions. Therefore, when processing a large number of files in batches, especially large files, you can use row numbers as much as possible.
Return to the topic. For example, if you want to output the last 10 rows, the window will keep the fixed size of the 10 rows. When reading the last row, you can use the "$" symbol to determine whether the row is the last. If yes, the window is output; otherwise, the data of the window is not output.
In general, such problems will use the space to temporarily store window data, but the mode space alone can also maintain a fixed number of rows of windows. As follows:
#! /Usr/bin/sed-nf # first read 8 rows, plus 9 rows of N; N # determine whether it is the last row. If not, read the next row and kill the first row: 1N $! S/[^ \ n] * \ n //; t1 # after reading the last row, the content in the output window is p
To keep a window with a fixed number of rows in the mode space, only all actions can be completed in one sed loop (because the mode space is cleared at the end of the SCRIPT loop). Therefore, you must use the tag to redirect cyclically. In the preceding sample script, eight rows are read first, and a row is automatically read. There are nine rows in the mode space, which is the window we need to maintain. Then, use a loop to judge the tag, read a row first, and then judge whether the row is the last row. If not, remove the first row in the window, so that the window is maintained at a fixed size of 9 rows. The replacement command fails until the last row is read. The 10 rows in the last window are output.
Since the last 10 rows can be output through the window, it is quite easy to output the last 10th rows. You only need to change the above "p" to the upper-case "P.
So if I want to output the last 15 rows, 20 rows, or even 50 rows, do I have to write this? Not to mention writing a lot of "N" commands, the efficiency will be reduced quickly if the window is too large. For example, if a file contains 1000 rows, the last 20 rows will be output, and more than 20 rows will be searched from the pattern space for each "$" Match starting from the first 10 rows, this is equivalent to processing1000*20=20000
. Of course, row number matching directly compares the counter value without such concerns, but if it is a regular expression matching, the efficiency will inevitably decrease rapidly.
At this time, the space can be used. However, although the reserved space can simplify the processing logic, the data exchange process between the two buffer spaces will have a slight impact on the performance. Therefore, in general, using a buffer space is much more efficient than using two buffer spaces, especially when a large number of large files are processed and a large amount of data is exchanged at the same time, the performance gap is obvious.
2.2 Save the window with reserved space
The idea of the above example is to read some initial number of rows to fill the window. When the window is about to reach the target, use the tag loop judgment function to maintain a fixed size window Sliding Process.
With the help of reserved space, you can save the window data after each slide. When you read the next row, append it back for processing.
For example, the preceding example is implemented using the reserved space. The statement is as follows:
#!/usr/bin/sed -nfH10,${g;s/[^\n]*\n//;h}$p
With the help of reserved space, the entire process does not need to be completed in a sed loop. The above process will fill the window through automatic reading of sed loop and append it to the reserved space. After filling in 10 rows, pull the row from the hold space back to the mode space, and use the "s" command to slide the window. After sliding, put it back to the hold space again. The content of the final window will be output after the last row slides.
Note that the number above is 10 instead of 11. This is the result of the "H" command, this is because an "\ n" will be appended at the end of the reserved space during each H execution, and will be appended even if the reserved space is empty at first. This makes it possible that after reading 10th rows and executing H, there will be a total of 11 rows of space, and the first row is empty.
2.3 replace the window maintenance command "s" with "D"
Consider the features of the "D" command:Delete the first row of the mode space and enter the next SCRIPT loop.. Therefore, unless there is no content in the mode space, the "D" command will keep the SCRIPT loop until there is no content in the mode space that can be matched by the D address.
To facilitate description of "D", a simple example is to compress consecutive empty rows. In addition, adjacent duplicate rows are compressed to remove duplicate rows.
echo -e "1\n2\n\n3\n4\n\n\n5" | sed '$!N;/^\n$/!P;D' echo -e "1\n2\n3\n3\n4\n4\n4\n4\n4\n5" | sed -r '$!N;/^(.*)\n\1$/!P;D'
Both of these commands use Windows as the model (I think so. Since the concept of window exists, I have considered any command involving "NDP" as a window, which is easy to understand ). Use the "N" command to read the next line to the window. If the two lines in the window do not overlap, output the first line and execute "D" to remove the first line. Then, the window is slid, and enter the next SCRIPT loop. Until duplicate rows appear in the window, the window with the size of 2 rows will be repeatedly swiped, but no output will be made until the last row of the previous adjacent duplicate rows is output.
From this we can also see that even the single or double row mode space is a window, but this window is flexible to maintain.
For the second command to remove duplicate rows, the general process is as follows: d indicates that the row is not output by "P" and is deleted by "D, p indicates that it is output by "P" and deleted by "D.
1p2p3d3p4d4d4d4d4p5p
Return to the topic. The "D" command itself implements a special conditional SCRIPT loop, while the "s" command can only be implemented by TAG judgment, unless the space is maintained. Because of this, the "D" command can also remove the old data in the window to achieve window sliding.
Using "D" can simplify the complexity of the script, but it cannot replace the maintenance behavior of the "s" command. Because the loop range of the D command is fixed to an entire SCRIPT loop (as shown in the preceding example of compressing duplicate rows, each time it returns to the next loop of the SCRIPT from the beginning ), the "s" command uses tag jump to loop in any size range. Therefore, the "D" command is less universal than the "s" command to achieve window sliding.
Take the last 10 rows of data output as an example. The statements implemented with "D" are as follows:
#!/usr/bin/sed -nf1h2,10H10g$p10,${N;D}
The first three rows are only placed in the mode space to fill a window with 10 rows (there are many ways to fill the window, so you only need to know what these three steps are ), starting from row 11th, these three rows have no effect. Focus on10,,${N;D}
This is an excellent combination of "NPD". With "N; D", it is easy to maintain a fixed size window: Read a row and delete a row. It is not output by "$ p" until the last row is read by "N. The reason why "$ p" should be placed on the last line of "D" is that "D" always returns to the top of the SCRIPT, so the command after it will not be executed.
2.4 real big moves
Ah, the original window is so easy to use? But don't blind your eyes if you get sed. To be honest, using sed itself to implement many complicated demands is not that simple and laborious. Instead, it is much simpler to use it with other text processing tools.
In the preceding example, the last 10 rows are output, which is very simple to combine with shell variables.
total=`wc -l <filename`sed -n $((total-9))',$p' filename
In this way, it is very easy to output any row number. The preceding is just an example. Combined with other available tools, sed can easily meet its complicated requirements. Therefore, this is the final "big move ".
In addition, the positions enclosed by quotation marks in the above example are strange, which is a difficult problem for sed to combine with shell. Many people may have encountered this problem, and there is no good explanation on the Internet, because it is not a sed problem, but a feature of shell parsing.
2.5 maintenance window methodology
Based on the above examples, the maintenance window is divided into two situations:
It seems that the first case is easier. Indeed, in the first case, data exchange between two buffers is unnecessary. In addition, the first case is more efficient, because after the window size is reached, there is no need to exchange data with the reserved space.
3. Best Partner: "N", "P", and "D" commands
As shown in the preceding window sliding example, we can find that "N" and "D" are perfect partners. They work together to achieve perfect window sliding, in addition, the logic of the two methods is clearer than that of retaining space for temporary window storage.
What about the combination of "N" and "D", plus "P? There is nothing to say about the combination of "N" and "P" separately. There may be too many situations.
However, the combination of "P" and "D" has a fixed significance: Determine whether to output the first row in multi-row mode based on the matching mode, and then kill the row, return to the top of the SCRIPT loop. This is probably a window with two rows in size. The format is usually:
[Address]P;D
Combined with "N", the function is more obvious. Maintain a window, determine whether to output the first line of the window, and then slide the window. The format is usually:
[Address1]N;[Address2]P;D
In many cases, the Address is omitted. The Address before the P command is a condition-based judgment statement to determine whether to output the statement. If the Address1 before the N command exists, the maximum value may be "$! ", Which means that when the last row has been read (whether it is automatically read by sed loop, n read by command or N read by command), skip this command directly. If you do not add "$! ", When there is no next row available for reading, the system will directly output the mode space (unless"-n "is specified) and exit the sed program.
Add "$! Before N! ", Has a great impact on the results. However, it is difficult to determine whether to add a pattern. At least, you must be familiar with the output mode space of sed. You can read sed training series (1): getting started with huafan embroidery leg.
Finally, we need to talk about the relationship between "N" and parity rows. When the "N" command cannot read the content in the output mode space of the next row and exit sed, how does the result affect the result if the last row is an odd or even number? Many people (including myself) may have considered this problem and have been confused for a long time, this issue is even included in the Bug report section of the info sed manual.
In fact, there is no need to consider whether the last row is an odd or even row. no matter whether the last row is an odd or even row, sed only remembers whether the last row has been read, if you have read the data, exit sed at the N command. Therefore, whether to add "$! ", It determines whether sed exits at N or continues to execute the following command. However, there is a case where parity must be considered. When "N" is combined with other read row operations (the command "n" or sed's automatic reading, because any other read action changes the parity of the rows read by "N.
For example, the odd and even rows of the input stream are output respectively. It is very easy to consider implementing the window model.
Seq 1 10 | sed 'n'; P; D' # output an odd number of rows seq 1 10 | sed '1! {N; P}; D' # output even row seq 1 10 | sed-n' 1! {N; P; d} '# outputs an even number of rows, but the parity seq 1 10 | sed-n'1 must be considered! {$! N; P; d} '# outputs an even number of rows Without parity
The first two commands are easy to understand, but the third Command needs to consider parity. Because the window is filled from the second row, the first row of window data is always an odd number before it is deleted by "d". For example, "(2, 3)" is a window, "{4, 5}" is a window. When the last row is an odd number, it must be read by "N", which does not affect the result. However, if the last row is an even number, the row must be automatically read by sed, so that sed ends at the "N" command and the subsequent "P" cannot be executed. In this case, you can consider adding "$! Before" N! ", That is, the fourth command.
Of course, the simpler method is as follows:
seq 1 10 | sed 'n;d'seq 1 10 | sed '1!n;d'
For example, the last 2nd rows of the file to be deleted. If you know the concept of a window, it is easy to maintain a two-row window (N and D are enough). When the last row is read, do not output the previous line. The following are implementation statements:
sed 'N;$!P;D' filename
Note the role of "N" in processing the last line. It outputs the last line and ends the sed program, but this line is automatically output, instead of output by "P", it is affected by "-n. If it is changed to "$! N ", after the last row is read, two consecutive" D "operations are executed directly, that is, the last two rows are deleted at the same time.
Back to series article outline: http://www.cnblogs.com/f-ck-need-u/p/7048359.html
Reprinted please indicate the source: Success!