Introduction: For text operations, there are more operations besides cutting and pasting, especially when the GUI is not used. In this article, the author explains how to use the gun text toolkit for text processing. After learning this article, you will process the text like an expert.
Overview: This article introduces "filters" that can be used to build complex pipelines to process text. You will learn how to display text, sorting, word and ROW statistics, conversion, and other operating technologies. You will also learn how to use the SED editor. Specific entry:
- Change the output by sending a text file or output stream to the text toolkit filter;
- Use the command line tool in the GNU text toolkit;
- Use the SED script to handle complex text changes
This article is aimed at the 103.2 goal of the LPIC-1, its weight is 3. text filtering is such a process, accept the input text stream, execute some changes, and then forward to the output stream. Although the input and output streams can come from files, in Linux and UNIX environments, filtering is usually done by building a command line pipeline. The output of a command is redirected to the next command as the input. Pipeline and redirection are explained in detail in another article. Now let's take a look at Pipeline and redirection Operators | and>. A stream is a byte sequence that can be read and written using database functions. The database functions read and written to it shield upper-layer applications from the details of underlying devices. The same program can use a stream to read and write data to a terminal, file, or network socket in a device-independent manner. The modern programming environment and Shell both use three standard input and output streams:
- Stdin standard input stream, which provides input for commands;
- Stdout standard output stream, used to display command output;
- Stderr standard error stream used to display command error output
The pipeline command input comes from the command line parameters you provide, and the output is displayed on your terminal device. Many text processing commands can receive input from standard input streams or files. To use the output of one command as the input of another command, you can use the pipeline symbol |. The following shows that the echo output is used as the input of sort using the pipeline.
[root@localhost ~]# echo -e "apple\npear\nbanana" | sortapplebananapear[root@localhost ~]#
Each command may have options or parameters. You can also use | to output the second command as the input of the Third Command, and so on. It is a common method in Linux/Unix to build a long command pipeline (with limited command capabilities) to complete tasks. Sometimes, you may see that the parameter of a command is-rather than a file name, which means that the input of this command comes from a standard input stream rather than a file.
Output redirection a pipeline composed of multiple commands can be built through the pipeline to output the results to the terminal. But sometimes you need to save the result to a file. This can be done through the output redirection operator>. Next we will use some small files, so we create a folder named lpi103-2, and then CD to this directory. Then we use> redirect echo command output to a file called text1.
[root@localhost ~]# mkdir lpi103-2[root@localhost ~]# cd lpi103-2/[root@localhost lpi103-2]# echo -e "1 apple\n2 pear\n3 banana" > text1
Now that we are familiar with the pipelines and redirection tools, let's take a look at some common UNIX and Linux text processing commands and filters. This section briefly introduces these commands. For more information about the usage, see the man manual page. Before Cat, OD, and split, we created the text1 file. Now let's take a look at what is in the file. Use the CAT (concatenate) command to display the content of a text file to the standard output.
[root@localhost lpi103-2]# cat text1 1 apple2 pear3 banana
If the cat command is not provided with a parameter, it uses the standard input stream as its input. This feature works with the output redirection to create another file, as shown below:
cat >text29 plum3 banana10 apple
Cat continuously reads standard input content until it reaches the end of the file. Use Ctrl + D to tell the end Of the cat standard input. Bash also uses Ctrl + D to exit. In this example, the tan key is used to separate the numbers and fruit names. Note that cat is the abbreviation of concatenate, which means connection. Because you can use Cat to connect multiple files and output them together. Next, we will output text1 and text2 together.
[root@localhost lpi103-2]# cat text*1 apple2 pear3 banana9plum3banana10apple
Note that the output alignment of the above two files is different. To find out the cause, you need to view the control characters in the file. The control character is used to control the output effect and does not display itself. Therefore, we need to use a dump file to find and interpret these special characters. The OD command in the gun Toolkit can complete this task. -Option A specifies the character position notation, which can be X (hexadecimal), D (decimal), O (octal), or N (no position is displayed ). -T is used to specify the display mode. c is the Escape Character mode, and a is the name mode.
[root@localhost lpi103-2]# od text20000000 004471 066160 066565 031412 061011 067141 067141 0051410000020 030061 060411 070160 062554 0000120000031[root@localhost lpi103-2]# od -A d -t c text20000000 9 \t p l u m \n 3 \t b a n a n a \n0000016 1 0 \t a p p l e \n0000025[root@localhost lpi103-2]# od -A n -t a text2 9 ht p l u m nl 3 ht b a n a n a nl 1 0 ht a p p l e nl
Our example files are very small. When encountering a large file, we need to split it into multiple small parts. For example, you may need to split a large file into a CD-size Shard to burn it into a CD disk. The split command is used to complete the split task, and CAT can easily reconnect the split parts to the original large file. By default, the file names of small files split by split are in the format of xAA, XAB ,..... Of course, you can use the command line option to change the default value, the size of the small file, and whether to split by line or by character. See the following example:
[Root @ localhost lpi103-2] # split-L 2 text1 # split by line, one file per 2 lines
[Root @ localhost lpi103-2] # split-B 17 text2 y # split by number of characters, each file piece 17 characters
[root@localhost lpi103-2]# cat yaa9plum3banana1[root@localhost lpi103-2]# cat y* x*9plum3banana10apple1 apple2 pear3 banana
Note that the Yaa file is not ended with a line break, so after the Yaa file is output, our bash prompt is offset.
WC, Head, tailcat can display all the content of the file, which is suitable for small files. However, if you drive a large file, you may need to use WC (Word Count) to check the file size. The WC command is used to display the number of lines, characters, and bytes of a file. The LS-l command can also obtain the number of bytes of the object. For example:
[root@localhost lpi103-2]# ls -l text*-rw-r--r-- 1 root root 24 May 7 14:29 text1-rw-r--r-- 1 root root 25 May 7 14:48 text2[root@localhost lpi103-2]# wc text* 3 6 24 text1 3 6 25 text2 6 12 49 total
WC provides other options to control the output format or output other information, such as the length of the largest row. For details, see the man manual.
The head and tail commands are used to display the header and tail of a file. They can be used as filters or as parameters. By default, both head and tail display 10 rows of content. The following is a comprehensive example:
Root @ localhost lpi103-2] # dmesg | tail-N15 | head-N 6 # first from the last 15 lines to the end, then take the first 6 lines tg3. 0: eth0: link is up at 100 Mbps, full duplextg3. 0: eth0: flow control is on for Tx and on for rxtg3. 0: eth0: link is downg3. 0: eth0: link is up at 100 Mbps, full duplextg3. 0: eth0: Flow Control is off for Tx and off for rxbluetooth: Core ver 2.15
Another common method of tail is the tail-F file name. When a background program modifies a file, you can use this command to track file changes, which are often used in log files. Expand, unexpand, Tr when we create text1 and text2 files, we use the Tab character in text2. Sometimes you need to replace the tab with spaces or vice versa. The expand and unexpand commands are used to complete this task. Both commands provide the-T option to set the tab width.
[Root @ localhost lpi103-2] # expand-T 1 text2 # Replace a tab with a space 9 plum3 banana10 Apple [root @ localhost lpi103-2] # expand-T 8 text2 | unexpand- -t2 | expand-T3 # Replace a tab with eight spaces, replace every two spaces with one tab, and replace each tab with three spaces 9 plum3 banana10 Apple
Unfortunately, you cannot use unxpand to replace spaces in text1 with tabs. Because unexpand requires at least two spaces before it can be changed to tab. However, you can use the tr command to complete this replacement. Because TR is a pure filter, you can use the cat command to provide input for it.
[root@localhost lpi103-2]# cat text1 | tr ' ' '\t' | cat - text21apple2pear3banana9plum3banana10apple
If you want to know how to replace it, you can use OD-ta to check it.
The PR, NL, and fmtpr commands (print) are used to format files for printing. The default output header contains the file name, creation date and time, and page number. The end is two blank lines. When you operate on multiple files or standard input files, the date and time is the current time, not the Creation Time of the file. The NL command (number lines) is used as the line number. Cat-N can also be used for serial numbers.
[root@localhost lpi103-2]# nl text2 | pr -m - text1 | head2013-05-07 16:04 Page 1 19plum 1 apple 23banana 2 pear 310apple 3 banana
Another command used to format text is FMT. It is used to use appropriate margin format text. It can be used to connect multiple short behavior rows, or split a long row into multiple rows. Example:
[ian@echidna lpi103-2]$ echo "This is a sentence. " !#:* !#:1->text3echo "This is a sentence. " "This is a sentence. " "This is a sentence. ">text3[ian@echidna lpi103-2]$ echo -e "This\nis\nanother\nsentence.">text4[ian@echidna lpi103-2]$ cat -et text3 text4This is a sentence. This is a sentence. This is a sentence. $This$is$another$sentence.$[ian@echidna lpi103-2]$ fmt -w 60 text3 text4This is a sentence. This is a sentence. This is asentence.This is another sentence.
The sort and uniqsort commands sort the input text according to the sorting rules of the current system and then output the text. Sort can also be used to merge sorted files and check whether a file has been sorted. In the following example, replace spaces in text1 with tabs, and then sort the two files.
[root@localhost lpi103-2]# cat text1 | tr ' ' '\t' | sort - text210apple1apple2pear3banana3banana9plum[root@localhost lpi103-2]# cat text1 | tr ' ' '\t' | sort -u -k1n -k2 - text21apple2pear3banana9plum10apple
Note that 10 is ranked first, because it is sorted alphabetically by default. Fortunately, sort supports two sorting methods: numbers and letters, and each column can be sorted differently. By default, each column is separated by tab or space.
In the second example, the first column is sorted by numbers, the second column is sorted by letters, and-u is used to remove duplicate rows. Note that we still have two apple lines because the uniqueness test is performed in the whole line. Think about how to modify or add steps to remove the second apple line. Another uniq command provides another mechanism to control repeated row deletion. Uniq usually performs operations based on sorting to delete consecutive duplicate rows. Uniq can also ignore some fields. As follows:
[Root @ localhost lpi103-2] # Cat text1 | tr ''' \ T' | sort-K2-text2 | uniq-F1 # ignore the first field
10apple3banana2pear9plum
Cut, past, join now let's learn three commands for processing fields in text data. These three commands are particularly useful when processing table-based data. The first is cut, which extracts fields from the file. The default field delimiter is tab.
[Root @ localhost lpi103-2] # Cut-f1-2 -- output-delimiter = ''text2 # extract two fields and separate them with spaces at Output Time 9 plum3 banana10 Apple
The paste command is used to connect lines of the same row number in multiple files, a bit like the-M option of the PR command.
[root@localhost lpi103-2]# paste text1 text21 apple9plum2 pear3banana3 banana10apple
This is only the simplest paste usage. For more complex usage, see the man manual page.
The last Field Operation Command is join. It uses a matching field to connect to a file (Translator's note: similar to the equivalent join of a table in a database ). The connected fields must be sorted.
[root@localhost lpi103-2]# sort -k2 text2 | join -1 2 -2 2 text5 -apple 1 10banana 3 3
Sedsed is stream editor. There are many articles and books about sed. Sed is super powerful and can be used to edit any text. This article is just a brief introduction to sed to increase the reader's appetite, but it will not be detailed. Like all preceding text processing commands, sed can also obtain input from files or pipelines. The output is the standard output stream. Sed reads the row to the mode space, edits the content in the mode space, and then writes the content of the mode space to the standard output. Sed may combine multiple rows in the mode space, or write data to the file, only write the selected output, or write data at the root. Sed uses a regular expression to select the row for the operation and complete the search and replacement. A retention buffer provides temporary space for text storage. This buffer zone can replace the mode space, be added to the mode space, or be exchanged with the mode space. Sed has a limited set of commands. Combining these commands with regular expressions and maintaining the buffer zone will bring huge capabilities. A set of SED commands is usually called sed scripts. The following shows three simple sed scripts.
[root@localhost lpi103-2]# sed 's/a/A/' text11 Apple2 peAr3 bAnana[root@localhost lpi103-2]# sed 's/a/A/g' text11 Apple2 peAr3 bAnAnA[root@localhost lpi103-2]# sed '2d;$s/a/A/g' text11 apple3 bAnAnA
In the first script, we use the S (substitute) command to replace the first a in each row with. In the second script, replace all A in each row with. In the third script, we add the D (delete) command to delete the second line, and then replace a with a in the last line. The two commands are separated. Sed performs operations on all rows by default, but it also provides the option to operate some rows. This can be represented by line numbers, regular expressions, and so on. $ Indicates the last row. Example:
[root@localhost lpi103-2]# sed -e '2,${' -e 's/a/A/g' -e '}' text11 apple2 peAr3 bAnAnA[root@localhost lpi103-2]# sed -e '/pear/,/bana/{' -e 's/a/A/g' -e '}' text11 apple2 peAr3 bAnAnA[root@localhost lpi103-2]# sed -e '/pear/,/bana/{s/a/A/g}' text11 apple2 peAr3 bAnAnA
The purpose of the above example is to replace a in the last two rows of the text1 file with. It also shows that-E is used to connect multiple commands. Sed scripts can be stored in files. In fact, you may save frequently used scripts to files. Previously, we used the tr command to replace spaces in the text1 file with tabs. Now we use sed to complete the same task and store the SED script into a file.
[root@localhost lpi103-2]# echo -e "s/ /\t/g">sedtab[root@localhost lpi103-2]# cat sedtab s/ //g[root@localhost lpi103-2]# sed -f sedtab text11apple2pear3banana
The following is an example of the last sed script:
[root@localhost lpi103-2]# sed '=' text219plum23banana310apple[root@localhost lpi103-2]# sed '=' text2 | sed 'N;s/\n//'19plum23banana310apple
In this example, we use the = command, and then hand over the output to sed for processing, to simulate NL to number the line. In this example, use = to print the row number, run n to read the next row of the file into the mode space, and then delete the line break between the two lines in the mode space. It looks bad because the row number and the row content are not separated. The following is an improved version:
[root@localhost lpi103-2]# cat text1 text2 text1 text2 >text6[root@localhost lpi103-2]# cat text61 apple2 pear3 banana9plum3banana10apple1 apple2 pear3 banana9plum3banana10apple[root@localhost lpi103-2]# ht=$(echo -en "\t")[root@localhost lpi103-2]# echo $ht[root@localhost lpi103-2]# sed '=' text6 | sed "N> s/^/ /> s/^.*\(......\)\n/\1$ht/" 11 apple 22 pear 33 banana 49plum 53banana 610apple 71 apple 82 pear 93 banana 109plum 113banana 1210apple
The steps are analyzed as follows:
- Use cat to create a multiline file.
- Because the tab key is an automatic completion shortcut key in bash and cannot be entered directly, create a bash variable and save it as an alternative
- Use the = command to add a line number
- Read the next row into the mode space
- Add 6 spaces before the row number
- Replace the content before the line break with the last six characters (to ensure that the serial number can be aligned within 6 characters), and then add a tab.
The fourth version of SED contains the document in Info format. There are a lot of wonderful examples, which are not available in version 3.02. GNU sed can use the -- version option to display the current version information.
[root@localhost lpi103-2]# sed --versionGNU sed version 4.2.1Copyright (C) 2009 Free Software Foundation, Inc.This is free software; see the source for copying conditions. There is NOwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,to the extent permitted by law.GNU sed home page: