Linux. Shell Programming Notes-basic text processing

Source: Internet
Author: User
Tags truncated
Chapter 5 text processing sorting text processing is almost the most important part in UNIXLINUXShell programming. In the design of UNIXLINUX, everything is a file, and the collaborative work of many programs in the system is achieved through text or text streams. Therefore, chapter 5 of UNIXL processes sorted text

Text processing is almost the most important part of UNIX/LINUXShell programming. In the UNIX/LINUX design, everything is a file, and the collaborative work of many programs in the system is achieved through text or text streams. Therefore, text processing and text stream design in UNIX/LINUX have become an important part.

Pipelines are an important invention in UNIX/LINUX. pipelines connect various processing tools to form text streams. In UNIX/LINUX, text processing tools are often designed as filters. Connect different filters through pipelines, so that simple splicing can achieve the required functions.

Sort the lines of the Sort command

Many data (text) files are organized in a certain format. these files provide information retrieval and processing in a readable manner. Generally, such formatted text files can be sorted. The sorted text/wood files are more conducive to retrieval.

There are many common sorting algorithms, such as stomach bubble sorting, merge sorting, and quick sorting. the sorting efficiency varies with each other.

The sorting tool sort provided in UNIX/LINUX can work efficiently. generally, it is much more efficient than the sorting algorithm you write. Therefore, even if you do not know the specific implementation details of sorting, you can use sort with peace of mind.

The sort command regards the input as a data stream with multiple records, and the record is composed of Fields. the record is a line break, and each line corresponds to one record. Fields are defined by blank characters. the sort command also provides parameters for specifying fields to define characters.

If no parameter is provided, the sort command sorts the data according to the sequence defined by the current character set (locale. For example, in the traditional C looal. The sort command is sorted in ASCII order. to change the sorting rules, you can modify the current character set.

[Houchangren @ ebsdi-23260-oozie data] $ catfruits.txt // view file content apple % contents [houchangren @ ebsdi-23260-oozie data] $ sortfruits.txt/default sort appleapple % contents [houchangren @ ebsdi-23260-oozie data] $ echo $ LANG // view default set character set zh_CN.UTF-8 [houchangren @ ebsdi-23260-oozie data] $ LANG = EN_US sort fruits.txt // Set to EN_Us for sorting % banaeBananaPresimmonappleapplebananaorangepresimmon

-D parameters are sorted alphabetically (default)

-F: Sort by write before sorting (convert lowercase letters to uppercase letters)

 [houchangren@ebsdi-23260-oozie data]$ sort -d-f fruits.txt    appleapple%%banaebananaBananaorangepresimmonPresimmon

-U deduplication, same row removal

[houchangren@ebsdi-23260-oozie data]$ sort-d -f -u fruits.txtapple%%banaebananaorangePresimmon
Sort fields of the Sort command

The sort command can also sort fields. In the sort parameter list, the-k parameter can select the sorting field, while the-t parameter can select the field separator. if it is not set, it is a blank character by default.

[Houchangren @ ebsdi-23260-oozie data] $ cat sort.txt keyDataXSZKCZY: XXXSCK: 604834: AFS: 3636351 d_20131220170748-1600299142 2 cores: XXXSCK: 604836: AFS: 3636353 d_20131220170748-1600299142 3 keyData XSZKCZY: XXXSCK: 604838: AFS: 3636355 d_20131220170748-1600299142 1 keyDataXSZKCZY: XXXSCK: 605180: AFS: 3639304 d_20131220170748-1600299142 6 keyDataXSZKCZY: XXXSCK: 606728: AFS: 3658757 d_20131220170748-1600299142 5: XXXSCK: 607072: AFS: 3661194 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607188: AFS: 3661453 d_20131220170748-1600299142 23 keyDataXSZKCZY: XXXSCK: 607195: AFS: 3661460 d_20131220170748-1600299142 4 keyDataXSZKCZY: XXXSCK: 607197: AFS: 3661462 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607199: AFS: 3661464 d_20131220170748-1600299142 2 [houchangren @ ebsdi-23260-oozie data] $ sort-t $ '\ t'-k4 sort.txt // $ keyDataXSZKCZY: XXXSCK: 604838: AFS: 3636355 d_20131220170748-1600299142 1 keyDataXSZKCZY: XXXSCK: 604834: AFS: 3636351 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607072: AFS: 3661194 d_20131220170748-1600299142 2: XXXSCK: 607197: AFS: 3661462 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607199: AFS: 3661464 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607188: AFS: 3661453 d_20131220170748-1600299142 23 keyDataXSZKCZY: XXXSCK: 604836: AFS: 3636353 d_20131220170748-1600299142 3 keyDataXSZKCZY: XXXSCK: 607195: AFS: 3661460 d_20131220170748-1600299142 4 keyDataXSZKCZY: XXXSCK: 606728: AFS: 3658757 d_20131220170748-1600299142 5 keyDataXSZKCZY: XXXSCK: 605180: AFS: 3639304 d_20131220170748-1600299142 6 [houchangren @ ebsdi-23260-oozie data] $ sort-t $ '\ t'-k4-n sort.txt keyDataXSZKCZY: XXXSCK: 604838: AFS: 3636355 d_20131220170748-1600299142 1 keyDataXSZKCZY: XXXSCK: 604834: AFS: 3636351 d_20131220170748-1600299142 2 keyData XSZKCZY: XXXSCK: 607072: AFS: 3661194 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607197: AFS: 3661462 d_20131220170748-1600299142 2 keyDataXSZKCZY: XXXSCK: 607199: AFS: 3661464 d_20131220170748-1600299142 2 keyData XSZKCZY: XXXSCK: 604836: AFS: 3636353 d_20131220170748-1600299142 3 cores: XXXSCK: 607195: AFS: 3661460 d_20131220170748-1600299142 4 keyDataXSZKCZY: XXXSCK: 606728: AFS: 3658757 d_20131220170748-1600299142 5 keyData XSZKCZY: XXXSCK: 605180: AFS: 3639304 d_20131220170748-1600299142 6 keyDataXSZKCZY: XXXSCK: 607188: AFS: 3661453 d_20131220170748-1600299142 23

The-n parameter allows the sort command to compare by integer. The previous sorting shows that 23 is in the middle.

Sort summary

Sort is an important command. you can see it in many important UNIX/LINUX shells: as long as there is data and there is a need for retrieval, generally, sort is supported. Sort can enter the top 10 for UNIX/LINUX commands.

The above is just the tip of the sort command, which is far more powerful than this. In addition

The sort command and other commands are used in combination (it is designed as a filter mode) to implement powerful functions, such, you can sort text blocks.

Moreover, the efficiency of the sort command is commendable: since its launch, many people have studied, optimized, and adjusted it. It certainly works better than the sorting algorithm you write. trust it, instead of trying to reinvent the wheel (write redundant fragile sorting algorithms ).

However, note that the sort command is unstable. the stability of the algorithm means that the input and output orders of two identical records remain unchanged.

The sorting fields are the same, but the output and input order are inconsistent. Therefore, sort is not a stable sorting implementation. The sort command in the GUN coreutils package makes up for this deficiency. Now. You can use the stable option to change the same sequence of input and output records. However, this parameter will reduce the efficiency of the sort command (using a slightly inefficient sorting algorithm to obtain stability ).

When sort is used as a filter in the pipeline, the stability of the sorting algorithm becomes important. Therefore, you need to determine whether to use the -- stable parameter based on whether the stability (stable) is important to your own commands.

Text deduplication

The sort-u deduplication command removes duplicates only when the columns in the specified sorting are the same, but other fields may be different, so sometimes it may not be what we want.

In the UlSIX \ Linux system, another command is used to deduplicate a record of data, which is enough for uniq. the uniq command removes repeated records from the data stream and leaves only the first record. it is often used in pipelines. for example, it is used for deduplication after the sort command.

The uniq command mainly has three options:-c is used to display the count of duplicate rows,-d is used to display only duplicate rows, and-u is used to display only non-duplicate rows.

[Houchangren @ ebsdi-23260-ooziedata] $ cat fruit.txt % banae banana applePresimmon % banae apple Banana orange presimmon [houchangren @ ebsdi-23260-ooziedata] $ sort fruit.txt | uniq-c // This place is behind two apple because the space is considered to be two records 1 apple 1 apple 2% % banae 1 banana 1 Banana 1 orange 1 presimmon 1 Presimmon [houchangren @ ebsdi-23260-ooziedata] $ sort fruit.txt | uniq-c-d // show duplicate rows only 2% % banae [houchangren @ ebsdi-23260-ooziedata] $ sort fruit.txt | uniq-c-u // show only duplicate rows 1 apple 1 apple 1 banana 1 Banana 1 orange 1 presimmon 1 Presimmon [houchangren @ ebsdi-23260-ooziedata] $
Count the number of lines, words, and characters

The wc command in Unix/linux provides statistics on the number of lines, characters, and characters of text. The wc command is also part of the POSIX standard and you can use it with confidence.

The-c parameter indicates that the WC command displays the number of characters.

The-w parameter indicates that the WC command displays the number of words.

-L indicates the number of lines in the WC command text line.

[houchangren@ebsdi-23260-oozie data]$ wc/etc/passwd 61   92 2867 /etc/passwd[houchangren@ebsdi-23260-oozie data]$ wc -c/etc/passwd2867 /etc/passwd[houchangren@ebsdi-23260-oozie data]$ wc -c-w /etc/passwd  922867 /etc/passwd[houchangren@ebsdi-23260-oozie data]$ wc -c-w -l /etc/passwd 61   92 2867 /etc/passwd[houchangren@ebsdi-23260-oozie data]$

// Find the number of names ending with. sh in the upper-level Directory

[houchangren@ebsdi-23260-oozie data]$ find../ -iname "*.sh" |wc -l11 

// Find the number of rows in the/etc/passwd file that contain the bash string. Generally, it is started in the system.

The number of users whose shell is bash.

[houchangren@ebsdi-23260-oozie data]$ grepbash /etc/passwd | wc -l25

// Use pure grep to achieve the same effect

[houchangren@ebsdi-23260-oozie data]$ grep-c bash /etc/passwd25 

// The WC command counts all files ending with. sh in the current directory, and counts the number of characters, words, and lines in the files.

And print the total results in the first row.

[Houchangren @ ebsdi-23260-oozie shell] $ wc *. sh 6 7 51 add. sh 19 29 337. sh 7 16 85 checkUserIsExist. sh 8 11 55 if. sh 16 58 284 testalg. sh 8 20 206 testcase. sh 6 7 42 testfor. sh 26 51 319 testwhile. sh 11 27 132 user_login.sh 15 26 193 whileexample2.sh 17 38 253 whileexample. sh 139 2901957 total

Note

The execution of the Wc command varies with the locale settings. Different locale may affect the character/word divider when the Wc command interprets the byte sequence.

Print and format the output

There are many tools for printing and formatting text in Linux, such as pr, fmt, and fold. They have different purposes.

Print files using pr

The pr command of UNIX/Linux can be used to convert text into files suitable for printing. A basic purpose of this tool is to split a large file into multiple pages and add titles for each page.

For example, pr can convert a 150-line text file into three text pages, and then let the user print them.

By default, each page contains 66 lines of text. However, you can change this rule using the-l parameter of pr.

There are many parameters that can be used to control the output of text. Generally, the title of each page is the file name of this document. You can also customize the title, for example:

$ Pr-h "My report" file.txt

If you do not use the exported -hashes, the printed page uses the printed File.txt file as the title. after the-h parameter is added, the page uses the "My report" specified after this parameter as the title.

You can also use the pr command to print the text in a separate manner. This is useful for short statements. if the statement is long, pr will wrap the line at the appropriate position. For example, to print the File.txt file in two columns, run the following command:

$ Pr-2-h "My report" file.txt

By default, pr adds line breaks (such as empty rows) to each page. However, you can also use tabs to replace empty rows. You can use the following command to replace empty rows with tabs:

$ Pr-f file.txt

If you only want to print the file and do not want to save it, this function is suitable. However, if you want to save the file at the same time, the added tab will make the file look messy.

Remember that pr is a standard output tool that can be directly output to a printer. if you want to save the result in a file, you need to redirect its output, for example:

$ Pr file.txt> file. output

Test instance:

[houchangren@ebsdi-23260-ooziedata]$ pr -f  fruit.txt2014-01-1810:56                    fruit.txt                     Page 1%%banae banana  applePresimmon%%banae apple Banana  orange  presimmon[houchangren@ebsdi-23260-ooziedata]$ pr -f -c1  fruit.txt2014-01-1810:56                    fruit.txt                     Page 1%%banae banana  applePresimmon%%banae apple Banana  orange  presimmon[houchangren@ebsdi-23260-ooziedata]$ pr -f -c1 -h "test" fruit.txt2014-01-18 10:56                       test                       Page 1%%banae banana  applePresimmon%%banae apple Banana  orange  presimmon[houchangren@ebsdi-23260-ooziedata]$ pr -f -c1 -t fruit.txt%%banae banana  applePresimmon%%banae apple Banana  orange  presimmon

In the pr command, The-c parameter shows the number of columns, and The-t parameter indicates that the title is not displayed.

NOTE

In the pr command, the effect of using-10 directly is the same as that of using-c10. 'This is because pr has a parameter "-Column", which means that a number is directly followed by a hyphen, number of output columns.

Format text using the fmt command

In addition to the pr command, there is also an fmt command in UNIX/Linux that can format text paragraphs so that the text does not go beyond the visible screen range Port. The fmt command reads content from the specified file, after the orchestration is re-arranged according to the specified format, it is output to the standard output device. If the specified file name is "1", The fmt command reads data from the standard input device.

The-w parameter indicates the maximum number of characters in each line of the fmt command.

The-s parameter indicates that the fmt command only splits the columns whose characters exceed the number of characters in each column, but does not merge columns whose characters are less than the number of characters in each column.

NOTE

Warning: Both pr and fmt have different behaviors in different versions of the system. Therefore, you need

Check manpage to determine the function of the formatting output tool.

Use fold to limit the text width

The fold command of LINIXILinux reads the content from the specified text, and adds the columns that exceed the specified column width to the extra column character.

Output to the standard output device. If no file name is specified or the given file name is "-", The fold command reads data from the standard input device.

Note

Note that the-w parameter of fold is not the same as the-w parameter of fmt. The-w parameter of fold hardly truncates the output line of the text, but does not determine whether the word is also truncated. And fmt's-w. the parameter determines whether a word can be normally displayed (not truncated). If a word cannot be displayed normally, fmt moves the word to the next row, and fold blocks it.

Extract the beginning and end of a text

// View the first 20 lines of tomcat logs

Head view the beginning of a file

Head-n specified row

[root@ebsdi-23260-oozie logs]# head -20  catalina.out

Tail command to view the end of the file

Tail-n specified row

[root@ebsdi-23260-oozie logs]# tail  -20 catalina.out    

-F parameter dynamic viewing

[root@ebsdi-23260-oozie logs]# tail -f catalina.out
Process a field in the Cut text

The-d parameter specifies the field separator accepted by the cut command.

The-f parameter specifies the field columns obtained by The cut command. multiple fields are separated by commas.

[houchangren@ebsdi-23260-oozie data]$ cut-d  ':' -f 1,7 /etc/passwd | grep bash |head -10root:/bin/bashmapred:/bin/bashhdfs:/bin/bashwangjuntao:/bin/bashyangzhi:/bin/bashhuanghu:/bin/bashzhangguochen:/bin/bashhouchangren:/bin/bashhadoop:/bin/bashneil:/bin/bash
Join Field

In Linux, the join command can connect different files so that record information with the same key value can be connected together. Based on the specified field, it finds the rows with the same content as the specified field in the two files, merges them, and outputs the content according to the required format. This command is helpful for comparing the content of two files.

[houchangren@ebsdi-23260-oozie data]$ catstuff.txt1      zhangsan2       lisi3      wangwu6      maliu[houchangren@ebsdi-23260-oozie data]$ catsalary.txt1      10002      20003      23505      2400[houchangren@ebsdi-23260-oozie data]$ joinstuff.txt salary.txt1 zhangsan 10002 lisi 20003 wangwu 2350[houchangren@ebsdi-23260-oozie data]$ joinstuff.txt salary.txt  -a11 zhangsan 10002 lisi 20003 wangwu 23506 maliu[houchangren@ebsdi-23260-oozie data]$ joinstuff.txt salary.txt  -a21 zhangsan 10002 lisi 20003 wangwu 23505 2400
Other field processing methods

Awk, which will be detailed later

Text Replacement

There are many implementation methods for text replacement in UNIX/Linux. for example, you can use the sed command. In the text editor

. However, the simplest command is tr.

Use tr to replace characters

The tr command deletes or replaces characters from the standard input and writes the results to the standard output. The tr command is very useful when commands require text replacement in a small range.

Tr command format

a.      tr str1 str2b.      tr {-d|-s} str1

Operations completed by tr

A. conversion character
If both str 1 and str2 are specified, but the-d flag is not specified, the tr command replaces each character contained in str1 with a character at the same position in str2.

B. use the-d flag to delete characters.
If the-d flag is specified, the tr command deletes each character contained in str1 from the standard input.

C. use the-s flag to remove the sequence

If the-s flag is specified, the tr command removes the division of any string series contained in str1 or str2.

All characters except the first character. For each character contained in str1, The tr command removes all characters except the first character in the standard output. For each character contained in str2, The tr command removes all characters except the first character in the standard output character sequence.

[houchangren@ebsdi-23260-oozie data]$ cat fruit.txt%%banae banana  applePresimmon%%banae apple Banana  orange  presimmon[houchangren@ebsdi-23260-oozie data]$ tr'a-z' 'A-Z' < fruit.txt > fruit.txt.upper[houchangren@ebsdi-23260-oozie data]$ catfruit.txt.upper%%BANAE BANANA  APPLEPRESIMMON%%BANAE APPLE BANANA  ORANGE  PRESIMMON[houchangren@ebsdi-23260-oozie data]$ catfruit.txt | tr -d 'a' > fruit.txt.rm[houchangren@ebsdi-23260-oozie data]$ catfruit.txt.rm%%bne bnn  pplePresimmon%%bne pple Bnn  ornge  presimmon[houchangren@ebsdi-23260-oozie data]$ catfruit.txt.rm |tr -s '%%' '$$' > fruit.txt.rep[houchangren@ebsdi-23260-oozie data]$ catfruit.txt.rep$bne bnn  pplePresimmon$bne pple Bnn  ornge  persimmon
Other instances
# To convert braces into parentheses, enter >>> tr '{} ''()' <textfile> newfile # to convert braces into square brackets, enter >>> tr '{} ''\ []'
 
  
Newfile # to create a word list in a file, enter; >>> tr-cs '[: lower:] [: upper:] ''[\ n *]'
  
   
Newfile # to delete all null characters from a file, enter >>> tr-d' \ 0'
   
    
Newfile # If you want to replace one or more lines in each sequence with separate line breaks, clear input >>> tr-s '\ n' <textfile> new file or >>> tr-s' \ 012'
    
     
Newfile # to use "?" (Question mark) replace each non-printable character (except valid control characters). enter >>> tr-c' [: print:] [: cntrl:] ''[? *]'
     
      
Newfile # to replace with a single "non" character
      
        For each character sequence in the character class, enter: >>> tr-s '[: space:] ''[# *]' for other options.
      
     
    
   
  
 

In fact, the text replacement function of tr only implements the simplest operation. if there are more complex requirements, such as conditional judgment logic, more powerful and complex tools are required. Such tools can often be called languages.

Common tools with such features in Linux include the following.

PerlThe powerful regular expression has no right in the UNIX/linux world, and the replacement of text and wood is naturally a side dish;

SedThe sed tool can process text streams and easily replace text with wood.

AwkThe awk language supports logic judgment, loops, and other features. In addition, its strong customization is also the reason for its popularity in text processing.

PythonAs one of the most popular languages in the UNIX and Linux communities, the python text processing module is quite powerful,

You can also complete the replacement function that you want. However, the execution efficiency of python is relatively slow, and it seems a little useless to process simple text operations in a single language such as python.

A slightly complex example

Requirement: obtain the top 100 most visited users in a day

Step: cut-> tr-> cut-> sort-> uniq-> sort-> head

The first field is followed by \ t, the second field is followed by a blank character (multiple), so we have to intercept it twice.

[houchangren@ebsdi-23260-oozie data]$ catips.txt 20:32  10.12.165.1     123872320:31  10.12.165.7     12392320:32  10.12.165.18    12832320:12  10.12.165.20    123462320:32  10.12.165.25    123244320:32  10.12.165.26    123243320:32  10.12.165.31    123452320:32  10.12.165.32    12320:32  10.12.165.33    1234520:32  10.12.165.33    12324520:32  10.12.165.36    12333320:32  10.12.165.37    12342320:32  10.12.165.38    1242320:32  10.12.165.39    1262320:32  10.12.165.40    1252320:32  10.12.165.255   1242320:32  224.0.0.2       1212320:32  224.0.0.22      1222320:32  224.0.0.251     1232220:32  224.0.0.252     12322[houchangren@ebsdi-23260-oozie data]$ cut-d $'\t' -f 2 ips.txt |tr -s '[" "]' $'\t'|cut -d $'\t' -f1|sort|uniq -c|sort -r|head -100     2 10.12.165.33     1 224.0.0.252     1 224.0.0.251     1 224.0.0.22     1 224.0.0.2     1 10.12.165.7     1 10.12.165.40     1 10.12.165.39     1 10.12.165.38     1 10.12.165.37     1 10.12.165.36     1 10.12.165.32     1 10.12.165.31     1 10.12.165.26     1 10.12.165.255     1 10.12.165.25     1 10.12.165.20     1 10.12.165.18     1 10.12.165.1     1

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.