The King of text sorting: the sort command, the King sort

Source: Internet
Author: User

The King of text sorting: the sort command, the King sort

Directory:

1.1 option description

1.2 sort example

1.3 in-depth research on sort

Sort is a sorting tool that perfectly implements the Unix philosophy: "do only one thing and do it perfectly ". Its sorting function is extremely powerful and complete. As long as there are enough rules in the file, it can output almost all the desired sorting results, which is a very high-quality tool.

Although sort is powerful, it has few options and is easy to use. What makes people think of its success is that even if you want to implement complicated and complete sort functions, the options used are no different from those used in general use. However, to implement complex functions, you must understand how sort works.

That is to say, when you do not understand the sort working mechanism, it can also complete the task, where it can be played, but it is inevitable that there will be some deviations and doubts in the areas not mentioned. Only by understanding the sort mechanism can we really identify the cause, and there is no deviation in the result. Even if there is a deviation, we also know why.

This article first explains the common options of the sort command, and then provides a simple example of sort for the preliminary interpretation of the sort options, and finally gives an in-depth description of sort. For more complete options, see the translation of info sort: sort command Chinese manual (info sort translation ).

1.1 option description

Sort reads the input of each row and divides each row into multiple fields according to the specified separator. These fields are the objects sorted by sort. In addition, sort can specify the sort rules to be sorted, such as sorting rules by the current character set (this is the default sorting rule), sort by dictionary, sort by numerical values, sort by month, and format by file size (k <M <G ). You can also remove duplicate rows and specify the sort method in descending or Ascending Order (default.

The default sorting rule is a character set sorting rule. Generally, the sequence of several common characters is: "Null String <blank character <value <A <B <... the same applies to dictionary sorting rules.

Syntax format:

Sort [OPTION]... [FILE]... OPTION Description:-c: checks whether the specified FILE has been sorted. If not sorted, the diagnostic information is output, prompting the line from which to start out of disorder. -C: similar to "-c", but does not output any diagnostic information. You can use exit status code 1 to identify files that are not sorted. -M: Merge multiple sorted files. No sorting action is performed during the merge process. -B: Ignore leading blank characters of a field. When the number of spaces is not fixed, this option is almost required. The "-n" option implies this option. -D: sort by dictionary. Only letters, numbers, and spaces are supported. Except for special characters, it is generally equivalent to the default sorting rule. -- Debug: displays the sorting process and fields and Characters Used for each sorting. Additional information is displayed in the first few rows. -F: Use all lowercase letters as uppercase letters. For example, "B" and "B" are the same. : When used with the "-u" option, if the comparison result of the sorting field is the same, the lowercase letter line is discarded. -K: Specifies the key to be sorted. The key is composed of fields. The key format is "POS1 [, POS2]". POS1 indicates the start position of the key and POS2 indicates the end position of the key. -N: sort by numerical values. The empty string "" or "\ 0" is treated as null. This option does not recognize all non-numeric characters except. : When the key is sorted by numeric values and encounters unrecognized characters, the sorting of the key is immediately ended. -M: sort by month in string format. It is automatically converted to uppercase and take the abbreviated value. Rule: unknown <JAN <FEB <... <NOV <DEC. -O: output the result to the specified file. -R: The default value is ascending. You can use this option to obtain the result of descending order. : Note: "-r" is not involved in sorting, but only the result after sorting is completed. -S: Disable sort from "sorting ". -T: Specifies the field separator. : For special symbols (such as tabs), you can use a type similar to-t $ '\ t' or-t 'ctrl + v, tab' (press ctrl + v first, press the tab key. -U: only the first row of the duplicate row is output. When combined with "-f", repeated lower-case rows are discarded.

1.2 sort example

This section is a simple example of sort usage, which is also the most likely example. This section is sufficient if you only want to use sort, not to root the problem.

Assume that the file system.txt is in place and the content is as follows: the blank part is a single tab.

[root@xuexi tmp]# cat system.txt1       mac     2000    5002       winxp   4000    3003       bsd     1000    6004       linux   1000    2005       SUSE    4000    3006       Debian  600     200

(1) When no option is added, the entire row is sorted in ascending order from the first character until the end of the row according to the default Character Set sorting rules.

[root@xuexi tmp]# sort system.txt1       mac     2000    5002       winxp   4000    3003       bsd     1000    6004       linux   1000    2005       SUSE    4000    3006       Debian  600     200

Because the first character of each line is 1 <2 <3 <4 <5 <6, the result is as above.

(2) sort by the third column as the sorting column. To divide fields, specify the field separator. The special character that cannot be entered directly by specifying a tab is $ '\ t '.

[root@xuexi tmp]# sort -t $'\t' -k3 system.txt  4       linux   1000    2003       bsd     1000    6001       mac     2000    5002       winxp   4000    3005       SUSE    4000    3006       Debian  600     200

Although the order of 1000 <2000 <4000 is correct, 600 is placed at the end of the result, because it is sorted according to the default Character Set sorting rules, with 6 or more characters greater than 4, so the last row is arranged.

(3) sort the third column by numerical value.

[root@xuexi tmp]# sort -t $'\t' -k3 -n system.txt6       Debian  600     2003       bsd     1000    6004       linux   1000    2001       mac     2000    5002       winxp   4000    3005       SUSE    4000    300

In the result, 600 is in the first row. The values in the third column of rows 2nd and 3rd are all 1000. How can we determine the order of these two rows?

(4). Use the fourth column as the winning attribute and sort the fourth column according to the numerical sorting rules on the basis of sorting 3rd columns by numerical values.

[root@xuexi tmp]# sort -t $'\t' -k3 -k4 -n system.txt6       Debian  600     2004       linux   1000    2003       bsd     1000    6001       mac     2000    5002       winxp   4000    3005       SUSE    4000    300

What if I want to use 3rd columns as the winning columns after sorting 2nd columns by numerical values? Because Column 2nd is a letter rather than a numerical value, the following statement is incorrect, although expected results are obtained.

[root@xuexi tmp]# sort -t $'\t' -k3 -k2 -n system.txt6       Debian  600     2003       bsd     1000    6004       linux   1000    2001       mac     2000    5002       winxp   4000    3005       SUSE    4000    300

The reason why we finally get the correct result is thatBy default, after the sorting behavior specified in the command line ends, sort performs the last sorting, which sorts the entire row according to the full default rules, that is, sort by character set and ascending order.Because the first character 3 in the row where 1000 is located is less than 4, the 3 is in the front.

The preceding statement is incorrect because the first character in Column 2nd is a letter rather than a numerical value,In numerical sorting, letters are unidentifiable characters. When an unidentifiable character is encountered, the sorting of this field is immediately ended.You can use the "-- debug" option to view the sorting process and columns used for sorting. Note that this option is available only for sort on CentOS 7.

[Root @ xuexi tmp] # sort -- debug-t $ '\ t'-k3-k2-n system.txt sort: using 'en _ US.UTF-8' sorting rulessort: key 1 is numeric and spans multiple fieldssort: key 2 is numeric and spans multiple fields6> Debian> 600> 200 ___ # 1st sorting behaviors, that is, sorting "-k3, the field used for sorting this time is 3rd columns ^ no match for key # 2nd sorting behavior, that is, sorting "-k2, however, it is displayed that the sorting key cannot be matched ____________ # by default, sort always performs the last sorting, the sorting object is the entire row 3> bsd> 1000> 600 ____ ^ no match for key ______________ 4> linux> 1000> 200 ____ ^ no match for key ________________ 1> mac> 2000> 500 ____ ^ no match for key ______________ 2> winxp> 4000> 300 ____ ^ no match for key ____________ 5> SUSE> 4000> 300 ____ ^ no match for key _______________

(5) On the basis of sorting 3rd columns by numerical values, 2nd columns are used as the winning attribute and sorted in descending order by default.

[root@xuexi tmp]# sort -t $'\t' -k3n -k2r system.txt6       Debian  600     2004       linux   1000    2003       bsd     1000    6001       mac     2000    5002       winxp   4000    3005       SUSE    4000    300

Because you want to sort 3rd columns in numerical Ascending Order and 2nd columns in descending order by default rules, you can only assign options for each field.Note that although the "r" option is a descending result, it does not affect the sorting process, but only the final sorting result. That is to say, after the final result is obtained by sorting in ascending order, the 2nd Column order is reversed, that is, the descending result is obtained. It also indicates that when sorting, sort must be sorted in ascending order. Only the "r" option is completed after sorting is completed.

The option that follows the field (for example, "n" of "-k3n" and "r" of "-k2r") is called a private option, options (for example, "-n" and "-r") that are written outside the field with a short horizontal line are global options.If no private option is assigned to a field, the sorting field inherits the global option.Of course, only sorting options such as "-n" and "-r" can be inherited and assigned to fields, while options such as "-t" cannot be assigned.

Therefore, "-n-k3-k4", "-n-k3n-k4", and "-k3n-k4n" are equivalent, "-r-k3n-k4" and "-k3nr-k4r" are equivalent.

In fact, the above command is not strictly written. The standard writing method should be as follows:

sort -t $'\t' -k3n -k2,2r system.txt

"-K2, 2" indicates that the sorting object starts from 2nd fields to 2nd fields, that is, only the second field is ordered.The format is "POS1, POS2". If POS2 is omitted, it will be automatically extended to the end of the row.That is, "-k2" is equivalent to "-k2, 4", that is, sorting the entire 2nd to 4th columns.

Note that the above-mentioned "-k2" inherits the global default sorting rules, that is, sorting by character rather than by numerical value. In this case, it can be equivalent to "-k2, 4 ", however, if it is "-k2n", it is not equivalent to "-k2, 4n" or "-k2n, 4n" or "-k2n, 4 "(these three are equivalent statements), the reason is not equivalent, because only numbers and negative numbers can be identified in numerical order"-", when sorting all other characters, will immediately end the sorting. Therefore, "-k2n" is equivalent to "-k2, 2n" or "-k2n, 2" or "-k2n, 2n ".

For more information about these theoretical knowledge points, see the theoretical content of sort in the next section. In this article, we will not explain the theoretical content, but will introduce how to use commands.

(6) On the basis of sorting 3rd columns by numerical values, use the 2nd characters of the 2nd columns as the winning attribute and sort the column in ascending order by default.

[root@xuexi tmp]# sort -t $'\t' -k3n -k2.2,2.2 system.txt6       Debian  600     2004       linux   1000    2003       bsd     1000    6001       mac     2000    5002       winxp   4000    3005       SUSE    4000    300

"-K2.2, 2.2" indicates that it starts from 2nd characters of 2nd fields and ends with 2nd characters of 2nd fields, that is, it is strictly limited to 2nd characters of 2nd fields. To sort the characters in descending order, "-k2.2, 2.2r ".

(7). Use "-u" to remove the rows where repeated fields are located. For example, if 3rd columns have two rows, 1000 rows and 4000 rows, only the first row in the front row is retained when fields are duplicated.

[root@xuexi tmp]# sort -t $'\t' -k3n -u system.txt6       Debian  600     2003       bsd     1000    6001       mac     2000    5002       winxp   4000    300

To remove rows with repeated fields, sort is disabled for "Last sorting" when "-u" is used ". For the rows with duplicate fields, you need to understand the overall working mechanism of sort to determine which row is at the top of the line. Please read this article.

"Sort-u" and "sort | uniq" are equivalent, but they are not equivalent if multiple options are specified. For example, "sort-n-u" only checks the uniqueness of the numerical part of the sorting field, but "sort-n | uniq" after sort sorts the fields in the row by numerical value, uniq checks the uniqueness of the entire row.

(8) Save the sorting result to the file. You can use redirection or the "-o" option, but the redirection cannot be saved to the original file because the original file is truncated by redirection before the sort starts execution. The use of "-o" does not have such a problem, because sort completes Data Reading before opening the file. However, when "-o" and "-m" are used together, they are also insecure.

[root@xuexi tmp]# sort -t $'\t' -k3n -o system1.txt system.txt

(9). Use "-c" or "-C" to check whether the file is out of order. If it is sorted, no information is returned. The exit status code is 0. If not sorted, the exit status code is 1, but "-c" will provide the diagnostic information and specify the line from which the disordered sequence starts, and "-C" will not return any information.

[root@xuexi tmp]# sort -c -k3n system.txt ;echo $?sort: system.txt:3: disorder: 3 bsd     1000    6001

It indicates that the first line in system.txt is out of order and the exit status code is 1.

[root@xuexi tmp]# sort -C -k3n system.txt ;echo $?1

1.3 in-depth research on sort

It seems that sort is easy to use. Isn't it "sort-t DELIMITER-k POS1, POS2 file"? Indeed, its man document is only 100 lines, more than 500 lines of nonsense are added to the info document. But in fact, the sort command is very difficult and can be said to be very simple. It is simple because, whether it is a complex function or a simple function, there are only a few options for use. It is difficult because it does not understand its working mechanism and details, sometimes the results are unexpected, and I don't know why.

This section mainly describes the details of the theory and working mechanism, and occasionally provides several examples. If you have any questions, please test it on your own. Of course, you are welcome to leave a message below the blog. In addition, the "-- debug" (CentOS7 only supports this option) option is of great help for troubleshooting, so you should make good use of this option.

(1 ). by default, the sort command sorts data by character set sorting rules. You can specify the "-d" option to sort data by dictionary, and specify "-n" to sort data by value, specify "-M" to sort by month rules in character format, and specify "-h" to sort by file capacity rules.

Character Set sorting rules and dictionary sorting rules generally have the same sequence for recognizable characters. The sequence of several common characters is: "empty string <blank character <value <a <A <B <B <... <z <Z ".

Specifying different sorting rules not only changes the basis of sorting, but also indirectly affects the behavior of sorting, because different sorting rules can recognize different character types. For details about the impact, see (4 ).

(2). sort splits each line using the separator specified by the "-t" option to obtain multiple fields. The separator is not used as the content of the field. The default Delimiter is an empty character between a blank character and a non-blank character. It is not a space or Tab character mentioned in many articles on the Internet.

For example, "foo bar" is separated by two fields by default, "foo" and "bar", and three fields are separated by spaces as separators: the first field is empty, the second and third fields are "foo" and "bar ". Use the following three sort statements to verify that the default Delimiter is not a space.

[Root @ xuexi ~] # Echo-e "234 bar \ n 123 car" | sort-t'-B-k3 234 bar 123 car [root @ xuexi ~] # Echo-e "234 bar \ n 123 car" | sort-B-k2 234 bar 123 car [root @ xuexi ~] # Echo-e "234 bar \ n 123 car" | the field specified by sort-B-k3 #-k3 is out of the range, so the key is empty. 123 car 234 bar

(3). Use the "-k" option to specify the sort key. If no sort key is specified, the entire row becomes the sort key, that is, the entire row is sorted.

  • The key is composed of fields in the format of "POS1, [POS2]", indicating the start and end positions of each row. That is to say, the key is the sorting object.
  • The POS format is "F [. C] [OPTS]", where F indicates the serial number of the field, and C indicates the serial number of the characters in the field. The positions of fields and characters are calculated from 1. If the character position of POS2 is set to 0, it indicates the last character in the POS2 field. If ". C" is omitted in POS1, the default value is 1 (the starting character of the Field). If ". C" is omitted in POS2, the default value is 0 (the ending character of the field ). When leading space characters are ignored using the "-B" option, C starts from the first non-blank character. If F or C is out of the valid range, the key is blank. For example, a row contains only three fields, but "-k4" is specified, or the 2nd Field contains only three characters, "-k2.5" is specified ".
  • If POS2 is omitted, the key is automatically extended to the end of the row, which is equivalent to "POS1, line_end ". If POS2 is not omitted, the key may span multiple fields. In either case, when multiple fields are crossed, the key retains the delimiter between fields.
  • OPTS specifies the key option, including but not limited to "bfnrhM ", they serve the same purpose as Global Options "-B", "-f", "-n", "-r", "-h", and "-M. By default, if no OPTS is specified in the key, the key inherits the Global Options. If the options are specified separately in the key, these options are the private sorting options of the key, which will overwrite the Global Options. In addition to the "B" option, other options, whether specified in POS1 or POS2, are equivalent. For the "B" option, if specified in POS1, it applies to POS1, specify POS2 to act on POS2. If the global option "-B" is inherited, it applies to POS1 and POS2.
  • The number of leading blank characters before the field is not fixed, which will lead to confusion in the field. Therefore, we strongly recommend that you always ignore leading blank characters. When values are sorted (that is, the "n" option), the "B" option is hidden.
  • You can use multiple "-k" options to specify multiple keys. Keys are sorted in the order of keys. The first key is usually called the primary key (primary key ). The second key is sorted based on the first key. Similarly, the third key is sorted based on the second key.

The following are some examples: the option "n" appears in the example. The description is not rigorous for the moment, but it can only be described as this currently. It will be explained in (4) later.

"-K 2": because no POS2 is specified, the key is extended to the end of the row. The key starts from the first character of the 2nd Field and ends at the end of the line. "-K 2nd": the key starts from the first character of the 3rd Field and ends with the last character of the field. "-K 2nd": the key only has fields. "-K 2, 3n" and "-k 2n, 3" and "-k 2n, 3n": these three are equivalent, because apart from the "B" option, OPTS specifies that the results in POS1 or POS2 are the same. "-K 2, 3b" and "-k 2b, 3" and "-k 2b, 3b": these three are not equivalent to each other. "-K 2n": The key is sorted by numerical value from the 2nd Field to the end of the row. "-K 2.2b, 3.2n": the key starts from 2nd non-blank characters in the 2nd Field and ends with 3rd characters (may contain blank characters) in the 2nd Field, the key is sorted by numerical value. In fact, the B option here is redundant, because n implies the B option. "-K 5b, 5-k 3, 3n": defines two sort keys. The primary sort key is 5th, and the secondary key is the third field. The primary key is sorted by default rules, and the secondary key is sorted by numerical values. The secondary key is sorted based on the primary key. "-K 5, 5n-k 3b, 6b": the primary key is a 5th field, which is sorted by numerical value. The secondary key ranges from the 3rd Field to the sixth field, ignoring leading blank characters, however, they are sorted by default rules. The secondary key is sorted based on the primary key.

(4) When the sorting rule options (such as "n", "d", "M", and "h") Find unrecognized symbols, the sorting of the current key is immediately ended. The default sorting rule is the sorting rule of character sets. Generally, all characters can be identified, so the entire key is always sorted completely. This is "When is cross-field and cross-key comparison? .

For example, if you specify n options to sort by numeric values, because the "n" option can only recognize numbers and negative numbers "-", when the sorting encounters unrecognized characters, the sorting of the key ends immediately. That is to say, for input such as "abc 123 456 abc", the separator is a space. When "-k 2, 3n" is specified, although the sorting key includes "123 456 ", however, the blank characters in the middle cannot be identified by n, so that the sorting of the key is completed immediately after the 2nd Field "123" is completed.

Because of this, the n option will never be compared across fields or keys. Therefore, the results of "-k 2, 3n" and "-k 2n", "-k 2, 2n", "-k 2, 4n" are equivalent, only 2nd fields are sorted by numerical values. However, the default sorting rules do not solve this problem because the default sorting rules can recognize all characters, that is to say, "-k", "-k 2", "-k", and "-k" are not mutually equivalent.

Similarly, the dictionary sorting rule of "-d" can only recognize letters, numbers, and blank characters. Therefore, when these three types of characters are not found, the sorting of the current key is also ended immediately. Both "-h" and "-M" have character recognition restrictions, and the processing method is the same. For descriptions of the "-h" and "-M" options, see info sort.

Specifically, n does not recognize null strings and ends sorting when null strings are found. This may result in unexpected sorting by numerical values. For example:

[root@xuexi ~]# echo -e "b 100:200 200\na 110 300" | tr ':' '\0'|sort -t ' ' -k2nb 100200 200a 110 300

For rows like "B 100 \ 0200 200", "-k 2n" makes the key "100 \ 0200 ". Although the result looks like 100200, it only sorts 100, that is, it is smaller than 110. This leads to the illusion that 100200 is smaller than 110.

(5) by default, sort performs the "Last sorting ". Use the "-s" option to disable the "Last sort" and "-u" option to hide the "-s" option.

Consider the following situation: the two rows have identical sorting results for all keys. How can we determine the order of the two rows?

For example:

[root@xuexi ~]# echo -e "b 100 200\na 100 300" | sort -t ' ' -k2na 100 300b 100 200

The first act is "B 100 200", and the second act is "a 100 300 ". Because all the 2nd fields are 100, the sorting results of the two rows on the key are the same,Therefore, sort uses the final method to sort the entire row completely by default rules (that is, sort by character set sorting rules in ascending order). This sort is called "last sort"(Info sort is called last-resort comparison ). Because the first character a is <B in the final sorting process, the final result will be the second line "a 100 300" before the first line "B 100 200.

After "Last sorting" is disabled, the rows with the same sorting key will be retained in the relative order of reading.That is, the first row is read first.

In the preceding example, the second field is sorted by default instead of numerical values? As follows:

[root@xuexi ~]# echo -e "b 100 200\na 100 300" | sort -t ' ' -k2b 100 200a 100 300

Because the default sorting rule is to sort by character set sorting rules, it can recognize all characters, so the entire key of "-k2" will be sorted, the key is automatically extended to the 2nd and 3rd fields. Because 2 of the third field is less than 3, the first row in the result is placed before the second row. Even so, sort still performs "final sorting", but "final sorting" does not affect the sorting result.

If no sorting option is specified, it is completely default, so there is no need to make the final sorting, so no "final sorting" will be performed ". If the "-r" option is specified, because "-r" is a reverse sorting of the final result, the result of this "Last sorting" is affected.

(6). sort usage suggestions.

After figuring out the above points, do you feel that sort can meet almost all sorting requirements? As long as the file has enough rules, sort can control the sorting method of any one or multiple columns and set whether to sort across columns, characters, and keys.

Here are several sort usage suggestions, which are the final supplement.

  • When you want to sort a single field or character, we recommend that you write POS2 and POS2 = POS1. In this way, the key can be strictly sorted by only that field or character. For example, replace "-k2, 2" with "-k2 ".
  • When you want to sort multiple fields or characters, we recommend that you use multiple "-k" options to specify multiple keys and assign private options for each key as needed. The reason for this is to avoid inadvertently ignoring the extension to the end or range of the line. For example, you should specify "-k2n-k3n" instead of "-k2, 3n" for sorting 2nd and 3rd columns by numerical values ".
  • The "-B" option should always be used to remove leading spaces to prevent confusion during field segmentation. "-N" implies "-B", so you can omit "-B" when sorting values ".
  • For large files, we recommend that you write all the sorting commands that meet your requirements, and then use "-s" to disable "Last sorting ". Because "Last sorting" sorts each row, the performance is very low.

 

Finally, let's give a Test Question: Suppose the content formats in some log files to be sorted are as follows:

4.150.156.3 - - [01/Apr/2004:06:31:51 +0000] message 1

211.24.3.231 - - [24/Apr/2004:20:17:39 +0000] message 2

Can you understand the following two equivalent commands?

sort -s -t ' ' -k 4.9n -k 4.5M -k 4.2n -k 4.14,4.21 file*.log | sort -s -t '.' -k 1,1n -k 2,2n -k 3,3n -k 4,4nsort -s -t ' ' -k 4.9n -k 4.5M -k 4.2n -k 4.14,4.21 file*.log | sort -s -t '.' -n -k1 -k2 -k3 -k4

 

Back to series article outline: http://www.cnblogs.com/f-ck-need-u/p/7048359.html

Reprinted please indicate the source: Success!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.