Linux sort command explanation (good article)

Source: Internet
Author: User
Tags pear printable characters sorts

Sort is a commonly used command in Linux. It manages sorting and focuses on sort in five minutes. Start now!

1. How sort works

Sort compares each row of a file as a unit. The comparison principle is to compare the lines from the first character to the back, compare them by ASCII code values, and output them in ascending order.

[Rocrocket @ rocrocket programming] $ cat seq.txt
Banana
Apple
Pear
Orange
[Rocrocket @ rocrocket programming] $ sort seq.txt
Apple
Banana
Orange
Pear

2 sort-u Option

The function is very simple, that is, to remove duplicate rows in the output row.

[Rocrocket @ rocrocket programming] $ cat seq.txt
Banana
Apple
Pear
Orange
Pear
[Rocrocket @ rocrocket programming] $ sort seq.txt
Apple
Banana
Orange
Pear
Pear
[Rocrocket @ rocrocket programming] $ sort-u seq.txt
Apple
Banana
Orange
Pear

Pear is repeatedly deleted by the-u option.

3 sort-r Option

The default sorting method of sort is ascending. If you want to change it to descending order, add-R.

[Rocrocket @ rocrocket programming] $ cat number.txt
1
3
5
2
4
[Rocrocket @ rocrocket programming] $ sort number.txt
1
2
3
4
5
[Rocrocket @ rocrocket programming] $ sort-r number.txt
5
4
3
2
1

4 sort-O options

Because sort outputs the result to the standard output by default, you need to use redirection to write the result to a file, such as sort FILENAME> newfile.

However, if you want to output the sorting result to the original file, you cannot use redirection.

[Rocrocket @ rocrocket programming] $ sort-r number.txt> number.txt
[Rocrocket @ rocrocket programming] $ cat number.txt
[Rocrocket @ rocrocket programming] $
Check that the number is cleared.

At this point, the-O option appears. It successfully solves this problem and allows you to write the result to the original file with confidence. This is perhaps the only advantage of-O proportional targeting.

[Rocrocket @ rocrocket programming] $ cat number.txt
1
3
5
2
4
[Rocrocket @ rocrocket programming] $ sort-r number.txt-O number.txt
[Rocrocket @ rocrocket programming] $ cat number.txt
5
4
3
2
1

5 sort-N Option

Have you ever encountered 10 to 2 problems. I have encountered it. This is because the sorting program sorts these numbers by characters. The sorting program will first compare 1 and 2, obviously 1 is small, so it will put 10 in front of 2. This is also the consistent style of sort.

If we want to change this situation, we need to use the-n option to tell sort, "sort by value "!

[Rocrocket @ rocrocket programming] $ cat number.txt
1
10
19
11
2
5
[Rocrocket @ rocrocket programming] $ sort number.txt
1
10
11
19
2
5
[Rocrocket @ rocrocket programming] $ sort-N number.txt
1
2
5
10
11
19

6 sort-T option and-K Option

If the content of a file is as follows:

[Rocrocket @ rocrocket programming] $ cat facebook.txt
Banana: 30: 5.5
Apple: 10: 2.5
Pear: 90:2.3
Orange: 20: 3.4

This file has three columns separated by colons. The first column indicates the fruit type, the second column indicates the fruit quantity, and the third column indicates the fruit price.

So I want to sort by the number of fruits, that is, by the second column. How can I use sort to achieve this?

Fortunately, sort provides the-T option, and you can set the delimiter later. (Do you think of the-D option of cut and paste, and resonate ~~)

After the Delimiter is specified, you can use-K to specify the number of columns.

[Rocrocket @ rocrocket programming] $ sort-n-k 2-T: facebook.txt
Apple: 10: 2.5
Orange: 20: 3.4
Banana: 30: 5.5
Pear: 90:2.3

We use the colon as the delimiter and sort the values in ascending order for the second column. The result is satisfactory.

7. Other common sort options

-F converts lowercase letters to uppercase letters for comparison, that is, the case is ignored.

-C checks whether the file is sorted out. If the file is in disordered order, information about the first unordered row is output, and 1 is returned.

-C checks whether the file is sorted out. If the file is not output in disordered order, only 1 is returned.

-M is sorted by month, for example, Jan is smaller than Feb.

-B ignores all the blank parts in front of each line and compares them from the first visible character.

Sometimes you may find that the sort command is followed by a bunch of things similar to-K1, 2, or-k1.2-k3.4, which is incredible. Today, let's fix it -- k option!

1. Prepare materials

$ Cat facebook.txt
Google 110 5000
Baidu 100 5000
Guge 50 3000
Sohu 100 4500

The first domain is the company name, the second domain is the number of people in the company, and the third domain is the average salary of employees. (Except for the company name, others are all written into the comment _ ^)

2. I want this file to be sorted alphabetically by the company, that is, by the first domain: (this facebook.txt file has three fields)

$ Sort-t'-K 1 facebook.txt
Baidu 100 5000
Google 110 5000
Guge 50 3000
Sohu 100 4500

You can use-K 1 to set it. (In fact, this is not strict. You will know later)

3. I want to sort facebook.txt by the number of people in the company.

$ Sort-n-t ''-K 2 facebook.txt
Guge 50 3000
Baidu 100 5000
Sohu 100 4500
Google 110 5000

I believe you can understand this.

However, there is a problem here, that is, Baidu and Sohu have the same number of people, both of which are 100. What should we do at this time? According to the default rule, the first domain is sorted in ascending order, So Baidu is placed before Sohu.

4. I want facebook.txt to be sorted by the number of employees in ascending order:

$ Sort-n-t ''-K 2-K 3 facebook.txt
Guge 50 3000
Sohu 100 4500
Baidu 100 5000
Google 110 5000

We added-K2-K3 to solve the problem. Sort supports this setting, that is, to set the priority of domain sorting, first sort by 2nd fields, if the same, then sort by 3rd fields. (If you want to, you can keep writing like this and set many sorting priorities)

5. I want facebook.txt to be sorted in descending order of employees' salaries. If the number of employees is the same, it will be sorted in ascending order by the number of employees in the company: (this is a little difficult)

$ Sort-n-t ''-K 3R-K 2 facebook.txt
Baidu 100 5000
Google 110 5000
Sohu 100 4500
Guge 50 3000

Some tips are used here. You can take a closer look and secretly add a lower-case letter R behind-K 3. You think about it. Can you get the answer in combination with our previous article? Reveal: the R and-r options have the same effect, that is, reverse order. Because sort is sorted in ascending order by default, R is required to indicate that the third domain (average employee salary) is sorted in descending order. You can add n to sort the field by the value. For example:

$ Sort-t'-K 3nr-K 2n facebook.txt
Baidu 100 5000
Google 110 5000
Sohu 100 4500
Guge 50 3000

We removed the first-N option, but added it to every-K option.

6-K option syntax format

If you want to go further, you have to have some theoretical knowledge. You need to understand the syntax format of the-K option, as follows:

[Fstart [. cstart] [modifier] [, [fend [. Cend] [modifier]

The syntax format can be divided into two parts by commas (,): start and end.

The first thought is "if the end part is not set, the end is regarded as the end of the line ". This concept is very important, but you often do not pay attention to it.

The start part is also composed of three parts. The modifier part is the option part similar to N and R we mentioned earlier. Let's focus on fstart and C. Start in start.

C. Start can also be omitted. If it is omitted, it indicates starting from the beginning of the local domain. In the previous example,-K 2 and-K 3 are examples that omit C. Start.

Fstart. cstart, where fstart indicates the domain used, and cstart indicates that the fstart field starts from the first character of the fstart field ".

Similarly, in the end section, you can set fend. Cend. If you omit. Cend, it indicates the end to the end of the domain, that is, the last character of the local domain. Or, if you set Cend to 0 (zero), it also indicates the end to the end of the "domain ".

7. The company sorted the company's English name by the second letter:

$ Sort-t'-K 1.2 facebook.txt
Baidu 100 5000
Sohu 100 4500
Google 110 5000
Guge 50 3000

We use-K 1.2 to sort the strings starting from the second character of the first domain to the last character of the current domain. You will find that Baidu is listed first because the second letter is. The second character of Sohu and Google is O, but the H of Sohu is in front of Google's O, so the two are ranked second and third respectively. Guge can only rank fourth.

8. In another whim, only the second letter of the company's English name is sorted. if the same is sorted in descending order according to the employee's salary:

$ Sort-t'-K 1.2, 1.2-K 3, 3nr facebook.txt
Baidu 100 5000
Google 110 5000
Sohu 100 4500
Guge 50 3000

Because only the second letter is sorted, we use the-K 1.2, 1.2 representation, indicating that we only sort the second letter. (If you ask "Why can't I use-K 1.2 ?", Of course not, because you omit the end part, which means that you will sort the strings from the second letter to the last character of the domain ). We also used-K 3, 3 for sorting employees' salaries. This is the most accurate expression, indicating that we only sort this domain, because if you omit the next 3, it becomes "sorting the content from the beginning of 3rd domains to the last domain location.

9 What options can be used in the modifier section?

B, d, f, I, n, or R can be used.

N and R are familiar to you.

B indicates that the blank sign-in characters in the current domain are ignored.

D indicates that the domain is ordered alphabetically (that is, only blank spaces and letters are considered ).

F indicates that the domain is case-insensitive for sorting.

I indicates ignore "printable characters" and only sort printable characters. (Some ASCII characters cannot be printed. For example, \ A is an alarm, \ B is a backspace, \ n is a line feed, and \ r is a carriage return)

10 think about the example of joint use of-K and-u:

$ Cat facebook.txt
Google 110 5000
Baidu 100 5000
Guge 50 3000
Sohu 100 4500

This is the original facebook.txt file.

$ Sort-n-k 2 facebook.txt
Guge 50 3000
Baidu 100 5000
Sohu 100 4500
Google 110 5000

$ Sort-n-k 2-u facebook.txt
Guge 50 3000
Baidu 100 5000
Google 110 5000

After the value is sorted by employee domain and-u is added, the Sohu row is deleted! In the past,-u only recognized the domain set with-K. If the domain is found to be the same, all subsequent identical rows will be deleted.

$ Sort-K 1-u facebook.txt
Baidu 100 5000
Google 110 5000
Guge 50 3000
Sohu 100 4500

$ Sort-K 1.1, 1.1-u facebook.txt
Baidu 100 5000
Google 110 5000
Sohu 100 4500

In this example, the Guge with the starting character G is not spared.

$ Sort-n-k 2-K 3-u facebook.txt
Guge 50 3000
Sohu 100 4500
Baidu 100 5000
Google 110 5000

Success! If the two-layer sorting priority is set here, no rows are deleted using-U. The original-u will weigh all-K options and will be deleted only when they are the same, as long as there is a level of difference, it will not be easily deleted :) (do not believe it, you can add a line of Sina 100 4500 by yourself)

11. The most strange sorting:

$ Sort-n-k 2.2, 3.1 facebook.txt
Guge 50 3000
Baidu 100 5000
Sohu 100 4500
Google 110 5000

Sorts the parts starting from the second character of the second domain to the first character of the third domain.

The first row extracts 0 3, the second row extracts 00 5, the third row extracts 00 4, and the fourth row extracts 10 5.

Because sort considers 0 to be less than 00 and less than 000 to be less than 0000 ....

Therefore, 0 3 must be the first one. 10 5 is definitely the last one. But why is 00 5 above 00 4? (You can experiment and think about it yourself .)

Answer: The original "cross-origin setting is an illusion", sort only compares the second character of the second domain to the last character of the second domain, the starting character of the third domain is not included in the comparison range. When it is found that 00 and 00 are the same, sort will automatically compare the first domain. Of course, Baidu is in front of Sohu. It can be confirmed by an example:

$ Sort-n-k 2.2, 3.1-K 1, 1R facebook.txt
Guge 50 3000
Sohu 100 4500
Baidu 100 5000
Google 110 5000

12 sometimes the symbols + 1-2 are displayed after the sort command. What is this?

The latest sort explains this syntax:

On older systems, 'sort 'ororts an obsolete origin-zero syntax' + pos1 [-pos2] 'for specifying sort keys. POSIX 1003.1-2001 (* Note standards conformance: :) does not allow this; Use '-k' instead.

It turns out that this old expression has been eliminated, and you can despise the script that uses this expression in the future!

(To prevent the existence of the old script, let's talk about this representation. the plus sign indicates the start part, and the minus sign indicates the end part. The most important thing is that this method starts counting from 0. The first domain previously mentioned is represented as 0th domains. The previous 2nd characters, expressed as 1st characters. Understand ?)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.