The Linux command line has a lot of fun and we can easily and perfectly perform a lot of tedious tasks. For example, we calculate how often characters and characters appear in a text file, which is what we intend to say in this article.
Immediately came to our minds the command of computing words and characters in a text file frequency of LINUX commands is WC command.
Before using a script to parse a text file, we must have a text file. To maintain consistency, we will create a text file, and the output of the man command is described below.
The code is as follows:
$ man mans > Man.txt
The above command is to import the use of the Man command into the Man.txt file.
We'd like to get the most common words and execute the following script for our newly created files.
The code is as follows:
$ Cat Man.txt | Tr ' ' 12 ' | TR ' [: Upper:] ' [: Lower:] ' | Tr-d ' [:p unct:] ' | Grep-v ' [^a-z] ' | Sort | uniq-c | Sort-rn | Head
Sample Output
The code is as follows:
7557
262 the
163 to
112 is
112 A
Of
Manual
The
If
The
The above script outputs the 10 words that are most commonly used.
How do you look at a single letter? Then use the following command.
The code is as follows:
$ Echo ' Tecmint team ' | Fold-w1
Sample Output
[Code] t
E
C
M
I
N
T
T
E
A
M
Note:-W1 just set the length
Now we'll sort the results by breaking each letter from that text file to get the 10 most common characters for the desired output frequency.
$ FOLD-W1 < Man.txt | Sort | uniq-c | Sort-rn | Head
Sample Output
The code is as follows:
8579
2413 E
1987 A
1875 T
1644 I
1553 N
1522 O
1514 S
1224 R
1021 L
How do you differentiate between case? We've all been ignoring the case before. So, use the following command.
$ FOLD-W1 < Man.txt | Sort | TR ' [: Lower:] ' [: Upper:] ' | uniq-c | Sort-rn | Head-20
Sample Output
The code is as follows:
11636
2504 E
2079 A
The T
1729 I
1645 N
1632 S
1580 O
1269 R
1055 L
836 H
791 P
766 D
753 C
725 M
690 U
605 F
504 G
352 Y
344.
Please check the output above, the punctuation is included. Let's kill him, with the TR command. Go:
The code is as follows:
$ FOLD-W1 < Man.txt | TR ' [: Lower:] ' [: Upper:] ' | Sort | Tr-d ' [:p unct:] ' | uniq-c | Sort-rn | Head-20
Sample Output
The code is as follows:
11636
2504 E
2079 A
The T
1729 I
1645 N
1632 S
1580 O
1550
1269 R
1055 L
836 H
791 P
766 D
753 C
725 M
690 U
605 F
504 G
352 Y
Now that we have three text, let's look at the results with the following command.
The code is as follows:
$ Cat *.txt | FOLD-W1 | TR ' [: Lower:] ' [: Upper:] ' | Sort | Tr-d ' [:p unct:] ' | uniq-c | Sort-rn | Head-8
Sample Output
The code is as follows:
11636
2504 E
2079 A
The T
1729 I
1645 N
1632 S
1580 O
Next we will generate those rare words with at least 10 letters long. Here's a simple script:
The code is as follows:
$ Cat Man.txt | Tr ' ' 12 ' | TR ' [: Upper:] ' [: Lower:] ' | Tr-d ' [:p unct:] ' | Tr-d ' [0-9] ' | Sort | uniq-c | Sort-n | Grep-e ' .......... ... ' | Head
Sample Output
The code is as follows:
1──────────────────────────────────────────
1 a All
1 ABC or all arguments within are optional
1 able setlocale for precise details
1 ab Options delimited by cannot is used together
1 achieved by using the less environment variable
1 A child process returned a nonzero exit status
1 act as if this option is supplied using the name as a filename
1 activate local mode format and display local manual files
1 acute accent
Note: The above. More and more, in fact, we can use. {10} Gets the same effect.
These simple scripts let us know the most frequently occurring words and the characters in English.
It's over now. Next time I'll be here to talk about another interesting topic that you should like to read. And don't forget to give us your valuable advice.