File processing tools on Linux systems
We all know that there are a lot of files in the computer, and these files contain a lot of information. But sometimes, in order to work efficiently, we in the vast information to extract the information we need, such skills are particularly important. Linux on the system for me we provide a variety of text processing tools, let us briefly say.
View file contents We can use less command, Cat command, more command, etc.
Cat
Cat [OPTION] ... [FILE] ...
-E: Display line terminator $
-N: Numbering each line displayed
-A: Show all control characters
-B: Non-empty line number
-S: Compress consecutive blank lines into a row
The cat command prints the entire contents of the target file, and if we encounter a very large file, we can use less and more to view it.
Less: A page-by-page view of a file or stdin output
The commands that are useful for viewing are:
/keyword keyword Query
n/n jumps to the next or previous match
The less command is actually the paging device used by the man command.
MORE: Paging through files
More [OPTIONS ...] FILE ...
-D: Show page flipping and exit tips
If we want to display text before or after n lines of content
Head
Head [OPTION] ... [FILE] ...
-C #: Specify get before # bytes
-N #: Specifies the first # line to get
-#: Specify the number of rows
Tail
Tail
tail [OPTION] ... [FILE] ...
-C #: Specifies the # bytes after fetching
-N #: Specifies the # line after fetch
-#:
-F: Trace display File New additions, common log monitoring
If we want to do other work while using the-f option, we can add the & symbol at the end to run it in the background.
Extracting text by column
Cut
Cut [OPTION] ... [FILE] ...
-D DELIMITER: Indicates delimiter, default tab
-F Fileds:
#: Section # Fields
#,#[,#]: Discrete multiple fields, such as 1,3,6
#-#: Multiple consecutive fields, such as 1-6
Mixed use: 1-3,7
-C cut by character
--output-delimiter=string specifying the output delimiter
Display a specified column of a file or stdin data
cut-d:-f1/etc/passwd
cat/etc/passwd | Cut-d:-f7
Cut-c2-5/usr/share/dict/words
Paste merge two files with row number columns to one line
Paste [OPTION] ... [FILE] ...
-D delimiter: Specify Delimiter, default tab
-S: All rows are composited on a single line display
Paste F1 F2
Paste-s F1 F2
Sometimes we need to analyze the text
WC: Collecting Text statistics
Count words, total number of lines, total number of bytes, and total number of characters
Can run on data in a file or stdin
Example: $wcstory. txt
237 1901 Story.txt
Number of characters in the line number of digits
Use-l to count only the number of rows
Use-W to count only the total number of words
Use-C to count only the total number of bytes
Use-M to count only the number of characters
Sort: Text sort
Display the sorted text in stdout, without changing the original file
$sort [Options] File (s)
-R performs reverse direction (top to bottom) finishing
-N Execution by number size
The-f option ignores character capitalization in the (fold) string
-u option (unique) Delete duplicate rows in output
The-t C option uses C as the field delimiter
The-k x option can be used multiple times by using the C character Delimited X column collation
Uniq: Remove duplicate front and back rows from input
Uniq[option] ... [FILE] ...
-C: Shows the number of repetitions of each line;
-D: Show only the rows that have been repeated;
-U: Displays only rows that have not been duplicated;
Continuous and exactly the same side is repeated
Commonly used with the sort command:
Sort Userlist.txt | Uniq-c
The Three musketeers of text processing on Linux:
Grep,egrep,fgrep: Text Filter tool (Pattern: pattern) tool;
grep: Basic Regular expression,-e,-f
Egrep: Extended Regular expression,-g,-f
Fgrep: Regular expressions are not supported
Sed:stream Editor, stream editors, text editor tools;
The implementation on Awk:linux is gawk, Text Report Generator (formatted text);
Grep:global Search RegularExpression and Print out of the line.
Role: The text Search tool, according to the user-specified "mode" to the target file line-by-row matching check; print the matching line.
Patterns: Filter conditions written by regular expression characters and text characters
Regular expression Refexp: A pattern written by a class of special characters and text characters, some of which do not represent literal meanings of characters, but are functions of control or wildcard.
Regular expressions are divided into two categories:
Basic Regular Expressions: BRE
Extended Regular expression: ERE
Grep:
grep [OPTIONS] PATTERN [FILE ...]
OPTIONS:
--color=auto: Color The matching text to highlight;
-i:ignorecase, ignoring the case of characters;
-O: Displays only the string that matches to itself;
-V,--invert-match: Displays the rows that cannot be matched by the pattern;
-E: Supports the use of extended regular expression metacharacters;
-Q,--quiet,--silent: Silent mode, neither output any information;
-a#:after, after # line
-b#:before, Front # line
-c#:context, front and back # lines
Speaking of which, we really need to talk about regular expressions.
Regular expression: A pattern written by a class of special characters and text characters in which some characters (metacharacters) do not represent literal meanings of characters, but are functions of control or a wildcard.
Basic regular Expression meta-characters:
Character matching
.: Matches any single character
[]: matches any single character within the specified range;
[^]: matches any single character outside the specified range;
[:d Igit:] [: Lower:] [: Upper:] [: Alpha:] [: Alnum:] [:p UNCT:] [: Space:]
Number of matches: used to limit the number of occurrences of the preceding character, after the character to which the number of occurrences is to be specified, by default working in greedy mode
*: Match any of its preceding characters: 0.1. multiple times;
. *: Matches any character of any length
/?: matches the preceding character 0 or 1 times, and the preceding character is optional;
\+: Matches the preceding character 1 or more times, and the character before it appears at least 1 times;
\{m\}: Matches the preceding character m times;
\{0,n}: Up to n times
\{m,}: at least m times
Location anchoring:
^: Anchor at the beginning of the line, for the leftmost mode;
$: End of line anchoring; for the rightmost side of the pattern;
^pattern$: Used for PATTERN to match whole line;
^$: Blank Line
^[[:space:]]*$: A blank line or a line containing white space characters;
Word: A continuous character (string) consisting of a non-special character is called a word;
\< or \b: The first anchor of the word, used for the left side of the word pattern;
\> or \b: The ending anchor for the right side of the word pattern;
\<pattern>: matches complete words;
Grouping and referencing
\ (\): Bind one or more characters together and treat as a whole;
Note: The contents of the pattern matching in the grouping brackets are automatically recorded in the internal variables by the regular expression engine, and these variables are:
\1: The pattern from the left side, the first opening parenthesis and the matching closing parenthesis, matches the character of the pattern;
\2: The pattern from the left side, the second opening parenthesis, and the matching closing parenthesis to the character;
[3]
...
Example: \ (string1\+\ (string2\) *\)
\1:string1\+\ (string2\) *
\2:string2
Back reference: References the pattern in the preceding grouping brackets to match the character (not the pattern itself)
To extend the regular expression:
egrep= GREP-E
Egrep [OPTIONS] PATTERN [FILE ...]
Extend the metacharacters of regular expressions:
Character Matching:
. Any single character
[] Specify the range of characters
[^] characters not in the specified range
Number of matches:
*: matches the preceding character any time
?: 0 or 1 times
+:1 Times or more
{m}: matches M-Times
{M,n}: At least m, up to N times
Location anchoring:
^: Beginning of the line
$: End of line
\<, \b: the first language
\>, \b: The end of the language
Group:
()
Back reference: \1, \2, ...
Cases:
A|b
C|cat:c or Cat
(c|c) At:cat or cat
This article is from the "11798474" blog, please be sure to keep this source http://11808474.blog.51cto.com/11798474/1834614
Text-processing tools on Linux systems