Text-processing tools on Linux systems

Source: Internet
Author: User
Tags control characters stdin expression engine egrep

File processing tools on Linux systems

We all know that there are a lot of files in the computer, and these files contain a lot of information. But sometimes, in order to work efficiently, we in the vast information to extract the information we need, such skills are particularly important. Linux on the system for me we provide a variety of text processing tools, let us briefly say.


View file contents We can use less command, Cat command, more command, etc.

Cat

Cat [OPTION] ... [FILE] ...

-E: Display line terminator $

-N: Numbering each line displayed

-A: Show all control characters

-B: Non-empty line number

-S: Compress consecutive blank lines into a row


The cat command prints the entire contents of the target file, and if we encounter a very large file, we can use less and more to view it.

Less: A page-by-page view of a file or stdin output

The commands that are useful for viewing are:

/keyword keyword Query

n/n jumps to the next or previous match

The less command is actually the paging device used by the man command.


MORE: Paging through files

More [OPTIONS ...] FILE ...

-D: Show page flipping and exit tips


If we want to display text before or after n lines of content

Head

Head [OPTION] ... [FILE] ...

-C #: Specify get before # bytes

-N #: Specifies the first # line to get

-#: Specify the number of rows


Tail

Tail

tail [OPTION] ... [FILE] ...

-C #: Specifies the # bytes after fetching

-N #: Specifies the # line after fetch

-#:

-F: Trace display File New additions, common log monitoring

If we want to do other work while using the-f option, we can add the & symbol at the end to run it in the background.



Extracting text by column

Cut

Cut [OPTION] ... [FILE] ...

-D DELIMITER: Indicates delimiter, default tab

-F Fileds:

#: Section # Fields

#,#[,#]: Discrete multiple fields, such as 1,3,6

#-#: Multiple consecutive fields, such as 1-6

Mixed use: 1-3,7

-C cut by character

--output-delimiter=string specifying the output delimiter

Display a specified column of a file or stdin data

cut-d:-f1/etc/passwd

cat/etc/passwd | Cut-d:-f7

Cut-c2-5/usr/share/dict/words


Paste merge two files with row number columns to one line

Paste [OPTION] ... [FILE] ...

-D delimiter: Specify Delimiter, default tab

-S: All rows are composited on a single line display

Paste F1 F2

Paste-s F1 F2


Sometimes we need to analyze the text

WC: Collecting Text statistics

Count words, total number of lines, total number of bytes, and total number of characters

Can run on data in a file or stdin

Example: $wcstory. txt

237 1901 Story.txt

Number of characters in the line number of digits

Use-l to count only the number of rows

Use-W to count only the total number of words

Use-C to count only the total number of bytes

Use-M to count only the number of characters


Sort: Text sort

Display the sorted text in stdout, without changing the original file

$sort [Options] File (s)

-R performs reverse direction (top to bottom) finishing

-N Execution by number size

The-f option ignores character capitalization in the (fold) string

-u option (unique) Delete duplicate rows in output

The-t C option uses C as the field delimiter

The-k x option can be used multiple times by using the C character Delimited X column collation


Uniq: Remove duplicate front and back rows from input

Uniq[option] ... [FILE] ...

-C: Shows the number of repetitions of each line;

-D: Show only the rows that have been repeated;

-U: Displays only rows that have not been duplicated;

Continuous and exactly the same side is repeated

Commonly used with the sort command:

Sort Userlist.txt | Uniq-c


The Three musketeers of text processing on Linux:

Grep,egrep,fgrep: Text Filter tool (Pattern: pattern) tool;

grep: Basic Regular expression,-e,-f

Egrep: Extended Regular expression,-g,-f

Fgrep: Regular expressions are not supported

Sed:stream Editor, stream editors, text editor tools;

The implementation on Awk:linux is gawk, Text Report Generator (formatted text);


Grep:global Search RegularExpression and Print out of the line.

Role: The text Search tool, according to the user-specified "mode" to the target file line-by-row matching check; print the matching line.

Patterns: Filter conditions written by regular expression characters and text characters

Regular expression Refexp: A pattern written by a class of special characters and text characters, some of which do not represent literal meanings of characters, but are functions of control or wildcard.


Regular expressions are divided into two categories:

Basic Regular Expressions: BRE

Extended Regular expression: ERE


Grep:

grep [OPTIONS] PATTERN [FILE ...]


OPTIONS:

--color=auto: Color The matching text to highlight;

-i:ignorecase, ignoring the case of characters;

-O: Displays only the string that matches to itself;

-V,--invert-match: Displays the rows that cannot be matched by the pattern;

-E: Supports the use of extended regular expression metacharacters;

-Q,--quiet,--silent: Silent mode, neither output any information;


-a#:after, after # line

-b#:before, Front # line

-c#:context, front and back # lines


Speaking of which, we really need to talk about regular expressions.

Regular expression: A pattern written by a class of special characters and text characters in which some characters (metacharacters) do not represent literal meanings of characters, but are functions of control or a wildcard.

Basic regular Expression meta-characters:

Character matching

.: Matches any single character

[]: matches any single character within the specified range;

[^]: matches any single character outside the specified range;

[:d Igit:] [: Lower:] [: Upper:] [: Alpha:] [: Alnum:] [:p UNCT:] [: Space:]


Number of matches: used to limit the number of occurrences of the preceding character, after the character to which the number of occurrences is to be specified, by default working in greedy mode

*: Match any of its preceding characters: 0.1. multiple times;

. *: Matches any character of any length

/?: matches the preceding character 0 or 1 times, and the preceding character is optional;

\+: Matches the preceding character 1 or more times, and the character before it appears at least 1 times;

\{m\}: Matches the preceding character m times;

\{0,n}: Up to n times

\{m,}: at least m times


Location anchoring:

^: Anchor at the beginning of the line, for the leftmost mode;

$: End of line anchoring; for the rightmost side of the pattern;

^pattern$: Used for PATTERN to match whole line;

^$: Blank Line

^[[:space:]]*$: A blank line or a line containing white space characters;


Word: A continuous character (string) consisting of a non-special character is called a word;


\< or \b: The first anchor of the word, used for the left side of the word pattern;

\> or \b: The ending anchor for the right side of the word pattern;

\<pattern>: matches complete words;


Grouping and referencing

\ (\): Bind one or more characters together and treat as a whole;


Note: The contents of the pattern matching in the grouping brackets are automatically recorded in the internal variables by the regular expression engine, and these variables are:

\1: The pattern from the left side, the first opening parenthesis and the matching closing parenthesis, matches the character of the pattern;

\2: The pattern from the left side, the second opening parenthesis, and the matching closing parenthesis to the character;

[3]

...


Example: \ (string1\+\ (string2\) *\)

\1:string1\+\ (string2\) *

\2:string2

Back reference: References the pattern in the preceding grouping brackets to match the character (not the pattern itself)


To extend the regular expression:

egrep= GREP-E

Egrep [OPTIONS] PATTERN [FILE ...]

Extend the metacharacters of regular expressions:


Character Matching:

. Any single character

[] Specify the range of characters

[^] characters not in the specified range


Number of matches:

*: matches the preceding character any time

?: 0 or 1 times

+:1 Times or more

{m}: matches M-Times

{M,n}: At least m, up to N times


Location anchoring:

^: Beginning of the line

$: End of line

\<, \b: the first language

\>, \b: The end of the language


Group:

()

Back reference: \1, \2, ...


Cases:

A|b

C|cat:c or Cat

(c|c) At:cat or cat


This article is from the "11798474" blog, please be sure to keep this source http://11808474.blog.51cto.com/11798474/1834614

Text-processing tools on Linux systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.