Text-processing tools on Linux systems

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

File processing tools on Linux systems

We all know that there are a lot of files in the computer, and these files contain a lot of information. But sometimes, in order to work efficiently, we in the vast information to extract the information we need, such skills are particularly important. Linux on the system for me we provide a variety of text processing tools, let us briefly say.

View file contents We can use less command, Cat command, more command, etc.

Cat

Cat [OPTION] ... [FILE] ...

-E: Display line terminator $

-N: Numbering each line displayed

-A: Show all control characters

-B: Non-empty line number

-S: Compress consecutive blank lines into a row

The cat command prints the entire contents of the target file, and if we encounter a very large file, we can use less and more to view it.

Less: A page-by-page view of a file or stdin output

The commands that are useful for viewing are:

/keyword keyword Query

n/n jumps to the next or previous match

The less command is actually the paging device used by the man command.

MORE: Paging through files

More [OPTIONS ...] FILE ...

-D: Show page flipping and exit tips

If we want to display text before or after n lines of content

Head

Head [OPTION] ... [FILE] ...

-C #: Specify get before # bytes

-N #: Specifies the first # line to get

-#: Specify the number of rows

Tail

tail [OPTION] ... [FILE] ...

-C #: Specifies the # bytes after fetching

-N #: Specifies the # line after fetch

-#：

-F: Trace display File New additions, common log monitoring

If we want to do other work while using the-f option, we can add the & symbol at the end to run it in the background.

Extracting text by column

Cut

Cut [OPTION] ... [FILE] ...

-D DELIMITER: Indicates delimiter, default tab

-F Fileds:

#: Section # Fields

#,#[,#]: Discrete multiple fields, such as 1,3,6

#-#: Multiple consecutive fields, such as 1-6

Mixed use: 1-3,7

-C cut by character

--output-delimiter=string specifying the output delimiter

Display a specified column of a file or stdin data

cut-d:-f1/etc/passwd

cat/etc/passwd | Cut-d:-f7

Cut-c2-5/usr/share/dict/words

Paste merge two files with row number columns to one line

Paste [OPTION] ... [FILE] ...

-D delimiter: Specify Delimiter, default tab

-S: All rows are composited on a single line display

Paste F1 F2

Paste-s F1 F2

Sometimes we need to analyze the text

WC: Collecting Text statistics

Count words, total number of lines, total number of bytes, and total number of characters

Can run on data in a file or stdin

Example: $wcstory. txt

237 1901 Story.txt

Number of characters in the line number of digits

Use-l to count only the number of rows

Use-W to count only the total number of words

Use-C to count only the total number of bytes

Use-M to count only the number of characters

Sort: Text sort

Display the sorted text in stdout, without changing the original file

$sort [Options] File (s)

-R performs reverse direction (top to bottom) finishing

-N Execution by number size

The-f option ignores character capitalization in the (fold) string

-u option (unique) Delete duplicate rows in output

The-t C option uses C as the field delimiter

The-k x option can be used multiple times by using the C character Delimited X column collation

Uniq: Remove duplicate front and back rows from input

Uniq[option] ... [FILE] ...

-C: Shows the number of repetitions of each line;

-D: Show only the rows that have been repeated;

-U: Displays only rows that have not been duplicated;

Continuous and exactly the same side is repeated

Commonly used with the sort command:

Sort Userlist.txt | Uniq-c

The Three musketeers of text processing on Linux:

Grep,egrep,fgrep: Text Filter tool (Pattern: pattern) tool;

grep: Basic Regular expression,-e,-f

Egrep: Extended Regular expression,-g,-f

Fgrep: Regular expressions are not supported

Sed:stream Editor, stream editors, text editor tools;

The implementation on Awk:linux is gawk, Text Report Generator (formatted text);

Grep:global Search RegularExpression and Print out of the line.

Role: The text Search tool, according to the user-specified "mode" to the target file line-by-row matching check; print the matching line.

Patterns: Filter conditions written by regular expression characters and text characters

Regular expression Refexp: A pattern written by a class of special characters and text characters, some of which do not represent literal meanings of characters, but are functions of control or wildcard.

Regular expressions are divided into two categories:

Basic Regular Expressions: BRE

Extended Regular expression: ERE

Grep:

grep [OPTIONS] PATTERN [FILE ...]

OPTIONS:

--color=auto: Color The matching text to highlight;

-i:ignorecase, ignoring the case of characters;

-O: Displays only the string that matches to itself;

-V,--invert-match: Displays the rows that cannot be matched by the pattern;

-E: Supports the use of extended regular expression metacharacters;

-Q,--quiet,--silent: Silent mode, neither output any information;

-a#:after, after # line

-b#:before, Front # line

-c#:context, front and back # lines

Speaking of which, we really need to talk about regular expressions.

Regular expression: A pattern written by a class of special characters and text characters in which some characters (metacharacters) do not represent literal meanings of characters, but are functions of control or a wildcard.

Basic regular Expression meta-characters:

Character matching

.: Matches any single character

[]: matches any single character within the specified range;

[^]: matches any single character outside the specified range;

[:d Igit:] [: Lower:] [: Upper:] [: Alpha:] [: Alnum:] [:p UNCT:] [: Space:]

Number of matches: used to limit the number of occurrences of the preceding character, after the character to which the number of occurrences is to be specified, by default working in greedy mode

*: Match any of its preceding characters: 0.1. multiple times;

. *: Matches any character of any length

/?: matches the preceding character 0 or 1 times, and the preceding character is optional;

\+: Matches the preceding character 1 or more times, and the character before it appears at least 1 times;

\{m\}: Matches the preceding character m times;

\{0,n}: Up to n times

\{m,}: at least m times

Location anchoring:

^: Anchor at the beginning of the line, for the leftmost mode;

$: End of line anchoring; for the rightmost side of the pattern;

^pattern$: Used for PATTERN to match whole line;

^$: Blank Line

^[[:space:]]*$: A blank line or a line containing white space characters;

Word: A continuous character (string) consisting of a non-special character is called a word;

\< or \b: The first anchor of the word, used for the left side of the word pattern;

\> or \b: The ending anchor for the right side of the word pattern;

\<pattern>: matches complete words;

Grouping and referencing

\ (\): Bind one or more characters together and treat as a whole;

Note: The contents of the pattern matching in the grouping brackets are automatically recorded in the internal variables by the regular expression engine, and these variables are:

\1: The pattern from the left side, the first opening parenthesis and the matching closing parenthesis, matches the character of the pattern;

\2: The pattern from the left side, the second opening parenthesis, and the matching closing parenthesis to the character;

[3]

...

Example: \ (string1\+\ (string2\) *\)

\1:string1\+\ (string2\) *

\2:string2

Back reference: References the pattern in the preceding grouping brackets to match the character (not the pattern itself)

To extend the regular expression:

egrep= GREP-E

Egrep [OPTIONS] PATTERN [FILE ...]

Extend the metacharacters of regular expressions:

Character Matching:

. Any single character

[] Specify the range of characters

[^] characters not in the specified range

Number of matches:

*: matches the preceding character any time

?: 0 or 1 times

+:1 Times or more

{m}: matches M-Times

{M,n}: At least m, up to N times

Location anchoring:

^: Beginning of the line

$: End of line

\<, \b: the first language

\>, \b: The end of the language

Group:

()

Back reference: \1, \2, ...

Cases:

A|b

C|cat:c or Cat

(c|c) At:cat or cat

This article is from the "11798474" blog, please be sure to keep this source http://11808474.blog.51cto.com/11798474/1834614

Text-processing tools on Linux systems

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Text-processing tools on Linux systems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Text-processing tools on Linux systems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support