DAY7: Text-processing tools and regular expressions

Last Update:2016-08-07 Source: Internet

Author: User

Tags control characters expression engine egrep

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

August 4, the main learning contents are as follows:

First, the tool of extracting text: Less,cat,head,tail,cut

II. Tools for analyzing text: Wc,sort,diff,patch

Second, grep and regular expressions

Third, Egrep extended regular expression

I. Tools for extracting text

1) File View command:

Cat [OPTION] ... [FILE] ...
-E: Display line terminator $
-N: Numbering each line displayed
-A: Show all control characters
-B: Non-empty line number
-S: Compress consecutive blank lines into a row
Tac
Features are the same as cat, displaying the contents in reverse of cat

2) Pagination View tool

MORE: Paging through files
More [OPTIONS ...] FILE ...
-D: Show page flipping and exit tips
Less: A page-by-page view of a file or stdin output
The commands that are useful for viewing are:
/Text Search text
n/n jumps to the next or previous match
Less command is a pager used by the man command

3) display text before or after content

Head
Head [OPTION] ... [FILE] ...
-C #: Specify get before # bytes
-N #: Specifies the first # line to get
-#: Specify the number of rows (same as-n#)
Tail
tail [OPTION] ... [FILE] ...
-C #: Specifies the # bytes after fetching
-N #: Specifies the # line after fetch
-#: Specify the number of rows
-F: Trace display File New additions, common log monitoring

4) Extract text cut and merge files by column paste

Cut [OPTION] ... [FILE] ...
-D DELIMITER: Indicates delimiter, default tab (-D and delimiter can have no spaces)
-F Fileds:
#: Section # Fields
#,#[,#]: Discrete multiple fields, such as 1,3,6
#-#: Multiple consecutive fields, such as 1-6
Mixed use: 1-3,7
-C by character cut--output-delimiter=string Specify output delimiter
Cut-d:-f1/etc/passwd
cat/etc/passwd | Cut-d:-f7
Cut-c2-5/usr/share/dict/words
Paste merge two files with row number columns to one line
-D delimiter: Specify Delimiter, default tab
-S: All rows are composited on a single line display
Paste F1 F2
Paste-s F1 F2

Ii. Text Analysis Tools

1) Text data statistics

WC: Count lines, total words, total characters (and total bytes), can run on data in a file or stdin

WC Story.txt

237 1901 Story.txt

Number of characters in the line number of digits

-L count of rows only

-W counts only the total number of words

-c counts only bytes total

-m count number characters total

2) Text sorting

Sort: Display the sorted text in stdout (default by character size) does not change the original file

Sort [Options] file (s)
-R performs reverse direction (top to bottom) finishing

-N Execution by numeric sizing

The-f option ignores character capitalization in the (fold) string

-u option (unique) Delete duplicate rows in output

The-t C option uses C as the field delimiter

The-k x option can be used multiple times by using the C character Delimited X column collation

3) In addition to weight

Uniq: Remove duplicate front and back rows from input

Uniq [OPTION] ... [FILE] ...

-C: Shows the number of occurrences per line

-D: Show only rows that have been repeated

-U: Show only rows that have not been duplicated: continuous and exact duplicates

Commonly used with the sort command: Sort Userlist.txt | Uniq-c

4) Compare files

diff: Compare the differences between two files per line,

diff [OPTION] ... [OLDFILE] [NEWFILE] Shows differences and measures compared to oldfile and NEWFILE

Diff Foo.conf-broken Foo.conf-works

5C5 (Note that there is a difference in line 5th)

< Use_widgets = No

---

> use_widgets = yes

-U Displays the context of the changed row, default 3 rows (for patch files)

Diff/path/to/oldfile/path/to/newfile >/path/to/patch_file

Diff can also be used to compare two different directories, showing the difference between each of these files

Patch: Copy Changes to file (patch to file)

Patch-i/path/to/patch_file/path/to/oldfile

Patch/path/to/oldfile </path/to/patch_file

-B option to automatically back up changed files

Iii. grep and regular expressions

1) Three Musketeers of text processing on Linux

grep: Text filter (Pattern: pattern) Tool

grep, Egrep (supports extended regular expressions), Fgrep (regular expression search not supported)

Sed:stream Editor, text editing tools

Implementation Gawk on Awk:linux, Text Report Generator

2) Grep:global search REgular expression and Print out of the line

Function:: Text Search tool, according to user-specified "mode" to match the target text line by row to check; print matching lines; pattern: Filter conditions written by regular expression characters and text characters

grep [OPTIONS] PATTERN [FILE ...]

grep root/etc/passwd

grep [OPTIONS] PATTERN [FILE ...] grep root/etc/passwd

Command options:

--color=auto: Coloring the text to match to a display

-V: Shows rows that cannot be matched to pattern

-I: Ignore character case

-N: Show matching line numbers

-C: Count the number of matching rows

-O: Show only the matching string

-Q: Silent mode, do not output any information (with echo $?). Can be used to write scripts)

-a #:after, showing the following # lines at the same time

-B #: Before, Front # line

-c #:context, front and back # lines

-E: Implementing a logical or relationship between multiple options

Grep–e ' Cat '-e ' dog ' file

-W: Entire line matches Whole word

-E: Regular expression using ere extension

3) Regular Expressions

REGEXP: A pattern written by a class of special characters and text characters, in which some characters (metacharacters) do not represent literal meanings, but are functions that represent control or a wildcard
Program support: grep, VIM, Less,nginx, etc.
Divided into two categories: basic Regular Expression: BRE extended Regular expression: ERE
Meta-character classification: character matching, number of matches, position anchoring, grouping

4) Basic Regular expressions

Character matching
. : Matches any single character
[]: matches any single character within the specified range
[^]: matches any single character outside the specified range
[:d Igit:], [: Lower:], [: Upper:], [: Alpha:], [: Alnum:], [:p UNCT:], [: Space:]
Number of matches (the default works in greedy mode: match as long as possible)
Used after the number of characters to be specified, to specify the number of occurrences of the preceding character
*: matches the preceding character any time, including 0 times
. *: Any character of any length
\?: match its preceding character 0 or 1 times
\+: Matches the preceding characters at least 1 times
\{m\}: Matches the preceding character m times
\{m,n\}: Matches the preceding character at least m times, up to N times
\{0,n\}: Matches the preceding character up to n times
\{m,\}: Matches the preceding character at least m times
Position anchoring: positioning where it appears
^: The beginning of the line is anchored to the leftmost side of the pattern (^root starts with root)
$: End of line anchor for the right-most side of the pattern (root$ line with Root)
^pattern$: Used for pattern matching of entire rows (only this pattern for the entire row)
^$: Blank line (white space character not included)
^[[:space:]]*$: Blank lines (blank lines or lines that contain white space characters)
Word: A continuous character (string) consisting of non-special characters (including numbers, without special characters)
\< or \b: The first anchor of the word, used for the left side of the word pattern
\> or \b: the ending anchor; for the right side of the word pattern
\<pattern\>: Match Whole word
Grouping and referencing
Group: \ (\): Binds one or more characters together as a whole, such as: \ (root\) \+
Note: The contents of the pattern matching in the grouping brackets are recorded in internal variables by the regular expression engine, which are named: \1, \2, \3, ...
\1: The character that matches the pattern between the first opening parenthesis and the matching closing parenthesis, starting from the left
Example: \ (string1\+\ (string2\) *\)
\1:string1\+\ (string2\) *
\2:string2
Back reference: References the pattern in the preceding grouping brackets to match the character (not the pattern itself)

Four, EGRP and extended regular expressions

1) egrep

Egrep = Grep-e

Egrep [OPTIONS] PATTERN [FILE ...]

2) Extended Regular expression

Character matching (same as basic regular expression)
Number of Matches
*: matches the preceding character any time
?: 0 or 1 times
+:1 Times or more
{m}: matches M-Times
{M,n}: At least m, up to N times
{0,n} {m,}
Position anchoring (same as basic regular expression)
Group
()
Back reference: \1, \2, .....
Or
A|b
C|cat:c or Cat
(c|c) At:cat or cat

3) Fgrep

Regular expression metacharacters are not supported: using Fgrep is better when you don't need to use metacharacters to write patterns

This article is from the "Laugh Monkey" blog, please be sure to keep this source http://xiaomonky.blog.51cto.com/11869371/1835347

DAY7: Text-processing tools and regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More