Awk text processing summary (beginner, intermediate, advanced)

Source: Internet
Author: User
Tags ldap processing text

Source: http://blog.csdn.net/blackbillow/archive/2009/01/21/3847425.aspx

 

Awk processing text Summary-alex. Wang

As a technical support project, we often encounter the need to process text files. No matter what database can finally be imported into text, we can process it, in this way, you can process the data even if you are not familiar with all database operations.

We need two tools: shell and awk, awk is the most suitable tool for handling text files. awk makes it easy for us to process text files. With some shell commands, we can easily get the desired results. Now I will summarize the problems that I think will happen frequently from simple examples.

Awk Introduction

Awk#, content example1.txt.

User1 password1 username1 unit1 10
User2 password2 username2 unit2 20
User3 password3 username3 unit3 30

 

In the Unix environment, we can use the following command to print the first column.

[Root @ mail awk] # awk '{print $1}' example1.txt

The result is as follows. To explain the difference between '{"and the shell command that enclose the awk statement in single quotes, $1 indicates the first column of the text file, the normal awk command follows the-F parameter to specify the delimiter. If it is a space or a tab key, it can be omitted.
User1
User2
User3

[Root @ mail awk] # awk '{if ($5> 20) {print $1} 'example1.txt

The command in this line adds "if ($5> 20)" to the comparison with the previous line. The result is:

User3

There is no need to explain this If statement in detail! That is, the first column that meets the conditions is displayed for columns greater than 20.

[Root @ mail awk] # awk '{if ($5> 20 | $5 = 10) {print $1}' example1.txt

User1
User3

At the beginning, an additional "if ($5> 20 | $5 = 10)" is added for the logic judgment of the three "|| &&
!"
Or, and, not three can be arbitrarily added to the inside, this statement means that if the 5th columns are greater than 20 or equal to 10, all display processing, in our work, users may ask to find out all the spaces that are larger
Or an account with the same size as the space is modified in batches.

If is one of the awk loops, there are many others. Man awk can see that,
Control statements
The control statements are as follows:

If (condition) statement [else statement]
While (condition) Statement
Do statement while (condition)
For (expr1; expr2; expr3) Statement
For (VAR in array) Statement
Break
Continue
Delete array [Index]
Delete Array
Exit [expression]
{Statements}

When learning awk, you can often use man awk to see all the functions and usage methods.

Only by understanding the meaning of each symbol can we better use awk. At first, remember several commands to know the results that can be implemented and we will understand them again.

Awk intermediate

Now we will replace the spaces in example1.txt with ":". Here, the Command Used in VI is:

% S //:/g

This command is the most widely used for VI users. Now we have created a new file named example2.txt.

User1: password1: username1: unit1: 10
User2: password2: username2: unit2: 20
User3: password3: username3: unit3: 30

Now let's create an awk script, which was previously performed on the command line. In fact, all operations can be implemented on the command line, and we have begun to use the most frequently used batch add users!

Script1.awk

#! /Bin/awk-F # When the file has executable permissions, you can directly execute
#./Script1.awk example2.txt
# If this line does not exist, an error may occur, or
# Awk-F script1.awk example2.txt parameter F indicates the script file

Begin {# "begin {" is where the awk script starts
FS = ":"; # fs indicates the delimiter in awk.
}

{# The following "{" is the content section
Print "add {"; # The following uses an awk function print
Print "uid =" $1;
Print "userpassword =" $2;
Print "Domain = eyou.com ";
Print "bookmark = 1 ";
Print "voicemail = 1 ";
Print "securemail = 1"
Print "Storage =" $5;
Print "}";
Print ".";
} # "}" Content part ended
End {# end {
Print "exit ";
}

Execution result
[Root @ mail awk] # awk-F script1.awk example2.txt
Add {
Uid = user1
Userpassword = password1
Domain = eyou.com
Bookmark = 1
Voicemail = 1
Securemail = 1
Storage = 10
}
.
.
.
.
.
.
Exit

Text operations are more convenient.

The following are two examples with the same returned results:
[Root @ mail awk] # awk-F: '{print $1 "@" $2}' example2.txt
[Root @ mail awk] # awk-F: '{printf "% s @ % s/n", $1, $2}' example2.txt

User1 @ password1

The difference here is that print
Compared with printf, the printf format is more free. We can specify the data to be output more freely. Print will automatically give spaces at the end of the row, while printf needs to be given"
/N. If you are interested, you can remove "/N" and check the result. % S represents the string % d
Representing numbers, basically % s can be processed, because everything in the text can be considered as a string, unlike other languages such as C language, but also to distinguish numbers, characters, strings and so on.

Awk also has some good functions to study in detail.

This time I encountered a problem. The customer had a user list of about 2 million users. He had an interesting job to do: to put each account directory under a specific directory, for example
The 13910011234 directory should be placed in the 139/10/directory. It can be seen from this that the first three of the mobile phone numbers are level-2 directory names, and 3rd and 4 of the mobile phone numbers are level-3 directory names. I
There is only one user list, And the rule is found. Let's start to find a solution.

Example3.txt

13910011234
15920312343
13922342134
15922334422
......

The first step is to find a way to separate each phone number. At first, you may think that there is no interval between them. How can we use awk to separate them? To be honest, I initially considered
After more than 20 minutes, I remembered that when I learned python, there was a split function, so I could try to find out if there were similar functions in awk. Man awk found that substr
This function substring,

[Root @ mail awk] # awk '{print substr ($1, 1, 3)}' example3.txt

[Root @ mail awk] # awk '{printf "% S/% s/n", substr ($, 3), substr ($, 2)} 'example3.txt

[Root @ mail awk] # awk '{printf "Mv % S/% s/n", $1, substr ($, 3), substr ($, 2)} 'example3.txt

After performing the preceding two steps, we can get the expected result.

MV 13910011234 139/10
MV 15920312343 159/20
MV 13922342134 139/22
MV 15922334422 159/22

Copy the output to a shell script and execute it in the current data directory!

Substr (S, I [, N]) returns the at most N-character substring of S
Starting at I. If n is omitted, the rest of S
Is used.

The substr function explains that s represents the string to be processed. I is the number of positions starting from the string, and N is the number of characters from the starting position. Reading more man English will also improve.

Awk has many interesting functions that You can check if you are interested in,
Man awk
String functions string functions. For example, some common functions
Length ([s]) returns the length of the string S, or
Length of $0 if S is not supplied.
You can get the length of a string. This is a common function.
Split (s, A [, R]) splits the string s into the array a on
Regular Expression R, and returns the number
Fields. if R is omitted, FS is used instead.
The array a is cleared first. Splitting
Behaves identically to field splitting,
Described abve.

Tolower (STR) returns a copy of the string STR, with all
Upper-case characters in STR translated
Their corresponding lower-case counterparts.
Non-alphabetic characters are left unchanged.

Toupper (STR) returns a copy of the string STR, with all
Lower-case characters in STR translated
Their corresponding upper-case counterparts.
Non-alphabetic characters are left unchanged.
Time functions time function, which is the most commonly used timestamp Conversion Function

Strftime ([format [, timestamp])
Formats timestamp according to the specification in format.
The timestamp shocould be of the same form as returned by sys-
Time (). If timestamp is missing, the current time of day is
Used. If format is missing, a default format equivalent
The output of date (1) is used. See the specification for
Strftime () function in ansi c for the format conversions that
Are guaranteed to be available. A public-domain version
Strftime (3) and a man page for it come with gawk; if that
Version was used to build gawk, then all of the conversions

Described in that man page are available
Gawk.


Here is an example of how the timestamp function is used.

[Root @ ent root] # date + % S | awk '{print strftime ("% F % t", $0 )}'
2008-02-19 15:59:19

We first use the date command to make a timestamp, and then convert it to time
There are also some functions that we may not frequently use now. For more information, see man awk.
Bit manipulations functions binary Functions
Internationalization functions international standardized Functions
 
User-Defined Functions users can also define their own functions. If you are interested, you can further study them.
 
For example:

Function f (p, q, a, B) # A and B are local
{
...
}

/ABC/{...; F (1, 2 );...}
Dynamically loading new functions dynamically loads new functions, which may be more advanced!

Awk advanced

 
No matter what language we learn, we all learn tools. The more tools we know, the more convenient we can start our work. However, tools do not necessarily produce good products in your hands, algorithms are equally important for editing scripts and programming. You need to know how to handle problems that others do not know. This proves that you are higher than others, and you can use the tool as long as you practice it slowly.
 

Next I will give you a question that I think is more advanced. If you are interested, you can think about a better solution. The problem is that we have a file exported from LDAP, which is a line of a field.
Note that the data of each user is separated by empty rows. We must find out the corresponding UID and userpassword.

Example: example4.txt

DN: uid = cailiying, domain = ccc.com.cn, O = mail.ccc.com.cn
UID: cailiying
Userpassword: e21knx0zrel4veiwodbjdxzktnu3wffts3lrpt0 =
Letter: 300
Quota: 100

DN: uid = caixiaoning, domain = ccc.com.cn, O = mail.ccc.com.cn
Userpassword: e21knx1kejfxu0dozwprr2rnynv5ajjjrwl3pt0 =
Letter: 300
Quota: 100
UID: chenzheng
Domain: cqc.com.cn

DN: uid = caixiaoning, domain = ccc.com.cn, O = mail.ccc.com.cn
Userpassword: e21knx1kejfxu0dozwprr2rnynv5ajjjrwl3pt0 =
Letter: 300
Quota: 100

DN: uid = caixiaoning, domain = ccc.com.cn, O = mail.ccc.com.cn
Userpassword: e21knx1kejfxu0dozwprr2rnynv5ajjjrwl3pt0 =
Letter: 300
Quota: 100
UID: chenzheng
Domain: cqc.com.cn
To process this text, we need to consider the following issues:
1 UID and userpassword are not included in every section
2. The UID and userpassword in each segment are in random order.
3. Some paragraphs may only contain UID or userpassword.

From the text analysis, we can see that the delimiter must be used. One is a blank line and the other is a colon.
Colon we awk
-F: That's it. However, it's hard to judge whether a blank line can have at most one/n characters in UNIX when it comes to the length () function, if the number of characters in a row is less than 2, we can judge it as a blank line,
Now the problem of the Delimiter is solved. Empty rows can only be judged by loops.

Another problem we have encountered is that the information in a segment is incomplete, and we have to give up how to do it here, we just need to make two Mark variables U and P and make another loop. If U and P are satisfied, we will output the following awk script to solve the ldif text processing through this thinking!

# The purpose of this script is to facilitate data import to Other LDAP emails in the future,
# Before using slapdcat-L to export all information, we need
# Sort out the UID password. The settings here are separated by ":" by default.
# For example, if slapcat-l user. ldif wants to get a file with the UID and userpassword,
# Modify username = "DN"; Password = "userpassword"; awk-F ldap2txt. awk user. ldif | grep uid | more to view the result (it may be a multi-domain email)
# If you want to obtain the password for the domain, modify username = "DN"; Password = "userpassword"; run awk-F ldap2txt. awk user. ldif | grep domain | more

#! /Bin/awk-F
# File name: ldap2txt. awk

Begin {
FS = ":";
Username = "uid ";
Password = "userpassword ";
}

{

If (length ($0) = 0)
{
If (name! = "U" & pword! = "P ")
{
Printf ("% s: % s/n", name, pword );
Name = "U ";
Pword = "p ";
}
}

Else
{
If ($1 = username)
{
Name = "U ";
Name = $0;
}
Else if ($1 = PASSWORD)
{
Pword = "p ";
Pword = $0;
}
}
}
End {

}

In fact, the first thing to learn about a language is to familiarize yourself with some common functions, and then try to solve the problems that others have solved, and then think about whether there are better and faster solutions. In fact
Most programmers are repeatedly using other people's good solutions to transform others' methods into their own methods, that is, practicing and solving different problems and thinking about better methods!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.