Old boy Education Daily-March 31, 2017-awk Array statistics

Source: Internet
Author: User

old boy Education daily one question -2017 year 3 months -awk Array Statistics

process The following file contents , take out the domain name and sort the order according to the domain name :( Baidu and Sohu face test )

http://www.etiantian.org/index.htmlhttp://www.etiantian.org/1.htmlhttp://post.etiantian.org/index.htmlhttp:// Mp3.etiantian.org/index.htmlhttp://www.etiantian.org/3.htmlhttp://post.etiantian.org/2.html

Required results:

mp3.etiantian.org 1post.etiantian.org 2www.etiantian.org 3

Ideas:

    1. Remove domain Name

      1. take a slash for the chopper to remove the second column (domain name)

    2. For processing

      1. Create an array

      2. Subscript The second column (domain name) as an array

      3. Calculate quantity in a form similar to i++

    3. The result is output after statistics

Answer:

Awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' url.txt|sort-rnk2

Demonstrate:

[Email protected] awkfile]# awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' url.txt|sort-rnk2www. etiantian.org 3post.etiantian.org 2mp3.etiantian.org 1

Let's not worry about the overall analysis of this result, we divide the result into several milestones and break it down individually.

    • First Milestone - Remove the desired content

we need to remove the domain name according to the topic. awk-f "/+", need to use + to indicate continuous.

[[email protected] awkfile]# awk-f "/+" ' {print $} ' url.txtwww.etiantian.orgwww.etiantian.orgpost.etiantian.orgmp3.etiantian.orgwww.etiantian.orgpost.etiantian.org
    • A second milestone - Create an array

We'll create the array, the array name or the hotel, each element (room)

[[email protected] awkfile]# awk-f "/+" ' {hotel[$2]} ' Url.txt # #创建数组 [[email protected] awkfile]# awk-f " /+ "' {hotel[$2];p rint} ' Url.txt # #创建数组, and output element name via print (room number) www.etiantian.orgwww.etiantian.orgpost.etiantian.org mp3.etiantian.orgwww.etiantian.orgpost.etiantian.org
    • A third milestone - to statistics

[[email protected] awkfile]# awk-f "/+" ' {hotel[$2]++} ' url.txt # # #创建数组 [[email protected] awkfile]# awk-f "/+" ' {Hot el[$2]++;p rint $2,hotel[$2]} ' url.txt # #创建数组 and output element name via print (room number), Room content www.etiantian.org 1www.etiantian.org 2post.etiantian.org 1mp3.etiantian.org 1www.etiantian.org 3post.etiantian.org 2

This is the second column of each row, which is a variable.

hotel[$2]++ This form is similar to the previous i++, except that the variable i is replaced with an array hotel[$2], equivalent to the original one of the room, replaced by an apartment building.

Here's a detailed analysis of how awk counts how many times www.etiantian.org repeats.

Here we focus only on the number of www.etiantian.org repetitions.

    • Read the first line:

To "/+" continuous/For the chopper, cutting is www.etiantian.org,

Put him in the array is hotel["www.etiantian.org"],

Statistics hotel["www.etiantian.org"]=hotel["www.etiantian.org"]+1

Hotel in www.etiantian.org Room originally there is no thing, can be understood as empty. So hotel["www.etiantian.org"]= empty +1 finally the room was put into the number 1.

    • Read the second line:

      It's www.etiantian.org.

Statistics is hotel["www.etiantian.org"]=hotel["www.etiantian.org"]+1

Because we've put the number 1 in the hotel's www.etiantian.org room, so now

hotel["www.etiantian.org"]=1+1hotel Hotel www.etiantian.org room content should be 2

    • Read the third line:

      It's post.etiantian.org.

Not what we want www.etiantian.org so www.etiantian.org room content or 2 will not change.

    • Read Line Fourth:

      It's mp3.etiantian.org.

Not what we want www.etiantian.org so www.etiantian.org room content or 2 will not change.

    • Read Line Fifth:

      It's www.etiantian.org.

Statistics is hotel["www.etiantian.org"]=hotel["www.etiantian.org"]+1

Because we've put the number 2 in the hotel's www.etiantian.org room, so now

hotel["www.etiantian.org"]=2+1hotel Hotel www.etiantian.org room content should be 3

    • Read line sixth:

      It's post.etiantian.org.

Not what we want www.etiantian.org so www.etiantian.org room content or 3 will not change.

Detailed Process table


View Only the contents of the hotel["www"] room

line number

content

hotel["www"] Previous content

hotel["www"] = hotel["www"] + 1 procedure

hotel["www"] after

1

www

empty

hotel["www"] = null + 1

1

2

Www

1

hotel["www"] = 1 + 1

2

3

post

2

not www No add 1

2

4

mp3

2

not www No add 1

2

5

www

2

hotel["www"] = 2 + 1

3

6

Post

3

Not www no add 1

3


    • Summarize:

The end result is:

[Email protected] awkfile]# awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' url.txt|sort-rnk2www. etiantian.org 3post.etiantian.org 2mp3.etiantian.org 1

Optimized results:

[[email protected] awkfile]# awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' Url.txt|sort-rnk2|col umn-twww.etiantian.org 3post.etiantian.org 2mp3.etiantian.org 1

You can pass the column command to make the results more elegant and easier than Awk's printf.


This article is from the "Long Wing blog" blog, please be sure to keep this source http://youjiu.blog.51cto.com/3388056/1912219

Old boy Education Daily-March 31, 2017-awk Array statistics

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.