old boy Education daily one question -2017 year 3 months -awk Array Statistics
process The following file contents , take out the domain name and sort the order according to the domain name :( Baidu and Sohu face test )
http://www.etiantian.org/index.htmlhttp://www.etiantian.org/1.htmlhttp://post.etiantian.org/index.htmlhttp:// Mp3.etiantian.org/index.htmlhttp://www.etiantian.org/3.htmlhttp://post.etiantian.org/2.html
Required results:
mp3.etiantian.org 1post.etiantian.org 2www.etiantian.org 3
Ideas:
Remove domain Name
take a slash for the chopper to remove the second column (domain name)
For processing
Create an array
Subscript The second column (domain name) as an array
Calculate quantity in a form similar to i++
The result is output after statistics
Answer:
Awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' url.txt|sort-rnk2
Demonstrate:
[Email protected] awkfile]# awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' url.txt|sort-rnk2www. etiantian.org 3post.etiantian.org 2mp3.etiantian.org 1
Let's not worry about the overall analysis of this result, we divide the result into several milestones and break it down individually.
we need to remove the domain name according to the topic. awk-f "/+", need to use + to indicate continuous.
[[email protected] awkfile]# awk-f "/+" ' {print $} ' url.txtwww.etiantian.orgwww.etiantian.orgpost.etiantian.orgmp3.etiantian.orgwww.etiantian.orgpost.etiantian.org
We'll create the array, the array name or the hotel, each element (room)
[[email protected] awkfile]# awk-f "/+" ' {hotel[$2]} ' Url.txt # #创建数组 [[email protected] awkfile]# awk-f " /+ "' {hotel[$2];p rint} ' Url.txt # #创建数组, and output element name via print (room number) www.etiantian.orgwww.etiantian.orgpost.etiantian.org mp3.etiantian.orgwww.etiantian.orgpost.etiantian.org
[[email protected] awkfile]# awk-f "/+" ' {hotel[$2]++} ' url.txt # # #创建数组 [[email protected] awkfile]# awk-f "/+" ' {Hot el[$2]++;p rint $2,hotel[$2]} ' url.txt # #创建数组 and output element name via print (room number), Room content www.etiantian.org 1www.etiantian.org 2post.etiantian.org 1mp3.etiantian.org 1www.etiantian.org 3post.etiantian.org 2
This is the second column of each row, which is a variable.
hotel[$2]++ This form is similar to the previous i++, except that the variable i is replaced with an array hotel[$2], equivalent to the original one of the room, replaced by an apartment building.
Here's a detailed analysis of how awk counts how many times www.etiantian.org repeats.
Here we focus only on the number of www.etiantian.org repetitions.
To "/+" continuous/For the chopper, cutting is www.etiantian.org,
Put him in the array is hotel["www.etiantian.org"],
Statistics hotel["www.etiantian.org"]=hotel["www.etiantian.org"]+1
Hotel in www.etiantian.org Room originally there is no thing, can be understood as empty. So hotel["www.etiantian.org"]= empty +1 finally the room was put into the number 1.
Read the second line:
It's www.etiantian.org.
Statistics is hotel["www.etiantian.org"]=hotel["www.etiantian.org"]+1
Because we've put the number 1 in the hotel's www.etiantian.org room, so now
hotel["www.etiantian.org"]=1+1hotel Hotel www.etiantian.org room content should be 2
Read the third line:
It's post.etiantian.org.
Not what we want www.etiantian.org so www.etiantian.org room content or 2 will not change.
Read Line Fourth:
It's mp3.etiantian.org.
Not what we want www.etiantian.org so www.etiantian.org room content or 2 will not change.
Read Line Fifth:
It's www.etiantian.org.
Statistics is hotel["www.etiantian.org"]=hotel["www.etiantian.org"]+1
Because we've put the number 2 in the hotel's www.etiantian.org room, so now
hotel["www.etiantian.org"]=2+1hotel Hotel www.etiantian.org room content should be 3
Read line sixth:
It's post.etiantian.org.
Not what we want www.etiantian.org so www.etiantian.org room content or 3 will not change.
Detailed Process table
View Only the contents of the hotel["www"] room
line number |
content |
hotel["www"] Previous content |
hotel["www"] = hotel["www"] + 1 procedure |
hotel["www"] after |
1 |
www |
empty |
hotel["www"] = null + 1 |
1 |
2 |
Www |
1 |
hotel["www"] = 1 + 1 |
2 |
3 |
post |
2 |
not www No add 1 |
2 |
4 |
mp3 |
2 |
not www No add 1 |
2 |
5 |
www |
2 |
hotel["www"] = 2 + 1 |
3 |
6 |
Post |
3 |
Not www no add 1 |
3 |
The end result is:
[Email protected] awkfile]# awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' url.txt|sort-rnk2www. etiantian.org 3post.etiantian.org 2mp3.etiantian.org 1
Optimized results:
[[email protected] awkfile]# awk-f "/+" ' {hotel[$2]++}end{for (Pol in hotel) print Pol,hotel[pol]} ' Url.txt|sort-rnk2|col umn-twww.etiantian.org 3post.etiantian.org 2mp3.etiantian.org 1
You can pass the column command to make the results more elegant and easier than Awk's printf.
This article is from the "Long Wing blog" blog, please be sure to keep this source http://youjiu.blog.51cto.com/3388056/1912219
Old boy Education Daily-March 31, 2017-awk Array statistics