Some time ago there is a need to check for a period of time DNS domain name access number ranking (top100), no way, had to slowly to parse DNS log bai, just learn python, take to practice practicing.
1. Raw Data analysis:
First look at the original data file, that is, the DNS log content, the following is extracted a few representative logs, 2x8.2x1.2x.1x5 this middle x is the corresponding number was erased by me.
Copy Code code as follows:
13-08-30 03:11:34,226 info:queries:–|1x3.2x8.2x0.2x0|config.dengluqi.net| |config.34245.com.; 127.0.0.1;| | a|success|+|–g--QR Rd RA |1|
13-08-30 03:11:34,229 info:queries:–|1x3.2x8.2x.2x8|p19.qhimg.com|default|2x8.2x1.2x.1x5;|default;|a|success|+|- W-QR AA Rd RA |8061|
13-08-30 03:11:34,238 info:queries:–|1x3.2x8.x.9x|shu.taobao.com|default|2x8.2x1.2x.1x5;|default;|a|success|+|-w -QR AA Rd RA |59034|
13-08-30 03:11:34,238 Info:queries:–|1x3.2x8.2x7.1x2|cncjn.phn.live.baofeng.net|default|2x8.2x1.2x.17x;|default ; |a|success|+|-w-qr AA Rd RA |3004|
You can see that the middle of the log is used | Split, shu.taobao.com that is the data we want to the domain name, as for the number of domain name access statistics, each domain name of a record of a visit. So we can determine two points:
A) use | As a separator
b The second field domain is the target data, we use as the key value, namely the dictionary key
c) Domain[key] The number of accesses to the corresponding domain name
2. Script idea:
A) Our DNS log is a period of time automatically cut, compressed into GZ files, so you must first use Gzip.open to open the Gz file, where you need to import the GZ library.
b) To seek is a period of time domain name ranking, so must be filtered for a period of time, here I used the regular way to filter, so import re regular library.
c) Sorting, the results must be sorted, and then output TOPXX results, because the dictionary is saved, and the dictionary is confused, so there must be a suitable way to sort, the dictionary iteritems just apply.
3. Script writing:
Understand the general point, the script is easy to write.
The code is as follows:
Copy Code code as follows:
#write by Siashero
Import gzip
Import re
File = Gzip.open ("e:\python_programs\queries.log.CBN-XA-1-3N3.20130803160052.gz")
domain_list= {}
Print "Time format is 13-08-04 19:1{1,2,3,4,5}"
Time = raw_input ("Please enter a time for you want")
While True:
line = File.readline ()
If not line:
Break
If Re.search (time,line):
Domain = line.split (' | ') [2]
If domain in Domain_list:
Domain_list[domain] + + 1
Else
Domain_list[domain] = 1
Count = 0
For V in Sorted (Domain_list.iteritems (), Key =lambda x:x[1],reverse=true):
Print V[1],v[0]
#to print the only TOP20 domain
If Count > 20:
Break
Count + 1
Raw_input ("Enter a word to finish")
File.close
A little bit of the script content, queries.log.cmn-cq.20130830031330.gz as a specific target file, the script is mainly in the dictionary storage, in domain field as Key,domain[key] storage access times.
The Iteritems method of the dictionary is called later to sort by the production iterator, and the TOP100 domain name is finally entered.
The final raw_input ("Enter a word to finish") is because I tested under Win7, the default execution is a flash past, join this line of pure broken to observe the results, Linux can be deleted.
Here is slightly awkward is the time filter is used is the regular to filter, so require input must be a regular way, this trouble.
3. Implementation
Said a long time, or first run down to see the effect of it.
You can see the normal output of the TOP20 domain name.
4. Summary:
Generally achieved the corresponding requirements, but a lot of documents are not handled well. For example, the use of regular to filter the time period, in a large amount of data will have an impact on performance. At the same time thank colleagues, the final dictionary of the sorting method I copied him, thank you