Python script enables analysis of DNS logs and

Python script enables analysis of DNS logs and _python of visited domain names

Last Update:2017-01-19 Source: Internet

Author: User

Tags gz file in domain python script

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some time ago there is a need to check for a period of time DNS domain name access number ranking (top100), no way, had to slowly to parse DNS log bai, just learn python, take to practice practicing.

1. Raw Data analysis:

First look at the original data file, that is, the DNS log content, the following is extracted a few representative logs, 2x8.2x1.2x.1x5 this middle x is the corresponding number was erased by me.

Copy Code code as follows:

You can see that the middle of the log is used | Split, shu.taobao.com that is the data we want to the domain name, as for the number of domain name access statistics, each domain name of a record of a visit. So we can determine two points:

A) use | As a separator

b The second field domain is the target data, we use as the key value, namely the dictionary key

c) Domain[key] The number of accesses to the corresponding domain name

2. Script idea:

A) Our DNS log is a period of time automatically cut, compressed into GZ files, so you must first use Gzip.open to open the Gz file, where you need to import the GZ library.

b) To seek is a period of time domain name ranking, so must be filtered for a period of time, here I used the regular way to filter, so import re regular library.

c) Sorting, the results must be sorted, and then output TOPXX results, because the dictionary is saved, and the dictionary is confused, so there must be a suitable way to sort, the dictionary iteritems just apply.

3. Script writing:

Understand the general point, the script is easy to write.

The code is as follows:

Copy Code code as follows:

#write by Siashero
Import gzip
Import re
File = Gzip.open ("e:\python_programs\queries.log.CBN-XA-1-3N3.20130803160052.gz")
domain_list= {}
Print "Time format is 13-08-04 19:1{1,2,3,4,5}"
Time = raw_input ("Please enter a time for you want")
While True:
line = File.readline ()
If not line:
Break
If Re.search (time,line):
Domain = line.split (' | ') [2]
If domain in Domain_list:
Domain_list[domain] + + 1
Else
Domain_list[domain] = 1
Count = 0
For V in Sorted (Domain_list.iteritems (), Key =lambda x:x[1],reverse=true):
Print V[1],v[0]
#to print the only TOP20 domain
If Count > 20:
Break
Count + 1
Raw_input ("Enter a word to finish")
File.close

A little bit of the script content, queries.log.cmn-cq.20130830031330.gz as a specific target file, the script is mainly in the dictionary storage, in domain field as Key,domain[key] storage access times.

The Iteritems method of the dictionary is called later to sort by the production iterator, and the TOP100 domain name is finally entered.

The final raw_input ("Enter a word to finish") is because I tested under Win7, the default execution is a flash past, join this line of pure broken to observe the results, Linux can be deleted.

Here is slightly awkward is the time filter is used is the regular to filter, so require input must be a regular way, this trouble.

3. Implementation

Said a long time, or first run down to see the effect of it.

You can see the normal output of the TOP20 domain name.

4. Summary:

Generally achieved the corresponding requirements, but a lot of documents are not handled well. For example, the use of regular to filter the time period, in a large amount of data will have an impact on performance. At the same time thank colleagues, the final dictionary of the sorting method I copied him, thank you

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More