Log archiving and data mining

Source: Internet
Author: User
Tags install mongodb yum install mongodb apache log node server

Log archiving and data mining http://netkiller.github.io/journal/log.html Mr.Neo Chen (Chen Jingfeng),Netkiller, Bg7nyt

China Guangdong province Shenzhen Khe Sanh Street, Longhua District, civil Administration
+86 13113668890
+86 755 29812080
<[email protected]>

Copyright? Netkiller. All rights reserved.

Copyright Notice

Reprint please contact the author, please be sure to indicate the original source of the article and the author's information and this statement.

Document Source:



2013-03-19 First Edition

2014-12-16 Second Edition

My series of documents
Netkiller Architect Codex Netkiller Developer Codex Netkiller PHP Codex Netkiller Python Codex Netkiller Testing Codex
Netkiller Cryptography Codex Netkiller Linux Codex Netkiller Debian Codex Netkiller CentOS Codex Netkiller FreeBSD Codex
Netkiller Shell Codex Netkiller Security Codex Netkiller Web Codex Netkiller Monitoring Codex Netkiller Storage Codex
Netkiller Mail Codex Netkiller Docbook Codex Netkiller Version Codex Netkiller Database Codex Netkiller PostgreSQL Codex
Netkiller MySQL Codex Netkiller NoSQL Codex Netkiller LDAP Codex Netkiller Network Codex Netkiller Cisco IOS Codex
Netkiller H3C Codex Netkiller Multimedia Codex Netkiller Perl Codex Netkiller Amateur Radio Codex Netkiller DevOps Codex
    • 1. What log archive
    • 2. Why log archiving
    • 3. When to make a log archive
    • 4. Where to put the archive log
    • 5. Who's going to do the log filing?
    • 6. How to make a log archive
      • 6.1. Log format Conversion
        • 6.1.1. Putting logs into the database
        • 6.1.2. Apache Pipe
        • 6.1.3. Log format
        • 6.1.4. Log import to MongoDB
      • 6.2. Log Center Scenario
        • 6.2.1. Software Installation
        • 6.2.2. Node push-off
        • 6.2.3. Log Collection End
        • 6.2.4. Log monitoring
1. What log archive

Archiving, refers to the completion of the log and the preservation of the value of the file, the system to organize the log server to save the process.

2. Why log Archiving
    • Recall the history log query at any time.
    • Data mining through the log, mining valuable data.
    • View the application's working status
3. When to make a log archive

Log archives should be a system stipulated by the Enterprise ("Filing System"), the system construction should take into account the issue of log archiving at the beginning. If your business does not have this job or system, you are advised to implement it immediately after reading this article.

4. Where to put the archive log

Simple can use a single-node server plus backup scheme.

As the size of the log expands, the future must adopt a distributed file system, even involving remote geo-disaster recovery.

5. Who's going to do the log filing?

My answer is log archiving automation, manual inspection or sampling.

6. How to make a log archive

There are several ways to summarize the logs of all servers in one place

Common methods for log archiving:
    • FTP is to download, this approach is suitable for small files and log volume is not large, it is to download to the designated server, the drawback is the repeated transmission, poor real-time.
    • Rsyslog A class of procedures, more general, but the inconvenience of expansion.
    • Rsync is synchronized, suitable for playing file synchronization, better than FTP, real-time poor.
6.1. Log format Conversion

First, let me introduce a simple solution.

I wrote a program in D that would decompose the WEB log and then pass it to the database handler through a pipeline.

6.1.1. Putting logs into the database

Process the Web server log through the pipeline and then write to the database

Source of processing procedures

$ vim match.dimport std.regex;import std.stdio;import std.string;import std.array;void Main () {    //Nginx//auto R = Reg Ex (' ^ (\s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}] ([0-9]+)" ([^ "]+)" "([^"]+) "" ([^ "]+)" "([^]" r ");//Apache2auto \s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}) ([0-9]+)" ([^ "]+]" "([^"]+) "'); foreach (line; stdin.byline) {foreach (M; Match (line, R)) {//writeln (m.hit); auto C = M.captures;c.popfront ();//writeln (c); auto value = Join (c, "\", \ ""); auto sql = f Ormat ("INSERT into log (Remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer, http_user_agent,http_x_forwarded_for) value (\ "%s\"); ", value); writeln (sql);}}


$ DMD match.d$ Strip match$ lsmatch  Match.d  match.o

Simple usage

$ Cat Access.log |./match

Advanced usage

$ Cat Access.log | Match | mysql-hlocalhost-ulog-p123456 Logging

To process the log in real time, first create a pipeline to find the log file to write to the pipeline.

Cat  pipe name | match | mysql-hlocalhost-ulog-p123456 logging

This allows for real-time log insertions.


The above program can be modified to implement HBase, hypertable this version

6.1.2. Apache Pipe

Apache Log Pipeline filter Customlog "| /srv/match >>/tmp/access.log "combined

<virtualhost *:80> ServerAdmin [email protected] #DocumentRoot/var/www documentroot/www <directory/> Options followsymlinks allowoverride None &LT;/DIRECTORY&G        T                #<directory/var/www/> <Directory/www/> Options Indexes followsymlinks multiviews        AllowOverride None Order Allow,deny allow from all </Directory>                scriptalias/cgi-bin//usr/lib/cgi-bin/<directory "/usr/lib/cgi-bin" > AllowOverride None Options +execcgi-multiviews +symlinksifownermatch Order allow,deny allow from All </Directory> errorlog ${apache_log_dir}/error.log # Possible values Include:debug, info, n        Otice, warn, error, crit, # Alert, Emerg. LogLevel warn #CustomLog ${apache_log_dir}/access.log combined CusTomlog "|        /srv/match >>/tmp/access.log "combined alias/doc/"/usr/share/doc/"<directory"/usr/share/doc/">        Options Indexes multiviews followsymlinks allowoverride None Order Deny,allow deny from all Allow from 1/128 </Directory></VirtualHost>

Pipeline-converted log effects

$ tail/tmp/access.loginsert into log (remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent, http_referer,http_user_agent,http_x_forwarded_for) value ("", "-", "-", "21/mar/2013:16:11:00 +0800", " get/http/1.1 "," 304 "," 208 ","-"," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.22 (khtml, like Gecko) chrome/25.0.1364.172 safari/537.22 "); insert into log (Remote_addr,unknow, Remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) Value ( "", "-", "-", "21/mar/2013:16:11:00 +0800", "Get/favicon.ico http/1.1", "404", "501", "-", "mozilla/5.0" ( Windows NT 6.1; WOW64) applewebkit/537.22 (khtml, like Gecko) chrome/25.0.1364.172 safari/537.22 "); insert into log (Remote_addr,unknow, Remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) Value ( "", "-", "-", "21/mar/2013:16:11:00 +0800", "get/http/1.1", "304", "208", "-", "mozilla/5.0" (Windows NT 6.1; WOW64) AppleWebKit/537.22 (khtml, like Gecko) chrome/25.0.1364.172 safari/537.22 "); 
6.1.3. Log format

By defining Logformat, you can directly output logs in SQL form


Logformat "%v:%p%h%l%u%t \"%r\ "%>s%O \"%{referer}i\ "\"%{user-agent}i\ "" Vhost_combinedlogformat "%h%l%u%t \"% R\ "%>s%O \"%{referer}i\ "\"%{user-agent}i\ "" Combinedlogformat "%h%l%u%t \"%r\ "%>s%O" Commonlogformat "%{Refe Rer}i,%u "Refererlogformat"%{user-agent}i "Agent


Log_format  main  ' $remote _addr-$remote _user [$time _local] "$request" "$status                      $body _bytes_sent" $http _ Referer "                      " "$http _user_agent" "$http _x_forwarded_for";

However, the system administrator uses Grep,awk,sed,sort,uniq analysis to cause some trouble. So I suggest we still use regular decomposition.

Generate a regular log format, Apache:

Logformat         "\"%h\ ",%{%y%m%d%h%m%s}t,%>s,\"%b\ ","%{content-type}o\ ",          \"%u\ ", \"%{referer}i\ ", \"%{ User-agent}i\ ""

Import the Access.log file into MySQL

LOAD DATA INFILE '/local/access_log ' into TABLE tbl_namefields TERMINATED by ', ' optionally enclosed by ' ' escaped by ' \ \ ‘
6.1.4. Log import to MongoDB
# RPM-UVH http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm# Yum Install MongoDB

D-Language Log processing program

Import std.regex;//import std.range;import std.stdio;import std.string;import std.array;void Main () {//Nginxauto R = Regex (' ^ (\s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}) ([0-9]+)" ([^]]+) "([^"]+) "" ([^ "]+]") X (' ^ (\s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}) ([0-9]+)" ([^ "]+]" "([^"]+) "'); foreach (line; stdin.byline) {// Writeln (line);//auto m = Match (line, R), foreach (m; Match (line, R)) {//writeln (m.hit); auto C = M.captures;c.popfront ();// Writeln (c);/*sqlauto value = Join (c, "\", \ ""); Auto sql = Format ("INSERT INTO Log" (Remote_addr,unknow,remote_user,time_ local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value (\ "%s\"); ", value); Writeln (SQL); *///mongodbstring Bson = Format ("Db.logging.access.save ({' remote_addr ': '%s ', ' Remote_user ': '%s ', ' time _local ': '%s ', ' request ': '%s ', ' status ': '%s ', ' body_bytes_sent ': '%s ', ' http_referer ': '%s ', ' http_user_agent ': '%s ', ' Http_x_forwarded_for ': '%s '}) ', c[0],c[2],c[3],c[4],c[5],c[6],c[7],c[8],C[9]); Writeln (Bson);}} 

Compiling a log handler



Cat/var/log/nginx/access.log | Mlog | MONGO

Handling the Journal of a missed press

# Zcat/var/log/nginx/*.access.log-*.gz | /srv/mlog | MONGO

Real-time capture logs

Tail-f/var/log/nginx/access.log | Mlog | MONGO
6.2. Log Center Scenario

Although the above scheme is simple, but too dependent on the system administrator, need to configure a lot of servers, each application software produces a different log, so it is complex. If there is a failure in the middle, a log will be lost.

So I went back to the beginning, all the logs are stored on their own server, timed to synchronize them to the log server, so that the log archive is resolved. Remote collection of logs, through the UDP protocol push summary to the log center, so that the real-time log monitoring, crawling and so on real-time requirements of high demand.

For this I wrote a software for two or three days: https://github.com/netkiller/logging

It's not the best scenario, it's just the right scenario for me, and I finished the software development in just two or three days. Later I will expand further and increase the ability of Message Queuing to deliver logs.

6.2.1. Software Installation
$ git clone https://github.com/netkiller/logging.git$ cd logging$ python3 setup.py sdist$ python3 setup.py Install
6.2.2. Node push-off

Install startup scripts




$ sudo cp init.d/ulog/etc/init.d/$ service Ulog Usage:/etc/init.d/ulog {Start|stop|status|restart}

Configure scripts to open/etc/init.d/ulog files

Configure the IP address of the log hub


Then configure the ports and collect those logs

Done << eof1213/var/log/nginx/access.log1214/tmp/test.log1215/tmp/$ (date + "%y-%m-%d.%h:%m:%s"). LogEOF

Format is

Port | Logfile------------------------------1213/var/log/nginx/access.log1214/tmp/test.log1215/tmp/$ (date + "%y-%m-%d.% h:%m:%s "). Log

1213 the destination port number (log hub port) is followed by logs that you need to monitor if the log produces a file each day that resembles/tmp/$ (date + "%y-%m-%d.%h:%m:%s"). Log

Prompt to generate a new log file daily requires a timed restart Ulog method is/etc/init.d/ulog restart

Start the push program when the configuration is complete

# service Ulog Start

View status

$ service Ulog status13865 pts/16   S      0:00/usr/bin/python3/usr/local/bin/rlog-d-H 1213/var/log/ngi Nx/access.log

Stop push

# Service Ulog Stop
6.2.3. Log Collection End
# CP logging/init.d/ucollection/etc/init.d#/etc/init.d/ucollection Usage:/etc/init.d/ucollection {start|stop| Status|restart}

Configure the receive port to save the file, open the/etc/init.d/ucollection file, and see the following paragraph

Done << eof1213/tmp/nginx/access.log1214/tmp/test/test.log1215/tmp/app/$ (date + "%y-%m-%d.%h:%m:%s"). log1216 /tmp/db/$ (date + "%y-%m-%d")/mysql.log1217/tmp/cache/$ (date + "%Y")/$ (date + "%m")/$ (date + "%d")/cache.logeof

The format is as follows to receive data from Port 1213 and save it to the/tmp/nginx/access.log file.

Port | Logfile1213/tmp/nginx/access.log

If you want the split log configured as follows

1217/tmp/cache/$ (date + "%Y")/$ (date + "%m")/$ (date + "%d")/cache.log

The above configuration log file will be generated in the following directory

$ find/tmp/cache//tmp/cache//tmp/cache/2014/tmp/cache/2014/12/tmp/cache/2014/12/16/tmp/cache/2014/12/16/ Cache.log
Also, if the split log requires a restart of the collection-side program.

Start the collection end

# service Ulog Start

Stop Program

# Service Ulog Stop

View status

$ init.d/ucollection status12429 pts/16   S      0:00/usr/bin/python3/usr/local/bin/collection-d-P 1213-l/tmp/ nginx/access.log12432 pts/16   S      0:00/usr/bin/python3/usr/local/bin/collection-d-P 1214-l/tmp/test/ test.log12435 pts/16   S      0:00/usr/bin/python3/usr/local/bin/collection-d-P 1215-l/tmp/app/ 2014-12-16.09:55:15.log12438 pts/16   S      0:00/usr/bin/python3/usr/local/bin/collection-d-P 1216-l/tmp/db/ 2014-12-16/mysql.log12441 pts/16   S      0:00/usr/bin/python3/usr/local/bin/collection-d-P 1217-l/tmp/cache/ 2014/12/16/cache.log
6.2.4. Log monitoring

Monitor data from 1217 wide ports

$ collection-p 1213192.168.6.20--[16/dec/2014:15:06:23 +0800] "get/journal/log.html http/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "[16/dec/2014:15:06:23 + 0800] "Get/journal/docbook.css http/1.1" 304 0 "" "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "[16/dec/2014:15:06:23 + 0800] "Get/journal/journal.css http/1.1" 304 0 "" "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "[16/dec/2014:15:06:23 + 0800] "Get/images/by-nc-sa.png http/1.1" 304 0 "" "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "[16/dec/2014:15:06:23 + 0800] "get/jS/q.js http/1.1 "304 0" "" mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "

Send the latest logs in real time after startup

Log archiving and data mining

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.