Log archiving and data mining
http://netkiller.github.io/journal/log.html Mr.Neo Chen (Chen Jingfeng),Netkiller, Bg7nyt
China Guangdong province Shenzhen Khe Sanh Street, Longhua District, civil Administration
518131
+86 13113668890
+86 755 29812080
<[email protected]>
Copyright? Netkiller. All rights reserved.
Copyright Notice
Reprint please contact the author, please be sure to indicate the original source of the article and the author's information and this statement.
|
Document Source: |
Http://netkiller.github.io |
Http://netkiller.sourceforge.net |
|
2014-12-16
Summary
2013-03-19 First Edition
2014-12-16 Second Edition
My series of documents
Netkiller Architect Codex |
Netkiller Developer Codex |
Netkiller PHP Codex |
Netkiller Python Codex |
Netkiller Testing Codex |
Netkiller Cryptography Codex |
Netkiller Linux Codex |
Netkiller Debian Codex |
Netkiller CentOS Codex |
Netkiller FreeBSD Codex |
Netkiller Shell Codex |
Netkiller Security Codex |
Netkiller Web Codex |
Netkiller Monitoring Codex |
Netkiller Storage Codex |
Netkiller Mail Codex |
Netkiller Docbook Codex |
Netkiller Version Codex |
Netkiller Database Codex |
Netkiller PostgreSQL Codex |
Netkiller MySQL Codex |
Netkiller NoSQL Codex |
Netkiller LDAP Codex |
Netkiller Network Codex |
Netkiller Cisco IOS Codex |
Netkiller H3C Codex |
Netkiller Multimedia Codex |
Netkiller Perl Codex |
Netkiller Amateur Radio Codex |
Netkiller DevOps Codex |
Directory
- 1. What log archive
- 2. Why log archiving
- 3. When to make a log archive
- 4. Where to put the archive log
- 5. Who's going to do the log filing?
- 6. How to make a log archive
- 6.1. Log format Conversion
- 6.1.1. Putting logs into the database
- 6.1.2. Apache Pipe
- 6.1.3. Log format
- 6.1.4. Log import to MongoDB
- 6.2. Log Center Scenario
- 6.2.1. Software Installation
- 6.2.2. Node push-off
- 6.2.3. Log Collection End
- 6.2.4. Log monitoring
1. What log archive
Archiving, refers to the completion of the log and the preservation of the value of the file, the system to organize the log server to save the process.
2. Why log Archiving
- Recall the history log query at any time.
- Data mining through the log, mining valuable data.
- View the application's working status
3. When to make a log archive
Log archives should be a system stipulated by the Enterprise ("Filing System"), the system construction should take into account the issue of log archiving at the beginning. If your business does not have this job or system, you are advised to implement it immediately after reading this article.
4. Where to put the archive log
Simple can use a single-node server plus backup scheme.
As the size of the log expands, the future must adopt a distributed file system, even involving remote geo-disaster recovery.
5. Who's going to do the log filing?
My answer is log archiving automation, manual inspection or sampling.
6. How to make a log archive
There are several ways to summarize the logs of all servers in one place
Common methods for log archiving:
- FTP is to download, this approach is suitable for small files and log volume is not large, it is to download to the designated server, the drawback is the repeated transmission, poor real-time.
- Rsyslog A class of procedures, more general, but the inconvenience of expansion.
- Rsync is synchronized, suitable for playing file synchronization, better than FTP, real-time poor.
6.1. Log format Conversion
First, let me introduce a simple solution.
I wrote a program in D that would decompose the WEB log and then pass it to the database handler through a pipeline.
6.1.1. Putting logs into the database
Process the Web server log through the pipeline and then write to the database
Source of processing procedures
$ vim match.dimport std.regex;import std.stdio;import std.string;import std.array;void Main () { //Nginx//auto R = Reg Ex (' ^ (\s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}] ([0-9]+)" ([^ "]+)" "([^"]+) "" ([^ "]+)" "([^]" r ");//Apache2auto \s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}) ([0-9]+)" ([^ "]+]" "([^"]+) "'); foreach (line; stdin.byline) {foreach (M; Match (line, R)) {//writeln (m.hit); auto C = M.captures;c.popfront ();//writeln (c); auto value = Join (c, "\", \ ""); auto sql = f Ormat ("INSERT into log (Remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent,http_referer, http_user_agent,http_x_forwarded_for) value (\ "%s\"); ", value); writeln (sql);}}
Compile
$ DMD match.d$ Strip match$ lsmatch Match.d match.o
Simple usage
$ Cat Access.log |./match
Advanced usage
$ Cat Access.log | Match | mysql-hlocalhost-ulog-p123456 Logging
To process the log in real time, first create a pipeline to find the log file to write to the pipeline.
Cat pipe name | match | mysql-hlocalhost-ulog-p123456 logging
This allows for real-time log insertions.
Tips
The above program can be modified to implement HBase, hypertable this version
6.1.2. Apache Pipe
Apache Log Pipeline filter Customlog "| /srv/match >>/tmp/access.log "combined
<virtualhost *:80> ServerAdmin [email protected] #DocumentRoot/var/www documentroot/www <directory/> Options followsymlinks allowoverride None </DIRECTORY&G T #<directory/var/www/> <Directory/www/> Options Indexes followsymlinks multiviews AllowOverride None Order Allow,deny allow from all </Directory> scriptalias/cgi-bin//usr/lib/cgi-bin/<directory "/usr/lib/cgi-bin" > AllowOverride None Options +execcgi-multiviews +symlinksifownermatch Order allow,deny allow from All </Directory> errorlog ${apache_log_dir}/error.log # Possible values Include:debug, info, n Otice, warn, error, crit, # Alert, Emerg. LogLevel warn #CustomLog ${apache_log_dir}/access.log combined CusTomlog "| /srv/match >>/tmp/access.log "combined alias/doc/"/usr/share/doc/"<directory"/usr/share/doc/"> Options Indexes multiviews followsymlinks allowoverride None Order Deny,allow deny from all Allow from 127.0.0.0/255.0.0.0:: 1/128 </Directory></VirtualHost>
Pipeline-converted log effects
$ tail/tmp/access.loginsert into log (remote_addr,unknow,remote_user,time_local,request,status,body_bytes_sent, http_referer,http_user_agent,http_x_forwarded_for) value ("192.168.6.30", "-", "-", "21/mar/2013:16:11:00 +0800", " get/http/1.1 "," 304 "," 208 ","-"," mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.22 (khtml, like Gecko) chrome/25.0.1364.172 safari/537.22 "); insert into log (Remote_addr,unknow, Remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) Value ( "192.168.6.30", "-", "-", "21/mar/2013:16:11:00 +0800", "Get/favicon.ico http/1.1", "404", "501", "-", "mozilla/5.0" ( Windows NT 6.1; WOW64) applewebkit/537.22 (khtml, like Gecko) chrome/25.0.1364.172 safari/537.22 "); insert into log (Remote_addr,unknow, Remote_user,time_local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) Value ( "192.168.6.30", "-", "-", "21/mar/2013:16:11:00 +0800", "get/http/1.1", "304", "208", "-", "mozilla/5.0" (Windows NT 6.1; WOW64) AppleWebKit/537.22 (khtml, like Gecko) chrome/25.0.1364.172 safari/537.22 ");
6.1.3. Log format
By defining Logformat, you can directly output logs in SQL form
Apache
Logformat "%v:%p%h%l%u%t \"%r\ "%>s%O \"%{referer}i\ "\"%{user-agent}i\ "" Vhost_combinedlogformat "%h%l%u%t \"% R\ "%>s%O \"%{referer}i\ "\"%{user-agent}i\ "" Combinedlogformat "%h%l%u%t \"%r\ "%>s%O" Commonlogformat "%{Refe Rer}i,%u "Refererlogformat"%{user-agent}i "Agent
Nginx
Log_format main ' $remote _addr-$remote _user [$time _local] "$request" "$status $body _bytes_sent" $http _ Referer " " "$http _user_agent" "$http _x_forwarded_for";
However, the system administrator uses Grep,awk,sed,sort,uniq analysis to cause some trouble. So I suggest we still use regular decomposition.
Generate a regular log format, Apache:
Logformat "\"%h\ ",%{%y%m%d%h%m%s}t,%>s,\"%b\ ","%{content-type}o\ ", \"%u\ ", \"%{referer}i\ ", \"%{ User-agent}i\ ""
Import the Access.log file into MySQL
LOAD DATA INFILE '/local/access_log ' into TABLE tbl_namefields TERMINATED by ', ' optionally enclosed by ' ' escaped by ' \ \ ‘
6.1.4. Log import to MongoDB
# RPM-UVH http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm# Yum Install MongoDB
D-Language Log processing program
Import std.regex;//import std.range;import std.stdio;import std.string;import std.array;void Main () {//Nginxauto R = Regex (' ^ (\s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}) ([0-9]+)" ([^]]+) "([^"]+) "" ([^ "]+]") X (' ^ (\s+) (\s+) (\s+) \[(. +) \] "([^"]+) "([0-9]{3}) ([0-9]+)" ([^ "]+]" "([^"]+) "'); foreach (line; stdin.byline) {// Writeln (line);//auto m = Match (line, R), foreach (m; Match (line, R)) {//writeln (m.hit); auto C = M.captures;c.popfront ();// Writeln (c);/*sqlauto value = Join (c, "\", \ ""); Auto sql = Format ("INSERT INTO Log" (Remote_addr,unknow,remote_user,time_ local,request,status,body_bytes_sent,http_referer,http_user_agent,http_x_forwarded_for) value (\ "%s\"); ", value); Writeln (SQL); *///mongodbstring Bson = Format ("Db.logging.access.save ({' remote_addr ': '%s ', ' Remote_user ': '%s ', ' time _local ': '%s ', ' request ': '%s ', ' status ': '%s ', ' body_bytes_sent ': '%s ', ' http_referer ': '%s ', ' http_user_agent ': '%s ', ' Http_x_forwarded_for ': '%s '}) ', c[0],c[2],c[3],c[4],c[5],c[6],c[7],c[8],C[9]); Writeln (Bson);}}
Compiling a log handler
DMD MLOG.D
Usage
Cat/var/log/nginx/access.log | Mlog | MONGO 192.169.0.5/logging-uxxx-pxxx
Handling the Journal of a missed press
# Zcat/var/log/nginx/*.access.log-*.gz | /srv/mlog | MONGO 192.168.6.1/logging-uneo-pchen
Real-time capture logs
Tail-f/var/log/nginx/access.log | Mlog | MONGO 192.169.0.5/logging-uxxx-pxxx
6.2. Log Center Scenario
Although the above scheme is simple, but too dependent on the system administrator, need to configure a lot of servers, each application software produces a different log, so it is complex. If there is a failure in the middle, a log will be lost.
So I went back to the beginning, all the logs are stored on their own server, timed to synchronize them to the log server, so that the log archive is resolved. Remote collection of logs, through the UDP protocol push summary to the log center, so that the real-time log monitoring, crawling and so on real-time requirements of high demand.
For this I wrote a software for two or three days: https://github.com/netkiller/logging
It's not the best scenario, it's just the right scenario for me, and I finished the software development in just two or three days. Later I will expand further and increase the ability of Message Queuing to deliver logs.
6.2.1. Software Installation
$ git clone https://github.com/netkiller/logging.git$ cd logging$ python3 setup.py sdist$ python3 setup.py Install
6.2.2. Node push-off
Install startup scripts
Centos
# CP LOGGING/INIT.D/ULOG/ETC/INIT.D
Ubuntu
$ sudo cp init.d/ulog/etc/init.d/$ service Ulog Usage:/etc/init.d/ulog {Start|stop|status|restart}
Configure scripts to open/etc/init.d/ulog files
Configure the IP address of the log hub
Host=xxx.xxx.xxx.xxx
Then configure the ports and collect those logs
Done << eof1213/var/log/nginx/access.log1214/tmp/test.log1215/tmp/$ (date + "%y-%m-%d.%h:%m:%s"). LogEOF
Format is
Port | Logfile------------------------------1213/var/log/nginx/access.log1214/tmp/test.log1215/tmp/$ (date + "%y-%m-%d.% h:%m:%s "). Log
1213 the destination port number (log hub port) is followed by logs that you need to monitor if the log produces a file each day that resembles/tmp/$ (date + "%y-%m-%d.%h:%m:%s"). Log
Prompt to generate a new log file daily requires a timed restart Ulog method is/etc/init.d/ulog restart
Start the push program when the configuration is complete
# service Ulog Start
View status
$ service Ulog status13865 pts/16 S 0:00/usr/bin/python3/usr/local/bin/rlog-d-H 127.0.0.1-p 1213/var/log/ngi Nx/access.log
Stop push
# Service Ulog Stop
6.2.3. Log Collection End
# CP logging/init.d/ucollection/etc/init.d#/etc/init.d/ucollection Usage:/etc/init.d/ucollection {start|stop| Status|restart}
Configure the receive port to save the file, open the/etc/init.d/ucollection file, and see the following paragraph
Done << eof1213/tmp/nginx/access.log1214/tmp/test/test.log1215/tmp/app/$ (date + "%y-%m-%d.%h:%m:%s"). log1216 /tmp/db/$ (date + "%y-%m-%d")/mysql.log1217/tmp/cache/$ (date + "%Y")/$ (date + "%m")/$ (date + "%d")/cache.logeof
The format is as follows to receive data from Port 1213 and save it to the/tmp/nginx/access.log file.
Port | Logfile1213/tmp/nginx/access.log
If you want the split log configured as follows
1217/tmp/cache/$ (date + "%Y")/$ (date + "%m")/$ (date + "%d")/cache.log
The above configuration log file will be generated in the following directory
$ find/tmp/cache//tmp/cache//tmp/cache/2014/tmp/cache/2014/12/tmp/cache/2014/12/16/tmp/cache/2014/12/16/ Cache.log
Also, if the split log requires a restart of the collection-side program.
Start the collection end
# service Ulog Start
Stop Program
# Service Ulog Stop
View status
$ init.d/ucollection status12429 pts/16 S 0:00/usr/bin/python3/usr/local/bin/collection-d-P 1213-l/tmp/ nginx/access.log12432 pts/16 S 0:00/usr/bin/python3/usr/local/bin/collection-d-P 1214-l/tmp/test/ test.log12435 pts/16 S 0:00/usr/bin/python3/usr/local/bin/collection-d-P 1215-l/tmp/app/ 2014-12-16.09:55:15.log12438 pts/16 S 0:00/usr/bin/python3/usr/local/bin/collection-d-P 1216-l/tmp/db/ 2014-12-16/mysql.log12441 pts/16 S 0:00/usr/bin/python3/usr/local/bin/collection-d-P 1217-l/tmp/cache/ 2014/12/16/cache.log
6.2.4. Log monitoring
Monitor data from 1217 wide ports
$ collection-p 1213192.168.6.20--[16/dec/2014:15:06:23 +0800] "get/journal/log.html http/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "192.168.6.20--[16/dec/2014:15:06:23 + 0800] "Get/journal/docbook.css http/1.1" 304 0 "http://192.168.6.2/journal/log.html" "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "192.168.6.20--[16/dec/2014:15:06:23 + 0800] "Get/journal/journal.css http/1.1" 304 0 "http://192.168.6.2/journal/log.html" "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "192.168.6.20--[16/dec/2014:15:06:23 + 0800] "Get/images/by-nc-sa.png http/1.1" 304 0 "http://192.168.6.2/journal/log.html" "mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "192.168.6.20--[16/dec/2014:15:06:23 + 0800] "get/jS/q.js http/1.1 "304 0" http://192.168.6.2/journal/log.html "" mozilla/5.0 (Windows NT 6.1; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.95 safari/537.36 "
Send the latest logs in real time after startup
Log archiving and data mining