Multi-server log merge statistics (1)

Source: Internet
Author: User
Tags format comments file system log sort time interval web services apache log
Server | Statistics This paper introduces a method of Cronolog and Webalizer merging of Apache logs.

Keyword: You do not have to patiently read all of the following, because the conclusion is nothing but the following 2 points:

1 Use Cronolog clean, safely round to follow Apache "Day" Zhi;
2 to sort multiple logs with sort-m;

According to the experience of personal use:

1) first introduced the Apache log merge method;
2 and then according to the problem that leads to the necessity and solution of the log round, introduce how to use the Cronolog to the Apache log round to follow;

There are a lot of tools in the design log merging process and some of the failed experiences ...

I believe that the solution to the above problems is more than this approach, the following options are certainly not the easiest or the lowest cost, I hope to have more exchanges with you.

1. The need for multi-server log merging statistics more and more large Web services use DNS round robin to achieve load balancing: Using multiple servers of the same role to do front-desk Web services greatly facilitates the distribution planning and extensibility of services, but the distribution of multiple servers makes the analysis of logs more cumbersome. If you use log analysis tools such as Webalizer to log statistics for each machine separately:

1 will bring a lot of trouble to the summary of the data, for example: Statistics of the total number of visits need to be SERVER1 SERVER2 ... Adds the number of the specified month.
2) will greatly affect the statistical results of the unique number of visitors unique visits, unique site number of unique sites and other indicators of statistics, because these indicators are not a few machines algebra added.

The benefits of unified log statistics are obvious, but how do you incorporate all of the machine statistics into one statistic?

The first thing you might think is: can multiple servers log records to the same remote file? We do not consider using the remote file system to log the problem, because the trouble is far more convenient than you get ...

Therefore, the log of the multiple servers to be counted is: record => separately and periodically synchronize to the backstage => merge => after the analysis with Log analysis tool.

First, explain why you want to merge the logs: Because Webalizer does not have the ability to merge multiple logs on the same day

Webalizer Log1
Webalizer log2
Webalizer Log3

So the final result is: only log3 results.

Can we get the log1<
Because a log of the analysis tool is not the log all read after the analysis, and streaming reading log and at a certain time interval, save the periodic statistical results. Therefore, the time span is too large (for example, 2 log interval more than 5 minutes), some log Statistics tool algorithm will "forget" the previous results. Therefore, the merging of multiple services for log1<2 and log merge problems is to merge the log into a single file after a chronological order. A typical time field for multiple log files is this:

Log1 log2 Log3
00:15:00 00:14:00 00:11:00
00:16:00 00:15:00 00:12:00
00:17:00 00:18:00 00:13:00
00:18:00 00:19:00 00:14:00
14:18:00 11:19:00 10:14:00
15:18:00 17:19:00 11:14:00
23:18:00 23:19:00 23:14:00

A log merge must be a cross merge of multiple logs by time. The merged log should be:

00:15:00 from Log1.
00:15:00 from Log2.
00:16:00 from Log1.
00:17:00 from Log3.
00:18:00 from Log2.
00:19:00 from Log1.
....

How do I combine multiple log files?

The following is an example of the standard CLF format log (Apache):

The log format for Apche is this:

%h%l%u%t "%r"%>s%b

Specific examples:

111.222.111.222--[03/apr/2002:10:30:17 +0800]
"Get/index.html http/1.1" 200 419

The simplest idea is to read all the logs and then sort by the time fields in the log

Cat Log1 log2 log3 |sort-k 4-t ""

Comments:

-T "": The log field split symbol is a space
-K 4: Sorted by the 4th field, which is:
[03/apr/2002:10:30:17 +0800] This field
-O Log_all: Output to log_all this file

But this is less efficient, you know. If a service already needs to use load balancing, its service's single log number is more than tens, the size of hundreds of m, so that multiple hundreds of m of the log to be sorted, the machine load can be thought of ...

There is an optimized way to know that even though a single log itself is already a "sorted by time" file, sort provides an optimized consolidation algorithm for sorting this sort of file: Using the-M merge merge option.

Therefore, merging the 3 log files in such a format log1 log2 Log3 and outputting them to Log_all is a good way to:

Sort-m-T ""-K 4-o log_all log1 log2 log3

Comments:

-M: Using the merge optimization algorithm

Note: The merged log output is best compressed and then sent to Webalizer for processing.

Some systems can handle 2G of files, and some cannot. Some programs can handle files larger than 2G, and some cannot. Try to avoid files larger than 2G unless you are sure that all of the programs and operating systems involved in the process can handle such files. Therefore, if the output of the file is greater than 2G, it is best to the log gzip and then sent to the Webalizer processing: More than 2G file system error in the analysis process is more likely, and gzip can significantly reduce the analysis during the I/O operation.

This is achieved by the chronological merging of the logs.

3, the rotation mechanism of the log let us care about the data source problem: Webalizer is actually a monthly statistical tool to support incremental statistics: So for large services, I can be a day to the Apache log merged to Webalizer statistics. How does the Web log truncate by day (like 00:00:00 midnight every day)?

If you use crontab every day: Back up the log access_log_yesterday daily 0 o'clock on time

Mv/path/to/apache/log/access_log/path/to/apache/log/access_log_yesterday

Words: You also need: immediately run: Apache restart otherwise: Apache will be because of the log file handle lost do not know where to log records. This file will be affected by the daily midnight restart of the Apache service.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.