Multi-server log merge statistics

Source: Internet
Author: User

Author: snail Il
Abstract: You have no patience to read all the following content, because the conclusion is nothing more than the following two points:
1. Use cronolog to clean and securely track apache logs
2. Merge and sort multiple logs with sort-m

Based on personal experience:
1. First introduce how to merge apache logs;
2. Describe the necessity and solution of log round-robin based on the problem, and describe how to use cronolog to round-robin apache logs;

There are a lot of related tool usage skills and some failed attempts in the design log merging process ......

I believe that there are more than one way to solve the above problems. The following solutions are certainly not the easiest or the lowest cost. I hope to have more exchanges with you.

{0} necessity of multi-server log merge statistics:

More and more large WEB services use DNS round robin to achieve load balancing: using multiple servers with the same role as the front-end WEB Services greatly facilitates service distribution planning and scalability, however, the distribution of multiple servers makes log analysis and statistics a little troublesome. If you use webalizer and other log analysis tools for separate log statistics on each machine: 1 will bring a lot of trouble to the Data Summary, such as: the total traffic statistics requires SERVER1 SERVER2... the number of the specified month is added. 2 will greatly affect the statistical results of the unique visitor number unique visits, unique site number unique sites and other indicators, because these indicators are not the algebraic addition of several machines.

The benefits of unified log statistics are obvious. But how can we combine the statistics of all machines into a statistical result?

First, you may think: Can multiple servers record logs to the same remote file? We do not consider using a remote file system to record logs, because the trouble is much more convenient than what you get ......

Therefore, the logs of multiple servers to be counted are recorded separately and regularly synchronized to the backend through a certain method => merge => and then analyzed using the log analysis tool.

First, we need to explain why logs should be merged: Because webalizer does not have the function to merge multiple logs on the same day

Run successively


Webalizer log1
Webalizer log2
Webalizer log3

 

The final result is: Only log3 results.

Can I change log1? <

Because a log analysis tool does not read all the logs at a time for analysis, and the stream reads the logs and saves the staged statistical results at a certain interval. Therefore, if the time span is too large (for example, two logs are separated by more than 5 minutes), the algorithms of some log statistics tools will "forget" the previous results ". Therefore, log1 <

{1} log merging Problems

The merge statistics of multiple services are to merge logs in chronological order into one file.

The typical time fields of multiple log files are as follows:


Log1 log2 log3
00:15:00 00:14:00
00:16:00 00:15:00
00:17:00 00:18:00
00:18:00 00:19:00
14:18:00 11:19:00
15:18:00 17:19:00
23:18:00 23:19:00

 

Logs must be merged by time. The merged logs should be:


00:15:00 from log1
00:15:00 from log2
00:16:00 from log1
00:17:00 from log3
00:18:00 from log2
00:19:00 from log1
....

 

How to merge multiple log files?

The following uses the standard clf format log (apache) as an example:

The log format of apche is as follows:


% H % l % u % t "% r" %> s % B

 

Example:


111.222.111.222--[03/Apr/2002: 10: 30: 17 + 0800]
"GET/index.html HTTP/1.1" 200 419

 

The simplest idea is to read the logs one by one and sort them by the time field in the log.


Cat log1 log2 log3 | sort-k 4-t ""

 

Note:
-T "": the log field delimiter is a space.
-K 4: sort by 4th fields, that is:
[03/Apr/2002: 10: 30: 17 + 0800] this field
-O log_all: output to the log_all file.

However, the efficiency is relatively low. If a service already needs Server Load balancer, the number of single-host logs of the service is usually more than 10 million, and the size is several hundred MB, in this way, we need to sort several hundreds of MB of logs at the same time, so the server load can be imagined ......

In fact, there is a way to optimize it. You must know that even if a single log is already a file named "sorted by time, sort provides an optimized merge algorithm for sorting and merging such files: Use the-m merge option,

Therefore, it is better to merge the three log files log1 log2 log3 in this format and output them to log_all:


Sort-m-t ""-k 4-o log_all log1 log2 log3

 

Note:
-M: Use the merge optimization algorithm

Note: It is best to compress the merged log output and then send it to webalizer for processing.

Some systems can process 2 GB of files, and some cannot. Some programs can process files larger than 2 GB, and some cannot. Avoid files larger than 2 GB unless you confirm that all programs and operating systems involved in the processing can process such files. Therefore, if the output file is greater than 2 GB, it is better to zip the log and send it to webalizer for processing: The file system error may be more likely during file analysis process larger than 2 GB, in addition, gzip can greatly reduce I/O operations during analysis.

This is how logs are merged in chronological order.

{2} log rotation mechanism:

Let's take a look at the data source: webalizer is actually a tool for monthly statistics and supports incremental statistics: for large services, I can combine apache logs by day and send them to webalizer for statistics. How does one truncate WEB logs by day (for example, 00:00:00 every night?

If you use crontab every day: Back up logs to accesserials_log_yesterday at every day


Mv/path/to/apache/log/accesserials_log
/Path/to/apache/log/accesserials_log_yesterday

 

You also need to: run apache restart immediately. Otherwise, apache will not know where to record the log because of the loss of the log file handle. In this way, the apache service will be affected when the archive is restarted every night.

A simple method that does not affect services is to copy and then clear


Cp/path/to/apache/log/accesserials_log/path/to/apache/log/accesserials_log_yesterday
Echo>/path/to/apache/log/accesserials_log

 

Serious analysts will find a problem:

However, it is impossible for cp to strictly guarantee zero-point truncation. It takes 6 seconds to add the data to the replication process. The log generated during the copy process to 00:00:06 will appear in the truncated accesserials_log_yesterday log. It is no problem to count the hundreds of lines of logs generated each day in a single log. However, there will be a Merge Sorting Problem for multiple logs within one day of the new month:


[31/Mar/2002: 59: 59: 59 + 0800]
[31/Mar/2002: 23: 59: 59 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]


  
You must know that the field [01/Apr/2002: 00: 00: 00 cannot be sorted across days. Because dd/mm/yyyy, month or English name are used in the date, if you sort by letter, it is likely that the result is: Sorting leads to log errors.


[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[01/Apr/2002: 00: 00: 00 + 0800]
[31/Mar/2002: 59: 59: 59 + 0800]
[31/Mar/2002: 59: 59: 59 + 0800]
[31/Mar/2002: 23: 59: 59 + 0800]
[31/Mar/2002: 59: 59: 59 + 0800]
[31/Mar/2002: 23: 59: 59 + 0800]

 

These abnormal data during the cross-day process is like eating a bug for analysis tools such as webalizer. The running result is: it may lose all the data in the previous month! Therefore, such data may cause many risks in the process of processing data from the last day of the previous month.

There are several ways to solve the problem:

Post-processing:

Therefore, you can use the grep command to remove logs from the previous month on the first day of each month. For example:


Grep-v "01/Apr" accesserials_log_04_01> accesserials_log_new

 

Modify the SORT log: Remove all cross-day data. It may be a way to post-process logs. Although the sort command has a special option-M (Note: uppercase M) for date sorting ), you can SORT the specified fields by month rather than by letter. However, it is troublesome to use the SORT command to split the month field in apache logs. (I tried to use "/" as the delimiter and use the "month" and "Year: Time" fields for sorting ). Although some PERL scripts can be implemented, I finally gave up. This does not comply with the system administrator design principles: versatility. And you need to keep asking yourself: Is there a simpler way? Also, change the log format to TIMESTAMP (for example, the SQUID log does not have this problem, and its log itself uses TIMESTAMP for TIMESTAMP ), however, I cannot guarantee that all log tools can recognize that you have used a special format for the date field.

2. Optimize the Data source:

The best way is to optimize the data source. Ensure that the data source is rotated by day, and the data in the logs of the same day is within the same day. In this way, no matter what tools you use (commercial, free) to analyze logs, the complex log preprocessing mechanism will not be affected.

The first thing you may think of is to control the log truncation time. For example, to strictly capture logs from, but there is no difference between intercepting logs one minute before midnight or one minute later, you still cannot control the cross-2-day record issue in a log, and you cannot predict the time used in the log archiving process.

Therefore, you must consider the use of log tracking tools. These log tracking tools must meet the following requirements:

1. Do not interrupt WEB Services: Do not stop apache => mobile logs => heavy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.