Web site server log management and analysis

Source: Internet
Author: User
Tags apache access log apache log

Managing Web sites is not just about monitoring the speed of the web and the delivery of Web content. It not only pays attention to the daily throughput of the server, but also understands the external access of these Web sites and the access to each page of the site. Improve the content and quality of pages based on the frequency of clicks on each page, improve the readability of content, and track the steps involved in business transactions and manage the "behind the scenes" data from the Web site.

In order to better provide WWW service, it is more and more important and urgent to monitor the operation of the Web server and to understand the detailed access status of the content of the website. These requirements can be achieved through the statistics and analysis of the Web server log files. This article discusses the principles and techniques of Web server log analysis.

The relevant tool software in this article is as follows:

webalizerhttp://www.mrunix.net/webalizer/

Cronolog http://www.cronolog.org/

Apache http://www.apache.org/

The principle of Web log analysis

The Web server log records various raw information such as processing requests and run-time errors that the Web server receives. Through the statistics, analysis and synthesis of the log, we can effectively grasp the operation status of the server, discover and troubleshoot the causes of errors, understand the distribution of customer access, and better enhance the system maintenance and management.

The WWW service model is very simple:

1. The client (browser) and the Web server establish a TCP connection, and after the connection is established, an access request (such as Get) is made to the Web server. According to the HTTP protocol, the request contains a series of information such as the client's IP address, browser type, requested URL, and so on.

2. After the Web server receives the request, the page content requested by the client is returned to the client. If an error occurs, the error code is returned.

3. The server side logs access information and error messages to the log file.

The following is the contents of the datagram that the client sends to the Web server request:

Get/engineer/ideal/list.htm http/1.1

Accept:image/gif, Image/x-xbitmap,image/jpeg, Image/pjpeg,

Application/vnd.ms-powerpoint,application/vnd.ms-excel, Application/msword, */*

referer:http://www.linuxaid.com.cn/engineer/ideal/

Accept-language:zh-cn

Accept-encoding:gzip, deflate

user-agent:mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

Host:www.linuxaid.com.cn

Connection:keep-alive

As you can see from the code, the client's request contains a lot of useful information, such as the client type. The Web server sends the requested Web page content back to the client.

At present, the common Web server has Apache, Netscape Enterprise Server, MS IIS and so on. Now the Web server commonly used on the internet is Apache, so the discussion in this article will be discussed in Linux+apache environment (other application environment is similar). For Apache, multiple log file formats are supported, the most common being the common and combined two modes. Among them, the combined way is more than the referer in the common mode log information (indicating where the request came from, such as a search engine from Yahoo!) and user-agent (user client type, such as Mozilla or IE). The following is an example of a common type of log:

218.242.102.121--[06/dec/2002:00:00:00 +0000] "GET

/2/face/shnew/ad/via20020915logo.gifhttp/1.1 "304 0" http://www.mpsoft.net/"

"Mozilla/4.0 (compatible; MSIE6.0; Windows 98) "

61.139.226.47--[06/dec/2002:00:00:00+0000] "get/cgi-bin/guanggaotmp.cgi?1

http/1.1 "178" http://www3.beareyes.com.cn/1/index.php "" mozilla/4.0

(Compatible; MSIE 5.0; Windows 98;digext) "

218.75.41.11--[06/dec/2002:00:00:00+0000] "GET

/2/face/shnew/ad/via20020915logo.gifhttp/1.1 "304 0" http://www.mpsoft.net/"

"Mozilla/4.0 (compatible; MSIE5.0; Windows 98; Digext) "

61.187.207.104--[06/dec/2002:00:00:00 +0000] "get/images/logolun1.gif

http/1.1 "304 0" http://www2.beareyes.com.cn/bbs/b.htm "" mozilla/4.0 (compatible;

MSIE 6.0; Windows NT 5.1) "

211.150.229.228--[06/dec/2002:00:00:00 +0000] "get/2/face/pub/image_top_l.gif

http/1.1 "260" http://www.beareyes.com/2/lib/200201/12/20020112004.htm "

"Mozilla/4.0 (compatible; MSIE5.5; Windows NT 5.0) "

As can be seen from the log file above, the log records record the client's IP address, the time the access occurred, the page where the request was accessed, the status information returned by the Web server for the request, the size of the content returned to the client in bytes, the reference address of the request, the type of client browser, and so on.

Configuration and management of Apache logs

In this article, it is assumed that Apache runs two virtual hosts: Www.secfocus.com and www.tomorrowtel.com. We need to perform an access log analysis and statistics for each of the two virtual hosts.

In the Apache configuration file, there are two log-related configurations that need to be cared for:

$ customlog/www/logs/access_log Common

$ errorlog/www/logs/error_log

Customlog is used to indicate where Apache's access logs are stored (stored in/www/logs/access_log) and format (here Common); errorlog is used to indicate where the Apache error message log is stored.

For a server that does not configure a virtual host, simply look for the Customlog configuration directly in httpd.conf to modify it. For a Web server with multiple virtual servers, the access logs for each virtual server need to be detached to provide access statistics and analysis for each virtual server. As a result, a separate log configuration is required in the virtual server configuration, as shown in the following example:

Namevirtualhost 75.8.18.19

ServerName www.secfocus.com

ServerAdmin [email protected]

documentroot/www/htdocs/secfocus/

Customlog "/www/log/secfocus" combined

alias/usage/"/www/log/secfocus/usage/"

ServerName www.tomorrowtel.com

ServerAdmin Tomorrowtel @tomorrowtel. com

Documentroot/www/htdocs/tomorrowtel

Customlog "/www/log/tomorrowtel" combined

alias/usage/"/www/log/tomorrowtel/usage/"

It is important to note that each virtual host definition has a customlog command that specifies the hosting file for the virtual host access log, and the alias command is used to make the report generated by the log analysis accessible through www.secfocus.com/usage/. The log file is saved by the configuration above.

The next problem is the round robin of the log files. Because the log has been increasing, if you do not process the log file becomes larger, it will affect the efficiency of the Web server, speed, and may also run out of server hard disk space, causing the server to not function properly. In addition, if a single log file is larger than the limit of the operating system single file size, this can further affect the operation of the Web service. Also, it is not convenient for log statistics to run if the log files are not round-robin. Because the log statistics analysis is the statistical analysis in the day, spanning a long time log will make the statistical analysis program run particularly slow. Therefore, the Web server log files need to be round-robin every day.

Web server log round robin

There are three ways to round the Web server log: The first is to take advantage of the Linux system's own log file round robin mechanism logrotate; The second method is to use Apache's own log rotation program Rotatelogs The third is to use the FAQ in Apache to recommend the development of a more mature log round tool cronolog.

For large Web services, the use of practical load balancing technology to improve the Web site service capabilities, so that there are multiple servers in the background to provide Web services, greatly facilitate the distribution of service planning and extensibility. If the distribution of multiple servers requires merging the logs, a unified statistical analysis is performed. Therefore, in order to ensure the accuracy of statistics, you need to strictly follow the daily time to automatically generate logs.

Using Logrotate to implement log rotation

First, we discuss the method of using the Linux system's own log file round-robin mechanism logrotate. Logrotate is the Linux system itself with a log round-robin program, is dedicated to a variety of system logs (syslogd, mail) to round-robin program. The program is run by the service Crond running the program every 4:02. In the/etc/cron.daily directory, you can see the Logrotate file, which reads as follows:

#!/bin/sh/

$ usr/sbin/logrotate/etc/logrotate.conf

Every morning Crond starts the logrotate script in the/etc/cron.daily directory to log round.

You can see the following in/etc/logrorate.conf:

# see ' Man logrotate ' fordetails

# Rotate log Files Weekly

Weekly

# Keep 4 weeks worth of backlogs

Rotate 4

# Create new (empty) log files afterrotating old Ones

Create

# Uncomment this if you want your logfiles compressed

#compress

# RPM Packages drop log rotationinformation into this directory

Include/etc/logrotate.d

# no packages own wtmp--we'll rotatethem here

/var/log/wtmp {

Monthly

Create 0664 Root utmp

Rotate 1

}

# system-specific logs May is also beconfigured here.

As you can see from the logrotate configuration file, the configuration of the log that needs to be scrolled is saved in the/ETC/LOGROATE.D directory, in addition to wtmp. Therefore, you only need to create a configuration file named Apache under this directory to instruct Logrotate how to round the Web server's log files. Here is an example:

/www/log/secfocus {

Rotate 2

Daily

Missingok

Sharedscripts

Postrotate

/usr/bin/killall-hup httpd 2>/dev/null | | True

Endscript

}

/www/log/tomorrowtel {

Rotate 2

Daily

Missingok

Sharedscripts

Postrotate

/usr/bin/killall-hup httpd 2>/dev/null | | True

Endscript

}

Here "Rotate 2" means that only two backup files are included in the round robin, that is, only Access_log, Access_log.1, and access_log.2 three log backup files. This enables the round robin of the log files for two virtual hosts. Later in this article, we will discuss how to use the log statistics analysis software to process log files.

The advantage of this approach is that log rotation can be achieved without the need for other third-party tools. However, this approach is not practical for heavily loaded servers and Web servers that use load balancing technology. This can affect the continuity of the service because it sends a-hup restart command to the corresponding service process to implement a truncated archive of the log.

Using Rotatelogs to implement log rotation

Apache provides the ability to send logs directly to another program, rather than writing them to a file. This greatly enhances the ability to process logs. This pipeline can be any program, such as log analysis, compression log, etc. To implement an operation that writes logs to a pipeline, simply replace the contents of the log file portion of the configuration with the "| Program name", for example:

# Compressed logs

$ customlog "|/usr/bin/gzip-c>>/var/log/access_log.gz" common

This enables the use of Apache's own round-robin tool Rotatelogs to cycle through the log files. Rotatelogs basically the log is controlled by time or size.

$ customlog "|/www/bin/rotatelogs/www/logs/secfocus/access_log 86400" common

In the above example, the Apache access log is sent to the program Rotatelogs;rotatelogs writes the log to/www/logs/secfocus/access_log and rounds the log every 86,400 seconds (1 days). The round-robin file is named/www/logs/secfocus/access_log.nnn, where NNN is the time to start logging. Therefore, in order for the logs to be aligned by day, it is necessary to start the service at 00:00, so that the daily round-robin log is exactly the full-day log to provide access to the statistical analysis program for processing. If you start generating a new log 00:00, the round-robin log is access_log.0000.

Using Cronolog to implement log rotation

First you need to download and install Cronolog, you can download the latest version of Cronolog to http://www.cronolog.org. After the download is complete, unzip the installation. The method is as follows:

Tar Xvfz cronolog-1.6.2.tar.gz

CD cronolog-1.6.2

./configure

Make

Make check

Make install

This completes the configuration and installation of the Cronolog, and by default Cronolog is installed under/usr/local/sbin.

Modify the Apache log Configuration command as follows:

$ customlog "|/usr/local/sbin/cronolog/www/logs/secfocus/%w/access_log" combined

The%w here means that the logs are saved in different directories by date, which saves a week's log.

For log analysis, you need to copy (or move the log file every day, if you do not want to save a week's log) to a fixed location to facilitate log analysis statistics files for processing (using CRONTAB-E). Add a timed task as follows:

$5 0 * * */bin/mv/www/logs/secfocus/' date-v-1d +\%w '/access_log

/www/logs/secfocus/access_log_yesterday

Then use the Log statistical analysis program to process the file access_log_yesterday.

For large sites that use load balancing technology, there is a consolidation problem with access logs for multiple servers. In this case, the individual servers cannot use Access_log_yesterday when defining or moving the log files, and should be differentiated with the server number (such as the server IP address). Then run the site mirroring and backup service RSYNCD on each server, and then download the installation profile for each server daily from rsync to the server dedicated to the access statistics analysis.

The methods to merge log files for multiple servers (such as Log1, LOG2, log3) and output to Log_all are:

$ sort-m-T ""-K 4-olog_all log1 log2 Log3

-M indicates the use of the merge optimization algorithm;-K 4 means sorting by time;-o means that the sorting results are stored in the specified file.

Installation and configuration of the Log Statistics Analyzer Webalizer

Webalizer is an efficient, free Web server Log Parser. The analysis results are in HTML file format, which makes it easy to browse through the Web server. Many sites on the Internet use Webalizer for Web server log analysis. Webalizer has some of the following features.

It is written in C language program, with high operating efficiency. On a machine with a CPU frequency of 200MHz, the Webalizer can analyze 10,000 records per second, so it only takes 15 seconds to parse a 40MB-sized log file.

Webalizer supports the Standard General log file format (Common Logfile format). In addition, several variants of the combined log format are supported, which can be used to count customer situations and customer operating system types. Now Webalizer can support WU-FTPD xferlog log format and squid log file format.

Support for command line configuration and configuration files.

You can support multiple languages, or you can work on your own localization.

Supports multiple platforms, such as UNIX, Linux, NT, OS/2 and MacOS.

The Webalizer generated Access Statistics Analysis report contains table and bar chart statistics for the average number of visits per month. Click each month to get detailed statistics of the day of the month.

1. Installation

Before installing, you need to ensure that the system has a GD library installed, you can use the following code:

# Rpm-qa|grep GD

# gd-devel-1.8.4-4

# gdbm-devel-1.8.0-14

# gdbm-1.8.0-14

# sysklogd-1.4.1-8

# gd-1.8.4-4

Used to confirm that the system has Gd-deve and GD two RPM packages installed.

There are two ways to install Webalizer: one is to download the source code to install it, and one to install it directly using the RPM package.

Using RPM package installation is very simple, from rpmfind.net to find the Webalizer package, download and run the following code to implement the installation:

$ RPM-IVH webalizer-2.01_10-1.i386.rpm

For source code mode you need to download from http://www.mrunix.net/webalizer/and then install. First unpack the source code package:

$ tar xvzf webalizer-2.01-10-src.tgz

There is a lang directory in the generated directory. The directory holds a variety of language files, but only the traditional Chinese version, can be converted to simplified, or re-translated, and then into the generated directory:

$ CD webalizer-2.01-10

$./configure

$ make--with-language=chinese

$ make Install

After the compilation is successful, a webalizer executable file is installed in the/usr/local/bin/directory.

2. Configuration and operation

Control of the Webalizer run can be done either through configuration files or by specifying parameters on the command line. The use of configuration files is simple and flexible, suitable for automated Web server log Statistics Analysis of the application environment.

The default profile for Webalizer is/etc/webalizer.conf, and when you start Webalizer without using the "-f" option, Webalizer looks for the file/etc/webalizer.conf, or you can use "-f" To specify a configuration file (when the server has a virtual host, you need to configure multiple different webalizer profiles.) The Webalizer of different virtual hosts uses different configuration files. The configuration options that need to be modified in the webalizer.conf configuration file are as follows:

Logfile/www/logs/secfocus/access_log

The path information that is used to indicate the configuration file, webalizer the log file as input for statistical analysis:

Outputdir/www/htdocs/secfocus/usage

Used to indicate the saved directory of the generated statistical report, using alias in front so that users can access the statistical report using http://www.secfocus.com/usage/.

HostName www.secfocus.com

The code above is used to indicate the hostname, which is referenced in the statistics report.

Other options do not need to be modified. After the configuration file has been modified, a regular webalizer is required to generate daily statistical analysis.

Run crontab-e as root into the scheduled run task edit state, adding the following tasks:

$5 0 * * */usr/local/bin/webalizer-f/etc/secfocus.webalizer.conf

$0 * * */usr/local/bin/webalizer-f/etc/tomorrowtel.webalizer.conf

This assumes that the system is running with two virtual hosts, and the log analysis profiles secfocus.webalizer.conf and tomorrowtel.webalizer.conf are defined separately. This defines a statistical analysis of the Secfocus log at 00:05, and a statistical analysis of the Tomorrowtel log at 00:15.

The next day, use Http://www.secfocus.com/usage and http://www.tomorrowtel.com/usage to view their log analysis reports separately.

Protection Log Statistics Analysis report is not accessed by unauthorized users

We don't want our website access statistics to be browsed by others, so we need to protect the usage directory and allow only legitimate users to access them. You can use the Basic authentication mechanism that comes with Apache. Configure a later connection this address requires the user to provide a password to access the page (shown in 3)

1. Conditions

In the configuration file, the directory "/" should be set to:

documentroot/www/htdocs/secfocus/

Accessfilename. htaccess

AllowOverride All

2. Requirements

The requirement is to restrict access to http://www.secfocus.com/usage/and require user authentication to access it. Here the user is "admin" and the password is "12345678".

To create a user file using htpasswd:

$ htpasswd-c/WWW/.HTPASSWD Admin

This program will ask the user "admin" password, enter "12345678" two times to take effect.

3. Create a. htaccess file

Use VI to create a file in the/www/logs/secfocus/usage/directory. htaccess, write the following lines of code:

AuthName admin-only

AuthType Basic

authuserfile/www/.htpasswd

Require user admin

Test

Through the browser access to the Http://www.secfocus.com/usage, the popup box request input user name and password, enter "admin", "12345678" can access log statistics Analysis report.


Http://server.it168.com/a2009/0624/595/000000595027_3.shtml

Web site server log management and analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.