Web server log Statistical analysis complete Solution _ Server

Source: Internet
Author: User
Tags apache access log apache log rsync

Article Related software:

Webalizer http://www.mrunix.net/webalizer/
Cronolog http://www.cronolog.org/
Apache http://www.apache.org/

First, the preface

With the development of Web services on the Internet, almost every government department, company, University, Scientific Research Institute and so on are constructing or building their own websites. At the same time, in the construction of the site in various units will encounter a variety of problems, then the Web server operation and access to the detailed and comprehensive analysis of the site to understand the operation of the website, found that the shortcomings of the site to promote the importance of better development is self-evident.

Managing Web sites is not just about monitoring the speed of the Web and Web content transfer, it requires not only to focus on the daily throughput of the server, but also to understand the external access to these Web sites, to understand the site pages of the visit, according to the frequency of each page to improve the content and quality of the page, improve content readability, Keep track of the steps involved in business transactions and manage the "behind-the-scenes" data of a Web site.
In order to provide the WWW service better, it becomes more and more important to monitor the running situation of the Web server and to understand the detailed visit of the website content. These requirements can be done through the statistics and analysis of the Web server's log files.

Second, the principle of Web log analysis

The Web server log records the various raw information that the Web server receives processing requests and run-time errors. Through the statistics, analysis and synthesis of the log, we can effectively master the running state of the server, find out the cause of the error, understand the customer access distribution, and better strengthen the maintenance and management of the system.

The WWW service model is very simple (see Figure 1):

1 the client (browser) and the Web server establish a TCP connection, after the connection is established, to the Web server to issue access requests (such as: get), according to the HTTP protocol this request contains the client's IP address, browser type, the requested URL, and so on a series of information.


Figure 1 Web Access mechanism

2 when the Web server receives the request, the page content requested by the client is returned to the client. If an error occurs, the error code is returned.

 

3 The server side logs access information and error information to the log file. The following are the contents of the datagram that the client sent to the Web server request:

Get/engineer/ideal/list.htm http/1.1
Accept:image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, Application/vnd.ms-powerpoint, application/vnd.ms-excel , Application/msword, */*
referer:http://www.linuxaid.com.cn/engineer/ideal/
Accept-language:zh-cn
Accept-encoding:gzip, deflate
user-agent:mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
Host:www.linuxaid.com.cn
Connection:keep-alive

As you can see, the client's request contains a lot of useful information, such as the client type, and so on. The Web server sends the requested Web page content back to the client.

The most common Web servers currently available are Apache, Netscape Enterprise Server, MS IIS, and so on. The most commonly used Web servers on the Internet are Apache, so our discussion here is all about linux+apache environment, and other applications are similar. For Apache, a variety of log file formats are supported, most commonly common and combined two modes, where combined is more referer than the common log (where the request comes from, For example, from the Yahoo search engine) and user-agent (user client type, such as Mozilla or IE). Here we discuss the combined type. The following is an example of a log of the common type:

218.242.102.121--[06/dec/2002:00:00:00 +0000] "get/2/face/shnew/ad/via20020915logo.gif http/1.1" 304 0 "http:// www.mpsoft.net/"" mozilla/4.0 (compatible; MSIE 6.0; Windows) "
61.139.226.47--[06/dec/2002:00:00:00 +0000]" get/cgi-bin/guanggaotmp.cgi?1 http/1.1 "178" http:// Www3.beareyes.com.cn/1/index.php "" mozilla/4.0 (compatible; MSIE 5.0; Windows 98; Digext) "
218.75.41.11--[06/dec/2002:00:00:00 +0000]" get/2/face/shnew/ad/via20020915logo.gif http/1.1 "304 0" http://www.mpsoft.net/"" mozilla/4.0 (compatible; MSIE 5.0; Windows 98; Digext) "
61.187.207.104--[06/dec/2002:00:00:00 +0000]" get/images/logolun1.gif http/1.1 "304 0" http:// Www2.beareyes.com.cn/bbs/b.htm "" mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) "
211.150.229.228--[06/dec/2002:00:00:00 +0000]" get/2/face/pub/image_top_l.gif http/1.1 "*" H Ttp://www.beareyes.com/2/lib/200201/12/20020112004.htm "" mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) "

From the log file above, you can see that logging records the client's IP address, the time that the access occurred, the page where the request was accessed, the status information returned by the Web server for the request, the size of the content returned to the client (in bytes), the referral address for the request, the client browser type, and so on.

III. configuration and management of Apache logs

In this article we assume that our Apache is running with two: Www.secfocus.com and www.tomorrowtel.com. We need to have these two separate access log analysis and statistics.

In the Apache configuration file, there are two log-related configurations we need to be concerned about:

Customlog/www/logs/access_log Common
Errorlog/www/logs/error_log

The customlog is used to indicate where Apache's access logs are stored (here in/www/logs/access_log) and in the format (here is common), and errorlog is used to indicate where the Apache error message log resides.

For servers that are not configured, it is only necessary to find the configuration of the Customlog directly in the httpd.conf, and for Web servers with multiple virtual servers, the access logs of each virtual server need to be detached to enable access statistics and analysis for each virtual server. So this requires a separate log configuration in the virtual server configuration, example:

Namevirtualhost 75.8.18.19


ServerName www.secfocus.com
ServerAdmin secfocus@secfocus.com
documentroot/www/htdocs/secfocus/
Customlog "/www/log/secfocus" combined
alias/usage/"/www/log/secfocus/usage/"



ServerName www.tomorrowtel.com
ServerAdmin Tomorrowtel @ tomorrowtel.com
Documentroot/www/htdocs/tomorrowtel
Customlog "/www/log/tomorrowtel" combined
alias/usage/"/www/log/tomorrowtel/usage/"

Note here that each definition has a customlog command that specifies the storage file for the access log, and the alias command is used to allow the report generated by the log analysis to be accessed in a www.secfocus.com/usage/manner. The log file is saved through the configuration above.

However, a problem encountered is the rotation of the log file, because the log is always increasing, if not processed then log files will become more and more large, will affect the efficiency of the Web server, speed, and may also be too much depletion of the server hard disk space, causing the server does not work properly, Additionally, if a single log file is larger than the operating system single file size limit, it further affects the operation of the Web service. Moreover, if the log file does not follow the rotation of the Log statistical analysis program, because the log statistical analysis is based on days for statistical analysis, spanning a long time log will make the log statistical analysis program running particularly slow. So here you need to cycle through the Web server log files every day.

Iv. Web server logs round-robin

Web server logs There are three different ways to do this: the first approach is to use the Linux system's own log file rotation mechanism: logrotate; the second approach is to use Apache's own log-round-robin program Rotatelogs The third is to use the Apache FAQ to recommend the development of a more mature log wheel tool cronolog.

For large Web services, the use of practical load balancing technology is often used to improve web site service capabilities, so that the background of a number of servers to provide Web services, which greatly facilitate the distribution planning and extensibility of services, but the distribution of multiple servers need to consolidate the unified statistical analysis of the log. Therefore, in order to ensure the accuracy of statistics, we need to strictly follow the daily time to automatically generate log files.

4. 1 logrotate Implementation log round-robin

First we discuss the use of the Linux system itself log file rotation mechanism: Logrotate method. Logrotate is a log-round program that is carried by the Linux system itself, and is a program that is dedicated to the rotation of various system logs (Syslogd,mail). The program is run by the service Crond that runs the program 4:02 every day, and you can see the logrotate file in the/etc/cron.daily directory, which reads as follows:

#!/bin/sh
/usr/sbin/logrotate/etc/logrotate.conf

You can see that every morning Crond will start the logrotate script in the/etc/cron.daily directory to log round.

The contents of the/etc/logrorate.conf can be seen as follows:

# "Man logrotate" for details
# Rotate log Files Weekly
Weekly
# Keep 4 weeks worth of backlogs
Rotate 4
# Create new (empty) log files after the rotating old ones
Create
# Uncomment this if you want your log files compressed
#compress
# RPM Packages Drop log rotation information into this directory
Include/etc/logrotate.d
# no packages own wtmp--we@ #ll rotate them here
/var/log/wtmp {
Monthly
Create 0664 Root utmp
Rotate 1
}

# System-specific logs may also is configured here.

You can see from the Logrotate configuration file that the configuration of the log that needs to be scrolled is saved in the/ETC/LOGROATE.D directory except for the wtmp. So all we need to do is create a configuration file named Apache in this directory to instruct Logrotate how to cycle through the Web server's log file, and here's an example:

/www/log/secfocus {
Rotate 2
Daily
Missingok
Sharedscripts
Postrotate
/usr/bin/killall-hup httpd 2>/dev/null | | True
Endscript
}
/www/log/tomorrowtel {
Rotate 2
Daily
Missingok
Sharedscripts
Postrotate
/usr/bin/killall-hup httpd 2>/dev/null | | True
Endscript
}

Here "Rotate 2" indicates that only two backup files are included in the round robin, that is, only: Access_log, Access_log.1, access_log.2 three log backup files. In this way, the two log files are implemented round robin. Later we will discuss how to use log statistics analysis software to process log files.

The advantage of this approach is that you can implement log rotations without the need for other third-party tools, but this approach is not practical for heavily loaded servers and Web servers that use load-balancing technology. This affects the continuity of the service because it emits a-hup restart command for the corresponding service process to implement the log truncation file.

4. 2 using Apache self-rotatelogs to implement log round robin

Apache provides the ability to send a log directly to another program instead of writing to it, which greatly strengthens the ability to process the log, which can be any program: Log analysis, compressed log, etc. To write a log to a pipe, you only need to replace the contents of the log file portion of the configuration with "| program name", for example:

# Compressed logs
Customlog "|/usr/bin/gzip-c >>/var/log/access_log.gz" common

This makes it possible to use Apache's own round robin tool: Rotatelogs to cycle through the log files. Rotatelogs is basically used to control logs by time or by size.

Customlog "|/www/bin/rotatelogs/www/logs/secfocus/access_log 86400" common

In the example above, the Apache access log is sent to program Rotatelogs,rotatelogs to write the log to/www/logs/secfocus/access_log, and to cycle through the logs every 86,400 seconds (one day). The following file name is/www/logs/secfocus/access_log.nnnn, where NNN is the time to start logging. Therefore, in order to align the log on a daily basis, the service needs to be started at 00:00, so that the log that logs on every day is just a full day's log to provide access to the statistical analysis program for processing. If you start to generate a new log 00:00, then the log is access_log.0000.

4. 3 using Cronolog to implement log rounds

First you need to download and install Cronolog, you can download the latest version of Cronolog to http://www.cronolog.org. After downloading, unzip the installation, the method is as follows:

[Root@mail root]# tar xvfz cronolog-1.6.2.tar.gz
[Root@mail root]# CD cronolog-1.6.2
[Root@mail cronolog-1.6.2]#./configure
[Root@mail cronolog-1.6.2]# make
[Root@mail cronolog-1.6.2]# make check
[Root@mail cronolog-1.6.2]# make install

This completes the configuration and installation of the Cronolog, by default cronolog is installed under/usr/local/sbin.
The Apache log configuration commands are modified as follows:

Customlog "|/usr/local/sbin/cronolog/www/logs/secfocus/%w/access_log" combined

Here%w means to save the log in a different directory according to the day of the week, which saves a week's log. For log analysis, the log file needs to be copied daily (or moved, if you do not want to save a week's log) to a fixed location to facilitate the log analysis statistics file processing, practical Crontab–e, the following add a timed task:

5 0 * * */bin/mv/www/logs/secfocus/' date-v-1d +\%w '/access_log/www/logs/secfocus/access_log_yesterday

This is done by using the Log Statistics Analyzer to access_log_yesterday the file.

For large sites that use load balancing technology, there is a problem of merging the access logs of multiple servers. For this case, each server can not use Access_log_yesterday when defining or moving log files, it should bring the server number. such as server IP address and other information to differentiate. Then run the site mirroring and backup service on each server RSYNCD (reference article "Implementing site mirroring and backup with Rsync", ttp://www.linuxaid.com.cn/engineer/ideal/article/ rsync.htm), and then download the daily installation profile for each server through rsync to a server dedicated to accessing statistical analysis.

The method for merging log files for multiple servers, such as: Log1 log2 log3 and output to Log_all, is:

Sort-m-T ""-K 4-o log_all log1 log2 log3

-M: Using the merge optimization algorithm, K-4 means sorting by time, and-o indicates that the sort results are stored in the specified file.

V. Installation and configuration of the Log statistical analysis program Webalizer

Webalizer is an efficient, free Web server Log Analyzer. The result is an HTML file format that makes it easy to browse through a Web server. Many sites on the Internet use Webalizer for Web server log analysis. Webalizer has some of the following features:

    1. is written in C program, so it has a high operating efficiency. On a machine with a frequency of 200Mhz, Webalizer can analyze 10,000 records per second, so it only takes 15 seconds to parse a 40M size log file.
    2. Webalizer supports the Standard General log file format (Common Logfile format), as well as several variants of the combination log format (Combined Logfile format) that can be used to count customer situations and customer operating system types. And now Webalizer can already support the WU-FTPD xferlog log format and squid log file format.
    3. Supports command-line configuration and configuration files.
    4. Can support multiple languages, or you can work on your own localization.
    5. supports a variety of platforms, such as UNIX, Linux, NT, OS/2 and MacOS.

The above figure is the first page of the Access Statistic analysis report generated by Webalizer, which contains a table and bar chart statistic analysis of the average number of visits per month. Click each month to get detailed statistics for each day of the month.

5. 1 installation

Before installing, you need to ensure that the GD library is installed on the system, and you can use:

[Root@mail root]# Rpm-qa|grep GD
Gd-devel-1.8.4-4
Gdbm-devel-1.8.0-14
Gdbm-1.8.0-14
Sysklogd-1.4.1-8
Gd-1.8.4-4

To confirm that the system has installed Gd-deve and GD two RPM packages.

There are two ways to install Webalizer, one is to download the source code to install, one is to use the RPM package directly to install.

The use of RPM package installation is very simple, from rpmfind.net find Webalizer package, download after:

RPM–IVH webalizer-2.01_10-1.i386.rpm

To implement the installation.

For source code The first need to download from the http://www.mrunix.net/webalizer/, and then install, first untie the source code package:

Tar xvzf webalizer-2.01-10-src.tgz

In the generated directory, there is a lang directory, which contains a variety of language files, but only the traditional Chinese version, you can convert to simplified, or their own translation. Then enter the generated directory:

CD webalizer-2.01-10
./configure
Make--with-language=chinese
Make install

Once the compilation is successful, a webalizer executable file is installed in the/usr/local/bin/directory.

5. 2 Configuring and Running

Control over the Webalizer run can be done either through a configuration file or by specifying parameters on the command line. The use of configuration files is relatively simple and flexible, applicable to Automatic Web server log statistical analysis of the application environment.

The default profile for Webalizer is/etc/webalizer.conf, and when the "-F" option is not used when the Webalizer is started, Webalizer will look for file/etc/webalizer.conf or use "-f" To specify a configuration file (when the server is sometimes required to configure several different webalizer profiles, different Webalizer use different profiles.) The configuration options that need to be modified in the webalizer.conf configuration file are as follows:

Logfile/www/logs/secfocus/access_log

Used to indicate the path information for the configuration file, webalizer the log file as input for statistical analysis;

Outputdir/www/htdocs/secfocus/usage

Used to indicate the saved directory of the generated statistic report, in which we used the alias so that the user can use http://www.secfocus.com/usage/to access the statistics report.

HostName

Used to indicate the host name, which is referenced in the statistics report.

Other options do not need to be modified, the configuration file after the modification, you need to be in the regular Webalizer, daily production of statistical analysis of the day.

Run as root: Crontab–e into the scheduled run task Edit status, add the following tasks:

5 0 * * */usr/local/bin/webalizer–f/etc/secfocus.webalizer.conf
0 * * */usr/local/bin/webalizer–f/etc/tomorrowtel.webalizer.conf

Here we assume that there are two systems running and that the log analysis profile secfocus.webalizer.conf and tomorrowtel.webalizer.conf are defined separately. This allows us to define a statistical analysis of the Secfocus log at 00:05 and a statistical analysis of Tomorrowtel's logs at 00:15.

Then, the next day, use http://www.secfocus.com/usage/and http://www.tomorrowtel.com/usage respectively to see their own log analysis reports.

VI. Protection Log statistical analysis reports are not accessed by unauthorized users

We certainly don't want our website access statistics to be viewed randomly by others, so we need to protect the usage directory and allow only legitimate users to access it. Here you can use Apache's own Basic authentication mechanism, configuration and then connect this address will require users to provide a password to access the page:



1, conditions

The directory "/" in the configuration file should be set to:

documentroot/www/htdocs/secfocus/
Accessfilename. htaccess
AllowOverride All

2. Demand

Requirements: Restrict access to http://www.secfocus.com/usage/and require user authentication to access. This sets the user to "admin" and the password is "12345678".

3, use HTPASSWD to establish user files

Htpasswd-c/WWW/.HTPASSWD Admin
This program will ask the user "admin" password, you enter "12345678", two times effective.

4, the establishment of. htaccess file

Use VI to create a file in the/www/logs/secfocus/usage/directory. htaccess, write the following lines:
AuthName admin-only
AuthType Basic
authuserfile/www/.htpasswd
Require user admin

5, testing

At this time through the browser access will be pop-up box to request the user name and password, and then enter the admin, 12345678 can access the Access Log statistical analysis report

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.