Overview of log analysis methods

Source: Internet
Author: User
Tags apache log database sharding
Logs are a very broad concept in computer systems. any program may output logs, such as the operating system kernel and various application servers. The log content, size, and usage are also different, which is difficult to generalize. Logs in the log processing method discussed in this article refer only to Web days... "> <LINKhref =" http://www.php100.com//statics/style/headf

Logs are a very broad concept in computer systems. any program may output logs, such as the operating system kernel and various application servers. The log content, size, and usage are also different, which is difficult to generalize.

Logs in the log processing method discussed in this article refer to Web logs only. In fact, there is no precise definition, which may include but is not limited to user access logs generated by various front-end Web servers, such as apache, lighttpd, and tomcat, as well as logs output by various Web applications.

In Web logs, each log usually represents a user's access behavior. for example, the following is a typical apache log:

211.87.152.44--[18/Mar/2005: 12: 21: 42 + 0800] "GET/HTTP/1.1" 200 899 "http://www.baidu.com/"" Mozilla/4.0 (compatible; MSIE 6.0; windows NT 5.1; Maxthon )"

From the above log, we can get a lot of useful information, for example, the visitor's IP address, access time, target webpage, source address, and the UserAgent information of the client used by the visitor. If you need more information, you need to use other methods to obtain it. for example, if you want to obtain the resolution of the user's screen, you generally need to use js code to send a separate request; if you want to obtain information such as the specific news title accessed by the user, the Web application may need to output it in your own code.

Why analyze logs?

Without a doubt, Web logs contain a large number of people-mainly the information that product analysts will be interested in. the simplest is that we can obtain the PV values of each type of page (PageView, page traffic), number of independent IP addresses (that is, the number of IP addresses after deduplication), and so on; slightly more complex, it can be used to calculate the keyword ranklists and pages with the highest user stay time. more complex, it can be used to build ad click models and analyze user behavior characteristics.

Since the data is so useful, there are already countless ready-made tools to help us analyze them, such as awstats and Webalizer, which are free programs dedicated to statistic analysis of Web server logs.

There is also another type of products that do not analyze direct logs, but directly perform data statistics by embedding js code in the page, or we can think of it as directly outputting logs to their servers. A typical representative product, the well-known Google Analytics, also includes cnzz and Baidu Statistics in China.

Many people may say that, in this case, why do we need to analyze logs by ourselves? is it necessary? Of course. Our user (product analyst) needs are endless. although these tools are very powerful, they obviously cannot meet all the requirements.

Whether it's a local analysis tool or an online analysis service, although they provide a wealth of statistical analysis functions and can be configured to a certain extent, they are still very limited. To perform a slightly complex analysis or log-based data mining, you still need to do it yourself.

In addition, the vast majority of log analysis tools can only be used on a single machine, and the amount of data is slightly larger. At the same time, services that provide online analysis usually have the maximum traffic limit for a single site-this is easy to understand and they also need to consider the server load.

Therefore, you often have to rely on yourself.

How to perform log analysis

This is not a simple problem. Even if we limit "log" to Web logs, it still contains thousands of possible formats and data, but "analysis" is even more difficult to define. it may be a simple calculation of statistical values, it may be a complex data mining algorithm.

We do not intend to discuss these complex issues, but just discuss in general how to build the basis for log analysis. With these foundations, simple log-based statistical analysis becomes simple and complex analysis and mining become feasible.

Small amount of data

First, consider the simplest case. when the data size is relatively small, it may be dozens of MB, several hundred MB, or dozens of GB. In short, it is a time when the processing on a single machine is tolerable. Everything is easy to handle. all the ready-made Unix/Linux tools-awk, grep, sort, and join-are powerful tools for log analysis. if you only want to know the PV of a page, one wc + grep can be done. If there is a slightly complex logic, you can basically solve all the problems by using various scripting languages, especially perl, combined with great regular expressions.

For example, to obtain the top 100 IP addresses with the highest access volume from the apache log mentioned above, the implementation is simple:

Cat logfile | awk '{a [$1] ++} END {for (B in) print B "\ t" a [B]} '| sort-k2-r | head-n 100

However, when we need to analyze logs frequently, the above practice may cause us a headache after a period of time: how to maintain various log files, script files for analysis, crontab files, and so on, in addition, there may be a lot of repeated code for parsing and cleaning the data format. at this time, you may need something more appropriate, such as database.

Of course, it still takes some time to use a database for log analysis, the most important thing is how to import a variety of heterogeneous log files into the database-this process is usually called ETL (Extraction-Transformation-Loading ). Fortunately, there are still a variety of open-source and free tools available to help us do this, and when there are not many types of logs, it is not difficult to write a few simple scripts to complete this work. For example, you can remove unnecessary fields from the above logs and import them to the following database:

  

 

Now you need to consider which database to store the data. MySQL is a classic open-source database. its traditional engine (MyISAM or InnoDB, row storage) may not be very suitable for storing log data, however, it is sufficient for small data volumes. In addition, there are already better options in this regard, such as the open-source and free Infobright and Infinidb, which are all data engines specially optimized for data warehouse applications and adopt column-based storage, with good data compression, processing hundreds of GB of data is basically not a problem.

One of the advantages of using databases is that great SQL can help us easily complete most of the statistical analysis work-PV only requires SELECT + COUNT, to calculate the ranking of search words, you only need to SELECT + COUNT + GROUP + ORDER + LIMIT. In addition, the structured storage mode of the database also simplifies the management of log data and reduces the O & M cost.

In the same example above, a simple SQL statement can be used:

SELECT * FROM (SELECT ip, COUNT (*) AS ip_count FROM apache_log group by ip) a order by ip_count desc limit 100

As for performance issues, database indexing and various optimization mechanisms usually make our statistical analysis work faster, in addition, the Infobright and Infinidb mentioned above are specially optimized for clustering applications such as SUM and COUNt. Of course, it is not always faster. for example, if you perform the LIKE operation in the database, it is usually much slower than a grep file.

Furthermore, with database-based storage, you can easily perform OLAP (online analytical processing) applications and mine more value from logs.

What about more data?

A good database seems to make things easy, but don't forget that we mentioned above are all standalone databases. The storage capacity and concurrency of a single machine are undoubtedly limited. One of the characteristics of log data is that it continues to grow over time, and many analysis processes often require historical data. The growth in a short period of time may be solved through Database Sharding, table sharding, or data compression, but it is obviously not a long-term solution.

If you want to completely solve the problems caused by data growth, you will naturally think of using the distributed technology. Combining the above conclusions, it may be a good choice to use a distributed database, then it is completely transparent to end users. This is indeed an ideal situation, but the reality is often cruel.

First of all, implementing a perfect distributed database (subject to the CAP principle) is a very complicated problem. Therefore, there are not so many open-source good things to use here, just like a single-host database, there are not even many commercial applications. Of course, it is not absolute. if you have money, you can still consider Oracle RAC, Greenplum and other things.

Secondly, the vast majority of distributed databases are NoSQL. Therefore, the advantages of continuing to use SQL are basically not expected. Instead, they are replaced by simple and difficult-to-use interfaces. From this point of view, the value of using these databases has been greatly reduced.

Therefore, let us be realistic. let's take a look at how to solve the log analysis problem of ultra-large scale, rather than how to make it as simple as small data. To do this, it seems that it is not too difficult now, and there is still free lunch to eat.

Hadoop is a distributed system under the great Apache foundation, including the Distributed File System (HDFS), MapReduce computing framework, HBase and many other components-these are basically Google's GFS/MapReduce/BigTable clone products.

After several years of development, Hadoop is very mature, especially the HDFS and MapReduce computing framework components. Clusters of hundreds of machines have been proved usable and can bear PB-level data.

The HBase in Hadoop project is a NoSQL distributed database with column-based storage. It provides simple functions and interfaces and can only perform simple K-V queries. Therefore, HBase is not directly applicable to most log analysis applications. Therefore, to use Hadoop for log analysis, you must first store the logs in HDFS and then use its MapReduce API to write a log analysis program.

MapReduce is a distributed programming model that is not difficult to learn, but it is clear that the cost of using it to process logs is much higher than that of a single-host script or SQL. A simple term frequency statistics calculation may require hundreds of codes-SQL requires only one line, and a complex environment for preparing and starting scripts.

For example, in the above example, the implementation is much more complicated, and two rounds of MapReduce are usually required. First, you need to calculate the sum of the access times of some ip addresses in the mapper of the first round, and output the ip as the key:

// Traverse input and aggregate results

Foreach (record in input ){

Ip = record. ip;

Dict [ip] ++;

}

// Output with emit. The first parameter is key, which is used for reduce distribution.

Foreach ( In dict ){

Emit (ip, count );

}

Then, in the first round of reduce, we can get the complete count of each ip address, which can be sorted by the way, and only the first 100 ip addresses are retained.

Count = 0;

// For each key (ip), traverse all values (count) and accumulate

While (input. values. hasNext ()){

Count + = input. values. next ();

}

// Insert it into the heap with a size of 100

Heap_insert (input. key, count );

Output at the end of reduce:

// Output the highest count 100 ip addresses in the current reduce

Foreach ( In dict ){

Emit (ip, count );

}

Since there are usually many reduce tasks, the output of all reduce tasks must be merged and sorted to obtain the first 100 IP addresses and the corresponding traffic.

Therefore, using Hadoop for log analysis is obviously not a simple task. it brings a lot of extra learning and O & M costs, but at least it makes log analysis of ultra-large scale possible.

How to make it easier

It is not easy to do anything on ultra-large scale data, including log analysis, but it does not mean that distributed log analysis must write MapReduce code, you can always perform further abstraction to make things easier under specific applications.

Some may naturally think about how nice it would be to use SQL to operate Hadoop data. In fact, not only do you think this way, but many people think this way, and they have implemented this idea, so they have Hive.

Hive is now a subproject under the Hadoop project. it allows us to use SQL interfaces to execute MapReduce, and even provides JDBC and ODBC interfaces. With this, Hadoop is basically packaged into a database. Of course, Hive SQL is eventually translated into MapReduce code for execution. Therefore, even the simplest SQL statement may have to be executed for dozens of seconds. Fortunately, this time is acceptable in the usual offline log analysis. More importantly, for the above example, we can use the same SQL statement to complete the analysis task.

Of course, Hive is not fully compatible with SQL syntax, and it cannot completely Block user details. In many cases, to optimize the execution performance, you still need to know some basic MapReduce knowledge and set some parameters based on your application mode. Otherwise, we may find that a query is slow to run, or you cannot execute it at all.

In addition, it is clear that Hive does not cover all requirements, so it still retains the interface for inserting original MapReduce code for extension.

More questions

Even with something similar to a database like Hive, we still have a lot to do. For example, after a long time, there may be more and more SQL statements that require routine execution, and some of these SQL statements may be repetitive; some may be very inefficient in execution, A complex SQL statement occupies all the computing resources. Such a system will become increasingly difficult to maintain, until one day the routine SQL finally fails to run. End users often do not care about these things. they only care about whether the queries they submit can receive immediate responses and how to get the results as soon as possible.

For example, if we find that almost no one uses the user_agent field in all queries using apache_log, we can remove this field completely, or split it into two tables to reduce the I/O time of most queries and improve the execution efficiency.

To systematically solve these problems, we may need to introduce the scheduling mechanism of routine tasks. we may need to analyze all the SQL statements to find out which can be merged and which performance needs to be optimized, whether horizontal or vertical table Sharding is required for the data table used. Depending on the actual situation, this may be done manually or automatically analyzed and adjusted by writing a program.

Furthermore, with the increasing demand for log types and analysis. More and more users complain that it is difficult to find the desired data in the log, or to run a good query because the log format changes and suddenly cannot be used. In addition, the ETL process mentioned above will become complicated, and a simple conversion and import script may not solve the problem. At this time, you may need to build a data management system, or simply consider building a so-called data warehouse.

In short, as the log data volume, log type, number of users, and analysis requirements continue to grow, more and more problems will emerge, log analysis may not be as simple as we initially thought. it will become more and more valuable and more challenging.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.