Overview of log analysis methods

Source: Internet
Author: User
Tags emit apache log

Original: Log Analysis Method overview

Logs are a very broad concept in a computer system, and any program has the potential to output logs: Operating system cores, various application servers, and so on. The content, size and purpose of the log are different, it is difficult to generalize.

The log in the log processing method discussed in this article refers only to Web logs. There is no precise definition, which may include, but is not limited to, the user access logs generated by various front-end Web servers ――apache, LIGHTTPD, Tomcat, and the logs that the various Web applications output themselves.

In the Web log, each log usually represents a user's one-time access behavior, such as the following is a typical Apache log:

211.87.152.44--[18/mar/2005:12:21:42 +0800] "get/http/1.1″200 899" http://www.baidu.com/"" mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Maxthon) "

From the above log, we can get a lot of useful information, such as the visitor's IP, the time of visit, the target page visited, the address of the source and the useragent information of the client used by the visitor. If more information is needed, use other means to obtain: for example, to get the resolution of the user's screen, it is generally necessary to use the JS code to send the request separately, and if you want information such as the specific news headlines that the user accesses, you may need the Web application to output in your own code.

Why to analyze logs

There is no doubt that the Web log contains a large number of people-mainly the product analyst will be interested in information, the simplest, we can get the site each type of page of the PV value (PageView, page access), the number of independent IP (that is, the number of IP after the weight), and so on; slightly more complicated, Can calculate the user retrieves the keyword leaderboard, the user stays the highest time the page and so on, more complex, constructs the advertisement click Model, the analysis user behavior characteristic and so on.

Since the data is so useful, there are, of course, countless tools available to help us analyze them, such as Awstats, Webalizer, which are free programs dedicated to statistical analysis of Web server logs.

There is also a kind of product, they do not analyze the direct log, but by letting the user embed JS code in the page to direct the data statistics, or we can think it is directly to the log output to their server. Typical representative products-the famous Google Analytics, in addition to the domestic CNZZ, Baidu statistics.

A lot of people might say, so why do we need to analyze the logs on our own, is that necessary? Of course. The needs of our users (product analysts) are endless, and these types of tools, although very good and powerful, are clearly not able to meet all the needs.

Whether it is a tool for local analysis or an online analysis service, they can be configured to a certain extent, although they have a wealth of statistical analysis capabilities, but they are still limited. For a slightly more complex analysis, or to do log-based data mining, you still need to do it yourself.

In addition, the vast majority of log analysis tools can only be used for single-machine, the amount of data is a little too big. At the same time, the services that provide online analysis often have the maximum traffic limit for a single site-which is easy to understand, and they also need to consider the server's load.

So, most of the time, you have to rely on yourself.

How to perform log analysis

This is not a simple question. Even if we limit the "log" to a Web log, it still contains thousands of possible formats and data, but "analysis" is more difficult to define, perhaps the calculation of simple statistical values, perhaps a complex data mining algorithm.

The following are not intended to discuss these complex issues, but rather to discuss in general how to build the basis for log analysis work. With these fundamentals, simple log-based statistical analysis can be made simple, and complex analysis mining is made possible.

Case of small amounts of data

Consider the simplest case, when the data size is smaller, perhaps dozens of MB, hundreds of MB or dozens of GB, in short, the single-machine processing is still tolerable. Everything is well done, ready-made various unix/linux tools ――awk, grep, sort, join, etc. are the tool of log analysis, if only want to know a page of PV, a wc+grep can be done. If there is a slightly more complex logic, then use a variety of scripting languages, especially Perl, with the great regular expression, basically can solve all the problems.

For example, we want to get the top 100 IP addresses from the Apache logs mentioned above, and the implementation is simple:

Cat LogFile | awk ' {a[$1]++} END {for (b in a) print B "\\t" a[b]} ' |sort-k2-r|head-n 100

However, when we need to analyze the log frequently, the above practice may give us a headache after a period of time how to carry out various log files, script files for analysis, crontab files and so on maintenance, and there may be a lot of duplicate code to do data format parsing and cleaning, This may require something more appropriate, such as a database.

Of course, there is a cost to using the database for log analysis, and the main thing is how to import various heterogeneous log files into the database-this process is often called ETL (extraction-transformation-loading). Fortunately, there are all kinds of open source, free tools to help us do this, and it is not difficult to write a few simple scripts to do the work when there are not too many log types. For example, you can remove unnecessary fields from the above logs and import them into the following database:

Now you need to consider what database is used to store this data. MySQL is a classic open source database, its traditional engine (MyISAM or InnoDB, row storage) may not be very suitable for the storage of log data, but in the small amount of data is still sufficient. And, in this regard now has a better choice, such as open source and free infobright, INFINIDB, are dedicated to data warehousing applications optimized data engine, with Columnstore, good data compression, processing hundreds of GB of data is basically not a problem.

One of the benefits of using a database is that great SQL can help us to do most of the statistical analysis work ――PV only need to Select+count, calculate the search word ranking only need select+count+group+order+limit. In addition, the structured storage mode of the database itself makes the management of log data more simple and reduces the cost of operation and maintenance.

The same example above, a simple SQL can be done:

SELECT * FROM (select IP, COUNT (*) as Ip_count from Apache_log GROUP by IP) a ORDER by Ip_count DESC LIMIT 100

As for performance issues, database indexing and various optimization mechanisms often make our statistical analysis work faster, and the Infobright and Infinidb mentioned above are specifically optimized for aggregate applications like Sum, count, and so on. Of course, it's not always fast, for example, like operations in a database are usually much slower than grep.

Further, the use of database-based storage makes it easy to perform OLAP (online analytical Processing) applications, and mining value from logs can be easier.

What about more data?

A good database seems to make things a little easier, but don't forget that the single-machine database mentioned earlier. There is no doubt that there is a big limit to the storage capacity and concurrency of a single machine. One of the characteristics of log data is that it continues to grow over time, and because many analysis processes often require historical data. Growth in a short period of time may be resolved through sub-databases, sub-tables or data compression, but it is clearly not a long-term solution.

To completely solve the problem of data scale growth, it is natural to think of the use of distributed technology, combined with the above conclusions, perhaps the use of a distributed database is a good choice, then the end user can be completely transparent. This is an ideal situation, but the reality is often brutal.

First, the implementation of a more perfect distributed database (subject to the CAP principle) is a very complex problem, so here is not like a stand-alone database, there are so many open source of good things to use, and even commercial is not too much. Of course, it is not absolutely, if you have money, you can still consider Oracle RAC, Greenplum and other things.

Second, the vast majority of distributed databases are NoSQL, so the advantages of having to continue using SQL are basically not expected, and are replaced by simple, hard-to-use interfaces. From this point of view, the value of using these databases has been reduced a lot.

So, to be realistic, take a step back and consider how to solve the problem of super-large log analysis, rather than how to make it as simple as the size of a small data. To do just that, it doesn't seem too difficult at the moment, and there's still a free lunch to eat.

Hadoop is the Great Apache Foundation under a set of distributed systems, including Distributed File System (HDFS), MapReduce Computing Framework, hbase and many other components-these are basically Google's gfs/mapreduce/ Cloning products from BigTable.

Hadoop has matured over the years, especially in the HDFs and MapReduce computational framework components. The cluster of hundreds of machines has been proven to work and can bear petabytes of data.

HBase in a Hadoop project is a column-stored NoSQL distributed database that provides both the functionality and the interface that are simple, simple k-v queries, and therefore not directly applicable to most log analytics applications. So it is common to use Hadoop for log analysis, first of all you need to store the log in HDFs, and then use the MapReduce API it provides to write the Log parser.

MapReduce is a distributed programming model that is not difficult to learn, but it is clear that the cost of using it to process logs is still much larger than a standalone script or SQL. A simple word frequency statistic may require hundreds of code ――sql only one line, plus complex environment preparation and startup scripts.

For example, as in the example above, the implementation is much more complex and usually requires two rounds of MapReduce to complete. First, in the first round of the mapper to calculate the sum of the number of IP access, and the IP key output:

Iterate through the inputs and aggregate the results

foreach (record in input) {

ip = record.ip;

dict[ip]++;

}

With emit output, the first parameter is key and is used for the distribution of reduce

foreach (<ip, count= "" > in Dict) {

Emit (IP, count);

}

Then in the first round of reduce you can get a complete count of each IP, you can order by the way, and keep only the first 100.

Count = 0;

For each key (IP), traverse all values (count) and accumulate

while (Input.values.hasNext ()) {

Count + = Input.values.next ();

}

Insert into a heap of size 100

Heap_insert (Input.key, Count);

Output at the end of reduce:

Outputs the highest count of 100 IPs in current reduce

foreach (<ip, count= "" > in Dict) {

Emit (IP, count);

}

Because reduce generally has many, it is also necessary to merge and reorder all of the reduce's outputs and get the final top 100 IP and corresponding traffic.

So, using Hadoop for log analysis is obviously not a simple thing, it brings a lot of extra learning and operational costs, but at the very least, it makes it possible to log analytics over large scale.

How to become simpler

It's not easy to do anything on super-large data, including log analysis, but it's not that distributed log analysis has to write MapReduce code, and it's always possible to do further abstraction and make things easier under specific applications.

It may be natural for someone to think about how good it is to use SQL to manipulate data on Hadoop. In fact, not only you, but a lot of people think so, and they realize the idea, so they have hive.

Hive is now also a sub-project under the Hadoop project, which allows us to execute mapreduce using the SQL interface and even provide JDBC and ODBC interfaces. With this, Hadoop is basically packaged as a database. Of course, the SQL of hive is eventually translated into the MapReduce code to execute, so even the simplest SQL may be executed for several 10 seconds. Fortunately, this time is acceptable in the usual offline log analysis. More importantly, for the example mentioned above, we can use the same SQL to complete the analysis task.

Of course, hive is not fully compatible with SQL syntax, and does not completely block the details of the user. Many times in order to perform performance optimization, users still need to understand some of the basic knowledge of mapreduce, according to their own application mode to set some parameters, otherwise we may find a query execution is slow, or do not come out at all.

Also, it is clear that hive does not cover all of the requirements, so it retains the interface to insert the original MapReduce code so that it can be extended.

More questions

Even with a database-like hive, we still have a lot of things to do. For example, over time, there may be more and more SQL that needs to be routinely executed, and some of these SQL may have done something repetitive, and perhaps some of the execution is inefficient, and a complex SQL is full of computing resources. Such a system will become more and more difficult to maintain, until the day the routine SQL finally run out. And end users tend not to care about these things, they only care about the query submitted by the immediate response, how to get the results as soon as possible.

As a simple example, if you find that almost no one uses the User_agent field in all queries using Apache_log, then we can either remove this field or split it into two tables to reduce the IO time of most queries and improve the efficiency of execution.

In order to solve these problems systematically, we may need to introduce the scheduling mechanism of routine tasks, we may need to analyze all the SQL to find out what can be combined, what performance needs to be optimized, whether the data tables used are required to do horizontal or vertical tables, and so on. Depending on the actual situation, it can be done manually or by writing a program to automatically analyze and adjust.

In addition, with the log type, analysis needs continue to grow. Users will be more and more complaining that it is difficult to find the desired data in which log, or run a good query because the log format changes and suddenly can not be used. In addition, the above mentioned ETL process will become complex, simple conversion import script is likely to solve the problem. It may be time to build a data management system, or simply consider building a so-called Data Warehouse.

In short, as the amount of log data, log type, number of users, analysis needs and so on continue to grow, more and more problems will gradually emerge, log analysis of the matter may not be as simple as we originally thought, will become more and more valuable, but also more and more challenging.

Overview of log analysis Methods (GO)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.