Logs are a very broad concept in computer systems. Any program may output logs, such as the operating system kernel and various application servers. The log Content, size, and usage are also different, which is difficult to generalize.
Logs in the log processing method discussed in this Article refer to Web logs only. In fact, there is no precise definition, which may include but is not limited to user access logs generated by various front-end Web servers, such as apache, lighttpd, and tomcat, as well as logs output by various Web applications.
In Web logs, each log usually represents a user's access behavior. For example, the following is a typical apache log:
| 211.87.152.44--[18/Mar/2005: 12: 21: 42 + 0800] "GET/HTTP/1.1" 200 899 "http://www.baidu.com/"" Mozilla/4.0 (compatible; MSIE 6.0; windows NT 5.1; Maxthon )" |
From the above log, we can get a lot of useful information, for example, the visitor's IP address, access time, target webpage, source address, and the UserAgent information of the client used by the visitor. If you need more information, you need to use other methods to obtain it. For example, if you want to obtain the resolution of the user's screen, you generally need to use js Code to send a separate request; if you want to obtain information such as the specific news title accessed by the user, the Web application may need to output it in your own code.
Why analyze logs?
Without a doubt, Web logs contain a large number of people-mainly the information that product analysts will be interested in. The simplest is that we can obtain the PV values of each type of page (PageView, page traffic), number of independent IP addresses (that is, the number of IP addresses after deduplication), and so on; slightly more complex, it can be used to calculate the keyword ranklists and pages with the highest user stay time. More complex, it can be used to build ad click models and analyze user behavior characteristics.
Since the data is so useful, there are already countless ready-made tools to help us analyze them, such as awstats and Webalizer, which are free programs dedicated to statistic analysis of Web server logs.
There is also another type of products that do not analyze direct logs, but directly perform data statistics by embedding js code in the page, or we can think of it as directly outputting logs to their servers. A typical representative product, the well-known Google Analytics, also includes cnzz and Baidu statistics in China.
Many people may say that, in this case, why do we need to analyze logs by ourselves? Is it necessary? Of course. Our user (product analyst) needs are endless. Although these tools are very powerful, they obviously cannot meet all the requirements.
Whether it's a local analysis tool or an online Analysis Service, although they provide a wealth of statistical analysis functions and can be configured to a certain extent, they are still very limited. To perform a slightly complex analysis or log-based data mining, you still need to do it yourself.
In addition, the vast majority of log analysis tools can only be used on a single machine, and the amount of data is slightly larger. At the same time, services that provide online analysis usually have the maximum traffic limit for a single site-this is easy to understand and they also need to consider the server load.
Therefore, you often have to rely on yourself.
How to perform log analysis
This is not a simple problem. Even if we limit "log" to Web logs, it still contains thousands of possible formats and data, but "analysis" is even more difficult to define. It may be a simple calculation of statistical values, it may be a complex data mining algorithm.
We do not intend to discuss these complex issues, but just discuss in general how to build the basis for log analysis. With these foundations, simple log-based statistical analysis becomes simple and complex analysis and mining become feasible.
Small amount of data
First, consider the simplest case. When the data size is relatively small, it may be dozens of MB, several hundred MB, or dozens of GB. In short, it is a time when the processing on a single machine is tolerable. Everything is easy to handle. All the ready-made Unix/Linux tools-awk, grep, sort, and join-are powerful tools for log analysis. If you only want to know the PV of a page, one wc + grep can be done. If there is a slightly complex logic, you can basically solve all the problems by using various scripting languages, especially perl, combined with great regular expressions.
For example, to obtain the top 100 IP addresses with the highest access volume from the apache Log mentioned above, the implementation is simple:
| Cat logfile | awk '{a [$1] ++} END {for (B in) print B "t" a [B]} '| sort-k2-r | head-n 100 |
However, when we need to analyze logs frequently, the above practice may cause us a headache after a period of time: How to maintain various log files, script files for analysis, crontab files, and so on, in addition, there may be a lot of repeated code for parsing and cleaning the data format. At this time, you may need something more appropriate, such as database.
Of course, it still takes some time to use a database for log analysis, the most important thing is how to import a variety of heterogeneous log files into the database-this process is usually called ETL (Extraction-Transformation-Loading ). Fortunately, there are still a variety of open-source and free tools available to help us do this, and when there are not many types of logs, it is not difficult to write a few simple scripts to complete this work. For example, you can remove unnecessary fields from the above logs and import them to the following database:
Now you need to consider which database to store the data. MySQL is a classic open-source database. Its traditional engine (MyISAM or InnoDB, Row Storage) may not be very suitable for storing log data, however, it is sufficient for small data volumes. In addition, there are already better options in this regard, such as the open-source and free Infobright and Infinidb, which are all data engines specially optimized for Data Warehouse applications and adopt column-based storage, with good data compression, processing hundreds of GB of data is basically not a problem.
One of the advantages of using databases is that great SQL can help us easily complete most of the statistical analysis work-PV only requires SELECT + COUNT, to calculate the ranking of search words, you only need to SELECT + COUNT + GROUP + ORDER + LIMIT. In addition, the structured storage mode of the database also simplifies the management of log data and reduces the O & M cost.
In the same example above, a simple SQL statement can be used:
| SELECT * FROM (SELECT ip, COUNT (*) AS ip_count FROM apache_log group by ip) a order by ip_count desc limit 100 |
As for performance issues, database indexing and various optimization mechanisms usually make our statistical analysis work faster, in addition, the Infobright and Infinidb mentioned above are specially optimized for clustering applications such as SUM and COUNt. Of course, it is not always faster. For example, if you perform the LIKE operation in the database, it is usually much slower than a grep file.
Furthermore, with database-based storage, you can easily perform OLAP (Online Analytical Processing) applications and mine more value from logs.
What about more data?
A good database seems to make things easy, but don't forget that we mentioned above are all standalone databases. The storage capacity and concurrency of a single machine are undoubtedly limited. One of the characteristics of log data is that it continues to grow over time, and many analysis processes often require historical data. The growth in a short period of time may be solved through database sharding, table sharding, or data compression, but it is obviously not a long-term solution.
If you want to completely solve the problems caused by data growth, you will naturally think of using the distributed technology. Combining the above conclusions, it may be a good choice to use a distributed database, then it is completely transparent to end users. This is indeed an ideal situation, but the reality is often cruel.
First of all, implementing a perfect Distributed Database (subject to the CAP principle) is a very complicated problem. Therefore, there are not so many open-source good things to use here, just like a single-host database, there are not even many commercial applications. Of course, it is not absolute. If you have money, you can still consider Oracle RAC, Greenplum and other things.
Secondly, the vast majority of distributed databases are NoSQL. Therefore, the advantages of continuing to use SQL are basically not expected. Instead, they are replaced by simple and difficult-to-use interfaces. From this point of view, the value of using these databases has been greatly reduced.
Therefore, let us be realistic. Let's take a look at how to solve the log analysis problem of ultra-large scale, rather than how to make it as simple as small data. To do this, it seems that it is not too difficult now, and there is still free lunch to eat.
Hadoop is a distributed system under the Great Apache Foundation, including the Distributed File System (HDFS), MapReduce computing framework, HBase and many other components-these are basically Google's GFS/MapReduce/BigTable clone products.
After several years of development, Hadoop is very mature, especially the HDFS and MapReduce computing framework components. Clusters of hundreds of machines have been proved usable and can bear PB-level data.
The HBase in Hadoop Project is a NoSQL distributed database with column-based storage. It provides simple functions and interfaces and can only perform simple K-V queries. Therefore, HBase is not directly applicable to most log analysis applications. Therefore, to use Hadoop for log analysis, you must first store the logs in HDFS and then use its MapReduce API to write a log analysis program.
MapReduce is a distributed programming model that is not difficult to learn, but it is clear that the cost of using it to process logs is much higher than that of a single-host script or SQL. A simple term frequency statistics calculation may require hundreds of codes-SQL requires only one line, and a complex environment for preparing and starting scripts.
For example, in the above example, the implementation is much more complicated, and two rounds of MapReduce are usually required. First, you need to calculate the sum of the access times of some ip addresses in the mapper of the first round, and output the ip as the key:
// Traverse input and aggregate results Foreach (record in input ){ Ip = record. ip; Dict [ip] ++; } // Output with emit. The first parameter is key, which is used for reduce distribution. Foreach (<ip, count> in dict ){ Emit (ip, count ); } Then in the first round of redu |