Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall
Logs are a very broad concept in a computer system, and any program can output logs: The operating system kernel, various application servers, and so on. The content, size and use of the log are different, it is difficult to generalize.
The logs in the log processing method discussed in this article refer only to Web logs. There are no precise definitions, which may include, but are not limited to, user access logs generated by various front-end Web servers--apache, LIGHTTPD, Tomcat, and various web applications that have their own output.
In a Web log, each log typically represents an access behavior for a user, such as the following is a typical Apache log:
211.87.152.44–-[18/mar/2005:12:21:42 +0800] "get/http/1.1″200 899" http://www.baidu.com/"mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Maxthon) "
From the above log, we can get a lot of useful information, such as the IP of the visitor, the time of visit, the target page of the visit, the address of the source and the useragent information of the client used by the visitor. If you need more information, you have to use other means: for example, to get the resolution of the user's screen, the general need to use the JS code to send the request alone, and if you want information such as the specific news headlines such as user access, you may need the Web application to output in its own code.
Why log Analysis
There is no doubt that the Web logs contain a large number of people-primarily the information that product analysts are interested in, the simplest, we can obtain the website each kind of page the PV value (PageView, the page accesses the quantity), the independent IP number (namely after the heavy IP number) and so on; slightly more complicated, Can calculate the user to search the list of keywords, users stay the most time page, etc. more complex, the construction of ad-click Model, analysis of user behavior characteristics and so on.
Now that these data are so useful, there are, of course, countless tools available to help us analyze them, such as Awstats, Webalizer, and free programs designed to analyze Web server logs.
There is also a class of products, they do not analyze the direct log, but by allowing users to embed the JS code in the page directly to the data statistics, or we can think that it is directly to the log output to their servers. Typical representative products-the famous Google Analytics, in addition to the domestic CNZZ, Baidu statistics.
Many people may say, so why do we need to analyze the logs ourselves, is it necessary? Of course. Our users (product analyst) requirements are endless, and these tools, though good and powerful, are clearly not able to meet the full requirements.
Whether it is a local analytical tool or an online analytics service, they can be configured to a certain degree, albeit with a rich statistical analysis function, but still very limited. To perform a slightly more complex analysis, or to do a log based data mining, you still need to do it yourself.
In addition, most of the log analysis tools can only be used for stand-alone, the amount of data is slightly larger. At the same time, those services that provide online analytics typically have maximum traffic constraints on a single site--which is easy to understand, and they also need to consider the load on the server.
Therefore, many times still have to rely on their own.
How to log analysis
This is not a simple question. Even if we limit "log" to Web logs, we still contain thousands of possible formats and data, but "analysis" is more difficult to define, perhaps a simple statistical calculation, perhaps a complex data mining algorithm.
The following are not intended to discuss these complex issues, but rather are a general discussion of how to build the basis for log analysis work. With these basics, it makes simple, log-based statistical analysis simple, and makes it possible for complex analytical excavations to become available.
A small amount of data
Consider the simplest case, when the data size is smaller, perhaps dozens of MB, hundreds of MB or dozens of GB, in short, when the stand-alone processing is still tolerable. Everything is very good to do, ready-made a variety of UNIX tools--awk, grep, sort, join, etc. are log analysis of the sharp weapon, if just want to know a page of PV, a wc+grep can be done. With a slightly more complex logic, using a variety of scripting languages, especially Perl, with great regular expressions, can basically solve all the problems.
For example, we want to get the top 100 IP access from the Apache log mentioned above, which is simple:
Cat LogFile | awk ' {a[$1]++} end {to (b in a) print B ' \ t ' a[b]} ' |sort-k2-r|head-n 100
However, when we need to analyze the log frequently, the above procedure may give us a headache after a period of time how to carry out various log files, for analysis of script files, crontab files and so on maintenance, and there may be a large number of duplicate code to do data format parsing and cleaning, This time may need more appropriate things, such as-database.
Of course, there is a cost to using the database for log analysis, and the main thing is how to import various heterogeneous log files into a database-a process commonly called ETL (extraction-transformation-loading). Thankfully there are still open source, free tools to help us do this, and it's not difficult to write a few simple scripts to do the job when there's not a lot of log types. For example, you can remove the above log from unnecessary fields, and then import the following database:
Now you need to consider what database to use to store this data. MySQL is a classic open source database, its traditional engine (MyISAM or InnoDB, row storage) may not be very suitable for the storage of log data, but in small amount of data is sufficient. In addition, there are now better choices in this area, such as open source and free infobright, INFINIDB, are specifically for the Data Warehouse application optimized data engine, the use of column storage, good data compression, processing hundreds of GB of data is basically not a problem.
One of the benefits of using a database is that great SQL can help us do most of the statistical analysis work--PV only need to select+count, calculate search word ranking only need select+count+group+order+limit. In addition, the structured storage mode of the database itself makes the management of log data simpler and reduces the cost of operation.
Also the example above, a simple SQL can be done:
SELECT * FROM (select IP, COUNT (*) as Ip_count to Apache_log GROUP by IP) an order by Ip_count DESC LIMIT 100
As for performance issues, database indexing and various optimization mechanisms often make our statistical analysis work faster, and the Infobright and infinidb mentioned above are optimized for aggregate applications like Sum and count. Of course, it's not absolutely fast, like doing it in a database, usually much slower than grep a file.
Further, the use of database based storage, you can easily OLAP (online analytical Processing) applications, mining from the log value will be more simple.
What about more data?
A good database seems to make things simple, but don't forget that the previous mention is a stand-alone database. A single machine in storage capacity, concurrency is undoubtedly a big limit. One of the characteristics of log data is that it continues to grow over time, and because many analytical processes often require historical data. Growth in a short period of time may be solved by a library, a table, or data compression, but it is clearly not a long-term solution.
To solve the problem of data scale growth, it is natural to think of the use of distributed technology, combined with the above conclusion, perhaps using a distributed database is a good choice, then the end users can be completely transparent. This is really an ideal situation, but the reality is often brutal.
First of all, the implementation of a more perfect distributed database (subject to the CAP principle) is a very complex problem, so there is not as a stand-alone database, there are so many open source good things to use, and even the commercial is not too much. Of course, not absolutely, if you have the money, you can still consider Oracle RAC, Greenplum and things like that.
Second, the vast majority of distributed databases are NoSQL, so the advantages of continuing to use SQL are largely hopeless, and are replaced by simple, difficult interfaces. From this point of view, the value of using these databases has been much reduced.
So, let's be realistic and take a step back and think about how to solve the problem of massive log parsing, rather than how to make it as simple as a small data scale. To do this alone is not so difficult at the moment, and there is still a free lunch to eat.
Hadoop is the Great Apache Foundation a set of distributed systems, including Distributed File System (HDFS), MapReduce Computing Framework, hbase and many other components-these are basically Google's gfs/mapreduce/ BigTable cloning products.
Hadoop has evolved over the years and is now very mature, especially in its HDFs and MapReduce computing framework components. The cluster of hundreds of machines has been proven to be available and can assume PB-level data.
The HBase in the Hadoop project is a NoSQL distributed database that is stored in columns, providing a simple K query that is not directly applicable to most log analysis applications. So generally using Hadoop for log analysis, you need to store the log in HDFs first, and then write the Log Analyzer using the MapReduce API it provides.
MapReduce is a distributed programming model that is not difficult to learn, but it is clear that the cost of using it to process logs is still much larger than stand-alone scripts or SQL. A simple word frequency statistics calculation may require hundreds of code--sql only one line, as well as complex environmental preparation and startup scripts.
For example, the same is the case above, the implementation will be more complex, usually requires two rounds of MapReduce to complete. First, in the first round of the mapper to calculate the number of IP access and IP for the key output:
Iterate through the input and aggregate the results
foreach (record in input) {
ip = record.ip;
dict[ip]++;
}
With emit output, the first parameter is key for reduce distribution
foreach (in dict) {
Emit (IP, count);
}
Then you can get a complete count of each IP in the first round of reduce, and you can order it by the way, and keep only the first 100.
Count = 0;
For each key (IP), iterate through all values (count) and add
while (Input.values.hasNext ()) {
Count + = Input.values.next ();
}
Insert into a heap of size 100
Heap_insert (Input.key, Count);
Output at end of reduce:
Output 100 IP with the highest count in the current reduce
foreach (in dict) {
Emit (IP, count);
}
Because reduce typically has many, it is also necessary to merge and reorder all of the output of reduce, with the final 100 IP and corresponding traffic.
So, using Hadoop for log analysis is obviously not a simple matter, it brings a lot of extra learning and operational costs, but at least it makes it possible to make a huge log analysis.
How to be simpler
It's not easy to do anything on an oversized scale, including log analysis, but it's not that the distributed log analysis has to write MapReduce code, and it's always possible to do further abstraction and make things simpler in a particular application.
It might be natural for someone to think how good it would be if you could use SQL to manipulate data on Hadoop. In fact, not only do you think so, many people think so, and they realize the idea, so they have hive.
Hive is now also a subproject under the Hadoop project, which allows us to execute mapreduce with SQL interfaces and even provide JDBC and ODBC interfaces. With this, Hadoop is basically packaged into a database. Of course, Hive's SQL is eventually translated into MapReduce code, so even the simplest SQL may take several 10 seconds to execute. Fortunately, this time is acceptable in the usual off-line log analysis. More importantly, for the example mentioned above, we can use the same SQL to complete the analysis task.
Of course, hive is not fully compatible with SQL syntax, and it does not completely mask the details of the user. Many times in order to perform performance optimization, still need users to understand some mapreduce basic knowledge, according to their own application mode to set some parameters, otherwise we may find that a query execution is slow, or not at all.
Also, obviously hive does not cover all requirements, so it retains the interface to insert the original MapReduce code in order to expand.
More questions
Even with a database-like hive, we still have a lot of work to do. For example, for a long time, there may be more and more SQL that needs to be executed routinely, and some of these may be repetitive; some of the execution efficiencies are very low, and a complex SQL fills up all the computing resources. Such a system becomes more and more difficult to maintain, until one day the routine SQL finally runs out. And end users tend not to care about these things, they only care about their own submitted queries can be immediately received, how to get the results as soon as possible.
For a simple example, if you find that almost no one uses the User_agent field in all queries using Apache_log, then we can completely remove the field or split it into two tables to reduce the IO time of most queries and increase the efficiency of execution.
To systematically solve these problems, we may need to introduce routine task scheduling mechanisms that may need to analyze all of the SQL to find out which ones can be merged, which performance needs to be optimized, and whether the data tables used need to be horizontally or vertically divided, and so on. Depending on the actual situation, this may be done manually or by writing a program to automatically analyze and adjust.
In addition, with the log type, the analysis of the increasing demand. Users will be more and more complaining about the difficulty of finding the data in which log, or running a good query because of the changes in the log format suddenly can not be used. In addition, the ETL process mentioned above can become complex, and simple transformation of the import script may not solve the problem. It may be necessary to build a data management system at this point, or simply consider creating a so-called data warehouse.
In short, as the amount of log data, log type, number of users, analysis requirements, and so on the growing, more and more problems will emerge, the log analysis may not be as simple as we initially think, will become more valuable and more challenging.