Basic introduction
Program developers often have to analyze the program log, including their own print logs and other software used to print the log, such as Php,nginx log, Linux environment analysis log has some built-in commands to use, such as Grep,sort,uniq,awk, the most powerful is awk, exists as a small text-processing language, but because it is a language and powerful, it is not so convenient to use at the command line, because AWK is compute-oriented rather than statistical-oriented. Awk is able to define variables that can be evaluated, and the command line is a language that includes an implicit for loop.
Awk assumes that for a very long time, some of its syntax is forgotten, it is necessary to analyze the online log to assume that SQL analysis is good, indeed, SQL (Structured Query language) is a truly statistical-oriented language, including Hive is also used, so recently developed a SQL-based Log Parser, The ability to parse the log with the class SQL syntax is called Myselect.
Myselect is a tool for simplifying log analysis, and it is believed that it has covered most of the log analysis functions that awk can complete, and of course, awk is needed in special cases. Myselect to analyze the log file as a database, the log line as a database record, so that the log data inside the statistical analysis. The following is a comparison of the use of other commands such as Myselect and awk.
To analyze the Ngnix log as an example, the following log is a log of our online Web machine
198.52.103.14--[29/jun/2014:00:17:11 +0800] "get/q/1403060495509100 http/1.1" 26788 1403060495509100 "" mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727) "221 0.532
The first field is IP, assuming that you want to know which source IP is the most, with other commands such as awk, such as the following
$ Awk ' {print '} ' Accesstest.log | Sort | uniq-c | SORT-K1NR | Less
14 111.13.65.251
13 10.141.88.248
12 10.141.88.239
10 10.141.88.250
9 121.226.135.115
8 10.141.88.241
8 10.141.88.249
8 222.74.246.190
7 211.149.165.150
6 119.138.167.213
Even using awk entirely can do the same, but there are other better commands that don't seem necessary.
Myselect How to achieve the above functions? Myselect The journal lines as multiple fields, with spaces separating the fields, and all the characters in the double-argument as a field, even if they contain spaces, which is different from awk's purely space-delimited, which makes it easier for us to process the logs. You can view the field values for a journal line by, for example, the following command:
$ Myselect-s ' 198.52.103.14--[29/jun/2014:00:17:11 +0800] "get/q/1403060495509100 http/1.1" 26788 O.com/q/1403060495509100 "" mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727) "221 0.532 '
**log fields**
$198.52.103.14
$-
$-
$4 [29/jun/2014:00:17:11
$ +0800]
$6 get/q/1403060495509100 http/1.1
$7 200
$8 26788
$9 http://wenda.so.com/q/1403060495509100
$ mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727)
$11 221
$0.532
The-S option prints the field values of the journal lines, which can then be calculated based on the corresponding fields, such as the following
$ Myselect ' SELECT COUNT ($), $ accesstest.log Group BY "Order by count ($) DESC LIMIT 10 '
14 111.13.65.251
13 10.141.88.248
12 10.141.88.239
10 10.141.88.250
9 121.226.135.115
8 10.141.88.241
8 10.141.88.249
8 222.74.246.190
7 211.149.165.150
6 61.174.51.174
The results are all the same
There is also a frequently used requirement to look at traffic per minute, and observe traffic anomalies, such as with awk commands such as the following:
$ awk" {print gensub (/.* : (. +):. *+0800].*/, "\\1", "G")} ' Access_wenda.qihoo.com_log | uniq-c | Grep-v Windows | less
1567 00:17
1597 00:17
933 00:18
3045 00:18
1605 00:19
294 00:19
2021 00:19
1315 00:20
666 00:20
1875 00:20
3679 00:21
1172 00:22
479 00:22
2094 00:22
1352 00:23
00:23
Notoginseng 00:23
Grep-v Windows is in order to filter out some garbled lines, in awk we need to get child fields through gensub, such as the minute values here, and in Myselect also have the same function regsub ($1,pattern,replace), Complete the same requirements with myselect such as the following:
$ Myselect ' Select Regsub ($4,/.*2014: (. +): \d{2}.*/,\1), COUNT ($) from Access_wenda.qihoo.com_log Group by Regsub ($4,/. *2014: (. +): \d{2}.*/,\1) Order by count ($) DESC LIMIT 10 '
Regsub ($4,/.*2014: (. +): \d{2}.*/,\1), we use a regular for the 4th field to get the minute value.
For example, we want to calculate the average time spent on network requests, can be achieved with awk, but the process is more complex, the need to define variables and calculations, and with Myselect only need to use the AVG function is relatively simple calculation, such as the following
$ Myselect ' Select AVG ($) from Access_wenda.qihoo.com_log '
From the above control, it is possible to find that Myselect is statistically based on the SQL method, not only to remember, but also to analyze the ideas more intuitively, unlike awk needs a bunch of commands to cooperate.
Maybe you would say put the log in the database again analysis is the same, but this process is too troublesome, rather than directly to the file with SQL analysis.
Myselect use
After installing the Myselect program, you can view the usage
$ myselect-h
Usage
Myselect ' SQL sentence '; Using SQL for statistical analysis
Myselect-s ' log line '; cut numbers by space on journal lines
Myselect-n ' log line ' sql sentence '; Parsing the journal lines with SQL
Myselect-p ' SQL sentence '; View SQL Syntax parsing results
Myselect-c ' SQL sentence '; View SQL calculation procedures
Statistical analysis of basic use such as the following
$ Myselect ' SQL statement '
SQL statement syntax is basically the same as the normal database query SELECT statement, do not distinguish between uppercase and lowercase (of course, the file name is differentiated between uppercase and lowercase), support free form, only a small part of the difference, we have reason to believe that the SQL language in statistical analysis must be the best language at the moment, Basically follow it to achieve it.
SQL statement = SELECT
select_expr [, select_expr ...]
[From file_name
[WHERE Where_condition]
[GROUP by {col_name | expr}
[Having where_condition]
[ORDER by {col_name | expr}
[ASC | DESC]]
[LIMIT {[Offset,] row_count}]
A brief description such as the following:
select_expr
Can contain field numbers such as $1,$2, fields are separated by spaces, can also contain functions, functions are divided into two categories
String handler function:
- Strsub ($1,2,3) Truncate substring
- Regsub ($1,/(.):(. +):(.) /I,\2) Replace substrings by regular
A string function can be used wherever a field can appear, and its parameters contain the field number
Aggregation functions:
Meaning is the same as normal SQL.
Where_condition
A relational expression that is concatenated with and is not currently supported or, for example, the following operator
=,!=,>,<,>=,<=,like,rlike
Like indicates whether the corresponding string is included, Rlike indicates a regular matching pattern
The original plan myselect with the Go language, and read the Go manual, but in our group technical journals before the deadline for a very short time to implement it in a new language, instead of first using PHP to implement a version number, and basically available, the current implementation of the PHP version number implementation of the basic SQL The SELECT statement syntax, like Askeyword and or logical operators, is not yet implemented, but this is not important. When the log file is very large, the PHP implementation of the version number in the performance and memory consumption is not very good to meet the requirements, but I believe that soon there will be the go language implementation of the high Availability version number.
For anyone unfamiliar with awk or who can't remember the details of the awk syntax, Myselect is well-versed in the need to analyze the logs when it's time to parse the log, and everyone in the SQL language should be very familiar with it.
This tool source code has been put to GitHub Https://github.com/micweaver/myselect
The basic implementation algorithm in the inside, next to translate into the go language implementation, go language can be very good to meet our performance and memory requirements, of course, great log you have to rely on hadoop,hive and other distributed computing tools