SQL-based Log analysis tool Myselect

Source: Internet
Author: User
Tags logical operators

Basic introduction

Program developers often have to analyze the program log, including their own print logs and other software used to print the log, such as Php,nginx log, Linux environment analysis log has some built-in commands to use, such as Grep,sort,uniq,awk, the most powerful is awk, exists as a small text-processing language, but because it is a language and powerful, it is not so convenient to use at the command line, because AWK is compute-oriented rather than statistical-oriented. Awk is able to define variables that can be evaluated, and the command line is a language that includes an implicit for loop.

Awk assumes that for a very long time, some of its syntax is forgotten, it is necessary to analyze the online log to assume that SQL analysis is good, indeed, SQL (Structured Query language) is a truly statistical-oriented language, including Hive is also used, so recently developed a SQL-based Log Parser, The ability to parse the log with the class SQL syntax is called Myselect.

Myselect is a tool for simplifying log analysis, and it is believed that it has covered most of the log analysis functions that awk can complete, and of course, awk is needed in special cases. Myselect to analyze the log file as a database, the log line as a database record, so that the log data inside the statistical analysis. The following is a comparison of the use of other commands such as Myselect and awk.

To analyze the Ngnix log as an example, the following log is a log of our online Web machine

198.52.103.14--[29/jun/2014:00:17:11 +0800] "get/q/1403060495509100 http/1.1" 26788 1403060495509100 "" mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727) "221 0.532

The first field is IP, assuming that you want to know which source IP is the most, with other commands such as awk, such as the following

$ Awk ' {print '} ' Accesstest.log | Sort | uniq-c | SORT-K1NR | Less
14 111.13.65.251
13 10.141.88.248
12 10.141.88.239
10 10.141.88.250
9 121.226.135.115
8 10.141.88.241
8 10.141.88.249
8 222.74.246.190
7 211.149.165.150
6 119.138.167.213

Even using awk entirely can do the same, but there are other better commands that don't seem necessary.

Myselect How to achieve the above functions? Myselect The journal lines as multiple fields, with spaces separating the fields, and all the characters in the double-argument as a field, even if they contain spaces, which is different from awk's purely space-delimited, which makes it easier for us to process the logs. You can view the field values for a journal line by, for example, the following command:

$ Myselect-s ' 198.52.103.14--[29/jun/2014:00:17:11 +0800] "get/q/1403060495509100 http/1.1" 26788 O.com/q/1403060495509100 "" mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727) "221 0.532 '


**log fields**
$198.52.103.14
$-
$-
$4 [29/jun/2014:00:17:11
$ +0800]
$6 get/q/1403060495509100 http/1.1
$7 200
$8 26788
$9 http://wenda.so.com/q/1403060495509100
$ mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727)
$11 221
$0.532

The-S option prints the field values of the journal lines, which can then be calculated based on the corresponding fields, such as the following

$ Myselect ' SELECT COUNT ($), $ accesstest.log Group BY "Order by count ($) DESC LIMIT 10 '
14 111.13.65.251
13 10.141.88.248
12 10.141.88.239
10 10.141.88.250
9 121.226.135.115
8 10.141.88.241
8 10.141.88.249
8 222.74.246.190
7 211.149.165.150
6 61.174.51.174

The results are all the same

There is also a frequently used requirement to look at traffic per minute, and observe traffic anomalies, such as with awk commands such as the following:

$ awk" {print gensub (/.* : (. +):. *+0800].*/, "\\1", "G")} ' Access_wenda.qihoo.com_log | uniq-c | Grep-v Windows | less 
1567 00:17 
1597 00:17 
933 00:18 
3045 00:18 
1605 00:19 
294 00:19 
2021 00:19 
1315 00:20 
666 00:20 
1875 00:20 
3679 00:21 
1172 00:22 
479 00:22 
2094 00:22 
1352 00:23 
00:23 
Notoginseng 00:23

Grep-v Windows is in order to filter out some garbled lines, in awk we need to get child fields through gensub, such as the minute values here, and in Myselect also have the same function regsub ($1,pattern,replace), Complete the same requirements with myselect such as the following:

$ Myselect ' Select Regsub ($4,/.*2014: (. +): \d{2}.*/,\1), COUNT ($) from Access_wenda.qihoo.com_log Group by Regsub ($4,/. *2014: (. +): \d{2}.*/,\1) Order by count ($) DESC LIMIT 10 '

Regsub ($4,/.*2014: (. +): \d{2}.*/,\1), we use a regular for the 4th field to get the minute value.

For example, we want to calculate the average time spent on network requests, can be achieved with awk, but the process is more complex, the need to define variables and calculations, and with Myselect only need to use the AVG function is relatively simple calculation, such as the following

$ Myselect ' Select AVG ($) from Access_wenda.qihoo.com_log '

From the above control, it is possible to find that Myselect is statistically based on the SQL method, not only to remember, but also to analyze the ideas more intuitively, unlike awk needs a bunch of commands to cooperate.

Maybe you would say put the log in the database again analysis is the same, but this process is too troublesome, rather than directly to the file with SQL analysis.

Myselect use

After installing the Myselect program, you can view the usage

$ myselect-h
Usage
Myselect ' SQL sentence '; Using SQL for statistical analysis
Myselect-s ' log line '; cut numbers by space on journal lines
Myselect-n ' log line ' sql sentence '; Parsing the journal lines with SQL
Myselect-p ' SQL sentence '; View SQL Syntax parsing results
Myselect-c ' SQL sentence '; View SQL calculation procedures

Statistical analysis of basic use such as the following

$ Myselect ' SQL statement '

SQL statement syntax is basically the same as the normal database query SELECT statement, do not distinguish between uppercase and lowercase (of course, the file name is differentiated between uppercase and lowercase), support free form, only a small part of the difference, we have reason to believe that the SQL language in statistical analysis must be the best language at the moment, Basically follow it to achieve it.

SQL statement = SELECT
select_expr [, select_expr ...]
[From file_name
[WHERE Where_condition]
[GROUP by {col_name | expr}
[Having where_condition]
[ORDER by {col_name | expr}
[ASC | DESC]]
[LIMIT {[Offset,] row_count}]

A brief description such as the following:

select_expr

Can contain field numbers such as $1,$2, fields are separated by spaces, can also contain functions, functions are divided into two categories
String handler function:

    • Strsub ($1,2,3) Truncate substring
    • Regsub ($1,/(.):(. +):(.) /I,\2) Replace substrings by regular

A string function can be used wherever a field can appear, and its parameters contain the field number

Aggregation functions:

    • Count
    • Sum
    • Agv
    • Max
    • Min

Meaning is the same as normal SQL.

Where_condition

A relational expression that is concatenated with and is not currently supported or, for example, the following operator
=,!=,>,<,>=,<=,like,rlike
Like indicates whether the corresponding string is included, Rlike indicates a regular matching pattern

The original plan myselect with the Go language, and read the Go manual, but in our group technical journals before the deadline for a very short time to implement it in a new language, instead of first using PHP to implement a version number, and basically available, the current implementation of the PHP version number implementation of the basic SQL The SELECT statement syntax, like Askeyword and or logical operators, is not yet implemented, but this is not important. When the log file is very large, the PHP implementation of the version number in the performance and memory consumption is not very good to meet the requirements, but I believe that soon there will be the go language implementation of the high Availability version number.

For anyone unfamiliar with awk or who can't remember the details of the awk syntax, Myselect is well-versed in the need to analyze the logs when it's time to parse the log, and everyone in the SQL language should be very familiar with it.

This tool source code has been put to GitHub Https://github.com/micweaver/myselect


The basic implementation algorithm in the inside, next to translate into the go language implementation, go language can be very good to meet our performance and memory requirements, of course, great log you have to rely on hadoop,hive and other distributed computing tools

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.