SQL-based Log analysis tool Myselect

Source: Internet
Author: User

Basic introduction

Program developers often have to analyze the program log, including their own print logs and other software used to print the log, such as Php,nginx logs, Linux environment analysis logs have some built-in commands to use, such as Grep,sort,uniq,awk, the most powerful of which is awk, exists as a small text-processing language, but because it is a language that is powerful, it is not so convenient to use at the command line because AWK is compute-oriented rather than statistical-oriented. Awk can define variables that can be evaluated, and the command line is a language that contains an implicit for loop.

If awk had not been used for a long time, it would have forgotten some of its syntax, and would like to analyze the online log if it would be nice to parse it with SQL, indeed, SQL (Structured Query language) is a truly statistical-oriented language, including Hive, which has recently developed a SQL-based Log Parser, You can parse the log with the class SQL syntax, which is called Myselect.

Myselect is a tool for simplifying log analysis, and it is believed to have covered most of the log analysis capabilities that awk can perform, and of course, awk is needed in special cases. Myselect to analyze the log file as a database, the log line of course database records, so that the log data inside the statistical analysis. Let's look at the comparison of the use of other commands such as Myselect and awk.

To analyze the Ngnix log as an example, the following log is a log of our online Web machine

198.52.103.14--[29/jun/2014:00:17:11 +0800] "get/q/1403060495509100 http/1.1" 26788 1403060495509100 "" mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727) "221 0.532

The first field is IP, if you want to know which source IP is the most, use other commands such as awk to implement the following

$ Awk ' {print '} ' Accesstest.log | Sort | uniq-c | SORT-K1NR | Less
14 111.13.65.251
13 10.141.88.248
12 10.141.88.239
10 10.141.88.250
9 121.226.135.115
8 10.141.88.241
8 10.141.88.249
8 222.74.246.190
7 211.149.165.150
6 119.138.167.213

Even using awk entirely can do this, but there are other better commands that don't seem necessary.

Myselect How to achieve the above functions? Myselect the journal line as multiple fields, with spaces separating the fields, and all the characters in the double quotation marks as a field, even if they include spaces, this is different from the fact that awk is purely space delimited, which makes it easier for us to process the logs. You can view the field values for a journal line by following this command:

$ Myselect-s ' 198.52.103.14--[29/jun/2014:00:17:11 +0800] "get/q/1403060495509100 http/1.1" 26788 O.com/q/1403060495509100 "" mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727) "221 0.532 '


**log fields**
$198.52.103.14
$-
$-
$4 [29/jun/2014:00:17:11
$ +0800]
$6 get/q/1403060495509100 http/1.1
$7 200
$8 26788
$9 http://wenda.so.com/q/1403060495509100
$ mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0;. NET CLR 2.0.50727)
$11 221
$0.532

The-S option prints the field values of the journal lines, which can then be calculated according to the corresponding fields, as follows

$ Myselect ' SELECT COUNT ($), $ accesstest.log Group BY "Order by count ($) DESC LIMIT 10 '
14 111.13.65.251
13 10.141.88.248
12 10.141.88.239
10 10.141.88.250
9 121.226.135.115
8 10.141.88.241
8 10.141.88.249
8 222.74.246.190
7 211.149.165.150
6 61.174.51.174

The result is exactly the same

Another common requirement is to look at the traffic per minute and observe the traffic anomalies, using awk and other commands like this:

$ awk" {print gensub (/.* : (. +):. *+0800].*/, "\\1", "G")} ' Access_wenda.qihoo.com_log | uniq-c | Grep-v Windows | less 
1567 00:17 
1597 00:17 
933 00:18 
3045 00:18 
1605 00:19 
294 00:19 
2021 00:19 
1315 00:20 
666 00:20 
1875 00:20 
3679 00:21 
1172 00:22 
479 00:22 
2094 00:22 
1352 00:23 
00:23 
Notoginseng 00:23

Grep-v Windows is in order to filter out some garbled lines, in awk we need to get child fields through gensub, such as the minute values here, and in Myselect also have the same function as function regsub ($1,pattern,replace), The same needs are accomplished with Myselect:

$ Myselect ' Select Regsub ($4,/.*2014: (. +): \d{2}.*/,\1), COUNT ($) from Access_wenda.qihoo.com_log Group by Regsub ($4,/. *2014: (. +): \d{2}.*/,\1) Order by count ($) DESC LIMIT 10 '

Regsub ($4,/.*2014: (. +): \d{2}.*/,\1), we use a regular for the 4th field to get the minute value.

For example, we want to calculate the average time spent on network requests, with AWK can be achieved, but the process is more complex, need to define variables and calculations, and with Myselect only need to use the AVG function is relatively simple calculation, as follows

$ Myselect ' Select AVG ($) from Access_wenda.qihoo.com_log '

From the above comparison, it can be found that myselect is to write SQL method for statistics, not only good remember, but also more intuitive analysis, unlike awk need a bunch of commands to cooperate.

Perhaps you would say to put the log in the database re-analysis is the same, but this process is too troublesome, rather than directly to the file with SQL analysis.

Myselect use

After installing the Myselect program, you can see how to use it

$ myselect-h
Usage
Myselect ' SQL sentence '; Using SQL for statistical analysis
Myselect-s ' log line '; split numbering by space on journal lines
Myselect-n ' log line ' sql sentence '; Parsing the journal lines with SQL
Myselect-p ' SQL sentence '; View SQL Syntax parsing results
Myselect-c ' SQL sentence '; View SQL calculation procedures

Statistical analysis is basically used as follows

$ Myselect ' SQL statement '

SQL statement syntax is basically the same as the normal database query SELECT statement, case-insensitive (of course, file names are case-sensitive), support free-form, only a small number of different, we have reason to believe that the SQL language in the statistical analysis of the current optimal language, basically follow it to achieve the line.

SQL statement = SELECT
select_expr [, select_expr ...]
[From file_name
[WHERE Where_condition]
[GROUP by {col_name | expr}
[Having where_condition]
[ORDER by {col_name | expr}
[ASC | DESC]]
[LIMIT {[Offset,] row_count}]

A brief description is as follows:

select_expr

You can include field numbers such as $1,$2, fields are separated by spaces, or functions can be included, and functions are divided into two categories
String handler function:

    • Strsub ($1,2,3) Truncate substring
    • Regsub ($1,/(.):(. +):(.) /I,\2) Replace substrings by regular

A string function can be used where any field can appear, and its parameters include the field number

Aggregation functions:

    • Count
    • Sum
    • Agv
    • Max
    • Min

Meaning is the same as normal SQL.

Where_condition

A relational expression that is concatenated with and is not currently supported or, and is supported by the following operators
=,!=,>,<,>=,<=,like,rlike
Like indicates whether the corresponding string is included, rlike means that regular matches the corresponding pattern

The original plan myselect in the Go language, and read the Go manual, but in our group technical journals before the deadline for a short period of time to implement it in a new language, instead of a version of PHP first implemented, and basically available, the current implementation of the PHP version implemented the main SQL The SELECT statement syntax, like the AS keyword and or logical operator, is not yet implemented, but it is not important. When the log file size is large, the PHP implementation of the version in performance and memory consumption can not reach a real practical state, but I believe that soon there will be the go language implementation of the high-availability version.

For people who are unfamiliar with awk or who can't remember the details of awk syntax, Myselect can implement our requirements well when it comes to parsing logs, and SQL language should be familiar to everyone.

This tool has been put into the source of GitHub Https://github.com/micweaver/myselect


Basic implementation algorithm in the inside, next to translate into the go language implementation, go language can be very good to meet our performance and memory requirements, of course, great log you have to rely on hadoop,hive and other distributed computing tools

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.