Record the entire process of data analysis

Source: Internet
Author: User

When I was just finishing my next shift, I sat down bored in the company. A colleague gave me some data and told me to implement a statistical report similar to a cross table.

I thought it would take up to ten minutes to get it done. I did not expect it to take two hours, so I was impressed and recorded the entire process.

The source data is the log text information.

 
 
  1. 2008/1/11               02:14:33:181           181          00001c68                SeqID       418370    ToBack()=TRUE       Len=154  MsgID=x00000202                  
  2.  
  3. 2008/1/11               02:14:33:181           181          00001c68                SeqID       418370    ToFront()=TRUE      Len=260  MsgID=x08000202                BEIP=192.168.1.162                BEPort=22049 
  4.  
  5. 2008/1/11               03:05:42:330           330          00004110                SeqID       418370    ToBack()=TRUE       Len=154  MsgID=x00000202                  
  6.  
  7. 2008/1/11               03:05:42:346           346          00004110                SeqID       418370    ToFront()=TRUE      Len=261  MsgID=x08000202                BEIP=192.168.1.163                BEPort=22049 

The result is to count the number of timeout milliseconds corresponding to each time period.

In theory, it is not complicated. We can find out data rules and perform grouping statistics, but the problem is:

First, statistics are context-related, that is, the corresponding indicators can be obtained only after contextual data is calculated.

Second, how to determine the context of the scenario, based on the number of fields to determine whether there is a problem, that is, not a unique identifier

It should have been easy to think about. import data to oracle first.

If there is a date and a time, you need to process the text date to the date type of oracle. However, the date type does not support millisecond operations. The first problem arises, there is a problem that context computing depends on the existing milliseconds in the log.

Let's get statistics first.

 
 
  1. select b.hours, 
  2. case when overlap<10 then '<10ms' 
  3.      when overlap<20 then '10-20' 
  4.      when overlap<30 then '20-30' 
  5.      when overlap<40 then '30-40' 
  6.      when overlap<50 then '40-50' 
  7.      when overlap<60 then '50-60'       
  8.      when overlap<70 then '60-70' 
  9.      when overlap<80 then '70-80' 
  10.      when overlap<90 then '80-90'   
  11.      else '>90ms' 
  12. end tt, 
  13. count(*) 
  14. from 
  15. select a.f,a.d from 
  16. select k,a,b,f,d,g,c, 
  17.        LAG(c, 1, 0) OVER (partition by f,dORDER BY B,g) lastc, 
  18.        LAG(b, 1, 0) OVER (partition by f,dORDER BY B,g) lastb, 
  19.        case when c - LAG(c, 1, 0) OVER (ORDER BY tt)>=0 then c - LAG(c, 1, 0) OVER (ORDER BY tt) 
  20.         else  c - LAG(c, 1, 0) OVER (ORDER BY tt)+1000 end aa 
  21.   fromtest6 t  
  22. ) a 
  23. where a.g='ToFront()=TRUE' anda.aa>90 ) 
  24. order by f,d,b,g 
  25. ) b 
  26. group by b.hours, 
  27. case when overlap<10 then '<10ms' 
  28.      when overlap<20 then '10-20' 
  29.      when overlap<30 then '20-30' 
  30.      when overlap<40 then '30-40' 
  31.      when overlap<50 then '40-50' 
  32.      when overlap<60 then '50-60'       
  33.      when overlap<70 then '60-70' 
  34.      when overlap<80 then '70-80' 
  35.      when overlap<90 then '80-90'   
  36.      else '>90ms' 
  37. end 

After the results are collected, the results are not expected, and several pieces of data are compared with the details. It is found that there are some small problems, but it is unclear where the problem is.

To clarify this problem, add a row number to the data and compare it again to find that the location of the data has changed, which is different from the original log order.

To solve this problem, we should use rownum to add table data to another test table and check whether the row number and log sequence can match, the log insertion sequence is different from the row number!

I also asked my colleagues what the business logic is like. The answer is: the context order in the log is very strict.

It seems that the row number problem needs to be completely solved.

I also did a test in Excel. It is easy to do a test in Excel. First, I got the millisecond information of the previous record, sorted it, filtered the data, and then grouped and judged, finally, generate a crosstab chart.

For a large amount of data, the Excel dragging is obviously full, followed by the need for functions, sorting, and data copying. In general, it is time-consuming.

 
 
  1. create or replace trigger trigger_test6 
  2.   before insert on test6  
  3.   for each row 
  4. declare 
  5. begin 
  6.   select tt.nextval into :new.tt from dual; 
  7. end trigger_test6; 

Verify the data order again. This time it is normal.

When the data is normal, the business logic is much simpler. You only need to modify the kernel and sort the data by row number.

 
 
  1. select rr,k,a,b,f,d,g,c, 
  2.        LAG(c, 1, 0) OVER (ORDER BY tt)lastc, 
  3.        LAG(b, 1, 0) OVER (ORDER BY tt)lastb      
  4.   from test6 t  

After the statistics are complete, copy the data to the Excel file to convert the pivot table. Then copy the table data and add beautiful information.

This thing is still not perfectly solved

It is mainly for processing in milliseconds. Theoretically, the time can be directly subtracted. However, because the Oracle date type cannot be directly processed, only the millisecond field in the log can be used for offset, and the offset is negative, then we can add 1000, which is somewhat problematic.

Second, there is a problem with the data sequence during oracle import, but I think maybe it is the root cause why I haven't found a solution yet.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.