--apache log analysis of user behavior analysis (II.)

Source: Internet
Author: User
Tags apache log

In the last article "User behavior analysis of the--apache log Analysis (a)" finally introduced to the Apache log information in the crawler, so why to introduce him, nothing more than to achieve the title of "User Behavior analysis" purposes, reptiles is not our site's real users, so to filter out his Before we filter him, we are not the first to know what kind of parents are not.

Given the ease of development, and the expertise of each language, Python is a great fit for this kind of thing, and the processing of text is to filter out the crawler information in the log, and then generate the XML file, which is the information that the program can use directly (then Xpath,xquery can come in handy).

I have been relatively lazy, following the three great advantages of the programmer "lazy"; found a parse code on the Internet.

Def parse (input): import re,string SB = "[EB =]" IP_SEPR = "--" output = {} Try: #clean empty spaces at the Beginni Ng. line = String.lstrip (input) TT = String.Split (LINE,IP_SEPR) [ip,rest] = TT output[' ip_address '] = String.strip (IP) #parse The date with the brackets included. S_bracket = String.index (REST,SB) E_bracket = String.index (rest,eb) date_str = String.strip (rest[s_bracket+1:e_bracket ] output[' date_time ' = date_str #parse request string to get method, request and protocol. Current_ind = e_bracket+1 Request_start =-1 request_end =-1 Magic_flag = 0 while Current_ind < Len (rest): if Request_ Start!= -1:magic_flag = 1 if rest[current_ind] = =/"" and request_start = = -1:request_start = Current_ind if Rest[curr Ent_ind] = = "/" "and Request_start!=-1 and Magic_flag = = 1:request_end = Current_ind if Request_start >= 0 and Reque St_end >= 0:break current_ind = current_ind +1 get_str = String.strip (Rest[request_start+1:request_end]) [method,requ Est,protocol] = sTring.split (Get_str, "") output[' method ']= method output[' request '] = Request output[' protocol ' = Protocol #parse return Code rest = String.strip (rest[request_end+1:]) Ret_code_e_ind = String.index (rest, "") Ret_code = Rest[:ret_code_e_ind] O utput[' return_code ' = Ret_code #parse byte sent rest = String.lstrip (rest[ret_code_e_ind+1:]) Byte_sent_e_ind = string.i Ndex (Rest, "") Byte_sent = Rest[:byte_sent_e_ind] output[' return_byte ' = byte_sent #parse refering URL after_byte_sent = Rest[byte_sent_e_ind+1:] S_quote_ref_url = String.index (After_byte_sent, "/") After_byte_sent = After_byte_sent[s_ Quote_ref_url+1:] E_quote_ref_url = String.index (After_byte_sent, "/") if E_quote_ref_url-s_quote_ref_url==1:output [' refering_url '] = "" else:output[' refering_url '] = After_byte_sent[:e_quote_ref_url] #parse user Agent After_ref_url = a Fter_byte_sent[e_quote_ref_url+1:] s_quote_user_agent = String.index (After_ref_url, "/") After_ref_url = After_ref_ Url[s_quote_user_agent+1:] E_quote_user_agent = String.index (After_ref_url, "/") if e_quote_user_agent-s_quote_user_agent==1:output[' user_agent '] = "" else:output[ ' user_agent '] = after_ref_url[:e_quote_user_agent] except Exception, e:print e return output if __name__ = ' __main__ ': I Nput = "" "10.29.101.5--[31/dec/2008:00:03:16 +0800]" get/blog_paper.php?paperid=72 http/1.0 "46030"-"" MOZILLA/5 .0 (compatible; Yahoo! Slurp; http://misc.yahoo.com.cn/help.html) "303 46379" "" Result = Parse (input) print result[' Return_code '] for x in Result.keys ( ): Print x, "--", result[x]

OK, so after parsing the parse, you can get to the object of the Python structure, then you can really processing, in fact, this is the book said "Data Cleansing", here in addition to cleaning out the data of the reptile also want to clean out some pictures or video information Ah, of course, to see the actual needs. After this cleaning, the back of the calculation can not calculate the irrelevant things.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.