Offline data analysis--combat

Source: Internet
Author: User
Tags log log response code node server
structure diagram of off-line analysis system


The overall architecture of the entire offline analysis is to use Flume to collect log files from the FTP server and store them on the Hadoop HDFS file system, then clean the log file with the MapReduce of Hadoop, and finally use HIVE to build the Data warehouse for offline analysis.       Task scheduling is done using Shell scripts, and of course you can try some automated task scheduling tools, such as Azkaban or OOZIE. Analysis of the use of the Clickstream log file mainly from the Nginx access.log log files, it is important to note that this is not used flume directly to the production environment to pull nginx log files, but more set up a layer of FTP server to buffer all the log files, and then use Flume to listen to the FTP server refers to Directories and pull the log files from the directory to the HDFS server (for specific reasons below). The operation of the push log file from the production environment to the FTP server can be implemented with the shell script in conjunction with the crontab timer.

Website Click Stream data


Image source: http://webdataanalysis.net/data-collection-and-preprocessing/weblog-to-clickstream/#comments

Generally in the WEB system, the user to the site of the page access to browse, click on the behavior of a series of data will be recorded in the log, each log record represents a data point in the above image, and the clickstream data is concerned with all of these points linked together after a complete site browsing behavior records, Can be considered a user to the site browsing session. For example, from which the user from the external station into the current site, the user next to browse the current site of which pages, click on the Image Link button and a series of behavioral records, this whole information is called the user's clickstream record. The offline analysis system designed in this article collects these data logs generated in the WEB system and cleans the log content stored on the distributed HDFS file storage System, and then uses the Offline analysis tool HIVE to count the Clickstream information for all users.
      This system we use Nginx Access.log to do the click Stream Analysis log file. The format of the Access.log log file is as follows:       Sample data format:       124.42.13.230--[18/sep/2013:06:57:50 +0000] "GET /shoppingmall?ver=1.2.1 http/1.1 "" 7200 "http://www.baidu.com.cn" "mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0; BTRS101170; infopath.2;. net4.0c;. net4.0e,. NET CLR 2.0.50727) "        format analysis:          1, guest IP address: 124.42.13 .         2, Guest user information:  --          3, request time: [18/sep/2013:06: 57:50 +0000]         4, request method: GET           5, requested URL:/shoppingmall ver= 1.10.2         6, request protocol: http/1.1           7, Response code: +         8, returned data traffic: 7200         9, visitor's source url:http://www.baidu.com.cn     &N Bsp   10, Browser for visitors: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; trident/4.0; BTRS101170; infopath.2;. net4.0c;. net4.0e;. NET CLR 2.0.50727)

Collecting User Data
The site collects the user's browsing data through the front-end JS code or server-side code behind it and stores it in the Web server. The general OPS will deploy the FTP server between the offline analysis system and the real production environment, and send the user data on the production environment to the FTP server on a daily basis, and the offline analysis system will collect the data from the FTP service without affecting the production environment.
There are many ways to collect data, one is to write a shell script or Java programming to collect data, but the workload is large, inconvenient maintenance, the other is to use the third party framework to carry out log collection, general third-party framework robustness, fault tolerance and ease of use are done well and easy to maintain. This paper uses the third-party framework Flume for log collection, Flume is a distributed and efficient log acquisition system, it can be distributed on different servers in the vast amount of log file data collected in a centralized storage resources, Flume is a top-level project of Apache, and Hadoop There is also good compatibility. However, it is important to note that Flume is not a highly available framework, which is optimized for the user to maintain.
The Flume agent is running on the JVM, so the JVM environment on each server is essential. Each Flume agent is deployed on a single server, Flume collects the log data generated by the Web server, and encapsulates the events sent to the Flume agent source, Flume agent source consumes these collected According to the event and placed on the Flume agent channel, the Flume agent Sink collects the collected data from the channel, either stored in a local filesystem or distributed as a consumption resource to the next Flume in the other servers in the distributed system. Line processing. Flume provides secure point-to-point availability, and the data in the Flume Agent channel on a server is removed only if it is guaranteed to be transferred to the Flume Agent channel on another server or properly saved to a local file storage system. A Flume agent is deployed on each FTP server in the system and on the name node server of Hadoop, and the FTP Flume agent collects the logs from the WEB server and summarizes them to the Flum on the name node server. e Agent, and finally the Hadoop name node server sinks all log data onto the Distributed file storage System HDFS. It is important to note that flume source selects spooling directory source in the system of this article, and does not choose Exec source because spooling directory when the Flume service is down Source can record the location of the last read to, and exec source does not, need to be handled by the user, when restarting the flume server if the processing is not good there will be duplicate data problem. Of course spooling Directory source is also flawed, and will rename the files that have been read, so multiple one-tier FTP servers are also designed to avoid flume "polluting" production environments. Spooling Directory Source Another big drawback is the inability to monitor new additions to all files in all subfolders under a folder. There are a number of solutions to these problems, such as selecting other log capture tools, such as Logstash.
The Flume configuration file on the FTP server is as follows:     [Plain] View Plain Copy agent.channels = memorychannel   agent.sinks = target      agent.sources.origin.type = spooldir   agent.sources.origin.spooldir =  /export/data/trivial/weblogs   agent.sources.origin.channels = memorychannel    agent.sources.origin.deserializer.maxlinelength = 2048      agent.sources.origin.interceptors = i2   agent.sources.origin.interceptors.i2.type =  host   agent.sources.origin.interceptors.i2.hostheader = hostname       agent.sinks.loggersink.type = logger   agent.sinks.loggersink.channel =  memorychannel      agent.channels.memorychannel.type = memory   agent.channels.memorychannel.capacity = 10000      agent.sinks.target.type =  avro   agent.sinks.target.channel = memorychannel   agent.sinks.target.hostname = 172.16.124.130    agent.sinks.target.port = 4545        Here are a few parameters to explain, Flume agent Source can specify the size of each event by configuring the Deserializer.maxlinelength property, which by default is 2048 bytes per event. The size of the Flume Agent Channel defaults to 80% of the memory acquired by the JVM on the local server, and the user can optimize by bytecapacitybufferpercentage and bytecapacity two parameters.
It is important to note that the log files placed in the Flume Listener folder on FTP cannot have the same name, or Flume will report an error and stop working, the best solution is to put a timestamp on each log file.

The configuration file on the Hadoop server is as follows:     [Plain] View Plain Copy agent.sources = origin   agent.channels = memorychannel   agent.sinks = target      agent.sources.origin.type = avro   agent.sources.origin.channels = memorychannel   agent.sources.origin.bind =  0.0.0.0   agent.sources.origin.port = 4545      # agent.sources.origin.interceptors = i1 i2   #agent. Sources.origin.interceptors.i1.type  = timestamp   #agent .sources.origin.interceptors.i2.type = host   # agent.sources.origin.interceptors.i2.hostheader = hostname      agent.sinks.loggersink.type = logger   agent.sinks.loggersink.channel =  memorychannel      agent.channels.memorychannel.type = memory   Agent.channels.memorychannel.capacity = 5000000   agent.channels.memorychannel.transactioncapacity = 1000000      agent.sinks.target.type = hdfs   agent.sinks.target.channel = memorychannel    agent.sinks.target.hdfs.path = /flume/events/%y-%m-%d/%h%m%s   agent.sinks.target.hdfs.fileprefix = data-%{hostname}   agent.sinks.target.hdfs.rollinterval = 60   agent.sinks.target.hdfs.rollsize =  1073741824   agent.sinks.target.hdfs.rollcount = 1000000   agent.sinks.target.hdfs.round = true   agent.sinks.target.hdfs.roundvalue = 10    agent.sinks.target.hdfs.roundunit = minute   agent.sinks.target.hdfs.uselocaltimestamp = true   Agent.sinks.target.hdfs.minBlockReplicas =1   agent.sinks.target.hdfs.writeformat=text   Agent.sinks.target.hdfs.filetype=datastream   
Round, roundvalue,roundunit three parameters are used to configure every 10 minutes in HDFs to generate a folder to save data pulled down from the FTP server.

Troubleshooting The problem of using Flume to pull files into HDFs will encounter small files that scatter files into multiple 1kb-5kb

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.