02. Website Click Stream data Analysis Project _ Module Development _ Data collection

Source: Internet
Author: User
Tags response code hadoop ecosystem

3 Module Development--Data acquisition 3.1 Demand

The demand for data acquisition is broadly divided into two parts.

 1) is the user's access behavior in the page capture, the specific development work:

1, the development of the page embedded JS, the acquisition of user access behavior

2, the background to accept the page JS request logging

This part of the work can also be attributed to a "data source," whose development work is usually the responsibility of the Web development team

  2) is the aggregation of logs from the Web server to HDFs, is the data Analysis System collection, this part of the work by the data analysis Platform construction team responsible for

Specific technical implementations are available in many ways:

Shell script: Advantages: Lightweight, easy to develop; Cons: Inconvenient control of fault-tolerant handling during log acquisition

Java Capture Program: Advantages: The acquisition process can be achieved fine control; Cons: Great development effort

   Flume Log Capture framework: A mature Open source log collection system, and itself a member of the Hadoop ecosystem, with

A variety of framework components with natural affinity, strong scalability

3.2 Flume Log collection system build:

  1. Data source information: data from the analysis of this project traffic logs generated by the server :/data/flumedata/access.log

2. Sample Data content:


"http://blog.fens.me/nodejs-socketio-chat/" "mozilla/5.0 (Windows NT 5.1; rv:23.0) gecko/20100101 firefox/23.0 " field Resolution:1, Guest IP address: 58.215.204.1182, Guest user information: --3, Request time: [18/sep/ 2013:06:51:35 +0000]4, request mode: GET5, the requested url:/wp-includes/js/jquery/jquery.js?ver=1.10.26, protocol used for request: http/1.17, Response code: 3048, the data traffic returned: 09, the source of the visitors url:http://blog.fens.me/nodejs-socketio-chat/10, Visitor's browser: mozilla/5.0 (Windows NT 5.1; rv:23.0) gecko/20100101 firefox/23.0

3, flume acquisition implementation: Configuration Acquisition Scheme:

# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Describe/Configure the Source#a1.sources.r1.type=exec#a1. Sources.r1.command = tail-f/home/hadoop/log/ Test.log Gets the data with the tail command and sinks to the HDFs#a1. Sources.r1.channels=c1# Describe/Configure the Sourcea1.sources.r1.type=spooldira1.sources.r1.spoolDir =/data/ flumedata Collection directory to HDFsa1.sources.r1.fileHeader=false# Describe The Sinka1.sinks.k1.type=Hdfsa1.sinks.k1.channel=C1A1.sinks.k1.hdfs.path =/fensiweblog/events/%y-%m-%d/A1.sinks.k1.hdfs.filePrefix= events-A1.sinks.k1.hdfs.round=trueA1.sinks.k1.hdfs.roundValue= 10A1.sinks.k1.hdfs.roundUnit=minute# Specifies that the sinking file is scrolled by 30 minutes A1.sinks.k1.hdfs.rollInterval= 30a1.sinks.k1.hdfs.rollSize= 1024#指定下沉文件按1000000条数滚动a1. Sinks.k1.hdfs.rollCount= 10000a1.sinks.k1.hdfs.batchSize= 1A1.sinks.k1.hdfs.useLocalTimeStamp=true#生成的文件类型, the default is Sequencefile, available datastream, or normal text A1.sinks.k1.hdfs.fileType=datastream# use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1

If a file is placed into the directory /data/flumedata , the file is dropped into HDFs;

start flume agent:bin/flume-ng agent-c conf-f conf/fensi.conf-n a1-dflume.root.logger=info,console

  

  

Note: the-n parameter in the start command is given to the agent name configured in the configuration file

02. Website Click Stream data Analysis Project _ Module Development _ Data collection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.