3 Module Development--Data acquisition
3.1 Demand
The demand for data acquisition is broadly divided into two parts.
1) is the user's access behavior in the page capture, the specific development work:
1, the development of the page embedded JS, the acquisition of user access behavior
2, the background to accept the page JS request logging
This part of the work can also be attributed to a "data source," whose development work is usually the responsibility of the Web development team
2) is the aggregation of logs from the Web server to HDFs, is the data Analysis System collection, this part of the work by the data analysis Platform construction team responsible for
Specific technical implementations are available in many ways:
Shell script: Advantages: Lightweight, easy to develop; Cons: Inconvenient control of fault-tolerant handling during log acquisition
Java Capture Program: Advantages: The acquisition process can be achieved fine control; Cons: Great development effort
Flume Log Capture framework: A mature Open source log collection system, and itself a member of the Hadoop ecosystem, with
A variety of framework components with natural affinity, strong scalability
3.2 Flume Log collection system build:
1. Data source information: data from the analysis of this project traffic logs generated by the server :/data/flumedata/access.log
2. Sample Data content:
"http://blog.fens.me/nodejs-socketio-chat/" "mozilla/5.0 (Windows NT 5.1; rv:23.0) gecko/20100101 firefox/23.0 " field Resolution:1, Guest IP address: 58.215.204.1182, Guest user information: --3, Request time: [18/sep/ 2013:06:51:35 +0000]4, request mode: GET5, the requested url:/wp-includes/js/jquery/jquery.js?ver=1.10.26, protocol used for request: http/1.17, Response code: 3048, the data traffic returned: 09, the source of the visitors url:http://blog.fens.me/nodejs-socketio-chat/10, Visitor's browser: mozilla/5.0 (Windows NT 5.1; rv:23.0) gecko/20100101 firefox/23.0
3, flume acquisition implementation: Configuration Acquisition Scheme:
# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Describe/Configure the Source#a1.sources.r1.type=exec#a1. Sources.r1.command = tail-f/home/hadoop/log/ Test.log Gets the data with the tail command and sinks to the HDFs#a1. Sources.r1.channels=c1# Describe/Configure the Sourcea1.sources.r1.type=spooldira1.sources.r1.spoolDir =/data/ flumedata Collection directory to HDFsa1.sources.r1.fileHeader=false# Describe The Sinka1.sinks.k1.type=Hdfsa1.sinks.k1.channel=C1A1.sinks.k1.hdfs.path =/fensiweblog/events/%y-%m-%d/A1.sinks.k1.hdfs.filePrefix= events-A1.sinks.k1.hdfs.round=trueA1.sinks.k1.hdfs.roundValue= 10A1.sinks.k1.hdfs.roundUnit=minute# Specifies that the sinking file is scrolled by 30 minutes A1.sinks.k1.hdfs.rollInterval= 30a1.sinks.k1.hdfs.rollSize= 1024#指定下沉文件按1000000条数滚动a1. Sinks.k1.hdfs.rollCount= 10000a1.sinks.k1.hdfs.batchSize= 1A1.sinks.k1.hdfs.useLocalTimeStamp=true#生成的文件类型, the default is Sequencefile, available datastream, or normal text A1.sinks.k1.hdfs.fileType=datastream# use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1
If a file is placed into the directory /data/flumedata , the file is dropped into HDFs;
start flume agent:bin/flume-ng agent-c conf-f conf/fensi.conf-n a1-dflume.root.logger=info,console
Note: the-n parameter in the start command is given to the agent name configured in the configuration file
02. Website Click Stream data Analysis Project _ Module Development _ Data collection