02. Website Click Stream data Analysis Project _ Module Development

02. Website Click Stream data Analysis Project _ Module Development _ Data collection

Last Update:2018-07-18 Source: Internet

Author: User

Tags response code hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

3 Module Development--Data acquisition 3.1 Demand

The demand for data acquisition is broadly divided into two parts.

　1) is the user's access behavior in the page capture, the specific development work:

1, the development of the page embedded JS, the acquisition of user access behavior

2, the background to accept the page JS request logging

This part of the work can also be attributed to a "data source," whose development work is usually the responsibility of the Web development team

　　2) is the aggregation of logs from the Web server to HDFs, is the data Analysis System collection, this part of the work by the data analysis Platform construction team responsible for

Specific technical implementations are available in many ways:

Shell script: Advantages: Lightweight, easy to develop; Cons: Inconvenient control of fault-tolerant handling during log acquisition

Java Capture Program: Advantages: The acquisition process can be achieved fine control; Cons: Great development effort

　　　Flume Log Capture framework: A mature Open source log collection system, and itself a member of the Hadoop ecosystem, with

A variety of framework components with natural affinity, strong scalability

3.2 Flume Log collection system build:

　　1. Data source information: data from the analysis of this project traffic logs generated by the server :/data/flumedata/access.log

2. Sample Data content:


"http://blog.fens.me/nodejs-socketio-chat/" "mozilla/5.0 (Windows NT 5.1; rv:23.0) gecko/20100101 firefox/23.0 " field Resolution:1, Guest IP address:   58.215.204.1182, Guest user information:  --3, Request time: [18/sep/ 2013:06:51:35 +0000]4, request mode: GET5, the requested url:/wp-includes/js/jquery/jquery.js?ver=1.10.26, protocol used for request: http/1.17, Response code: 3048, the data traffic returned: 09, the source of the visitors url:http://blog.fens.me/nodejs-socketio-chat/10, Visitor's browser: mozilla/5.0 (Windows NT 5.1; rv:23.0) gecko/20100101 firefox/23.0

3, flume acquisition implementation: Configuration Acquisition Scheme:

# Name the components on Thisagenta1.sources=r1a1.sinks=K1a1.channels=c1# Describe/Configure the Source#a1.sources.r1.type=exec#a1. Sources.r1.command = tail-f/home/hadoop/log/ Test.log Gets the data with the tail command and sinks to the HDFs#a1. Sources.r1.channels=c1# Describe/Configure the Sourcea1.sources.r1.type=spooldira1.sources.r1.spoolDir =/data/ flumedata Collection directory to HDFsa1.sources.r1.fileHeader=false# Describe The Sinka1.sinks.k1.type=Hdfsa1.sinks.k1.channel=C1A1.sinks.k1.hdfs.path =/fensiweblog/events/%y-%m-%d/A1.sinks.k1.hdfs.filePrefix= events-A1.sinks.k1.hdfs.round=trueA1.sinks.k1.hdfs.roundValue= 10A1.sinks.k1.hdfs.roundUnit=minute# Specifies that the sinking file is scrolled by 30 minutes A1.sinks.k1.hdfs.rollInterval= 30a1.sinks.k1.hdfs.rollSize= 1024#指定下沉文件按1000000条数滚动a1. Sinks.k1.hdfs.rollCount= 10000a1.sinks.k1.hdfs.batchSize= 1A1.sinks.k1.hdfs.useLocalTimeStamp=true#生成的文件类型, the default is Sequencefile, available datastream, or normal text A1.sinks.k1.hdfs.fileType=datastream# use a channel which buffers events in Memorya1.channels.c1.type=memorya1.channels.c1.capacity= 1000a1.channels.c1.transactionCapacity= 100# Bind The source and sink to the Channela1.sources.r1.channels=C1a1.sinks.k1.channel= C1

If a file is placed into the directory /data/flumedata , the file is dropped into HDFs;

start flume agent:bin/flume-ng agent-c conf-f conf/fensi.conf-n a1-dflume.root.logger=info,console

Note: the-n parameter in the start command is given to the agent name configured in the configuration file

02. Website Click Stream data Analysis Project _ Module Development _ Data collection

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More