First, PV statistics (page traffic)(1) Basic concepts
is usually the main indicator for measuring a network news channel or website or even a web news. Web page views is one of the most commonly used indicators for evaluating website traffic, referred to as PV. Monitoring the website PV trends and analysis of the reasons for the change is a lot of webmaster regularly do the work. Page views in the page generally refers to ordinary HTML pages, but also contains PHP, JSP and other dynamically generated HTML content. An HTML content request from the browser is considered to be a PV, which gradually accumulates as a PV total.
(2) Calculation method
Every 1 times a user accesses each page in a Web site 1 times. User multiple access to the same page, the amount of traffic accumulated.
(3) statistical analysis--1. Creating a Database
Create database Jfyun;
Use Jfyun;
--2. Create a user access record table, first create a good partition table
Create external Table Data_collect (
accessdate string,
accesshour int,
Requestmethod String,
referurl string,
requestprotocal string,
returnstatus string,
Requesturl string,
referdomain string,
userorigin string,
Originword string ,
browser string,
browserversion string,
operatesystem string,
Requestip string,
ipnumber int,
userprovince string,
screensize string,
Screencolor string,
pagetitle string,
sitetype string,
userflag string,
visitflag string,
sflag string,
timeonpage int
) partitioned by (Access_day string)
row format delimited fields
terminated by ' \ t ' location
'/user/hadoop/external/jfpc/output ';
--3. Create a partition for a table (partition is created first, and then the data is loaded into the partition using a partition
ALTER TABLE Data_collect Add partition (access_day= ' 20150705 ');
ALTER TABLE Data_collect Add partition (access_day= ' 20150706 ');
--4. Execute a mapreduce program to store data/load data into a partitioned table
Hadoop jar Jfyun.jar Com.yun.job.AccessLogEnhanceImportHDFS external/jfpc/input/20150705/130/ Clickdata-2015070500.log external/jfpc/output/access_day=20150705
Hadoop jar Jfyun.jar Com.yun.job.AccessLogEnhanceImportHDFS External/jfpc/input/20150705/131/clickdata-2015070500.log external/jfpc/ output/access_day=20150705
Hadoop jar Jfyun.jar Com.yun.job.AccessLogEnhanceImportHDFS external/jfpc/input/ 20150705/130/clickdata-2015070501.log external/jfpc/output/access_day=20150705
Hadoop jar Jfyun.jar Com.yun.job.AccessLogEnhanceImportHDFS External/jfpc/input/20150705/131/clickdata-2015070501.log external/jfpc/ output/access_day=20150705
Hadoop jar Jfyun.jar Com.yun.job.AccessLogEnhanceImportHDFS external/jfpc/input/ 20150706 external/jfpc/output/access_day=20150706
--5. Show Table partition show partitions data_collect; --6. Viewing partition data based on partitioning criteria
SELECT * from Data_collect where access_day= ' 20150705 ';
SELECT * from Data_collect where access_day= ' 20150706 ';
--7. Analyzing PV Data via hive
--7.1. PV By day statistics
Select substr (accessdate,1,8), COUNT (1) from Data_collect where access_day= ' 20150706 ' GROUP by substr ( accessdate,1,8);
--7.2. PV hourly statistics, inserted into the specified table
select Accesshour,count (1) stacount from Data_collect where access_day= ' 20150706 ' GROUP by Accessho ur;
--7.3 per day in each province PV
select substr (accessdate,1,8), Userprovince,count (1) from Data_collect where access_day= ' 20150706 ' Group by substr (accessdate,1,8), userprovince
--7.4 per hour per day per province
select substr (accessdate,1,8), userprovince , Accesshour,count (1) from Data_collect where access_day= ' 20150706 ' GROUP by substr (accessdate,1,8), Userprovince, Accesshour
Second, UV statistics (independent visitors)
(1) Basic concept independent IP: refers to independent users/independent visitors. Refers to the number of people who visit a site or click a different IP address for a news message
(2) Calculation method
In the same day 00:00-24:00, independent IP records only the first access to the site with a separate IP visitors, you can set a cookie, record the first access to set up a new user, followed by the old user
(3) Statistical analysis Project requirements: (1) Users visit e-commerce website, through the way of JS interpolation to collect user behavior log, and then through the MapReduce program to the user log into HBase, in accordance with the UV table calculation. (2) After-storage data to be statistical analysis (3) User log format (simulated data)
"06/jul/2015:00:01:04 +0800" "GET" "http%3a//jf.10086.cn/m/" "http/1.1" "$" "http://jf.10086.cn/m/subject/ 100000000000009_0.html "" mozilla/5.0 (Linux; U Android 4.4.2; ZH-CN; Lenovo a3800-d build/lenovoa3800-d) applewebkit/533.1 (khtml, like Gecko) version/4.0 mqqbrowser/5.4 tbs/025438 Mobile safari/533.1 micromessenger/6.2.0.70_r1180778.561 nettype/cmnet language/zh_cn "" 10.139.198.176 "" 480x854 "" 24 ""% U5927%u7c7b%u5217%u8868%u9875_%u4e2d%u56fd%u79fb%u52a8%u79ef%u5206%u5546%u57ce "" 0 "" 3037487029517069460000 "" 3037487029517069460000 "" "1" "06/jul/2015:01:01:04" "+0800" "GET" "http%3a//jf.10086.cn/portal/ware/web/ searchwareaction%3faction%3dsearchwareinfo%26pager.offset%3d144 "" http/1.1 "" "" Http://jf.10086.cn/portal/ware " /web/searchwareaction?action=searchwareinfo&pager.offset=156 "" mozilla/5.0 (Linux; U Android 4.4.2; ZH-CN; HUAWEI mt2-l01 build/huaweimt2-l01) applewebkit/534.30 (khtml, like Gecko) version/4.0 ucbrowser/10.5.2.598 U3/0.8.0 Mobile safari/534.30 "" 223.73.104.224 "" 720x1208 "" + ""%u641c%u7d22_%u4e2d%u56fd%u79fb%u52a8%u79ef%u5206%u5546%u57ce "" 0 "" 3046252153674140570000 " "3046252153674140570000" "1" "2699" "06/jul/2015:02:01:04 +0800" "GET" "" "http/1.1" "" "" http://jf.10086.cn/"" mozilla/5.0 (Linux; Android 4.4.4; Vivo y13l build/ktu84p) applewebkit/537.36 (khtml, like Gecko) version/4.0 chrome/33.0.0.0 Mobile safari/537.36 baiduboxa pp/5.1 (Baidu; P1 4.4.4) "" 10.154.210.240 "" 480x855 "" + ""%u9996%u9875_%u4e2d%u56fd%u79fb%u52a8%u79ef%u5206%u5546%u57ce "" 0 "" 3098781670304015290000 "" 3098781670304015290000 "" 0 "" 831 "" 06/jul/2015:03:01:07 +0800 "" GET "" http%3a//wx.10086.cn/ Wechat-website/wechatwebsite/accumulatepoints "" http/1.1 "" "" "" http://jf.10086.cn/m/"" mozilla/5.0 (Linux; U Android 4.4.2; ZH-CN; Lenovo a3800-d build/lenovoa3800-d) applewebkit/533.1 (khtml, like Gecko) version/4.0 mqqbrowser/5.4 tbs/025438 Mobile safari/533.1 micromessenger/6.2.0.70_r1180778.561 nettype/cmnet language/zh_cn "" 10.139.198.176 "" 480x854 "" 24 ""% u9996%u9875_%u4e2d%u56fd%u79fb%u52a8%u79ef%u5206%u5546%u57ce "" 0 "" 3037487029517069460000 "" 3037487029517069460000 "" 1 " "135"
(4) Data source, can refer to the following website http://jf.10086.cn/analyzeVesopera.gif?screenSize=1366x768&screenColor=24&pageTitle=% U9996%u9875_%u4e2d%u56fd%u79fb%u52a8%u79ef%u5206%u5546%u57ce&referrerpage=&sitetype=0&uid= 20523849176242946000&sid=56080848979763680000&sflag=1&countlog=1443006061700&onloadtotaltime= 135
Technical solutions: (1) write MapReduce, read each row of data and then save HBase (2) Let hive manipulate hbase table data (3) Hive Statistics Analysis hbase table data, analyze user visitor behavior
1. Create a table
Create ' Uservisitinfo ', {NAME = ' info '}
2. Import HBase
Hadoop jar Jfyun.jar Com.yun.job.AccessLogImportHBase External/jfpc/input/20150705/130/clickdata-2015070500.log
Hadoop jar Jfyun.jar com.yun.job.AccessLogImportHBase external/jfpc/input/20150705/131/ Clickdata-2015070500.log
Hadoop jar Jfyun.jar com.yun.job.AccessLogImportHBase external/jfpc/input/20150705/ 130/clickdata-2015070501.log
Hadoop jar Jfyun.jar com.yun.job.AccessLogImportHBase external/jfpc/input/ 20150705/131/clickdata-2015070501.log
Hadoop jar Jfyun.jar com.yun.job.AccessLogImportHBase external/jfpc/ input/20150706
3. View data in HBase 3.1 full table view
Scan ' Uservisitinfo '
3.2 According to Rowkey view
HBase (main):012:0> get ' uservisitinfo ', ' 20150706_3037487029517069460000 ' COLUMN CELL Info:firstaccessurl timestamp=14430 00064923, value=/m/subject/100000000000009_0.html info:browser timestamp=144300 0064923, Value=safari info:browserversion timestamp=1443000 064923, value=533.1 info:firstaccesstime timestamp=14430000 64923, value=20150706000104 Info:operatesystem timestamp=144300006 4923, Value=linux info:recentaccesstime timestamp=1443000065 001, value=20150706030107 Info:recentaccessurl timestamp=14430000650 value=/m/,
Info:screencolor timestamp=1443000064923, value=24
Info:screensize timestamp=1443000064923, value=480x854
Info:sitetype timestamp=1443000064923, value=0 Info:userflag timestamp=1443000064923, value=303748702951706946
0000 info:userprovince timestamp=1443000064923, value=999 Info:uservisitid timestamp=1443000064923, value=20150706_30374870295
17069460000 Info:visitcount timestamp=1443000065001, value=2 Info:visitday timestamp=1443000064923, value=20150706
Info:visitflag timestamp=1443000064923, value=3037487029517069460000
Info:visithour timestamp=1443000064923, value=0
Info:visitip timestamp=1443000064923, value=10.139.198.176 Info:visitkeeptime timestamp=1443000065001, value=10803
4, Statistics hive analysis hbase Table Data 4.1 Create an HBase table, add data to the HBase table Uservisitinfo 4.2 Create a hive table for hbase table mappings (1) Create a table
CREATE external TABLE user_visit_info (Uservisitid string, Firstaccessurl string, browserversion string, Firstaccesstime String, Operatesystem
String, recentaccesstime string, Recentaccessurl string, Screencolor String, screensize s
Tring, SiteType string, Userflag string,
Userprovince string, Visitcount string,
Visitday string, Visitflag string, Visithour string, Visitip string, Visitkeeptime Strin g) STORED by ' Org.apachE.hadoop.hive.hbase.hbasestoragehandler ' with serdeproperties ("hbase.columns.mapping" = ": Key, Info:firstaccessurl, Info:browserversion,info:firstaccesstime,info:operatesystem, Info:recentaccesstime,info:recentaccessurl,info: Screencolor,info:screensize,info:sitetype, Info:userflag,info:userprovince,info:visitcount,info:visitday,info: Visitflag,info:visithour, Info:visitip,info:visitkeeptime ") tblproperties (" hbase.table.name "=" uservisitinfo ");
4.3 Using hive statistical analysis