1 PackageSogolog2 3 ImportOrg.apache.spark.rdd.RDD4 ImportOrg.apache.spark. {sparkconf, sparkcontext}5 6 /**7 * Count the number of searches per hour8 */9 /*Ten Sogou Log Example One Access time (Hours: seconds) User ID [query Word] The URL in the return results of the ranking user clicked the order number of the user clicked on the URL A 00:00:00 2982199073774412 [360 safety defender] 8 3 download.it.com.cn/softweb/software/firewall/antivirus/20067/1793 8.html - 00:00:00 07594220010824798 [Looting relief supplies] 1 1 news.21cn.com/social/daqian/2008/05/29/4777194_1.shtml - 00:00:00 5228056822071097 [75,810 teams] 5 www.greatoo.com/greatoo_cn/list.asp?link_id=276&title=%BE%DE% C2%D6%D0%C2%CE%C5 the 00:00:00 6140463203615646 [rope] www.jd-cd.com/jd_opus/xx/200607/706.html - */ - Object Countbyhours { -def main (args:array[string]): Unit = { + - //1. Start spark Context, read file +Val conf =NewSparkconf (). Setappname ("Sougo count by Hours"). Setmaster ("local") AVal sc =Newsparkcontext (conf) atvar Orgrdd = sc.textfile ("c:\\users\\king\\desktop\\sogouq.reduced\\sogouq.reduced") -println ("Total number of rows:" +Orgrdd.count ()) - - //2, map operation, traverse processing each row of data -var map:rdd[(string,integer)] = Orgrdd.map (line=>{ - //get the hour . invar h:string = line.substring (0,2) -(h,1) to }) + - //3, reduce operation, the above map results by key to merge, overlay thevar reduce:rdd[(string,integer)] = Map.reducebykey ((x, y) ={ *x+y $ })Panax Notoginseng - //Print out statistical results sorted by hour the Reduce.sortbykey (). Collect (). Map (println) + } A}
Operation Result:
Sogou log: http://www.sogou.com/labs/resource/q.php
Use spark for Sogou log analysis instances--count the amount of searches per hour