Analysis of user behavior path based on spark

Source: Internet
Author: User

Research background

Internet industry more and more attention to their own customer's behavior preferences, whether it is the e-commerce industry or the financial industry, based on user behavior can make a lot of things, the e-commerce industry can be summed up user preferences for users to recommend goods, the financial industry can use user behavior as a point of anti-fraud, this article mainly introduces one of the important function , the user behavior path is counted based on the behavior log, providing better operational decision-making for operations personnel. User Behavior path analysis similar to that of a mature product such as Adobe Analytics can be implemented. The final effect. Using the open source Big Data visualization tool. , the user behavior path data is very large, UV indicators can not be calculated in advance (time period is not determined), if the display level 5, a page of data is 10 of 5 square, if there are thousands of pages, the amount of data is immeasurable, so only real-time calculation, and spark is very suitable for iterative calculation, Based on this consideration, spark is a good choice.

Solution Process Description

When a customer searches the behavior path detail of a starting page, the RPC requests to the background, the Spark program calculates the returned data in real time, and Java parses the data and displays it.

Preparatory work

1. First of all to have behavioral data, user behavior log data must contain the following four fields, access time, device fingerprint, session ID, page name, where the page name can be self-defined, used to mark one or a class of pages, this page name should not be duplicated.

2. Then perform a first-level cleaning of the behavior log (based on hive) to clean the data into the following format

Device fingerprint Session ID Page path (ascending by Time) Time
Fpid1 Sessionid1 A_b_c_d_e_f_g 2017-01-13

A, B, C for the name of the page, cleaning process using row_number function, Concat_ws function, the specific use can be Baidu. After cleaning, the landing to Hive table, the follow-up will be used. T+1 Clean this data.

3. Figuring out the definition of recursion

Spark processing

Process Overview:

1. Construct a multi-fork tree class, class main attribute description, name full path such as A_b_c,childlist son linked list, multi-fork tree construction and recursive reference here

2. Read the data from the previous step by time, recursively calculate the attribute metrics for each level of the page, and insert it into the Initialized node class root based on the page path.

3. Recursively traverse the root node object that was initialized in the previous step, and replace the name in which the ID of name is named, with the help of Spark dataframe querying the data.

4. Convert the root object into JSON format and return to the front end.

The attached code is as follows.

ImportJava.utilImportCom.google.gson.GsonImportOrg.apache.spark.SparkContextImportOrg.apache.log4j. {level, Logger =LG}ImportOrg.apache.spark.sql.DataFrameImportOrg.apache.spark.sql.hive.HiveContext/*** User Behavior path Real-time calculation implementation * Created by Chouyarn on 2016/12/12. *//*** Tree Structure class * *@paramName page path *@paramvisit Visit times *@paramPV PV *@paramUV UV *@paramchildlist son linked list*/classNode (Var name:string, var Path:any, var Visit:any, var pv:any, var uv:any, var childlist:util. Arraylist[node])extendsSerializable {/*** Add child nodes * *@paramNode child nodes Object *@return    */def addNode (Node:node)={childlist.add (node)}/*** Traversal node, depth first*/def Traverse (): Unit= {    if(Childlist.isempty)return    //node.Val Childnum =childlist.size for(I <-0 to ChildNum-1) {val Child:node=Childlist.get (i) Child.name= Child.name.split ("_"). Last//Remove the front absolute pathchild.traverse ()}} /*** Traversal node, depth first*/def Traverse (pages:dataframe): Unit= {    if(childlist.isempty| | Childlist.size () ==0)      return    //node.Val Childnum =childlist.size for(I <-0 to ChildNum-1) {val Child:node=Childlist.get (i) Child.name= Child.name.split ("_"). Last Val ID=pages.filter ("page_id=" "+child.name+" "). Select (" Page_name "). First (). getString (0)//Replacement ID is nameChild.name =ID child.traverse (pages)}} /*** Dynamic Insert node * *@paramNode Nodes Object *@return    */def insertnode (node:node): Boolean={val Insertname=Node.nameif(Insertname.stripsuffix ("_" + Insertname.split ("_")). Last). Equals (name) {//Node.name=node.name.split ("_"). lastaddNode (node)true    } Else{val ChildList1=childlist Val Childnum=childlist1.size var insetflag=false       for(I <-0 to ChildNum-1) {val Childnode=Childlist1.get (i) Insetflag=childnode.insertnode (node)if(Insetflag = =true)          true      }      false    }  }}/*** Processing Class*/classPathextendsCleandatawithrdd {LG.getRootLogger.setLevel (level.error)//control spark log output levelsVal Sc:sparkcontext= Sparkutil.createsparkcontextyarn ("path") Val Hivecontext=NewHivecontext (SC) override Def handledata (Conf:map[string, String]): Unit={val num= Conf.getorelse ("Depth", 5)//Path DepthVal pageName = Conf.getorelse ("PageName", "" ")//page name//val pageName = "A_c"val src = conf.getorelse ("src", "")//flag Source PC or WAPVal PageType= Conf.getorelse ("PageType", "" ")//forward or backward pathVal startdate = Conf.getorelse ("StartDate", "" ")//Start DateVal endDate = Conf.getorelse ("EndDate", "" ")//End Date//Keep log cache for subsequent useVal log = Hivecontext.sql (S "Select Fpid,sessionid,path" +s"From Specter.t_pagename_path_sparksource" +s"Where day between ' $startDate ' and ' $endDate ' and path_type= $pageType and src= ' $src '"). Map (S={(S.apply (0) + "_" + s.apply (1) + "_" + s.apply (2) }). Repartition (10). Persist () Val pages=hivecontext.sql ("Select Page_id,page_name from Specter.code_pagename"). Persist ()//Cache page Dictionary table//Local test Data//val log = sc.parallelize (Seq ("Fpid1_sessionid1_a_b",//"Fpid2_sessionid2_a_c_d_d_b_a_d_a_f_b",//"Fpid1_sessionid1_a_f_a_c_d_a_b_a_v_a_n"))var Root:node =NULL    /*** Recursive the calculated node into the tree structure * *@parampageName page name*/def Compute (pagename:string): Unit={val Currenregex= PAGENAME.R//Regular Expressions for pagesVal Containsrdd = Log.filter (_.contains (PageName)). Persist ()//contains the page name of the RDD, the next steps to use theVal CURRENTPV = Containsrdd.map (s = = {//Calculate PVCurrenregex Findallin (s)}). Map (_.mkstring (","). FlatMap (_.tostring.split (","). Filter (_.size> 0). Count () Val Temprdd= Containsrdd.map (_.split ("_")). Persist ()//broken-down RddVal currentuv = temprdd.map (_.apply (0)). Distinct (). COUNT ()//page UvsVal currentvisit = Temprdd.map (_.apply (1)). Distinct (). COUNT ()//page Visit times//Initialize the root node or add a node      if(Root = =NULL) {root=NewNode (Pagename,pagename.hashcode, Currentvisit, CURRENTPV, CURRENTUV,Newutil. Arraylist[node] ())}Else{Root.insertnode (NewNode (Pagename,pagename.hashcode, Currentvisit, CURRENTPV, CURRENTUV,Newutil. Arraylist[node] ( ))}if(Pagename.split ("_"). Size = = 5| | Temprdd.isempty ()) {//Recursive exit        return      } Else {        //determine the next page name regular expressionVal Nextregex =s"" "${pagename}_[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}" "". R//Local Testing//val nextregex =s "" "${pagename}_[a-z] " "". RVal Nextpvmap = Containsrdd.map (s = = {//PV Number of next-level path TOP9Nextregex Findallin (s)}). Map (_.mkstring (","). FlatMap (_.tostring.split (","). Filter (_.size> 0). Map (S= = (S.split ("_"). Last, 1). Filter (!_._1.contains (Pagename.split ("_") (0)) . Reducebykey (_+ _). SortBy (_._2,false). Take (9). Tomap NextpvMap.keySet.foreach (Key= = {//Recursive calculationCompute (PageName + "_" +key)}) }    }    //Trigger CalculationCompute (pageName) Val Gson:gson=NewGson () root.traverse (pages) Root.name=pages.filter ("page_id=" "+pagename+" "). Select (" Page_name "). First (). getString (0) println (Gson.tojson (root) )//Convert to JSON and print, Alibaba Fsatjson is not available, or Google is bad. } override def stop (): Unit={sc.stop ()}}object Path {def main (args:array[string]): Unit= {    //println ("ss". Hashcode)var num=5Try{num=args (5). ToInt}Catch {       CaseE:exception =} val Map= Map ("PageName", args (0),      "PageType", args (1),      "StartDate", args (2),      "EndDate", args (3),      "Src", args (4),      "Depth"num.tostring) Val Path=NewPath () path.handledata (map)}}
Summarize

Spark basically solves the problem of real-time computing behavior path, the disadvantage is that the delay is a little bit high, because after submitting the job to request resources to the cluster, the application resources and start-up takes nearly 30 seconds, the subsequent block can be optimized. It is said that Spark-jobserver provides a restful interface for job pre-boot containers, bloggers do not have time to study the interest can be studied.

Fastjson is not as good as Google's Gson in the transformation of complex objects.

Use recursion to be cautious, pay special attention to the export conditions, if the export is unclear, it is likely to lead to a cycle of death.

Analysis of user behavior path based on spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.