Analysis of user behavior path based on spark

Last Update:2017-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Research background

Internet industry more and more attention to their own customer's behavior preferences, whether it is the e-commerce industry or the financial industry, based on user behavior can make a lot of things, the e-commerce industry can be summed up user preferences for users to recommend goods, the financial industry can use user behavior as a point of anti-fraud, this article mainly introduces one of the important function , the user behavior path is counted based on the behavior log, providing better operational decision-making for operations personnel. User Behavior path analysis similar to that of a mature product such as Adobe Analytics can be implemented. The final effect. Using the open source Big Data visualization tool. , the user behavior path data is very large, UV indicators can not be calculated in advance (time period is not determined), if the display level 5, a page of data is 10 of 5 square, if there are thousands of pages, the amount of data is immeasurable, so only real-time calculation, and spark is very suitable for iterative calculation, Based on this consideration, spark is a good choice.

Solution Process Description

When a customer searches the behavior path detail of a starting page, the RPC requests to the background, the Spark program calculates the returned data in real time, and Java parses the data and displays it.

Preparatory work

1. First of all to have behavioral data, user behavior log data must contain the following four fields, access time, device fingerprint, session ID, page name, where the page name can be self-defined, used to mark one or a class of pages, this page name should not be duplicated.

2. Then perform a first-level cleaning of the behavior log (based on hive) to clean the data into the following format

Device fingerprint	Session ID	Page path (ascending by Time)	Time
Fpid1	Sessionid1	A_b_c_d_e_f_g	2017-01-13

A, B, C for the name of the page, cleaning process using row_number function, Concat_ws function, the specific use can be Baidu. After cleaning, the landing to Hive table, the follow-up will be used. T+1 Clean this data.

3. Figuring out the definition of recursion

Spark processing

Process Overview:

1. Construct a multi-fork tree class, class main attribute description, name full path such as A_b_c,childlist son linked list, multi-fork tree construction and recursive reference here

2. Read the data from the previous step by time, recursively calculate the attribute metrics for each level of the page, and insert it into the Initialized node class root based on the page path.

3. Recursively traverse the root node object that was initialized in the previous step, and replace the name in which the ID of name is named, with the help of Spark dataframe querying the data.

4. Convert the root object into JSON format and return to the front end.

The attached code is as follows.

ImportJava.utilImportCom.google.gson.GsonImportOrg.apache.spark.SparkContextImportOrg.apache.log4j. {level, Logger =LG}ImportOrg.apache.spark.sql.DataFrameImportOrg.apache.spark.sql.hive.HiveContext/*** User Behavior path Real-time calculation implementation * Created by Chouyarn on 2016/12/12. *//*** Tree Structure class * *@paramName page path *@paramvisit Visit times *@paramPV PV *@paramUV UV *@paramchildlist son linked list*/classNode (Var name:string, var Path:any, var Visit:any, var pv:any, var uv:any, var childlist:util. Arraylist[node])extendsSerializable {/*** Add child nodes * *@paramNode child nodes Object *@return    */def addNode (Node:node)={childlist.add (node)}/*** Traversal node, depth first*/def Traverse (): Unit= {    if(Childlist.isempty)return    //node.Val Childnum =childlist.size for(I <-0 to ChildNum-1) {val Child:node=Childlist.get (i) Child.name= Child.name.split ("_"). Last//Remove the front absolute pathchild.traverse ()}} /*** Traversal node, depth first*/def Traverse (pages:dataframe): Unit= {    if(childlist.isempty| | Childlist.size () ==0)      return    //node.Val Childnum =childlist.size for(I <-0 to ChildNum-1) {val Child:node=Childlist.get (i) Child.name= Child.name.split ("_"). Last Val ID=pages.filter ("page_id=" "+child.name+" "). Select (" Page_name "). First (). getString (0)//Replacement ID is nameChild.name =ID child.traverse (pages)}} /*** Dynamic Insert node * *@paramNode Nodes Object *@return    */def insertnode (node:node): Boolean={val Insertname=Node.nameif(Insertname.stripsuffix ("_" + Insertname.split ("_")). Last). Equals (name) {//Node.name=node.name.split ("_"). lastaddNode (node)true    } Else{val ChildList1=childlist Val Childnum=childlist1.size var insetflag=false       for(I <-0 to ChildNum-1) {val Childnode=Childlist1.get (i) Insetflag=childnode.insertnode (node)if(Insetflag = =true)          true      }      false    }  }}/*** Processing Class*/classPathextendsCleandatawithrdd {LG.getRootLogger.setLevel (level.error)//control spark log output levelsVal Sc:sparkcontext= Sparkutil.createsparkcontextyarn ("path") Val Hivecontext=NewHivecontext (SC) override Def handledata (Conf:map[string, String]): Unit={val num= Conf.getorelse ("Depth", 5)//Path DepthVal pageName = Conf.getorelse ("PageName", "" ")//page name//val pageName = "A_c"val src = conf.getorelse ("src", "")//flag Source PC or WAPVal PageType= Conf.getorelse ("PageType", "" ")//forward or backward pathVal startdate = Conf.getorelse ("StartDate", "" ")//Start DateVal endDate = Conf.getorelse ("EndDate", "" ")//End Date//Keep log cache for subsequent useVal log = Hivecontext.sql (S "Select Fpid,sessionid,path" +s"From Specter.t_pagename_path_sparksource" +s"Where day between ' $startDate ' and ' $endDate ' and path_type= $pageType and src= ' $src '"). Map (S={(S.apply (0) + "_" + s.apply (1) + "_" + s.apply (2) }). Repartition (10). Persist () Val pages=hivecontext.sql ("Select Page_id,page_name from Specter.code_pagename"). Persist ()//Cache page Dictionary table//Local test Data//val log = sc.parallelize (Seq ("Fpid1_sessionid1_a_b",//"Fpid2_sessionid2_a_c_d_d_b_a_d_a_f_b",//"Fpid1_sessionid1_a_f_a_c_d_a_b_a_v_a_n"))var Root:node =NULL    /*** Recursive the calculated node into the tree structure * *@parampageName page name*/def Compute (pagename:string): Unit={val Currenregex= PAGENAME.R//Regular Expressions for pagesVal Containsrdd = Log.filter (_.contains (PageName)). Persist ()//contains the page name of the RDD, the next steps to use theVal CURRENTPV = Containsrdd.map (s = = {//Calculate PVCurrenregex Findallin (s)}). Map (_.mkstring (","). FlatMap (_.tostring.split (","). Filter (_.size> 0). Count () Val Temprdd= Containsrdd.map (_.split ("_")). Persist ()//broken-down RddVal currentuv = temprdd.map (_.apply (0)). Distinct (). COUNT ()//page UvsVal currentvisit = Temprdd.map (_.apply (1)). Distinct (). COUNT ()//page Visit times//Initialize the root node or add a node      if(Root = =NULL) {root=NewNode (Pagename,pagename.hashcode, Currentvisit, CURRENTPV, CURRENTUV,Newutil. Arraylist[node] ())}Else{Root.insertnode (NewNode (Pagename,pagename.hashcode, Currentvisit, CURRENTPV, CURRENTUV,Newutil. Arraylist[node] ( ))}if(Pagename.split ("_"). Size = = 5| | Temprdd.isempty ()) {//Recursive exit        return      } Else {        //determine the next page name regular expressionVal Nextregex =s"" "${pagename}_[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}" "". R//Local Testing//val nextregex =s "" "${pagename}_[a-z] " "". RVal Nextpvmap = Containsrdd.map (s = = {//PV Number of next-level path TOP9Nextregex Findallin (s)}). Map (_.mkstring (","). FlatMap (_.tostring.split (","). Filter (_.size> 0). Map (S= = (S.split ("_"). Last, 1). Filter (!_._1.contains (Pagename.split ("_") (0)) . Reducebykey (_+ _). SortBy (_._2,false). Take (9). Tomap NextpvMap.keySet.foreach (Key= = {//Recursive calculationCompute (PageName + "_" +key)}) }    }    //Trigger CalculationCompute (pageName) Val Gson:gson=NewGson () root.traverse (pages) Root.name=pages.filter ("page_id=" "+pagename+" "). Select (" Page_name "). First (). getString (0) println (Gson.tojson (root) )//Convert to JSON and print, Alibaba Fsatjson is not available, or Google is bad. } override def stop (): Unit={sc.stop ()}}object Path {def main (args:array[string]): Unit= {    //println ("ss". Hashcode)var num=5Try{num=args (5). ToInt}Catch {       CaseE:exception =} val Map= Map ("PageName", args (0),      "PageType", args (1),      "StartDate", args (2),      "EndDate", args (3),      "Src", args (4),      "Depth"num.tostring) Val Path=NewPath () path.handledata (map)}}

Summarize

Spark basically solves the problem of real-time computing behavior path, the disadvantage is that the delay is a little bit high, because after submitting the job to request resources to the cluster, the application resources and start-up takes nearly 30 seconds, the subsequent block can be optimized. It is said that Spark-jobserver provides a restful interface for job pre-boot containers, bloggers do not have time to study the interest can be studied.

Fastjson is not as good as Google's Gson in the transformation of complex objects.

Use recursion to be cautious, pay special attention to the export conditions, if the export is unclear, it is likely to lead to a cycle of death.

Analysis of user behavior path based on spark

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of user behavior path based on spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of user behavior path based on spark

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support