[R Language] Association rule 1---Not consider the timing relationship between items

Source: Internet
Author: User

This article describes the association rules, divided into two parts: the first part is--- do not take into account the strict timing relationship between items purchased by users, each user has a "shopping basket" to find the association rules. The second part--- consider The strict timing relationship between items to analyze the purchase path of user props and the mining of association rules. This article is the first part of the explanation. (The code and data set required for this article can be downloaded here .) )

The most commonly heard example of association rules is "beer and diapers": People who buy beer often buy diapers. In the daily visit to the e-commerce website will also appear "purchase of the product users will also buy ..." and so on, which applies the association rules algorithm.

This article focuses on the R language Implementation of association rules and the Visualization of association rules, here does not explain the principle of association rules, can refer to the Baidu Encyclopedia---Association rules, wikipedia---Apriori algorithm, Wikipedia---association rule Learning

Table of Contents 0. Create a data set    for purchase records 1. Convert the purchase record to 0-1 matrix     2. Convert the 0-1 matrix to "transcations" form     3. Delete Redundancy rule     4. Association rule visualization
0. Create a data set for purchase records

The following creates a 1W purchase record data set, one row for a user, the columns are: User ID, item name pname, paid amount amount, time of purchase

The style of the data is as follows:

# # #有放回地抽取1W个从10000000到10002000, as User ID

Uid<-sample (10000000:10002000,10000,replace=t)

# # #将日期限定在20160401 10:01:01~20160408 10:01:01, convert it to the form of a Unix timestamp, also take 1W

Start_time<-as.numeric (AS. POSIXCT ("2016/04/01 10:01:01", format="%y/%m/%d%h:%m:%s" ) ) End_time<-as.numeric (as. POSIXCT ("2016/04/08 10:01:01", format="%y/%m/%d%h:%m:%s"  ))time <-sample (start_time:end_time,10000,replace=t)

#将两者合并成一个数据框orders

orders<-Data.frame (uid,time) head (orders)

# # #下面用P1 ~p20 to indicate the name of the item purchased

Pname_list<-c (1:20) for in 1:20) {  pname_list[i]<-paste (' P ', i,sep="")}

#随机将道具名称传递到1W行上

orders$pname<-'P1' for in 1:20) {  orders[sample (1: Nrow (Orders), 1000,replace=t),'pname']<-pname_list[i]}orders$pname< -as.factor (Orders$pname)

#随机将付费金额amount (1 to 50) on a 1W line

orders$amount<-10 for in 1:50) {  orders[sample (1:nrow (orders), 1000, replace=t),'amount']<-i}

#查看一下数据集 to see if the generated analog data is normal

Head (Orders) summary (orders)

#将数据集写回本地

Write.table (Orders,'orders_test.txt', sep='\ t', row.names = F,col.names = T)
1. Convert purchase record to 0-1 matrix

The above is just the first step: Create a DataSet. The second step is to convert the purchase record to the 0-1 matrix, where the row represents the user, the column represents the product, and 1 indicates that the user purchased the item.

#读取数据集

Payer<-read.table ("orders_test.txt", sep='\ t', header=T) Head (payer) Dim (payer)

#转换成cast1: Behavior User ID, listed as prop, value as amount

Payer2<-payer[,c ('uid','pname','amount  ')]head (payer2) library (reshape2) melt1<-melt (payer2,id=c ("uid ","pname")); head (MELT1)

Cast1<-dcast (melt1,uid~pname,sum);

#下面查看cast1数据集的形式, notice that the props (column names) are arranged alphabetically, a row represents a user, the column represents a product, where the value of the item represents the total amount paid by the user on the item, and all non-0 column names in a row constitute the user's "shopping basket", "shopping basket" There is no time to order.

Head (CAST1)

#将矩阵cast1转换成0-1 matrix cast2, where 1 means the user has purchased the item

Cast2<-matrix (0,ncol=ncol (cast1), nrow=nrow (cast1)) for in 1: Ncol (Cast1)) {  CAST2[CAST1[,J]>0,j]<-1}

#注意到其中原本为用户id的第一列全变成了1; The column name is not the original item name

The column name of the #将0-1 matrix Cast2 replaced by the CAST1 column name (item name)

Colnames (cast2) <-names (cast1); Cast2<-as.data.frame (cast2) cast2$uid<-cast1$ Uidhead (CAST2)

2. Convert the 0-1 matrix to "transcations" form

#将0-1 converted to the "transcations" form of the Apriori algorithm

Cast3<-cast2[,-1]# need to first remove the first column UID

#此时还不能转换成transactions形式

Library (arules) arules<-as (cast3,"transactions")

The error is prompt: Err inAsmethod (object): Column (s) ... not logical or a factor. Use As.factor, as.logical or categorize first.

According to the above error, before converting the 0-1 matrix to transactions, you need to convert the column to factor or logical type, but be sure to convert the 0-1 matrix to the logical type instead of the factor type, otherwise you will factor= 0 of the data is also considered an item set, which consumes a lot of memory after executing the Apriori algorithm

#之前转换成Factor型后, the execution of Apriori function, slow out of the results, the memory is rubbing against the rise, until the last outage, after the test of a small data set to find that it will factor=0 column also as an item set, especially when the row and column more, there will be a huge number of invalid Itemsets , which results in full memory. The correct way is as follows:

 for  in 1: Ncol (CAST3)) {  cast3[,j]<-as.logical (cast3[,j])# remind again!!! Be sure to convert the 0-1 matrix to the logical type instead of the factor}

#转换为apriori算法可用的transactions形式

Library (arules) arules<-as (cast3,"transactions")

#查看其中的项集 is the shopping basket for each user, as shown in

##### #下面执行apriori算法, set the support threshold to 0.01 and the confidence confidence threshold to 0.5, and you can set thresholds based on your needs.

Rules<-apriori (Arules,parameter = list (support=0.01,confidence=0.5)) inspect (rules)

#可以按照提升度排序

Sorted_lift<-sort (rules,by='lift') Inspect (sorted_lift)

3. Delete Redundancy rules

#下面进行第三步, there are 152 rules to meet the support threshold and confidence threshold, there are a lot of redundancy rules, the definition of redundant rules is: if Rules2 LHS and RHS is contained in Rules1, and Rules2 lift is less than or equal to Rules1, The rules2 is called the redundancy rule of rules1. The redundancy rules are removed below.

subset.matrix<-is. Subset (rules,rules)# generates a subset matrix of all rules, with rows and columns each rules, where the values are true and false, When Rules2 is a subset of rules1, the value of Rules2 in Rules1 is trueSubset.matrix[lower.tri (subset.matrix,diag=t)]<-na# set the element below the diagonal of the matrix to null, preserving only the upper triangle redundant<-colsums (subset.matrix,na.rm=t) >=1#R will use True in the matrix as 1, The sum of each column is counted (ignoring missing values), and if the sum of the column is greater than or equal to 1, which means that the column (rule) is a subset of other rules, it should be deleted.  rules.pruned<-rules[!redundant]# get rid of redundant rules

#原本152条规则精简到4条规则

Inspect (rules.pruned)

#写回本地

Write (rules.pruned,"rules_pruned.txt", Col.names=na)
4. Visibility of association Rules

Plot a scatter plot with direct plots:

Plot (rules)

The darker the point color in the figure, the greater the lift value, the more points you can see the lift value are concentrated on the low support. There are others who think the most interesting rules are on the edge of support/conf.

You can use Interactive=true to achieve the interactive function of the scatter plot, and you can select some points to see its specific rules

Plot (Rules,interactive=true)

There is a similar "bubble chart" of the presentation: the degree of lift is the color of the circle, the size of the circle is the size of support. The number of LHS and the most important (frequent) item set in the group are displayed in the label of the column. The lift is gradually reduced from the upper left corner to the lower right corner.

" grouped ")

Associate rules are represented by arrows and circles, and vertices represent itemsets, and edges represent relationships in rules. The larger the circle, the greater the support, the darker the color indicates the greater the lift. However, if the rules are much more chaotic, it is difficult to find the law, so usually only a few rules to use such a diagram, the following is the lift of the top10 rule to visualize.

" Graph ")

The above is the R language Implementation of association rules and the Visualization of association rules, here does not consider the user to buy items in the timing relationship, but from the user "shopping basket" to mining association rules, the next one will consider the strict timing relationship between items to analyze the user props purchase Path and association rules mining. The code and data set required for this article can be downloaded here .

[R Language] Association rule 1---Not consider the timing relationship between items

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.