R-language implementation of fixed group Aggregation

Source: Internet
Author: User

A group summary with a group name and number of known groups is called a fixed group summary. The grouping of such algorithms comes from outside the dataset, for example, grouping by the customer list in the parameter list, or group by condition list. Such algorithms involve issues such as whether the grouping basis exceeds the dataset, whether redundant groups are required, and whether data is overlapped. This article describes how to implement the aggregation of fixed groups in the r language.

Example 1: grouping is based on a dataset not exceeding the dataset

In the data box, sales is the order record, where the client column is the customer name and the amount column is the order amount. Please group sales by "potential customer list, and sum the amount columns of each group. The potential customers list is [Aro, Bon, Cho], which is a subset of the client column.

Note: The Source of sales can be a database or a file, for example, orders <-read. Table ("sales.txt", SEP = "\ t", header = true ). The first few rows of data are as follows:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/49/E5/wKioL1QeQCayHAD_AAE1pbxW6W0524.jpg "Title =" r_fiexed_group_1.jpg "alt =" wkiol1qeqcayhad_aae1pbxw6w0524.jpg "/>

Code:

Byfac <-factor (sales $ client, levels = C ("ARO", "bon", "Cho "))

Result <-aggregate (sales $ amount, list (byfac), sum)

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/E4/wKiom1QeQBvgIo_6AAB52TP-yv4811.jpg "Title =" r_fiexed_group_2.jpg "alt =" wKiom1QeQBvgIo_6AAB52TP-yv4811.jpg "/>

Code explanation:

1. The function factor generates a grouping basis (referred to as a factor in r). The function aggregate is grouped and summarized by grouping basis, and the entire code structure is very clear.

2. It should be noted that the grouping is based on not a vector or array, so it cannot be directly written as byfac <-C ("ARO", "bon", "Cho "). The grouping basis cannot be used directly. You also need to convert it to the list type. These aspects are difficult for beginners to understand.

3. If the client column is used as the group basis (non-fixed group), you only need one sentence of code to implement it:

Result <-aggregate (sales $ amount, list (sales $ client), sum)

Summary:This case can be easily implemented using aggregate.


Example 2: The Group is based on the data that exceeds the dataset.

Grouping is based only on column data. This is a special case. In fact, because the grouping is based on external data (such as external parameters), its members may not be in column data. This case attempts to solve this problem.

Assume that the value of "potential customer list" is [Aro, Bon, Cho, ztoz]. Divide sales into four groups based on "potential customer list, and sum the amount columns of each group. Note that ztoz is not in the client column.

Code similar to Example 1:

Byfac <-factor (sales $ client, levels = C ("ARO", "bon", "Cho", "ztoz "))

Result <-aggregate (sales $ amount, list (byfac), sum)

The calculation result of the above Code is:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/E5/wKioL1QeQFnzHbbrAAB6JQTbLcc410.jpg "Title =" r_fiexed_group_3.jpg "alt =" wkiol1qe1_nzhbbraab6jqtblcc410.jpg "/>

We can see that there are only three groups of data in the calculation result, and ztoz is missing, instead of the four groups in the requirement. Obviously, the above Code cannot implement this case and needs to be improved.

Improved code:

Byfac <-factor (sales $ client, levels = C ("ARO", "bon", "Cho", "ztoz "))

Tapply (sales $ amount, list (byfac), function (x) sum (x ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/E4/wKiom1QeQEeS8pXtAABXac5vf-Q013.jpg "Title =" r_fiexed_group_4.jpg "alt =" wKiom1QeQEeS8pXtAABXac5vf-Q013.jpg "/>

Code explanation:

1. The improved code is more in line with the business logic. All four groups can be displayed in the results.

2. tapply is used in the Code for grouping and summarizing. This function is more universal than aggregate, but the tapply name is not as intuitive as aggregate, and most beginners are confused.

3. ztoz's summary value is Na, which means ztoz is not in the client column. If the sum of ztoz is 0, ztoz is in the client column, but the order amount is 0.

4. In this case, there are only four groups of group summary results, and redundant customers should not appear. These customers can be called "redundant groups ". You cannot simply modify the current algorithm to calculate the sum of the redundant groups. You need to use the new function:

Filtered <-Sales [! Is. element (sales $ client, byfac),]

Redundant <-sum (filtered $ amount)

This code is not complex, but the implementation idea is obviously different from the previous code.

Summary:This case can be easily implemented using tapply.


Example 3: The grouping conditions do not overlap.

The condition is used as the group basis, which is also a type of fixed grouping. For example, the Order amount is divided into four intervals according to 1000, 2000, and 4000. Each interval contains a group of orders, count the order amount of each group.

Code

Byfac <-cut (sales $ amount, breaks = C (, INF ))

Result <-tapply (sales $ amount, list (byfac), function (x) sum (x ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/49/E5/wKioL1QeQH_xekSaAACBbFCSGtk034.jpg "Title =" r_fiexed_group_5.jpg "alt =" wkiol1qeqh_xeksaaacbbfcsgtk034.jpg "/>

Code explanation:Function cut divides the data box into four intervals. Function tapply groups the data boxes by intervals and summarizes the results of each group.

Summary: Cut and tapply can be used together to easily implement the simplest condition grouping.


Example 4: grouping conditions overlap and repeated calculation results

In the simplest condition group, the condition does not overlap, but in fact it is very common to overlap. For example, grouping order amounts according to the following rules:

1000 to 4000: regular order R14

Less than 2000: Non-key order r2

More than 3000: Key orders r3

The regular orders here will overlap with the other two groups. In this case, whether or not to repeat the overlapped data is required.

Code:

R14 <-subset (sales, amount> = 1000 & amount <= 4000)

R2 <-subset (sales, amount <2000)

R3 <-subset (sales, amount> 3000)

Grouped <-List (R14 = R14, R2 = R2, R3 = R3)

Result <-lapply (grouped, fun = function (x) sum (x $ amount ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/49/E4/wKiom1QeQG7xWd0fAACCELBtQ_w853.jpg "Title =" r_fiexed_group_6.jpg "alt =" wkiom1qeqg7xwd0faaccelbtq_w853.jpg "/>

Note: R2 and R3 contain part of R14 data.

Code explanation:

1. The above code can solve this case, but it is very troublesome. If the conditions are more complex, the above Code will be longer.

2. A new function lapply is used here. So far, many functions have been used to implement fixed grouping, including factor, aggregate, list, tapply, cut, subset, and lapply. In addition, we only need to use different functions and different ideas to implement conditional grouping because of the overlapping conditions. It is still quite difficult to grasp these usages.

3. The calculation result of the above Code is list. In the previous cases, some are data. Frame and some are array. These inconsistencies may cause problems in actual use.

Summary:This case can be implemented, but the code is complicated and requires many functions.


Example 5: The grouping conditions overlap and the results are not repeated.

The previous case solves the problem of data duplication, but sometimes we need to do not duplicate the calculation results, that is, the data that has been in the previous group cannot appear later. For this case, the specific algorithm is: R2 should not contain data in R14, R3 should not contain data in R2 and R14.

Code:

R14 <-subset (sales, amount> = 1000 & amount <= 4000)

R2 <-subset (sales, amount <2000 &! (Amount> = 1000 & amount <= 4000 ))

R3 <-subset (sales, amount> 3000 &! (Amount> = 1000 & amount <= 4000 ))&! (Amount <2000 ))

Grouped <-List (R14 = R14, R2 = R2, R3 = R3)

Result <-lapply (grouped, fun = function (x) sum (x $ amount ))

Calculation Result

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/E6/wKioL1QeQJvB6RtmAACEFor1cGY719.jpg "Title =" r_fiexed_group_7.jpg "alt =" wkiol1qeqjvb6rtmaacefor1cgy719.jpg "/>

Note: when the data is not repeated, the values of R2 and R3 are smaller than the previously calculated values.

Code explanation: to achieve non-repeated computing, the above Code adds more logic judgments, which further increases the complexity of the Code. As you can imagine, when the number of groups is large and the grouping conditions are complex, the amount of code to be written will be quite large.

Conclusion: This case can be implemented, but the code is more complex.


 Third-party solutions

The preceding example can also be implemented in Python, analyzer, Perl, and other languages. Like the r language, these languages can achieve fixed group aggregation and structured data computing. The following describes the solution of the Set calculator.

Example 1:

Byfac = ["ARO", "bon", "Cho"]

[Email protected] (byfac, client)

Grouped. New (byfac (#),~. Sum (amount ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/E4/wKiom1QeQIugM-gvAABeCTHhQr4947.jpg "Title =" r_fiexed_group_8.jpg "alt =" wKiom1QeQIugM-gvAABeCTHhQr4947.jpg "/>

Example 2:

The code is exactly the same as that in Example 1, which is omitted here.

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/E6/wKioL1QeQLeSi9rKAABVDjx7OGU596.jpg "Title =" r_fiexed_group_9.jpg "alt =" wkiol1qeqlesi9rkaabvdjx7ogu596.jpg "/>

If you want to calculate the summary value of the redundant group, you only need to slightly modify the code:

Byfac = ["ARO", "bon", "Cho", "ztoz"]

[Email protected] @ n (byfac, client)

Grouped. New (byfac | "redundant ")(#),~. Sum (amount ))

The red part is changed. @ n indicates that an additional group is added to the result set. It can be seen that this writing method is easier to grasp than the r code.

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M00/49/E4/wKiom1QeQKixe_yhAAB2XUNFEng573.jpg "Title =" r_fiexed_group_10.jpg "alt =" wkiom1qeqkixe_yhaab2xunfeng573.jpg "/>


Example 3

For simple conditional grouping, the set operator only needs to replace the previous align function with Enum, And the rest remains unchanged.

Byfac = ["? <= 1000 ","?> 1000 &&? <= 2000 ","?> 2000 &&? <= 4000 ","?> 4000 "]

Grouped = Sales. Enum (byfac, amount)

Grouped. New (byfac (#),~. Sum (amount ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/49/E6/wKioL1QeQNGgPo_7AAB4bpG7Eao564.jpg "Title =" r_fiexed_group_11.jpg "alt =" wkiol1qeqnggpo_7aab4bpg7eao564.jpg "/>


Set Calculator

To calculate duplicate data, you only need to add the @ r option to the previous code.

Byfac = ["?> = 1000 &&? <= 4000 ","? <2000 ","?> 3000 "]

[Email protected] (byfac, amount)

Grouped. New (byfac (#),~. Sum (amount ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M01/49/E4/wKiom1QeQL_hESuRAABcwHstrSo435.jpg "Title =" r_fiexed_group_12.jpg "alt =" wkiom1qeql_hesuraabcwhstrso435.jpg "/>

Set Calculator

When you do not need to calculate duplicate data, remove the @ r option, which is exactly the same as a simple condition group.

Byfac = ["?> = 1000 &&? <= 4000 ","? <2000 ","?> 3000 "]

Grouped = Sales. Enum (byfac, amount)

Grouped. New (byfac (#),~. Sum (amount ))

Calculation Result:

650) This. width = 650; "src =" http://s3.51cto.com/wyfs02/M02/49/E6/wKioL1QeQOqxk6plAABh0SlbGPI369.jpg "Title =" r_fiexed_group_13.jpg "alt =" wkiol1qeqoqxk6plaabh0slbgpi316.jpg "/>

As you can see, the set Calculator only requires the align and enum functions to implement a fixed grouping summary for all types. The code structure is consistent and the solution is simple.


This article is from the "high performance report data computing" blog, please be sure to keep this source http://esproc.blog.51cto.com/8028595/1558052

R-language implementation of fixed group Aggregation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.