DemandIn the recommended system scenario, if the underlying behavior data is too small, or too sparse, the recommended algorithm is likely to not reach the required number. For example, if you want to recommend 20 item for each item or user, but only 8 by calculation, the remaining 12 will need to be complete. Welcome reprint, please specify Source: http://blog.csdn.net/u010967382/article/details/39674047
StrategyThe specific strategies for data completion are:
- completion time: After the mining calculation is finished, the mining results are imported into HBase (The final web system takes data from HBase) before, data completion, completion of data re-import hbase. (There is also an optional time, after receiving the request to achieve completion in the program, but such efficiency is certainly not directly from the HBase reading high, so space-time is a more reasonable strategy);
- Implementation technology: The completion process is based on the hive implementation;
- Complete data: The test process using the current browse item under the classification of the near period of time to browse the amount of topn;
- Test Scenario: This article is only for "look and see" for data completion experiments , other recommended requirements are similar.
Experimental Process
1. First Debug SQL under OracleThe commissioning process involves two sheets: (1) TEST_TOPN:
Each row in the table represents the amount of time an item is visited on a given day.
(2)test_x_and_x:
Each row in the table represents the viewed item and its traffic for each item.Our goal is to complete the table, for each current_item to have 5 look and look at the item. For example, for item 10,001th, you need to get TOP2 from the IT classification to fill the table.
The following SQL has been successfully implemented in Oracle for this purpose:Select * from (select row_number () over (partition&NBSP; by Current_item_category , current_item_id order by source , view_count desc ) no , current_item_id , Current_item_category, andx_item_id, source, View_count from(Select current_item_id, current_item_category, andx_item_id, 1 source, view_count From test_x_and_xUnion Select a.current_item_id, A.current_item_category,b.item_id,2,B.view _count from (select current_item_id,current_item_category from test_x_and_x Group by current_item_id,current_item_category) a, test_topn b where a.current_item_category = b.item_category)) where no<=5
Note: Where the source column is used to identify whether the data is from the original table or TOPN, all TOPN table data is queued after the original table data.
2. Porting SQL statements from Oracle to HiveSuccessful migration of Hive SQL:SELECT * from(select Rank () over (partition by c.current_item_category,c.current_item_id ORDER BY c.source,c.view_count Desc) No, c.current_item_id, C.current_item_category, c.andx_item_id, C.source, C.view_count from(select current_item_id,current_item_category,andx_item_id,1 source,view_countFrom test_x_and_xUnion AllSelect a.current_item_id current_item_id,a.current_item_category current_item_category,b.item_id andx_item_id, 2 Source,b.view_count View_count from(select Current_item_id,current_item_category from test_x_and_xGROUP by Current_item_id,current_item_category) A, test_topn bwhere a.current_item_category = b.item_category) C) d where d.no <= 5;
The execution results are exactly the same as in Oracle:
Some pits are encountered during the transplant and are hereby recorded:
- Hive supports only union all and does not support union;
- UNION ALL of the two tables, not only the corresponding field data type, field names (can use column aliases) must also be identical;
- The result set for each nested subquery must use a table alias!
"Gandalf." Recommend system data completion using hive SQL implementation