"Gandalf." Recommend system data completion using hive SQL implementation

Source: Internet
Author: User

DemandIn the recommended system scenario, if the underlying behavior data is too small, or too sparse, the recommended algorithm is likely to not reach the required number. For example, if you want to recommend 20 item for each item or user, but only 8 by calculation, the remaining 12 will need to be complete. Welcome reprint, please specify Source: http://blog.csdn.net/u010967382/article/details/39674047
StrategyThe specific strategies for data completion are:
    • completion time: After the mining calculation is finished, the mining results are imported into HBase (The final web system takes data from HBase) before, data completion, completion of data re-import hbase. (There is also an optional time, after receiving the request to achieve completion in the program, but such efficiency is certainly not directly from the HBase reading high, so space-time is a more reasonable strategy);
    • Implementation technology: The completion process is based on the hive implementation;
    • Complete data: The test process using the current browse item under the classification of the near period of time to browse the amount of topn;
    • Test Scenario: This article is only for "look and see" for data completion experiments , other recommended requirements are similar.
Experimental Process 1. First Debug SQL under OracleThe commissioning process involves two sheets: (1) TEST_TOPN:
Each row in the table represents the amount of time an item is visited on a given day.

(2)test_x_and_x:
Each row in the table represents the viewed item and its traffic for each item.Our goal is to complete the table, for each current_item to have 5 look and look at the item. For example, for item 10,001th, you need to get TOP2 from the IT classification to fill the table.
The following SQL has been successfully implemented in Oracle for this purpose:Select * from (select  row_number ()  over (partition&NBSP; by  Current_item_category , current_item_id order  by   source , view_count desc )   no , current_item_id , Current_item_category, andx_item_id, source, View_count from(Select current_item_id, current_item_category, andx_item_id, 1 source, view_count From test_x_and_xUnion Select a.current_item_id, A.current_item_category,b.item_id,2,B.view _count from (select current_item_id,current_item_category from test_x_and_x Group by current_item_id,current_item_category) a, test_topn b where a.current_item_category = b.item_category)) where no<=5
Note: Where the source column is used to identify whether the data is from the original table or TOPN, all TOPN table data is queued after the original table data. 2. Porting SQL statements from Oracle to HiveSuccessful migration of Hive SQL:SELECT * from(select Rank () over (partition by c.current_item_category,c.current_item_id ORDER BY c.source,c.view_count Desc) No, c.current_item_id, C.current_item_category, c.andx_item_id, C.source, C.view_count from(select current_item_id,current_item_category,andx_item_id,1 source,view_countFrom test_x_and_xUnion AllSelect a.current_item_id current_item_id,a.current_item_category current_item_category,b.item_id andx_item_id, 2 Source,b.view_count View_count from(select Current_item_id,current_item_category from test_x_and_xGROUP by Current_item_id,current_item_category) A, test_topn bwhere a.current_item_category = b.item_category) C) d where d.no <= 5;
The execution results are exactly the same as in Oracle:

Some pits are encountered during the transplant and are hereby recorded:
    1. Hive supports only union all and does not support union;
    2. UNION ALL of the two tables, not only the corresponding field data type, field names (can use column aliases) must also be identical;
    3. The result set for each nested subquery must use a table alias!

"Gandalf." Recommend system data completion using hive SQL implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.