Business background
The schedule.sh script is responsible for scheduling the execution of the user's trajectory engineering script, and intercepts some of the code as follows:
#!/bin/bashSource/etc/profile;ExportUsertrackpathcollecthome=/home/pms/bigdataengine/analysis/script/usertrack/master/pathcollect################################ Process a################################ Verify that the relevant product data source exists for the machine collocationLines= ' Hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/ruleengine/artificial/product/$yesterday| Wc- L`if[$lines-le0] ; Then Echo ' error! Artificial product is not exist ' Exit 1Else Echo ' Artificial product is OK!!!!!! 'fi# Verify that the relevant product data source exists for the machine collocationLines= ' Hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/mix/artificial/product/$yesterday| Wc- L`if[$lines-le0] ; Then Echo ' error! mix product is not exist ' Exit 1Else Echo ' mix product is OK!!!!!! 'fi################################ Process B################################ Generate Group purchase Information table, currently only grab group purchase ID, item ID two itemsSh$userTrackPathCollectHome/scripts/extract_groupon_info.shlines= ' Hadoop fs-ls/user/hive/pms/extract_groupon_info | Wc- L`if[$lines-le0] ; Then Echo ' error! Groupon info is not exist ' Exit 4Else Echo ' Groupon info is OK!!!!! 'fi# Generate a series of products, the total file size of about 320MSh$userTrackPathCollectHome/scripts/extract_product_serial.shlines= ' Hadoop fs-ls/user/hive/pms/product_serial_id | Wc- L`if[$lines-le0] ; Then Echo ' error! product serial is not exist ' Exit 5Else Echo ' product serial is OK!!!!! 'fi# preprocessing generates EXTRACT_TRFC_PAGE_KPI table--The number of PV, UV number of the page on which to summarize the statistics according to PageIDSh$userTrackPathCollectHome/scripts/extract_trfc_page_kpi.sh$dateLines= ' Hadoop fs-ls/user/hive/pms/extract_trfc_page_kpi/ds=$date| Wc- L`if[$lines-le0] ; Then Echo ' error! extract_trfc_page_kpi is not exist ' Exit 6Else Echo ' EXTRACT_TRFC_PAGE_KPI is OK!!!!!! 'fi# Synchronize Term_category to hive and convert foreground class to background classSh$userTrackPathCollectHome/scripts/extract_term_category.shlines= ' Hadoop fs-ls/user/hive/pms/temp_term_category | Wc- L`if[$lines-le0] ; Then Echo ' error! temp_term_category is not exist ' Exit 7Else Echo ' temp_term_category is OK!!!!!! 'fi################################ Process C################################ Generate Extract_track_info tableSh$userTrackPathCollectHome/scripts/extract_track_info.shlines= ' Hadoop fs-ls/user/hive/warehouse/extract_track_info | Wc- L`if[$lines-le0] ; Then Echo ' error! extract_track_info is not exist ' Exit 1Else Echo ' extract_track_info is OK!!!!! 'fi...
As above, it takes 55 minutes for the entire preprocessing session to complete the script execution.
Optimization
The above script execution process can be divided into three processes:
流程A->流程B->流程C
Given that each sub-task in process B does not affect each other, there is no need for sequential execution, and the idea of optimization is to execute those unrelated subtasks in process B in parallel.
In fact, Linux does not execute this particular command concurrently, the above-mentioned concurrent execution is actually put these subtasks in the background execution, so that the so-called "concurrent Execution", the script is modified as follows:
#!/bin/bashSource/etc/profile;ExportUsertrackpathcollecthome=/home/pms/bigdataengine/analysis/script/usertrack/master/pathcollect################################ Process a################################ Verify that the relevant product data source exists for the machine collocationLines= ' Hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/ruleengine/artificial/product/$yesterday| Wc- L`if[$lines-le0] ; Then Echo ' error! Artificial product is not exist ' Exit 1Else Echo ' Artificial product is OK!!!!!! 'fi# Verify that the relevant product data source exists for the machine collocationLines= ' Hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/mix/artificial/product/$yesterday| Wc- L`if[$lines-le0] ; Then Echo ' error! mix product is not exist ' Exit 1Else Echo ' mix product is OK!!!!!! 'fi################################ Process B################################ Concurrent process, generate Group purchase information table, currently only grab group purchase ID, item ID two items{sh$userTrackPathCollectHome/scripts/extract_groupon_info.sh lines= ' Hadoop fs-ls/user/hive/pms/extract_groupon_info | Wc- L`if[$lines-le0] ; Then Echo ' error! Groupon info is not exist ' Exit 4 Else Echo ' Groupon info is OK!!!!! ' fi}&# Concurrent process, generate series of goods, total file size around 320M{sh$userTrackPathCollectHome/scripts/extract_product_serial.sh lines= ' Hadoop fs-ls/user/hive/pms/product_serial_id | Wc- L`if[$lines-le0] ; Then Echo ' error! product serial is not exist ' Exit 5 Else Echo ' product serial is OK!!!!! ' fi}&# Concurrent processes, preprocessing generates EXTRACT_TRFC_PAGE_KPI tables--for the number of PV and UV numbers on the page where summary statistics are performed by PageID{sh$userTrackPathCollectHome/scripts/extract_trfc_page_kpi.sh$dateLines= ' Hadoop fs-ls/user/hive/pms/extract_trfc_page_kpi/ds=$date| Wc- L`if[$lines-le0] ; Then Echo ' error! extract_trfc_page_kpi is not exist ' Exit 6 Else Echo ' EXTRACT_TRFC_PAGE_KPI is OK!!!!!! ' fi}&# Concurrent processes, synchronizing term_category to Hive, and converting foreground classes to back-end class entries{sh$userTrackPathCollectHome/scripts/extract_term_category.sh lines= ' Hadoop fs-ls/user/hive/pms/temp_term_category | Wc- L`if[$lines-le0] ; Then Echo ' error! temp_term_category is not exist ' Exit 7 Else Echo ' temp_term_category is OK!!!!!! ' fi}&################################ Process C################################ Wait for all the background processes above to finish executingWaitEcho ' End of backend jobs above!!!!!!!!!!!!!!!!!!!!!!!!!!!! '# Generate Extract_track_info tableSh$userTrackPathCollectHome/scripts/extract_track_info.shlines= ' Hadoop fs-ls/user/hive/warehouse/extract_track_info | Wc- L`if[$lines-le0] ; Then Echo ' error! extract_track_info is not exist ' Exit 1Else Echo ' extract_track_info is OK!!!!! 'fi
In the above script, all the non-affected subtasks in process B are executed in the background, resulting in "concurrent execution", and in order not to disrupt the execution process of the script:
流程A->流程B->流程C
You need to add the following before process C execution:
# 等待上面所有的后台进程执行结束
The purpose is to wait for all background processes in process B to complete before executing process C
Conclusion
After optimization, the execution time of the script, from 55 minutes to 15 minutes, has a significant effect.
[Linux]shell Multi-process concurrency-detailed version