[Linux] shell multi-process concurrency-Details
Business background
The schedule. sh script schedules the execution of the user's trajectory project script. Part of the Code is as follows:
#! /Bin/bashsource/etc/profile; export userTrackPathCollectHome =/home/pms/bigDataEngine/analysis/script/usertrack/master/pathCollect ##################### ########### process ########################### ##### verify that lines = 'hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/ruleengine/artificial/product/ $ yesterday | wc-l 'if [$ lines-le 0]; then echo 'error! Artificial product is not exist 'exit 1 else echo 'artificial product is OK !!!!!! 'Fi # verify whether lines = 'hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/mix/artificial/product/$ yesterday exists in related product data sources of machine matching | wc-l 'if [$ lines-le 0]; then echo 'error! Mix product is not exist 'exit 1 else echo 'mix product is OK !!!!!! 'Fi ################################ process B #### ########################## generate a group buying information table, currently, only group buying IDs and product IDs are captured. sh $ userTrackPathCollectHome/scripts/extract_groupon_info.shlines = 'hadoop fs-ls/user/hive/pms/extract_groupon_info | wc-l' if [$ lines- le 0]; then echo 'error! Groupon info is not exist 'exit 4 else echo 'groupon info is OK !!!!! 'Fi # generate product series, the total file size is around mb. sh $ userTrackPathCollectHome/scripts/extract_product_serial.shlines = 'hadoop fs-ls/user/hive/pms/product_serial_id | wc-l' if [$ lines-le 0]; then echo 'error! Product serial is not exist 'exit 5 else echo 'product serial is OK !!!!! 'Fi # generate the extract_trfc_page_kpi table through preprocessing -- used to collect statistics on the pv count and uv count of the page according to pageId sh $ userTrackPathCollectHome/scripts/extract_trfc_page_kpi.sh $ datelines = 'hadoop fs/ hive/pms/extract_trfc_page_kpi/ds = $ date | wc-l 'if [$ lines-le 0]; then echo 'error! Extract_trfc_page_kpi is not exist 'exit 6 else echo 'extract _ trfc_page_kpi is OK !!!!!! 'Fi # synchronize term_category to hive, convert the foreground category to the background category sh $ userTrackPathCollectHome/scripts/extract_term_category.shlines = 'hadoop fs-ls/user/hive/pms/temp_term_category | wc-l' if [$ lines-le 0]; then echo 'error! Temp_term_category is not exist 'exit 7 else echo 'temp _ term_category is OK !!!!!! 'Fi ################################ Process C #### ########################## generate the extract_track_info table sh $ userTrackPathCollectHome/scripts/extract_track_info.shlines =' hadoop fs-ls/user/hive/warehouse/extract_track_info | wc-l 'if [$ lines-le 0]; then echo 'error! Extract_track_info is not exist 'exit 1 else echo 'extract _ track_info is OK !!!!! 'Fi...
As shown above, it takes 55 minutes to complete the script execution during the entire preprocessing process.
Optimization
The above script execution process can be divided into three processes:
Process A-> process B-> Process C
ConsideringProcess BEach subtask in does not affect each other, so there is no need to execute it in sequence. The idea of optimization isProcess BThe sub-tasks that do not affect each other are executed in parallel.
In fact, linux does not concurrently execute this specific command. The preceding concurrent execution actually puts these subtasks in the background for execution, so that the so-called "concurrent execution" can be realized ", the script transformation is as follows:
#! /Bin/bashsource/etc/profile; export userTrackPathCollectHome =/home/pms/bigDataEngine/analysis/script/usertrack/master/pathCollect ##################### ########### process ########################### ##### verify that lines = 'hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/ruleengine/artificial/product/ $ yesterday | wc-l 'if [$ lines-le 0]; then echo 'error! Artificial product is not exist 'exit 1 else echo 'artificial product is OK !!!!!! 'Fi # verify whether lines = 'hadoop fs-ls/user/pms/recsys/algorithm/schedule/warehouse/mix/artificial/product/$ yesterday exists in related product data sources of machine matching | wc-l 'if [$ lines-le 0]; then echo 'error! Mix product is not exist 'exit 1 else echo 'mix product is OK !!!!!! 'Fi ################################ process B #### ########################### concurrent processes, generate the group buying information table, currently, only group buying ID and product ID items are crawled. {sh $ userTrackPathCollectHome/scripts/extract_groupon_info.sh lines = 'hadoop fs-ls/user/hive/pms/extract_groupon_info | wc-l 'if [$ lines-le 0]; then echo 'error! Groupon info is not exist 'exit 4 else echo 'groupon info is OK !!!!! 'Fi} & # concurrent process, generation of product series, the total file size is around m {sh $ userTrackPathCollectHome/scripts/extract_product_serial.sh lines = 'hadoop fs-ls/user/hive/pms/product_serial_id | wc-l' if [$ lines-le 0]; then echo 'error! Product serial is not exist 'exit 5 else echo 'product serial is OK !!!!! 'Fi} & # concurrent processes, pre-process to generate the extract_trfc_page_kpi table -- used to collect statistics on the pv count and uv count of the page by pageId {sh $ userTrackPathCollectHome/scripts/extract_trfc_page_kpi.sh $ date lines = 'hadoop fs-ls /pms/extract_trfc_page_kpi/ds = $ date | wc-l 'if [$ lines-le 0]; then echo 'error! Extract_trfc_page_kpi is not exist 'exit 6 else echo 'extract _ trfc_page_kpi is OK !!!!!! 'Fi} & # concurrent processes, synchronizing term_category to hive, convert the foreground category to the background category {sh $ userTrackPathCollectHome/scripts/extract_term_category.sh lines = 'hadoop fs-ls/user/hive/pms/temp_term_category | wc-l' if [$ lines -le 0]; then echo 'error! Temp_term_category is not exist 'exit 7 else echo 'temp _ term_category is OK !!!!!! 'Fi} ################################# Process C ## ############################ wait until all the preceding background processes finish running wait echo 'End of backend jobs above !!!!!!!!!!!!!!!!!!!!!!!!!!!! '# Generate the extract_track_info table sh $ userTrackPathCollectHome/scripts/extract_track_info.shlines = 'hadoop fs-ls/user/hive/warehouse/extract_track_info | wc-l' if [$ lines-le 0]; then echo 'error! Extract_track_info is not exist 'exit 1 else echo 'extract _ track_info is OK !!!!! 'Fi
In the above scriptProcess BAll sub-tasks that do not affect each other are executed in the background, so as to achieve "concurrent execution", and to avoid disrupting the script execution process:
Process A-> process B-> Process C
You needProcess CAdd:
# Wait for all the background processes to finish wait
The purpose is to waitProcess BAll background processes are executed only after they are executed.Process C
Conclusion
After optimization, the execution time of the script is reduced from 55 minutes to 15 minutes, with remarkable results.