[Linux]shell多進程並發—詳細版,linuxshell
業務背景
schedule.sh指令碼負責調度使用者軌跡工程指令碼的執行,截取部分代碼如下:
#!/bin/bashsource /etc/profile;export userTrackPathCollectHome=/home/pms/bigDataEngine/analysis/script/usertrack/master/pathCollect################################ 流程A################################ 驗證機器搭配的相關商品資料來源是否存在lines=`hadoop fs -ls /user/pms/recsys/algorithm/schedule/warehouse/ruleengine/artificial/product/$yesterday | wc -l`if [ $lines -le 0 ] ;then echo 'Error! artificial product is not exist' exit 1else echo 'artificial product is ok!!!!!!'fi# 驗證機器搭配的相關商品資料來源是否存在lines=`hadoop fs -ls /user/pms/recsys/algorithm/schedule/warehouse/mix/artificial/product/$yesterday | wc -l`if [ $lines -le 0 ] ;then echo 'Error! mix product is not exist' exit 1else echo 'mix product is ok!!!!!!'fi################################ 流程B################################ 產生團購資訊表,目前只抓取團購ID、商品ID兩項sh $userTrackPathCollectHome/scripts/extract_groupon_info.shlines=`hadoop fs -ls /user/hive/pms/extract_groupon_info | wc -l `if [ $lines -le 0 ] ;then echo 'Error! groupon info is not exist' exit 4else echo 'groupon info is ok!!!!!'fi# 產生系列商品,總檔案大小在320M左右sh $userTrackPathCollectHome/scripts/extract_product_serial.shlines=`hadoop fs -ls /user/hive/pms/product_serial_id | wc -l `if [ $lines -le 0 ] ;then echo 'Error! product serial is not exist' exit 5else echo 'product serial is ok!!!!!'fi# 預先處理產生extract_trfc_page_kpi表--用於按照pageId進行匯總統計所在頁面的pv數、uv數sh $userTrackPathCollectHome/scripts/extract_trfc_page_kpi.sh $datelines=`hadoop fs -ls /user/hive/pms/extract_trfc_page_kpi/ds=$date | wc -l`if [ $lines -le 0 ] ;then echo 'Error! extract_trfc_page_kpi is not exist' exit 6else echo 'extract_trfc_page_kpi is ok!!!!!!'fi# 同步term_category到hive,並將前台類目轉換為後台類目sh $userTrackPathCollectHome/scripts/extract_term_category.shlines=`hadoop fs -ls /user/hive/pms/temp_term_category | wc -l`if [ $lines -le 0 ] ;then echo 'Error! temp_term_category is not exist' exit 7else echo 'temp_term_category is ok!!!!!!'fi################################ 流程C################################ 產生extract_track_info表sh $userTrackPathCollectHome/scripts/extract_track_info.shlines=`hadoop fs -ls /user/hive/warehouse/extract_track_info | wc -l `if [ $lines -le 0 ] ;then echo 'Error! extract_track_info is not exist' exit 1else echo 'extract_track_info is ok!!!!!'fi...
如上,整個預先處理環節指令碼執行完,需要耗時55分鐘。
最佳化
上面的指令碼執行流程可以分為三個流程:
流程A->流程B->流程C
考慮到流程B中的每個子任務都互不影響,因此沒有必要順序執行,最佳化的思路是將流程B中這些互不影響的子任務並存執行。
其實linux中並沒有並發執行這一特定命令,上面所說的並發執行實際上是將這些子任務放到後台執行,這樣就可以實現所謂的“並發執行”,指令碼改造如下:
#!/bin/bashsource /etc/profile;export userTrackPathCollectHome=/home/pms/bigDataEngine/analysis/script/usertrack/master/pathCollect################################ 流程A################################ 驗證機器搭配的相關商品資料來源是否存在lines=`hadoop fs -ls /user/pms/recsys/algorithm/schedule/warehouse/ruleengine/artificial/product/$yesterday | wc -l`if [ $lines -le 0 ] ;then echo 'Error! artificial product is not exist' exit 1else echo 'artificial product is ok!!!!!!'fi# 驗證機器搭配的相關商品資料來源是否存在lines=`hadoop fs -ls /user/pms/recsys/algorithm/schedule/warehouse/mix/artificial/product/$yesterday | wc -l`if [ $lines -le 0 ] ;then echo 'Error! mix product is not exist' exit 1else echo 'mix product is ok!!!!!!'fi################################ 流程B################################ 並發進程,產生團購資訊表,目前只抓取團購ID、商品ID兩項{ sh $userTrackPathCollectHome/scripts/extract_groupon_info.sh lines=`hadoop fs -ls /user/hive/pms/extract_groupon_info | wc -l ` if [ $lines -le 0 ] ;then echo 'Error! groupon info is not exist' exit 4 else echo 'groupon info is ok!!!!!' fi}&# 並發進程,產生系列商品,總檔案大小在320M左右{ sh $userTrackPathCollectHome/scripts/extract_product_serial.sh lines=`hadoop fs -ls /user/hive/pms/product_serial_id | wc -l ` if [ $lines -le 0 ] ;then echo 'Error! product serial is not exist' exit 5 else echo 'product serial is ok!!!!!' fi}&# 並發進程,預先處理產生extract_trfc_page_kpi表--用於按照pageId進行匯總統計所在頁面的pv數、uv數{ sh $userTrackPathCollectHome/scripts/extract_trfc_page_kpi.sh $date lines=`hadoop fs -ls /user/hive/pms/extract_trfc_page_kpi/ds=$date | wc -l` if [ $lines -le 0 ] ;then echo 'Error! extract_trfc_page_kpi is not exist' exit 6 else echo 'extract_trfc_page_kpi is ok!!!!!!' fi}&# 並發進程,同步term_category到hive,並將前台類目轉換為後台類目{ sh $userTrackPathCollectHome/scripts/extract_term_category.sh lines=`hadoop fs -ls /user/hive/pms/temp_term_category | wc -l` if [ $lines -le 0 ] ;then echo 'Error! temp_term_category is not exist' exit 7 else echo 'temp_term_category is ok!!!!!!' fi}&################################ 流程C################################ 等待上面所有的後台進程執行結束wait echo 'end of backend jobs above!!!!!!!!!!!!!!!!!!!!!!!!!!!!'# 產生extract_track_info表sh $userTrackPathCollectHome/scripts/extract_track_info.shlines=`hadoop fs -ls /user/hive/warehouse/extract_track_info | wc -l `if [ $lines -le 0 ] ;then echo 'Error! extract_track_info is not exist' exit 1else echo 'extract_track_info is ok!!!!!'fi
上面的指令碼中,將流程B中互不影響的子任務全部放到了後台執行,從而實現了“並發執行”,同時為了不破壞指令碼的執行流程:
流程A->流程B->流程C
就需要在流程C執行之前加上:
# 等待上面所有的後台進程執行結束wait
其目的是等待流程B的所有後台進程全部執行完成,才執行流程C
結論
經過最佳化後,指令碼的執行時間,從耗時55分鐘,降到了耗時15分鐘,效果很顯著。