linux多線程下載工具 axel 下載中止的問題

來源:互聯網
上載者:User

linux多線程下載工具 axel 下載中止的問題

1 axel是什麼
axel是一個linux下的多線程下載工具, 官網 http://axel.alioth.debian.org/

2 我遇到的問題
    $> axel -a -n 10 -s 409600 "myurl"
    下載一段時間再無進度. 而且這種現象很難重現.
3 axel 源碼的邏輯
  

  main()    {        axel_new()        {            /* 發送HTTP GET 請求 1個位元組,擷取下載檔案位元組數 */        }        axel_open()        {            /* 分配每個串連下載檔案資料的起止位元組數, 並開啟用於儲存下載資料的檔案 */            axel->outfd = open();        }        axel_start()        {            /* 建立線程池 */            setup_thread()            {                /* 線程處理函數 僅僅是建立串連 之後就返回了.而且 connect() 未設定逾時 */                gethostbyname();                connect();                return;            }        }        /* 註冊 SIGINT 和 SIGTERM 訊號  */        while (下載未完成 且 未接收到訊號)        {            axel_do()            {                /*                   儲存下載狀態                    註冊 select()                    如果當前線程池未建立有效串連,即所有描述符(axel->conn[0-n].fd <= 0), 則回收線程或重新建立線程                    如果某線程 建立串連成功或未成功返回,則回收線程,並再次建立線程,執行串連操作.                    如果某線程 在設定時間內未返回,則 pthread_cancel() 掉該線程. 但是 pthread_cancel() 並不回立馬生效                    select()                   如果某串連可讀, 則讀取資料,並寫入檔案對應位置.                   如果某串連不可讀 且 逾時 45秒, 則關閉串連.下次迴圈時會建立線程重新串連.                 */            }        }    }

4 分析過程
    使用strace跟蹤當前 axel 進程:
    strace -f -tt -p PID 
    然後看單個線程的執行過程:

16457 14:28:54.194584 clone(child_stack=0xb5e35494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb5e35bd8, {entry_number:6, base_addr:0xb5e35b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb5e35bd8) = 2342323423 14:28:54.194938 set_robust_list(0xb5e35be0, 0xc <unfinished ...>23423 14:28:54.195125 <... set_robust_list resumed> ) = 023423 14:28:54.195300 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>16457 14:29:14.077986 tgkill(16457, 23423, SIGRTMIN <unfinished ...>23423 14:29:14.078204 <... futex resumed> ) = ? ERESTARTSYS (To be restarted)23423 14:29:14.078424 --- SIGRTMIN (Unknown signal 32) @ 0 (0) ---23423 14:29:14.078676 madvise(0xb5635000, 8372224, MADV_DONTNEED <unfinished ...>23423 14:29:14.078761 <... madvise resumed> ) = 023423 14:29:14.078916 _exit(0)          = ?16457 14:28:54.195374 <... clone resumed> child_stack=0xb5634494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb5634bd8, {entry_number:6, base_addr:0xb5634b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb5634bd8) = 2342423424 14:28:54.195524 set_robust_list(0xb5634be0, 0xc <unfinished ...>23424 14:28:54.195666 <... set_robust_list resumed> ) = 023424 14:28:54.195877 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>16457 14:29:14.078985 tgkill(16457, 23424, SIGRTMIN) = 023424 14:29:14.079124 <... futex resumed> ) = ? ERESTARTSYS (To be restarted)23424 14:29:14.079353 --- SIGRTMIN (Unknown signal 32) @ 0 (0) ---23424 14:29:14.079510 madvise(0xb4e34000, 8372224, MADV_DONTNEED <unfinished ...>23424 14:29:14.079719 <... madvise resumed> ) = 023424 14:29:14.079944 _exit(0)          = ?...

發現:

    -> 新分配的線程都停在了 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>    |    v    -> 顯然是死結了.    |  為什麼死結?    v    -> 查看源碼, 並沒有使用鎖機制, 整個代碼中調用的線程函數有:    |   pthread_create()    |   pthread_join()    |   pthread_cancel()    |   pthread_setcancelstate()    |   pthread_setcanceltype()    |    v    -> 說明futex()不是代碼顯示調用的, 可能是某個函數內部調用了 futex()    | 究竟是哪個函數調用了 futex() 呢?    v    -> 跟蹤執行成功的axel的線程:    |   10363 16:28:41 <... clone resumed> child_stack=0xb6b4c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6b4cbd8, {entry_number:6, base_addr:0xb6b4cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6b4cbd8) = 10370    |   10370 16:28:41 set_robust_list(0xb6b4cbe0, 0xc <unfinished ...>    |   10370 16:28:41 <... set_robust_list resumed> ) = 0    |   10370 16:28:41 futex(0x28ae68, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>    |   10370 16:28:41 <... futex resumed> )    = 0    |   10370 16:28:41 open("/etc/resolv.conf", O_RDONLY <unfinished ...>    |   10370 16:28:41 <... open resumed> )     = 4    |   10370 16:28:41 fstat64(4,  <unfinished ...>    |   10370 16:28:41 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0    |   10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000    |   10370 16:28:41 read(4, "nameserver 219.141.136.10\nnamese"..., 4096) = 52    |   10370 16:28:41 read(4, "", 4096)        = 0    |   10370 16:28:41 close(4)                 = 0    |   10370 16:28:41 munmap(0xb7711000, 4096) = 0    |   10370 16:28:41 uname({sys="Linux", node="201221021JM93X", ...}) = 0    |   10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0    |   10370 16:28:41 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4    |   10370 16:28:41 fstat64(4, {st_mode=S_IFREG|0644, st_size=3382, ...}) = 0    |   10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000    |   10370 16:28:41 read(4, "127.0.0.1\tlocalhost\n10.2.30.159\t"..., 4096) = 3382    |   10370 16:28:41 read(4, "", 4096)        = 0    |   10370 16:28:41 close(4)                 = 0    |   10370 16:28:41 munmap(0xb7711000, 4096) = 0    |   10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0    |   10370 16:28:41 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4    |   10370 16:28:41 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, 16) = 0    |   10370 16:28:41 gettimeofday({1345624121, 428914}, NULL) = 0    |   10370 16:28:41 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}])    |   10370 16:28:41 send(4, "#\301\1\0\0\1\0\0\0\0\0\0\5cacti\6bokecc\3com\0\0\1"..., 34, MSG_NOSIGNAL) = 34    |   10370 16:28:41 poll([{fd=4, events=POLLIN}], 1, 5000 <unfinished ...>    |   10370 16:28:41 <... poll resumed> )     = 1 ([{fd=4, revents=POLLIN}])    |   10370 16:28:41 ioctl(4, FIONREAD, [188]) = 0    |   10370 16:28:41 recvfrom(4, "#\301\201\200\0\1\0\1\0\2\0\6\5cacti\6bokecc\3com\0\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, [16]) = 188    |   10370 16:28:41 close(4)                 = 0    |   10370 16:28:41 futex(0x28ae68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>    |   10370 16:28:41 <... futex resumed> )    = 1    |   10370 16:28:41 socket(PF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...>    |   10370 16:28:41 <... socket resumed> )   = 6    |   10370 16:28:41 connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("114.113.152.135")}, 16 <unfinished ...>    |   10370 16:28:41 <... connect resumed> )  = 0    |   10370 16:28:41 gettimeofday( <unfinished ...>    |   10370 16:28:41 <... gettimeofday resumed> {1345624121, 435907}, NULL) = 0    |   10370 16:28:41 write(6, "GET /test/test.flv HTTP/1.0\r\nHos"..., 116 <unfinished ...>    v    -> 發現在進入線程執行的第二個系統調用就是 futex()    |   10370 16:28:41 set_robust_list(0xb6b4cbe0, 0xc <unfinished ...>    |   10370 16:28:41 <... set_robust_list resumed> ) = 0    |   10370 16:28:41 futex(0x28ae68, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>    |   10370 16:28:41 <... futex resumed> )    = 0    |   10370 16:28:41 open("/etc/resolv.conf", O_RDONLY <unfinished ...>    |   10370 16:28:41 <... open resumed> )     = 4    |   10370 16:28:41 fstat64(4,  <unfinished ...>    |   10370 16:28:41 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0    |   10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000    |   ...    |   10370 16:28:41 close(4)                 = 0    |   ...    |   10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0    |   10370 16:28:41 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4    |   ...    |   10370 16:28:41 close(4)                 = 0    |   ...    |   10370 16:28:41 close(4)                 = 0    |   10370 16:28:41 futex(0x28ae68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>    |   10370 16:28:41 <... futex resumed> )    = 1    |   10370 16:28:41 socket(PF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...>    |   10370 16:28:41 <... socket resumed> )   = 6    |   10370 16:28:41 connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("114.113.152.135")}, 16 <unfinished ...>    v    -> 很明顯是 gethostbyname() 函數內部調用了 futex()     |    v    -> 我記得老師講課時說過 gethostbyname() 不是安全執行緒的. 會不會跟這有關係,試用 gethostbyname_r() 或 getaddrinfo() 替換再次測試    |   32559 16:39:39 clone(child_stack=0xb7357494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7357bd8, {entry_number:6, base_addr:0xb7357b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7357bd8) = 32619    |   32619 16:39:39 set_robust_list(0xb7357be0, 0xc <unfinished ...>    |   32619 16:39:39 <... set_robust_list resumed> ) = 0    |   32619 16:39:39 open("/etc/resolv.conf", O_RDONLY <unfinished ...>    |   32619 16:39:39 <... open resumed> )     = 4    |   32619 16:39:39 fstat64(4,  <unfinished ...>    |   32619 16:39:39 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0    |   32619 16:39:39 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>    |   32619 16:39:39 <... mmap2 resumed> )    = 0xb6355000    |   32619 16:39:39 read(4,  <unfinished ...>    |   32619 16:39:39 <... read resumed> "nameserver 219.141.136.10\nnamese"..., 4096) = 52    |   32619 16:39:39 read(4,  <unfinished ...>    |   32619 16:39:39 <... read resumed> "", 4096) = 0    |   32619 16:39:39 close(4 <unfinished ...>    |   32619 16:39:39 <... close resumed> )    = 0    |   32619 16:39:39 munmap(0xb6355000, 4096 <unfinished ...>    |   32619 16:39:39 <... munmap resumed> )   = 0    |   32619 16:39:39 uname( <unfinished ...>    |   32619 16:39:39 <... uname resumed> {sys="Linux", node="201221021JM93X", ...}) = 0    |   32619 16:39:39 futex(0x98e1e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>    |   32619 16:39:39 <... futex resumed> )    = 0    |   32619 16:39:39 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 6    |   32619 16:39:39 fstat64(6, {st_mode=S_IFREG|0644, st_size=3382, ...}) = 0    |   32619 16:39:39 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7713000    |   32619 16:39:39 read(6, "127.0.0.1\tlocalhost\n10.2.30.159\t"..., 4096) = 3382    |   32619 16:39:39 read(6, "", 4096)        = 0    |   32619 16:39:39 close(6)                 = 0    |   32619 16:39:39 munmap(0xb7713000, 4096) = 0    |   32619 16:39:39 futex(0x98e1e8, FUTEX_WAKE_PRIVATE, 1) = 1    v    ->  發現gethostbyname_r也會調用futex()    |    | 可能不是 gethostbyname() 或 gethostbyname_r() 的問題, 驗證方法 <= 把 gethostbyname_r() 單獨寫一個程式然後strace跟蹤不會產生futex()調用     | 之所以會有futex() 調用是因為 編譯時間串連了 pthread 庫,但是死結肯定和這裡有關係, 也就是 線程執行到 gethostbyname() 內, 剛剛執行完    |   futex(0x98e1e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> (加鎖)    | 還沒來得及解鎖就被 cancel 掉了, 其他線程執行到gethostbyname() 同樣要加鎖,由於沒人解鎖,所以一個個線程就阻塞到這兒直到20s逾時,被主線程cancel掉,    | 主線程會再次建立新的線程, 線程還是死結在那兒, 再次被cancel,這樣周而復始就產生了我們開頭描述的axel下載一段時間再無進度的情況.    v    -> 由此可以推斷導致開頭描述現象的原因是:    |   線程被cancel的不是時候,也就是線上程加鎖後和解鎖前不能被cancel, 顯然這是由於線程函數中將cancel模式    |       "pthread_setcanceltype( PTHREAD_CANCEL_DEFERRED, &oldstate );"    |   設定為了 非同步取消模式 導致的, 在<<UNIX環境進階編程>>第2版 12.7 取消選項(P333)一節中寫到:    |       "非同步取消和延遲取消不同, 試用非同步取消時, 線程可以在任意時候被取消,而不是非得遇到取消點才能被取消"    v    -> 作者為什麼這麼做呢?    |   我想是為了控制 線程處理中的 gethostbyname() 和 connect() 逾時時間. 我們直到 connect()可以設定逾時,但是 gethostbyname() 是沒辦法設定逾時的.    v    -> 應該做怎樣的修改呢    |   1 將 gethostbyname() 改為安全執行緒的函數 gethostbyname_r()    |   2 將 線程處理函數中設定cancel模式 由 非同步取消模式 改為 延遲取消模式    |   3 使用 非阻塞IO 和 select 實現 connect() 逾時 10s 的設定.    v     -> gethostbyname_r() 函數的逾時不用設定嗎?     |   dns協議有自己的預設逾時時間, 可通過 man 5 resolv.conf, 查看:     |       "the default is RES_TIMEOUT (currently 5, see <resolv.h>)"     |   所以這裡可以不考慮 gethostbyname_r() 的逾時.     v     結束

------

GS

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.