linux多線程下載工具 axel 下載中止的問題
1 axel是什麼
axel是一個linux下的多線程下載工具, 官網 http://axel.alioth.debian.org/
2 我遇到的問題
$> axel -a -n 10 -s 409600 "myurl"
下載一段時間再無進度. 而且這種現象很難重現.
3 axel 源碼的邏輯
main() { axel_new() { /* 發送HTTP GET 請求 1個位元組,擷取下載檔案位元組數 */ } axel_open() { /* 分配每個串連下載檔案資料的起止位元組數, 並開啟用於儲存下載資料的檔案 */ axel->outfd = open(); } axel_start() { /* 建立線程池 */ setup_thread() { /* 線程處理函數 僅僅是建立串連 之後就返回了.而且 connect() 未設定逾時 */ gethostbyname(); connect(); return; } } /* 註冊 SIGINT 和 SIGTERM 訊號 */ while (下載未完成 且 未接收到訊號) { axel_do() { /* 儲存下載狀態 註冊 select() 如果當前線程池未建立有效串連,即所有描述符(axel->conn[0-n].fd <= 0), 則回收線程或重新建立線程 如果某線程 建立串連成功或未成功返回,則回收線程,並再次建立線程,執行串連操作. 如果某線程 在設定時間內未返回,則 pthread_cancel() 掉該線程. 但是 pthread_cancel() 並不回立馬生效 select() 如果某串連可讀, 則讀取資料,並寫入檔案對應位置. 如果某串連不可讀 且 逾時 45秒, 則關閉串連.下次迴圈時會建立線程重新串連. */ } } }
4 分析過程
使用strace跟蹤當前 axel 進程:
strace -f -tt -p PID
然後看單個線程的執行過程:
16457 14:28:54.194584 clone(child_stack=0xb5e35494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb5e35bd8, {entry_number:6, base_addr:0xb5e35b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb5e35bd8) = 2342323423 14:28:54.194938 set_robust_list(0xb5e35be0, 0xc <unfinished ...>23423 14:28:54.195125 <... set_robust_list resumed> ) = 023423 14:28:54.195300 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>16457 14:29:14.077986 tgkill(16457, 23423, SIGRTMIN <unfinished ...>23423 14:29:14.078204 <... futex resumed> ) = ? ERESTARTSYS (To be restarted)23423 14:29:14.078424 --- SIGRTMIN (Unknown signal 32) @ 0 (0) ---23423 14:29:14.078676 madvise(0xb5635000, 8372224, MADV_DONTNEED <unfinished ...>23423 14:29:14.078761 <... madvise resumed> ) = 023423 14:29:14.078916 _exit(0) = ?16457 14:28:54.195374 <... clone resumed> child_stack=0xb5634494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb5634bd8, {entry_number:6, base_addr:0xb5634b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb5634bd8) = 2342423424 14:28:54.195524 set_robust_list(0xb5634be0, 0xc <unfinished ...>23424 14:28:54.195666 <... set_robust_list resumed> ) = 023424 14:28:54.195877 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>16457 14:29:14.078985 tgkill(16457, 23424, SIGRTMIN) = 023424 14:29:14.079124 <... futex resumed> ) = ? ERESTARTSYS (To be restarted)23424 14:29:14.079353 --- SIGRTMIN (Unknown signal 32) @ 0 (0) ---23424 14:29:14.079510 madvise(0xb4e34000, 8372224, MADV_DONTNEED <unfinished ...>23424 14:29:14.079719 <... madvise resumed> ) = 023424 14:29:14.079944 _exit(0) = ?...
發現:
-> 新分配的線程都停在了 futex(0xb77bcde0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | v -> 顯然是死結了. | 為什麼死結? v -> 查看源碼, 並沒有使用鎖機制, 整個代碼中調用的線程函數有: | pthread_create() | pthread_join() | pthread_cancel() | pthread_setcancelstate() | pthread_setcanceltype() | v -> 說明futex()不是代碼顯示調用的, 可能是某個函數內部調用了 futex() | 究竟是哪個函數調用了 futex() 呢? v -> 跟蹤執行成功的axel的線程: | 10363 16:28:41 <... clone resumed> child_stack=0xb6b4c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb6b4cbd8, {entry_number:6, base_addr:0xb6b4cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb6b4cbd8) = 10370 | 10370 16:28:41 set_robust_list(0xb6b4cbe0, 0xc <unfinished ...> | 10370 16:28:41 <... set_robust_list resumed> ) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 0 | 10370 16:28:41 open("/etc/resolv.conf", O_RDONLY <unfinished ...> | 10370 16:28:41 <... open resumed> ) = 4 | 10370 16:28:41 fstat64(4, <unfinished ...> | 10370 16:28:41 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000 | 10370 16:28:41 read(4, "nameserver 219.141.136.10\nnamese"..., 4096) = 52 | 10370 16:28:41 read(4, "", 4096) = 0 | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 munmap(0xb7711000, 4096) = 0 | 10370 16:28:41 uname({sys="Linux", node="201221021JM93X", ...}) = 0 | 10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4 | 10370 16:28:41 fstat64(4, {st_mode=S_IFREG|0644, st_size=3382, ...}) = 0 | 10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000 | 10370 16:28:41 read(4, "127.0.0.1\tlocalhost\n10.2.30.159\t"..., 4096) = 3382 | 10370 16:28:41 read(4, "", 4096) = 0 | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 munmap(0xb7711000, 4096) = 0 | 10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 4 | 10370 16:28:41 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, 16) = 0 | 10370 16:28:41 gettimeofday({1345624121, 428914}, NULL) = 0 | 10370 16:28:41 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) | 10370 16:28:41 send(4, "#\301\1\0\0\1\0\0\0\0\0\0\5cacti\6bokecc\3com\0\0\1"..., 34, MSG_NOSIGNAL) = 34 | 10370 16:28:41 poll([{fd=4, events=POLLIN}], 1, 5000 <unfinished ...> | 10370 16:28:41 <... poll resumed> ) = 1 ([{fd=4, revents=POLLIN}]) | 10370 16:28:41 ioctl(4, FIONREAD, [188]) = 0 | 10370 16:28:41 recvfrom(4, "#\301\201\200\0\1\0\1\0\2\0\6\5cacti\6bokecc\3com\0\0\1"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("219.141.136.10")}, [16]) = 188 | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 1 | 10370 16:28:41 socket(PF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...> | 10370 16:28:41 <... socket resumed> ) = 6 | 10370 16:28:41 connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("114.113.152.135")}, 16 <unfinished ...> | 10370 16:28:41 <... connect resumed> ) = 0 | 10370 16:28:41 gettimeofday( <unfinished ...> | 10370 16:28:41 <... gettimeofday resumed> {1345624121, 435907}, NULL) = 0 | 10370 16:28:41 write(6, "GET /test/test.flv HTTP/1.0\r\nHos"..., 116 <unfinished ...> v -> 發現在進入線程執行的第二個系統調用就是 futex() | 10370 16:28:41 set_robust_list(0xb6b4cbe0, 0xc <unfinished ...> | 10370 16:28:41 <... set_robust_list resumed> ) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 0 | 10370 16:28:41 open("/etc/resolv.conf", O_RDONLY <unfinished ...> | 10370 16:28:41 <... open resumed> ) = 4 | 10370 16:28:41 fstat64(4, <unfinished ...> | 10370 16:28:41 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7711000 | ... | 10370 16:28:41 close(4) = 0 | ... | 10370 16:28:41 stat64("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 10370 16:28:41 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 4 | ... | 10370 16:28:41 close(4) = 0 | ... | 10370 16:28:41 close(4) = 0 | 10370 16:28:41 futex(0x28ae68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> | 10370 16:28:41 <... futex resumed> ) = 1 | 10370 16:28:41 socket(PF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...> | 10370 16:28:41 <... socket resumed> ) = 6 | 10370 16:28:41 connect(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("114.113.152.135")}, 16 <unfinished ...> v -> 很明顯是 gethostbyname() 函數內部調用了 futex() | v -> 我記得老師講課時說過 gethostbyname() 不是安全執行緒的. 會不會跟這有關係,試用 gethostbyname_r() 或 getaddrinfo() 替換再次測試 | 32559 16:39:39 clone(child_stack=0xb7357494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7357bd8, {entry_number:6, base_addr:0xb7357b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7357bd8) = 32619 | 32619 16:39:39 set_robust_list(0xb7357be0, 0xc <unfinished ...> | 32619 16:39:39 <... set_robust_list resumed> ) = 0 | 32619 16:39:39 open("/etc/resolv.conf", O_RDONLY <unfinished ...> | 32619 16:39:39 <... open resumed> ) = 4 | 32619 16:39:39 fstat64(4, <unfinished ...> | 32619 16:39:39 <... fstat64 resumed> {st_mode=S_IFREG|0644, st_size=52, ...}) = 0 | 32619 16:39:39 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...> | 32619 16:39:39 <... mmap2 resumed> ) = 0xb6355000 | 32619 16:39:39 read(4, <unfinished ...> | 32619 16:39:39 <... read resumed> "nameserver 219.141.136.10\nnamese"..., 4096) = 52 | 32619 16:39:39 read(4, <unfinished ...> | 32619 16:39:39 <... read resumed> "", 4096) = 0 | 32619 16:39:39 close(4 <unfinished ...> | 32619 16:39:39 <... close resumed> ) = 0 | 32619 16:39:39 munmap(0xb6355000, 4096 <unfinished ...> | 32619 16:39:39 <... munmap resumed> ) = 0 | 32619 16:39:39 uname( <unfinished ...> | 32619 16:39:39 <... uname resumed> {sys="Linux", node="201221021JM93X", ...}) = 0 | 32619 16:39:39 futex(0x98e1e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> | 32619 16:39:39 <... futex resumed> ) = 0 | 32619 16:39:39 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 6 | 32619 16:39:39 fstat64(6, {st_mode=S_IFREG|0644, st_size=3382, ...}) = 0 | 32619 16:39:39 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7713000 | 32619 16:39:39 read(6, "127.0.0.1\tlocalhost\n10.2.30.159\t"..., 4096) = 3382 | 32619 16:39:39 read(6, "", 4096) = 0 | 32619 16:39:39 close(6) = 0 | 32619 16:39:39 munmap(0xb7713000, 4096) = 0 | 32619 16:39:39 futex(0x98e1e8, FUTEX_WAKE_PRIVATE, 1) = 1 v -> 發現gethostbyname_r也會調用futex() | | 可能不是 gethostbyname() 或 gethostbyname_r() 的問題, 驗證方法 <= 把 gethostbyname_r() 單獨寫一個程式然後strace跟蹤不會產生futex()調用 | 之所以會有futex() 調用是因為 編譯時間串連了 pthread 庫,但是死結肯定和這裡有關係, 也就是 線程執行到 gethostbyname() 內, 剛剛執行完 | futex(0x98e1e8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> (加鎖) | 還沒來得及解鎖就被 cancel 掉了, 其他線程執行到gethostbyname() 同樣要加鎖,由於沒人解鎖,所以一個個線程就阻塞到這兒直到20s逾時,被主線程cancel掉, | 主線程會再次建立新的線程, 線程還是死結在那兒, 再次被cancel,這樣周而復始就產生了我們開頭描述的axel下載一段時間再無進度的情況. v -> 由此可以推斷導致開頭描述現象的原因是: | 線程被cancel的不是時候,也就是線上程加鎖後和解鎖前不能被cancel, 顯然這是由於線程函數中將cancel模式 | "pthread_setcanceltype( PTHREAD_CANCEL_DEFERRED, &oldstate );" | 設定為了 非同步取消模式 導致的, 在<<UNIX環境進階編程>>第2版 12.7 取消選項(P333)一節中寫到: | "非同步取消和延遲取消不同, 試用非同步取消時, 線程可以在任意時候被取消,而不是非得遇到取消點才能被取消" v -> 作者為什麼這麼做呢? | 我想是為了控制 線程處理中的 gethostbyname() 和 connect() 逾時時間. 我們直到 connect()可以設定逾時,但是 gethostbyname() 是沒辦法設定逾時的. v -> 應該做怎樣的修改呢 | 1 將 gethostbyname() 改為安全執行緒的函數 gethostbyname_r() | 2 將 線程處理函數中設定cancel模式 由 非同步取消模式 改為 延遲取消模式 | 3 使用 非阻塞IO 和 select 實現 connect() 逾時 10s 的設定. v -> gethostbyname_r() 函數的逾時不用設定嗎? | dns協議有自己的預設逾時時間, 可通過 man 5 resolv.conf, 查看: | "the default is RES_TIMEOUT (currently 5, see <resolv.h>)" | 所以這裡可以不考慮 gethostbyname_r() 的逾時. v 結束
------
GS