Linux multithreaded download tool axel Download abort issue
What's 1 axel?
Axel is a multi-threaded download tool under Linux, official website http://axel.alioth.debian.org/
2 Problems I have encountered
$> axel-a-N 10-s 409600 "Myurl"
There is no progress in downloading for some time. And the phenomenon is hard to reproduce.
3 Axel Source Logic
Main () {axel_new () {* * sends an HTTP GET request 1 bytes, gets the number of bytes downloaded/} axel_open ()
{/* Allocate the number of bytes from the download file data for each connection, and open the file for saving the downloaded data/* axel->outfd = open (); Axel_start () {/* Create thread pool/* Setup_thread () {/* Thread processing Letter
The number is returned only after the connection has been created. and connect () does not set timeout/gethostbyname ();
Connect ();
Return
}/* Register SIGINT and Sigterm signal */while (download incomplete and not received signal) {AXEL_DO () {/* Save download Status Register select () if the current thread pool does not create a valid connection, that is, all descriptions
Character (axel->conn[0-n].fd <= 0), the recycle thread or re-create thread if a thread creates a connection that succeeds or does not return successfully, the thread is recycled, the thread is created again, and the connection operation is performed. If a thread does not return within the set time, the thread is Pthread_cancel (). But Pthread_cancel () does not come immediately. Select () If a connection is readable, read the data and write to the corresponding location of the file.
If a connection is unreadable and has timed out 45 seconds, close the connection. The next loop creates a thread reconnect. */
}
}
}
4 Analysis Process
Use Strace to track the current Axel process:
Strace-f-tt-p PID
Then look at the execution of a single thread:
16457 14:28:54.194584 Clone (child_stack=0xb5e35494, flags=clone_vm| clone_fs| clone_files| clone_sighand| clone_thread| clone_sysvsem| clone_settls| clone_parent_settid| Clone_child_cleartid, Parent_tidptr=0xb5e35bd8, {entry_number:6, base_addr:0xb5e35b70, limit:1048575, Seg_32bit:1, contents:0, read_exec_only:0, Limit_in_pages:1, seg_not_present:0, useable:1}, Child_tidptr=0xb5e35bd8) = 23423 23423 14:28:54.194938 set_robust_list (0XB5E35BE0, 0xc <unfinished ...> 23423) 14:28:54.195125 resumed>) = 0 23423 14:28:54.195300 futex (0xb77bcde0, Futex_wait_private, 2, NULL <unfinished ...> 16457 14:29:1 4.077986 Tgkill (16457, 23423, sigrtmin <unfinished) ...> 23423 14:29:14.078204 futex) =? Erestartsys (to be restarted) 23423 14:29:14.078424---sigrtmin (Unknown signal) @ 0 (0)---23423 14:29:14.078676 mad Vise (0xb5635000, 8372224, madv_dontneed <unfinished ...> 23423) 14:29:14.078761 madvise) = 0 234 23 14:29:14.078916 _exit (0) =? 16457 14:28:54.195374. Clone resumed> child_stack=0xb5634494, flags=clone_vm| clone_fs| clone_files| clone_sighand| clone_thread| clone_sysvsem| clone_settls| clone_parent_settid| Clone_child_cleartid, Parent_tidptr=0xb5634bd8, {entry_number:6, base_addr:0xb5634b70, limit:1048575, Seg_32bit:1, contents:0, read_exec_only:0, Limit_in_pages:1, seg_not_present:0, useable:1}, Child_tidptr=0xb5634bd8) = 23424 23424 14:28:54.195524 set_robust_list (0XB5634BE0, 0xc <unfinished ...> 23424) 14:28:54.195666 resumed>) = 0 23424 14:28:54.195877 futex (0xb77bcde0, Futex_wait_private, 2, NULL <unfinished ...> 16457 14:29:1 4.078985 Tgkill (16457, 23424, sigrtmin) = 0 23424 14:29:14.079124) =? Erestartsys (to be restarted) 23424 14:29:14.079353---sigrtmin (Unknown signal) @ 0 (0)---23424 14:29:14.079510 mad Vise (0xb4e34000, 8372224, Madv_dontneed <unfinished ...>, 23424 14:29:14.079719 <... madvise resumed>) = 0 23424 14:29:14.079944 _exit (0) =?
...
Found:
-> new assigned threads are parked in Futex (0xb77bcde0, Futex_wait_private, 2, NULL <unfinished ...> |
V-> is obviously a dead lock. |
Why the deadlock? V-> View the source, and did not use the lock mechanism, the entire code called the thread functions are: | Pthread_create () | Pthread_join () | Pthread_cancel () | Pthread_setcancelstate () |
Pthread_setcanceltype () | V-> Description Futex () is not called by the code display, possibly within a function called Futex () |
what function called Futex ()? V-> track execution of successful Axel threads: | 10363 16:28:41. Clone resumed> child_stack=0xb6b4c494, flags=clone_vm| clone_fs| clone_files| clone_sighand| clone_thread| clone_sysvsem| clone_settls| clone_parent_settid| Clone_child_cleartid, Parent_tidptr=0xb6b4cbd8, {entry_number:6, base_addr:0xb6b4cb70, limit:1048575, Seg_32bit:1, contents:0, read_exec_only:0, Limit_in_pages:1, seg_not_present:0, useable:1}, Child_tidptr=0xb6b4cbd8) = 10370 | 10370 16:28:41 set_robust_list (0XB6B4CBE0, 0xc <unfinished ...> |
10370 16:28:41 ... set_robust_list resumed>) = 0| 10370 16:28:41 Futex (0x28ae68, Futex_wait_private, 2, NULL <unfinished ...> | 10370 16:28:41 ... futex resumed>) = 0 | 10370 16:28:41 Open ("/etc/resolv.conf", O_rdonly <unfinished ...> | 10370 16:28:41. Open resumed>) = 4 | 10370 16:28:41 Fstat64 (4, <unfinished ...> | 10370 16:28:41 ... fstat64 resumed> {st_mode=s_ifreg|0644, st_size=52, ...}) = 0 | 10370 16:28:41 mmap2 (NULL, 4096, prot_read| Prot_write, map_private| Map_anonymous,-1, 0) = 0xb7711000 | 10370 16:28:41 Read (4, "nameserver 219.141.136.10\nnamese" ..., 4096) = 52 | 10370 16:28:41 Read (4, "", 4096) = 0 | 10370 16:28:41 Close (4) = 0 | 10370 16:28:41 Munmap (0xb7711000, 4096) = 0 | 10370 16:28:41 uname ({sys= "Linux", node= "201221021jm93x", ...}) = 0 | 10370 16:28:41 stat64 ("/etc/resolv.conf", {st_mode=s_ifreg|0644, st_size=52, ...}) = 0 | 10370 16:28:41 Open ("/etc/hosts", O_rdonly| o_cloexec) = 4 | 10370 16:28:41 Fstat64 (4, {st_mode=s_ifreg|0644, st_size=3382, ...}) = 0 | 10370 16:28:41 mmap2 (NULL, 4096, prot_read| Prot_write, map_private| Map_anonymous,-1, 0) = 0xb7711000 | 10370 16:28:41 Read (4, "127.0.0.1\tlocalhost\n10.2.30.159\t" ..., 4096) = 3382 | 10370 16:28:41 Read (4, "", 4096) = 0 | 10370 16:28:41 Close (4) = 0 | 10370 16:28:41 Munmap (0xb7711000, 4096) = 0 | 10370 16:28:41 stat64 ("/etc/resolv.conf", {st_mode=s_ifreg|0644, st_size=52, ...}) = 0 | 10370 16:28:41 socket (pf_inet, sock_dgram| Sock_nonblock, IPPROTO_IP) = 4 | 10370 16:28:41 Connect (4, {sa_family=af_inet, sin_port=htons (), Sin_addr=inet_addr ("219.141.136.10")}, 16) = 0 | 10370 16:28:41 gettimeofday ({1345624121, 428914}, NULL) = 0 | 10370 16:28:41 Poll ([{fd=4, events=pollout}], 1, 0) = 1 ([{fd=4, revents=pollout}]) | 10370 16:28:41 Send (4, "#\301\1\0\0\1\0\0\0\0\0\0\5cacti\6bokecc\3com\0\0\1" ...,msg_nosignal) = 34 | 10370 16:28:41 Poll ([{fd=4, Events=pollin}], 1, 5000 <unfinished ...> | 10370 16:28:41 ... poll resumed>) = 1 ([{fd=4, Revents=pollin}]) | 10370 16:28:41 IOCTL (4, Fionread, [188]) = 0 | 10370 16:28:41 recvfrom (4, "#\301\201\200\0\1\0\1\0\2\0\6\5cacti\6bokecc\3com\0\0\1" ..., 1024, 0, {Sa_family=AF_INET , Sin_port=htons (+), sin_addr=inet_addr ("219.141.136.10")}, [16]) = 188 | 10370 16:28:41 Close (4) = 0 | 10370 16:28:41 Futex (0x28ae68, futex_wake_private, 1 <unfinished ...> | 10370 16:28:41 ... futex resumed>) = 1 | 10370 16:28:41 socket (pf_inet, Sock_stream, ipproto_ip <unfinished-...> | 10370 16:28:41. Socket resumed>) = 6 | 10370 16:28:41 Connect (6, {sa_family=af_inet, sin_port=htons (), Sin_addr=inet_addr ("114.113.152.135")}, < Unfinished ...> | 10370 16:28:41. Connect resumed>) = 0 | 10370 16:28:41 Gettimeofday (<unfinished ...> | 10370 16:28:41 ... gettimeofday resumed> {1345624121, 435907}, NULL) = 0 | 10370 16:28:41 Write (6, "get/test/test.flv http/1.0\r\nhos" ..., 116 <unfinished, ...> v-> found in the second department to enter the thread execution The call is Futex () | 10370 16:28:41 set_robust_list (0XB6B4CBE0, 0xc <unfinished ...> | 10370 16:28:41 ... set_robust_list resumed>) = 0 | 10370 16:28:41 Futex (0x28ae68, Futex_wait_private, 2, NULL <unfinished ...> | 10370 16:28:41 ... futex resumed>) = 0 | 10370 16:28:41 Open ("/etc/resolv.conf", O_rdonly <unfinished ...> | 10370 16:28:41. Open resumed>) = 4 | 10370 16:28:41 Fstat64 (4, <unfinished ...> | 10370 16:28:41 ... fstat64 resumed> {st_mode=s_ifreg|0644, st_size=52, ...}) = 0 | 10370 16:28:41 mmap2 (NULL, 4096, prot_read| Prot_write, map_private| Map_anonymous,-1, 0) = 0xb7711000 | ...
| 10370 16:28:41 CLOSE (4) = 0 | ...
| 10370 16:28:41 stat64 ("/etc/resolv.conf", {st_mode=s_ifreg|0644, st_size=52, ...}) = 0 | 10370 16:28:41 Open ("/etc/hosts", o_rdonly| o_cloexec) = 4 | ...
| 10370 16:28:41 Close (4) = 0 | ...
| 10370 16:28:41 Close (4) = 0 | 10370 16:28:41 Futex (0x28ae68, futex_wake_private, 1 <unfinished ...> | 10370 16:28:41 ... futex resumed>) = 1 | 10370 16:28:41 socket (pf_inet, Sock_stream, ipproto_ip <unfinished-...> | 10370 16:28:41. Socket resumed>) = 6 | 10370 16:28:41 Connect (6, {sa_family=af_inet, sin_port=htons (), Sin_addr=inet_addr ("114.113.152.135")}, <
Unfinished ...> v-> is obviously called Futex () inside the gethostbyname () function | V-> I remember when the teacher said that gethostbyname () is not thread safe. Will it have anything to do with this, try Gethostbyname_r () or getaddrinfo () replace the test again | 32559 16:39:39 Clone (child_stack=0xb7357494, flags=clone_vm| clone_fs| Clone_files| clone_sighand| clone_thread| clone_sysvsem| clone_settls| clone_parent_settid| Clone_child_cleartid, Parent_tidptr=0xb7357bd8, {entry_number:6, base_addr:0xb7357b70, limit:1048575, Seg_32bit:1, contents:0, read_exec_only:0, Limit_in_pages:1, seg_not_present:0, useable:1}, Child_tidptr=0xb7357bd8) = 32619 | 32619 16:39:39 set_robust_list (0XB7357BE0, 0xc <unfinished ...> | 32619 16:39:39 ... set_robust_list resumed>) = 0 | 32619 16:39:39 Open ("/etc/resolv.conf", O_rdonly <unfinished ...> | 32619 16:39:39. Open resumed>) = 4 | 32619 16:39:39 Fstat64 (4, <unfinished ...> | 32619 16:39:39 ... fstat64 resumed> {st_mode=s_ifreg|0644, st_size=52, ...}) = 0 | 32619 16:39:39 mmap2 (NULL, 4096, prot_read| Prot_write, map_private| Map_anonymous,-1, 0 <unfinished ...> | 32619 16:39:39 ... mmap2 resumed>) = 0xb6355000 | 32619 16:39:39 Read (4, <unfinished ...> | 32619 16:39:39. Read resumed> "NameServer 219.141.136.10\nnamese" ..., 4096) = 52 | 32619 16:39:39 Read (4, <unfinished ...> | 32619 16:39:39. Read Resumed> "", 4096) = 0 | 32619 16:39:39 Close (4 <unfinished ...> | 32619 16:39:39. Close resumed>) = 0 | 32619 16:39:39 Munmap (0xb6355000, 4096 <unfinished ...> | 32619 16:39:39 ... munmap resumed>) = 0 | 32619 16:39:39 uname (<unfinished ...> | 32619 16:39:39 ... uname resumed> {sys= "Linux", node= "201221021jm93x", ...}) = 0 | 32619 16:39:39 Futex (0x98e1e8, Futex_wait_private, 2, NULL <unfinished ...> | 32619 16:39:39 ... futex resumed>) = 0 | 32619 16:39:39 Open ("/etc/hosts", o_rdonly| o_cloexec) = 6 | 32619 16:39:39 Fstat64 (6, {st_mode=s_ifreg|0644, st_size=3382, ...}) = 0 | 32619 16:39:39 mmap2 (NULL, 4096, prot_read| Prot_write, map_private| Map_anonymous,-1, 0) = 0xb7713000 | 32619 16:39:39 Read (6, "127.0.0.1\tlocalhost\n10.2.30.159\t" ..., 4096) = 3382 | 32619 16:39:39 Read (6, "", 4096) = 0 | 32619 16:39:39 Close (6) = 0 | 32619 16:39:39 Munmap (0xb7713000, 4096) = 0 |
32619 16:39:39 Futex (0x98e1e8, futex_wake_private, 1) = 1 v-> found Gethostbyname_r will also invoke Futex () | | may not be a problem with gethostbyname () or Gethostbyname_r (), the authentication method <= write a separate program for Gethostbyname_r () and strace Trace does not produce Futex () calls | There is a Futex () call because the Pthread library is connected at compile time, but the deadlock must have something to do with it, that is, the thread executes to gethostbyname (), just finished | Futex (0x98e1e8, Futex_wait_private, 2, NULL <unfinished ...> (lock) | was removed before it was unlocked, and other threads execute to gethostbyname () Also to lock, because no one unlocked, so a thread blocked here until the 20s timeout, by the main thread cancel, |
The main thread will create new threads again, the thread or deadlock there, and be cancel again, so that the cycle will produce the Axel we described at the beginning of the download for some time no progress. V-> This can be inferred as the cause of the opening description: | It is not a time for a thread to be cancel, that is, it cannot be cancel before the online Cheng is reconciled, apparently because the cancel mode in the thread function |
"Pthread_setcanceltype (pthread_cancel_deferred, &oldstate);" | Set for asynchronous cancellation mode, written in the <<unix Environment Advanced Programming >> 2nd version 12.7 cancellation option (P333) Section: |
"Asynchronous cancellation and delay cancellation are different, when the trial asynchronous cancellation, the thread can be canceled at any time, rather than having to encounter a cancellation point to be canceled" V-> why did the author do so? | I want to control the gethostbyname () and connect () timeout in the thread processing.
We can set a timeout until connect (), but gethostbyname () is not able to set the timeout. V-> should do how to change it | 1 Change gethostbyname () to Thread-safe function gethostbyname_r () | 2 Set the Cancel mode in the threading function from asynchronous cancellation mode to delay cancellation mode |
3 Use non-blocking IO and select to implement Connect () Timeout 10s settings.
Does the timeout for the V-> gethostbyname_r () function not be set? | The DNS protocol has its own default timeout, which can be viewed by man 5 resolv.conf: | "The default is Res_timeout (currently 5, <resolv.h>)" |
Therefore, the timeout for Gethostbyname_r () can not be considered here. V End
------
Gs