Value of one of the thread-argument variable has changed at some place, which is only detected when the program crashes.
(gdb) p *a$7 = { host = 0x0, port = 0, h_ent = 0x7fff5fbf6f10, request = 0x100005620 "UH??H??? ", reqlen = 140734799803600, timeout = 0x100004e56, s = 0x100281000, cnt = 1606381568, sargs = 0x1001000e0}
Look at the garbage.
So, watchpoint.
watch a
Isn' t helpful at all, since there is no write to the variable A. Have to use memory location watch, which is
(gdb) watch -location aHardware watchpoint 7: *(struct args **) 4297592504
(GDB shipped with OS X doesn' t recognize watch-l (as specified in the help message ).
Then the debugger kindly pauses my program
(gdb) cContinuing.Hardware watchpoint 6 deleted because the program has left the block in which its expression is valid.READ: 8192, TOTAL: 8192Hardware watchpoint 7: *(struct args **) 4297592504Old value = (struct args *) 0x7fff5fbf6f40New value = (struct args *) 0x7fff5fbf6f000x0000000100002641 in parse_response (buf=0x10027eeb8 "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"..., len=8192, a=0x7fff5fbf7000) at md.c:512512 nextproc[len] = 0;(gdb) bt#0 0x0000000100002641 in parse_response (buf=0x10027eeb8 "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"..., len=8192, a=0x7fff5fbf7000) at md.c:512#1 0x0000000100005c10 in recv_engine (p=0x7fff5fbf6f40) at engine.c:394#2 0x00007fff88c637a2 in _pthread_start ()#3 0x00007fff88c501e1 in thread_start ()(gdb) info locals readlen = 8192sfd = 3i = 0n = 3quit = 0_i = 0a = (struct args *) 0x7fff5fbf6f00buf = "HTTP/1.1 200 OK\r\nVary: Accept-Encoding\r\nContent-Type: image/x-icon\r\nLast-Modified: Tue, 14 Aug 2012 15:19:23 GMT\r\nDate: Wed, 17 Apr 2013 14:07:54 GMT\r\nExpires: Thu, 25 Apr 2013 14:07:54 GMT\r\nX-Content"...go_loop = 1total = 8192error_exit = 0out_rwx.86 = (int (*)[1]) 0x10027ee20
(gdb) l parse_response 502 503 return bytes_processed - bufsz_orig;504}505506int // return 0 for normal, 1 for all done, -1 for error507parse_response (char *buf, long len, struct session_args *a) {508 // responsible for parsing the status line 509 // and handle the rest to different parse sub routines510 char *nextinput = buf;511 char *nextproc = buf;(gdb) l512 nextproc[len] = 0;
(gdb) l engine.c:366361 || a->sargs[i].rtt_timer[0].tv_usec) {362 // we only pause for the first time363 timer_pause (a->sargs[i].rtt_timer);364 }365 //printf ("Read %d\n", sfd);366 int readlen = recv (sfd, buf, sizeof(buf) - 1, 0);367 if (readlen == 0) {368 dprintf ("READ 0 <%d>\n", a->s[i]);369 close (a->s[i]);370 if (a->sargs[i].repeat_recv > 0) {
Everything is just crystal clear now. I did an Recv using the whole buffer, at line 366, then passed to parse_response, which expected one more byte at end of buffer so I cocould append a' \ 0' for easier string operations (line 512), of which location overruns
Into & A, the variable that shouldn't have been changed.
It's weird that valgrind didn't detect this. Maybe there is a stack checking tool in experimental stage I remember but this definitely is not in the memcheck tool.
Anyway, looks like it's working again now.
A few more insights:
1. This bug didn't reveal until I cleared out some recursion code, tests the Code with multiple threads, on a line that has delays and drops.
2. GCC built binary on Linux didn't gave me trouble. llvm on OS X shows up problems 8/10 times.
3. Read the code, make asserts base on assumptions. I did this in the parsing module, not in the boundary of networking and parsing modules.
4. Watch-location is useful.
5. Still didn't figure out what is "-fstack-protector" for in gcc/clang
Update 0424:
Using valgrind exp-sgcheck
==25634== Invalid read of size 1==25634== at 0x4C2C008: index (h_intercepts.c:89)==25634== by 0x402BB6: parse_response (md.c:552)==25634== by 0x403A5F: test_parse (md.c:896)==25634== by 0x403CAF: main (md.c:951)==25634== Address 0x7ff000430 expected vs actual:==25634== Expected: stack array "buf" of size 8,192 in frame 2 back from here==25634== Actual: unknown==25634== Actual: is 0 after Expected
Valgrind sgcheck found one occurrence, which is a few accesses after the one bug I found using GDB as shown above. But it cocould still be useful somemaytimes be.