haproxy/include/common
Willy Tarreau 2b57cb8f30 MEDIUM: protocol: implement a "drain" function in protocol layers
Since commit cfd97c6f was merged into 1.5-dev14 (BUG/MEDIUM: checks:
prevent TIME_WAITs from appearing also on timeouts), some valid health
checks sometimes used to show some TCP resets. For example, this HTTP
health check sent to a local server :

  19:55:15.742818 IP 127.0.0.1.16568 > 127.0.0.1.8000: S 3355859679:3355859679(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
  19:55:15.742841 IP 127.0.0.1.8000 > 127.0.0.1.16568: S 1060952566:1060952566(0) ack 3355859680 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
  19:55:15.742863 IP 127.0.0.1.16568 > 127.0.0.1.8000: . ack 1 win 257
  19:55:15.745402 IP 127.0.0.1.16568 > 127.0.0.1.8000: P 1:23(22) ack 1 win 257
  19:55:15.745488 IP 127.0.0.1.8000 > 127.0.0.1.16568: FP 1:146(145) ack 23 win 257
  19:55:15.747109 IP 127.0.0.1.16568 > 127.0.0.1.8000: R 23:23(0) ack 147 win 257

After some discussion with Chris Huang-Leaver, it appeared clear that
what we want is to only send the RST when we have no other choice, which
means when the server has not closed. So we still keep SYN/SYN-ACK/RST
for pure TCP checks, but don't want to see an RST emitted as above when
the server has already sent the FIN.

The solution against this consists in implementing a "drain" function at
the protocol layer, which, when defined, causes as much as possible of
the input socket buffer to be flushed to make recv() return zero so that
we know that the server's FIN was received and ACKed. On Linux, we can make
use of MSG_TRUNC on TCP sockets, which has the benefit of draining everything
at once without even copying data. On other platforms, we read up to one
buffer of data before the close. If recv() manages to get the final zero,
we don't disable lingering. Same for hard errors. Otherwise we do.

In practice, on HTTP health checks we generally find that the close was
pending and is returned upon first recv() call. The network trace becomes
cleaner :

  19:55:23.650621 IP 127.0.0.1.16561 > 127.0.0.1.8000: S 3982804816:3982804816(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
  19:55:23.650644 IP 127.0.0.1.8000 > 127.0.0.1.16561: S 4082139313:4082139313(0) ack 3982804817 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
  19:55:23.650666 IP 127.0.0.1.16561 > 127.0.0.1.8000: . ack 1 win 257
  19:55:23.651615 IP 127.0.0.1.16561 > 127.0.0.1.8000: P 1:23(22) ack 1 win 257
  19:55:23.651696 IP 127.0.0.1.8000 > 127.0.0.1.16561: FP 1:146(145) ack 23 win 257
  19:55:23.652628 IP 127.0.0.1.16561 > 127.0.0.1.8000: F 23:23(0) ack 147 win 257
  19:55:23.652655 IP 127.0.0.1.8000 > 127.0.0.1.16561: . ack 24 win 257

This change should be backported to 1.4 which is where Chris encountered
this issue. The code is different, so probably the tcp_drain() function
will have to be put in the checks only.
2013-06-10 20:33:23 +02:00
..
accept4.h BUILD: accept4: move the socketcall declaration outside of accept4() 2012-10-10 17:42:39 +02:00
appsession.h [MINOR] Make appsess{,ion}_refresh static 2011-06-25 21:07:01 +02:00
base64.h [MINOR] add encode/decode function for 30-bit integers from/to base64 2010-10-30 19:04:33 +02:00
buffer.h OPTIM: buffer: remove one jump in buffer_count() 2013-04-02 01:25:57 +02:00
cfgparse.h MINOR: config: make str2listener() use memprintf() to report errors. 2012-09-24 10:53:16 +02:00
chunk.h MINOR: chunks: centralize the trash chunk allocation 2012-12-23 21:46:07 +01:00
compat.h MEDIUM: protocol: implement a "drain" function in protocol layers 2013-06-10 20:33:23 +02:00
compiler.h CLEANUP: ebtree: clarify licence and update to 6.0.6 2011-12-02 17:09:49 +01:00
config.h [BUG] definitely fix regparm issues between haproxy core and ebtree 2009-10-27 21:53:58 +01:00
debug.h [MINOR] term_trace: add better instrumentations to trace the code 2008-08-16 14:55:08 +02:00
defaults.h MINOR: defaults: allow REQURI_LEN and CAPTURE_LEN to be redefined 2013-06-10 10:30:09 +02:00
epoll.h MAJOR: polling: replace epoll with sepoll and remove sepoll 2012-11-11 20:53:30 +01:00
errors.h [MINOR] errors: provide new status codes for config parsing functions 2010-08-10 14:01:15 +02:00
memory.h MEDIUM: memory: add the ability to poison memory at run time 2012-05-08 21:28:16 +02:00
mini-clist.h [MINOR] prepare req_*/rsp_* to receive a condition 2010-01-28 18:10:50 +01:00
rbtree.h [MINOR] imported the rbtree function from Linux kernel 2007-01-07 02:12:57 +01:00
regex.h BUG/MINOR: acl: fix a double free during exit when using PCRE_JIT 2013-05-08 18:14:18 +02:00
sessionhash.h [MAJOR] remove files distributed under an obscure license 2007-09-09 21:56:53 +02:00
splice.h [REORG] build: move syscall redefinition to specific places 2011-08-23 00:11:25 +02:00
standard.h MEDIUM: stats: add proxy name filtering on the statistic page 2013-04-15 22:50:33 +02:00
syscall.h BUILD/MINOR: syscall: add definition of NR_accept4 for ARM 2013-03-04 07:38:08 +01:00
template.h [CLEANUP] included common/version.h everywhere 2006-06-29 18:54:54 +02:00
ticks.h [MEDIUM] scheduler: get rid of the 4 trees thanks and use ebtree v4.1 2009-03-21 10:25:14 +01:00
time.h BUG/MINOR: time: frequency counters are not totally accurate 2012-12-29 21:50:07 +01:00
tools.h [MINOR] tools: add two macros MID_RANGE and MAX_RANGE 2011-03-28 15:55:43 +02:00
uri_auth.h [REORG] http: move the http-request rules to proto_http 2011-03-13 22:00:24 +01:00
version.h [CLEANUP] reference product branch 1.5 2010-08-27 11:09:17 +02:00