haproxy/src/raw_sock.c

484 lines
14 KiB
C
Raw Normal View History

/*
REORG: connection: rename the data layer the "transport layer" While working on the changes required to make the health checks use the new connections, it started to become obvious that some naming was not logical at all in the connections. Specifically, it is not logical to call the "data layer" the layer which is in charge for all the handshake and which does not yet provide a data layer once established until a session has allocated all the required buffers. In fact, it's more a transport layer, which makes much more sense. The transport layer offers a medium on which data can transit, and it offers the functions to move these data when the upper layer requests this. And it is the upper layer which iterates over the transport layer's functions to move data which should be called the data layer. The use case where it's obvious is with embryonic sessions : an incoming SSL connection is accepted. Only the connection is allocated, not the buffers nor stream interface, etc... The connection handles the SSL handshake by itself. Once this handshake is complete, we can't use the data functions because the buffers and stream interface are not there yet. Hence we have to first call a specific function to complete the session initialization, after which we'll be able to use the data functions. This clearly proves that SSL here is only a transport layer and that the stream interface constitutes the data layer. A similar change will be performed to rename app_cb => data, but the two could not be in the same commit for obvious reasons.
2012-10-02 18:19:48 -04:00
* RAW transport layer over SOCK_STREAM sockets.
*
* Copyright 2000-2012 Willy Tarreau <w@1wt.eu>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*
*/
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <netinet/tcp.h>
#include <haproxy/api.h>
#include <haproxy/buf.h>
#include <haproxy/connection.h>
#include <haproxy/errors.h>
#include <haproxy/fd.h>
#include <haproxy/freq_ctr.h>
#include <haproxy/global.h>
#include <haproxy/pipe.h>
#include <haproxy/stream_interface.h>
#include <haproxy/ticks.h>
#include <haproxy/time.h>
#include <haproxy/tools.h>
#if defined(USE_LINUX_SPLICE)
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
/* A pipe contains 16 segments max, and it's common to see segments of 1448 bytes
* because of timestamps. Use this as a hint for not looping on splice().
*/
#define SPLICE_FULL_HINT 16*1448
/* how many data we attempt to splice at once when the buffer is configured for
* infinite forwarding */
#define MAX_SPLICE_AT_ONCE (1<<30)
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
/* Returns :
* -1 if splice() is not supported
* >= 0 to report the amount of spliced bytes.
* connection flags are updated (error, read0, wait_room, wait_data).
* The caller must have previously allocated the pipe.
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
*/
int raw_sock_to_pipe(struct connection *conn, void *xprt_ctx, struct pipe *pipe, unsigned int count)
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
{
int ret;
int retval = 0;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
MAJOR: connection: add two new flags to indicate readiness of control/transport Currently the control and transport layers of a connection are supposed to be initialized when their respective pointers are not NULL. This will not work anymore when we plan to reuse connections, because there is an asymmetry between the accept() side and the connect() side : - on accept() side, the fd is set first, then the ctrl layer then the transport layer ; upon error, they must be undone in the reverse order, then the FD must be closed. The FD must not be deleted if the control layer was not yet initialized ; - on the connect() side, the fd is set last and there is no reliable way to know if it has been initialized or not. In practice it's initialized to -1 first but this is hackish and supposes that local FDs only will be used forever. Also, there are even less solutions for keeping trace of the transport layer's state. Also it is possible to support delayed close() when something (eg: logs) tracks some information requiring the transport and/or control layers, making it even more difficult to clean them. So the proposed solution is to add two flags to the connection : - CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert) and cleared after it's released (fd_delete). - CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init) and cleared after it's released (xprt->close). The functions have been adapted to rely on this and not on the pointers anymore. conn_xprt_close() was unused and dangerous : it did not close the control layer (eg: the socket itself) but still marks the transport layer as closed, preventing any future call to conn_full_close() from finishing the job. The problem comes from conn_full_close() in fact. It needs to close the xprt and ctrl layers independantly. After that we're still having an issue : we don't know based on ->ctrl alone whether the fd was registered or not. For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what remains to be done on the connection. In order not to miss some flag assignments, we introduce conn_ctrl_init() to initialize the control layer, register the fd using fd_insert() and set the flag, and conn_ctrl_close() which unregisters the fd and removes the flag, but only if the transport layer was closed. Similarly, at the transport layer, conn_xprt_init() calls ->init and sets the flag, while conn_xprt_close() checks the flag, calls ->close and clears the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init and the ->close functions are called only once each and in the correct order. Note that conn_xprt_close() does nothing if the transport layer is still tracked. conn_full_close() now simply calls conn_xprt_close() then conn_full_close() in turn, which do nothing if CO_FL_XPRT_TRACKED is set. In order to handle the error path, we also provide conn_force_close() which ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers in turns. All relevant instances of fd_delete() have been replaced with conn_force_close(). Now we always know what state the connection is in and we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
if (!conn_ctrl_ready(conn))
return 0;
if (!fd_recv_ready(conn->handle.fd))
MAJOR: connection: add two new flags to indicate readiness of control/transport Currently the control and transport layers of a connection are supposed to be initialized when their respective pointers are not NULL. This will not work anymore when we plan to reuse connections, because there is an asymmetry between the accept() side and the connect() side : - on accept() side, the fd is set first, then the ctrl layer then the transport layer ; upon error, they must be undone in the reverse order, then the FD must be closed. The FD must not be deleted if the control layer was not yet initialized ; - on the connect() side, the fd is set last and there is no reliable way to know if it has been initialized or not. In practice it's initialized to -1 first but this is hackish and supposes that local FDs only will be used forever. Also, there are even less solutions for keeping trace of the transport layer's state. Also it is possible to support delayed close() when something (eg: logs) tracks some information requiring the transport and/or control layers, making it even more difficult to clean them. So the proposed solution is to add two flags to the connection : - CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert) and cleared after it's released (fd_delete). - CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init) and cleared after it's released (xprt->close). The functions have been adapted to rely on this and not on the pointers anymore. conn_xprt_close() was unused and dangerous : it did not close the control layer (eg: the socket itself) but still marks the transport layer as closed, preventing any future call to conn_full_close() from finishing the job. The problem comes from conn_full_close() in fact. It needs to close the xprt and ctrl layers independantly. After that we're still having an issue : we don't know based on ->ctrl alone whether the fd was registered or not. For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what remains to be done on the connection. In order not to miss some flag assignments, we introduce conn_ctrl_init() to initialize the control layer, register the fd using fd_insert() and set the flag, and conn_ctrl_close() which unregisters the fd and removes the flag, but only if the transport layer was closed. Similarly, at the transport layer, conn_xprt_init() calls ->init and sets the flag, while conn_xprt_close() checks the flag, calls ->close and clears the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init and the ->close functions are called only once each and in the correct order. Note that conn_xprt_close() does nothing if the transport layer is still tracked. conn_full_close() now simply calls conn_xprt_close() then conn_full_close() in turn, which do nothing if CO_FL_XPRT_TRACKED is set. In order to handle the error path, we also provide conn_force_close() which ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers in turns. All relevant instances of fd_delete() have been replaced with conn_force_close(). Now we always know what state the connection is in and we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
return 0;
conn->flags &= ~CO_FL_WAIT_ROOM;
errno = 0;
/* Under Linux, if FD_POLL_HUP is set, we have reached the end.
* Since older splice() implementations were buggy and returned
* EAGAIN on end of read, let's bypass the call to splice() now.
*/
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if (unlikely(!(fdtab[conn->handle.fd].state & FD_POLL_IN))) {
/* stop here if we reached the end of data */
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if ((fdtab[conn->handle.fd].state & (FD_POLL_ERR|FD_POLL_HUP)) == FD_POLL_HUP)
goto out_read0;
/* report error on POLL_ERR before connection establishment */
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if ((fdtab[conn->handle.fd].state & FD_POLL_ERR) && (conn->flags & CO_FL_WAIT_L4_CONN)) {
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
errno = 0; /* let the caller do a getsockopt() if it wants it */
goto leave;
}
}
while (count) {
if (count > MAX_SPLICE_AT_ONCE)
count = MAX_SPLICE_AT_ONCE;
ret = splice(conn->handle.fd, NULL, pipe->prod, NULL, count,
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (ret <= 0) {
if (ret == 0)
goto out_read0;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
if (errno == EAGAIN) {
/* there are two reasons for EAGAIN :
* - nothing in the socket buffer (standard)
* - pipe is full
* The difference between these two situations
* is problematic. Since we don't know if the
* pipe is full, we'll stop if the pipe is not
* empty. Anyway, we will almost always fill or
* empty the pipe.
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
*/
if (pipe->data) {
/* always stop reading until the pipe is flushed */
conn->flags |= CO_FL_WAIT_ROOM;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
break;
}
/* socket buffer exhausted */
fd_cant_recv(conn->handle.fd);
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
break;
}
else if (errno == ENOSYS || errno == EINVAL || errno == EBADF) {
/* splice not supported on this end, disable it.
* We can safely return -1 since there is no
* chance that any data has been piped yet.
*/
retval = -1;
goto leave;
}
else if (errno == EINTR) {
/* try again */
continue;
}
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
/* here we have another error */
conn->flags |= CO_FL_ERROR;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
break;
} /* ret <= 0 */
retval += ret;
pipe->data += ret;
count -= ret;
if (pipe->data >= SPLICE_FULL_HINT || ret >= global.tune.recv_enough) {
/* We've read enough of it for this time, let's stop before
* being asked to poll.
*/
BUG/MEDIUM: splicing: fix abnormal CPU usage with splicing Mark Janssen reported an issue in 1.5-dev19 which was introduced in 1.5-dev12 by commit 96199b10. From time to time, randomly, the CPU usage spikes to 100% for seconds to minutes. A deep analysis of the traces provided shows that it happens when waiting for the response to a second pipelined HTTP request, or when trying to handle the received shutdown advertised by epoll() after the last block of data. Each time, splice() was involved with data pending in the pipe. The cause of this was that such events could not be taken into account by splice nor by recv and were left pending : - the transfer of the last block of data, optionally with a shutdown was not handled by splice() because of the validation that to_forward is higher than MIN_SPLICE_FORWARD ; - the next recv() call was inhibited because of the test on presence of data in the pipe. This is also what prevented the recv() call from handling a response to a pipelined request until the client had ACKed the previous response. No less than 4 different methods were experimented to fix this, and the current one was finally chosen. The principle is that if an event is not caught by splice(), then it MUST be caught by recv(). So we remove the condition on the pipe's emptiness to perform an recv(), and in order to prevent recv() from being used in the middle of a transfer, we mark supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because the reason for stopping a splice()-based receive is that the pipe is supposed to be full. The net effect is that we don't wake up and sleep in loops during these transient states. This happened much more often than expected, sometimes for a few cycles at end of transfers, but rarely long enough to be noticed, unless a client timed out with data pending in the pipe. The effect on CPU usage is visible even when transfering 1MB objects in pipeline, where the CPU usage drops from 10 to 6% on a small machine at medium bandwidth. Some further improvements are needed : - the last chunk of a splice() transfer is never done using splice due to the test on to_forward. This is wrong and should be performed with splice if the pipe has not yet been emptied ; - si_chk_snd() should not be called when the write event is already being polled, otherwise we're almost certain to get EAGAIN. Many thanks to Mark for all the traces he cared to provide, they were essential for understanding this issue which was not reproducible without. Only 1.5-dev is affected, no backport is needed.
2013-07-18 15:49:32 -04:00
conn->flags |= CO_FL_WAIT_ROOM;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
break;
}
} /* while */
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && retval)
conn->flags &= ~CO_FL_WAIT_L4_CONN;
leave:
if (retval > 0) {
/* we count the total bytes sent, and the send rate for 32-byte
* blocks. The reason for the latter is that freq_ctr are
* limited to 4GB and that it's not enough per second.
*/
_HA_ATOMIC_ADD(&global.out_bytes, retval);
_HA_ATOMIC_ADD(&global.spliced_out_bytes, retval);
update_freq_ctr(&global.out_32bps, (retval + 16) / 32);
}
return retval;
out_read0:
conn_sock_read0(conn);
conn->flags &= ~CO_FL_WAIT_L4_CONN;
goto leave;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
}
/* Send as many bytes as possible from the pipe to the connection's socket.
*/
int raw_sock_from_pipe(struct connection *conn, void *xprt_ctx, struct pipe *pipe)
{
int ret, done;
if (!conn_ctrl_ready(conn))
return 0;
if (!fd_send_ready(conn->handle.fd))
MAJOR: connection: add two new flags to indicate readiness of control/transport Currently the control and transport layers of a connection are supposed to be initialized when their respective pointers are not NULL. This will not work anymore when we plan to reuse connections, because there is an asymmetry between the accept() side and the connect() side : - on accept() side, the fd is set first, then the ctrl layer then the transport layer ; upon error, they must be undone in the reverse order, then the FD must be closed. The FD must not be deleted if the control layer was not yet initialized ; - on the connect() side, the fd is set last and there is no reliable way to know if it has been initialized or not. In practice it's initialized to -1 first but this is hackish and supposes that local FDs only will be used forever. Also, there are even less solutions for keeping trace of the transport layer's state. Also it is possible to support delayed close() when something (eg: logs) tracks some information requiring the transport and/or control layers, making it even more difficult to clean them. So the proposed solution is to add two flags to the connection : - CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert) and cleared after it's released (fd_delete). - CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init) and cleared after it's released (xprt->close). The functions have been adapted to rely on this and not on the pointers anymore. conn_xprt_close() was unused and dangerous : it did not close the control layer (eg: the socket itself) but still marks the transport layer as closed, preventing any future call to conn_full_close() from finishing the job. The problem comes from conn_full_close() in fact. It needs to close the xprt and ctrl layers independantly. After that we're still having an issue : we don't know based on ->ctrl alone whether the fd was registered or not. For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what remains to be done on the connection. In order not to miss some flag assignments, we introduce conn_ctrl_init() to initialize the control layer, register the fd using fd_insert() and set the flag, and conn_ctrl_close() which unregisters the fd and removes the flag, but only if the transport layer was closed. Similarly, at the transport layer, conn_xprt_init() calls ->init and sets the flag, while conn_xprt_close() checks the flag, calls ->close and clears the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init and the ->close functions are called only once each and in the correct order. Note that conn_xprt_close() does nothing if the transport layer is still tracked. conn_full_close() now simply calls conn_xprt_close() then conn_full_close() in turn, which do nothing if CO_FL_XPRT_TRACKED is set. In order to handle the error path, we also provide conn_force_close() which ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers in turns. All relevant instances of fd_delete() have been replaced with conn_force_close(). Now we always know what state the connection is in and we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
return 0;
if (conn->flags & CO_FL_SOCK_WR_SH) {
/* it's already closed */
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH;
errno = EPIPE;
return 0;
}
done = 0;
while (pipe->data) {
ret = splice(pipe->cons, NULL, conn->handle.fd, NULL, pipe->data,
SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
if (ret <= 0) {
if (ret == 0 || errno == EAGAIN) {
fd_cant_send(conn->handle.fd);
break;
}
else if (errno == EINTR)
continue;
/* here we have another error */
conn->flags |= CO_FL_ERROR;
break;
}
done += ret;
pipe->data -= ret;
}
MEDIUM: connection: enable reading only once the connection is confirmed In order to address the absurd polling sequence described in issue #253, let's make sure we disable receiving on a connection until it's established. Previously with bottom-top I/Os, we were almost certain that a connection was ready when the first I/O was confirmed. Now we can enter various functions, including process_stream(), which will attempt to read something, will fail, and will then subscribe. But we don't want them to try to receive if we know the connection didn't complete. The first prerequisite for this is to mark the connection as not ready for receiving until it's validated. But we don't want to mark it as not ready for sending because we know that attempting I/Os later is extremely likely to work without polling. Once the connection is confirmed we re-enable recv readiness. In order for this event to be taken into account, the call to tcp_connect_probe() was moved earlier, between the attempt to send() and the attempt to recv(). This way if tcp_connect_probe() enables reading, we have a chance to immediately fall back to this and read the possibly pending data. Now the trace looks like the following. It's far from being perfect but we've already saved one recvfrom() and one epollctl(): epoll_wait(3, [], 200, 0) = 0 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22 epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126 close(7) = 0
2019-09-05 11:05:05 -04:00
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && done) {
conn->flags &= ~CO_FL_WAIT_L4_CONN;
MEDIUM: connection: enable reading only once the connection is confirmed In order to address the absurd polling sequence described in issue #253, let's make sure we disable receiving on a connection until it's established. Previously with bottom-top I/Os, we were almost certain that a connection was ready when the first I/O was confirmed. Now we can enter various functions, including process_stream(), which will attempt to read something, will fail, and will then subscribe. But we don't want them to try to receive if we know the connection didn't complete. The first prerequisite for this is to mark the connection as not ready for receiving until it's validated. But we don't want to mark it as not ready for sending because we know that attempting I/Os later is extremely likely to work without polling. Once the connection is confirmed we re-enable recv readiness. In order for this event to be taken into account, the call to tcp_connect_probe() was moved earlier, between the attempt to send() and the attempt to recv(). This way if tcp_connect_probe() enables reading, we have a chance to immediately fall back to this and read the possibly pending data. Now the trace looks like the following. It's far from being perfect but we've already saved one recvfrom() and one epollctl(): epoll_wait(3, [], 200, 0) = 0 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22 epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126 close(7) = 0
2019-09-05 11:05:05 -04:00
}
return done;
}
#endif /* USE_LINUX_SPLICE */
/* Receive up to <count> bytes from connection <conn>'s socket and store them
BUG/MAJOR: connection: fix mismatch between rcv_buf's API and usage Steve Ruiz reported some reproducible crashes with HTTP health checks on a certain page returning a huge length. The traces he provided clearly showed that the recv() call was performed twice for a total size exceeding the buffer's length. Cyril Bont tracked down the problem to be caused by the full buffer size being passed to rcv_buf() in event_srv_chk_r() instead of passing just the remaining amount of space. Indeed, this change happened during the connection rework in 1.5-dev13 with the following commit : f150317 MAJOR: checks: completely use the connection transport layer But one of the problems is also that the comments at the top of the rcv_buf() functions suggest that the caller only has to ensure the requested size doesn't overflow the buffer's size. Also, these functions already have to care about the buffer's size to handle wrapping free space when there are pending data in the buffer. So let's change the API instead to more closely match what could be expected from these functions : - the caller asks for the maximum amount of bytes it wants to read ; This means that only the caller is responsible for enforcing the reserve if it wants to (eg: checks don't). - the rcv_buf() functions fix their computations to always consider this size as a max, and always perform validity checks based on the buffer's free space. As a result, the code is simplified and reduced, and made more robust for callers which now just have to care about whether they want the buffer to be filled or not. Since the bug was introduced in 1.5-dev13, no backport to stable versions is needed.
2014-01-14 05:31:27 -05:00
* into buffer <buf>. Only one call to recv() is performed, unless the
* buffer wraps, in which case a second call may be performed. The connection's
* flags are updated with whatever special event is detected (error, read0,
* empty). The caller is responsible for taking care of those events and
* avoiding the call if inappropriate. The function does not call the
* connection's polling update function, so the caller is responsible for this.
* errno is cleared before starting so that the caller knows that if it spots an
* error without errno, it's pending and can be retrieved via getsockopt(SO_ERROR).
*/
static size_t raw_sock_to_buf(struct connection *conn, void *xprt_ctx, struct buffer *buf, size_t count, int flags)
{
ssize_t ret;
size_t try, done = 0;
if (!conn_ctrl_ready(conn))
return 0;
if (!fd_recv_ready(conn->handle.fd))
MAJOR: connection: add two new flags to indicate readiness of control/transport Currently the control and transport layers of a connection are supposed to be initialized when their respective pointers are not NULL. This will not work anymore when we plan to reuse connections, because there is an asymmetry between the accept() side and the connect() side : - on accept() side, the fd is set first, then the ctrl layer then the transport layer ; upon error, they must be undone in the reverse order, then the FD must be closed. The FD must not be deleted if the control layer was not yet initialized ; - on the connect() side, the fd is set last and there is no reliable way to know if it has been initialized or not. In practice it's initialized to -1 first but this is hackish and supposes that local FDs only will be used forever. Also, there are even less solutions for keeping trace of the transport layer's state. Also it is possible to support delayed close() when something (eg: logs) tracks some information requiring the transport and/or control layers, making it even more difficult to clean them. So the proposed solution is to add two flags to the connection : - CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert) and cleared after it's released (fd_delete). - CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init) and cleared after it's released (xprt->close). The functions have been adapted to rely on this and not on the pointers anymore. conn_xprt_close() was unused and dangerous : it did not close the control layer (eg: the socket itself) but still marks the transport layer as closed, preventing any future call to conn_full_close() from finishing the job. The problem comes from conn_full_close() in fact. It needs to close the xprt and ctrl layers independantly. After that we're still having an issue : we don't know based on ->ctrl alone whether the fd was registered or not. For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what remains to be done on the connection. In order not to miss some flag assignments, we introduce conn_ctrl_init() to initialize the control layer, register the fd using fd_insert() and set the flag, and conn_ctrl_close() which unregisters the fd and removes the flag, but only if the transport layer was closed. Similarly, at the transport layer, conn_xprt_init() calls ->init and sets the flag, while conn_xprt_close() checks the flag, calls ->close and clears the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init and the ->close functions are called only once each and in the correct order. Note that conn_xprt_close() does nothing if the transport layer is still tracked. conn_full_close() now simply calls conn_xprt_close() then conn_full_close() in turn, which do nothing if CO_FL_XPRT_TRACKED is set. In order to handle the error path, we also provide conn_force_close() which ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers in turns. All relevant instances of fd_delete() have been replaced with conn_force_close(). Now we always know what state the connection is in and we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
return 0;
conn->flags &= ~CO_FL_WAIT_ROOM;
errno = 0;
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if (unlikely(!(fdtab[conn->handle.fd].state & FD_POLL_IN))) {
/* stop here if we reached the end of data */
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if ((fdtab[conn->handle.fd].state & (FD_POLL_ERR|FD_POLL_HUP)) == FD_POLL_HUP)
goto read0;
/* report error on POLL_ERR before connection establishment */
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if ((fdtab[conn->handle.fd].state & FD_POLL_ERR) && (conn->flags & CO_FL_WAIT_L4_CONN)) {
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
goto leave;
}
}
/* read the largest possible block. For this, we perform only one call
* to recv() unless the buffer wraps and we exactly fill the first hunk,
* in which case we accept to do it once again. A new attempt is made on
* EINTR too.
*/
BUG/MAJOR: connection: fix mismatch between rcv_buf's API and usage Steve Ruiz reported some reproducible crashes with HTTP health checks on a certain page returning a huge length. The traces he provided clearly showed that the recv() call was performed twice for a total size exceeding the buffer's length. Cyril Bont tracked down the problem to be caused by the full buffer size being passed to rcv_buf() in event_srv_chk_r() instead of passing just the remaining amount of space. Indeed, this change happened during the connection rework in 1.5-dev13 with the following commit : f150317 MAJOR: checks: completely use the connection transport layer But one of the problems is also that the comments at the top of the rcv_buf() functions suggest that the caller only has to ensure the requested size doesn't overflow the buffer's size. Also, these functions already have to care about the buffer's size to handle wrapping free space when there are pending data in the buffer. So let's change the API instead to more closely match what could be expected from these functions : - the caller asks for the maximum amount of bytes it wants to read ; This means that only the caller is responsible for enforcing the reserve if it wants to (eg: checks don't). - the rcv_buf() functions fix their computations to always consider this size as a max, and always perform validity checks based on the buffer's free space. As a result, the code is simplified and reduced, and made more robust for callers which now just have to care about whether they want the buffer to be filled or not. Since the bug was introduced in 1.5-dev13, no backport to stable versions is needed.
2014-01-14 05:31:27 -05:00
while (count > 0) {
try = b_contig_space(buf);
if (!try)
break;
BUG/MAJOR: connection: fix mismatch between rcv_buf's API and usage Steve Ruiz reported some reproducible crashes with HTTP health checks on a certain page returning a huge length. The traces he provided clearly showed that the recv() call was performed twice for a total size exceeding the buffer's length. Cyril Bont tracked down the problem to be caused by the full buffer size being passed to rcv_buf() in event_srv_chk_r() instead of passing just the remaining amount of space. Indeed, this change happened during the connection rework in 1.5-dev13 with the following commit : f150317 MAJOR: checks: completely use the connection transport layer But one of the problems is also that the comments at the top of the rcv_buf() functions suggest that the caller only has to ensure the requested size doesn't overflow the buffer's size. Also, these functions already have to care about the buffer's size to handle wrapping free space when there are pending data in the buffer. So let's change the API instead to more closely match what could be expected from these functions : - the caller asks for the maximum amount of bytes it wants to read ; This means that only the caller is responsible for enforcing the reserve if it wants to (eg: checks don't). - the rcv_buf() functions fix their computations to always consider this size as a max, and always perform validity checks based on the buffer's free space. As a result, the code is simplified and reduced, and made more robust for callers which now just have to care about whether they want the buffer to be filled or not. Since the bug was introduced in 1.5-dev13, no backport to stable versions is needed.
2014-01-14 05:31:27 -05:00
if (try > count)
try = count;
ret = recv(conn->handle.fd, b_tail(buf), try, 0);
if (ret > 0) {
b_add(buf, ret);
done += ret;
if (ret < try) {
/* socket buffer exhausted */
fd_cant_recv(conn->handle.fd);
/* unfortunately, on level-triggered events, POLL_HUP
* is generally delivered AFTER the system buffer is
* empty, unless the poller supports POLL_RDHUP. If
* we know this is the case, we don't try to read more
* as we know there's no more available. Similarly, if
* there's no problem with lingering we don't even try
* to read an unlikely close from the client since we'll
* close first anyway.
*/
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if (fdtab[conn->handle.fd].state & FD_POLL_HUP)
goto read0;
if (!(fdtab[conn->handle.fd].state & FD_LINGER_RISK) ||
(cur_poller.flags & HAP_POLL_F_RDHUP)) {
break;
}
}
count -= ret;
if (flags & CO_RFL_READ_ONCE)
break;
}
else if (ret == 0) {
goto read0;
}
else if (errno == EAGAIN || errno == ENOTCONN) {
/* socket buffer exhausted */
fd_cant_recv(conn->handle.fd);
break;
}
else if (errno != EINTR) {
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
break;
}
}
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && done)
conn->flags &= ~CO_FL_WAIT_L4_CONN;
leave:
return done;
read0:
conn_sock_read0(conn);
conn->flags &= ~CO_FL_WAIT_L4_CONN;
/* Now a final check for a possible asynchronous low-level error
* report. This can happen when a connection receives a reset
* after a shutdown, both POLL_HUP and POLL_ERR are queued, and
* we might have come from there by just checking POLL_HUP instead
* of recv()'s return value 0, so we have no way to tell there was
* an error without checking.
*/
MEDIUM: fd: merge fdtab[].ev and state for FD_EV_* and FD_POLL_* into state For a long time we've had fdtab[].ev and fdtab[].state which contain two arbitrary sets of information, one is mostly the configuration plus some shutdown reports and the other one is the latest polling status report which also contains some sticky error and shutdown reports. These ones used to be stored into distinct chars, complicating certain operations and not even allowing to clearly see concurrent accesses (e.g. fd_delete_orphan() would set the state to zero while fd_insert() would only set the event to zero). This patch creates a single uint with the two sets in it, still delimited at the byte level for better readability. The original FD_EV_* values remained at the lowest bit levels as they are also known by their bit value. The next step will consist in merging the remaining bits into it. The whole bits are now cleared both in fd_insert() and _fd_delete_orphan() because after a complete check, it is certain that in both cases these functions are the only ones touching these areas. Indeed, for _fd_delete_orphan(), the thread_mask has already been zeroed before a poller can call fd_update_event() which would touch the state, so it is certain that _fd_delete_orphan() is alone. Regarding fd_insert(), only one thread will get an FD at any moment, and it as this FD has already been released by _fd_delete_orphan() by definition it is certain that previous users have definitely stopped touching it. Strictly speaking there's no need for clearing the state again in fd_insert() but it's cheap and will remove some doubts during some troubleshooting sessions.
2021-04-06 11:23:40 -04:00
if (unlikely(fdtab[conn->handle.fd].state & FD_POLL_ERR))
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
goto leave;
}
/* Send up to <count> pending bytes from buffer <buf> to connection <conn>'s
* socket. <flags> may contain some CO_SFL_* flags to hint the system about
* other pending data for example, but this flag is ignored at the moment.
* Only one call to send() is performed, unless the buffer wraps, in which case
* a second call may be performed. The connection's flags are updated with
* whatever special event is detected (error, empty). The caller is responsible
* for taking care of those events and avoiding the call if inappropriate. The
* function does not call the connection's polling update function, so the caller
* is responsible for this. It's up to the caller to update the buffer's contents
* based on the return value.
*/
static size_t raw_sock_from_buf(struct connection *conn, void *xprt_ctx, const struct buffer *buf, size_t count, int flags)
{
ssize_t ret;
size_t try, done;
int send_flag;
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
if (!conn_ctrl_ready(conn))
return 0;
if (!fd_send_ready(conn->handle.fd))
MAJOR: connection: add two new flags to indicate readiness of control/transport Currently the control and transport layers of a connection are supposed to be initialized when their respective pointers are not NULL. This will not work anymore when we plan to reuse connections, because there is an asymmetry between the accept() side and the connect() side : - on accept() side, the fd is set first, then the ctrl layer then the transport layer ; upon error, they must be undone in the reverse order, then the FD must be closed. The FD must not be deleted if the control layer was not yet initialized ; - on the connect() side, the fd is set last and there is no reliable way to know if it has been initialized or not. In practice it's initialized to -1 first but this is hackish and supposes that local FDs only will be used forever. Also, there are even less solutions for keeping trace of the transport layer's state. Also it is possible to support delayed close() when something (eg: logs) tracks some information requiring the transport and/or control layers, making it even more difficult to clean them. So the proposed solution is to add two flags to the connection : - CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert) and cleared after it's released (fd_delete). - CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init) and cleared after it's released (xprt->close). The functions have been adapted to rely on this and not on the pointers anymore. conn_xprt_close() was unused and dangerous : it did not close the control layer (eg: the socket itself) but still marks the transport layer as closed, preventing any future call to conn_full_close() from finishing the job. The problem comes from conn_full_close() in fact. It needs to close the xprt and ctrl layers independantly. After that we're still having an issue : we don't know based on ->ctrl alone whether the fd was registered or not. For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what remains to be done on the connection. In order not to miss some flag assignments, we introduce conn_ctrl_init() to initialize the control layer, register the fd using fd_insert() and set the flag, and conn_ctrl_close() which unregisters the fd and removes the flag, but only if the transport layer was closed. Similarly, at the transport layer, conn_xprt_init() calls ->init and sets the flag, while conn_xprt_close() checks the flag, calls ->close and clears the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init and the ->close functions are called only once each and in the correct order. Note that conn_xprt_close() does nothing if the transport layer is still tracked. conn_full_close() now simply calls conn_xprt_close() then conn_full_close() in turn, which do nothing if CO_FL_XPRT_TRACKED is set. In order to handle the error path, we also provide conn_force_close() which ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers in turns. All relevant instances of fd_delete() have been replaced with conn_force_close(). Now we always know what state the connection is in and we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
return 0;
if (conn->flags & CO_FL_SOCK_WR_SH) {
/* it's already closed */
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH;
errno = EPIPE;
return 0;
}
done = 0;
/* send the largest possible block. For this we perform only one call
* to send() unless the buffer wraps and we exactly fill the first hunk,
* in which case we accept to do it once again.
[MAJOR] complete support for linux 2.6 kernel splicing This code provides support for linux 2.6 kernel splicing. This feature appeared in kernel 2.6.25, but initial implementations were awkward and buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization patches. Using pipes, this code is able to pass network data directly between sockets. The pipes are a bit annoying to manage (fd creation, release, ...) but finally work quite well. Preliminary tests show that on high bandwidths, there's a substantial gain (approx +50%, only +20% with kernel workarounds for corruption bugs). With 2000 concurrent connections, with Myricom NICs, haproxy now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two processes buffers. 8-9 Gbps are easily reached with smaller numbers of connections. We also try to splice out immediately after a splice in by making profit from the new ability for a data producer to notify the consumer that data are available. Doing this ensures that the data are immediately transferred between sockets without latency, and without having to re-poll. Performance on small packets has considerably increased due to this method. Earlier kernels return only one TCP segment at a time in non-blocking splice-in mode, while newer return as many segments as may fit in the pipe. To work around this limitation without hurting more recent kernels, we try to collect as much data as possible, but we stop when we believe we have read 16 segments, then we forward everything at once. It also ensures that even upon shutdown or EAGAIN the data will be forwarded. Some tricks were necessary because the splice() syscall does not make a difference between missing data and a pipe full, it always returns EAGAIN. The trick consists in stop polling in case of EAGAIN and a non empty pipe. The receiver waits for the buffer to be empty before using the pipe. This is in order to avoid confusion between buffer data and pipe data. The BF_EMPTY flag now covers the pipe too. Right now the code is disabled by default. It needs to be built with CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice() must have "option splice-response" (or option splice-request) enabled. It is probably desirable to keep a pool of pre-allocated pipes to avoid having to create them for every session. This will be worked on later. Preliminary tests show very good results, even with the kernel workaround causing one memcpy(). At 3000 connections, performance has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
*/
while (count) {
try = b_contig_data(buf, done);
if (try > count)
try = count;
send_flag = MSG_DONTWAIT | MSG_NOSIGNAL;
if (try < count || flags & CO_SFL_MSG_MORE)
send_flag |= MSG_MORE;
ret = send(conn->handle.fd, b_peek(buf, done), try, send_flag);
if (ret > 0) {
count -= ret;
done += ret;
/* if the system buffer is full, don't insist */
if (ret < try) {
fd_cant_send(conn->handle.fd);
break;
}
if (!count)
fd_stop_send(conn->handle.fd);
}
else if (ret == 0 || errno == EAGAIN || errno == ENOTCONN || errno == EINPROGRESS) {
/* nothing written, we need to poll for write first */
fd_cant_send(conn->handle.fd);
break;
}
else if (errno != EINTR) {
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
break;
}
}
MEDIUM: connection: enable reading only once the connection is confirmed In order to address the absurd polling sequence described in issue #253, let's make sure we disable receiving on a connection until it's established. Previously with bottom-top I/Os, we were almost certain that a connection was ready when the first I/O was confirmed. Now we can enter various functions, including process_stream(), which will attempt to read something, will fail, and will then subscribe. But we don't want them to try to receive if we know the connection didn't complete. The first prerequisite for this is to mark the connection as not ready for receiving until it's validated. But we don't want to mark it as not ready for sending because we know that attempting I/Os later is extremely likely to work without polling. Once the connection is confirmed we re-enable recv readiness. In order for this event to be taken into account, the call to tcp_connect_probe() was moved earlier, between the attempt to send() and the attempt to recv(). This way if tcp_connect_probe() enables reading, we have a chance to immediately fall back to this and read the possibly pending data. Now the trace looks like the following. It's far from being perfect but we've already saved one recvfrom() and one epollctl(): epoll_wait(3, [], 200, 0) = 0 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22 epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126 close(7) = 0
2019-09-05 11:05:05 -04:00
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && done) {
conn->flags &= ~CO_FL_WAIT_L4_CONN;
MEDIUM: connection: enable reading only once the connection is confirmed In order to address the absurd polling sequence described in issue #253, let's make sure we disable receiving on a connection until it's established. Previously with bottom-top I/Os, we were almost certain that a connection was ready when the first I/O was confirmed. Now we can enter various functions, including process_stream(), which will attempt to read something, will fail, and will then subscribe. But we don't want them to try to receive if we know the connection didn't complete. The first prerequisite for this is to mark the connection as not ready for receiving until it's validated. But we don't want to mark it as not ready for sending because we know that attempting I/Os later is extremely likely to work without polling. Once the connection is confirmed we re-enable recv readiness. In order for this event to be taken into account, the call to tcp_connect_probe() was moved earlier, between the attempt to send() and the attempt to recv(). This way if tcp_connect_probe() enables reading, we have a chance to immediately fall back to this and read the possibly pending data. Now the trace looks like the following. It's far from being perfect but we've already saved one recvfrom() and one epollctl(): epoll_wait(3, [], 200, 0) = 0 socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7 fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress) epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1 connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22 epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0 epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126 close(7) = 0
2019-09-05 11:05:05 -04:00
}
if (done > 0) {
/* we count the total bytes sent, and the send rate for 32-byte
* blocks. The reason for the latter is that freq_ctr are
* limited to 4GB and that it's not enough per second.
*/
_HA_ATOMIC_ADD(&global.out_bytes, done);
update_freq_ctr(&global.out_32bps, (done + 16) / 32);
}
return done;
}
/* Called from the upper layer, to subscribe <es> to events <event_type>. The
* event subscriber <es> is not allowed to change from a previous call as long
* as at least one event is still subscribed. The <event_type> must only be a
* combination of SUB_RETRY_RECV and SUB_RETRY_SEND. It always returns 0.
*/
static int raw_sock_subscribe(struct connection *conn, void *xprt_ctx, int event_type, struct wait_event *es)
{
return conn_subscribe(conn, xprt_ctx, event_type, es);
}
/* Called from the upper layer, to unsubscribe <es> from events <event_type>.
* The <es> pointer is not allowed to differ from the one passed to the
* subscribe() call. It always returns zero.
*/
static int raw_sock_unsubscribe(struct connection *conn, void *xprt_ctx, int event_type, struct wait_event *es)
{
return conn_unsubscribe(conn, xprt_ctx, event_type, es);
}
static void raw_sock_close(struct connection *conn, void *xprt_ctx)
{
if (conn->subs != NULL) {
conn_unsubscribe(conn, NULL, conn->subs->events, conn->subs);
}
}
/* We can't have an underlying XPRT, so just return -1 to signify failure */
static int raw_sock_remove_xprt(struct connection *conn, void *xprt_ctx, void *toremove_ctx, const struct xprt_ops *newops, void *newctx)
{
/* This is the lowest xprt we can have, so if we get there we didn't
* find the xprt we wanted to remove, that's a bug
*/
BUG_ON(1);
return -1;
}
REORG: connection: rename the data layer the "transport layer" While working on the changes required to make the health checks use the new connections, it started to become obvious that some naming was not logical at all in the connections. Specifically, it is not logical to call the "data layer" the layer which is in charge for all the handshake and which does not yet provide a data layer once established until a session has allocated all the required buffers. In fact, it's more a transport layer, which makes much more sense. The transport layer offers a medium on which data can transit, and it offers the functions to move these data when the upper layer requests this. And it is the upper layer which iterates over the transport layer's functions to move data which should be called the data layer. The use case where it's obvious is with embryonic sessions : an incoming SSL connection is accepted. Only the connection is allocated, not the buffers nor stream interface, etc... The connection handles the SSL handshake by itself. Once this handshake is complete, we can't use the data functions because the buffers and stream interface are not there yet. Hence we have to first call a specific function to complete the session initialization, after which we'll be able to use the data functions. This clearly proves that SSL here is only a transport layer and that the stream interface constitutes the data layer. A similar change will be performed to rename app_cb => data, but the two could not be in the same commit for obvious reasons.
2012-10-02 18:19:48 -04:00
/* transport-layer operations for RAW sockets */
static struct xprt_ops raw_sock = {
.snd_buf = raw_sock_from_buf,
.rcv_buf = raw_sock_to_buf,
.subscribe = raw_sock_subscribe,
.unsubscribe = raw_sock_unsubscribe,
.remove_xprt = raw_sock_remove_xprt,
#if defined(USE_LINUX_SPLICE)
.rcv_pipe = raw_sock_to_pipe,
.snd_pipe = raw_sock_from_pipe,
#endif
.shutr = NULL,
.shutw = NULL,
.close = raw_sock_close,
.name = "RAW",
};
__attribute__((constructor))
static void __raw_sock_init(void)
{
xprt_register(XPRT_RAW, &raw_sock);
}
/*
* Local variables:
* c-indent-level: 8
* c-basic-offset: 8
* End:
*/