2006-06-25 20:48:02 -04:00
|
|
|
/*
|
REORG: connection: rename the data layer the "transport layer"
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
2012-10-02 18:19:48 -04:00
|
|
|
* RAW transport layer over SOCK_STREAM sockets.
|
2006-06-25 20:48:02 -04:00
|
|
|
*
|
2012-05-11 10:59:14 -04:00
|
|
|
* Copyright 2000-2012 Willy Tarreau <w@1wt.eu>
|
2006-06-25 20:48:02 -04:00
|
|
|
*
|
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
|
*
|
|
|
|
|
*/
|
|
|
|
|
|
2009-01-18 15:59:13 -05:00
|
|
|
#define _GNU_SOURCE
|
2006-06-25 20:48:02 -04:00
|
|
|
#include <errno.h>
|
|
|
|
|
#include <fcntl.h>
|
|
|
|
|
#include <stdio.h>
|
|
|
|
|
#include <stdlib.h>
|
|
|
|
|
|
|
|
|
|
#include <sys/socket.h>
|
|
|
|
|
#include <sys/stat.h>
|
|
|
|
|
#include <sys/types.h>
|
2009-08-24 07:11:06 -04:00
|
|
|
#include <netinet/tcp.h>
|
|
|
|
|
|
2020-05-27 06:58:42 -04:00
|
|
|
#include <haproxy/api.h>
|
2020-06-02 05:28:02 -04:00
|
|
|
#include <haproxy/buf.h>
|
2020-06-04 12:02:10 -04:00
|
|
|
#include <haproxy/connection.h>
|
2020-06-05 11:27:29 -04:00
|
|
|
#include <haproxy/errors.h>
|
2020-06-09 03:07:15 -04:00
|
|
|
#include <haproxy/fd.h>
|
|
|
|
|
#include <haproxy/freq_ctr.h>
|
2020-06-04 11:05:57 -04:00
|
|
|
#include <haproxy/global.h>
|
2020-06-09 03:07:15 -04:00
|
|
|
#include <haproxy/pipe.h>
|
2020-06-04 14:45:39 -04:00
|
|
|
#include <haproxy/stream_interface.h>
|
2020-06-02 12:15:32 -04:00
|
|
|
#include <haproxy/ticks.h>
|
2020-06-01 05:05:15 -04:00
|
|
|
#include <haproxy/time.h>
|
2020-06-09 03:07:15 -04:00
|
|
|
#include <haproxy/tools.h>
|
2006-06-25 20:48:02 -04:00
|
|
|
|
2012-05-11 10:59:14 -04:00
|
|
|
|
2019-05-22 13:24:06 -04:00
|
|
|
#if defined(USE_LINUX_SPLICE)
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
|
|
|
|
|
/* A pipe contains 16 segments max, and it's common to see segments of 1448 bytes
|
|
|
|
|
* because of timestamps. Use this as a hint for not looping on splice().
|
|
|
|
|
*/
|
|
|
|
|
#define SPLICE_FULL_HINT 16*1448
|
|
|
|
|
|
[BUG] stream_sock: BUF_INFINITE_FORWARD broke splice on 64-bit platforms
Yohan Tordjman at Dstorage found that upgrading haproxy to 1.4-dev4
caused truncated objects to be returned. An strace quickly exhibited
the issue which was 100% reproducible :
4297 epoll_wait(0, {}, 10, 0) = 0
4297 epoll_wait(0, {{EPOLLIN, {u32=7, u64=7}}}, 10, 1000) = 1
4297 splice(0x7, 0, 0x5, 0, 0xffffffffffffffff, 0x3) = -1 EINVAL (Invalid argument)
4297 shutdown(7, 1 /* send */) = 0
4297 close(7) = 0
4297 shutdown(2, 1 /* send */) = 0
4297 close(2) = 0
This is caused by the fact that the forward length is taken from
BUF_INFINITE_FORWARD, which is -1. The problem does not appear
in 32-bit mode because this value is first cast to an unsigned
long, truncating it to 32-bit (4 GB). Setting an upper bound
fixes the issue.
Also, a second error check has been added for splice. If EINVAL
is returned, we fall back to recv().
2009-11-28 01:47:10 -05:00
|
|
|
/* how many data we attempt to splice at once when the buffer is configured for
|
|
|
|
|
* infinite forwarding */
|
|
|
|
|
#define MAX_SPLICE_AT_ONCE (1<<30)
|
|
|
|
|
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
/* Returns :
|
2012-08-23 18:46:52 -04:00
|
|
|
* -1 if splice() is not supported
|
|
|
|
|
* >= 0 to report the amount of spliced bytes.
|
|
|
|
|
* connection flags are updated (error, read0, wait_room, wait_data).
|
|
|
|
|
* The caller must have previously allocated the pipe.
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
*/
|
2019-03-21 13:27:17 -04:00
|
|
|
int raw_sock_to_pipe(struct connection *conn, void *xprt_ctx, struct pipe *pipe, unsigned int count)
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
{
|
2009-09-20 06:07:52 -04:00
|
|
|
int ret;
|
2012-08-09 08:45:22 -04:00
|
|
|
int retval = 0;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
|
MAJOR: connection: add two new flags to indicate readiness of control/transport
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
|
|
|
|
2014-01-20 09:13:07 -05:00
|
|
|
if (!conn_ctrl_ready(conn))
|
|
|
|
|
return 0;
|
|
|
|
|
|
2017-08-24 08:31:19 -04:00
|
|
|
if (!fd_recv_ready(conn->handle.fd))
|
MAJOR: connection: add two new flags to indicate readiness of control/transport
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
|
|
|
return 0;
|
|
|
|
|
|
2020-01-17 03:59:40 -05:00
|
|
|
conn->flags &= ~CO_FL_WAIT_ROOM;
|
2013-12-04 18:49:40 -05:00
|
|
|
errno = 0;
|
|
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
/* Under Linux, if FD_POLL_HUP is set, we have reached the end.
|
|
|
|
|
* Since older splice() implementations were buggy and returned
|
|
|
|
|
* EAGAIN on end of read, let's bypass the call to splice() now.
|
|
|
|
|
*/
|
2021-04-06 11:23:40 -04:00
|
|
|
if (unlikely(!(fdtab[conn->handle.fd].state & FD_POLL_IN))) {
|
2012-10-04 14:38:49 -04:00
|
|
|
/* stop here if we reached the end of data */
|
2021-04-06 11:23:40 -04:00
|
|
|
if ((fdtab[conn->handle.fd].state & (FD_POLL_ERR|FD_POLL_HUP)) == FD_POLL_HUP)
|
2012-10-04 14:38:49 -04:00
|
|
|
goto out_read0;
|
|
|
|
|
|
|
|
|
|
/* report error on POLL_ERR before connection establishment */
|
2021-04-06 11:23:40 -04:00
|
|
|
if ((fdtab[conn->handle.fd].state & FD_POLL_ERR) && (conn->flags & CO_FL_WAIT_L4_CONN)) {
|
2013-12-04 17:44:10 -05:00
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
|
2013-12-04 18:49:40 -05:00
|
|
|
errno = 0; /* let the caller do a getsockopt() if it wants it */
|
2017-10-25 03:30:13 -04:00
|
|
|
goto leave;
|
2012-10-04 14:38:49 -04:00
|
|
|
}
|
|
|
|
|
}
|
2009-01-25 07:56:13 -05:00
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
while (count) {
|
|
|
|
|
if (count > MAX_SPLICE_AT_ONCE)
|
|
|
|
|
count = MAX_SPLICE_AT_ONCE;
|
[BUG] stream_sock: BUF_INFINITE_FORWARD broke splice on 64-bit platforms
Yohan Tordjman at Dstorage found that upgrading haproxy to 1.4-dev4
caused truncated objects to be returned. An strace quickly exhibited
the issue which was 100% reproducible :
4297 epoll_wait(0, {}, 10, 0) = 0
4297 epoll_wait(0, {{EPOLLIN, {u32=7, u64=7}}}, 10, 1000) = 1
4297 splice(0x7, 0, 0x5, 0, 0xffffffffffffffff, 0x3) = -1 EINVAL (Invalid argument)
4297 shutdown(7, 1 /* send */) = 0
4297 close(7) = 0
4297 shutdown(2, 1 /* send */) = 0
4297 close(2) = 0
This is caused by the fact that the forward length is taken from
BUF_INFINITE_FORWARD, which is -1. The problem does not appear
in 32-bit mode because this value is first cast to an unsigned
long, truncating it to 32-bit (4 GB). Setting an upper bound
fixes the issue.
Also, a second error check has been added for splice. If EINVAL
is returned, we fall back to recv().
2009-11-28 01:47:10 -05:00
|
|
|
|
2017-08-24 08:31:19 -04:00
|
|
|
ret = splice(conn->handle.fd, NULL, pipe->prod, NULL, count,
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
|
|
|
|
|
|
|
|
|
|
if (ret <= 0) {
|
2019-05-22 13:55:24 -04:00
|
|
|
if (ret == 0)
|
2012-08-23 18:46:52 -04:00
|
|
|
goto out_read0;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
|
|
|
|
|
if (errno == EAGAIN) {
|
|
|
|
|
/* there are two reasons for EAGAIN :
|
|
|
|
|
* - nothing in the socket buffer (standard)
|
|
|
|
|
* - pipe is full
|
2019-05-22 13:55:24 -04:00
|
|
|
* The difference between these two situations
|
|
|
|
|
* is problematic. Since we don't know if the
|
|
|
|
|
* pipe is full, we'll stop if the pipe is not
|
|
|
|
|
* empty. Anyway, we will almost always fill or
|
|
|
|
|
* empty the pipe.
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
*/
|
2012-08-23 18:46:52 -04:00
|
|
|
if (pipe->data) {
|
2020-03-23 13:28:40 -04:00
|
|
|
/* always stop reading until the pipe is flushed */
|
2012-08-23 18:46:52 -04:00
|
|
|
conn->flags |= CO_FL_WAIT_ROOM;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
break;
|
|
|
|
|
}
|
2020-02-28 08:09:12 -05:00
|
|
|
/* socket buffer exhausted */
|
|
|
|
|
fd_cant_recv(conn->handle.fd);
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
break;
|
|
|
|
|
}
|
MINOR: splice: disable it when the system returns EBADF
At least on a heavily patched 2.6.35.9, we can see splice() fail
with EBADF :
recv(6, "789.123456789.123456789.12345678"..., 1049, 0) = 1049
send(5, "HTTP/1.1 200\r\nContent-length: 10"..., 8030, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_MORE) = 8030
gettimeofday({1352717854, 515601}, NULL) = 0
epoll_wait(0x3, 0x40221008, 0x7, 0) = 0
gettimeofday({1352717854, 515793}, NULL) = 0
pipe([7, 8]) = 0
splice(0x6, 0, 0x8, 0, 0xfe12c, 0x3) = -1 EBADF (Bad file descriptor)
close(6) = 0
This clearly is a kernel issue since all FDs are valid here, so let's
simply disable splice() on the connection when this happens so that
the session correctly recovers from that issue using recv().
2012-11-12 06:00:09 -05:00
|
|
|
else if (errno == ENOSYS || errno == EINVAL || errno == EBADF) {
|
2012-08-23 18:46:52 -04:00
|
|
|
/* splice not supported on this end, disable it.
|
|
|
|
|
* We can safely return -1 since there is no
|
|
|
|
|
* chance that any data has been piped yet.
|
|
|
|
|
*/
|
2017-10-25 03:30:13 -04:00
|
|
|
retval = -1;
|
|
|
|
|
goto leave;
|
2009-06-28 17:10:19 -04:00
|
|
|
}
|
2012-08-23 18:46:52 -04:00
|
|
|
else if (errno == EINTR) {
|
|
|
|
|
/* try again */
|
|
|
|
|
continue;
|
|
|
|
|
}
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
/* here we have another error */
|
2012-08-23 18:46:52 -04:00
|
|
|
conn->flags |= CO_FL_ERROR;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
break;
|
|
|
|
|
} /* ret <= 0 */
|
|
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
retval += ret;
|
|
|
|
|
pipe->data += ret;
|
BUG/MEDIUM: splicing is broken since 1.5-dev12
Commit 96199b10 reintroduced the splice() mechanism in the new connection
system. However, it failed to account for the number of transferred bytes,
allowing more bytes than scheduled to be transferred to the client. This
can cause an issue with small-chunked responses, where each packet from
the server may contain several chunks, because a single splice() call may
succeed, then try to splice() a second time as the pipe is not full, thus
consuming the next chunk size.
This patch also reverts commit baf2a5 ("OPTIM: splice: detect shutdowns...")
because it introduced a related regression. The issue is that splice() may
return less data than available also if the pipe is full, so having EPOLLRDHUP
after splice() returns less than expected is not a sufficient indication that
the input is empty.
In both cases, the issue may be detected by the presence of "SD" termination
flags in the logs, and worked around by disabling splicing (using "-dS").
This problem was reported by Sander Klein, and no backport is needed.
2013-04-06 05:29:39 -04:00
|
|
|
count -= ret;
|
2013-01-07 10:38:26 -05:00
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
if (pipe->data >= SPLICE_FULL_HINT || ret >= global.tune.recv_enough) {
|
|
|
|
|
/* We've read enough of it for this time, let's stop before
|
|
|
|
|
* being asked to poll.
|
|
|
|
|
*/
|
BUG/MEDIUM: splicing: fix abnormal CPU usage with splicing
Mark Janssen reported an issue in 1.5-dev19 which was introduced
in 1.5-dev12 by commit 96199b10. From time to time, randomly, the
CPU usage spikes to 100% for seconds to minutes.
A deep analysis of the traces provided shows that it happens when
waiting for the response to a second pipelined HTTP request, or
when trying to handle the received shutdown advertised by epoll()
after the last block of data. Each time, splice() was involved with
data pending in the pipe.
The cause of this was that such events could not be taken into account
by splice nor by recv and were left pending :
- the transfer of the last block of data, optionally with a shutdown
was not handled by splice() because of the validation that to_forward
is higher than MIN_SPLICE_FORWARD ;
- the next recv() call was inhibited because of the test on presence
of data in the pipe. This is also what prevented the recv() call
from handling a response to a pipelined request until the client
had ACKed the previous response.
No less than 4 different methods were experimented to fix this, and the
current one was finally chosen. The principle is that if an event is not
caught by splice(), then it MUST be caught by recv(). So we remove the
condition on the pipe's emptiness to perform an recv(), and in order to
prevent recv() from being used in the middle of a transfer, we mark
supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because
the reason for stopping a splice()-based receive is that the pipe is
supposed to be full.
The net effect is that we don't wake up and sleep in loops during these
transient states. This happened much more often than expected, sometimes
for a few cycles at end of transfers, but rarely long enough to be
noticed, unless a client timed out with data pending in the pipe. The
effect on CPU usage is visible even when transfering 1MB objects in
pipeline, where the CPU usage drops from 10 to 6% on a small machine at
medium bandwidth.
Some further improvements are needed :
- the last chunk of a splice() transfer is never done using splice due
to the test on to_forward. This is wrong and should be performed with
splice if the pipe has not yet been emptied ;
- si_chk_snd() should not be called when the write event is already being
polled, otherwise we're almost certain to get EAGAIN.
Many thanks to Mark for all the traces he cared to provide, they were
essential for understanding this issue which was not reproducible
without.
Only 1.5-dev is affected, no backport is needed.
2013-07-18 15:49:32 -04:00
|
|
|
conn->flags |= CO_FL_WAIT_ROOM;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
} /* while */
|
|
|
|
|
|
2012-10-04 14:20:46 -04:00
|
|
|
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && retval)
|
|
|
|
|
conn->flags &= ~CO_FL_WAIT_L4_CONN;
|
2017-10-25 03:30:13 -04:00
|
|
|
|
|
|
|
|
leave:
|
2019-05-23 05:39:14 -04:00
|
|
|
if (retval > 0) {
|
|
|
|
|
/* we count the total bytes sent, and the send rate for 32-byte
|
|
|
|
|
* blocks. The reason for the latter is that freq_ctr are
|
|
|
|
|
* limited to 4GB and that it's not enough per second.
|
|
|
|
|
*/
|
|
|
|
|
_HA_ATOMIC_ADD(&global.out_bytes, retval);
|
2020-07-10 07:56:30 -04:00
|
|
|
_HA_ATOMIC_ADD(&global.spliced_out_bytes, retval);
|
2019-05-23 05:39:14 -04:00
|
|
|
update_freq_ctr(&global.out_32bps, (retval + 16) / 32);
|
|
|
|
|
}
|
2012-08-23 18:46:52 -04:00
|
|
|
return retval;
|
2009-01-25 07:56:13 -05:00
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
out_read0:
|
|
|
|
|
conn_sock_read0(conn);
|
2012-10-04 14:20:46 -04:00
|
|
|
conn->flags &= ~CO_FL_WAIT_L4_CONN;
|
2017-10-25 03:30:13 -04:00
|
|
|
goto leave;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
}
|
|
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
/* Send as many bytes as possible from the pipe to the connection's socket.
|
|
|
|
|
*/
|
2019-03-21 13:27:17 -04:00
|
|
|
int raw_sock_from_pipe(struct connection *conn, void *xprt_ctx, struct pipe *pipe)
|
2012-08-23 18:46:52 -04:00
|
|
|
{
|
|
|
|
|
int ret, done;
|
|
|
|
|
|
2014-01-20 09:13:07 -05:00
|
|
|
if (!conn_ctrl_ready(conn))
|
|
|
|
|
return 0;
|
|
|
|
|
|
2017-08-24 08:31:19 -04:00
|
|
|
if (!fd_send_ready(conn->handle.fd))
|
MAJOR: connection: add two new flags to indicate readiness of control/transport
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
|
|
|
return 0;
|
|
|
|
|
|
2020-01-23 12:17:55 -05:00
|
|
|
if (conn->flags & CO_FL_SOCK_WR_SH) {
|
|
|
|
|
/* it's already closed */
|
|
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH;
|
|
|
|
|
errno = EPIPE;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
done = 0;
|
|
|
|
|
while (pipe->data) {
|
2017-08-24 08:31:19 -04:00
|
|
|
ret = splice(pipe->cons, NULL, conn->handle.fd, NULL, pipe->data,
|
2012-08-23 18:46:52 -04:00
|
|
|
SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
|
|
|
|
|
|
|
|
|
|
if (ret <= 0) {
|
|
|
|
|
if (ret == 0 || errno == EAGAIN) {
|
2020-02-28 08:09:12 -05:00
|
|
|
fd_cant_send(conn->handle.fd);
|
2012-08-23 18:46:52 -04:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
else if (errno == EINTR)
|
|
|
|
|
continue;
|
|
|
|
|
|
|
|
|
|
/* here we have another error */
|
|
|
|
|
conn->flags |= CO_FL_ERROR;
|
|
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
done += ret;
|
|
|
|
|
pipe->data -= ret;
|
|
|
|
|
}
|
MEDIUM: connection: enable reading only once the connection is confirmed
In order to address the absurd polling sequence described in issue #253,
let's make sure we disable receiving on a connection until it's established.
Previously with bottom-top I/Os, we were almost certain that a connection
was ready when the first I/O was confirmed. Now we can enter various
functions, including process_stream(), which will attempt to read
something, will fail, and will then subscribe. But we don't want them
to try to receive if we know the connection didn't complete. The first
prerequisite for this is to mark the connection as not ready for receiving
until it's validated. But we don't want to mark it as not ready for sending
because we know that attempting I/Os later is extremely likely to work
without polling.
Once the connection is confirmed we re-enable recv readiness. In order
for this event to be taken into account, the call to tcp_connect_probe()
was moved earlier, between the attempt to send() and the attempt to recv().
This way if tcp_connect_probe() enables reading, we have a chance to
immediately fall back to this and read the possibly pending data.
Now the trace looks like the following. It's far from being perfect
but we've already saved one recvfrom() and one epollctl():
epoll_wait(3, [], 200, 0) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22
epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126
close(7) = 0
2019-09-05 11:05:05 -04:00
|
|
|
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && done) {
|
2012-10-04 14:20:46 -04:00
|
|
|
conn->flags &= ~CO_FL_WAIT_L4_CONN;
|
MEDIUM: connection: enable reading only once the connection is confirmed
In order to address the absurd polling sequence described in issue #253,
let's make sure we disable receiving on a connection until it's established.
Previously with bottom-top I/Os, we were almost certain that a connection
was ready when the first I/O was confirmed. Now we can enter various
functions, including process_stream(), which will attempt to read
something, will fail, and will then subscribe. But we don't want them
to try to receive if we know the connection didn't complete. The first
prerequisite for this is to mark the connection as not ready for receiving
until it's validated. But we don't want to mark it as not ready for sending
because we know that attempting I/Os later is extremely likely to work
without polling.
Once the connection is confirmed we re-enable recv readiness. In order
for this event to be taken into account, the call to tcp_connect_probe()
was moved earlier, between the attempt to send() and the attempt to recv().
This way if tcp_connect_probe() enables reading, we have a chance to
immediately fall back to this and read the possibly pending data.
Now the trace looks like the following. It's far from being perfect
but we've already saved one recvfrom() and one epollctl():
epoll_wait(3, [], 200, 0) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22
epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126
close(7) = 0
2019-09-05 11:05:05 -04:00
|
|
|
}
|
2017-10-25 03:30:13 -04:00
|
|
|
|
2012-08-23 18:46:52 -04:00
|
|
|
return done;
|
|
|
|
|
}
|
|
|
|
|
|
2019-05-22 13:24:06 -04:00
|
|
|
#endif /* USE_LINUX_SPLICE */
|
2009-01-18 15:59:13 -05:00
|
|
|
|
|
|
|
|
|
2012-08-20 11:30:32 -04:00
|
|
|
/* Receive up to <count> bytes from connection <conn>'s socket and store them
|
2014-01-14 05:31:27 -05:00
|
|
|
* into buffer <buf>. Only one call to recv() is performed, unless the
|
2012-08-20 11:30:32 -04:00
|
|
|
* buffer wraps, in which case a second call may be performed. The connection's
|
|
|
|
|
* flags are updated with whatever special event is detected (error, read0,
|
|
|
|
|
* empty). The caller is responsible for taking care of those events and
|
|
|
|
|
* avoiding the call if inappropriate. The function does not call the
|
|
|
|
|
* connection's polling update function, so the caller is responsible for this.
|
2013-12-04 18:49:40 -05:00
|
|
|
* errno is cleared before starting so that the caller knows that if it spots an
|
|
|
|
|
* error without errno, it's pending and can be retrieved via getsockopt(SO_ERROR).
|
2012-08-20 11:30:32 -04:00
|
|
|
*/
|
2019-03-21 13:27:17 -04:00
|
|
|
static size_t raw_sock_to_buf(struct connection *conn, void *xprt_ctx, struct buffer *buf, size_t count, int flags)
|
2012-08-20 11:30:32 -04:00
|
|
|
{
|
2018-07-18 05:22:03 -04:00
|
|
|
ssize_t ret;
|
|
|
|
|
size_t try, done = 0;
|
2012-08-20 11:30:32 -04:00
|
|
|
|
2014-01-20 09:13:07 -05:00
|
|
|
if (!conn_ctrl_ready(conn))
|
|
|
|
|
return 0;
|
|
|
|
|
|
2017-08-24 08:31:19 -04:00
|
|
|
if (!fd_recv_ready(conn->handle.fd))
|
MAJOR: connection: add two new flags to indicate readiness of control/transport
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
|
|
|
return 0;
|
|
|
|
|
|
2020-01-17 03:59:40 -05:00
|
|
|
conn->flags &= ~CO_FL_WAIT_ROOM;
|
2013-12-04 18:49:40 -05:00
|
|
|
errno = 0;
|
|
|
|
|
|
2021-04-06 11:23:40 -04:00
|
|
|
if (unlikely(!(fdtab[conn->handle.fd].state & FD_POLL_IN))) {
|
2012-10-04 14:38:49 -04:00
|
|
|
/* stop here if we reached the end of data */
|
2021-04-06 11:23:40 -04:00
|
|
|
if ((fdtab[conn->handle.fd].state & (FD_POLL_ERR|FD_POLL_HUP)) == FD_POLL_HUP)
|
2012-10-04 14:38:49 -04:00
|
|
|
goto read0;
|
|
|
|
|
|
|
|
|
|
/* report error on POLL_ERR before connection establishment */
|
2021-04-06 11:23:40 -04:00
|
|
|
if ((fdtab[conn->handle.fd].state & FD_POLL_ERR) && (conn->flags & CO_FL_WAIT_L4_CONN)) {
|
2013-12-04 17:44:10 -05:00
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
|
2017-10-25 03:30:13 -04:00
|
|
|
goto leave;
|
2012-10-04 14:38:49 -04:00
|
|
|
}
|
|
|
|
|
}
|
2012-08-20 11:30:32 -04:00
|
|
|
|
|
|
|
|
/* read the largest possible block. For this, we perform only one call
|
|
|
|
|
* to recv() unless the buffer wraps and we exactly fill the first hunk,
|
|
|
|
|
* in which case we accept to do it once again. A new attempt is made on
|
|
|
|
|
* EINTR too.
|
|
|
|
|
*/
|
2014-01-14 05:31:27 -05:00
|
|
|
while (count > 0) {
|
2018-06-15 11:21:00 -04:00
|
|
|
try = b_contig_space(buf);
|
|
|
|
|
if (!try)
|
|
|
|
|
break;
|
|
|
|
|
|
2014-01-14 05:31:27 -05:00
|
|
|
if (try > count)
|
|
|
|
|
try = count;
|
|
|
|
|
|
2018-06-07 12:46:28 -04:00
|
|
|
ret = recv(conn->handle.fd, b_tail(buf), try, 0);
|
2012-08-20 11:30:32 -04:00
|
|
|
|
|
|
|
|
if (ret > 0) {
|
2018-06-28 12:17:23 -04:00
|
|
|
b_add(buf, ret);
|
2012-08-20 11:30:32 -04:00
|
|
|
done += ret;
|
|
|
|
|
if (ret < try) {
|
2020-02-28 08:09:12 -05:00
|
|
|
/* socket buffer exhausted */
|
|
|
|
|
fd_cant_recv(conn->handle.fd);
|
|
|
|
|
|
2012-08-20 11:30:32 -04:00
|
|
|
/* unfortunately, on level-triggered events, POLL_HUP
|
|
|
|
|
* is generally delivered AFTER the system buffer is
|
BUG/MINOR: raw_sock: always perfom the last recv if RDHUP is not available
Curu Wong reported a case where haproxy used to send RST to a server
after receiving its FIN. The problem is caused by the fact that being
a server connection, its fd is marked with linger_risk=1, and that the
poller didn't report POLLRDHUP, making haproxy unaware of a pending
shutdown that came after the data, so it used to resort to nolinger
for closing.
However when pollers support RDHUP we're pretty certain whether or not
a shutdown comes after the data and we don't need to perform that extra
recv() call.
Similarly when we're dealing with an inbound connection we don't care
and don't want to perform this extra recv after a request for a very
unlikely case, as in any case we'll have to deal with the client-facing
TIME_WAIT socket.
So this patch ensures that only when it's known that there's no risk with
lingering data, as well as in cases where it's known that the poller would
have detected a pending RDHUP, we perform the fd_done_recv() otherwise we
continue, trying a subsequent recv() to try to detect a pending shutdown.
This effectively results in an extra recv() call for keep-alive sockets
connected to a server when POLLRDHUP isn't known to be supported, but
it's the only way to know whether they're still alive or closed.
This fix should be backported to 1.7, 1.6 and 1.5. It relies on the
previous patch bringing support for the HAP_POLL_F_RDHUP flag.
2017-03-13 07:04:34 -04:00
|
|
|
* empty, unless the poller supports POLL_RDHUP. If
|
|
|
|
|
* we know this is the case, we don't try to read more
|
|
|
|
|
* as we know there's no more available. Similarly, if
|
|
|
|
|
* there's no problem with lingering we don't even try
|
|
|
|
|
* to read an unlikely close from the client since we'll
|
|
|
|
|
* close first anyway.
|
2012-08-20 11:30:32 -04:00
|
|
|
*/
|
2021-04-06 11:23:40 -04:00
|
|
|
if (fdtab[conn->handle.fd].state & FD_POLL_HUP)
|
2012-08-20 11:30:32 -04:00
|
|
|
goto read0;
|
2014-01-23 18:54:27 -05:00
|
|
|
|
2021-04-06 11:49:19 -04:00
|
|
|
if (!(fdtab[conn->handle.fd].state & FD_LINGER_RISK) ||
|
BUG/MINOR: raw_sock: always perfom the last recv if RDHUP is not available
Curu Wong reported a case where haproxy used to send RST to a server
after receiving its FIN. The problem is caused by the fact that being
a server connection, its fd is marked with linger_risk=1, and that the
poller didn't report POLLRDHUP, making haproxy unaware of a pending
shutdown that came after the data, so it used to resort to nolinger
for closing.
However when pollers support RDHUP we're pretty certain whether or not
a shutdown comes after the data and we don't need to perform that extra
recv() call.
Similarly when we're dealing with an inbound connection we don't care
and don't want to perform this extra recv after a request for a very
unlikely case, as in any case we'll have to deal with the client-facing
TIME_WAIT socket.
So this patch ensures that only when it's known that there's no risk with
lingering data, as well as in cases where it's known that the poller would
have detected a pending RDHUP, we perform the fd_done_recv() otherwise we
continue, trying a subsequent recv() to try to detect a pending shutdown.
This effectively results in an extra recv() call for keep-alive sockets
connected to a server when POLLRDHUP isn't known to be supported, but
it's the only way to know whether they're still alive or closed.
This fix should be backported to 1.7, 1.6 and 1.5. It relies on the
previous patch bringing support for the HAP_POLL_F_RDHUP flag.
2017-03-13 07:04:34 -04:00
|
|
|
(cur_poller.flags & HAP_POLL_F_RDHUP)) {
|
|
|
|
|
break;
|
|
|
|
|
}
|
2012-08-20 11:30:32 -04:00
|
|
|
}
|
|
|
|
|
count -= ret;
|
2020-02-20 05:04:40 -05:00
|
|
|
|
|
|
|
|
if (flags & CO_RFL_READ_ONCE)
|
|
|
|
|
break;
|
2012-08-20 11:30:32 -04:00
|
|
|
}
|
|
|
|
|
else if (ret == 0) {
|
|
|
|
|
goto read0;
|
|
|
|
|
}
|
2014-03-03 16:48:42 -05:00
|
|
|
else if (errno == EAGAIN || errno == ENOTCONN) {
|
2020-02-28 08:09:12 -05:00
|
|
|
/* socket buffer exhausted */
|
|
|
|
|
fd_cant_recv(conn->handle.fd);
|
2012-08-20 11:30:32 -04:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
else if (errno != EINTR) {
|
2013-12-04 17:44:10 -05:00
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
|
2012-08-20 11:30:32 -04:00
|
|
|
break;
|
|
|
|
|
}
|
|
|
|
|
}
|
2012-10-04 14:20:46 -04:00
|
|
|
|
|
|
|
|
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && done)
|
|
|
|
|
conn->flags &= ~CO_FL_WAIT_L4_CONN;
|
2017-10-25 03:30:13 -04:00
|
|
|
|
|
|
|
|
leave:
|
2012-08-20 11:30:32 -04:00
|
|
|
return done;
|
|
|
|
|
|
|
|
|
|
read0:
|
|
|
|
|
conn_sock_read0(conn);
|
2012-10-04 14:20:46 -04:00
|
|
|
conn->flags &= ~CO_FL_WAIT_L4_CONN;
|
2012-12-06 18:01:33 -05:00
|
|
|
|
|
|
|
|
/* Now a final check for a possible asynchronous low-level error
|
|
|
|
|
* report. This can happen when a connection receives a reset
|
|
|
|
|
* after a shutdown, both POLL_HUP and POLL_ERR are queued, and
|
|
|
|
|
* we might have come from there by just checking POLL_HUP instead
|
|
|
|
|
* of recv()'s return value 0, so we have no way to tell there was
|
|
|
|
|
* an error without checking.
|
|
|
|
|
*/
|
2021-04-06 11:23:40 -04:00
|
|
|
if (unlikely(fdtab[conn->handle.fd].state & FD_POLL_ERR))
|
2013-12-04 17:44:10 -05:00
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
|
2017-10-25 03:30:13 -04:00
|
|
|
goto leave;
|
2012-08-20 11:30:32 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
2018-06-14 12:31:46 -04:00
|
|
|
/* Send up to <count> pending bytes from buffer <buf> to connection <conn>'s
|
|
|
|
|
* socket. <flags> may contain some CO_SFL_* flags to hint the system about
|
|
|
|
|
* other pending data for example, but this flag is ignored at the moment.
|
2012-08-21 12:22:06 -04:00
|
|
|
* Only one call to send() is performed, unless the buffer wraps, in which case
|
|
|
|
|
* a second call may be performed. The connection's flags are updated with
|
|
|
|
|
* whatever special event is detected (error, empty). The caller is responsible
|
|
|
|
|
* for taking care of those events and avoiding the call if inappropriate. The
|
|
|
|
|
* function does not call the connection's polling update function, so the caller
|
2018-06-14 12:31:46 -04:00
|
|
|
* is responsible for this. It's up to the caller to update the buffer's contents
|
|
|
|
|
* based on the return value.
|
2006-06-25 20:48:02 -04:00
|
|
|
*/
|
2019-03-21 13:27:17 -04:00
|
|
|
static size_t raw_sock_from_buf(struct connection *conn, void *xprt_ctx, const struct buffer *buf, size_t count, int flags)
|
2009-01-18 09:30:37 -05:00
|
|
|
{
|
2018-06-14 12:31:46 -04:00
|
|
|
ssize_t ret;
|
|
|
|
|
size_t try, done;
|
|
|
|
|
int send_flag;
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
|
2014-01-20 09:13:07 -05:00
|
|
|
if (!conn_ctrl_ready(conn))
|
|
|
|
|
return 0;
|
|
|
|
|
|
2017-08-24 08:31:19 -04:00
|
|
|
if (!fd_send_ready(conn->handle.fd))
|
MAJOR: connection: add two new flags to indicate readiness of control/transport
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
2013-10-21 10:30:56 -04:00
|
|
|
return 0;
|
|
|
|
|
|
2020-01-23 12:17:55 -05:00
|
|
|
if (conn->flags & CO_FL_SOCK_WR_SH) {
|
|
|
|
|
/* it's already closed */
|
|
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH;
|
|
|
|
|
errno = EPIPE;
|
|
|
|
|
return 0;
|
|
|
|
|
}
|
|
|
|
|
|
2012-08-21 12:22:06 -04:00
|
|
|
done = 0;
|
|
|
|
|
/* send the largest possible block. For this we perform only one call
|
|
|
|
|
* to send() unless the buffer wraps and we exactly fill the first hunk,
|
|
|
|
|
* in which case we accept to do it once again.
|
[MAJOR] complete support for linux 2.6 kernel splicing
This code provides support for linux 2.6 kernel splicing. This feature
appeared in kernel 2.6.25, but initial implementations were awkward and
buggy. A kernel >= 2.6.29-rc1 is recommended, as well as some optimization
patches.
Using pipes, this code is able to pass network data directly between
sockets. The pipes are a bit annoying to manage (fd creation, release,
...) but finally work quite well.
Preliminary tests show that on high bandwidths, there's a substantial
gain (approx +50%, only +20% with kernel workarounds for corruption
bugs). With 2000 concurrent connections, with Myricom NICs, haproxy
now more easily achieves 4.5 Gbps for 1 process and 6 Gbps for two
processes buffers. 8-9 Gbps are easily reached with smaller numbers
of connections.
We also try to splice out immediately after a splice in by making
profit from the new ability for a data producer to notify the
consumer that data are available. Doing this ensures that the
data are immediately transferred between sockets without latency,
and without having to re-poll. Performance on small packets has
considerably increased due to this method.
Earlier kernels return only one TCP segment at a time in non-blocking
splice-in mode, while newer return as many segments as may fit in the
pipe. To work around this limitation without hurting more recent kernels,
we try to collect as much data as possible, but we stop when we believe
we have read 16 segments, then we forward everything at once. It also
ensures that even upon shutdown or EAGAIN the data will be forwarded.
Some tricks were necessary because the splice() syscall does not make
a difference between missing data and a pipe full, it always returns
EAGAIN. The trick consists in stop polling in case of EAGAIN and a non
empty pipe.
The receiver waits for the buffer to be empty before using the pipe.
This is in order to avoid confusion between buffer data and pipe data.
The BF_EMPTY flag now covers the pipe too.
Right now the code is disabled by default. It needs to be built with
CONFIG_HAP_LINUX_SPLICE, and the instances intented to use splice()
must have "option splice-response" (or option splice-request) enabled.
It is probably desirable to keep a pool of pre-allocated pipes to
avoid having to create them for every session. This will be worked
on later.
Preliminary tests show very good results, even with the kernel
workaround causing one memcpy(). At 3000 connections, performance
has moved from 3.2 Gbps to 4.7 Gbps.
2009-01-18 18:32:22 -05:00
|
|
|
*/
|
2018-06-14 12:31:46 -04:00
|
|
|
while (count) {
|
|
|
|
|
try = b_contig_data(buf, done);
|
|
|
|
|
if (try > count)
|
|
|
|
|
try = count;
|
2010-01-03 11:24:51 -05:00
|
|
|
|
2012-08-21 12:22:06 -04:00
|
|
|
send_flag = MSG_DONTWAIT | MSG_NOSIGNAL;
|
2018-06-14 12:31:46 -04:00
|
|
|
if (try < count || flags & CO_SFL_MSG_MORE)
|
2014-02-01 19:44:13 -05:00
|
|
|
send_flag |= MSG_MORE;
|
2006-06-25 20:48:02 -04:00
|
|
|
|
2018-06-14 12:31:46 -04:00
|
|
|
ret = send(conn->handle.fd, b_peek(buf, done), try, send_flag);
|
2006-06-25 20:48:02 -04:00
|
|
|
|
|
|
|
|
if (ret > 0) {
|
2018-06-14 12:31:46 -04:00
|
|
|
count -= ret;
|
2012-08-21 12:22:06 -04:00
|
|
|
done += ret;
|
2008-08-16 16:18:07 -04:00
|
|
|
|
2007-06-03 08:10:36 -04:00
|
|
|
/* if the system buffer is full, don't insist */
|
2020-02-28 08:09:12 -05:00
|
|
|
if (ret < try) {
|
|
|
|
|
fd_cant_send(conn->handle.fd);
|
2007-04-30 08:37:43 -04:00
|
|
|
break;
|
2020-02-28 08:09:12 -05:00
|
|
|
}
|
MEDIUM: connection: get rid of CO_FL_CURR_* flags
These ones used to serve as a set of switches between CO_FL_SOCK_* and
CO_FL_XPRT_*, and now that the SOCK layer is gone, they're always a
copy of the last know CO_FL_XPRT_* ones that is resynchronized before
I/O events by calling conn_refresh_polling_flags(), and that are pushed
back to FDs when detecting changes with conn_xprt_polling_changes().
While these functions are not particularly heavy, what they do is
totally redundant by now because the fd_want_*/fd_stop_*() actions
already perform test-and-set operations to decide to create an entry
or not, so they do the exact same thing that is done by
conn_xprt_polling_changes(). As such it is pointless to call that
one, and given that the only reason to keep CO_FL_CURR_* is to detect
changes there, we can now remove them.
Even if this does only save very few cycles, this removes a significant
complexity that has been responsible for many bugs in the past, including
the last one affecting FreeBSD.
All tests look good, and no performance regressions were observed.
2020-01-17 11:39:35 -05:00
|
|
|
if (!count)
|
2020-02-21 04:21:46 -05:00
|
|
|
fd_stop_send(conn->handle.fd);
|
2007-04-15 14:56:27 -04:00
|
|
|
}
|
2017-01-23 17:36:45 -05:00
|
|
|
else if (ret == 0 || errno == EAGAIN || errno == ENOTCONN || errno == EINPROGRESS) {
|
2009-01-18 09:30:37 -05:00
|
|
|
/* nothing written, we need to poll for write first */
|
2020-02-28 08:09:12 -05:00
|
|
|
fd_cant_send(conn->handle.fd);
|
2012-08-21 12:22:06 -04:00
|
|
|
break;
|
2006-06-25 20:48:02 -04:00
|
|
|
}
|
2012-08-21 12:22:06 -04:00
|
|
|
else if (errno != EINTR) {
|
2013-12-04 17:44:10 -05:00
|
|
|
conn->flags |= CO_FL_ERROR | CO_FL_SOCK_RD_SH | CO_FL_SOCK_WR_SH;
|
2012-08-21 12:22:06 -04:00
|
|
|
break;
|
2006-06-25 20:48:02 -04:00
|
|
|
}
|
2012-08-21 12:22:06 -04:00
|
|
|
}
|
MEDIUM: connection: enable reading only once the connection is confirmed
In order to address the absurd polling sequence described in issue #253,
let's make sure we disable receiving on a connection until it's established.
Previously with bottom-top I/Os, we were almost certain that a connection
was ready when the first I/O was confirmed. Now we can enter various
functions, including process_stream(), which will attempt to read
something, will fail, and will then subscribe. But we don't want them
to try to receive if we know the connection didn't complete. The first
prerequisite for this is to mark the connection as not ready for receiving
until it's validated. But we don't want to mark it as not ready for sending
because we know that attempting I/Os later is extremely likely to work
without polling.
Once the connection is confirmed we re-enable recv readiness. In order
for this event to be taken into account, the call to tcp_connect_probe()
was moved earlier, between the attempt to send() and the attempt to recv().
This way if tcp_connect_probe() enables reading, we have a chance to
immediately fall back to this and read the possibly pending data.
Now the trace looks like the following. It's far from being perfect
but we've already saved one recvfrom() and one epollctl():
epoll_wait(3, [], 200, 0) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22
epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126
close(7) = 0
2019-09-05 11:05:05 -04:00
|
|
|
if (unlikely(conn->flags & CO_FL_WAIT_L4_CONN) && done) {
|
2012-10-04 14:20:46 -04:00
|
|
|
conn->flags &= ~CO_FL_WAIT_L4_CONN;
|
MEDIUM: connection: enable reading only once the connection is confirmed
In order to address the absurd polling sequence described in issue #253,
let's make sure we disable receiving on a connection until it's established.
Previously with bottom-top I/Os, we were almost certain that a connection
was ready when the first I/O was confirmed. Now we can enter various
functions, including process_stream(), which will attempt to read
something, will fail, and will then subscribe. But we don't want them
to try to receive if we know the connection didn't complete. The first
prerequisite for this is to mark the connection as not ready for receiving
until it's validated. But we don't want to mark it as not ready for sending
because we know that attempting I/Os later is extremely likely to work
without polling.
Once the connection is confirmed we re-enable recv readiness. In order
for this event to be taken into account, the call to tcp_connect_probe()
was moved earlier, between the attempt to send() and the attempt to recv().
This way if tcp_connect_probe() enables reading, we have a chance to
immediately fall back to this and read the possibly pending data.
Now the trace looks like the following. It's far from being perfect
but we've already saved one recvfrom() and one epollctl():
epoll_wait(3, [], 200, 0) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
fcntl(7, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLOUT|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLOUT, {u32=7, u64=7}}], 200, 1000) = 1
connect(7, {sa_family=AF_INET, sin_port=htons(8000), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(7, "OPTIONS / HTTP/1.0\r\n\r\n", 22, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 22
epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}) = 0
epoll_wait(3, [{EPOLLIN|EPOLLRDHUP, {u32=7, u64=7}}], 200, 1000) = 1
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockopt(7, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(7, "HTTP/1.0 200\r\nContent-length: 0\r\nX-req: size=22, time=0 ms\r\nX-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)\r\n\r\n", 16384, 0, NULL, NULL) = 126
close(7) = 0
2019-09-05 11:05:05 -04:00
|
|
|
}
|
2017-10-25 03:30:13 -04:00
|
|
|
|
2019-05-23 05:39:14 -04:00
|
|
|
if (done > 0) {
|
|
|
|
|
/* we count the total bytes sent, and the send rate for 32-byte
|
|
|
|
|
* blocks. The reason for the latter is that freq_ctr are
|
|
|
|
|
* limited to 4GB and that it's not enough per second.
|
|
|
|
|
*/
|
|
|
|
|
_HA_ATOMIC_ADD(&global.out_bytes, done);
|
|
|
|
|
update_freq_ctr(&global.out_32bps, (done + 16) / 32);
|
|
|
|
|
}
|
2012-08-21 12:22:06 -04:00
|
|
|
return done;
|
2009-01-18 09:30:37 -05:00
|
|
|
}
|
2007-04-30 08:37:43 -04:00
|
|
|
|
2020-01-17 01:52:13 -05:00
|
|
|
/* Called from the upper layer, to subscribe <es> to events <event_type>. The
|
|
|
|
|
* event subscriber <es> is not allowed to change from a previous call as long
|
|
|
|
|
* as at least one event is still subscribed. The <event_type> must only be a
|
|
|
|
|
* combination of SUB_RETRY_RECV and SUB_RETRY_SEND. It always returns 0.
|
|
|
|
|
*/
|
|
|
|
|
static int raw_sock_subscribe(struct connection *conn, void *xprt_ctx, int event_type, struct wait_event *es)
|
2019-03-21 13:27:17 -04:00
|
|
|
{
|
2020-01-17 01:52:13 -05:00
|
|
|
return conn_subscribe(conn, xprt_ctx, event_type, es);
|
2019-03-21 13:27:17 -04:00
|
|
|
}
|
|
|
|
|
|
2020-01-17 01:52:13 -05:00
|
|
|
/* Called from the upper layer, to unsubscribe <es> from events <event_type>.
|
|
|
|
|
* The <es> pointer is not allowed to differ from the one passed to the
|
|
|
|
|
* subscribe() call. It always returns zero.
|
|
|
|
|
*/
|
|
|
|
|
static int raw_sock_unsubscribe(struct connection *conn, void *xprt_ctx, int event_type, struct wait_event *es)
|
2019-03-21 13:27:17 -04:00
|
|
|
{
|
2020-01-17 01:52:13 -05:00
|
|
|
return conn_unsubscribe(conn, xprt_ctx, event_type, es);
|
2019-03-21 13:27:17 -04:00
|
|
|
}
|
2009-01-18 09:30:37 -05:00
|
|
|
|
2021-03-13 18:34:49 -05:00
|
|
|
static void raw_sock_close(struct connection *conn, void *xprt_ctx)
|
|
|
|
|
{
|
|
|
|
|
if (conn->subs != NULL) {
|
|
|
|
|
conn_unsubscribe(conn, NULL, conn->subs->events, conn->subs);
|
|
|
|
|
}
|
|
|
|
|
}
|
|
|
|
|
|
2019-05-23 11:47:36 -04:00
|
|
|
/* We can't have an underlying XPRT, so just return -1 to signify failure */
|
|
|
|
|
static int raw_sock_remove_xprt(struct connection *conn, void *xprt_ctx, void *toremove_ctx, const struct xprt_ops *newops, void *newctx)
|
|
|
|
|
{
|
|
|
|
|
/* This is the lowest xprt we can have, so if we get there we didn't
|
|
|
|
|
* find the xprt we wanted to remove, that's a bug
|
|
|
|
|
*/
|
|
|
|
|
BUG_ON(1);
|
|
|
|
|
return -1;
|
|
|
|
|
}
|
|
|
|
|
|
REORG: connection: rename the data layer the "transport layer"
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
2012-10-02 18:19:48 -04:00
|
|
|
/* transport-layer operations for RAW sockets */
|
2016-12-22 15:08:52 -05:00
|
|
|
static struct xprt_ops raw_sock = {
|
2012-08-24 12:12:41 -04:00
|
|
|
.snd_buf = raw_sock_from_buf,
|
|
|
|
|
.rcv_buf = raw_sock_to_buf,
|
2019-03-21 13:27:17 -04:00
|
|
|
.subscribe = raw_sock_subscribe,
|
|
|
|
|
.unsubscribe = raw_sock_unsubscribe,
|
2019-05-23 11:47:36 -04:00
|
|
|
.remove_xprt = raw_sock_remove_xprt,
|
2019-05-22 13:24:06 -04:00
|
|
|
#if defined(USE_LINUX_SPLICE)
|
2012-08-23 18:46:52 -04:00
|
|
|
.rcv_pipe = raw_sock_to_pipe,
|
|
|
|
|
.snd_pipe = raw_sock_from_pipe,
|
|
|
|
|
#endif
|
2012-08-24 12:12:41 -04:00
|
|
|
.shutr = NULL,
|
|
|
|
|
.shutw = NULL,
|
2021-03-13 18:34:49 -05:00
|
|
|
.close = raw_sock_close,
|
2016-11-24 10:58:12 -05:00
|
|
|
.name = "RAW",
|
2012-05-07 11:15:39 -04:00
|
|
|
};
|
2006-06-25 20:48:02 -04:00
|
|
|
|
2016-12-22 14:25:26 -05:00
|
|
|
|
|
|
|
|
__attribute__((constructor))
|
2017-08-14 09:59:44 -04:00
|
|
|
static void __raw_sock_init(void)
|
2016-12-22 14:25:26 -05:00
|
|
|
{
|
|
|
|
|
xprt_register(XPRT_RAW, &raw_sock);
|
|
|
|
|
}
|
|
|
|
|
|
2006-06-25 20:48:02 -04:00
|
|
|
/*
|
|
|
|
|
* Local variables:
|
|
|
|
|
* c-indent-level: 8
|
|
|
|
|
* c-basic-offset: 8
|
|
|
|
|
* End:
|
|
|
|
|
*/
|