By default we automatically wait for enough data to fill large
packets if buf->to_forward is not null. This causes a problem
with POST/Expect requests which have a data size but no data
immediately available. Instead of causing noticeable delays on
such requests, simply add a flag to disable waiting when sending
requests.
Many times we see a lot of short responses in HTTP (typically 304 on a
reload). It is a waste of network bandwidth to send that many small packets
when we know we can merge them. When we know that another HTTP request is
following a response, we set BF_EXPECT_MORE on the response buffer, which
will turn MSG_MORE on exactly once. That way, multiple short responses can
leave pipelined if their corresponding requests were also pipelined.
The body parser will be used in close and keep-alive modes. It follows
the stream to keep in sync with both the request and the response message.
Both chunked transfer-coding and content-length are supported according to
RFC2616.
The multipart/byterange encoding has not yet been implemented and if not
seconded by any of the two other ones, will be forwarded till the close,
as requested by the specification.
Both the request and the response analysers converge into an HTTP_MSG_DONE
state where it will be possible to force a close (option forceclose) or to
restart with a fresh new transaction and maintain keep-alive.
This change is important. All tests are OK but any possible behaviour
change with "option httpclose" might find its root here.
We used to apply a limit to each buffer's size in order to leave
some room to rewrite headers, then we used to remove this limit
once the session switched to a data state.
Proceeding that way becomes a problem with keepalive because we
have to know when to stop reading too much data into the buffer
so that we can leave some room again to process next requests.
The principle we adopt here consists in only relying on to_forward+send_max.
Indeed, both of those data define how many bytes will leave the buffer.
So as long as their sum is larger than maxrewrite, we can safely
fill the buffers. If they are smaller, then we refrain from filling
the buffer. This means that we won't risk to fill buffers when
reading last data chunk followed by a POST request and its contents.
The only impact identified so far is that we must ensure that the
BF_FULL flag is correctly dropped when starting to forward. Right
now this is OK because nobody inflates to_forward without using
buffer_forward().
The code part which waits for an HTTP response has been extracted
from the old function. We now have two analysers and the second one
may re-enable the first one when an 1xx response is encountered.
This has been tested and works.
The calls to stream_int_return() that were remaining in the wait
analyser have been converted to stream_int_retnclose().
When processing a GET or HEAD request in close mode, we know we don't
need to read anything anymore on the socket, so we can disable it.
Doing this can save up to 40% of the recv calls, and half of the
epoll_ctl calls.
For this we need a buffer flag indicating that we're not interesting in
reading anymore. Right now, this flag also disables both polled reads.
We might benefit from disabling only speculative reads, but we will need
at least this flag when we want to support keepalive anyway.
Currently we don't disable the flag on completion, but it does not
matter as we close ASAP when performing the shutw().
Recent "struct chunk rework" introduced a NULL pointer dereference
and now haproxy segfaults if auth is required for stats but not found.
The reason is that size_t cannot store negative values, but current
code assumes that "len < 0" == uninitialized.
This patch fixes it.
We can get rid of the stats analyser by moving all the stats code
to a stream interface applet. Above being cleaner, it provides new
advantages such as the ability to process requests and responses
from the same function and work only with simple state machines.
There's no need for any hijack hack anymore.
The direct advantage for the user are the interactive mode and the
ability to chain several commands delimited by a semi-colon. Now if
the user types "prompt", he gets a prompt from which he can send
as many requests as he wants. All outputs are terminated by a
blank line followed by a new prompt, so this can be used from
external tools too.
The code is not very clean, it needs some rework, but some part
of the dirty parts are due to the remnants of the hijack mode used
in the old functions we call.
The old AN_REQ_STATS_SOCK analyser flag is now unused and has been
removed.
In TCP, we don't want to forward chunks of data, we want to forward
indefinitely. This patch introduces a special value for the amount
of data to be forwarded. When buffer_forward() is called with
BUF_INFINITE_FORWARD, it configures the buffer to never stop
forwarding until the end.
The BF_EMPTY flag was once used to indicate an empty buffer. However,
it was used half the time as meaning the buffer is empty for the reader,
and half the time as meaning there is nothing left to send.
"nothing to send" is only indicated by "->send_max=0 && !pipe". Once
we fix this, we discover that the flag is not used anymore. So the
flags has been renamed BF_OUT_EMPTY and means exactly the condition
above, ie, there is nothing to send.
Doing so has allowed us to remove some unused tests for emptiness,
but also to uncover a certain amount of situations where the flag
was not correctly set or tested.
The BF_WRITE_ENA buffer flag became very complex to deal with, because
it was used to :
- enable automatic connection
- enable close forwarding
- enable data forwarding
The last point was not very true anymore since we introduced ->send_max,
but still the test remained everywhere. This was causing issues such as
impossibility to connect without forwarding data, impossibility to prevent
closing when data was forwarded, etc...
This patch clarifies the situation by getting rid of this multi-purpose
flag and replacing it with :
- data forwarding based only on ->send_max || ->pipe ;
- a new BF_AUTO_CONNECT flag to allow automatic connection and only
that ;
- ability to perform an automatic connection when ->send_max or ->pipe
indicate that data is waiting to leave the buffer ;
- a new BF_AUTO_CLOSE flag to let the producer automatically set the
BF_SHUTW_NOW flag when it gets a BF_SHUTR.
During this cleanup, it was discovered that some tests were performed
twice, or that the BF_HIJACK flag was still tested, which is not needed
anymore since ->send_max replcaed it. These places have been fixed too.
These cleanups have also revealed a few areas where the other flags
such as BF_EMPTY are not cleanly used. This will be an opportunity for
a second patch.
This flag was incorrectly used as meaning "close immediately",
while it needs to say "close ASAP". ASAP here means when unsent
data pending in the buffer are sent. This helps cleaning up some
dirty tricks where the buffer output was checking the BF_SHUTR
flag combined with EMPTY and other such things. Now we have a
clearly defined semantics :
- producer sets SHUTR and *may* set SHUTW_NOW if WRITE_ENA is
set, otherwise leave it to the session processor to set it.
- consumer only checks SHUTW_NOW to decide whether or not to
call shutw().
This also induced very minor changes at some locations which were
not protected against buffer changes while the SHUTW_NOW flag was
set. Now we prevent send_max from changing when the flag is set.
Several tests have been run without any unexpected behaviour detected.
Some more cleanups are needed, as it clearly appears that some tests
could be removed with stricter semantics.
The first step towards dynamic buffer size consists in removing
all static definitions of the buffer size. Instead, we store a
buffer's size in itself. Right now they're all preinitialized
to BUFSIZE, but we will change that.
The remains of the stats socket code has nothing to do in proto_uxst
anymore and must move to dumpstats. The code is much cleaner and more
structured. It was also an opportunity to rename AN_REQ_UNIX_STATS
as AN_REQ_STATS_SOCK as the stats socket is no longer unix-specific
either.
The last item refering to stats in proto_uxst is the setting of the
task's nice value which should in fact come from the listener.
The new statement "persist rdp-cookie" enables RDP cookie
persistence. The RDP cookie is then extracted from the RDP
protocol, and compared against available servers. If a server
matches the RDP cookie, then it gets the connection.
The HTTP processing has been splitted into 7 steps, one of which
is not anymore HTTP-specific (content-switching). That way, it
becomes possible to use "use_backend" rules in TCP mode. A new
"use_server" directive should follow soon.
We want to split several steps in HTTP processing so that
we can call individual analysers depending on what processing
we want to perform. The first step consists in splitting the
part that waits for a request from the rest.
When the reader does not expect to read lots of data, it can
set BF_READ_DONTWAIT on the request buffer. When it is set,
the stream_sock_read callback will not try to perform multiple
reads, it will return after only one, and clear the flag.
That way, we can immediately return when waiting for an HTTP
request without trying to read again.
On pure request/responses schemes such as monitor-uri or
redirects, this has completely eliminated the EAGAIN occurrences
and the epoll_ctl() calls, resulting in a performance increase of
about 10%. Similar effects should be observed once we support
HTTP keep-alive since we'll immediately disable reads once we
get a full request.
Using pipe pools makes pipe management a lot easier. It also allows to
remove quite a bunch of #ifdefs in areas which depended on the presence
or not of support for kernel splicing.
The buffer now holds a pointer to a pipe structure which is always NULL
except if there are still data in the pipe. When it needs to use that
pipe, it dynamically allocates it from the pipe pool. When the data is
consumed, the pipe is immediately released.
That way, there is no need anymore to care about pipe closure upon
session termination, nor about pipe creation when trying to use
splice().
Another immediate advantage of this method is that it considerably
reduces the number of pipes needed to use splice(). Tests have shown
that even with 0.2 pipe per connection, almost all sessions can use
splice(), because the same pipe may be used by several consecutive
calls to splice().
When CONFIG_HAP_LINUX_SPLICE is defined, the buffer structure will be
slightly enlarged to support information needed for kernel splicing
on Linux.
A first attempt consisted in putting this information into the stream
interface, but in the long term, it appeared really awkward. This
version puts the information into the buffer. The platform-dependant
part is conditionally added and will only enlarge the buffers when
compiled in.
One new flag has also been added to the buffers: BF_KERN_SPLICING.
It indicates that the application considers it is appropriate to
use splicing to forward remaining data.
In the buffers, the read limit used to leave some place for header
rewriting was set by a pointer to the end of the buffer. Not only
this required subtracts at every place in the code, but this will
also soon not be usable anymore when we want to support keepalive.
Let's replace this with a length limit, comparable to the buffer's
length. This has also sightly reduced the code size.
The way the buffers and stream interfaces handled ->to_forward was
really not handy for multiple reasons. Now we've moved its control
to the receive-side of the buffer, which is also responsible for
keeping send_max up to date. This makes more sense as it now becomes
possible to send some pre-formatted data followed by forwarded data.
The following explanation has also been added to buffer.h to clarify
the situation. Right now, tests show that the I/O is behaving extremely
well. Some work will have to be done to adapt existing splice code
though.
/* Note about the buffer structure
The buffer contains two length indicators, one to_forward counter and one
send_max limit. First, it must be understood that the buffer is in fact
split in two parts :
- the visible data (->data, for ->l bytes)
- the invisible data, typically in kernel buffers forwarded directly from
the source stream sock to the destination stream sock (->splice_len
bytes). Those are used only during forward.
In order not to mix data streams, the producer may only feed the invisible
data with data to forward, and only when the visible buffer is empty. The
consumer may not always be able to feed the invisible buffer due to platform
limitations (lack of kernel support).
Conversely, the consumer must always take data from the invisible data first
before ever considering visible data. There is no limit to the size of data
to consume from the invisible buffer, as platform-specific implementations
will rarely leave enough control on this. So any byte fed into the invisible
buffer is expected to reach the destination file descriptor, by any means.
However, it's the consumer's responsibility to ensure that the invisible
data has been entirely consumed before consuming visible data. This must be
reflected by ->splice_len. This is very important as this and only this can
ensure strict ordering of data between buffers.
The producer is responsible for decreasing ->to_forward and increasing
->send_max. The ->to_forward parameter indicates how many bytes may be fed
into either data buffer without waking the parent up. The ->send_max
parameter says how many bytes may be read from the visible buffer. Thus it
may never exceed ->l. This parameter is updated by any buffer_write() as
well as any data forwarded through the visible buffer.
The consumer is responsible for decreasing ->send_max when it sends data
from the visible buffer, and ->splice_len when it sends data from the
invisible buffer.
A real-world example consists in part in an HTTP response waiting in a
buffer to be forwarded. We know the header length (300) and the amount of
data to forward (content-length=9000). The buffer already contains 1000
bytes of data after the 300 bytes of headers. Thus the caller will set
->send_max to 300 indicating that it explicitly wants to send those data,
and set ->to_forward to 9000 (content-length). This value must be normalised
immediately after updating ->to_forward : since there are already 1300 bytes
in the buffer, 300 of which are already counted in ->send_max, and that size
is smaller than ->to_forward, we must update ->send_max to 1300 to flush the
whole buffer, and reduce ->to_forward to 8000. After that, the producer may
try to feed the additional data through the invisible buffer using a
platform-specific method such as splice().
*/
In preparation of splice support, let's add the splice_len member
to the buffer struct. An earlier implementation made it conditional,
which made the whole logics very complex due to a large number of
ifdefs.
Now BF_EMPTY is only set once both buf->l and buf->splice_len are
null. Splice_len is initialized to zero during buffer creation and
is currently not changed, so the whole logics remains unaffected.
When splice gets merged, splice_len will reflect the number of bytes
in flight out of the buffer but not yet sent, typically in a pipe for
the Linux case.
If an analyser sets buf->to_forward to a given value, that many
data will be forwarded between the two stream interfaces attached
to a buffer without waking the task up. The same applies once all
analysers have been released. This saves a large amount of calls
to process_session() and a number of task_dequeue/queue.
Sometimes we don't care about a read timeout, for instance, from the
client when waiting for the server, but we still want the client to
be able to read.
Till now it was done by articially forcing the read timeout to ETERNITY.
But this will cause trouble when we want the low level stream sock to
communicate without waking the session up. So we add a BF_READ_NOEXP
flag to indicate that when the read timeout is to be set, it might
have to be set to ETERNITY.
Since BF_READ_ENA was not used, we replaced this flag.
For keep-alive, line-mode protocols and splicing, we will need to
limit the sender to process a certain amount of bytes. The limit
is automatically set to the buffer size when analysers are detached
from the buffer.
Instead of calling a hard-coded function to produce data, let's
reference this function into the buffer and call it from there
when BF_HIJACK is set. This goes in the direction of more generic
session management code.
All the processing has now completely been split in layers. As of
now, everything is still in process_session() which is not the right
place, but the code sequence works. Timeouts, retries, errors, all
work.
The shutdown sequence has been strictly applied: BF_SHUTR/BF_SHUTW
are only assigned by lower layers. Upper layers can only indicate
their wish to close using BF_SHUTR_NOW and BF_SHUTW_NOW.
When a shutdown is performed on a stream interface, the buffer flags
are updated accordingly and re-checked by upper layers. A lot of care
has been taken to ensure that aborts during intermediate connection
setups are correctly handled and shutdowns correctly propagated to
both buffers.
A future evolution would consist in ensuring that BF_SHUT?_NOW may
be set at any time, and applies only when the buffer is empty. This
might help with error messages, but might complicate the processing
of data remaining in buffers.
Some useless buffer flag combinations have been removed.
Stat counters are still broken (eg: per-server total number of sessions).
Error messages should be delayed to the close instant and be produced by
protocol.
Many functions must now move to proper locations.
Two new functions are used instead : buffer_check_{shutr,shutw}.
It is indeed more adequate to check for new closures only when the
buffer reports them.
Several remaining unclosed connections were detected after a test,
even before this patch, so a bug remains. To reproduce, try the
following during 30 seconds :
inject30l4 -n 20000 -l -t 1000 -P 10 -o 4 -u 100 -s 100 -G 127.0.0.1:8000/
Now we have almost two distinct parts between tcp and http.
Only the connection establishment code still requires some
resynchronization, the rest does not.
The buffer flags became a big bazaar. Re-arrange them
so that their names are more explicit and so that they
are more easily readable in hex form. Some aggregates
have also been adjusted.
It was a waste to constantly update the file descriptor's status
and timeouts during a flags update. So stream_sock_process_data
has been slit in two parts :
stream_sock_data_update() => computes updated flags
stream_sock_data_finish() => computes timeouts
Only the first one is called during flag updates. The second one
is only called upon completion. The number of calls to fd_set/fd_clr
has now significantly dropped.
Also, it's useless to check for errors and timeouts in the
process_session() loop, it's enough to check for them at the
beginning.
The client side now relies on stream_sock_process_data(). One
part has not yet been re-implemented, it concerns the calls
to produce_content().
process_session() has been adjusted to correctly check for
changing bits in order not to call useless functions too many
times.
It already appears that stream_sock_process_data() should be
split so that the timeout computations are only performed at
the exit of process_session().
srv_state has been removed from HTTP state machines, and states
have been split in either TCP states or analyzers. For instance,
the TARPIT state has just become a simple analyzer.
New flags have been added to the struct buffer to compensate this.
The high-level stream processors sometimes need to force a disconnection
without touching a file-descriptor (eg: report an error). But if
they touched BF_SHUTW or BF_SHUTR, the file descriptor would not
be closed. Thus, the two SHUT?_NOW flags have been added so that
an application can request a forced close which the stream interface
will be forced to obey.
During this change, a new BF_HIJACK flag was added. It will
be used for data generation, eg during a stats dump. It
prevents the producer on a buffer from sending data into it.
BF_SHUTR_NOW /* the producer must shut down for reads ASAP */
BF_SHUTW_NOW /* the consumer must shut down for writes ASAP */
BF_HIJACK /* the producer is temporarily replaced */
BF_SHUTW_NOW has precedence over BF_HIJACK. BF_HIJACK has
precedence over BF_MAY_FORWARD (so that it does not need it).
New functions buffer_shutr_now(), buffer_shutw_now(), buffer_abort()
are provided to manipulate BF_SHUT* flags.
A new type "stream_interface" has been added to describe both
sides of a buffer. A stream interface has states and error
reporting. The session now has two stream interfaces (one per
side). Each buffer has stream_interface pointers to both
consumer and producer sides.
The server-side file descriptor has moved to its stream interface,
so that even the buffer has access to it.
process_srv() has been split into three parts :
- tcp_get_connection() obtains a connection to the server
- tcp_connection_failed() tests if a previously attempted
connection has succeeded or not.
- process_srv_data() only manages the data phase, and in
this sense should be roughly equivalent to process_cli.
Little code has been removed, and a lot of old code has been
left in comments for now.
It's a shame not to use buffer->wex for connection timeouts since by
definition it cannot be used till the connection is not established.
Using it instead of ->cex also makes the buffer processing more
symmetric.
It is not always convenient to run checks on req->l in functions to
check if a buffer is empty or full. Now the stream_sock functions
set flags BF_EMPTY and BF_FULL according to the buffer contents. Of
course, functions which touch the buffer contents adjust the flags
too.
BF_SHUTR_PENDING and BF_SHUTW_PENDING were poor ideas because
BF_SHUTR is the pending of BF_SHUTW_DONE and BF_SHUTW is the
pending of BF_SHUTR_DONE. Remove those two useless and confusing
"pending" versions and rename buffer_shut{r,w}_* functions.
It wasn't really wise to separate BF_MAY_CONNECT and BF_MAY_FORWARD,
as it caused trouble in TCP mode because the connection was allowed
but not the forwarding. Remove BF_MAY_CONNECT.
For the first time, HTTP and TCP are not merged anymore. All request
processing has moved to process_request while the TCP processing of
the frontend remains in process_cli. The code is a lot cleaner,
simpler, smaller (1%) and slightly faster (1% too).
Right now, the HTTP state machine cannot easily command the TCP
state machine, but it does not cause that many difficulties.
The response processing has not yet been extracted, and the unix-stream
state machines have to be broken down that way too.
The CL_STDATA, CL_STSHUTR and CL_STSHUTW states still exist and are
exactly the sames. They will have to be all merged into CL_STDATA
once the work has stabilized. It is also possible that this single
state will disappear in favor of just buffer flags.
A new buffer flag BF_MAY_FORWARD has been added so that the client
FSM can check whether it is allowed to forward the response to the
client. The client FSM does not have to monitor the server state
anymore.
A new buffer flag BF_MAY_CONNECT has been added so that the server
FSM can check whether it is allowed to establish a connection or
not. That way, the client FSM only has to move this flag and the
server side does not need to monitor client state anymore.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.