Commit graph

7904 commits

Author SHA1 Message Date
Randall Stewart
638b5ae1c7 HTPS has actually three states not two so the macro needs to account for that.
Ok lets fix up the tcp_in_hpts() so that it also says yes if you
are in the race state moving and you are scheduled to be put in.
This also requires changing the MPASS to be the old version non
inline function of tcp_in_hpts().

This change also adds a new inline macro so that a uint64_t timestamp can be
obtained by a transport (aka Rack will use this).

Reviewed by: glebius, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D44157
2024-03-01 15:21:15 -05:00
Gordon Bergling
6bce41a38e carp(4): Fix a typo in a source code comment
- s/successfull/successful/

MFC after:	3 days
2024-02-27 17:39:57 +01:00
Richard Scheffenegger
8917131e00 tcp: need default in switch statement for enum.
fix clang error after c9b6241e25

Reviewed By: imp
Differential Revision: https://reviews.freebsd.org/D44081
2024-02-25 08:24:13 +01:00
Richard Scheffenegger
c9b6241e25 tcp: address enum-int-mismatch
fix gcc13 error after f74352fbcf
2024-02-25 04:46:39 +01:00
Richard Scheffenegger
5e248c23d9 tcp: retain some CC signals outside of kernel scope
Summary: fix build error after f74352fbcf

Reviewers: #transport!

Subscribers: imp, melifaro, glebius

Differential Revision: https://reviews.freebsd.org/D44066
2024-02-24 21:01:54 +01:00
Michael Tuexen
644cffe67f sctp: improve sending of packets containing an INIT ACK chunk
If the peer announced support of zero checksums, do so when sending
packets containing an INIT ACK chunk.

MFC after:	1 week
2024-02-24 19:16:36 +01:00
Richard Scheffenegger
038699a8f1 tcp: cubic - restart epoch after RTO
This is a migitation to avoid sudden extreme jumps in
cwnd, as t_epoch can be very out of date after an RTO.
Per RFC9438, sec 4.8, t_epoch is to be reset whenever
cwnd grows beyond ssthresh (CC phase transitions from
slow start to congestion avoidance), to be fixed with
the upcoming cc_cubic changes.

MFC after:		3 days
Reviewed By:		cc, #transport
Sponsored by:		NetApp, Inc
Differential Revision:	https://reviews.freebsd.org/D44023
2024-02-24 17:07:46 +01:00
Richard Scheffenegger
40fdc6d25f tcp: provide correct snd_fack on post_recovery
Ensure that snd_fack holds a valid value when doing
the post_recovery CC processing, for preparation of
the cc_cubic update, so that local pipe calculations
can correctly refer to snd_fack during and after CC events.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43957
2024-02-24 16:55:31 +01:00
Richard Scheffenegger
f74352fbcf tcp: use enum for all congestion control signals
Facilitate easier troubleshooting by enumerating
all congestion control signals. Typecast the
enum to int, when a congestion control module uses
private signals.

No external change.

Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43838
2024-02-24 16:41:48 +01:00
Richard Scheffenegger
38983d40c1 tcp: prevent div by zero in cc_htcp
Make sure the divident is at least one. While cwnd should
never be smaller than t_maxseg, this can happen during
Path MTU Discovery, or when TCP options are considered
in other parts of the stack.

PR:			276674
MFC after:		3 days
Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43797
2024-02-24 16:35:59 +01:00
Michael Tuexen
533faf21c1 sctp: improve consistency
MFC after:	1 week
2024-02-23 21:40:46 +01:00
Gordon Bergling
2fb174d18a sctp(4): Fix a typo in a source code comment
- s/anthing/anything/

MFC after:	3 days
2024-02-18 13:01:04 +01:00
Michael Tuexen
2f4e46dfdd RACK, BBR: handle EACCES like EPERM for IP output handling
The FreeBSD TCP base stack handles them also the same way.
In case of packet filters dropping packets in the output path,
this avoids retranmitting the dropped packet every 10ms or so.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43773
2024-02-16 12:19:24 +01:00
Gleb Smirnoff
abe8379b4f sockets: repair wakeup of accept(2) by shutdown(2)
That was lost in transition from one-for-all soshutdown() to protocol
specific methods.  Only protocols that listen(2) were affected.  This is
not a documented or specified feature, but some software relies on it.  At
least the FreeSWITCH telephony software uses this behavior on
PF_INET/SOCK_STREAM.

Fixes:  5bba272807
2024-02-15 10:48:44 -08:00
Richard Scheffenegger
fcea1cc971 tcp: fix RTO ssthresh for non-6675 pipe calculation
Follow up on D43768 to properly deal with the non-default
pipe calculation. When CC_RTO is processed, the timeout
will have already pulled back snd_nxt. Further, snd_fack
is not pulled along with snd_una.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43876
2024-02-14 14:51:53 +01:00
Richard Scheffenegger
57e27ff07a tcp: partially undo D43792
At the destruction of the tcpcb, no timers are supposed to
be running. However, it turns out that stopping them in the
close() / shutdown() call does not have the desired effect
under all circumstances.

This partially reverts 62d47d73b7 to reduce the nuisance
caused.

PR:			277009
Reported-by:		syzbot+9a9aa434a14a2b35c3ba@syzkaller.appspotmail.com
Reported-by:		syzbot+e82856782410e895bae7@syzkaller.appspotmail.com
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43855
2024-02-12 22:38:11 +01:00
Richard Scheffenegger
62d47d73b7 tcp: stop timers and clean scoreboard in tcp_close()
Stop timers when in tcp_close() instead of doing that in tcp_discardcb().
A connection in CLOSED state shall not need any timers. Assert that no
timer is rescheduled after that in tcp_timer_activate() and verfiy that
this is also the expected state in tcp_discardcb().

PR:			276761
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43792
2024-02-10 10:30:00 +01:00
Richard Scheffenegger
a8e817cf5c tcp: stop doing superfluous work after sending RST
When sending a RST control segment in tcp_output() it
means we are in TCPS_CLOSED state, called from tcp_drop().
Once the RST is sent, don't call tcp_timer_activate() or
update anything in tcpcb, since that will go away shortly.

PR:			276761
Provided by:		glebius
Reviewed By:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43808
2024-02-10 10:25:02 +01:00
Richard Scheffenegger
3eeb22cb81 tcp: clean scoreboard when releasing the socket buffer
The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().

PR:			276761
Reviewed by:		glebius, tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43805
2024-02-10 10:20:00 +01:00
Richard Scheffenegger
23c4f23247 tcp: ensure tcp_sack_partialack does not inflate cwnd after RTO
The implicit assumption of snd_nxt always being larger than
snd_recover is not true after RTO. In that case, cwnd
would get inflated to ssthresh, which may be much larger
than the current pipe (data in flight).

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43653
2024-02-08 20:40:25 +01:00
Richard Scheffenegger
32a6df57df tcp: calculate ssthresh on RTO according to RFC5681
per RFC5681, only adjust ssthresh on the initital
retransmission timeout. Since RTO often happens
during loss recovery, while cwnd no longer tracks
all data in flight, calculcate pipe properly.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43768
2024-02-08 19:18:26 +01:00
Richard Scheffenegger
1adab814e8 tcp: use tcp_fixed_maxseg instead of tcp_maxseg in cc modules
tcp_fixed_maxseg() is the streamlined calculation of typical
tcp options and more suitable for heavy use in the congestion
control modules on every received packet.

No external functional change.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43779
2024-02-08 18:36:59 +01:00
Gleb Smirnoff
ce69e37369 Revert "sockets: retire sorflush()"
Provide a comment in sorflush() why the socket I/O sx(9) lock is actually
important.

This reverts commit 507f87a799.
2024-02-03 13:08:41 -08:00
Gleb Smirnoff
f79a8585bb sockets: garbage collect SS_ISCONFIRMING
Fixes:	8df32b19de
2024-01-30 10:38:33 -08:00
Michael Tuexen
f30c7d5654 TCP LRO: convert TCP header fields to host byte order earlier
This is a preparation for adding dtrace hooks in a follow-up commit,
which are missing in the code path, where packets are directly queued
to the tcpcb. The dtrace hooks expect the fields to be in host byte
order. This only applies when TCP HPTS is used.
No functional change intended.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43594
2024-01-29 18:52:17 +01:00
Kristof Provost
ffeab76b68 pfil: PFIL_PASS never frees the mbuf
pfil hooks (i.e. firewalls) may pass, modify or free the mbuf passed
to them. (E.g. when rejecting a packet, or when gathering up packets
for reassembly).

If the hook returns PFIL_PASS the mbuf must still be present. Assert
this in pfil_mem_common() and ensure that ipfilter follows this
convention. pf and ipfw already did.
Similarly, if the hook returns PFIL_DROPPED or PFIL_CONSUMED the mbuf
must have been freed (or now be owned by the firewall for further
processing, like packet scheduling or reassembly).

This allows us to remove a few extraneous NULL checks.

Suggested by:	tuexen
Reviewed by:	tuexen, zlei
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D43617
2024-01-29 14:10:19 +01:00
Richard Scheffenegger
0b3f9e435f tcp: move cc_post_recovery past snd_una update
The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.

This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.

MFC after:             1 week
Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520
2024-01-28 00:18:51 +01:00
Mark Johnston
bbf86c65d0 netinet: Remove stale references to Giant from comments
MFC after:	1 week
2024-01-27 13:51:13 -05:00
Richard Scheffenegger
2d05a1c81b tcp: commonize check for more data to send, style changes
Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().

None of the above change the function.

Reviewed By:           tuexen, cc, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539
2024-01-26 01:20:35 +01:00
Richard Scheffenegger
fc262fd3dc tcp: AccECN access ACE field by shifting bits
Shifting bits is quicker than checking header flag bits
one by one. Also improve readability by the use of switch
statements.

No change in behaviour.

Reviewed By:           glebius, tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43560
2024-01-26 00:16:22 +01:00
Richard Scheffenegger
0932fb565a tcp: fix TCPSTAT accounting for SACK
Account for SACK retransmitted bytes once the actual length
is known. This prevents a call to tcp_maxseg() and prepares
for TSO support when transmitting from the SACK scoreboard.

Reviewed By:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43447
2024-01-25 22:58:33 +01:00
Richard Scheffenegger
c7c325d01d tcp: pass maxseg around instead of calculating locally
Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().

Reviewed By:           glebius, tuexen, cc, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536
2024-01-24 16:43:29 +01:00
Gleb Smirnoff
90ad2dc287 tcp: remove 20+ year old disabled code from d912c694ee 2024-01-23 13:16:34 -08:00
Gleb Smirnoff
c809435b18 tcp: clear outdated comment mentioning T/TCP 2024-01-23 12:59:21 -08:00
Gleb Smirnoff
e21c668719 tcp: pass positive errno to tcp_drop()
Fixes:	446ccdd08e
2024-01-23 12:59:21 -08:00
Gordon Bergling
9b035689f1 tcp_fastopen: Fix a typo in a source code comment
- s/posession/possession/

MFC after:	3 days
2024-01-22 21:49:47 +01:00
Gleb Smirnoff
7f3184ba79 tcp: remove outdated comment
This paragraph should have been removed in 446ccdd08e.
2024-01-22 12:42:21 -08:00
Gordon Bergling
ef0ac0a1ad tcp_hpts: Fix a typo of a function name in a comment
- s/tcp_ouput/tcp_output/

MFC after:	3 days
2024-01-20 17:29:28 +01:00
Richard Scheffenegger
dfe30e4196 tcp: remove unused tcp_sack_output_debug() function
This debugging code has been lingering for years with
no known use.

No functional change.

Reviewed by:           tuexen, #transport
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43511
2024-01-19 14:48:32 +01:00
Gleb Smirnoff
a079c891c0 sctp: restore missing inpcb lock
Fixes:	5bba272807
Reported-by: syzbot+b8636c973dc20fea4a9b@syzkaller.appspotmail.com
Reported-by: syzbot+d76a18ee8bbe6f7d3056@syzkaller.appspotmail.com
2024-01-16 23:11:27 -08:00
Xavier Beaudouin
80044c785c Add UDP encapsulation of ESP in IPv6
This patch provides UDP encapsulation of ESP packets over IPv6.
Ports the IPv4 code to IPv6 and adds support for IPv6 in udpencap.c
As required by the RFC and unlike in IPv4 encapsulation,
UDP checksums are calculated.

Co-authored-by:	Aurelien Cazuc <aurelien.cazuc.external@stormshield.eu>
Sponsored-by:	Stormshield
Sponsored-by:	Wiktel
Sponsored-by:	Klara, Inc.
2024-01-16 20:44:34 +00:00
Gleb Smirnoff
507f87a799 sockets: retire sorflush()
With removal of dom_dispose method the function boils down to two
meaningful function calls: socantrcvmore() and sbrelease().  The latter is
only relevant for protocols that use generic socket buffers.

The socket I/O sx(9) lock acquisition in sorflush() is not relevant for
shutdown(2) operation as it doesn't do any I/O that may interleave with
read(2) or write(2).  The socket buffer mutex acquisition inside
sbrelease() is what guarantees thread safety.  This sx(9) acquisition in
soshutdown() can be tracked down to 4.4BSD times, where it used to be
sblock(), and it was carried over through the years evolving together with
sockets with no reconsideration of why do we carry it over.  I can't tell
if that sblock() made sense back then, but it doesn't make any today.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43415
2024-01-16 10:30:49 -08:00
Gleb Smirnoff
5bba272807 sockets: make pr_shutdown fully protocol specific method
Disassemble a one-for-all soshutdown() into protocol specific methods.
This creates a small amount of copy & paste, but makes code a lot more
self documented, as protocol specific method would execute only the code
that is relevant to that protocol and nothing else.  This also fixes a
couple recent regressions and reduces risk of future regressions.  The
extended KPI for the new pr_shutdown removes need for the extra pr_flush
which was added for the sake of SCTP which could not perform its shutdown
properly with the old one.  Particularly for SCTP this change streamlines
a lot of code.

Some notes on why certain parts of code were copied or were not to certain
protocols:
* The (SS_ISCONNECTED | SS_ISCONNECTING | SS_ISDISCONNECTING) check is
  needed only for those protocols that may be connected or disconnected.
* The above reduces into only SS_ISCONNECTED for those protocols that
  always connect instantly.
* The ENOTCONN and continue processing hack is left only for datagram
  protocols.
* The SOLISTENING(so) block is copied to those protocols that listen(2).
* sorflush() on SHUT_RD is copied almost to every protocol, but that
  will be refactored later.
* wakeup(&so->so_timeo) is copied to protocols that can make a non-instant
  connect(2), can SO_LINGER or can accept(2).

There are three protocols (netgraph(4), Bluetooth, SDP) that did not have
pr_shutdown, but old soshutdown() would still perform sorflush() on
SHUT_RD for them and also wakeup(9).  Those protocols partially supported
shutdown(2) returning EOPNOTSUP for SHUT_WR/SHUT_RDWR, now they fully lost
shutdown(2) support.  I'm pretty sure netgraph(4) and Bluetooth are okay
about that and SDP is almost abandoned anyway.

Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D43413
2024-01-16 10:30:37 -08:00
Gleb Smirnoff
d4033ebd05 divert: just return EOPNOTSUPP on shutdown(2)
Before this change we would always return ENOTCONN.  There is no
legitimate use of shutdown(2) on divert(4).
2024-01-12 02:04:04 -08:00
Michael Tuexen
13720136fb tcpsso: fix when used without -i option
Since fdb987bebd it is not possible anymore to use inp_next
iterator for bound, but unconnected sockets. This applies
to TCP listening sockets. Therefore the metioned commit broke
tcpsso on listening sockets if the -i option was not used.
Fix this by iterating through all endpoints instead of only
through the bound, but unconnected ones.

Reviewed by:		markj
Fixes:			fdb987bebd ("inpcb: Split PCB hash tables")
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43353
2024-01-10 08:33:09 +01:00
John Baldwin
8cb9b68f58 sys: Use mbufq_empty instead of comparing mbufq_len against 0
Reviewed by:	bz, emaste
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D43338
2024-01-09 11:00:46 -08:00
Richard Scheffenegger
429f14f83a tcp: clean PRR state after ECN congestion recovery.
PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.

Reviewed by:           tuexen, cc, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170
2024-01-08 10:53:04 +01:00
Richard Scheffenegger
f4574e2dc5 tcp: prevent spurious empty segments and fix uncommon panic
Only try sending more data on pure ACKs when there is
more data available in the send buffer.

In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.

Reported by:           fengdreamer@126.com
Reviewed by:           cc, tuexen, #transport
MFC after:             1 week
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343
2024-01-08 10:52:49 +01:00
Richard Scheffenegger
30409ecdb6 tcp: do not purge SACK scoreboard on first RTO
Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.

Reviewed By:           tuexen, #transport
MFC after:             10 weeks
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906
2024-01-06 20:25:38 +01:00
Richard Scheffenegger
893ed42eca tcp: Make use of enum for sack_changed
No functional change.

Reviewed By:           tuexen, #transport
MFC after:             3 days
Sponsored by:          NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43346
2024-01-06 20:23:52 +01:00