Commit graph

7946 commits

Author SHA1 Message Date
Michael Tuexen
1941914d3b tcp rack: improve BBR_LOG_CWND event
Fix a typo, which resulted in missing r_ctl.gate_to_fs in the BBLog
event.

Reported by:		Coverity Scan
CID:			1540024
Reviewed by:		rrs, rscheff
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44648
2024-04-18 21:57:44 +02:00
Michael Tuexen
c9cd686bd4 tcp: drop data received after a FIN has been processed
RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.

Reviewed by:		rrs, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44746
2024-04-18 21:54:42 +02:00
Michael Tuexen
605a00660e tcp bbr: improve code consistency
Improve code consistency with the RACK stack.
Reviewed by:		gallatin, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44800
2024-04-15 23:52:08 +02:00
Mark Johnston
1d14e88e53 tcp: Make tcp_var.h more self-contained
struct tcpcb embeds a struct osd and a struct callout.  Rather than
forcing all consumers to pull in the same headers, include the headers
directly.

No functional change intended.

Reviewed by:	glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D44685
2024-04-10 08:53:49 -04:00
Gleb Smirnoff
f7c4d12bcd icmp: correct the assertion that checks limit + jitter
Fixes:	4399e055ea
2024-04-08 16:54:19 -07:00
Kristof Provost
60d8dbbef0 netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters
When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.

Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).

For example:
	dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'

This can be disabled by setting the following kernel option:
	options 	KDTRACE_NO_MIB_SDT

Reviewed by:	gallatin, tuexen (previous version), gnn (previous version)
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D43504
2024-04-08 17:29:59 +02:00
Michael Tuexen
e8c149ab85 tcp: add some debug output
Also log, when dropping text or FIN after having received a FIN.
This is the intended behavior described in RFC 9293.
A follow-up patch will enforce this behavior for the base stack
and the RACK stack.
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44669
2024-04-07 22:41:24 +02:00
Michael Tuexen
3e1c8a35f7 tcp: improve consistency
No functional change intended.

Reported by:		Coverity Scan
CID:			1523781
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44645
2024-04-06 10:02:06 +02:00
Michael Tuexen
d902c8f55b tcp rack: fix memory corruption
When in rack_output() jumping to the label out, don't write errno into
the log buffer, since the pointer is not initialized.

Reported by:		Coverity Scan
CID:			1523773
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44647
2024-04-06 09:55:46 +02:00
Michael Tuexen
7df0ef5f48 tcp rack: fix sending
In rack_output(), idle is used as a boolean variable. So don't use it
as an int and don't clear it afterwards.
This avoids setting idle to false, when it is not intended.

Reported by:		olivier
Reviewed by:		rrs, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44610
2024-04-05 17:47:03 +02:00
Michael Tuexen
60bc195745 tcp bblog: cleanup
Remove redundant checks and improve error checking.

Reported by:		Coverity Scan
CID:			1523780
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44616
2024-04-05 17:36:40 +02:00
Michael Tuexen
aaaa01c0c8 tcp hpts: initialize variable
Ensure that  tv.tv_sec is zero in all code paths.

Reported by:		Coverity Scan
CID:			1527724
Reviewed by:		rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44584
2024-04-05 17:30:31 +02:00
Michael Tuexen
6b454da6bb tcp: address a warning
t_state is an unsigned variable, so no need for testing that it is
non-negative.

Reported by:		Coverity Scan
CID:			1390885
Reviewed by:		glebius
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44619
2024-04-04 00:14:59 +02:00
Michael Tuexen
e0bd180130 tcp: fix conversion of rttvar
A wrong variable and wrong scaling factors were used.

Reported by:		Coverity Scan
CID:			1508689
Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44612
2024-04-03 18:39:31 +02:00
Michael Tuexen
5a268d8688 tcp: fix comment
Make the comment consistent with the code.

Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44611
2024-04-03 18:26:04 +02:00
Michael Tuexen
b600644fdd tcp hpts: improve consistency
The target_slot argument of max_slots_available() can be NULL.
Therefore, check for this in all places.
Right now, all callers provide non-NULL pointer.

Reported by:		Coverity Scan
CID:			1527732
Reviewed by:		rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44527
2024-04-01 21:51:59 +02:00
Gleb Smirnoff
1a8d176432 inpcb: fully retire inp_ppcb pointer
Before a protocol specific control block started to embed inpcb in self
(see 0aa120d52f, e68b379244, 483fe96511) this pointer used to point
at it.

Retain kf_sock_inpcb field in the struct kinfo_file in <sys/user.h>.  The
exp-run detected a minimal use of the field in ports:
  * sysutils/lsof - patched upstream
  * net-mgmt/netdata  - patch accepted upstream
  * emulators/qemu-user-static - upstream master branch seems not using
    the field anymore
We can keep the field around for some time, but eventually it may be
reused for something else.

PR:			277659 (exp-run)
Reviewed by:		tuexen
Differential Revision:	https://reviews.freebsd.org/D44491
2024-03-29 12:18:32 -07:00
Randall Stewart
b7b78c1c16 Optimize HPTS so that little work is done until we have a hpts thread that is over the connection threshold
HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.ptimize HPTS so that little work is done until we have a hpts
thread that is over the connection threshold.

Reported by:    eduardo
Reviewed by:    gallatin, glebius, tuexen
Tested by:      gallatin
Differential Revision: https://reviews.freebsd.org/D44420
2024-03-28 08:12:37 -04:00
Michael Tuexen
ed505f893a tcp bblog: use correct length
The length of tldl_reason is TCP_LOG_REASON_LEN, not TCP_LOG_ID_LEN.
No functional change intended.
Reported by:		Coverity Scan
CID:			1418074
CID:			1418276
Reviewed by:		glebius, rscheff
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44510
2024-03-27 14:31:48 +01:00
Gleb Smirnoff
4399e055ea icmp: allow zero value for ICMP limits
Zero means limit is disabled, so the value doesn't need to be checked
against jitter value.

Fixes:	ac44739fd8
Fixes:	a03aff88a1
2024-03-24 19:52:03 -07:00
Gleb Smirnoff
4f96be33fe icmp6: move ICMPv6 related tunables to the files where they are used
Most of them can be declared as static after the move out of in6_proto.c.
Keeping sysctl(9) declarations with their text descriptions next to the
variable declaration create self-documenting code.  There should be no
functional changes.

Differential Revision:	https://reviews.freebsd.org/D44481
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
ac44739fd8 icmp: improve ICMP limit jitter
Instead of fixing up invalid values set by a user in badport_bandlim()
which is a fast path function, provide a sysctl handler
sysctl_icmplim_and_jitter(), that will check that jitter is less than the
limit.

Provide jitter initilization function icmplim_new_jitter() used at boot,
in the sysctl handler and when we actually hit the limit.  This also fixes
no jitter on a fresh booted system until first limit hit.

Instead of CVE number provide link the the actual paper that explains what
and why we are doing here.  The CVE number isn't very informative, it will
just tell you what RedHat version you need to upgrade to.

Reviewed by:		kp, tuexen, zlei
Differential Revision:	https://reviews.freebsd.org/D44478
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
b508545ce0 icmp: when logging ICMP ratelimiting message use correct jitter value
The limiting of the very last second has been done using certain jitter
value.  We update the jitter for the next second.  But the logging should
report the jitter before the change.

Reviewed by:		kp, tuexen, zlei
Differential Revision:	https://reviews.freebsd.org/D44477
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
9d7f17d746 icmp: hide icmp_bandlimit_uninit() under VIMAGE
The uninitialization may be executed only on a kernel with VIMAGE.

Reviewed by:		kp, tuexen, zlei
Differential Revision:	https://reviews.freebsd.org/D44476
2024-03-24 09:13:23 -07:00
Gleb Smirnoff
7142ab4790 icmp: do not store per-VNET identical array of strings
We need per-VNET struct counter_rate, but we don't need per-VNET set of
const char *.  Also, identical word "response" can go into the format
string instead of being stored 7 times.

Reviewed by:		kp, zlei, tuexen
Differential Revision:	https://reviews.freebsd.org/D44475
2024-03-24 09:13:23 -07:00
Michael Tuexen
af700f430f tcp: no data on SYN segments unless doing TFO
Ensure that there is no data on SYN segments unless doing TFO.
This check is already in RACK and BBR.

Reported by:		glebius
Reviewed by:		rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D44384
2024-03-22 11:12:56 +01:00
John Baldwin
3d0a736796 tcp: Add a new kernel-only TCP_USE_DDP socket option
This socket option can be used by in-kernel consumers (like NFS) to
request a NIC to use optimized receive of large buffers for a
connection.  The current use case is to support DDP by the TOE on
Chelsio NICs.

Reviewed by:	rscheff, tuexen, glebius
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D44000
2024-03-20 15:29:02 -07:00
Gleb Smirnoff
56f7860087 carp: check CARP status in in_localip_fib(), in6_localip_fib()
Don't report a BACKUP CARP address as local.  These two functions are used
only by source address validation for input packets, controlled by sysctls
net.inet.ip.source_address_validation and
net.inet6.ip6.source_address_validation.  For this purpose we definitely
want to treat BACKUP addresses as non local.

This change is conservative and doesn't modify compat in_localip() and
in6_localip().  They are used more widely than the FIB-aware versions.
The change would modify the notion of ipfw(4) 'me' keyword.  There might
be other consequences as in_localip() is used by various tunneling
protocols.

PR:			277349
2024-03-19 11:48:59 -07:00
Gleb Smirnoff
e34ea0196f tcp: clear all TCP timers in tcp_timer_stop() when in callout
When a TCP callout decides to disable self, e.g. tcp_timer_2msl() calling
tcp_close(), we must also clear all other possible timers.  Otherwise,
upon return, the callout would be scheduled again in tcp_timer_enter().

Revert 57e27ff07a, which was a temporary partial revert of otherwise
correct 62d47d73b7, that exposed the problem being fixed now.  Add an
extra assertion in tcp_timer_enter() to check we aren't arming callout for
a closed connection.

Reviewed by:	rscheff
2024-03-18 13:57:00 -07:00
Gleb Smirnoff
dd7b86e2a0 tcp: remove IS_FASTOPEN() macro
The macro is more obfuscating than helping as it just checks a single flag
of t_flags.  All other t_flags bits are checked without a macro.

A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option.  It is a bad practice to create such definitions in
installable headers.

Reviewed by:		rscheff, tuexen, kib
Differential Revision:	https://reviews.freebsd.org/D44362
2024-03-18 08:56:17 -07:00
Gleb Smirnoff
d62c4607e8 sockets: remove unused KPIs to manipulate sockets
These KPIs were added in dd0e6c383a and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet.  But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet.  For the sockets it is unclear what could be a large
collection of externally written kernel modules that need extensively use
sockets and not be aware of their internals at the same time. This
isolation of a structure knowledge requires a lot of work, and just
throwing in a few KPIs isn't helpful.

Reviewed by:		kib, olce, markj
Differential Revision:	https://reviews.freebsd.org/D44311
2024-03-18 08:50:30 -07:00
Gleb Smirnoff
027fda80fe inpcb: remove unused KPIs to manipulate inpcbs
These KPIs were added in 9d29c635da and through 15 years had zero use.
They slightly remind what IfAPI does for struct ifnet.  But IfAPI does
that for the sake of large collection of NIC drivers not being aware of
struct ifnet.  For the inpcb it is unclear what could be a large
collection of externally written kernel modules that need extensively use
inpcb and not be aware of its internals at the same time. This isolation
of a structure knowledge requires a lot of work, and just throwing in a
few KPIs isn't helpful.

Reviewed by:		kib, bz, markj
Differential Revision:	https://reviews.freebsd.org/D44310
2024-03-18 08:49:39 -07:00
Gleb Smirnoff
ab8f59ceaf rack: don't define TCPOUTFLAGS
as the code doesn't use tcp_outflags.  This should fix gcc builds.
2024-03-13 21:07:59 -07:00
Konstantin Belousov
220ee18f19 netinet/tcp_var.h: always define IS_FASTOPEN() for kernel compilation env
and drop the definition for userspace (which matched TCP_RFC7413) since
it depends on presence of the kernel option.

Reviewed by:	glebius, rscheff
Sponsored by:	NVIDIA networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D44349
2024-03-14 01:20:58 +02:00
Richard Scheffenegger
85df11a1de ktls: deep copy tls_enable struct for in-kernel tcp consumers
Doing a deep copy of the keys early allows users of the
tls_enable structure to assume kernel memory.
This enables the socket options to be set by kernel threads.

Reviewed By:	#transport, tuexen, jhb, rrs
Sponsored by:	NetApp, Inc.
X-NetApp-PR:	#79
Differential Revision:	https://reviews.freebsd.org/D44250
2024-03-13 13:23:13 +01:00
Gleb Smirnoff
e4315bbc85 tcp: move struct tcp_ifcap declaration under _KERNEL
Reviewed by:		rscheff, tuexen, kib
Differential Revision:	https://reviews.freebsd.org/D44340
2024-03-13 12:14:18 -07:00
Randall Stewart
e18b97bd63 Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

With a bit of a struggle with git I finally got rack_pcm.c into place (apologies
for not noticing this error). The LINT kernel is running on my box now .. sigh.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986
2024-03-12 07:55:02 -04:00
Brooks Davis
c112243f6b Revert "Update to bring the rack stack with all its fixes in."
This commit was incomplete and breaks LINT kernels.  The tree has been
broken for 8+ hours.

This reverts commit f6d489f402.
2024-03-11 20:28:24 +00:00
Randall Stewart
f6d489f402 Update to bring the rack stack with all its fixes in.
This brings the rack stack up to the current level used at NF. Many fixes
and improvements have been added. I also add in a fix to BBR to deal with
the changes that have been in hpts for a while i.e. only one call no matter
if mbuf queue or tcp_output.

Note there is a new file that I can't figure out how to get in rack_pcm.c

It basically does little except BBlogs and is a placemark for future work on
doing path capacity measurements.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision:https://reviews.freebsd.org/D43986
2024-03-11 07:36:54 -04:00
Michael Tuexen
96ad640178 TCP LRO: add dtrace probe points
Add the IP, UDP, and TCP receive static probes to the code path,
which avoids if_input.

Reviewed by:		rrs, markj
MFC after:		1 week`
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43727
2024-03-08 10:21:09 +01:00
Michael Tuexen
d1ce01214a TCP LRO: disable mbuf queuing when packet filter hooks are in place
When doing mbuf queueing, the packet filter hooks in ether_demux(),
ip_input(), and ip6_input() are by-passed. This means that the packet
filters don't process incoming packets, which might result in
connection failures. For example bypassing the TCP sequence number
validation will result in dropping valid packets.
Please note that this patch is only disabling mbuf queueing, not LRO.

Reported by:		Herbert J. Skuhra
Reviewed by:		glebius, rrs, rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D43769
2024-03-08 10:03:43 +01:00
Kyle Evans
47ad4f2d45 ktrace: log genio events on failed write
Visibility into the contents of the buffer when a write(2) has failed
can be immensely useful in debugging IPC issues -- pushing this to
discuss the idea, or maybe an alternative where we can set a flag like
KTRFAC_ERRIO to enable it.

When a genio event is potentially raised after an error, currently we'll
just free the uio and return.  However, such data can be useful when
debugging communication between processes to, e.g., understand what the
remote side should have grabbed before closing a pipe.  Tap out the
entire buffer on failure rather than simply discarding it.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D43799
2024-03-04 23:44:09 -06:00
Randall Stewart
638b5ae1c7 HTPS has actually three states not two so the macro needs to account for that.
Ok lets fix up the tcp_in_hpts() so that it also says yes if you
are in the race state moving and you are scheduled to be put in.
This also requires changing the MPASS to be the old version non
inline function of tcp_in_hpts().

This change also adds a new inline macro so that a uint64_t timestamp can be
obtained by a transport (aka Rack will use this).

Reviewed by: glebius, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D44157
2024-03-01 15:21:15 -05:00
Gordon Bergling
6bce41a38e carp(4): Fix a typo in a source code comment
- s/successfull/successful/

MFC after:	3 days
2024-02-27 17:39:57 +01:00
Richard Scheffenegger
8917131e00 tcp: need default in switch statement for enum.
fix clang error after c9b6241e25

Reviewed By: imp
Differential Revision: https://reviews.freebsd.org/D44081
2024-02-25 08:24:13 +01:00
Richard Scheffenegger
c9b6241e25 tcp: address enum-int-mismatch
fix gcc13 error after f74352fbcf
2024-02-25 04:46:39 +01:00
Richard Scheffenegger
5e248c23d9 tcp: retain some CC signals outside of kernel scope
Summary: fix build error after f74352fbcf

Reviewers: #transport!

Subscribers: imp, melifaro, glebius

Differential Revision: https://reviews.freebsd.org/D44066
2024-02-24 21:01:54 +01:00
Michael Tuexen
644cffe67f sctp: improve sending of packets containing an INIT ACK chunk
If the peer announced support of zero checksums, do so when sending
packets containing an INIT ACK chunk.

MFC after:	1 week
2024-02-24 19:16:36 +01:00
Richard Scheffenegger
038699a8f1 tcp: cubic - restart epoch after RTO
This is a migitation to avoid sudden extreme jumps in
cwnd, as t_epoch can be very out of date after an RTO.
Per RFC9438, sec 4.8, t_epoch is to be reset whenever
cwnd grows beyond ssthresh (CC phase transitions from
slow start to congestion avoidance), to be fixed with
the upcoming cc_cubic changes.

MFC after:		3 days
Reviewed By:		cc, #transport
Sponsored by:		NetApp, Inc
Differential Revision:	https://reviews.freebsd.org/D44023
2024-02-24 17:07:46 +01:00
Richard Scheffenegger
40fdc6d25f tcp: provide correct snd_fack on post_recovery
Ensure that snd_fack holds a valid value when doing
the post_recovery CC processing, for preparation of
the cc_cubic update, so that local pipe calculations
can correctly refer to snd_fack during and after CC events.

Reviewed By:		tuexen, #transport
Sponsored by:		NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D43957
2024-02-24 16:55:31 +01:00