Commit graph

8137 commits

Author SHA1 Message Date
Gleb Smirnoff
e9255dafa1 pf: netlink KPI use cleanup
- use nlf_p_empty instead of declaring own empty array
- don't declare _IN() macro when we don't parse a header

Reviewed by:		kp
Differential Revision:	https://reviews.freebsd.org/D48306
2025-01-03 14:25:10 -08:00
Michael Tuexen
88766e7af5 TCP BBR: fix integer overflow
Use 64-bit arithmetic.

Reviewed by:		rrs
CID:			1523806
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48302
2025-01-03 19:44:43 +01:00
Michael Tuexen
4173a3a009 TCP BBR: simplify expression
rsm cannot be NULL, when calling bbr_update_bbr_info().
So no need to check partially for it. No functional change intended.

Reviewed by:		rrs
CID:			1523803
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48293
2025-01-03 14:22:25 +01:00
Michael Tuexen
deb4252e9e TCP RACK: remove un-needed assignment
No functional change intended.

Reviewed by:		rrs
CID:			1523768
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48292
2025-01-03 14:20:19 +01:00
Michael Tuexen
8471791eb6 TCP RACK: simplify condition
It is already known that rsm != NULL, so no need to check for it.

Reviewed by:		rrs
CID:			1523815
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48282
2025-01-02 16:19:30 +01:00
Michael Tuexen
c7e81cc043 TCP BBR: do not log an uninitialized value
Reviewed by:		rrs
CID:			1523789
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48281
2025-01-02 16:17:05 +01:00
Michael Tuexen
3b9da3dcd1 TCP RACK: avoid using uninitialized tot_idle variable
Reviewed by:		rrs
CID:			1540027
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48277
2025-01-01 18:42:00 +01:00
Michael Tuexen
1781324db2 TCP BBR: remove code which is never executed
USEC_2_TICKS() returns at least 1.

Reviewed by:		rrs
CID:			1523775
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D4827
2025-01-01 18:39:23 +01:00
Michael Tuexen
5ec914e06c TCP BBR: fix condition when sending a tail loss probe
Reviewed by:		rrs
CID:			1523793
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48274
2024-12-31 22:03:13 +01:00
Michael Tuexen
0ce13b1d58 TCP RACK: add comment
Indicate that the missing of the break is intentionally.

Reviewed by:		rrs
CID:			1523782
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48273
2024-12-31 21:51:58 +01:00
Michael Tuexen
b47dcb4b1f TCP BBR: fix getsockopt() for TCP_BBR_USEDEL_RATE
Actually implement the IPPROTO_TCP-level socket option
TCP_BBR_USEDEL_RATE.

Reviewed by:		rrs
CID:			1523813
CID:			1523814
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48261
2024-12-31 17:29:04 +01:00
Michael Tuexen
4f3a0c7197 TCP RACK: don't use an uninitialized variable
When storing the old beta values in rack_swap_beta_values(),
ensure that the newreno_flags field is initialized appropriately
instead of using an uninitialized value.
Since the stored newreno_flags aren't actually used, this fix
should not have any functional change.

Reviewed by:		rrs
CID:			1523796
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48260
2024-12-31 17:26:04 +01:00
Michael Tuexen
4940584bbf TCP RACK, BBR: cleanup of ctf_process_inbound_raw()
Instead of dealing with ifp == NULL, which should never happen,
assume that this is not true. Use KASSERT to make this clear.
No functional change intended.

Reviewed by:		glebius, rrs
CID:			1523767
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48258
2024-12-31 17:22:03 +01:00
Ed Maste
48ef7ed72a Clarify net.inet.ip.allow_net240 and allow_net0
The stack has never limited use of addresses in these ranges as an
endpoint.  The relatively recent sysctls control only forwarding of,
and ICMP response to, these addresses.

Reviewed by: bz
Fixes: efe58855f3 ("IPv4: experimental changes to allow net 0/8, 240/4, part of 127/8")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D48262
2024-12-31 14:47:32 +00:00
Michael Tuexen
b5739c8b12 TCP RACK, BBR: ensure return value is always ininitialized
Do not return an uninitialized value from ctf_do_queued_segments()
in case no packets are actually processed (all are skipped).

Reviewed by:		rrs
CID:			1523774
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48217
2024-12-30 16:02:43 +01:00
Michael Tuexen
16e8e99f1d TCP RACK: remove redundant check
No functional change intended.

Reviewed by:		rrs
CID:			1523811
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48216
2024-12-30 16:00:11 +01:00
Michael Tuexen
895347fc10 TCP BBR: remove assignments without effect
No functional change intended.

Reviewed by:		rrs
CID:			1523772
CID:			1523777
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48215
2024-12-30 15:57:11 +01:00
Gleb Smirnoff
053a988497 tcp: don't ever return ECONNRESET on close(2)
The SUS doesn't mention this error code as a possible one [1]. The FreeBSD
manual page specifies a possible ECONNRESET for close(2):

[ECONNRESET]	The underlying object was a stream socket that was
		shut down by the peer before all pending data was
		delivered.

In the past it had been EINVAL (see 21367f630d), and this EINVAL was
added as a safety measure in 623dce13c6.  After conversion to
ECONNRESET it had been documented in the manual page in 78e3a7fdd5, but
I bet wasn't ever tested to actually be ever returned, cause the
tcp-testsuite[2] didn't exist back then.  So documentation is incorrect
since 2006, if my bet wins.  Anyway, in the modern FreeBSD the condition
described above doesn't end up with ECONNRESET error code from close(2).
The error condition is reported via SO_ERROR socket option, though.  This
can be checked using the tcp-testsuite, temporarily disabling the
getsockopt(SO_ERROR) lines using sed command [3].  Most of these
getsockopt(2)s are followed by '+0.00 close(3) = 0', which will confirm
that close(2) doesn't return ECONNRESET even on a socket that has the
error stored, neither it is returned in the case described in the manual
page.  The latter case is covered by multiple tests residing in tcp-
testsuite/state-event-engine/rcv-rst-*.

However, the deleted block of code could be entered in a race condition
between close(2) and processing of incoming packet, when connection had
already been half-closed with shutdown(SHUT_WR) and sits in TCPS_LAST_ACK.
This was reported in the bug 146845.  With the block deleted, we will
continue into tcp_disconnect() which has proper handling of INP_DROPPED.

The race explanation follows.  The connection is in TCPS_LAST_ACK.  The
network input thread acquires the tcpcb lock first, sets INP_DROPPED,
acquires the socket lock in soisdisconnected() and clears SS_ISCONNECTED.
Meanwhile, the syscall thread goes through sodisconnect() which checks for
SS_ISCONNECTED locklessly(!).  The check passes and the thread blocks on
the tcpcb lock in tcp_usr_disconnect().  Once input thread releases the
lock, the syscall thread observes INP_DROPPED and returns ECONNRESET.

- Thread 1: tcp_do_segment()->tcp_close()->in_pcbdrop(),soisdisconnected()
- Thread 2: sys_close()...->soclose()->sodisconnect()->tcp_usr_disconnect()

Note that the lockless operation in sodisconnect() isn't correct, but
enforcing the socket lock there will not fix the problem.

[1] https://pubs.opengroup.org/onlinepubs/9799919799/
[2] https://github.com/freebsd-net/tcp-testsuite
[3] sed -i "" -Ee '/\+0\.00 getsockopt\(3, SOL_SOCKET, SO_ERROR, \[ECONNRESET\]/d' $(grep -lr ECONNRESET tcp-testsuite)

PR:			146845
Reviewed by:		tuexen, rrs, imp
Differential Revision:	https://reviews.freebsd.org/D48148
2024-12-23 10:35:49 -08:00
Mark Johnston
c9756953bd inpcb: Further restrict binding to a port owned by a different UID
See commit 4f02a7d739 for more background.

I cannot see a good reason to continue ignoring mismatching UIDs when
binding to INADDR_ANY.  Looking at the sdr.V2.4a7n sources (mentioned in
bugzilla PR 7713), there is a CANT_MCAST_BIND hack wherein the
application binds to INADDR_ANY instead of a multicast address, but
CANT_MCAST_BIND isn't defined for FreeBSD builds.

It seems unlikely that we still have a use-case for allowing sockets
from different UIDs to bind to the same port when binding to the
unspecified address.  And, as noted in D47832, applications like sdr
would have been broken by the inverted SO_REUSEPORT check removed in
that revision, apparently without any bug reports.  Let's break
compatibility and simply disallow this case outright.

Also, add some comments, remove a hack in a regression test which tests
this funtionality, and add a new regression test to exercise the
remaining checks that were added in commit 4658dc8325.

MFC after:	1 month
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D47870
2024-12-23 15:41:06 +00:00
Gleb Smirnoff
c91dd7a054 tcp: remove unused variable from tcp_usr_disconnect() 2024-12-18 20:11:34 -08:00
Kristof Provost
e4e0f49742 in: add in_mask2len()
Similar to the existing in6_mask2len() function, but for IPv4. This will be used
by pf's nat64 code.

Obtained from:	OpenBSD
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D47785
2024-12-17 11:07:12 +01:00
Gleb Smirnoff
3604a050ee tcp_hpts: refactor the per tcpcb call to either input/output method
Either input or output return unlocked on failure.  Should be no
functional change.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D47925
2024-12-16 06:52:06 -08:00
Michael Tuexen
c9febea3dc icmp: improve INVARIANTS check
Actually check the conditions that are enforced by the error checking
code instead of a condition which is
* checking a number to be non-negative instead of positive
* depending on a random number
Perform the checks consistently for ICMPv4 and ICMPv6.

Reviewed by:		glebius, rrs, cc
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48001
2024-12-12 15:40:49 +01:00
Mark Johnston
4f02a7d739 inpcb: Remove bogus SO_REUSEPORT(_LB) checks in in_pcbbind()
This check for SO_REUSEPORT was added way back in commit 52b65dbe85.
Per the commit log, this commit restricted this port-stealing check to
unicast addresses, and then only if the existing socket does not have
SO_REUSEPORT set.  In other words, if there exists a socket bound to
INADDR_ANY, and we bind a socket to INADDR_ANY with the same port, then
the two sockets need not be owned by the same user if the existing
socket has SO_REUSEPORT set.

This is a surprising semantic; bugzilla PR 7713 gives some additional
context.  That PR makes a case for the behaviour described above when
binding to a multicast address.  But, the SO_REUSEPORT check is only
applied when binding to a non-multicast address, so it doesn't really
make sense.  In the PR the committer notes that "unicast applications
don't set SO_REUSEPORT", which makes some sense, but also refers to
"multicast applications that bind to INADDR_ANY", which sounds a bit
suspicious.

OpenBSD performs the multicast check, but not the SO_REUSEPORT check.
DragonflyBSD removed the SO_REUSEPORT (and INADDR_ANY) checks back in
2014 (commit 0323d5fde12a4).  NetBSD explicitly copied our logic and
still has it.

The plot thickens: 20 years later, SO_REUSEPORT_LB was ported from
DragonflyBSD: this option provides similar semantics to SO_REUSEPORT,
but for unicast addresses it causes incoming connections/datagrams to be
distributed among all sockets in the group.  This commit (1a43cff92a)
inverted the check for SO_REUSEPORT while adding one for
SO_REUSEPORT_LB; this appears to have been inadvertent.  However:
- apparently no one has noticed that the semantics were changed;
- sockets belonging to different users can now be bound to the same port
  so long as they belong to a single lbgroup bound to INADDR_ANY, which
  is not correct.

Simply remove the SO_REUSEPORT(_LB) checks, as their original
justification was dubious and their current implementation is wrong; add
some tests.

Reviewed by:	glebius
MFC after:	1 month
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D47832
2024-12-12 14:25:15 +00:00
Mark Johnston
a600aabe9b inpcb: Close some SO_REUSEPORT_LB races
For a long time, the inpcb lookup path has been lockless in the common
case: we use net_epoch to synchronize lookups.  However, the routines
which update lbgroups were not careful to synchronize with unlocked
lookups.  I believe that in the worst case this can result in spurious
connection aborts (I have a regression test case to exercise this), but
it's hard to be certain.

Modify in_pcblbgroup* routines to synchronize with unlocked lookup:
- When removing inpcbs from an lbgroup, do not shrink the array.
  The maximum number of lbgroup entries is INPCBLBGROUP_SIZMAX (256),
  and it doesn't seem worth the complexity to shrink the array when a
  socket is removed.
- When resizing an lbgroup, do not insert it into the hash table until
  it is fully initialized; otherwise lookups may observe a partially
  constructed lbgroup.
- When adding an inpcb to the group, increment the counter after adding
  the array entry, using a release store.  Otherwise it's possible for
  lookups to observe a null array slot.
- When looking up an entry, use a corresponding acquire load.

Reviewed by:	ae, glebius
MFC after:	1 month
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48020
2024-12-12 14:02:12 +00:00
Zhenlei Huang
ac51711cab netinet: Use NULL for VNET_SYSINIT's last arg, which is a pointer type
MFC after:	3 days
2024-12-10 01:14:08 +08:00
Zhenlei Huang
ec6e7677a0 inpcb: Use NULL for VNET_SYSINIT's last arg, which is a pointer type
MFC after:	3 days
2024-12-10 01:14:08 +08:00
Mark Johnston
ffb3d384fc inpcb: Fix the GENERIC-NODEBUG build
Fixes:	01f8ce8324 ("inpcb: Factor out parts of in6_pcbbind() and in_pcbbind_setup()")
2024-12-05 16:45:26 +00:00
Damjan Jovanovic
61bf830cbb libalias: Add support for EIM NAT
Add support for endpoint-independent mapping ("full cone NAT") in
Libalias's UDP NAT.

This conforms to RFC 4787 requirements 1 and 3. All UDP packets sent out from a
particular internal address:port leave via the same NAT address:port,
regardless of their destination.

Add some libalias tests and supporting defines.

Reviewed by:    igoro, thj
Differential Revision:  https://reviews.freebsd.org/D46689D
2024-12-05 16:19:13 +00:00
Mark Johnston
01f8ce8324 inpcb: Factor out parts of in6_pcbbind() and in_pcbbind_setup()
A large portion of these functions just determines whether the inpcb can
bind to the address/port.  This portion has no side effects, so is a
good candidate to move into its own helper function.  This patch does
so, making the callers less complicated and reducing indentation.

While moving this code, also make some changes:
- Load socket options (SO_REUSEADDR etc.) only once.  There is nothing
  preventing another thread from toggling the socket options, so make
  this function easier to reason about by avoiding races.
- When checking whether the bind address is an interface address, make a
  separate sockaddr rather than temporarily modifying the one passed to
  in_pcbbind().

Reviewed by:	ae, glebius
MFC after:	1 month
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D47590
2024-12-05 15:19:57 +00:00
Gleb Smirnoff
b2bde8a6d3 tcp_hpts: consistenly use macros to lock & unlock
The macros version of lock/unlock has already been used 23 times in this
file and the bare version was used 6 times only, so prefer the former.
No functional change.
2024-12-04 12:16:42 -08:00
Gleb Smirnoff
5cb73dbe48 tcp_hpts: use booleans for tcp_hptsi() local variables
No functional change.
2024-12-04 12:15:46 -08:00
Gleb Smirnoff
63446fd354 tcp_hpts: use boolean to tell is it callout or userret context
No functional change.
2024-12-04 12:14:44 -08:00
Gleb Smirnoff
29f6150256 netlink: use nitems() and roundup(2) from param.h
While here style nested includes (kernel ones go first).

Reviewed by:		melifaro
Differential Revision:	https://reviews.freebsd.org/D47557
2024-12-03 12:04:39 -08:00
Richard Scheffenegger
347dd0539f tcp: add TH_AE capabilities to ppp and pf
Add support for the AE Flag in the TCP header to pf and ppp.
Commonalize to the use of "E"(ECE), "W"(CWR) and "e"(AE)
for the TCP header flags, in line with tcpdump.

Reviewers: kp, cc, tuexen, cy, #transport!
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47106
2024-11-29 10:04:31 +01:00
Richard Scheffenegger
0fc7bdc978 tcp: extend the use of the th_flags accessor function
Formally, there are 12 bits for TCP header flags.
Use the accessor functions in more (kernel) places.

No functional change.

Reviewed By: cc, #transport, cy, glebius, #iflib, kbowling
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47063
2024-11-29 09:48:23 +01:00
Zhenlei Huang
949190c5af udp: Prefer memcpy() over bcopy()
The variable b[] is on the stack, thus cannot overlap with ipov, which
points to the heap area, so prefer memcpy() over memmove(), aka bcopy().

No functional change intended.

Reviewed by:	cc, rrs, cy, #transport, #network
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D47713
2024-11-28 18:04:23 +08:00
Gleb Smirnoff
3789810845 tcp: avoid bcopy() in tcp_mss_update() 2024-11-20 16:37:24 -08:00
Gleb Smirnoff
2944a888ea tcp: remove so != NULL check
In the modern FreeBSD network stack a socket outlives its tcpcb.
2024-11-20 16:37:18 -08:00
Gleb Smirnoff
b80c06cc0a tcp: use const argument in the TCP hostcache KPI
The hostcache can't modify tcpcb, inpcb or connection info.
2024-11-20 16:30:42 -08:00
Gleb Smirnoff
09000cc133 tcp: mechanically rename hostcache metrics structure fields
Use hc_ prefix instead of rmx_.  The latter stands for "route metrix" and
is an artifact from the 90-ies, when TCP caching was embedded into the
routing table.  The rename should have happened back in 97d8d152c2.

No functional change. Done with sed(1) command:

s/rmx_(mtu|ssthresh|rtt|rttvar|cwnd|sendpipe|recvpipe|granularity|expire|q|hits|updates)/hc_\1/g
2024-11-20 16:29:00 -08:00
Kristof Provost
e27970ae8f netinet: handle blackhole routes
If during ip_forward() we find a blackhole (or reject) route we should stop
processing and count this in the 'cantforward' counter, just like we already do
for IPv6.
Blackhole routes are set to use the loopback interface, so we don't actually
incorrectly forward traffic, but we do fail to count it as unroutable.

Test this, both for IPv4 and IPv6.

Reviewed by:	melifaro
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D47529
2024-11-20 16:52:41 +01:00
Kristof Provost
438ca68cef netinet: default mib counter probe points off
Disable the IP/IP6/ICMP/... counter probe points by default.
They are kept enabled in debug builds, and can be enabled with
'options KDTRACE_MIB_SDT'.

Requested by:	glebius
Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D47657
2024-11-20 09:52:48 +01:00
Michael Tuexen
8caa2f5351 tcp: define tcp_lro_log() only when TCP_BLACKBOX is defined
Reviewed by:		rrs, Peter Lei
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D47401
2024-11-17 19:21:01 +01:00
Randall Stewart
12fc79619a Change the SOCKBUF_LOCK calls to use the more refined SOCK_XXXBUF_LOCK/UNLOCK.
The socket buffer locking used to be standard on SOCKBUF_LOCK/UNLOCK. But we are now
moving to a more elegant SOCK_SENDBUF_LOCK/UNLOCK and SOCK_RECVBUF_LOCK/UNLOCK.
Lets get BBR and Rack to use these updated macros.

Reviewed by:glebius, tuexen, rscheff
Differential Revision:https://reviews.freebsd.org/D47542
2024-11-15 12:37:05 -05:00
Gleb Smirnoff
0b4539ee54 inpcb: gc unused argument of in_pcbconnect() 2024-11-14 11:39:13 -08:00
Gleb Smirnoff
fb7c1ac5ac tcp: remove the looping on pcb count in tcp_destroy()
This was useful when TCP timers were not able to reliably stop. Note that
in_pcbinfo_destroy() called later asserts that V_tcbinfo.ipi_count is 0.

This reverts 806929d514, b54e08e11a.
2024-11-14 11:39:12 -08:00
Gleb Smirnoff
81f08f3038 siftr: remove pointless assertion
The assertion is correct, but isn't useful.  Also it contradicts
its own comment.
2024-11-14 11:39:12 -08:00
Richard Scheffenegger
22dcc81293 tcp: Use segment size excluding tcp options for all cwnd calculations
Avoid sending small segments by making sure that cwnd is usually
calculated in full (data) segment sizes. Especially during loss
recovery and retransmission scenarios.

Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47474
2024-11-14 10:16:57 +01:00
Richard Scheffenegger
8f5a2e216f tcp: fix cwnd recalculation during limited transmit
Properly calculate the expected flight size (cwnd) during
limited transmit. Exclude the SACK scoreboard from
consideration when still in limited transmit.

PR: 282605
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47541
2024-11-14 09:19:49 +01:00