Commit graph

8216 commits

Author SHA1 Message Date
Cheng Cui
a8d2bccb87
tcp cc: use tcp_compute_pipe() for pipe in xx_post_recovery() directly
This follows up with commit 67787d2004, and obsoletes the non-default
pipe calculation from commit 46f5848237 nearly 25 years ago.

Reviewed by: rscheff
Differential Revision: https://reviews.freebsd.org/D49247
2025-03-17 09:00:50 -04:00
Gleb Smirnoff
c56e75390e inpcb: make sure we don't pass uninitialized faddr to in_pcbladdr()
This very theoretical edge case was discovered by Coverity, not sure if
it was introduced by 2af953b132 or was there before.

CID:			1593695
Fixes:			2af953b132
2025-03-13 09:53:40 -07:00
Gleb Smirnoff
c78a14a2b8 inpcb: in_pcb_lport_dest() doesn't use lportp as input argument
This assignment just created false positive analyzer report.

CID:			1593692
2025-03-13 09:53:40 -07:00
Konstantin Belousov
394605c057 ip_output(): style
Reviewed by:	glebius
Sponsored by:	NVidia networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D49305
2025-03-10 22:16:58 +02:00
Konstantin Belousov
edc1fba05e ip_output(): if mb_unmapped_to_ext() failed, return directly
do not free the original mbuf, it is already freed by the
mb_unmapped_to_ext().

Reviewed by:	glebius
Sponsored by:	NVidia networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D49305
2025-03-10 22:16:58 +02:00
Gleb Smirnoff
136c5e17b6 inpcb: return ENOMEM if bind(2) fails to allocate lbgroup
The SO_REUSEPORT_LB isn't a standard option, neither ENOMEM is a specified
return code from bind(2), but it definitely is more appropriate than
EAGAIN or the masked ENOBUFS.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49153
2025-03-06 22:59:46 -08:00
Gleb Smirnoff
452187b611 inpcb: in_pcbinshash() now can't fail on connect(2)
Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49152
2025-03-06 22:59:40 -08:00
Gleb Smirnoff
5f53917078 inpcb: retire two-level port hash database
This structure originates from the pre-FreeBSD times when system RAM was
measured in single digits of MB and Internet speeds were measured in Kb.
At first level the database hashes the port value only to calculate index
into array of pointers to lazily allocated headers that hold lists of
inpcbs with the same local port.  This design apparently was made to
preserve kernel memory.

In the modern kernel size of the first level of the hash is derived from
maxsockets, which is derived from maxfiles, which in its turn is derived
from amount of physical memory.  Then the size of the hash is capped by
IPPORT_MAX, cause it doesn't make any sense to have hash table larger then
the set of possible values.  In practice this cap works even on my laptop.
I haven't done precise calculation or experiments, but my guess is that
any system with > 8 Gb of RAM will be autotuned to IPPORT_MAX sized hash.
Apparently, this hash is a degenerate one: it never has more than one
entries in any slot.  You can check this with kgdb:

    set $i = 0
    while ($i <= tcbinfo->ipi_porthashmask)
        set $p = tcbinfo->ipi_porthashbase[$i].clh_first
        set $c = 0
        while ($p != 0)
            set $c = $c + 1
            set $p = $p->phd_hash.cle_next
        end
        if ($c > 1)
            printf "Slot %u count %u", $i, $c
        end
        set $i = $i + 1
    end

Retiring the two level hash we remove a lot of complexity at the cost of
only one comparison 'inp->inp_lport != lport' in the lookup cycle, which
is going to be always false on most machines anyway. This comparison
definitely shall be cheaper than extra pointer traversal.

Another positive change to be singled out is that now we no longer need to
allocate memory in non-sleepable context in in_pcbinshash(), so a
potential ENOMEM on connect(2) is removed.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49151
2025-03-06 22:58:35 -08:00
Gleb Smirnoff
79fb0d2474 inpcb: make inpcb hash insertion/removal functions private 2025-03-06 22:58:29 -08:00
Gleb Smirnoff
c7f803c71d inpcb: fix a panic with SO_REUSEPORT_LB + connect(2) misuse
This combination doesn't make any sense.  This socket option makes sense
only on a socket that is going to be a listening one.  There are two
options here: refuse connect(2) on a socket that has the option set
previously, or ignore (and clear) the option.  After some discussion on
phabricator, we have chosen the former, for safety and consistency
reasons.  Any programmer that runs this sequence is doing something wrong
and should be informed of that with appropriate error code.

Since connect(2) is a SUS API that has a defined set of error codes, none
of which corresponds to "a socket has non-standard incompatible socket
option set", we decided to return the same error that an already listening
socket would return.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49150
2025-03-06 22:57:44 -08:00
Gleb Smirnoff
2af953b132 inpcb: inline in_pcbconnect_setup() into in_pcbconnect()
The separation had been done back in 5200e00e72 for the purposes of
removing a true temporary connect of an unconnected UDP socket that does
sendto(2) in 90162a4e87.  Now, with 69c05f4287 in place, the
separation is no longer needed.  There should be no functional change.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49142
2025-03-06 22:57:29 -08:00
Gleb Smirnoff
e92a78ad7a tcp: return EOPNOTSUPP on attempt to connect(2) a listening socket
This is the error code specified by SUS.  Only the TCP over IPv6 required
this fix.

Fixes:			bd4a39cc93
Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49275
2025-03-06 22:57:11 -08:00
Zhenlei Huang
09de373103 tcp_ratelimit: Use static initializers
MFC after:	1 week
2025-03-06 12:51:45 +08:00
Zhenlei Huang
b7d5bda6f1 carp: Use static initializers
MFC after:	1 week
2025-03-06 12:51:44 +08:00
Zhenlei Huang
2472f4dbe9 udp: Do not recursively enter net epoch
The only caller udp_send() has already entered net epoch before invoking
udp_v4mapped_pktinfo().

No functional change intended.

This partially reverts commit d74b7baeb0 (ifnet_byindex() actually
requires network epoch).

Reviewed by:	ae, glebius
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D49227
2025-03-05 12:57:30 +08:00
SHENGYI HONG
8ee127efb0 vm_lowmem: Fix signature mismatches in vm_lowmem callbacks
This is required for kernel CFI.

Reviewed by:	rrs, jhb, glebius
Differential Revision:	https://reviews.freebsd.org/D49111
2025-03-04 20:18:52 -05:00
acazuc
70703aa922 netinet: allow per protocol random IP id control, single out IPSEC
A globally enabled random IP id generation maybe useful in most IP
contexts, but it may be unnecessary in the case of IPsec encapsulated
packets because IPsec can be configured to use anti-replay windows.

This commit adds a new net.inet.ipsec.random_id sysctl to control whether
or not IPsec packets should use random IP id generation.

Rest of the protocols/modules are still controlled by the global
net.inet.ip.random_id, but can be easily augmented with a knob.

Reviewed by:		glebius
Sponsored by:		Stormshield
Differential Revision:	https://reviews.freebsd.org/D49164
2025-03-04 08:45:32 -08:00
Andrey V. Elsukov
4a77657cbc ipfw: migrate ipfw to 32-bit size rule numbers
This changes ABI due to the changed opcodes and includes the
following:
 * rule numbers and named object indexes converted to 32-bits
 * all hardcoded maximum rule number was replaced with
   IPFW_DEFAULT_RULE macro
 * now it is possible to grow maximum numbers or rules in
   build time
 * several opcodes converted to ipfw_insn_u32 to keep rulenum:
   O_CALL, O_SKIPTO
 * call stack modified to keep u32 rulenum. The behaviour of
   O_CALL opcode was changed to avoid possible packets looping.
   Now when call stack is overflowed or mbuf tag allocation
   failed, a packet will be dropped instead of skipping to next
   rule.
 * 'return' action now have two modes to specify return point:
   'next-rulenum' and 'next-rule'
 * new lookup key added for O_IP_DST_LOOKUP opcode 'lookup rulenum'
 * several opcodes converted to keep u32 named object indexes
   in special structure ipfw_insn_kidx
 * tables related opcodes modified to use two structures:
   ipfw_insn_kidx and ipfw_insn_table
 * added ability for table value matching for specific value type
   in 'table(name,valtype=value)' opcode
 * dynamic states and eaction code converted to use u32 rulenum
   and named objects indexes
 * added insntod() and insntoc() macros to cast to specific
   ipfw instruction type
 * default sockopt version was changed to IP_FW3_OPVER=1
 * FreeBSD 7-11 rule format support was removed
 * added ability to generate special rtsock messages via log opcode
 * added IP_FW_SKIPTO_CACHE sockopt to enable/disable skipto cache.
   It helps to reduce overhead when many rules are modified in batch.
 * added ability to keep NAT64LSN states during sets swapping

Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D46183
2025-03-03 21:15:17 +03:00
Zhenlei Huang
3ae7c76354 netinet: Make in_canforward() return bool
No functional change intended.

MFC after:	5 days
2025-03-02 23:00:42 +08:00
Zhenlei Huang
f7174eb2b4 netinet: Do not forward or ICMP response to INADDR_ANY
The section 4 in the draft proposal [1] explicitly states that 0.0.0.0,
aka INADDR_ANY, retains its existing special meanings.

[1] https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-0

Reviewed by:	glebius
Fixes:	efe58855f3 IPv4: experimental changes to allow net 0/8, 240/4, part of 127/8
MFC after:	5 days
Differential Revision:	https://reviews.freebsd.org/D49157
2025-03-02 23:00:42 +08:00
Cheng Cui
67787d2004
tcp: make inflight data (pipe) calculation consistent
Reviewed by: glebius, rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D49047
2025-02-28 15:53:12 -05:00
Zhenlei Huang
97309cec6f netinet: Make in_ifhasaddr() return bool
No functional change intended.

MFC after:	1 week
2025-02-27 23:58:20 +08:00
Zhenlei Huang
69beb16284 netinet: Make in_localaddr() return bool
It is used as a boolean function everywhere.

No functional change intended.

MFC after:	1 week
2025-02-27 23:58:20 +08:00
Peter Lei
6f27541d94 tcp rack: cleanup accounting conditional checks
No functional change intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	Netflix, Inc.
2025-02-25 21:45:40 +01:00
Peter Lei
0e58542fd2 tcp: remove unused field from struct tcpcb
Reviewed by:	tuexen
Sponsored by:	Netflix, Inc.
2025-02-25 21:37:48 +01:00
Peter Lei
163c30c793 tcp rack: remove dead code
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	Netflix, Inc.
2025-02-25 21:33:32 +01:00
Gleb Smirnoff
f510c5b213 netinet: fix build
Fixes:	3b281d1421
2025-02-24 08:08:54 -08:00
Zhenlei Huang
a5e380e51c netinet: Update a comment for in_localip()
The function in_localip() was changed to return bool but the comment was
left unchanged.

Fixes:	c8ee75f231 Use network epoch to protect local IPv4 addresses hash
MFC after:	3 days
2025-02-24 18:14:39 +08:00
Mark Johnston
8b3d2c19d3 inpcb: Fix reuseport lbgroup array resizing
in_pcblisten() moves an inpcb from the per-group list into the array, at
which point it becomes visible to inpcb lookups in the datapath.  It
assumes that there is space in the array for this, but that's not
guaranteed, since in_pcbinslbgrouphash() doesn't reserve space in the
array if the inpcb isn't associated with a listening socket.

We could resize the array in in_pcblisten(), but that would introduce a
failure case where there currently is none.  Instead, keep track of the
number of pending inpcbs as well, and modify in_pcbinslbgrouphash() to
reserve space for each pending (i.e., not-yet-listening) inpcb.

Add a regression test.

Reviewed by:	glebius
Reported by:	netchild
Fixes:		7cbb6b6e28 ("inpcb: Close some SO_REUSEPORT_LB races, part 2")
Differential Revision:	https://reviews.freebsd.org/D49100
2025-02-23 16:20:12 +00:00
Zhenlei Huang
1776633438 carp: Fix checking IPv4 multicast address
An IPv4 address stored in `struct in_addr` is in network byte order but
`IN_MULTICAST` wants host order.

PR:		284872
Reported by:	Steven Perreau
Reported by:	Brett Merrick <brett.merrick@itcollective.nz>
Reviewed by:	Franco Fichtner <franco@opnsense.org>, ae, kp, glebius
Tested by:	Steven Perreau
Fixes:		137818006d carp: support unicast
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D49053
2025-02-23 03:26:33 +08:00
Gleb Smirnoff
69c05f4287 udp: make sendto(2) on unconnected UDP socket use public inpcb KPIs
UDP allows to sendto(2) on unconnected socket.  The original BSD devise
was that such action would create a temporary (for the duration of the
syscall) connection between our inpcb and remote addr:port specified in
sockaddr 'to' of the syscall.  This devise was broken in 2002 in
90162a4e87.  For more motivation on the removal of the temporary
connection see email [1].

Since the removal of the true temporary connection the sendto(2) on
unconnected socket has the following side effects:

1) After first sendto(2) the "unconnected" socket will receive datagrams
   destined to the selected port.
2) All subsequent sendto(2) calls will use the same source port.

Effectively, such sendto(2) acts like a bind(2) to INADDR_ANY:0.  Indeed,
if you do this:

	s1 = socket(PF_INET, SOCK_DGRAM, 0);
	s2 = socket(PF_INET, SOCK_DGRAM, 0);
	sendto(s1, ..., &somedestination, ...);
	bind(s2, &{ .sin_addr = INADDR_ANY, sin_port = 0 });

And then look into kgdb at resulting inpcbs, you would find them equal in
all means modulo bound to different anonymous ports.

What is even more interesting is that Linux kernel had picked up same
behavior, including that "unconnected" socket will receive datagrams.  So
it seems that such behavior is now an undocumented standard, thus I
covered it in recently added tests/sys/netinet/udp_bindings.

Now, with the above knowledge at hand, why are we using
in_pcbconnect_setup() and in_pcbinshash(), which are supposed to be
private to in_pcb.c, to achieve the binding?  Let's use public KPI
in_pcbbind() on the first sendto(2) and use in_pcbladdr() on all
sendto(2)s.  Apart from finally hiding these two should be private
functions, we no longer acquire global INP_HASH_WLOCK() for every
sendto(2) on unconnected socket as well as remove a couple workarounds.

[1] https://mail-archive.FreeBSD.org/cgi/mid.cgi?200210141935.aa83883

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49043
2025-02-21 18:11:17 -08:00
Gleb Smirnoff
532106f7aa netinet: use in_broadcast() inline
There should be no functional change.

Reviewed by:		rrs, markj
Differential Revision:	https://reviews.freebsd.org/D49088
2025-02-21 18:11:06 -08:00
Gleb Smirnoff
3b281d1421 netinet: enforce broadcast mode for all-ones and all-zeroes destinations
When a socket has SO_BROADCAST set and destination address is INADDR_ANY
or INADDR_BROADCAST, the kernel shall pick up first broadcast capable
interface and broadcast the packet out of it.  Since this API is not
reliable on a machine with > 1 broadcast capable interfaces, all practical
software seems to use IP_ONESBCAST or other mechanisms to send broadcasts.
This has been broken at least since FreeBSD 6.0, see bug 99558.  Back then
the problem was in the fact that in_broadcast() check was always done
against the gateway address, not the destination address.  Later, with
90cc51a1ab, a second problem piled on top - we aren't checking for
INADDR_ANY and INADDR_BROADCAST at all.

Better late than never, fix that by checking destination address.

PR:			99558
Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49042
2025-02-21 18:11:00 -08:00
Gleb Smirnoff
197fc4cad0 netinet: rename in_broadcast() to in_ifnet_broadcast()
This aligns with existing in_ifaddr_broadcast() and aligns with other
simple functions or macros with bare "in_" prefix that operator just on
struct in_addr and nothing else, e.g. in_nullhost().  No functional
change.

Reviewed by:		markj
Differential Revision:	https://reviews.freebsd.org/D49041
2025-02-21 18:10:53 -08:00
Cheng Cui
7f9ef5c75f
cc_cubic: remove redundant code
During my progress on updating cc_cubic to RFC9438, found such redundancy as:

- W_est: we use the alternative stack local variable `W_est` in
	 `cubic_ack_received()`.
- cwnd_prior: it is used for Reno-Friendly Region in RFC9438 Section 4.3,
	 but we use the alternative cwnd from NewReno for Reno-Friendly as
	 in commit ee45061051.

No functional change intended.

Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D49008
2025-02-20 11:00:41 -05:00
Gleb Smirnoff
dc9db1f6b3 netinet: make in_broadcast() and in_ifaddr_broadcast return bool
While here annotate deprecated condition with __predict_false() and
slightly refactor in_broadcast() removing leftovers from old address list
locking.  Should be no functional change.
2025-02-17 15:28:52 -08:00
Gleb Smirnoff
8f1d5cf5b5 ip_output: use bool for isbroadcast 2025-02-17 15:28:52 -08:00
Gleb Smirnoff
bafe022b1f inpcb: add const qualifiers on functions that select address/port
There are several functions that keep database locked and do address
and port selection before a caller commits the changes to the inpcb.
Mark the inpcb argument with a good documenting const.
2025-02-17 15:28:52 -08:00
Gleb Smirnoff
24e5c2ee2a inpcb: update inpcb multipath routing information only on success
The in_pcbconnect_setup() function is not supposed to modify inpcb.
It may be entered with read-only lock via UDP path.  Also at this
point we aren't yet sure that the binding is going to be successful.
Thus, update the multipath routing information only at the end of a
succesful in_pcbconnect().

Fixes:	0c325f53f1
2025-02-17 15:28:52 -08:00
Cheng Cui
6156da866e
cc_cubic: remove redundant calls of tcp_fixed_maxseg()
Summary: No functional change intended.

Reviewed by: rscheff, tuexen

Subscribers: imp, melifaro, glebius

Differential Revision: https://reviews.freebsd.org/D48967
2025-02-12 17:49:21 -05:00
Michael Tuexen
923c223f27 icmp: use per rate limit randomized jitter
Using the same random jitter for multiple rate limits allows an
attacker to use one rate limiter to figure out the current jitter
and then use this knowledge to de-randomize the other rate limiters.
This can be mitigated by using a separate randomized jitter for each
rate limiter.
This issue was reported as issue number 10 in Keyu Man et al.:
SCAD: Towards a Universal and Automated Network Side-Channel
Vulnerability Detection

Reviewed by:		rrs, Peter Lei, glebius
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D48804
2025-02-10 22:16:20 +01:00
Mateusz Guzik
0fd31cf690 mroute: fix a sysctl vs teardown race
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2025-02-10 14:45:36 +00:00
Mateusz Guzik
efd368784e mroute: serialize parallel teardown of the same vnet
Otherwise 2 threads calling here can crash the kernel.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2025-02-10 14:45:32 +00:00
Mark Johnston
ca94f92c23 inpcb: Move the definition of struct inpcblbgroup to in_pcb_var.h
It's only needed for in_pcb.c and in6_pcb.c, so can go to the private
header.

No functional change intended.

Reported by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
2025-02-06 16:25:24 +00:00
Mark Johnston
4009a98fe8 rawip: Add a bind_all_fibs sysctl
As with net.inet.{tcp,udp}.bind_all_fibs, this causes raw sockets to
accept only packets from the same FIB.

Reviewed by:	glebius
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48707
2025-02-06 14:16:36 +00:00
Mark Johnston
caccbaef8e socket: Move SO_SETFIB handling to protocol layers
In particular, we store a FIB number in both struct socket and in struct
inpcb.  When updating the FIB number with setsockopt(SO_SETFIB), make
the update atomic.  This is required to support the new bind_all_fibs
mode, since in that mode changing the FIB of a bound socket is not
permitted.

This requires a bit more code, but avoids a layering violation in
sosetopt(), where we hard-code the list of protocol families that
implement SO_SETFIB.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48666
2025-02-06 14:16:21 +00:00
Mark Johnston
08e638c089 udp: Add a sysctl to modify listening socket FIB inheritance
Introduce the net.inet.udp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour.  When set to 0, all received
datagrams will be dropped unless an inpcb bound to the same FIB exists.

No functional change intended, as the new behaviour is not enabled by
default.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48664
2025-02-06 14:15:41 +00:00
Mark Johnston
5dc99e9bb9 tcp: Add a sysctl to modify listening socket FIB inheritance
Introduce the net.inet.tcp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour.  When set to 0, all TCP
listening sockets are private to their FIB.  Inbound connection requests
will only succeed if a matching inpcb is bound to the same FIB as the
request.

No functional change intended, as the new behaviour is not enabled by
default.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48663
2025-02-06 14:14:49 +00:00
Mark Johnston
da806e8db6 inpcb: Add FIB-aware inpcb lookup
Allow protocol layers to look up an inpcb belonging to a particular FIB.
This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be
used is obtained from the specificed mbuf or ifnet.

No functional change intended.

Reviewed by:	glebius, melifaro
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48662
2025-02-06 14:14:39 +00:00
Mark Johnston
bbd0084baf inpcb: Add a flags parameter to in_pcbbind()
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB
number.  When this flag is specified, duplicate bindings are permitted,
so long as each FIB contains at most one inpcb bound to the same
address/port.  If an inpcb is bound with this flag, it'll have the
INP_BOUNDFIB flag set.

No functional change intended.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Differential Revision:	https://reviews.freebsd.org/D48661
2025-02-06 14:14:23 +00:00