in_pcblisten() moves an inpcb from the per-group list into the array, at
which point it becomes visible to inpcb lookups in the datapath. It
assumes that there is space in the array for this, but that's not
guaranteed, since in_pcbinslbgrouphash() doesn't reserve space in the
array if the inpcb isn't associated with a listening socket.
We could resize the array in in_pcblisten(), but that would introduce a
failure case where there currently is none. Instead, keep track of the
number of pending inpcbs as well, and modify in_pcbinslbgrouphash() to
reserve space for each pending (i.e., not-yet-listening) inpcb.
Add a regression test.
Reviewed by: glebius
Reported by: netchild
Fixes: 7cbb6b6e28 ("inpcb: Close some SO_REUSEPORT_LB races, part 2")
Differential Revision: https://reviews.freebsd.org/D49100
UDP allows to sendto(2) on unconnected socket. The original BSD devise
was that such action would create a temporary (for the duration of the
syscall) connection between our inpcb and remote addr:port specified in
sockaddr 'to' of the syscall. This devise was broken in 2002 in
90162a4e87. For more motivation on the removal of the temporary
connection see email [1].
Since the removal of the true temporary connection the sendto(2) on
unconnected socket has the following side effects:
1) After first sendto(2) the "unconnected" socket will receive datagrams
destined to the selected port.
2) All subsequent sendto(2) calls will use the same source port.
Effectively, such sendto(2) acts like a bind(2) to INADDR_ANY:0. Indeed,
if you do this:
s1 = socket(PF_INET, SOCK_DGRAM, 0);
s2 = socket(PF_INET, SOCK_DGRAM, 0);
sendto(s1, ..., &somedestination, ...);
bind(s2, &{ .sin_addr = INADDR_ANY, sin_port = 0 });
And then look into kgdb at resulting inpcbs, you would find them equal in
all means modulo bound to different anonymous ports.
What is even more interesting is that Linux kernel had picked up same
behavior, including that "unconnected" socket will receive datagrams. So
it seems that such behavior is now an undocumented standard, thus I
covered it in recently added tests/sys/netinet/udp_bindings.
Now, with the above knowledge at hand, why are we using
in_pcbconnect_setup() and in_pcbinshash(), which are supposed to be
private to in_pcb.c, to achieve the binding? Let's use public KPI
in_pcbbind() on the first sendto(2) and use in_pcbladdr() on all
sendto(2)s. Apart from finally hiding these two should be private
functions, we no longer acquire global INP_HASH_WLOCK() for every
sendto(2) on unconnected socket as well as remove a couple workarounds.
[1] https://mail-archive.FreeBSD.org/cgi/mid.cgi?200210141935.aa83883
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49043
When a socket has SO_BROADCAST set and destination address is INADDR_ANY
or INADDR_BROADCAST, the kernel shall pick up first broadcast capable
interface and broadcast the packet out of it. Since this API is not
reliable on a machine with > 1 broadcast capable interfaces, all practical
software seems to use IP_ONESBCAST or other mechanisms to send broadcasts.
This has been broken at least since FreeBSD 6.0, see bug 99558. Back then
the problem was in the fact that in_broadcast() check was always done
against the gateway address, not the destination address. Later, with
90cc51a1ab, a second problem piled on top - we aren't checking for
INADDR_ANY and INADDR_BROADCAST at all.
Better late than never, fix that by checking destination address.
PR: 99558
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49042
This aligns with existing in_ifaddr_broadcast() and aligns with other
simple functions or macros with bare "in_" prefix that operator just on
struct in_addr and nothing else, e.g. in_nullhost(). No functional
change.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49041
During my progress on updating cc_cubic to RFC9438, found such redundancy as:
- W_est: we use the alternative stack local variable `W_est` in
`cubic_ack_received()`.
- cwnd_prior: it is used for Reno-Friendly Region in RFC9438 Section 4.3,
but we use the alternative cwnd from NewReno for Reno-Friendly as
in commit ee45061051.
No functional change intended.
Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D49008
While here annotate deprecated condition with __predict_false() and
slightly refactor in_broadcast() removing leftovers from old address list
locking. Should be no functional change.
There are several functions that keep database locked and do address
and port selection before a caller commits the changes to the inpcb.
Mark the inpcb argument with a good documenting const.
The in_pcbconnect_setup() function is not supposed to modify inpcb.
It may be entered with read-only lock via UDP path. Also at this
point we aren't yet sure that the binding is going to be successful.
Thus, update the multipath routing information only at the end of a
succesful in_pcbconnect().
Fixes: 0c325f53f1
Using the same random jitter for multiple rate limits allows an
attacker to use one rate limiter to figure out the current jitter
and then use this knowledge to de-randomize the other rate limiters.
This can be mitigated by using a separate randomized jitter for each
rate limiter.
This issue was reported as issue number 10 in Keyu Man et al.:
SCAD: Towards a Universal and Automated Network Side-Channel
Vulnerability Detection
Reviewed by: rrs, Peter Lei, glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48804
It's only needed for in_pcb.c and in6_pcb.c, so can go to the private
header.
No functional change intended.
Reported by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
As with net.inet.{tcp,udp}.bind_all_fibs, this causes raw sockets to
accept only packets from the same FIB.
Reviewed by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48707
In particular, we store a FIB number in both struct socket and in struct
inpcb. When updating the FIB number with setsockopt(SO_SETFIB), make
the update atomic. This is required to support the new bind_all_fibs
mode, since in that mode changing the FIB of a bound socket is not
permitted.
This requires a bit more code, but avoids a layering violation in
sosetopt(), where we hard-code the list of protocol families that
implement SO_SETFIB.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48666
Introduce the net.inet.udp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all received
datagrams will be dropped unless an inpcb bound to the same FIB exists.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48664
Introduce the net.inet.tcp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all TCP
listening sockets are private to their FIB. Inbound connection requests
will only succeed if a matching inpcb is bound to the same FIB as the
request.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48663
Allow protocol layers to look up an inpcb belonging to a particular FIB.
This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be
used is obtained from the specificed mbuf or ifnet.
No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48662
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB
number. When this flag is specified, duplicate bindings are permitted,
so long as each FIB contains at most one inpcb bound to the same
address/port. If an inpcb is bound with this flag, it'll have the
INP_BOUNDFIB flag set.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48661
This is to enable a mode where duplicate inpcb bindings are permitted,
and we want to look up an inpcb with a particular FIB. Thus, add a
"fib" parameter to in_pcblookup() and related functions, and plumb it
through.
A fib value of RT_ALL_FIBS indicates that the lookup should ignore FIB
numbers when searching. Otherwise, it should refer to a valid FIB
number, and the returned inpcb should belong to the specific FIB. For
now, just add the fib parameter where needed, as there are several
layers to plumb through.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48660
Now that the family and group are completely private to netlink_generic.c,
provide a simple and robust KPI, that would require very simple guarantees
from both KPI and the module:
* Strings are used only for family and group registration, that return ID:
uint16_t genl_register_family(const char *name, ...
uint32_t genl_register_group(uint16_t family, const char *name, ...
* Once created families and groups are guaranteed to not disappear and
be addressable by their ID.
* All subsequent calls, including deregistration shall use ID.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D48845
Just like we already do for IPv6 set the PFIL_FWD flag when we're forwarding
IPv4 traffic. This allows firewalls to make more precise decisions.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D48824
A tcpcb in TCPS_LISTEN has always socket in SO_ACCEPTCONN. One block
above there is an assertion that proves that this never happens. We
stopped ever clearing SO_ACCEPTCONN back in 779f106aa1.
This reverts commit 982c1675ff.
Reviewed by: cc, markj
Differential Revision: https://reviews.freebsd.org/D48710
There was a long living problem that pr_listen is called every time on
consecutive listen(2) syscalls. Up until today it produces spurious TCP
state change events in tracing software and other harmless problems. But
with 7cbb6b6e28 we started to call LIST_REMOVE() twice on the same
entry.
This is quite ugly, but quick and robust fix against regression, that we
decided to put in the scope of the January stabilization week. A better
refactoring will happen later.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D48703
Fixes: 7cbb6b6e28
Previously this macro would always increment the first counter in the
histogram array.
PR: 279975
Fixes: 60d8dbbef0 ("netinet: add a probe point for IP, IP6, ICMP, ICMP6, UDP and TCP stats counters")
Sponsored by: Klara, Inc.
Suppose a thread is adds a socket to an existing TCP lbgroup that is
actively accepting connections. It has to do the following operations:
1. set SO_REUSEPORT_LB on the socket
2. bind() the socket to the shared address/port
3. call listen()
Step 2 makes the inpcb visible to incoming connection requests.
However, at this point the inpcb cannot accept new connections. If
in_pcblookup() matches it, the remote end will see ECONNREFUSED even
when other listening sockets are present in the lbgroup. This means
that dynamically adding inpcbs to an lbgroup (e.g., by starting up new
workers) can trigger spurious connection failures for no good reason.
(A similar problem exists when removing inpcbs from an lbgroup, but that
is harder to fix and is not addressed by this patch; see the review for
a bit more commentary.)
Fix this by augmenting each lbgroup with a linked list of inpcbs that
are pending a listen() call. When adding an inpcb to an lbgroup, keep
the inpcb on this list if listen() hasn't been called, so it is not yet
visible to the lookup path. Then, add a new in_pcblisten() routine which
makes the inpcb visible within the lbgroup now that it's safe to let it
handle new connections.
Add a regression test which verifies that we don't get spurious
connection errors while adding sockets to an LB group.
Reviewed by: glebius
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48544
garp_rexmit() is a callback, so is not in net_epoch, which
arprequest_internal() expects.
Enter and exit the net_epoch.
PR: 284073
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Turn garp_rexmit_count into a per-vnet variable.
This immediate use case is to enable easier testing.
Sponsored by: Rubicon Communications, LLC ("Netgate")
To comply with Common Criteria certification requirements, it may be
necessary to ensure that packets to 0.0.0.0/::0 are dropped and logged
by the system firewall. Currently, such packets are dropped by
ip_input() and ip6_input() before reaching pfil hooks; let's defer the
checks slightly to give firewalls a chance to drop the packets
themselves, as this gives better observability. Add some regression
tests for this with pf+pflog.
Note that prior to commit 713264f6b8, v4 packets to the unspecified
address were not dropped by the IP stack at all.
Note that ip_forward() and ip6_forward() ensure that such packets are
not forwarded; they are passed back unmodified.
Add a regression test which ensures that such packets are visible to
pflog.
Reviewed by: glebius
MFC after: 3 weeks
Sponsored by: Klara, Inc.
Sponsored by: OPNsense
Differential Revision: https://reviews.freebsd.org/D48163
Only map mbuf when a policy is looked up and indicates that IPSEC needs
to transform the packet. If IPSEC is inline offloaded, it is up to the
interface driver to request remap if needed.
Fetch the IP header using m_copydata() instead of using mtod() to select
policy/SA.
Reviewed by: markj
Sponsored by: NVidia networking
Differential revision: https://reviews.freebsd.org/D48265
but instead of tripping the assert in debug kernel, and silently falling
into UB for prod, skip IPSEC processing for KTLS framed packets when
mb_unmapped_to_ext() failed.
Reviewed by: markj
Sponsored by: NVidia networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D48265
When calculating length of data to be sent, we may do some congestion
calculations, but we shall never send a byte beyond (snd_una + snd_wnd).
In 7dc78150c7 we started to use tcp_compute_pipe() instead of (snd_wnd
- off). This function makes an estimate of how much data is in flight. It
can return a value smaller and larger than (snd_nxt - snd_una). It will
return a value larger when we have retransmitted some data from SACK
holes, and smaller once those retransmitted SACK holes are acked.
We may use tcp_compute_pipe() for length calculation, but always capped
by the send offset 'off', which (snd_nxt - snd_una).
PR: 283649
Reviewed by: rscheff
Differential Revision: https://reviews.freebsd.org/D48237
Fixes: 7dc78150c7
When the SACK scoreboard collapses, properly clear all the counters.
The counters are used in tcp_compute_pipe(), which can be called
anytime later after the SACK recovery. The returned result can be
totally bogus, including both too large and too small values.
PR: 283649
Reviewed by: rscheff
Differential Revision: https://reviews.freebsd.org/D48236
Also make this variable initialization, as well as accompanying sackhole
pointer, slightly more readable. NFC.
Reviewed by: rscheff, tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D48235
reduce is uninitialized, if the code path for logging is reached via
goto old_method;.
Reviewed by: rrs, Peter Lei
CID: 1557359
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48346
Bring back the code, which was accidentally removed. While there,
indent a comment correctly.
Reviewed by: rrs
CID: 1540026
Fixes: e18b97bd63 ("Update to bring the rack stack with all its fixes in.")
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48340
bw is unsigned and not zero. So it cannot be smaller than 1.
No functional change intended.
Reviewed by: rrs, cc
CID: 1523791
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48323
Do not jump to a place in the code, which requires several variables
to be set (segsize, minseg, idle, len, sb_offset), which is not true.
To avoid using these variables, start the HPTS timer explicitly.
This fix only applies to the client side using TCP fast open.
Approved by: rrs
CID: 1523766
CID: 1523770
CID: 1523786
CID: 1523801
CID: 1523809
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48322
minslot is initialized to 0 and never changed. It is not clear to me
under which condition minslot should be set to which value.
Therefore, remove it and the code checking that it is not zero.
No functional change intended.
Reviewed by: rrs
CID: 1523812
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48321
rc_bbr_substate is a 3-bit unsigned int, so it can't be larger than
or equal to 8. The wrap around already happens.
No functional change intended.
Reviewed by: rrs
CID: 1523795
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48320
There is no need to check partially for bbr->r_ctl.crte being NULL,
since this can't be true in this path.
No functional change intended.
Reviewed by: rrs
CID: 1523810
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48312