Based on a patch originally found in m0n0wall, expanded
to IPv6 and aligned with FreeBSD's IP input path.
The limit may not be correctly accounted for on the WAN
interface due to dummynet counting the packet again even
though it was already processed.
The problem here is that there's no proper way to reinject
the packet at the point where it was previously removed
from so we make the assumption that ip input was already
done (including pfil) and more or less directly move to
packet output processing.
While here move the passin label up to take the extra check
but avoiding a second label. Also remove the spurious tag
read for forward check since we don't use it and we should
really trust the mbuf flag.
When calling dxr_init(), the FIB_ALGO infrastructure may provide a
pointer to a previous dxr instance, which permits reuse of auxiliary
dxr structures, i.e. incremental lookup structure updates. For dxr this
is a crucial feature provided by FIB_ALGO, since dxr incremental updates
are typically several orders of magnitude faster than full lookup table
rebuilds.
However, the auxiliary dxr structure caches a pointer to struct fib_data and
relies upon it for performing incremental updates. Apparently, incremental
rebuild requests from FIB_ALGO, i.e. a calls to dxr_init() with a pointer
old_data set, may (under not yet fully understood circumstances) be invoked
within a different fib_data context than the one cached in the previous
version of dxr auxiliary structures. In such (rare) events, we ignore the
offered old dxr context, and proceed with a full lookup structure rebuild
instead of attempting an incremental one using a fib_data context which
may or may not no longer be valid, and thus lead to a system crash.
PR: 278422
MFC after: 1 week
Approved by: re (cperciva)
(cherry picked from commit 4ab122e8ef127d36d95f874e85600c36c87c8c22)
(cherry picked from commit d6e32525c7)
Previously it was possible for dxr_build() to return with da->fd
unset in case of range_tbl or x_tbl malloc() failures. This
may have led to NULL ptr dereferencing in dxr_change_rib_batch().
Approved by: re (cperciva)
MFC after: 1 week
PR: 278422
(cherry picked from commit 0418d7a090)
This is used by 802.3 Ethernet. (Also be used by 802.4 Token Bus and
802.5 Token Ring, but we don't support those.)
This was accidentally removed along with FDDI support in commit
0437c8e3b1, presumably because comments implied it was used only by
FDDI or Token Ring.
Fixes: 0437c8e3b1 ("Remove support for FDDI networks.")
Reviewed-by: emaste
Signed-off-by: Denny Page <dennypage@me.com>
Pull-request: https://github.com/freebsd/freebsd-src/pull/1166
(cherry picked from commit fcdf9a19893b9b5beb7a21407de507f0ae4c500b)
HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.ptimize HPTS so that little work is done until we have a hpts
thread that is over the connection threshold.
Reported by: eduardo
Reviewed by: gallatin, glebius, tuexen
Tested by: gallatin
Differential Revision: https://reviews.freebsd.org/D44420
(cherry picked from commit b7b78c1c169dd2213b4cb3e14e19c045b2c5e5af)
Ok lets fix up the tcp_in_hpts() so that it also says yes if you
are in the race state moving and you are scheduled to be put in.
This also requires changing the MPASS to be the old version non
inline function of tcp_in_hpts().
This change also adds a new inline macro so that a uint64_t timestamp can be
obtained by a transport (aka Rack will use this).
Reviewed by: glebius, tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D44157
(cherry picked from commit 638b5ae1c7858373344bc7b9dcb5a1e7fab80bd9)
So when we call the fast_rsm retransmit path, we should always move
snd_nxt back up to snd_max. In fact during ack-processing if snd_nxt
falls behind it should be moved up there as well. Otherwise what
can happen is we have an incorrect mark on snd_nxt and incorrectly
calculate the offset when we go through the front path (which is
what skzyall was able to do) then when we go to clean up the
send the offset is all wrong and we crash.
Special thanks to Gleb for pointing out the problem and the email
that had the reproducer so I could find the issue.
Reported-by: syzbot+f5061a372f74f021ec02@syzkaller.appspotmail.com
Sponsored by: Netflix Inc
(cherry picked from commit 8818f0f1124ea3d0e8028f85d667237536eba10c)
struct tcpcb embeds a struct osd and a struct callout. Rather than
forcing all consumers to pull in the same headers, include the headers
directly.
No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44685
(cherry picked from commit 1d14e88e5332cfddbec1893f6b5332f81d378d61)
In rack_output(), idle is used as a boolean variable. So don't use it
as an int and don't clear it afterwards.
This avoids setting idle to false, when it is not intended.
Reported by: olivier
Reviewed by: rrs, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44610
(cherry picked from commit 7df0ef5f48e1c67b3f1df7c7964bfa59bc56f4e4)
In rack_output(), idle is used as a boolean variable. So don't use it
as an int and don't clear it afterwards.
This avoids setting idle to false, when it is not intended.
Reported by: olivier
Reviewed by: rrs, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44610
(cherry picked from commit 7df0ef5f48e1c67b3f1df7c7964bfa59bc56f4e4)
Also log, when dropping text or FIN after having received a FIN.
This is the intended behavior described in RFC 9293.
A follow-up patch will enforce this behavior for the base stack
and the RACK stack.
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44669
(cherry picked from commit e8c149ab85c7834f76325864f22ca89298e65f75)
When in rack_output() jumping to the label out, don't write errno into
the log buffer, since the pointer is not initialized.
Reported by: Coverity Scan
CID: 1523773
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44647
(cherry picked from commit d902c8f55b8da6902ab45e67ed756cc99f5a9d5a)
Ensure that tv.tv_sec is zero in all code paths.
Reported by: Coverity Scan
CID: 1527724
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44584
(cherry picked from commit aaaa01c0c858fd703194c6cbd515dd514574381f)
t_state is an unsigned variable, so no need for testing that it is
non-negative.
Reported by: Coverity Scan
CID: 1390885
Reviewed by: glebius
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44619
(cherry picked from commit 6b454da6bbaa3327cf9b7185d198c96ffc1b88f4)
Make the comment consistent with the code.
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44611
(cherry picked from commit 5a268d868890dbbfe96361906be20d01cc252b2f)
The target_slot argument of max_slots_available() can be NULL.
Therefore, check for this in all places.
Right now, all callers provide non-NULL pointer.
Reported by: Coverity Scan
CID: 1527732
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44527
(cherry picked from commit b600644fdd6cefb1b90d76fdd5aa595946611a7d)
Ensure that there is no data on SYN segments unless doing TFO.
This check is already in RACK and BBR.
Reported by: glebius
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44384
(cherry picked from commit af700f430fd86ba3eae63e587985a12436db8f69)
Add the IP, UDP, and TCP receive static probes to the code path,
which avoids if_input.
Reviewed by: rrs, markj
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43727
(cherry picked from commit 96ad640178ea0a8a9d1772687659dce5be18fbd9)
When doing mbuf queueing, the packet filter hooks in ether_demux(),
ip_input(), and ip6_input() are by-passed. This means that the packet
filters don't process incoming packets, which might result in
connection failures. For example bypassing the TCP sequence number
validation will result in dropping valid packets.
Please note that this patch is only disabling mbuf queueing, not LRO.
Reported by: Herbert J. Skuhra
Reviewed by: glebius, rrs, rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43769
(cherry picked from commit d1ce01214a5540db8a7e09fdf46b7ea2d06ffc48)
If the peer announced support of zero checksums, do so when sending
packets containing an INIT ACK chunk.
(cherry picked from commit 644cffe67f61ad5b36b60d621d1c630ff2a50412)
The FreeBSD TCP base stack handles them also the same way.
In case of packet filters dropping packets in the output path,
this avoids retranmitting the dropped packet every 10ms or so.
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43773
(cherry picked from commit 2f4e46dfdd710c6679f233480c9de430e6c4ef9b)
This is a preparation for adding dtrace hooks in a follow-up commit,
which are missing in the code path, where packets are directly queued
to the tcpcb. The dtrace hooks expect the fields to be in host byte
order. This only applies when TCP HPTS is used.
No functional change intended.
Reviewed by: rscheff
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D43594
(cherry picked from commit f30c7d56546b9f36e42351fb385d96e37dbac1d5)
Don't define TCPOUTFLAGS to get the static definition from tcp_fsm.h.
tailq_hash.c doesn't refernce tcpoutflag. Only files that reference this
should define TCPOUTFLAGS. clang is fine with it, but gcc12 complained.
Sponsored by: Netflix
(cherry picked from commit afd155c72bf65c056d19473569cc78c6e5807b3b)
This socket option can be used by in-kernel consumers (like NFS) to
request a NIC to use optimized receive of large buffers for a
connection. The current use case is to support DDP by the TOE on
Chelsio NICs.
Reviewed by: rscheff, tuexen, glebius
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44000
(cherry picked from commit 3d0a736796a99fe70be9de97beec8f10970c6905)
Don't report a BACKUP CARP address as local. These two functions are used
only by source address validation for input packets, controlled by sysctls
net.inet.ip.source_address_validation and
net.inet6.ip6.source_address_validation. For this purpose we definitely
want to treat BACKUP addresses as non local.
This change is conservative and doesn't modify compat in_localip() and
in6_localip(). They are used more widely than the FIB-aware versions.
The change would modify the notion of ipfw(4) 'me' keyword. There might
be other consequences as in_localip() is used by various tunneling
protocols.
PR: 277349
(cherry picked from commit 56f7860087eec14b4a65310b70bd704e79e1b48c)
Visibility into the contents of the buffer when a write(2) has failed
can be immensely useful in debugging IPC issues -- pushing this to
discuss the idea, or maybe an alternative where we can set a flag like
KTRFAC_ERRIO to enable it.
When a genio event is potentially raised after an error, currently we'll
just free the uio and return. However, such data can be useful when
debugging communication between processes to, e.g., understand what the
remote side should have grabbed before closing a pipe. Tap out the
entire buffer on failure rather than simply discarding it.
Reviewed by: kib, markj
(cherry picked from commit 47ad4f2d45e406c6316909bc12bc760b2fdd6afb)
This is a migitation to avoid sudden extreme jumps in
cwnd, as t_epoch can be very out of date after an RTO.
Per RFC9438, sec 4.8, t_epoch is to be reset whenever
cwnd grows beyond ssthresh (CC phase transitions from
slow start to congestion avoidance), to be fixed with
the upcoming cc_cubic changes.
MFC after: 3 days
Reviewed By: cc, #transport
Sponsored by: NetApp, Inc
Differential Revision: https://reviews.freebsd.org/D44023
(cherry picked from commit 038699a8f18a0a651ee06b85fa1dbbee1eab56f1)
Make sure the divident is at least one. While cwnd should
never be smaller than t_maxseg, this can happen during
Path MTU Discovery, or when TCP options are considered
in other parts of the stack.
PR: 276674
MFC after: 3 days
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43797
(cherry picked from commit 38983d40c18ec5705dcba19ac320b86c5efe8e7e)
The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.
This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.
MFC after: 1 week
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520
(cherry picked from commit 0b3f9e435f2bde9e5be27030d9f574a977a1ad47)