If the socket is configured such that the sender is expected to supply
the IP header, then we need to verify that it actually did so.
Reported by: syzkaller+KMSAN
Reviewed by: donner
Sponsored by: The FreeBSD Foundation
(cherry picked from commit ba21825202)
This completes PRR cwnd reduction in all circumstances
for the base TCP stack (SACK loss recovery, ECN window reduction,
non-SACK loss recovery), preventing the arriving ACKs to
clock out new data at the old, too high rate. This
reduces the chance to induce additional losses while
recovering from loss (during congested network conditions).
For non-SACK loss recovery, each ACK is assumed to have
one MSS delivered. In order to prevent ACK-split attacks,
only one window worth of ACKs is considered to actually
have delivered new data.
MFC after: 6 weeks
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29441
(cherry picked from commit 74d7fc8753)
Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.
This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.
Reviewed by: kbowling
Obtained from: OpenBSD
MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31138
(cherry picked from commit 8e1864ed07)
Fix a bug in VNET handling, which occurs when using specific NICs.
PR: 257195
Reviewed by: rrs
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31212
(cherry picked from commit a730d82378)
The packet_limit can fall to 0, leading to a divide by zero abort in
the "packets % packet_limit".
An possible solution would be to apply a lower limit of 1 after the
calculation of packet_limit, but since any number modulo 1 gives 0,
the more efficient solution is to skip the modulo operation for
packet_limit <= 1.
Reported by: Karl Denninger <karl@denninger.net>
(cherry picked from commit 58080fbca0)
When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.
Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D31094
Sponsored by: Netflix
(cherry picked from commit b1e806c0ed)
This fixes the incorrect use of a sysctl add to u64. It
was for a useconds time, but on 32 bit platforms its
not a u64. Instead use the long directive.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31107
(cherry picked from commit 7312e4e5cf)
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
(cherry picked from commit d7955cc0ff)
There are several cases where we make a goodput measurement and we are running
out of data when we decide to make the measurement. In reality we should not make
such a measurement if there is no chance we can have "enough" data. There is also
some corner case TLP's that end up not registering as a TLP like they should, we
fix this by pushing the doing_tlp setup to the actual timeout that knows it did
a TLP. This makes it so we always have the appropriate flag on the sendmap
indicating a TLP being done as well as count correctly so we make no more
that two TLP's.
In addressing the goodput lets also add a "quality" metric that can be viewed via
blackbox logs so that a casual observer does not have to figure out how good
of a measurement it is. This is needed due to the fact that we may still make
a measurement that is of a poorer quality as we run out of data but still have
a minimal amount of data to make a measurement.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31076
(cherry picked from commit e834f9a44a)
Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.
Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895
(cherry picked from commit 9e4d9e4c4d)
We need to enter the network epoch when calling into
tfb_tcp_fb_fini. I noticed this when I hit an assert
running the latest rack
Differential Revision: https://reviews.freebsd.org/D30407
Reviewed by: rrs, tuexen
Sponsored by: Netflix
(cherry picked from commit 086a35562f)
There were two bugs that prevented V4 sockets from connecting to
a rack server running a V4/V6 socket. As well as a bug that stops the
mapped v4 in V6 address from working.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30885
PR: 256657
(cherry picked from commit 66aec14a53)
Some TCP stacks negotiate TS support, but do not send TS at all
or not for keep-alive segments. Since this includes modern widely
deployed stacks, tolerate the violation of RFC 7323 per default.
Reviewed by: rgrimes, rrs, rscheff
Differential Revision: https://reviews.freebsd.org/D30740
Sponsored by: Netflix, Inc.
(cherry picked from commit 870af3f4dc)
Current data structure is using a hash of unordered lists. Those
unordered lists are quite efficient, because the least recently
inserted entries are most likely to be used again. In order to avoid
long search times in other cases, the lists are hashed into many
buckets. Unfortunatly a search for a miss needs an exhaustive
inspection and a careful definition of the hash.
Splay trees offer a similar feature - almost O(1) for access of the
least recently used entries), and amortized O(ln(n) - for almost all
other cases. Get rid of the hash.
Now the data structure should able to quickly react to external
packets without eating CPU cycles for breakfast, preventing a DoS.
PR: 192888
Discussed with: Dimitry Luhtionov
Differential Revision: https://reviews.freebsd.org/D30516
Differential Revision: https://reviews.freebsd.org/D30536
Differential Revision: https://reviews.freebsd.org/D30844
(cherry picked from commit 935fc93af1)
(cherry picked from commit d261e57dea)
(cherry picked from commit f70c98a2f5)
(cherry picked from commit 25392fac94)
(cherry picked from commit 2f4d91f9cb)
(cherry picked from commit 4060e77f49)
Summary:
- Use LibAliasTime as a real global variable for central timekeeping.
- Reduce number of syscalls in user space considerably.
- Dynamically adjust the packet counters to match the second resolution.
- Only check the first few packets after a time increase for expiry.
Discussed with: hselasky
Differential Revision: https://reviews.freebsd.org/D30566
(cherry picked from commit ef828d39be)
Stats counters are used as unsigned valued (i.e. printf("%u")) but are
defined as signed int. This causes trouble later, so fix it early.
Differential Revision: https://reviews.freebsd.org/D30587
(cherry picked from commit 3fd20a79e7)
Replace current expensive, but sparsly called housekeeping
by a single, repetive action.
This is part of a larger restructure of libalias in order to switch to
more efficient data structures. The whole restructure process is
split into 15 reviews to ease reviewing. All those steps will be
squashed into a single commit for MFC in order to hide the
intermediate states from production systems.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D30277
(cherry picked from commit 294799c6b0)
This makes it easier to change the socket locking protocols. No
functional change intended.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit a100217489)
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit f4bb1869dd)
DXR maintains compressed lookup structures with a trivial search
procedure. A two-stage trie is indexed by the more significant bits of
the search key (IPv4 address), while the remaining bits are used for
finding the next hop in a sorted array. The tradeoff between memory
footprint and search speed depends on the split between the trie and
the remaining binary search. The default of 20 bits of the key being
used for trie indexing yields good performance (see below) with
footprints of around 2.5 Bytes per prefix with current BGP snapshots.
Rebuilding lookup structures takes some time, which is compensated for by
batching several RIB change requests into a single FIB update, i.e. FIB
synchronization with the RIB may be delayed for a fraction of a second.
RIB to FIB synchronization, next-hop table housekeeping, and lockless
lookup capability is provided by the FIB_ALGO infrastructure.
DXR works well on modern CPUs with several MBytes of caches, especially
in VMs, where is outperforms other currently available IPv4 FIB
algorithms by a large margin.
Reviewed by: melifaro
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29821
(cherry picked from commit 2aca58e16f)
The current implement of ip_input() reject packets destined for
169.254.0.0/16, but not those original from 169.254.0.0/16 link-local
addresses.
Fix to fully respect RFC 3927 section 2.7.
PR: 255388
Reviewed by: donner, rgrimes, karels
Differential Revision: https://reviews.freebsd.org/D29968
Reviewed by: rgrimes, donner, karels, marcus, emaste
Differential Revision: https://reviews.freebsd.org/D30374
(cherry picked from commit 3d846e4822)
(cherry picked from commit 03b0505b8f)
Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and
then not including a timestamp. This involved in the input path doing a goto done_with_input
label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN.
This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this
missing m_freem() but rack did not). This then caused the missing m_freem to show
up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs
later (after processing) is not a good idea, even though its only for logging. Best to
copy that off before any frees can take place.
Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30727
(cherry picked from commit ba1b3e48f5)
When running at NF the current Rack and BBR changes with the recent
commits from Richard that cause the socket buffer lock to be held over
the ip_output() call and then finally culminating in a call to tcp_handle_wakeup()
we get a lot of leaked mbufs. I don't think that this leak is actually caused
by holding the lock or what Richard has done, but is exposing some other
bug that has probably been lying dormant for a long time. I will continue to
look (using his changes) at what is going on to try to root cause out the issue.
In the meantime I can't leave the leaks out for everyone else. So this commit
will revert all of Richards changes and move both Rack and BBR back to just
doing the old sorwakeup_locked() calls after messing with the so_rcv buffer.
We may want to look at adding back in Richards changes after I have pinpointed
the root cause of the mbuf leak and fixed it.
Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30704
(cherry picked from commit 67e892819b)
Recently we had a rewrite to tcp_lro.c that was tested but one subtle change
was the move to a less precise timestamp. This causes all kinds of chaos
in tcp's that do pacing and needs to be fixed to use the more precise
time that was there before.
Reviewed by: mtuexen, gallatin, hselasky
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30695
(cherry picked from commit b45daaea95)
* Completely initialise the CC module specific data
* Use beta_ecn in case of an ECN event whenever ABE is enabled
or it is requested by the stack.
Reviewed by: rscheff, rrs
Sponsored by: Netflix, Inc.
(cherry picked from commit fa3746be42)
from main to stable/13: Add TCP LRO support for VLAN and VxLAN.
Make sure all counters are allocated.
This is a direct commit.
Reported by: Herbert J. Skuhra <herbert@gojira.at>
Sponsored by: Mellanox Technologies // NVIDIA Networking
While partially reverting D24237 with D29690, due to introducing some
unintended effects for in-kernel TCP consumers, the preexisting lock
on the socket send buffer was not considered properly.
Found by: markj
MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D30390
(cherry picked from commit 3975688563)
r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.
Reported by: rmacklem
Reviewed By: tuexen, #transport
Tested by: rmacklem, otis
MFC after: 2 weeks
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29690
(cherry picked from commit 032bf749fd)
So it turns out that my fix before was not correct. It ended with us failing
some of the "improved" SYN tests, since we are not in the correct states.
With more digging I have figured out the root of the problem is that when
we receive a SYN|FIN the reassembly code made it so we create a segq entry
to hold the FIN. In the established state where we were not in order this
would be correct i.e. a 0 len with a FIN would need to be accepted. But
if you are in a front state we need to strip the FIN so we correctly handle
the ACK but ignore the FIN. This gets us into the proper states
and avoids the previous ack war.
I back out some of the previous changes but then add a new change
here in tcp_reass() that fixes the root cause of the issue. We still
leave the rack panic fixes in place however.
Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30627
(cherry picked from commit 4747500dea)
The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr).
However there was a missing case, i.e. where we get an out-of-order FIN by itself.
In such a case we don't want to leave the FIN bit set, otherwise we will do the
wrong thing and ack the FIN incorrectly. Instead we need to go through the
tcp_reasm() code and that way the FIN will be stripped and all will be well.
Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30497
(cherry picked from commit 8c69d988a8)
Timer_slop, in TCP, has been 200ms for a long time. This value dates back
a long time when delayed ack timers were longer and links were slower. A
200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that
lowering this value to something more in line with todays delayed ack values (40ms)
might improve TCP. This bit of code makes it so rack can, via a socket option,
adjust the timer slop.
Reviewed by: mtuexen
Sponsered by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30249
(cherry picked from commit 4f3addd94b)
Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed
in the last commit. The problem is the left edge gets transmitted before the adjustments are done
to the send_map, this means that right edge bits must be considered to be added only if
the entire RSM is being retransmitted.
Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns
out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself.
After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically
what happens is we go into the reassembly code and lose the FIN bit. The trick here is we
should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything.
That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no
timers running. This is because the usrclosed function gets called and the FIN's and such have
already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get
stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly
recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE
before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2
timer can speed this up in testing.
Reviewed by: mtuexen,rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30451
(cherry picked from commit 13c0e198ca)
The push bit itself was also not actually being properly moved to
the right edge. The FIN bit was incorrectly on the left edge. We
fix these two issues as well as plumb in the mtu_change for
alternate stacks.
Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30413
(cherry picked from commit 631449d5d0)
Handle the case where during socket option processing, the user
switches a stack such that processing the stack specific socket
option does not make sense anymore. Return an error in this case.
Reviewed by: markj
Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com
Sponsored by: Netflix, Inc.
Differential revision: https://reviews.freebsd.org/D30395
(cherry picked from commit 8923ce6304)
Skyzall found an interesting panic in rack. When a SYN and FIN are
both sent together a KASSERT gets tripped where it is validating that
a mbuf pointer is in the sendmap. But a SYN and FIN often will not
have a mbuf pointer. So the fix is two fold a) make sure that the
SYN and FIN split the right way when cloning an RSM SYN on left
edge and FIN on right. And also make sure the KASSERT properly
accounts for the case that we have a SYN or FIN so we don't
panic.
Reviewed by: mtuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D30241
(cherry picked from commit 02cffbc250)
When the TCP is in the front states, don't take the slop variable
into account. This improves consistency with the base stack.
Reviewed by: rrs@
Differential Revision: https://reviews.freebsd.org/D30230
Sponsored by: Netflix, Inc.
(cherry picked from commit 251842c639)
Rack now after the previous commit is very careful to translate any
value in the hostcache for srtt/rttvar into its proper format. However
there is a snafu here in that if tp->srtt is 0 is the only time that
the HC will actually restore the srtt. We need to then only convert
the srtt restored when it is actually restored. We do this by making
sure it was zero before the call to cc_conn_init and it is non-zero
afterwards.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30213
(cherry picked from commit 4b86a24a76)
The hostcache up to now as been updated in the discard callback
but without checking if we are all done (the race where there are
more than one calls and the counter has not yet reached zero). This
means that when the race occurs, we end up calling the hc_upate
more than once. Also alternate stacks can keep there srtt/rttvar
in different formats (example rack keeps its values in microseconds).
Since we call the hc_update *before* the stack fini() then the
values will be in the wrong format.
Rack on the other hand, needs to convert items pulled from the
hostcache into its internal format else it may end up with
very much incorrect values from the hostcache. In the process
lets commonize the update mechanism for srtt/rttvar since we
now have more than one place that needs to call it.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30172
(cherry picked from commit 9867224bab)
platforms that for whatever reason cannot include the RATELIMIT option
can still work with rack. It adds two dummy functions that rack will
call and find out that the highest hw supported b/w is 0 (which
kinda makes sense and rack is already prepared to handle).
Reviewed by: Michael Tuexen, Warner Losh
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30163
(cherry picked from commit 5a4333a537)
issues.
A) Not enough hdrlen was being calculated when a UDP tunnel is
in place.
and
B) Not enough memory is allocated in racks fsb. We need to
overbook the fsb to include a udphdr just in case.
Submitted by: Peter Lei
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30157
(cherry picked from commit a16cee0218)