Use hc_ prefix instead of rmx_. The latter stands for "route metrix" and
is an artifact from the 90-ies, when TCP caching was embedded into the
routing table. The rename should have happened back in 97d8d152c2.
No functional change. Done with sed(1) command:
s/rmx_(mtu|ssthresh|rtt|rttvar|cwnd|sendpipe|recvpipe|granularity|expire|q|hits|updates)/hc_\1/g
Properly calculate the expected flight size (cwnd) during
limited transmit. Exclude the SACK scoreboard from
consideration when still in limited transmit.
PR: 282605
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47541
Change the older LOCK related macros over to the
dedicated send/recv buffer macros in the base tcp stack.
No functional change intended.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D47567
Refactoring of cwnd and moving the adjustment for SACKed data into
tcp_output() - cwnd tracking the maximum extent starting at snd_una -
allows both SACK loss recovery as well as SACK transmissions after
RTO during slow start and if allowed, the use of TSO while in loss
recovery.
Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43470
When snd_nxt doesn't track snd_max, partial SACK ACKs may elicit
unexpected duplicate retransmissions. This is usually masked by
LRO not necessarily ACKing every individual segment, and prior
to RFC6675 SACK loss recovery, harder to trigger even when an
RTO happens while SACK loss recovery is ongoing.
Address this by improving the logic when to start a SACK loss recovery
and how to deal with a RTO, as well as improvements to the adjusted
congestion window during transmission selection.
Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43355
Implement ACK throttling of challenge ACKs as described in RFC 5961.
Reviewed by: Peter Lei, rscheff, cc
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D46066
Implement the improved SEG.ACK validation described in RFC 5961.
In addition to that, also detect ghost ACKs, which are ACKs for data
that has never been sent.
The additional checks are enabled by default, but can be disabled
by setting the sysctl-variable net.inet.tcp.insecure_ack to a
non-zero value.
PR: 250357
Reviewed by: Peter Lei, rscheff (older version)
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45894
Allow for TSO to operate if network interface supports ipsec inline
offload and supports TSO over it.
Reviewed by: tuexen
Sponsored by: NVIDIA networking
Differential revision: https://reviews.freebsd.org/D44222
When we first find an inp, we set also the tp. If then a second
lookup is necessary, the inp is recomputed. If this fails, the
tp is not cleared, which resulted in failing KASSERT.
Therefore, clear the tp when staring the inp lookup procedure.
Reported by: Jenkins
Fixes: 02d15215ce ("tcp: improve blackhole support")
MFC after: 1 week
Sponsored by: Netflix, Inc.
There are two improvements to the TCP blackhole support:
(1) If net.inet.tcp.blackhole is set to 2, also sent no RST whenever
a segment is received on an existing closed socket or if there is
a port mismatch when using UDP encapsulation.
(2) If net.inet.tcp.blackhole is set to 3, no RST segment is sent in
response to incoming segments on closed sockets or in response to
unexpected segments on listening sockets.
Thanks to gallatin@ for suggesting such an improvement.
Reviewed by: gallatin
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D45304
There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:
ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)
This first sack looks fine but then the attacker sends
ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...
These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.
To combat this we introduce three things.
when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.
The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).
After this set of changes is in rack can drop its SAD detection completely
Reviewed by:tuexen@, rscheff@
Differential Revision: <https://reviews.freebsd.org/D44903>
RFC 9293 describes the handling of data in the CLOSE-WAIT, CLOSING,
LAST-ACK, and TIME-WAIT states:
This should not occur since a FIN has been received from the remote
side. Ignore the segment text.
Therefore, implement this handling.
Reviewed by: rrs, rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44746
Also log, when dropping text or FIN after having received a FIN.
This is the intended behavior described in RFC 9293.
A follow-up patch will enforce this behavior for the base stack
and the RACK stack.
Reviewed by: rscheff
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D44669
The macro is more obfuscating than helping as it just checks a single flag
of t_flags. All other t_flags bits are checked without a macro.
A bigger problem was that declaration of the macro in tcp_var.h depended
on a kernel option. It is a bad practice to create such definitions in
installable headers.
Reviewed by: rscheff, tuexen, kib
Differential Revision: https://reviews.freebsd.org/D44362
Ensure that snd_fack holds a valid value when doing
the post_recovery CC processing, for preparation of
the cc_cubic update, so that local pipe calculations
can correctly refer to snd_fack during and after CC events.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43957
Follow up on D43768 to properly deal with the non-default
pipe calculation. When CC_RTO is processed, the timeout
will have already pulled back snd_nxt. Further, snd_fack
is not pulled along with snd_una.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43876
The SACK scoreboard is conceptually an extention of the socket
buffer. Remove it when the socket buffer goes away with
soisdisconnected(). Verify that this is also the expected
state in tcp_discardcb().
PR: 276761
Reviewed by: glebius, tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43805
The RFC6675 pipe calculation (sack.revised, enabled
by default since D28702), uses outdated information,
while the previous default calculated it correctly
with up-to-date information from the incoming ACK.
This difference can become as large as the receive
window (not the congestion window previously),
potentially triggering a massive burst of new packets.
MFC after: 1 week
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43520
Use SEQ_SUB instead of a plain subtraction, for an implict
type conversion and prevention of a possible overflow.
Use curly brackets in stacked if statements throughout.
Use of the ? operator to enhance readability when clearing
the FIN flag in tcp_output().
None of the above change the function.
Reviewed By: tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43539
Improve slowpath processing (reordering, retransmissions)
slightly by calculating maxseg only once. This typically
saves one of two calls to tcp_maxseg().
Reviewed By: glebius, tuexen, cc, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43536
PRR state was not properly reset on subsequent ECN CE
events. Clean up after local transmission failures too.
Reviewed by: tuexen, cc, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43170
Only try sending more data on pure ACKs when there is
more data available in the send buffer.
In the case of a retransmitted SYN not being sent due to
an internal error, the snd_una/snd_nxt accounting could
be off, leading to a panic. Pulling snd_nxt up to snd_una
prevents this from happening.
Reported by: fengdreamer@126.com
Reviewed by: cc, tuexen, #transport
MFC after: 1 week
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D43343
Keeping the SACK scoreboard intact after the first RTO
and retransmitting all data anew only on subsequent RTOs
allows a more timely and efficient loss recovery under
many adverse cirumstances.
Reviewed By: tuexen, #transport
MFC after: 10 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42906
The tcp_tun_port field that is used to pass port value between UDP
and TCP in case of tunneling is a generic field that used to pass
data between network layers. It can be contaminated on entry, e.g.
by a VLAN tag set by a NIC driver. Explicily set it, so that it
is zeroed out in a normal not-tunneled TCP. If it contains garbage,
tcp_twcheck() later can enter wrong block of code and treat the packet
as incorrectly tunneled one. On main and stable/14 that will end up
with sending incorrect responses, but on stable/13 with ipfw(8) and
pcb-matching rules it may end up in a panic.
This is a minimal conservative patch to be merged to stable branches.
Later we may redesign this.
PR: 275169
Reviewed by: tuexen
Differential Revision: https://reviews.freebsd.org/D43065
Don't let PRR pass up on the opportunity of clocking
out packets on arrival of ACKs - by pulling sends
forward by about half a packet. Prevents unexpectedly
long runs of incoming ACKs without eliciting a
packet transmission.
MFC after: 1 week
Reviewed By: #transport, tuexen
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42918
Moving lrd sysctl to the tcp.sack branch, since LRD only works with SACK.
Remove the sockopt to programmatically control LRD per session.
Reviewed By: #transport, tuexen, rrs
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42851
Lost Retransmission Detection was added as a
feature in May 2021, but disabled by default.
Enabling the feature by default to reduce the
flow completion time by avoiding RTOs when
retransmissions get lost too.
Reviewed By: tuexen, #transport, zlei
MFC after: 10 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D42845
Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.
Sponsored by: Netflix
Improve Proportional Rate Reduction (RFC6937) by using a
heuristic, which automatically chooses between
conservative CRB and more aggressive SSRB modes.
Only when snd_una advances (a partial ACK), SSRB may be
used. Also, that ACK must not have any indication of
ongoing loss - using the addition of new holes into the
scoreboard as proxy for such an event.
MFC after: 4 weeks
Reviewed By: #transport, kbowling, rrs
Sponsored By: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28822
Add more accounting while processing SACK data, to
keep track of when a packet is deemed lost using
the RFC6675 guidance.
Together with PRR (RFC6972) this allows a sender to
retransmit presumed lost packets faster, and loss
recovery to complete earlier.
Reviewed By: cc, rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39299
Patch base stack to correctly handle the RST bit independently
of other header flags per TCP RFC.
MFC after: 1 week
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D40982
Summary:
This brings some benefit of a tcp flow identification for some kernel
modules, such as siftr.
Reviewers: rrs, rscheff, tuexen, #transport!
Approved by: tuexen (mentor), rrs
Subscribers: imp, melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D40061
The socket argument is superfluous, as a tcpcb always has one and
only one socket.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39434
Use the same counter that ip_input()/ip6_input() use for bad destination
address. For IPv6 this is already heavily abused ip6s_badscope, which
needs to be split into several separate error counters.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D39234
The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints
we need to move to using the new inline functions. This adds them and moves rack to now use
the tcp_tracepoints.
Reviewed by: tuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D38831
The assertions added in commit b0ccf53f24 ("inpcb: Assert against
wildcard addrs in in_pcblookup_hash_locked()") revealed that protocol
layers may pass the unspecified address to in_pcblookup().
Add some checks to filter out such packets before we attempt an inpcb
lookup:
- Disallow the use of an unspecified source address in in_pcbladdr() and
in6_pcbladdr().
- Disallow IP packets with an unspecified destination address.
- Disallow TCP packets with an unspecified source address, and add an
assertion to verify the comment claiming that the case of an
unspecified destination address is handled by the IP layer.
Reported by: syzbot+9ca890fb84e984e82df2@syzkaller.appspotmail.com
Reported by: syzbot+ae873c71d3c71d5f41cb@syzkaller.appspotmail.com
Reported by: syzbot+e3e689aba1d442905067@syzkaller.appspotmail.com
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38570
During tcp session start, various mechanisms need to
track a few initial RTTs before becoming active.
Prevent overflows of the corresponding tracking counter
and reduce the size of tcpcb simultaneously.
Reviewed By: #transport, tuexen, guest-ccui
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D21117
The ipfw(4) feature of forwarding to local address without modifying
a packet was broken. The first lookup needs always be a non-wildcard
one, cause its goal is to find an already existing socket. Otherwise
a local wildcard listener with the same port number may match resulting
in the connection being forwared to wrong port.
Reported by: Pavel Polyakov <bsd kobyla.org>
Fixes: d88eb4654f
This subsystem is superseded by modern debugging facilities,
e.g. DTrace probes and TCP black box logging.
We intentionally leave SO_DEBUG in place, as many utilities may
set it on a socket. Also the tcp::debug DTrace probes look at
this flag on a socket.
Reviewed by: gnn, tuexen
Discussed with: rscheff, rrs, jtl
Differential revision: https://reviews.freebsd.org/D37694
For the TCP protocol inpcb storage specify allocation size that would
provide space to most of the data a TCP connection needs, embedding
into struct tcpcb several structures, that previously were allocated
separately.
The most import one is the inpcb itself. With embedding we can provide
strong guarantee that with a valid TCP inpcb the tcpcb is always valid
and vice versa. Also we reduce number of allocs/frees per connection.
The embedded inpcb is placed in the beginning of the struct tcpcb,
since in_pcballoc() requires that. However, later we may want to move
it around for cache line efficiency, and this can be done with a little
effort. The new intotcpcb() macro is ready for such move.
The congestion algorithm data, the TCP timers and osd(9) data are
also embedded into tcpcb, and temprorary struct tcpcb_mem goes away.
There was no extra allocation here, but we went through extra pointer
every time we accessed this data.
One interesting side effect is that now TCP data is allocated from
SMR-protected zone. Potentially this allows the TCP stacks or other
TCP related modules to utilize that for their own synchronization.
Large part of the change was done with sed script:
s/tp->ccv->/tp->t_ccv./g
s/tp->ccv/\&tp->t_ccv/g
s/tp->cc_algo/tp->t_cc/g
s/tp->t_timers->tt_/tp->tt_/g
s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g
Dependency side effect is that code that needs to know struct tcpcb
should also know struct inpcb, that added several <netinet/in_pcb.h>.
Differential revision: https://reviews.freebsd.org/D37127
The inp_socket is cleared only in in_pcbdetach(), which for TCP is
always accompanied with inp_pcbfree(). An inpcb that went through
in_pcbfree() shall never be returned by any kind of pcb lookup.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37062
The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped. The comment suggests that
such pcb won't be returned by lookups. Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what
comment suggests: never return such pcbs and remove unnecessary checks.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37061
Keep all ECN related code in (mostly) one place.
No functional change.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37285