This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.
A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
Just trust the pcb database, that if we did in_pcbref(), no way
an inpcb can go away. And if we never put a dropped inpcb on
our queue, and tcp_discardcb() always removes an inpcb to be
dropped from the queue, then any inpcb on the queue is valid.
Now, to solve LOR between inpcb lock and HPTS queue lock do the
following trick. When we are about to process a certain time
slot, take the full queue of the head list into on stack list,
drop the HPTS lock and work on our queue. This of course opens
a race when an inpcb is being removed from the on stack queue,
which was already mentioned in comments. To address this race
introduce generation count into queues. If we want to remove
an inpcb with generation count mismatch, we can't do that, we
can only mark it with desired new time slot or -1 for remove.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33026
The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks. So, rack/bbr utilize HPTS to drop the connection in
a different context.
In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33025
Also, make some of the functions also private to the module. Remove
unused functions discovered after that.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33024
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc). However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.
Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree(). However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it. Details:
- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
Once the desired inpcb is found we need to transition from SMR
section to r/w lock on the inpcb itself, with a check that inpcb
isn't yet freed. This requires some compexity, since SMR section
itself is a critical(9) section. The complexity is hidden from
KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
notification) also a new KPI is provided, that hides internals of
the database - inp_next(struct inp_iterator *).
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33022
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.
This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].
I'm up to resurrecting it back if there is any interest from anybody.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33020
I just discovered that the return of the EBUSY error was incorrectly
rigged so that you could unload a CC module that was set to default.
Its supposed to be an EBUSY error. Make it so.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33229
For socket options related to local and remote addresses providing
generic association ids does not make sense. Report EINVAL in this
case.
MFC after: 1 week
This allows it to work with unmapped mbufs. In particular,
in_cksum_skip() calls no longer need to be preceded by calls to
mb_unmapped_to_ext() to avoid a page fault.
PR: 259645
Reviewed by: gallatin, glebius, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33096
in_cksum() and related routines are implemented separately for each
platform, but only i386 and arm have optimized versions. Other
platforms' copies of in_cksum.c are identical except for style
differences and support for big-endian CPUs.
Deduplicate the implementations for the rest of the platforms. This
will make it easier to implement in_cksum() for unmapped mbufs. On arm
and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is
not to be compiled.
No functional change intended.
Reviewed by: kp, glebius
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33095
It does not get compiled into the kernel. No functional change
inteneded.
Reviewed by: kp, glebius, cy
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33094
Until this change there were two places where we would free tcpcb -
tcp_discardcb() in case if all timers are drained and tcp_timer_discard()
otherwise. They were pretty much copy-n-paste, except that in the
default case we would run tcp_hc_update(). Merge this into single
function tcp_freecb() and move new short version of tcp_timer_discard()
to tcp_timer.c and make it static.
Reviewed by: rrs, hselasky
Differential revision: https://reviews.freebsd.org/D32965
In case we failed to uma_zalloc() and also failed to reuse with
tcp_tw_2msl_scan(), then just use on stack tcptw. This will allow
to run through tcp_twrespond() and standard tcpcb discard routine.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D32965
Previously we added ack-war prevention for misbehaving firewalls. This is
where the f/w or nat messes up its sequence numbers and causes an ack-war.
There is yet another type of ack war that we have found in the wild that is
like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such)
and instead of turning the ack/seq around and adding a TH_RST it does something
real stupid and sends a new packet with seq=0. This of course triggers the challenge
ack in the reset processing which then sends in a challenge ack (if the seq=0 is within
the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat.
This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters)
to prevent it and allow say 5 per second by default.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32938
m_apply() works on unmapped mbufs, so this will let us elide
mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in
the network stack.
Modify sctp_calculate_cksum() to assume it's passed an mbuf header.
This assumption appears to be true in practice, and we need to know the
full length of the chain.
No functional change intended.
Reviewed by: tuexen, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32941
When no mask is supplied to the ioctl adding an Internet interface
address, revert to using the historical class mask rather than a
single default. Similarly for the NFS bootp code.
MFC after: 3 weeks
Reviewed by: melifaro glebius
Differential Revision: https://reviews.freebsd.org/D32951
The code was probably useful during the problem being chased down,
but for brevity makes sense just to return to the original KASSERT.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D32968
All timers keep inpcb locked through their execution. We need to
check these flags only once. Checking for INP_TIMEWAIT earlier is
is also safer, since such inpcbs point into tcptw rather than tcpcb,
and any dereferences of inp_ppcb as tcpcb are erroneous.
Reviewed by: rrs, hselasky
Differential revision: https://reviews.freebsd.org/D32967
This causes new vnets to inherit the cc algorithm from vnet0. This is a
temporary patch to fix vnet jail creation.
With encouragement from: glebius
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Differential Revision: https://reviews.freebsd.org/D32970
Define CC_NEWRENO in all the appropriate DEFAULTS and std.* config
files. It's the default congestion control algorithm. Add code to cc.c
so that CC_DEFAULT is "newreno" if it's not overriden in the config
file.
Sponsored by: Netflix
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Revired by: manu, hselasky, jhb, glebius, tuexen
Differential Revision: https://reviews.freebsd.org/D32964
Drop packets arriving from the network that have our source IP
address. If maliciously crafted they can create evil effects
like an RST exchange between two of our listening TCP ports.
Such packets just can't be legitimate. Enable the tunable
by default. Long time due for a modern Internet host.
Reviewed by: donner, melifaro
Differential revision: https://reviews.freebsd.org/D32914
Quick review confirms that they do not, also IPv6 doesn't expect
such a change in mbuf. In IPv4 this appeared in 0aade26e6d,
which doesn't seem to have a valid explanation why.
Reviewed by: donner, kp, melifaro
Differential revision: https://reviews.freebsd.org/D32913
This very questionable feature was enabled in FreeBSD for a very short
time. It was disabled very soon upon merging to RELENG_4 - 23d7f14119bf.
And in HEAD was also disabled pretty soon - 4bc37f9836.
The tunable has very vague name. Check interface for what? Given that
it was never documented and almost never enabled, I think it is fine
to rename it together with documenting it.
Also, count packets dropped by this tunable as ips_badaddr, otherwise
they fall down to ips_cantforward counter, which is misleading, as
packet was not supposed to be forwarded, it was destined locally.
Reviewed by: donner, kp
Differential revision: https://reviews.freebsd.org/D32912
While here rearrange the RVSP check to inspect proto first and avoid
evaluating V_rsvp in the common case to begin with (most notably avoid
the expensive read).
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32929
When a persists probe is lost, we will end up calculating a long
RTT based on the initial probe and when the response comes from the
second probe (or third etc). This means we have a minimum of a
confidence level of 3 on a incorrect probe. This commit will change it
so that we have one of two options
a) Just not count RTT of probes where we had a loss
<or>
b) Count them still but degrade the confidence to 0.
I have set in this the default being to just not measure them, but I am open
to having the default be otherwise.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32897
NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!!
This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.
This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.
Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under net.inet.tcp.cc). If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
RELNOTES:YES
Differential Revision: https://reviews.freebsd.org/D32693
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.
Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.
The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().
Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741
Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level. Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD. Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits). Warn when an Internet address is set without a mask.
MFC after: 1 month
Reviewed by: cy
Differential Revision: https://reviews.freebsd.org/D32708
There is a printf when a socket option down to the CC module fails, this really
should not be a printf. In fact this whole option needs to be re-thought in coordination
with some other changes in the CC modules (its just not right but its ok what it
does here if it fails since it will just use the ECN beta).
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32894