Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level. Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD. Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits). Warn when an Internet address is set without a mask.
(cherry picked from commit 20d5940396)
Ensure that TTL and TOS values set on a listener get inheritet
to the accepted sockets.
PR: 260119
MFC after: 1 week
(cherry picked from commit d79676fb13)
When sending packets the stcb was used to access the inp and then
access the endpoint specific IPv6 level options. This fails when
there exists an inp, but no stcb yet. This is the case for sending
an INIT-ACK in response to an INIT when no association already
exists. Fix this by just providing the inp instead of the stcb.
PR: 260120
MFC after: 1 week
(cherry picked from commit f32357be53)
For socket options related to local and remote addresses providing
generic association ids does not make sense. Report EINVAL in this
case.
MFC after: 1 week
(cherry picked from commit 13c196a41e)
Current logic always selects an IFA of the same family from the
outgoing interfaces. In IPv4 over IPv6 setup there can be just
single non-127.0.0.1 ifa, attached to the loopback interface.
Create a separate rt_getifa_family() to handle entire ifa selection
for the IPv4 over IPv6.
Differential Revision: https://reviews.freebsd.org/D31868
MFC after: 1 week
(cherry picked from commit 4b631fc832)
With the new FIB_ALGO infrastructure, nearly all subsystems use
fib[46]_lookup() functions, which provides lockless lookups.
A number of places remains that uses old-style lookup functions, that
still requires RIB read lock to return the result. One of such places
is arp processing code.
FIB_ALGO implementation makes some tradeoffs, resulting in (relatively)
prolonged periods of holding RIB_WLOCK. If the lock is held and datapath
competes for it, the RX ring may get blocked, ending in traffic delays and losses.
As currently arp processing is performed directly in the interrupt handler,
handling ARP replies triggers the problem descibed above when the amount of
ARP replies is high.
To be more specific, prior to creating new ARP entry, routing lookup for the entry
address in interface fib is executed. The following conditions are the verified:
1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable,
failure is returned. The only exception are host routes w/ gateway==address.
2. If the routing lookup returns different interface and non-host route,
we want to support the use case of having multiple interfaces with the same prefix.
In fact, the current code just checks if the returned prefix covers target address
(always true) and effectively allow allocating ARP entries for any directly-reachable prefix,
regardless of its interface.
Change the code to perform the following:
1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix.
2) Rewrite first condition check using nexthop flags (1:1 match)
3) Rewrite second condition to check for interface addresses matching target address on
the input interface.
Differential Revision: https://reviews.freebsd.org/D31824
Reviewed by: ae
MFC after: 1 week
PR: 257965
(cherry picked from commit 936f4a42fa)
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.
Reviewed by: glebius, jhb
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 44775b163b)
This allows it to work with unmapped mbufs. In particular,
in_cksum_skip() calls no longer need to be preceded by calls to
mb_unmapped_to_ext() to avoid a page fault.
PR: 259645
Reviewed by: gallatin, glebius, jhb
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 0d9c3423f5)
in_cksum() and related routines are implemented separately for each
platform, but only i386 and arm have optimized versions. Other
platforms' copies of in_cksum.c are identical except for style
differences and support for big-endian CPUs.
Deduplicate the implementations for the rest of the platforms. This
will make it easier to implement in_cksum() for unmapped mbufs. On arm
and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is
not to be compiled.
No functional change intended.
Reviewed by: kp, glebius
Sponsored by: The FreeBSD Foundation
(cherry picked from commit ecbbe83144)
It does not get compiled into the kernel. No functional change
inteneded.
Reviewed by: kp, glebius, cy
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 5195bcc212)
sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().
No functional change intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 756bb50b6a)
m_apply() works on unmapped mbufs, so this will let us elide
mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in
the network stack.
Modify sctp_calculate_cksum() to assume it's passed an mbuf header.
This assumption appears to be true in practice, and we need to know the
full length of the chain.
No functional change intended.
Reviewed by: tuexen, jhb
Sponsored by: The FreeBSD Foundation
(cherry picked from commit b4d758a0cc)
Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an
advskew of 100, making them master and backup respectively.
If net.inet.carp.demotion is set to a negative value on node 1, node 2
might become master while node 1 still retains it master status. Wether
or not node 2 becomes master seems to depend on the nodes advskew and
what the demotion sysctl was set to on node 1.
The reason for node 2 becoming master seems to be that the calculated
advskew taking demotion into account is truncated to a single unsigned
byte when copied into the carp header for sending, and node 1 stays
master since it takes uses the whole non-truncated calculated advskew
when deciding wether to stay master.
PR: 259528
Reviewed by: donner, glebius
MFC after: 3 weeks
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D32759
(cherry picked from commit 1019354b54)
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).
An equivalent test has existed since 4.4-Lite. It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).
In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es). In practice no work will be
avoided.
Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses. It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.
The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs. As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.
PR: 253166
Reviewed by: ae, bz, donner, emaste, glebius
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D32563
(cherry picked from commit 5c5340108e)
Before passing an IPv6 packet to application apply delayed checksum
calculation. Mbuf flags will be lost when divert listener will return a
packet back, so we will not be able to do delayed checksum calculation
later. Also an application will get a packet with correct checksum.
Reviewed by: donner
Differential Revision: https://reviews.freebsd.org/D32807
(cherry picked from commit 4a9e95286c)
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122. Until
now, we would broadcast the host zero address as well as the configured
address. Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.
See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316. Note, Linux
now implements this.
Reviewed by: rgrimes, tuexen; melifaro (previous version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D31861
(cherry picked from commit fd0765933c)
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.
Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.
MFC after: 3 days
PR: 257965
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.
When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.
Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().
Reviewed by: tuexen, gallatin
Sponsored by: The FreeBSD Foundation
(cherry picked from commit f94acf52a4)
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.
The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).
Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time. If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.
This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.
MFC after: 3 days
There are two flags to request a non-blocking receive on a socket:
MSG_NBIO and MSG_DONTWAIT. They are handled a bit differently in that
soreceive_generic() and soreceive_stream() will block on the socket I/O
lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set. In general,
MSG_NBIO seems to mean, "don't block if there is no data to receive" and
MSG_DONTWAIT means "don't go to sleep for any reason".
SCTP's soreceive implementation did not allow blocking on the I/O lock
if either flag is set, but this violates an assumption in
aio_process_sb(), which specifies MSG_NBIO but nonetheless
expects to make progress if data is available to read. Change
sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT
is not set.
Reported by: syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit e6c19aa94d)
A division by zero would occur if DXR would be activated on a vnet
with no IP addresses configured on any interfaces.
PR: 257965
MFC after: 3 days
Reported by: Raul Munoz
(cherry picked from commit eb3148cc4d)
Don't rebuild in vain trie parts unaffected by accumulated incremental
RIB updates.
PR: 257965
Tested by: Konrad Kreciwilk
MFC after: 3 days
(cherry picked from commit b51f8bae57)
Free memory before return from arprequest_internal(). In in_arpinput(),
if arp_fillheader() fails, it should use goto drop.
Reviewed by: melifaro, imp, markj
Pull Request: https://github.com/freebsd/freebsd-src/pull/534
(cherry picked from commit f5777c123a)
- The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads
could observe that the flag is not set and then both set it. I'm not
sure if this is actually a problem in practice, i.e., maybe there's no
problem having multiple sends for a single PCB in the iterator list?
- sctp_sendall() was modifying sctp_flags without the inp lock held.
The change simply acquires the PCB write lock before toggling the flag,
fixing both problems.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 173a7a4ee4)
This appears to be unused in usrsctp as well. No functional change
intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit e8e23ec127)
sctp_close() and sctp_abort() disassociate the PCB from its socket.
As a part of this, they attempt to free the PCB, which may end up
lingering. Fix some bugs in this area:
- For some reason, sctp_close() and sctp_abort() set
SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the
PCB lock held. This is racy since sctp_flags is normally updated
without atomics, using the PCB lock to synchronize. So, the update
can be lost, which can cause all sort of races with other SCTP
components which look for the _GONE flag. Fix the problem simply by
acquiring the PCB lock in order to set the flag. Note that we have to
drop and re-acquire the lock again in sctp_inpcb_free(), but I don't
see a good way around that for now. If it's a real problem, the _GONE
flag could be split out of sctp_flags and into a dedicated sctp_inpcb
field.
- In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock,
to avoid possible races with parallel sctp_inpcb_free() calls.
- Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit c17b531bed)
It appears that maliciously crafted ifaliasreq can lead to NULL
pointer dereference in in_aifaddr_ioctl(). In order to replicate
that, one needs to
1. Ensure that carp(4) is not loaded
2. Issue SIOCAIFADDR call setting ifra_vhid field of the request
to a negative value.
A repro code would look like this.
int main() {
struct ifaliasreq req;
struct sockaddr_in sin, mask;
int fd, error;
bzero(&sin, sizeof(struct sockaddr_in));
bzero(&mask, sizeof(struct sockaddr_in));
sin.sin_len = sizeof(struct sockaddr_in);
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = inet_addr("192.168.88.2");
mask.sin_len = sizeof(struct sockaddr_in);
mask.sin_family = AF_INET;
mask.sin_addr.s_addr = inet_addr("255.255.255.0");
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd < 0)
return (-1);
memset(&req, 0, sizeof(struct ifaliasreq));
strlcpy(req.ifra_name, "lo0", sizeof(req.ifra_name));
memcpy(&req.ifra_addr, &sin, sin.sin_len);
memcpy(&req.ifra_mask, &mask, mask.sin_len);
req.ifra_vhid = -1;
return ioctl(fd, SIOCAIFADDR, (char *)&req);
}
To fix, discard both positive and negative vhid values in
in_aifaddr_ioctl, if carp(4) is not loaded. This prevents NULL pointer
dereference and kernel panic.
Reviewed by: imp@
Pull Request: https://github.com/freebsd/freebsd-src/pull/530
(cherry picked from commit 620cf65c2b)
This will be used by sctp_listen() to avoid dropping locks when
performing an implicit bind. No functional change intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 457abbb857)
Later in sctp_free_assoc(), when we clean up chunk lists,
sctp_free_spbufspace() is used to reset the byte count in the socket
send buffer. However, if the PCB is going away, the socket may already
have been detached from the PCB, in which case this becomes a use-after
free. Clear the socket reference from the association before detaching
it from the PCB, if the PCB has already lost its socket reference.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 65f30a39e1)
At this point we do not hold the inpcb lock, so the only thing holding
the socket reference live is the TCB lock, which needs to be acquired by
sctp_inpcb_free() in order to destroy associations. Defer the unlock to
until after we dereference the socket reference.
Reported by: syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com
Reported by: syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit d35be50f57)
Implement kernel support for RFC 5549/8950.
* Relax control plane restrictions and allow specifying IPv6 gateways
for IPv4 routes. This behavior is controlled by the
net.route.rib_route_ipv6_nexthop sysctl (on by default).
* Always pass final destination in ro->ro_dst in ip_forward().
* Use ro->ro_dst to exract packet family inside if_output() routines.
Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.
* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
It leverages recent lltable changes committed in c541bd368f.
Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0
Differential Revision: https://reviews.freebsd.org/D30398
(cherry picked from commit 62e1a437f3)
Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.
However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.
Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.
nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.
Differential Revision: https://reviews.freebsd.org/D31379
(cherry picked from commit c541bd368f)
Consistently use `nh` instead of always dereferencing
ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
of updating ro flags based on different nexthop.
Differential Revision: https://reviews.freebsd.org/D31451
Reviewed by: kp
(cherry picked from commit 9748eb7427)
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
Differential Revision: https://reviews.freebsd.org/D31390
(cherry picked from commit f3a3b06121)
SCTP needs to avoid binding a given socket twice. The check used to
avoid this is racy since neither the inpcb lock nor the global info lock
is held. Fix it by synchronizing using the global info lock. In
particular, sctp_inpcb_bind() may drop the inpcb lock in some cases, but
the info lock is sufficient to prevent double insertion into PCB hash
tables.
Reported by: syzbot+548a8560d959669d0e12@syzkaller.appspotmail.com
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 4a36122b1d)
Eliminate a flag variable and reduce indentation. No functional change
intended.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 2496d812a9)