mld_domifattach() does a memory allocation under the global MLD mutex
and so can fail, but no error handling prevents a null pointer
dereference in this case. The mutex is only needed when updating the
global softc list; the allocation and static initialization of the softc
does not require this mutex. So, reduce the scope of the mutex and use
M_WAITOK for the allocation.
PR: 261457
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 5d691ab4f0)
For IPv4 use dst pointer as destination address in fib4_lookup().
It keeps destination address from IPv4 header and can be changed
when PACKET_TAG_IPFORWARD tag was set by packet filter.
For IPv6 override destination address with address from dst_sa.sin6_addr,
that was set from PACKET_TAG_IPFORWARD tag.
Reviewed by: eugen
PR: 256828, 261697, 255705
Differential Revision: https://reviews.freebsd.org/D34732
(cherry picked from commit 7d98cc096b)
Also convert raw epoch_call() calls to lltable_free_entry() calls, no
functional change intended. There's no need to asynchronously free the
LLEs in that case to begin with, but we might as well use the lltable
interfaces consistently.
Noticed by code inspection; I believe lltable_calc_llheader() failures
do not generally happen in practice.
Reviewed by: bz
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 990a6d18b0)
Historically, lltable_try_set_entry_addr() would release the LLE lock
upon failure. After some refactoring, it no longer does so, but
consumers were not adjusted accordingly.
Also fix a leak that can occur if lltable_calc_llheader() fails in the
ARP code, but I suspect that such a failure can only occur due to a code
bug.
Reviewed by: bz, melifaro
Reported by: pho
Fixes: 0b79b007eb ("[lltable] Restructure nd6 code.")
Sponsored by: The FreeBSD Foundation
(cherry picked from commit dd91d84486)
Introduce a new function, lltable_get(), to retrieve lltable pointer
for the specified interface and family.
Use it to avoid all-iftable list traversal when adding or deleting
ARP/ND records.
Differential Revision: https://reviews.freebsd.org/D33660
MFC after: 2 weeks
(cherry picked from commit ff3a85d324)
Enter the net epoch before calling ip6_setpktopts
ip6_setpktopts() can look up ifnets via ifnet_by_index(), which
is only safe in the net epoch. Ensure that callers are in the net
epoch before calling this function.
Sponsored by: Dell EMC Isilon
MFC after: 4 weeks
Reviewed by: donner, kp
Differential Revision: https://reviews.freebsd.org/D30630
(cherry picked from commit 2290dfb40f)
Allow the resending of DATA chunks to be controlled by the caller,
which allows retiring sctp_mtu_size_reset() in a separate commit.
Also improve the computaion of the overhead and use 32-bit integers
consistently.
Thanks to Timo Voelker for pointing me to the code.
(cherry picked from commit 2de2ae331b)
All callers of sctp_aloc_assoc() mark the PCB as connected after a
successful call (for one-to-one-style sockets). In all cases this is
done without the PCB lock, so the PCB's flags can be corrupted. We also
do not atomically check whether a one-to-one-style socket is a listening
socket, which violates various assumptions in solisten_proto().
We need to hold the PCB lock across all of sctp_aloc_assoc() to fix
this. In order to do that without introducing lock order reversals, we
have to hold the global info lock as well.
So:
- Convert sctp_aloc_assoc() so that the inp and info locks are
consistently held. It returns with the association lock held, as
before.
- Fix an apparent bug where we failed to remove an association from a
global hash if sctp_add_remote_addr() fails.
- sctp_select_a_tag() is called when initializing an association, and it
acquires the global info lock. To avoid lock recursion, push locking
into its callers.
- Introduce sctp_aloc_assoc_connected(), which atomically checks for a
listening socket and sets SCTP_PCB_FLAGS_CONNECTED.
There is still one edge case in sctp_process_cookie_new() where we do
not update PCB/socket state correctly.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31908
(cherry picked from commit 2d5c48eccd)
Fix getsockopt() for the IPPROTO_IPV6 level socket options with the
following names: IPV6_HOPOPTS, IPV6_RTHDR, IPV6_RTHDRDSTOPTS,
IPV6_DSTOPTS, and IPV6_NEXTHOP.
Reviewed by: markj
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31458
(cherry picked from commit a8d54fc903)
When ip_output_send() returns EAGAIN due to issues with send tags (route
change, lagg failover, etc), it must free the mbuf. This is because
ip_output_send() was written as a wrapper/replacement for a direct
call to if_output(), and the contract with if_output() has
historically been that it owns the mbufs once called. When
ip_output_send() failed to free mbufs, it violated this assumption
and lead to leaked mbufs.
This was noticed when using NIC TLS in combination with hardware
rate-limited connections. When seeing lots of NIC output drops
triggered ratelimit send tag changes, we noticed we were leaking
ktls_sessions, send tags and mbufs. This was due ip_output_send()
leaking mbufs which held references to ktls_sessions, which in
turn held references to send tags.
Many thanks to jbh, rrs, hselasky and markj for their help in
debugging this.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34054
Reviewed by: hselasky, jhb, rrs
MFC after: 2 weeks
(cherry picked from commit 9ba117960e)
When sending an NS, check if we are using a IPv6 CARP address
and if we do, then put proper CARP link level address into
ND_OPT_SOURCE_LINKADDR option and also put PACKET_TAG_CARP tag
on the packet. The latter will enforce CARP link level address
at the data link layer too, which might be necessary for broken
implementations.
The code really follows what NA sending code has been doing since
introduction of carp(4). While here, bring to style(9) the whole
block of code.
PR: 193280
Differential revision: https://reviews.freebsd.org/D33858
(cherry picked from commit bc6abdd97e)
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.
Reviewed by: glebius, jhb
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 44775b163b)
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).
An equivalent test has existed since 4.4-Lite. It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).
In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es). In practice no work will be
avoided.
Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses. It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.
The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs. As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.
PR: 253166
Reviewed by: ae, bz, donner, emaste, glebius
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D32563
(cherry picked from commit 5c5340108e)
Interface addresses with pending duplicate address detection (DAD) live
in a global queue. In this case, a callout is associated with each
entry. The callout transmits neighbour solicitations until the system
decides the address is no longer tentative, or until a duplicate address
is discovered. At this point the entry is dequeued and freed. DAD may
be manually stopped as well.
The callout currently runs (and potentially transmits packets) with
Giant held. Reorganize DAD queue locking to interlock properly with the
callout:
- Configure the callout to acquire the DAD queue lock before running.
The lock is dropped before transmitting any packets. Stop protecting
the callout with Giant.
- When looking up DAD queue entries for an incoming NS or NA, don't
bother fiddling with the DAD queue entry reference count.
- Split nd6_dad_starttimer() so that the caller is responsible to
transmitting a NS if it so desires.
- Remove the DAD entry from the queue before stopping the timer. Use a
temporary reference to make sure that the entry doesn't get freed by
the callout while we're draining.
Reported by: mav
Reviewed by: bz, hrs
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 9a94097cd0)
- Protect the `expire_upcalls` callout with the MFC6 mutex. The callout
handler needs this mutex anyway.
- Convert the MROUTER6 mutex to a sleepable sx lock. It is only used
when configuring the global v6 multicast routing socket, so is only
used in system call paths where sleeping is safe. This lets us drain
the callout without having to drop the lock.
- For all locking macros in the file, convert to using a _LOCKPTR macro.
Reported by: mav
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 353783964c)
Currently we use pre-calculated headers inside LLE entries as prepend data
for `if_output` functions. Using these headers allows saving some
CPU cycles/memory accesses on the fast path.
However, this approach makes adding L2 header for IPv4 traffic with IPv6
nexthops more complex, as it is not possible to store multiple
pre-calculated headers inside lle. Additionally, the solution space is
limited by the fact that PCB caching saves LLEs in addition to the nexthop.
Thus, add support for creating special "child" LLEs for the purpose of holding
custom family encaps and store mbufs pending resolution. To simplify handling
of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
to the "parent" LLE - it is not possible to delete "child" when parent is alive.
Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
machine used by the standard LLEs.
nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
allowing to return such child LLEs. This change uses `LLE_SF()` macro which
packs family and flags in a single int field. This is done to simplify merging
back to stable/. Once this code lands, most of the cases will be converted to
use a dedicated `family` parameter.
Differential Revision: https://reviews.freebsd.org/D31379
(cherry picked from commit c541bd368f)
Factor out lltable locking logic from lltable_try_set_entry_addr()
into a separate lltable_acquire_wlock(), so the latter can be used
in other parts of the code w/o duplication.
Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
and nd6_nbr.c.
Move lle creation logic from nd6_resolve_slow() into a separate
nd6_get_llentry() to simplify the former.
These changes serve as a pre-requisite for implementing
RFC8950 (IPv4 prefixes with IPv6 nexthops).
Differential Revision: https://reviews.freebsd.org/D31432
(cherry picked from commit 0b79b007eb)
Use newly-create llentry_request_feedback(),
llentry_mark_used() and llentry_get_hittime() to
request datapatch usage check and fetch the results
in the same fashion both in IPv4 and IPv6.
While here, simplify llentry_provide_feedback() wrapper
by eliminating 1 condition check.
Differential Revision: https://reviews.freebsd.org/D31390
(cherry picked from commit f3a3b06121)
The keyword adds nothing as all operations on the var are performed
through atomic_*
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31528
(cherry picked from commit c17ae18080)
The use of Giant here is vestigal and does not provide any useful
synchronization. Furthermore, non-MPSAFE callouts can cause the
softclock threads to block waiting for long-running newbus operations to
complete.
Reported by: mav
Reviewed by: bz
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 663428ea17)
Add check that ifp supports IPv6 multicasts in in6_getmulti.
This fixes panic when user application tries to join into multicast
group on an interface that doesn't support IPv6 multicasts, like
IFT_PFLOG interfaces.
PR: 257302
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D31420
(cherry picked from commit d477a7feed)
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
(cherry picked from commit 7045b1603b)
The various protocol implementations are not very consistent about
freeing mbufs in error paths. In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.
This diff plugs various leaks in the pru_send implementations.
Reviewed by: tuexen
Sponsored by: The FreeBSD Foundation
(cherry picked from commit d8acd2681b)
Commit 81728a538 ("Split rtinit() into multiple functions.") removed
the initialization of sa6, but not one of its uses. This meant that we
were passing an uninitialized sockaddr as the address to
lltable_prefix_free(). Remove the variable outright to fix the problem.
The caller is expected to hold a reference on pr.
Fixes: 81728a538 ("Split rtinit() into multiple functions.")
Reported by: KMSAN
Reviewed by: donner
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30166
(cherry picked from commit c1dd4d642f)
Several protocol methods take a sockaddr as input. In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur. Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.
Reported by: syzkaller+KASAN
Reviewed by: tuexen, melifaro
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29519
(cherry picked from commit f161d294b9)
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.
For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).
This puts us in line with OpenBSD, NetBSD and Linux.
Reviewed by: donner
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30111
(cherry picked from commit 2ef5d803e3)
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.
Use them where appropriate.
Reviewed by: ae (previous version), rscheff, tuexen, rgrimes
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29056
(cherry picked from commit bb4a7d94b9)
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.
Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.
Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```
PR: 253998
Reported by: rashey at superbox.pl
Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29116
(cherry picked from commit b1d63265ac)
P2P ifa may require 2 routes: one is the loopback route, another is
the "prefix" route towards its destination.
Current code marks loopback routes existence with IFA_RTSELF and
"prefix" p2p routes with IFA_ROUTE.
For historic reasons, we fill in ifa_dstaddr for loopback interfaces.
To avoid installing the same route twice, we preemptively set
IFA_RTSELF when adding "prefix" route for loopback.
However, the teardown part doesn't have this hack, so we try to
remove the same route twice.
Fix this by checking if ifa_dstaddr is different from the ifa_addr
and moving this logic into a separate function.
Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29121
(cherry picked from commit 7634919e15)
in6_selectsrc() may call fib6_lookup() in some cases, which requires
epoch. Wrap in6_selectsrc* calls into epoch inside its users.
Mark it as requiring epoch by adding NET_EPOCH_ASSERT().
Differential Revision: https://reviews.freebsd.org/D28647
(cherry picked from commit 605284b894)
The current preference number were copied from IPv4 code,
assuming 500k routes to be the full-view. Adjust with the current
reality (100k full-view).
Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
(cherry picked from commit d5be41beb7)
Currently ip6_input() calls in6ifa_ifwithaddr() for
every local packet, in order to check if the target ip
belongs to the local ifa in proper state and increase
its counters.
in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
of `in6ifa_ifwithaddr()` do not need this reference
anymore, as epoch provides stability guarantee.
Given that, update `in6ifa_ifwithaddr()` to allow
it to return ifa without referencing it, while preserving
option for getting referenced ifa if so desired.
Differential Revision: https://reviews.freebsd.org/D28648
(cherry picked from commit 8268d82cff)
The only place where in6_ifawithifp() is used is ip6_output(),
which uses the returned ifa to bump traffic counters.
Given ifa stability guarantees is provided by epoch, do not refcount ifa.
This eliminates 2 atomic ops from IPv6 fast path.
Reviewed By: rstone
Differential Revision: https://reviews.freebsd.org/D28649
(cherry picked from commit 1bd44b11e5)
When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.
In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.
Reviewed by: donner@, melifaro@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28860
(cherry picked from commit c139b3c19b)
The lookup for a IPv6 multicast addresses corresponding to
the destination address in the datagram is protected by the
NET_EPOCH section. Access to each PCB is protected by INP_RLOCK
during comparing. But access to socket's so_options field is
not protected. And in some cases it is possible, that PCB
pointer is still valid, but inp_socket is not. The patch wides
lock holding to protect access to inp_socket. It copies locking
strategy from IPv4 UDP handling.
PR: 232192
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D28232
(cherry picked from commit 3c782d9c91)
we need to make sure that the m_nextpkt field is NULL
else the lower layers may do unwanted things.
Reviewed By: gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D28377
(cherry picked from commit 24a8f6d369)
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
(cherry-picked from commit 3f43ada98c)