Commit graph

137359 commits

Author SHA1 Message Date
Mark Johnston
90ffac35b7 eli: Zero pad bytes that arise when certain auth algorithms are used
When authentication is configured, GELI ensures that the amount of data
per sector is a multiple of 16 bytes.  This is done in
eli_metadata_softc().  When the digest size is not a multiple of 16
bytes, this leaves some extra pad bytes at the end of every sector, and
they were not being zeroed before being written to disk.  In particular,
this happens with the HMAC/SHA1, HMAC/RIPEMD160 and HMAC/SHA384 data
authentication algorithms.

This change ensures that they are zeroed before being written to disk.

Reported by:	KMSAN
Reviewed by:	delphij, asomers
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 0fcafe8516)
2021-07-29 08:12:22 -04:00
Mark Johnston
822c62b7db Assert that valid PTEs are not overwritten when installing a new PTP
amd64 and 32-bit ARM already had assertions to this effect.  Add them to
other pmaps.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit b092c58c00)
2021-07-29 08:12:12 -04:00
Mark Johnston
9f43633dd0 pf: Constify tag name and queue name helper functions
No functional change intended.

Reviewed by:	kp
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 81f95106b8)
2021-07-29 08:12:01 -04:00
Yang Zhong
50b26566c7 mmc: Drain the intrhook in mmc_detach()
Buggy SD card drivers may attach and detach a mmc(4) driver instance in
quick succession.  In this case mmc(4) must disestablish its intrhook
callback during detach.  Thus, this change adds a call to
config_intrhook_drain(), which blocks or does nothing if the intrhook is
running or has already ran (the SD card was plugged in), and
disestablishes the hook if it hasn't ran yet (the SD card was not
plugged in).

PR:		254373
Reviewed by:	imp, manu, markj
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit d5341d72a1)
2021-07-29 08:11:50 -04:00
Mark Johnston
f9d3c6f4b5 nfsclient: Avoid copying uninitialized bytes into statfs
hst will be nul-terminated but the remaining space in the buffer is left
uninitialized.  Avoid copying the entire buffer to ensure that
uninitialized bytes are not leaked via statfs(2).

Reported by:	KMSAN
Reviewed by:	rmacklem
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 44de1834b5)
2021-07-29 08:11:16 -04:00
Mark Johnston
e504e98ab5 arm64: Print CPU features slightly earlier
In particular, print them before we release APs.  Otherwise they tend to
get mixed with other kernel messages.

Reviewed by:	andrew, manu
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit fa46a46a82)
2021-07-28 08:40:25 -04:00
Alan Somers
c7fe00ae40 fusefs: correctly set lock owner during FUSE_SETLK
During FUSE_SETLK, the owner field should uniquely identify the calling
process.  The fusefs module now sets it to the process's pid.
Previously, it expected the calling process to set it directly, which
was wrong.

libfuse also apparently expects the owner field to be set during
FUSE_GETLK, though I'm not sure why.

PR:		256005
Reported by:	Agata <chogata@moosefs.pro>
Reviewed by:	pfg
Differential Revision: https://reviews.freebsd.org/D30622

(cherry picked from commit 18b19f8c6e)
2021-07-27 11:44:28 -06:00
Kristof Provost
7d226e964a pf: clean up syncookie callout on vnet shutdown
Ensure that we cancel any outstanding callouts for syncookies when we
terminate the vnet.

MFC after:	1 week
Sponsored by:	Modirum MDPay

(cherry picked from commit 32271c4d38)
2021-07-27 09:45:41 +02:00
Kristof Provost
00b005036b pf: remove stray debug line
MFC after:	1 week
Sponsored by:	Modirum MDPay

(cherry picked from commit 84db87b8da)
2021-07-27 09:45:31 +02:00
Kristof Provost
eae9481e86 pf: fix LINT build
We failed to list the new pf_syncookies.c file in sys/conf/files. This
worked for the usual configurations, where pf is a module, but not for
LINT builds.

Reported by:	lwhsu
MFC after:	1 week
Sponsored by:	Modirum MDPay

(cherry picked from commit b972a7fa9e)
2021-07-27 09:45:23 +02:00
Kristof Provost
2987a3643b pf: syncookie ioctl interface
Kernel side implementation to allow switching between on and off modes,
and allow this configuration to be retrieved.

MFC after:	1 week
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D31139

(cherry picked from commit 231e83d342)
2021-07-27 09:42:52 +02:00
Kristof Provost
c3d03672e1 pf: syncookie support
Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.

This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.

Reviewed by:	kbowling
Obtained from:	OpenBSD
MFC after:	1 week
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D31138

(cherry picked from commit 8e1864ed07)
2021-07-27 09:42:25 +02:00
Kristof Provost
0df576d98e pf: factor out pf_synproxy()
MFC after:	1 week
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D31137

(cherry picked from commit ee9c3d3803)
2021-07-27 09:42:13 +02:00
Mark Johnston
82b475c654 gmirror: Zero the metadata block before writing
The mirror metadata fields contain string buffers and pad bytes, neither
were being zeroed before metadata was written to disk.  Also, the
metadata structure is smaller than the sector size, and in one case
gmirror was failing to zero-fill the full buffer before writing.

Fix these problems by pre-zeroing the metadata structure and the sector
buffer.

Reported by:	KMSAN
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 7f053a44ae)
2021-07-26 21:47:45 -04:00
Mark Johnston
f065d0bb29 blist: Correct the node count computed in blist_create()
Commit bb4a27f927 added the ability to allocate a span of blocks
crossing a meta node boundary.  To ensure that blst_next_leaf_alloc()
does not walk past the end of the tree, an extra all-zero meta node
needs to be present at the end of the allocation, and
blst_next_leaf_alloc() is implemented such that the presence of this
node terminates the search.

blist_create() computes the number of nodes required.  It had two
problems:
1. When the size of the blist is a power of BLIST_RADIX, we would
   unnecessarily allocate an extra level in the tree.
2. When the size of the blist is a multiple of BLIST_RADIX, we would
   fail to allocate a terminator node.  In this case,
   blst_next_leaf_alloc() could scan beyond the bounds of the
   allocation.  This was found using KASAN.

Modify blist_create() to handle these cases correctly.

Reported by:	pho
Reviewed by:	dougm

(cherry picked from commit 2783335cae)
2021-07-26 21:47:20 -04:00
Mark Johnston
a938bfca7a graid3: Zero the metadata block before writing
Ensure that string buffers and pad bytes are zero-filled before writing
graid3 metadata.

Reported by:	KMSAN
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 39552dff7b)
2021-07-26 21:47:12 -04:00
Mark Johnston
b88a1996e2 fifo: Explicitly initialize generation numbers when opening
The fi_rgen and fi_wgen fields are generation numbers used when sleeping
waiting for the other end of the fifo to be opened.  The fields were not
explicitly initialized after allocation, but this was harmless.  To
avoid false positives from KMSAN, though, ensure that they get
initialized to zero.

Reported by:	KMSAN
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit b9ca419a21)
2021-07-26 21:46:45 -04:00
Mark Johnston
9788041842 uart: Fix an out-of-bounds read in ns8250_bus_probe()
The problem is that ns8250_bus_probe() accesses a field from the
ns8250_softc, which embeds the generic UART softc, but the ns8250_softc
hasn't yet been allocated because we're still probing.

This is a regression from commit 0aefb0a63c.  This fixed a problem
where one of the upper four IER bits, which are usually reserved, needs
to be set in order to get RX interrupts before the RX FIFO is full.  At
the same time, we avoid clearing those reserved bits (see commit
58957d8717, though other UART drivers I looked at do not bother with
this).

So, copy what ns8250_init() does to disable interrupts, since we don't
know what the "right" mask is at this point.

Reported by:	syzbot+f256beefd0df9eb796e7@syzkaller.appspotmail.com
Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 4a9a41650c)
2021-07-26 21:46:34 -04:00
Navdeep Parhar
6e405dd9e4 cxgbe(4): Remove some dead code.
(cherry picked from commit 3965469eaa)
2021-07-26 10:49:25 -07:00
Hans Petter Selasky
9661887500 Fix mismerge in OFED update
When OFED was upgraded to Linux v4.9, a bunch of Linux-specific
netlink changes were dropped.  Unfortunately, there was a mismerge
in this process and as a result ib_sa_cancel_query() would fail to
cancel an outstanding MAD.

This was causing rdma_destroy_id() to hang indefinitely waiting
for the MAD to complete and release the final reference.

Sponsored by: Dell Inc.
Differential Revision:	https://reviews.freebsd.org/D28421
Reviewed by: hselasky, kib

(cherry picked from commit 8a06ca2f73)
2021-07-26 18:12:35 +02:00
Hans Petter Selasky
5f23486df9 ipoib: Fix for accessing uninitialized pointers and freed memory during attach and detach.
Call infiniband_ifdetach() early to stop ifioctl(9) calls from user-space
during device removal. Also make sure that ifioctl(9) calls are blocked from
executing until the device is fully initialized. Ideally we would delay the
infiniband_ifattach() call, but because part of the initialization is to update
the link level address, that is not possible without more significant changes.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit cd2c05d323)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
5b803f6820 mlx5: Numa domain improvements.
Properly allocate all mlx5en(4) structures from correct numa domain.

While at it cleanup unused numa domain integers deriving from the
Linux version of mlx5en(4).

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 7c3eff94bd)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
2b4db9bbc2 mlx5: Fix for uninitialized "uid" field.
Make sure the "uid" field gets properly set when destroying DCT and QP
objects by making a copy of the field when creating such objects.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit cbf6911e10)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
191a0e65a0 mlx4: Map core_clock page to user space only when allowed
Currently when we map the hca_core_clock page to the user space,
there are vulnerable registers, one of which is semaphore, on
this page as well. If user read the wrong offset, it can modify the
above semaphore and hang the device.

Hence, mapping the hca_core_clock page to the user space only when
user required it specifically.

After this patch, mlx4 core_clock won't be mapped to user space by
default. Oppose to current state, where mlx4 core_clock is always mapped
to user space.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c8301cbb0f)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
fdab56d1f6 mlx5en: Allow binding channels to CPUs when RSS is not enabled.
Submitted by:	Netflix
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c8d16d1e08)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
9fe9a92ab6 ibcore: Add some functions and definitions for selecting and querying retryable ucontext cleanup.
Linux commit:
1c77483e4c50339b0306572167ccbff6b55d051b

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit f60da09dbb)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
857966b357 mlx5en: Allocate per-channel doorbells.
To avoid congestion on the same PCI memory register space when
traffic consists mostly of small packets.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 9dfa21486e)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
0ebff67cab mlx5en: Wait for all TLS connections to terminate when unloading driver.
The driver expects all TLS tags to be returned to the driver before
it can free the UMA zone where the TLS tags reside.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 3a934ba7a3)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
cd08d4c537 mlx4ib and mlx5ib: Set slid to zero in Ethernet completion struct
IB spec says that a lid should be ignored when link layer is Ethernet,
for example when building or parsing a CM request message (CA17-34).
However, since ib_lid_be16() and ib_lid_cpu16()  validates the slid,
not only when link layer is IB, we set the slid to zero to prevent
false warnings in the kernel log.

Linux commit:
65389322b28f81cc137b60a41044c2d958a7b950

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 30416d4e82)
2021-07-26 18:04:33 +02:00
Hans Petter Selasky
1554e2673b mlx5en: Configure relaxed PCI read and write ordering for ethernet.
This may improve performance in some configurations.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit de2437f199)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
c41403c5ca mlx5en: Check for pci_channel_offline() when draining sendqueue.
This speeds up detach in hypervisor environments.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 4692d9808e)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
763d239db2 mlx5ib: Implement support for enabling and disabling RoCE ECN.
RoCE is short for Remote direct memory access over Converged Ethernet.
ECN is short for Explicit Congestion Notification.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 8abf5ac0e6)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
526c54f54b mlx5ib: Extend parameter macros so that more arguments may be added.
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 42f719d611)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
75b63f08d2 mlx5core: Don't query the PCI config space for offline during a firmware command.
Querying the PCI config space for offline for every firmware command blocks
the PCI bus and affects performance. Especially for packet pacing and TLS
when objects are frequently created and destroyed.

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit e787b5acb1)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
f52bc34a06 Fix LINT kernel build issues after c3987b8ea7 .
Fixes the IPOIB_CM and SDP kernel options.

Reported by:	lwhsu @
Sponsored by:	NVIDIA Networking

(cherry picked from commit 693ddf4dc4)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
47f7e8e423 ibcore: Declare ib_post_send() and ib_post_recv() arguments const
Since neither ib_post_send() nor ib_post_recv() modify the data structure
their second argument points at, declare that argument const. This change
makes it necessary to declare the 'bad_wr' argument const too and also to
modify all ULPs that call ib_post_send(), ib_post_recv() or
ib_post_srq_recv(). This patch does not change any functionality but makes
it possible for the compiler to verify whether the
ib_post_(send|recv|srq_recv) really do not modify the posted work request.

Linux commit:
f696bf6d64b195b83ca1bdb7cd33c999c9dcf514
7bb1fafc2f163ad03a2007295bb2f57cfdbfb630
d34ac5cd3a73aacd11009c4fc3ba15d7ea62c411

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c3987b8ea7)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
b383248ef6 mlx5: Set default timestamp format for mlx5en(4) and mlx5ib.
Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 4fb0a74e08)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
a7a80f1771 mlx5: Add new timestamp mode bits.
These fields declare which timestamp mode is supported
by the device per RQ/SQ/QP.

In addition add the ts_format field to the select the mode
for RQ/SQ/QP.

Linux commit:
a6a217dddcd544f6b75f0e2a60b6e84c1d494b7e

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 915fc66cb5)
2021-07-26 18:04:32 +02:00
Hans Petter Selasky
ee9aeb312c ibcore: Implement ib_uverbs_get_ucontext_file().
Expose ib_ucontext from a given ib_uverbs_file. Drivers that use the ioctl(9)
API may have the ib_uverbs_file and need a way to get the related ib_ucontext
from it, this is enabled by this patch.

Downstream patches from this series will use it.

Linux commit:
7dc08dcfc8c86cb4457e383734ff6844ddaff876

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 79b817084c)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
11f4280153 ibcore: Clean up INIT_UDATA() and INIT_UDATA_BUF_OR_NULL() macro usage.
We get a harmless warning about the fact that we use the result of a
multiplication as a condition in INIT_UDATA_BUF_OR_NULL():

uverbs_main.c: In function 'ib_uverbs_write':
error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context]

This avoids the problem by using an inline function in place of
the macro.

After changing INIT_UDATA_BUF_OR_NULL() to an inline function,
do the same change to INIT_UDATA() for consistency.

Using an inline function gives us better type safety here among other
issues with macros. I'm using u64_to_user_ptr() to convert the user
pointer to simplify the logic rather than adding lots of new type casts.

Linux commit:
12f727721eee61b3d19dedb95cb893b2baa9fe41
40a203396cc1c239f2e71c47c66ed03097123d2c

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 05f4691919)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
1c1c0acb31 ibcore: Simplify ib_modify_qp_is_ok().
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state makes the
checks of out-of-scope redundant. Let's remove them together with updating
function signature to return boolean result.

While at it remove unused "ll" parameter from ib_modify_qp_is_ok().

Linux commit:
19b1f54099b6ee334acbfbcfbdffd1d1f057216d
d31131bba5a1630304c55ea775c48cc84912ab59

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit d92a9e5604)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
4c033941d0 ibcore: Support rate limit for packet pacing
Add new member rate_limit to ib_qp_attr which holds the packet pacing
rate in kbps, 0 means unlimited.

IB_QP_RATE_LIMIT is added to ib_attr_mask and could be used by RAW
QPs when changing QP state from RTR to RTS, RTS to RTS.

Linux commit:
528e5a1bd3f0e9b760cb3a1062fce7513712a15d

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 0c13880ccc)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
d76e6346af ibcore: Add new IB rates.
Add the new rates that were added to Infiniband spec as part of
HDR and 2x support.

Linux commit:
a5a5d1993696419e7d5357fc3128e53d219d382e

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 3d2fb36a9c)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
d857bf2037 ibcore: Don't allocate method table, if already present.
This commit aligns the code in question with upstream Linux.

Linux commit:
2468b82d69e3a53d024f28d79ba0fdb8bf43dfbf

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 721b795b72)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
a0d74ac3f3 ibcore: Fix a use-after-free in ucma_resolve_ip().
There is a race condition between ucma_close() and ucma_resolve_ip():

CPU0                            CPU1
ucma_resolve_ip():              ucma_close():

ctx = ucma_get_ctx(file, cmd.id);

        list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) {
                mutex_lock(&mut);
                idr_remove(&ctx_idr, ctx->id);
                mutex_unlock(&mut);
                ...
                mutex_lock(&mut);
                if (!ctx->closing) {
                        mutex_unlock(&mut);
                        rdma_destroy_id(ctx->cm_id);
                ...
                ucma_free_ctx(ctx);
        }

ret = rdma_resolve_addr();
ucma_put_ctx(ctx);

Before idr_remove(), ucma_get_ctx() could still find the ctx
and after rdma_destroy_id(), rdma_resolve_addr() may still
access id_priv pointer. Also, ucma_put_ctx() may use ctx after
ucma_free_ctx() too.

ucma_close() should call ucma_put_ctx() too which tests the
refcnt and waits for the last one releasing it. The similar
pattern is already used by ucma_destroy_id().

Linux commit:
5fe23f262e0548ca7f19fb79f89059a60d087d22

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit c6ccb08686)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
1753641a8b ibcore: Define option to set ack timeout.
Define new option in 'rdma_set_option' to override calculated QP timeout
when requested to provide QP attributes to modify a QP.

At the same time, pack tos_set to be bitfield.

Linux commit:
2c1619edef61a03cb516efaa81750784c3071d10

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 20fea7ac64)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
7449dd9a12 ibcore: Do not overreact to SM LID change event.
When IPoIB receives an SM LID change event, it reacts by flushing its
path record cache and rejoining multicast groups. This is the same
behavior it performs when it receives a reregistration event. This
behavior is unnecessary as an SM may have database backup or
synchronization mechanisms which permit the SM location or LID to change
without loss of multicast membership and without impact to path records.

Both opensm and the OPA FM issue reregistration events if a new SM is
started (or restarted with a new config) or an SM event occurs which
results in loss of multicast membership records by the SM (such as
opensm failover) or the SM encounters new nodes with Active ports (such
as after joining 2 fabrics by connecting switches via ISLs). Hence this
event can be depended on as the trigger for IPoIB cache and multicast
flushing.

It appears that some drivers, such as qib, and hfi1 issue the
IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not.
Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed
that Mellanox EDR HCAs do not generate SM change events and that opensm
does generate reregistration.

An SM LID change event is generated by the mentioned drivers to reflect
that sm_lid and/or sm_sl in the local port info has changed. The intent
of this event is to permit applications and ULPs which have a local copy
of this information (or an address handle using it) to update their
information.

The intent is that the reregistration event (caused by the SM via a bit
in Set(PortInfo)) be used to inform nodes that they need to rejoin
multicast groups, resubscribe for notices and potentially update path
records.

When an SM migrates or fails over, a SM LID change event can occur. In
response IPoIB discards path records and multicast membership and loses
connectivity until these records are restored via SA requests. In very
large fabrics, it may take minutes for the SM to be ready and for the SA
responses to be supplied.  This can result in undesirable and
unnecessary IPoIB connectivity impacts. It also can result in an
unnecessary storm of SA queries from all nodes in a cluster potentially
followed by yet another storm if the SM issues the reregistration
request.

The fact the Mellanox HCAs do not even generate this event, is further
evidence that on modern IB fabrics there will be no ill side effects
from the proposed changes below to reduce the reaction by 3 kernel
components to this event. So these changes should be benign for Mellanox
IB fabrics and will benefit OPA fabrics while also making ib_core and
ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event
APIs.

Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib.
IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do
anything on SM LID change. IPoIB makes use of other ib_core components
to issue SA requests for it and those components correctly track SM LID
and SM LID changes.

Also in ib_core multicast handling,  remove the test for
IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the
error state, which will trigger rejoins. This code is used by IPoIB as
well as the connection manager and other clients of multicast groups.
This kernel module centralizes group membership status and joins since a
node can only join a given group once but multiple ULPs or applications
may want to join the same group. It makes use of the sa_query.c
component in ib_core, which correctly trackes SM LID and SL. This
component does not track SM LID nor SL itself and hence need not react
to their changes.

Similarly in the ib_core cache code remove the handling for the
IB_EVENT_SM_CHANGE.  In this function. The ib_cache_update function
which is ultimately called is updating local copies of the pkey table,
gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As
such it does not need to be called on an SM LID change. It technically
also does not need to be called on a reregistration. The LID_CHANGE,
PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR,
PORT_ACTICE) should be sufficient triggers.

It is worth noting that the alternative of simply having the hfi1 and
qib drivers not generate the SM LID change event was explored. While
this would duplicate what Mellanox drivers do now, it is not the correct
behavior and removes the ability for an SM to migrate without requiring
reregistration. Since both opensm and OPA SM have mechanisms to backup
or synchronize registration information, it is desirable to let them
perform SM migrations (with LID or SL changes) without requiring
reregistration when they deem it appropriate.

Linux commit:
ba7d8117f3cca8eb70d579fde3f9ec8cd6a28f39

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit df1df0c742)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
fe866f5890 ibcore: Remove debug prints after allocation failure.
The prints after [k|v][m|z|c]alloc() functions are not needed,
because in case of failure, allocator will print their internal
error prints anyway.

Linux commit:
2716243212241855cd9070883779f6e58967dec5

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 26646ba5bc)
2021-07-26 18:04:31 +02:00
Hans Petter Selasky
36f60545aa ibcore: Fix use-after-free in IB mad completion handling.
We encountered a use-after-free bug when unloading the driver:

BUG: KASAN: use-after-free in ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
Read of size 4 at addr ffff8882ca5aa868 by task kworker/u13:2/23862

Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0xe3/0x2e0
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
__kasan_report+0x15c/0x1df
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
kasan_report+0xe/0x20
ib_mad_post_receive_mads+0xddc/0xed0 [ib_core]
find_mad_agent+0xa00/0xa00 [ib_core]
qlist_free_all+0x51/0xb0
mlx4_ib_sqp_comp_worker+0x1970/0x1970 [mlx4_ib]
quarantine_reduce+0x1fa/0x270
kasan_unpoison_shadow+0x30/0x40
ib_mad_recv_done+0xdf6/0x3000 [ib_core]
_raw_spin_unlock_irqrestore+0x46/0x70
ib_mad_send_done+0x1810/0x1810 [ib_core]
mlx4_ib_destroy_cq+0x2a0/0x2a0 [mlx4_ib]
_raw_spin_unlock_irqrestore+0x46/0x70
debug_object_deactivate+0x2b9/0x4a0
__ib_process_cq+0xe2/0x1d0 [ib_core]
ib_cq_poll_work+0x45/0xf0 [ib_core]
process_one_work+0x90c/0x1860
pwq_dec_nr_in_flight+0x320/0x320
worker_thread+0x87/0xbb0
__kthread_parkme+0xb6/0x180
process_one_work+0x1860/0x1860
kthread+0x320/0x3e0
kthread_park+0x120/0x120
ret_from_fork+0x24/0x30
...
Freed by task 31682:
save_stack+0x19/0x80
__kasan_slab_free+0x11d/0x160
kfree+0xf5/0x2f0
ib_mad_port_close+0x200/0x380 [ib_core]
ib_mad_remove_device+0xf0/0x230 [ib_core]
remove_client_context+0xa6/0xe0 [ib_core]
disable_device+0x14e/0x260 [ib_core]
__ib_unregister_device+0x79/0x150 [ib_core]
ib_unregister_device+0x21/0x30 [ib_core]
mlx4_ib_remove+0x162/0x690 [mlx4_ib]
mlx4_remove_device+0x204/0x2c0 [mlx4_core]
mlx4_unregister_interface+0x49/0x1d0 [mlx4_core]
mlx4_ib_cleanup+0xc/0x1d [mlx4_ib]
__x64_sys_delete_module+0x2d2/0x400
do_syscall_64+0x95/0x470
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The problem was that the MAD PD was deallocated before the MAD CQ.
There was completion work pending for the CQ when the PD got deallocated.
When the mad completion handling reached procedure
ib_mad_post_receive_mads(), we got a use-after-free bug in the following
line of code in that procedure:
sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey;
(the pd pointer in the above line is no longer valid, because the
pd has been deallocated).

We fix this by allocating the PD before the CQ in procedure
ib_mad_port_open(), and deallocating the PD after freeing the CQ
in procedure ib_mad_port_close().

Since the CQ completion work queue is flushed during ib_free_cq(),
no completions will be pending for that CQ when the PD is later
deallocated.

Note that freeing the CQ before deallocating the PD is the practice
in the ULPs.

Linux commit:
770b7d96cfff6a8bf6c9f261ba6f135dc9edf484

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 468a6b5055)
2021-07-26 18:04:30 +02:00
Hans Petter Selasky
7da85a0db9 ibcore: Fail early if unsupported QP is provided.
When requested QP type is not supported for a {device, port}, return the
error right away before validating all parameters during mad agent
registration time.

Linux commit:
798bba01b44b0ddf8cd6e542635b37cc9a9b739c

Reviewed by:	kib
Sponsored by:	Mellanox Technologies // NVIDIA Networking

(cherry picked from commit 507389a35a)
2021-07-26 18:04:30 +02:00