Commit graph

387 commits

Author SHA1 Message Date
Alexander Motin
7ab9e36fc5 nvme: Introduce longer timeouts for admin queue
KIOXIA CD8 SSDs routinely take ~25 seconds to delete non-empty
namespace.  In some cases like hot-plug it takes longer, triggering
timeout and controller resets after just 30 seconds. Linux for many
years has separate 60 seconds timeout for admin queue.  This patch
does the same.  And it is good to be consistent.

Sponsored by:	iXsystems, Inc.
Reviewed by:	imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D42454

(cherry picked from commit 8d6c0743e36e3cff9279c40468711a82db98df23)
2023-11-14 07:54:18 -05:00
David Sloan
510404f2f4 nvme: Fix memory leak in pt ioctl commands
When running nvme passthrough commands through the ioctl interface
memory is mapped with vmapbuf() but not unmapped. This results in leaked
memory whenever a process executes an nvme passthrough command with a
data buffer. This can be replicated with a simple c function (error
checks skipped for brevity):

void leak_memory(int nvme_ns_fd, uint16_t nblocks) {
	struct nvme_pt_command pt = {
		.cmd = {
			.opc = NVME_OPC_READ,
			.cdw12 = nblocks - 1,
		},
		.len = nblocks * 512, // Assumes devices with 512 byte lba
		.is_read = 1, // Reads and writes should both trigger leak
	}
	void *buf;

	posix_memalign(&buf, nblocks * 512);
	pt.buf = buf;
	ioctl(nvme_ns_fd, NVME_PASSTHROUGH_COMMAND, &pt);
	free(buf);
}

Signed-off-by: David Sloan <david.sloan@eideticom.com>

PR:		273626
Reviewed by:	imp, markj
MFC after:	1 week

(cherry picked from commit 7ea866eb14f8ec869a525442c03228b6701e1dab)
2023-10-08 20:41:25 -04:00
Warner Losh
3cd49bc5b3 nvme: Supress noise messages
When we're suspending, we get messages about waiting for the controller
to reset. These are in error: we're not waiting for it to reset. We put
the recovery state as part of suspending, so we should suppress these as
a false positive.

Also remove a stray debug that's left over from earlier versions of
the recovery code that no longer makes sense.

Sponsored by:		Netflix

(cherry picked from commit 1d6021cd72689f54093af4ed77066a2f8abde664)
2023-09-28 15:05:15 -06:00
Warner Losh
81b118e842 nvme: Fix locking protocol violation to fix suspend / resume
Currently, when we suspend, we need to tear down all the qpairs. We call
nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
tracker it will call for the pending AER also locks it (recursively)
hitting an assert. This routine is called without the qpair lock held
when we destroy the device entirely in a number of places. Add an assert
to this effect and drop the qpair lock before calling it.
nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
list, dropping it around calls to nvme_qpair_complete_tracker, and
restarting the list scan after picking it back up.

Note: If interrupts are still running, there's a tiny window for these
AERs: If one fires just an instant after we manually complete it, then
we'll be fine: we set the state of the queue to 'waiting' and we ignore
interrupts while 'waiting'. We know we'll destroy all the queue state
with these pending interrupts before looking at them again and we know
all the TRs will have been completed or rescheduled. So either way we're
covered.

Also, tidy up the failure case as well: failing a queue is a superset of
disabling it, so no need to call disable first. This solves solves some
locking issues with recursion since we don't need to recurse.. Set the
qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
watchdog. Assert we're not failed when we're enabling a qpair, since
failure currently is one-way. Make failure a little less verbose.

Next, kill the pre/post reset stuff. It's completely bogus since we
disable the qparis, we don't need to also hold the lock through the
reset: disabling will cause the ISR to return early. This keeps us from
recursing on the recovery lock when resuming. We only need the recovery
lock to avoid a specific race between the timer and the ISR.

Finally, kill NVME_RESET_2X. It'S been a major release since we put it
in and nobody has used it as far as I can tell. And it was a motivator
for the pre/post uglification.

These are all interrelated, so need to be done at the same time.

Sponsored by:		Netflix
Reviewed by:		jhb
Tested by:		jhb (made sure suspend / resume worked)
MFC After:		3 days
Differential Revision:	https://reviews.freebsd.org/D41866

(cherry picked from commit da8324a9258f1791cd10423103c1746646e33104)
2023-09-28 15:05:15 -06:00
Warner Losh
5e9b7d0e0e nvme: Give up when we've failed
Normally, we poll the device every so often to see if commands have
timed out. However, we'll go into the recovery state as part of failing
the drive. To account for all possibilties, if we're failed when we get
into the polling function, just stop polling: Party is over.

Sponsored by:		Netflix

(cherry picked from commit d95431624f934fe4740211738fc787808005b14e)
2023-09-28 15:05:14 -06:00
Warner Losh
c7cb2dcdf2 nvme: Add exclusion for ISR
Add a basically uncontended spinlock that we take out while the ISR is
running. This has two effects: First, when we get a timeout, we can
safely call the nvme_qpair_process_completions w/o racing any ISRs.
Second, we can use it to ensure that we don't reset the card while
the ISRs are active (right now we just sleep and hope for the best,
which usually is fine, but not always).

Sponsored by:		Netflix
MFC After:		2 weeks
Reviewed by:		chuck, gallatin
Differential Revision:	https://reviews.freebsd.org/D41452

(cherry picked from commit 8052b01e7e4113fa8296ce43c354116b0a1774b7)
2023-09-28 15:05:14 -06:00
Warner Losh
5d627e0669 nvme: Greatly improve error recovery
Next phase of error recovery: Eliminate the REOVERY_START phase, since
we don't need to wait to start recovery. Eliminate the RECOVERY_RESET
phase since it is transient, we now transition from RECOVERY_NORMAL into
RECOVERY_WAITING.

In normal mode, read the status of the controller. If it is in failed
state, or appears to be hot-plugged, jump directly to reset which will
sort out the proper things to do. This will cause all pending I/O to
complete with an abort status before the reset.

When in the NORMAL state, call the interrupt handler. This will complete
all pending transactions when interrupts are broken or temporarily
misbehaving. We then check all the pending completions for timeouts. If
we have abort enabled, then we'll send an abort. Otherwise we'll assume
the controller is wedged and needs a reset. By calling the interrupt
handler here, we'll avoid an issue with the current code where we
transitioned to RECOVERY_START which prevented any completions from
happening. Now completions happen. In addition and follow-on I/O that is
scheduled in the completion routines will be submitted, rather than
queued, because the recovery state is correct. This also fixes a problem
where I/O would timeout, but never complete, leading to hung I/O.

Resetting remains the same as before, just when we chose to reset has
changed.

A nice side effect of these changes is that we now do I/O when
interrupts to the card are totally broken. Followon commits will improve
the error reporting and logging when this happens. Performance will be
aweful, but will at least be minimally functional.

There is a small race when we're checking the completions if interrupts
are working, but this is handled in a future commit.

Sponsored by:		Netflix
MFC After:		2 weeks
Differential Revision:	https://reviews.freebsd.org/D36922

(cherry picked from commit d4959bfcd110ea471222c7dd87775ba1f4e3d1d9)
2023-09-28 15:05:14 -06:00
Warner Losh
a276199fef nvme: Timeout expired transactions
When we went to having a shared timeout routine, failing the timed-out
transaction code was inadvertantly dropped. Reinstate it.

Fixes:			502dc84a8b
Sponsored by:		Netflix
MFC After:		2 weeks
Reviewed by:		chuck, jhb
Differential Revision:	https://reviews.freebsd.org/D36921

(cherry picked from commit 2a6b7055a980f7e7543dfdbda4aa0c356133b77d)
2023-09-28 15:05:14 -06:00
Warner Losh
685dc743dc sys: Remove $FreeBSD$: one-line .c pattern
Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
2023-08-16 11:54:36 -06:00
Warner Losh
95ee2897e9 sys: Remove $FreeBSD$: two-line .h pattern
Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/
2023-08-16 11:54:11 -06:00
John Baldwin
081c22db85 nvme.h: Fix a comment typo in admin opcode enum
Sponsored by:	Chelsio Communications
2023-08-15 11:06:58 -07:00
Warner Losh
33469f1011 nvme: use mtx_padaalign instead of mtx + alignment attribute
nvme driver predates, it seems, mtx_padalign. Modernize.

Sponsored by:		Netflix
2023-08-14 16:33:26 -06:00
Warner Losh
09c20a2932 nvme: Move bools to fill hole
The two bools in nvme_request create a 6 byte hole today. Move them to
after retires to fill the 4 byte hole there and add a spare[2] to make
nvme_request 8 bytes smaller. spare[2] isn't strictly necessary, but
documents how many bytes we have left in that hole, as the number of
booleans will increase shortly.

Suggested by:		chuck
Sponsored by:		Netflix
2023-08-08 11:44:51 -06:00
Warner Losh
2ad9a815fd nvme: Directly lookup op code
Rather than have a table to walk through, use a sparse array.

Suggested by:		jhb
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D41353
2023-08-07 16:44:32 -06:00
Warner Losh
63b0c00eb0 nvme: Update comment
Fix comment to note we should grab additional data from the error log
page, but don't currently (it's inclear if we should do that here
and other places in nvd that want it, or if we should let nvd / the
nda periph make the request).

Sponsored by:		Netflix
Reviewed by:		chuck, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D41315
2023-08-07 16:44:31 -06:00
Warner Losh
95cd10f139 nvme: Add comments about other fields in status
When manually completing an I/O, we do so because we have no status back
from the card. Note M, CRD and P are all 0 because this is an artificial
event (and phase isn't checked when it's completed this way). There's no
MORE information in the error log page and there's no delayed retry
(CRD=0) and we don't currently request CRD to be set to anything other
than 0 and thus don't implement delayed retry.

Sponsored by:		Netflix
Reviewed by:		chuck, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D41314
2023-08-07 16:44:31 -06:00
Warner Losh
a510dbc848 nvme: Be less verbose when cancelling I/O or admin commands
When we're resetting, and there's outstanding I/O that we're cancelling,
only report we're cancelling the I/O once rather than once per
I/O. Likewise when we reschedule the I/O. We don't need to say for each
one that we're cancelling/rescheduling something, and then report the
I/O that we're doing. Likewise with cancelling admin commands (we never
retry them here, so a similar change isn't needed).

Sponsored by:		Netflix
Reviewed by:		chuck, mav
Differential Revision:	https://reviews.freebsd.org/D41313
2023-08-07 16:44:31 -06:00
Warner Losh
ac8c866fda nvme: Add more NVME Base Spec 2.0 and NVME Command Set Spec 1.0a
Add admin commands capacity management, lockdown and fabrics commands.
Add I/O copy command.

Sponsored by:		Netflix
Reviewed by:		chuck, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D41311
2023-08-07 16:44:31 -06:00
Warner Losh
edd23e4dc0 nvme: Eliminate redundant code
get_admin_opcode_string and get_io_opcode_string are identical, but
start with different tables. Use a helper routine that takes an argument
to implement these instead. A future commit will refine this further.

Sponsored by:		Netflix
Reviewed by:		chuck, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D41310
2023-08-07 16:44:31 -06:00
Warner Losh
7be0b06885 nvme: Remove duplicate command printing routine
Both nvme_dump_command and nvme_qpair_print_command print nvme
commands. The former latter better. Recode the one call to
nvme_dump_command to use nvme_qpair_print_command and delete the
former. No sense having two nearly identical routines. A future commit
will convert to sbuf.

Sponsored by:		Netflix
Reviewed by:		chuck, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D41309
2023-08-07 16:44:30 -06:00
Warner Losh
6f76d49386 nvme: Remove duplicate completion printing routine
Both nvme_dump_completion and nvme_qpair_print_completion print
completions. The latter is better. Recode the two instances of
nvme_dump_completion to use nvme_qpair_print_completion and delete the
former. No sense having two nearly identical routines. A future commit
will convert this to sbuf.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D41308
2023-08-07 16:44:30 -06:00
Vladimir Kondratyev
fc14525044 nvme(4): detect S3X NVMe controller in 2016-2017 MacBooks
Adds support for detection of the S3X NVMe controller found in the
13" MacBook Pro 2017 without Touch Bar (MacBook14,1)
It is known to be used in following MacBooks:
- Retina MacBook 2016 (MacBook9,1)
- 13" MacBook Pro 2016 without Touch Bar (MacBook13,1)
- 13" MacBook Pro 2016 with Touch Bar (MacBook13,2)
2023-07-31 17:33:14 +03:00
John Baldwin
92103adbeb nvme: Use a memdesc for the request buffer instead of a bespoke union.
This avoids encoding CAM-specific knowledge in nvme_qpair.c.

Reviewed by:	chuck, imp, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D41119
2023-07-24 10:32:58 -07:00
Warner Losh
774ab87cf2 cam: Add CAM_NVME_STATUS_ERROR error code
Add CAM_NVME_STATUS_ERROR error code. Flag all NVME commands that
completed with an error status as CAM_NVME_STATUS_ERROR (a new value)
instaead of CAM_REQ_CMP_ERR. This indicates to the upper layers of CAM
that the 'cpl' field for nvmeio CCBs is valid and can be examined for
error recovery, if desired.

No functional change. nda will still see these as errors, call
ndaerror() to get the error recovery action, etc. cam_periph_error will
select the same case as before (even w/o the change, though the change
makes it explicit).

Sponsored by:		Netflix
Reviewed by:		chuck, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D41085
2023-07-20 22:32:31 -06:00
John Baldwin
5ae4463498 nvme: Fix typo in "Command Aborted by Host" constant name.
Reviewed by:	chuck, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D40763
2023-06-27 10:06:22 -07:00
John Baldwin
9c2203a691 nvme: Tidy up transfer rate settings in XPT_GET_TRAN_SETTINGS.
- Replace a magic number with CTS_NVME_VALID_SPEC.

- Set the transport and protocol versions the same as for XPT_PATH_INQ.

Probably we shouldn't bother with setting the version in the 'spec'
member of ccb_trans_settings_nvme at all and use the transport
and/or protocol version field instead.

Reviewed by:	chuck, imp
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D40616
2023-06-26 20:32:29 -07:00
Warner Losh
bdc81eeda0 nvme: Switch to nda by default
We already run nda by default on all the !x86 architectures. Switch the
default to nda. nda created nvd compatibility links by default, so this
should be a nop. If this causes problems for your application, set
hw.nvme.use_nvd=1 in your loader.conf.

Sponsored by:		Netflix
2023-06-12 21:41:06 -06:00
Warner Losh
4d846d260e spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with:		pfg
MFC After:		3 days
Sponsored by:		Netflix
2023-05-12 10:44:03 -06:00
Alexander Motin
49ebbdb264 Add NAMESPACE MANAGEMENT into admin_opcode[].
MFC after: 1 week
2023-03-08 15:42:31 -05:00
Dag-Erling Smørgrav
9a5acf365d nvme: Clear the notify flag if the consumer rejects the controller.
While here, fix some type mismatch warnings.

Reviewed by:	imp
Sponsored by:	Netapp, Inc.
Sponsored by:	Klara, Inc.
MFC after:	1 week
2022-12-20 02:53:38 +01:00
Wanpeng Qian
8ab99dbea1
bhyve: abort and return FEATURE_NOT_SAVEABLE while set feature with a save flag for NVMe controller.
Currently bhyve's NVMe controller cannot save feature values cross
reboot. It should return a FEATURE_NOT_SAVEABLE error when the command
specifies a save flag.

Quote from NVMe specification, page 205:

https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

If the Feature Identifier specified in the Set Features command is not
saveable by the controller and the controller receives a Set Features
command with the Save bit set to one, then the command shall be aborted
with a status of Feature Identifier Not Saveable.

Reviewed by:		chuck (older version)
Approved by:		manu (mentor)
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D32767
2022-11-15 07:48:24 +01:00
Alexander Motin
2a31a06bf1 Add random VMware device IDs.
Just to make dmesg look nicer there.

MFC after:	1 week
2022-10-20 10:19:24 -04:00
Warner Losh
4982884b99 nvme: Always set deadline to max
When a transaction is on the outstanding list, it needs to have a valid
timeout value, so set it to infinity before placing it on the
list. Place before we put it on the list, even though the list is
protected by the qpair lock.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D36920
2022-10-11 12:51:32 -06:00
Alexander Motin
a69c096462 nvme: Print CRD, M and DNR status bits on errors.
It may help with some issues debugging.

MFC after:	1 week
2022-08-05 10:58:19 -04:00
Gordon Bergling
6e8ab6715d nvmw(4): Fix a typo in a source code comment
- s/inaccessable/inaccessible/

MFC after:	3 days
2022-06-04 11:46:03 +02:00
John Baldwin
1093caa1bb nvme: Remove unused devclass arguments to DRIVER_MODULE. 2022-05-06 15:46:55 -07:00
John Baldwin
82496a256f nvme: Use devclass_find to lookup the nvme devclass.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D34995
2022-04-21 10:29:14 -07:00
Warner Losh
0fd4cd405b nvme: Use controller's page size instead of PAGE_SIZE to create qpair
When constructing qpair, use the controller's notion of page size rather
than the host's PAGE_SIZE. Currently, these are both 4k, but the arm 16k
page size support requires decoupling.

There's a "hidden" PAGE_SIZE in btoc, so we must change btoc(x) to
howmany(x, ctrlr->page_size) to properly count the number of pages (in
the drive's world view) are needed for various calculations.

With these changes, we the nvme driver operates at production level load
for both host 4k and host 16k page size.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34873
2022-04-15 14:46:19 -06:00
Warner Losh
c5ed67dc90 nvme: Prefer nvme_printf to printf when reporting formatting error
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34872
2022-04-15 14:46:19 -06:00
Warner Losh
3740a8db13 nvme: Further refinements in Host Memory Buffer Sizing
Host Memory Buffer units are a mix. For those in the identify structure,
the size is in 4kiB chunks. For specifying the buffer description,
though, they are in terms of the drive's MPS. Add comments to this
effect and change PAGE_SIZE to ctrlr->page_size where needed, as well as
correct a mistaken use of NVME_HPS_UNITS in 214df80a9c as pointed out
by rpokala@ after the commit. No functional change is intended, as
page_size is still 4k which matches all current hosts' PAGE_SIZE, but to
support 16k pages on arm, we need to differentiate these two cases.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D34871
2022-04-15 14:46:19 -06:00
Warner Losh
3086efe895 nvme: Remove NVME_MAX_XFER_SIZE, replace inline calculation
NVME_MAX_XFER_SIZE used to be a constant (back when MAXPHYS was a
constant) to denote the smaller of MAXPHYS or the largest PRP we could
encode with our prealloation scheme. However, it's no longer constant
since MAXPHYS varies at runtime. In addition, the actual maximum is now
based on the drive's currently in use page_size, which is also a runtime
expression. As such, remove the define and expand it inline in the one
place its used still in the tree.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34870
2022-04-15 14:46:18 -06:00
Warner Losh
3a468f2010 nvme: Use saved mps when initializing drive
Make sure we set the MPS we cached (currently the drives minimum mps) in
CC (Controller Configuration) when reinitializing the drive. It must
match the page_size that we're going to use. Also retire less specific
NVME_PAGE_SHIFT since it's now unused.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34869
2022-04-15 14:46:18 -06:00
Warner Losh
55412ef90a nvme: Rename min_page_size to page_size and save mps
The Memory Page Size sets the basic unit of operation for the drive. We
currently set this to the drive's minimum page size, but we could set it
to any page size the drive supports in the future. Replace min_page_size
(it's now unused for that purpose) with page_size to reflect this and
cache the MPS we want to use. Use NVME_MPS_SHIFT to compute page_size.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34868
2022-04-15 14:46:18 -06:00
Warner Losh
6e3deec8ca nvme: Base maximum data transfer size directly on MPSMIN in cap_hi
Calculate the maxmimum transfer size based on the MPSMIN we have in our
cached copy of cap_hi rather than using min_page_size in the controller.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34867
2022-04-15 14:46:18 -06:00
Warner Losh
a7218e7a6b nvme: Fix old intel alignment size
The intel raid stripe alignment parameter is based on CAP.MPSMIN, so use
that directly now that we have it available.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34866
2022-04-15 14:46:18 -06:00
Warner Losh
e66c1b5185 nvme: Define NVME_MPS_SHIFT
The memory page size (MPS) is expressed in terms of a 2^(number + 12)
and other items in the system inherit this. Create a define rather than
sprinkling 12 everywehere.

Sponsored by:		Netflix
Reviewed by:		chuck
Differential Revision:	https://reviews.freebsd.org/D34865
2022-04-15 14:46:18 -06:00
Gordon Bergling
dfa01f4f98 nvme(4): Fix a typo in a source code comment
- s/is is/is/

MFC after:	3 days
2022-04-09 09:24:34 +02:00
Warner Losh
214df80a9c nvme: new define for size of host memory buffer sizes
The nvme spec defines the various fields that specify sizes for host
memory buffers in terms of 4096 chunks. So, rather than use a bare 4096
here, use NVME_HMB_UNITS. This is explicitly not the host page size of
4096, nor the default memory page size (mps) of the NVMe drive, but its
own thing and needs its own define.

No functional change is intended, only the logical spelling of 4k.

Sponsored by:		Netflix
2022-04-08 23:05:25 -06:00
Warner Losh
161fcf7994 nvme: Publish the drive's capabilities
Add cap_lo and cap_hi sysctl to each nvme drive. This publishes the raw
capabilities of the drive. Now we can only discover these with
bootverbose.

Sponsored by:		Netflix
2022-03-31 21:13:16 -06:00
Warner Losh
6af6a52ee4 nvme: Save cap_lo and cap_hi
Save the capabilities for the drive.

Sponsored by:		Netflix
2022-03-31 21:12:38 -06:00