When it's a I/O failure, we can still send admin commands. Separate out
the admin failures and flag them as such so that we can still send admin
commands on half-failed drives.
Fixes: 9229b3105d (nvme: Fail passthrough commands right away in failed state)
Sponsored by: Netflix
When it's a I/O failure, we can still send admin commands. Separate out
the admin failures and flag them as such so that we can still send admin
commands on half-failed drives.
Fixes: 9229b3105d (nvme: Fail passthrough commands right away in failed state)
Sponsored by: Netflix
While it is easy enough to bounce over to nvme.c from nvme_ctrlr.c to
find this out, I've had to do that several times, so a little bit of
context is quite helpful.
Sponsored by: Netflix
Remove some uses of PHOLD which were there only to prevent the process'
threads from being swapped out.
Tested by: pho
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D46118
Issue a warning if we have system interrupt issues. If you get this
warning, then we submitted a request, it timed out without an interrupt
being posted, but when we polled the card's completion, we found
completion events. This indicates that we're missing interrupts, and to
date all the times I've helped people track issues like this down it has
been a system issue, not an NVMe driver isseue.
Sponsored by: Netflix
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D46031
Optimize timeout code based on three observations.
(1) The tr queues are sorted in order of submission, so the first one
that could time out is the first "real" one on the list.
(2) Timeouts for a given queue are all the same length (well, except
at startup, where timeout doesn't matter, and when you change it
at runtime, where timeouts will still happen eventually and the
difference isn't worth optimizing for).
(3) Calling the ISR races the real ISR and we should avoid that better.
So now, after checking to see if the card is there and working, the
timeout routine scans the pending tracker list until it finds a non-AER
tracker. If the deadline hasn't passed, we return, doing nothing
further. Otherwise, we call poll completions and then process the list
looking for timed out items.
This should move the timeout routine to touching hardware only when it's
really necessary. It thus avoids racing the normal ISR, while still
timig out stuck transactions quickly enough.
There was also some minor code motion to make all of the above flow more
nicely for the reader.
When interrupts aren't working at all, then this will increase latency
somewhat. But when interrupts aren't working at all, there's bigger
problems and we should poll quite often in that case. That will be
handled in future commits.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D46026
When processing an abort completion command, we have to lock. But we
have to lock the qpair of the original transaction (not the abort we're
completing). We do this to avoid races with checking the completion id
to tr mapping array, as well as to manually complete it.
Note: we don't handle the completion status of 'Asked to abort too many
transactions at once.' That will be fixed on subsequent commits. Add a
note to that effect for now since it's a harder problem to solve.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D46025
When we lose a race with the timeout code, shift towards waiting for
that timeout code to complete so we can acquire the lock. This way we
can make sure we're in 'normal' mode before processing I/O
completions. If we're not in 'normal' mode, then we're resetting and we
should avoid completions.
Sponsored by: Netflix
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D46024
Make nvme_qpair_manual_complete_request take dnr as well as a
print_on_error action. Make the status word computation common between
it and nvme_qpair_manual_complete_tracker. And print the error when
we are cancelling the I/O on failure, but not when we're filtering
the I/O after we've failed. Make it private again to nvme_qpair.c.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D46049
When the drive is failed, we can't send passthrough commands to the
card, so fail them right away. Rearrange the comments to reflect the
current failure paths in the driver.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D46048
Add the NVME_IOCTL_ID, NVME_IOCTL_ADMIN_CMD, and NVME_IOCTL_IO_CMD Linux
compatible ioctls. These may be run on either an I/O (ns) dev or a nvme
(admin) dev. Linux allows both on either device, and programs use this
and aren't careful about having the right device open. Emulate this
feature, and implement these ioctls. The data is passed in into the
kernel in host byte order (not converted to le). Results are returned in
host order.
The timeout field is ignore, and the metadata and metadata_len fields
must be zero.
The addr field can be null, even when the data_len is non zero (FreeBSD's
ioctl interface prohibits this, Linux's just ignores the inconsistency).
Only the cdw10 is returned from the command: the status is not returned
in 'result' field. XXX need to verify that this is what Linux does on an
error signaled from the drive.
No external include file is yet available for this: most programs that
call this interface either use a linux-specific path <linux/nvme.h> or
have their own private copy of the data. It's unclear the best thing to
do.
Also, create a /dev/nvmeXnY as an alias for /dev/nvmeXnsY.
These changes allow a native build of nvme-cli to work for everything
that doesn't depend on sysfs entries in /sys, calls that use metadata,
send / receive drive data and sed functionality not in our nvme driver.
Sponsored by: Netflix
Co-Authored-by: Chuck Tuffli <chuck@freebsd.org>
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D45415
Changes the device name for NVMe and NVMe-oF namespaces from using "ns"
to "n" to be more compatible with other operating systems. For example,
a device which was previously /dev/nvme0ns1 is now /dev/nvme0n1.
Preserves the existing functionality by creating alias from nvmeXnY to
nvmeXnsY.
Reviewed by: imp
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D45414
When possible, we split up I/Os to NVMe drives that advertise a
preferred alignment. Add a counter for this.
Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D45311
It's easy to overlook the chain of events that lead to tr->deadline
being updated. Add a comment here to explain what otherwise looks like
an oversight w/o careful study.
Sponsored by: Netflix
We don't need to dereference qpair to get the ctrlr pointer each time,
so use the cached value. It's not going to change. No change intended.
Sponsored by: Netflix
nvme_qpair_complete_tracker and nvme_qpair_manual_complete_tracker have
to be called without the qpair lock, so assert its unowned.
Sponsored by: Netflix
Add definition for page types 7 and 8 for host initiated telemetry and
controller initiated telemetry (they differ by one byte, but that byte
that's defined in the host version is reserved in the controller
version).
Sponsored by: Netflix
This was already true for most architectures due to uint64_t structure
members. However, i386 is special in that it only requires 4 byte
alignment for uint64_t. As a result, casts from struct nvme_command
to struct nvmf_fabric_cmd were raising a "cast increases alignment"
warning on i386. Explicitly aligning struct nvme_command pacifies
this warning on i386.
Reported by: rscheff
Sponsored by: Chelsio Communications
This ensures that embedded uint64_t values used for statistics
counters are aligned when allocating a structure on the stack or as
part of a containing structure. In particular this quiets
-Waddress-of-packed-member warnings from GCC when compiling the code
in nvmfd to update the stats.
Reported by: GCC
This defines structures, ioctl commands, and related constants used
for both the Fabrics host and controller.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44706
We can't post a AER for this page, so there's no need to be able to swap
it to host byte order. It's not one of the standard defined pages that
can post via AER, and the vendor's public docs for this temperature page
don't suggest it's possible to get over or under event changes. Since
nvmecontrol no longer needsd the swap routine, remove it since it's
now unused.
Sponsored by: Netflix
Reviewed by: chuck
Differential Revision: https://reviews.freebsd.org/D44659
Add sys/errno.h, sys/malloc.h, sys/queue.h, and vm/uma.h as needed.
sys/sysproto.h currently includes sys/acl.h which currently includes
sys/param.h, sys/queue.h, and vm/uma.h which in turn bring in
sys/errno.h sys/malloc.h.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D44465
Add all the bits from the NVMe 2.0 base specification: CMD_EFFECTS to
indicate the commands and effects log page is supported, TELEMETRY to
indicate that the telemetry log pages and protocols are supported,
PERSISTENT_EVENTS to indicate the persistent event log is supported,
LOG_PAGES_PAGE to indicate that various log pages related to log page
and command support are supported: L0, L5, L12, and L13. and
DA4_TELEMETRY to indicate that the DA4 area is supported for telemetry
data.
Sponsored by: Netflix
This is used in NVMe over Fabrics to enumerate a list of available
controllers.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44446
nvme(4) doesn't check this flag, but Fabrics implementations may need
to set this flag in the log page attributes cdata field.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44444
This is not used in nvme(4) but is used in NVMe over Fabrics
transports which use SGLs to describe buffers instead of PRPs.
While here, adjust the shift value for the FUSE field to be relative
to the 'fuse' member of 'struct nvme_command'.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44443
Fabrics capsules use an SGL structure instead of prp1/2 addresses to
describe the data buffer used for a command. The SGL structure is
added to a union with the existing prp1/2 fields.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44442
These are useful for NVMe over Fabrics.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44441
We only see a request with a failed controller while we're in the
process of failing the controller. Add a comment to that effect.
Sponsored by: Netflix
We're logging when we start a reset, but not when we complete it, nor
the result. Create now log a success or timed_out event for the reset.
Currently, the only detectable error we have from reset is 'failure to
become ready in time,' though the code looks like it might be more
generic. Log this and if we ever have other failure modes, change the
logging to devd when that happens.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D44211
Change the devctl events slightly for the controller. SMART errors will
log the changed bits in the NVME SMART Critical Warning State as its
event.
Reset will now emit 'event=start'. Soon more.
Sponsored by: Netflix
Reviewed by: mav
Differential Revision: https://reviews.freebsd.org/D44210
Split the devctl aspect of things out to its own function in
nvme_ctrlr_devctl_log. In preparing to document this, and based on
actual use, we want something different for the SMART errors, so this
will facilitate that.
Sponsored by: Netflix
Reviewed by: chuck, mav
Differential Revision: https://reviews.freebsd.org/D44209
In particular, don't try to byteswap the values as 64-bit integers and
always print a non-empty version as a string.
Reviewed by: chuck, imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44121
This macro accepts a field name and a value for the field and
constructs the shifted field value.
Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43604
A few of these omitted a shift of 0, but this is more consistent.
Reviewed by: chuck
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D43602