nvmf_submit_request() handles races with concurrent queue pair
destruction (or the queue pair being destroyed between
nvmf_allocate_request and nvmf_submit_request), so the lock is not
needed here. This avoids holding the lock across transport-specific
logic such as queueing mbufs for PDUs to a socket buffer, etc.
Holding the lock across nvmf_allocate_request() ensures that the queue
pair pointers in the softc are still valid as shutdown attempts will
block on the lock before destroying the queue pairs.
Sponsored by: Chelsio Communications
The last reference on a pending I/O request might be held by an mbuf
in the socket buffer. When this mbuf is freed, the I/O request is
completed which triggers completion of the CCB. However, this can
occur with locks held (e.g. with so_snd locked when the mbuf is freed
by sbdrop()) raising a LOR between so_snd and the CAM device lock.
Instead, defer CCB completion processing to a thread where locks are
not held.
Sponsored by: Chelsio Communications
Some block devices may request datamove operations from an ithread
context while holding locks. Queue datamove operations to a taskqueue
backed by a thread pool to safely permit blocking allocations, etc. in
datamove handling.
Reviewed by: asomers
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D46551
The previous version of tcp_send_controller_data avoided sending a
chain of multiple mbufs that exceeded the limit, but if an individual
mbuf was larger than the limit it was sent as a single, over-sized
PDU. Fix by using m_split() to split individual mbufs larger than the
limit.
Note that this is not a protocol error, per se, as there is no limit
on C2H_DATA PDU lengths (unlike the MAXH2CDATA parameter). This fix
just honors the administrative limit more faithfully. This case is
also very unlikely with the default limit of 256k.
Sponsored by: Chelsio Communications
The increment of 1 was intended to convert qp->maxr2t from 0's based
to 1 based before multiplying by the queue length.
Sponsored by: Chelsio Communications
If a queue pair is destroyed (e.g. due to the TCP connection dropping)
while a host to controller data transfer is in progress, the
pending_r2ts counter can be non-zero. This can later trigger an
assertion failure when the capsule is freed. To fix, update the
relevant R2T accounting stats when aborting active command buffers
during queue pair destruction.
Sponsored by: Chelsio Communications
This sysctl sets a cap on the maximum payload of transmitted data PDUs
including both C2H_DATA and H2C_DATA PDUs, not just C2H_DATA PDUs.
Sponsored by: Chelsio Communications
If a PDU (such as a Command Capsule PDU) on a connection that has
enabled data digests does not have a data section, it will not have
the the PDU data digest flag set. The previous check was requiring
this flag to be present on all PDU types that support data sections
even if no data was included in the PDU.
Sponsored by: Chelsio Communications
If an association is disconnected during a clean shutdown, abort all
pending and future I/O requests with an error to avoid hangs either due
to filesystem unmounts or a stuck GEOM event.
If an association is connected during a clean shutdown, gracefully
disconnect from the remote controller and close the open queues.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45462
Add a kern.nvmf.fail_on_disconnection sysctl similar to the
kern.iscsi.fail_on_disconnection sysctl. This causes pending I/O
requests to fail with an error if an association is disconnected
instead of requeueing to be retried once the association is
reconnected. As with iSCSI, the default is to queue and retry
operations.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45308
While a host was disconnected from a remote controller, namespaces
might have been added, removed, or altered properties. Rescan the
namespaces after reconnecting to detect any such changes.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45461
Previously this just punted with a warning message.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45460
This function accepts a namespace ID and associated namespace data
from IDENTIFY and takes care of updating nvmeXnY and ndaZ.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45459
Rename to nvmf_scan_active_namespaces and accept an additional
callback function and callback argument. The callback is invoked on
each active namespace enumerated by the active namespace list from the
IDENTIFY command.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D45458
Changes the device name for NVMe and NVMe-oF namespaces from using "ns"
to "n" to be more compatible with other operating systems. For example,
a device which was previously /dev/nvme0ns1 is now /dev/nvme0n1.
Preserves the existing functionality by creating alias from nvmeXnY to
nvmeXnsY.
Reviewed by: imp
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D45414
This is leftover from an earlier iteration of the code where 'nt' was
not dynamically allocated but was the passed in 'ops' pointer so was
always alive.
Reported by: Coverity Scan
CID: 1545042
Sponsored by: Chelsio Communications
This avoids -Wcast-align warnings from clang when upcasting from
struct nvmf_fabric_cmd to struct nvmf_fabric_prop_set_cmd.
Reported by: bapt
Sponsored by: Chelsio Communications
The protocol structures do not need explicit packing and static
assertions verify the size of all the structures as well as the
offsets of several key fields. The pragma triggers warnings when
building with GCC.
Sponsored by: Chelsio Communications
This is the server (target in SCSI terms) for NVMe over Fabrics.
Userland is responsible for accepting a new queue pair and receiving
the initial Connect command before handing the queue pair off via an
ioctl to this CTL frontend.
This frontend exposes CTL LUNs as NVMe namespaces to remote hosts.
Users can ask LUNS to CTL that can be shared via either iSCSI or
NVMeoF.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44726
This is the client (initiator in SCSI terms) for NVMe over Fabrics.
Userland is responsible for creating a set of queue pairs and then
handing them off via an ioctl to this driver, e.g. via the 'connect'
command from nvmecontrol(8). An nvmeX new-bus device is created
at the top-level to represent the remote controller similar to PCI
nvmeX devices for PCI-express controllers.
As with nvme(4), namespace devices named /dev/nvmeXnsY are created and
pass through commands can be submitted to either the namespace devices
or the controller device. For example, 'nvmecontrol identify nvmeX'
works for a remote Fabrics controller the same as for a PCI-express
controller.
nvmf exports remote namespaces via nda(4) devices using the new NVMF
CAM transport. nvmf does not support nvd(4), only nda(4).
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44714
Structurally this is very similar to the TCP transport for iSCSI
(icl_soft.c). One key difference is that NVMeoF transports use a more
abstract interface working with NVMe commands rather than transport
PDUs. Thus, the data transfer for a given command is managed entirely
in the transport backend.
Similar to icl_soft.c, separate kthreads are used to handle transmit
and receive for each queue pair. On the transmit side, when a capsule
is transmitted by an upper layer, it is placed on a queue for
processing by the transmit thread. The transmit thread converts
command response capsules into suitable TCP PDUs where each PDU is
described by an mbuf chain that is then queued to the backing socket's
send buffer. Command capsules can embed data along with the NVMe
command.
On the receive side, a socket upcall notifies the receive kthread when
more data arrives. Once enough data has arrived for a PDU, the PDU is
handled synchronously in the kthread. PDUs such as R2T or data
related PDUs are handled internally, with callbacks invoked if a data
transfer encounters an error, or once the data transfer has completed.
Received capsule PDUs invoke the upper layer's capsule_received
callback.
struct nvmf_tcp_command_buffer manages a TCP command buffer for data
transfers that do not use in-capsule-data as described in the NVMeoF
spec. Data related PDUs such as R2T, C2H, and H2C are associated with
a command buffer except in the case of the send_controller_data
transport method which simply constructs one or more C2H PDUs from the
caller's mbuf chain.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44712
nvmf_transport.ko provides routines for managing NVMeoF queue pairs
and capsules. It provides a glue layer between transports (such as
TCP or RDMA) and an NVMeoF host (initiator) and controller (target).
Unlike the synchronous API exposed to the host and controller by
libnvmf, the kernel's transport layer uses an asynchronous API built
on callbacks. Upper layers provide callbacks on queue pairs that are
invoked for transport errors (error_cb) or anytime a capsule is
received (receive_cb).
Data transfers for a command are usually associated with a callback
that is invoked once a transfer has finished either due to an error
or successful completion.
For an upper layer that is a host, command capsules are allocated and
populated with an NVMe SQE by calling nvmf_allocate_command. A data
buffer (described by a struct memdesc) can be associated with a
command capsule before it is transmitted via nvmf_capsule_append_data.
This function accepts a direction (send vs receive) as well as the
data transfer callback. The host then transmits the command via
nvmf_transmit_capsule. The host must ensure that the data buffer
described by the 'struct memdesc' remains valid until the data
transfer callback is called. The queue pair's receive_cb callback
should match received response capsules up with previously transmitted
commands.
For the controller, incoming commands are received via the queue
pair's receive_cb callback. nvmf_receive_controller_data is used to
retrieve any data from a command (e.g. the data for a WRITE command).
It can be called multiple times to split the data transfer into
smaller sizes. This function accepts an I/O completion callback that
is invoked once the data transfer has completed.
nvmf_send_controller_data is used to send data to a remote host in
response to a command. In this case a callback function is not used
but the status is returned synchronously. Finally, the controller can
allocate a response capsule via nvmf_allocate_response populated with
a supplied CQE and send the response via nvmf_transmit_capsule.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44711
This includes functions to validate NVMe Qualified Names, compute an
initial value of the CAP property, validate changes to the CC
property, and populate the Identify Controller data structure for an
I/O controller.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44709
- Helper macros for specific SGL types used with the TCP transport
- An inline function which validates various fields in TCP PDUs
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44708
This defines structures, ioctl commands, and related constants used
for both the Fabrics host and controller.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44706
- Add opcode, command structure, and new error code for Disconnect
fabrics opcode.
- Add a generic struct nvmf_fabric_command.
- Add constants for special controller ID values.
- Add constants for the cattr field in the Connect command and the
default value for the kato field in the Connect command.
- Add constants for the offset of controller properties (Fabrics
version of controller registers).
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44705
- Replace SPDK_STATIC_ASSERT with _Static_assert.
- Remove SPDK_ and spdk_ prefixes from types and constants.
- Switch to using FreeBSD headers, e.g. <dev/nvme/nvme.h> in place of
"spdk/nvme_spec.h".
- Add a definition of NVME_NQN_FIELD_SIZE (from SPDK's nvme_spec.h).
- Remove constant for the fabrics opcode as this is already present in
<dev/nvme/nvme.h>.
- Use types from <dev/nvme/nvme.h> for NVMe structures including
struct nvme_sgl_descriptor, struct nvme_command, and
struct nvme_completion.
- Use plain uint16_t in place of struct spdk_nvme_status.
Reviewed by: imp
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44704
This is a copy of spdk/include/spdk/nvmf_spec.h as of commit
470e851852bb948334a272c9f8de495020fa082f from Intel's SPDK.
Subsequent commits will modify it to be suitable header for the
kernel, but importing the stock file first makes it easier to see
how the resulting header is derived from the original.
Reviewed by: imp
Obtained from: SPDK (https://github.com/spdk/spdk.git)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D44703