- Add icl_pdu_append_bio and icl_pdu_get_bio methods.
- Add new page pod routines for allocating and writing page pods for
unmapped bio requests. Use these new routines for setting up DDP
for iSCSI tasks with a SCSI I/O CCB which uses CAM_DATA_BIO.
- When ICL_NOCOPY is used to append data from an unmapped I/O request
to a PDU, construct unmapped mbufs from the relevant pages backing
the struct bio. This also requires changes in the t4_push_pdus path
to support unmapped mbufs.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D34383
Otherwise we end up copying one uninitialized byte into the socket
buffer.
Reported by: KMSAN
Reviewed by: jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33953
The advanced TCP stacks (bbr, rack) may decide to drop a TCP connection
when they do output on it. The default stack never does this, thus
existing framework expects tcp_output() always to return locked and
valid tcpcb.
Provide KPI extension to satisfy demands of advanced stacks. If the
output method returns negative error code, it means that caller must
call tcp_drop().
In tcp_var() provide three inline methods to call tcp_output():
- tcp_output() is a drop-in replacement for the default stack, so that
default stack can continue using it internally without modifications.
For advanced stacks it would perform tcp_drop() and unlock and report
that with negative error code.
- tcp_output_unlock() handles the negative code and always converts
it to positive and always unlocks.
- tcp_output_nodrop() just calls the method and leaves the responsibility
to drop on the caller.
Sweep over the advanced stacks and use new KPI instead of using HPTS
delayed drop queue for that.
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D33370
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.
Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.
The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().
Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741
For TCP DDP, handle_ddp_close() needs to see the pre-FIN rcv_nxt to
determine how much data was placed in the local buffer before the FIN
was received. The changes in d59f1c49e2 broke this by updating
rcv_nxt before calling handle_ddp_close().
Fixes: d59f1c49e2 cxgbe tom: Permit rcv_nxt mismatches on FIN for iSCSI connections on T6.
Sponsored by: Chelsio Communications
This is similar to the fixes in 141fe2dcee. One difference is that
TOE sockets do not change states (listen vs non-listen) once created,
so no lock is needed for SOLISTENING().
Sponsored by: Chelsio Communications
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.
When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.
Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31657
Implement kernel support for RFC 5549/8950.
* Relax control plane restrictions and allow specifying IPv6 gateways
for IPv4 routes. This behavior is controlled by the
net.route.rib_route_ipv6_nexthop sysctl (on by default).
* Always pass final destination in ro->ro_dst in ip_forward().
* Use ro->ro_dst to exract packet family inside if_output() routines.
Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.
* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
It leverages recent lltable changes committed in c541bd368f.
Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0
Differential Revision: https://reviews.freebsd.org/D30398
MFC after: 2 weeks
A socket in the FIN_WAIT_1 state is marked disconnected by
do_close_con_rpl() even though there might still receive data pending.
This is because the socket at that point has set SBS_CANTRCVMORE which
causes the protocol layer to discard any data received before the FIN.
However, icl_cxgbei_conn_close needs to wait until all the data has
been discarded. Replace the wait for SS_ISDISCONNECTED with instead
waiting for final_cpl_received() to be called.
Reported by: Jithesh Arakkan @ Chelsio
Sponsored by: Chelsio Communications
ISO can be disabled before establishing a connection by setting
dev.tNnex.N.toe.iso to 0.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31223
The remote peer might send a FIN in the middle of a burst of data
PDUs. In the case of T6 with data PDU completion moderation, the
driver would not have seen these PDUs since the final PDU in the burst
was never received resulting in a stale rcv_nxt when the FIN is
received.
While here, invert the logic in the condition to be more readable and
always set tp->rcv_nxt from the sequence number in the CPL. This sets
the proper value of rcv_nxt for FINs on connections with data received
but not reported via a CPL (e.g. a partial iSCSI PDU burst interrupted
by a FIN).
Reported by: Jithesh Arakkan @ Chelsio
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30871
The driver used to configure all available classes with some default
parameters on attach and the rest of t4_sched.c was written with the
assumption that all traffic classes are always valid in the hardware.
But this resulted in a lot of informational messages being logged in the
firmware's circular log, crowding out other more useful messages.
This change leaves the tx scheduler alone during attach to reduce the
spam in the devlog. The state of every class is now tracked separately
from its flags and there is support for an 'uninitialized' state.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Recent firmwares are able to utilize the traffic classes of tx channels
that were previously unused. This effectively doubles the number of
traffic classes available per port for 2 port cards. Stop using the raw
per-channel value in the driver and ask the firmware for the number of
usable traffic classes instead.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
The NIC TLS and TOE TLS modes in cxgbe(4) both work with TLS key
contexts. Previously, TOE TLS supported TLS key contexts created by
two different methods, and NIC TLS had a separate bit of code copied
from NIC TLS but specific to KTLS. Now that TOE TLS only supports
KTLS, pull common code for creating TLS key contexts and programming
them into on-card memory into t4_keyctx.c.
Sponsored by: Chelsio Communications
TOE TLS offload was first supported via a customized OpenSSL developed
by Chelsio with proprietary socket options prior to KTLS being present
either in FreeBSD or upstream OpenSSL. With the addition of KTLS in
both places, cxgbe's TOE driver was extended to support TLS offload
via KTLS as well. This change removes the older interface leaving
only the KTLS bindings for TOE TLS.
Since KTLS was added to TOE TLS second, it was somehat shoe-horned
into the existing code. In addition to removing the non-KTLS TLS
offload, refactor and simplify the code to assume KTLS, e.g. not
copying keys into a helper structure that mimic'ed the non-KTLS mode,
but using the KTLS session object directly when constructing key
contexts.
This also removes some unused code to send TX keys inline in work
requests for TOE TLS. This code was never enabled, and was arguably
sending the wrong thing (it was not sending the raw key context as we
do for NIC TLS when using inline keys).
Sponsored by: Chelsio Communications
If an iSCSI connection is shutdown abruptly (e.g. by a RST from the
peer), pending iSCSI PDUs and page pod work requests can be in the
ulp_pduq when the final CPL is received indicating the death of the
connection.
Reported by: Jithesh Arakkan @ Chelsio
In 4427ac3675, the TOM driver stopped sending work requests to
program iSCSI page pods directly and instead queued them to be written
asynchronously with iSCSI PDUs. The queue of mbufs to send is
protected by the inp lock. However, the inp cannot be safely obtained
from the toep since a RST from the remote peer might have cleared
toep->inp asynchronously in an ithread. To fix, obtain the inp from
the socket as is already done in icl_cxgbei_conn_pdu_queue_cb() and
fail the new transfer setup with ECONNRESET if the connection has been
reset.
To avoid passing sockets or inps into the page pod routines, pull the
mbufq out of the two relevant page pod routines such that the routines
queue new work request mbufs to a caller-supplied mbufq.
Reported by: Jithesh Arakkan @ Chelsio
Fixes: 4427ac3675
- Process the list of local IPs once instead of once per adapter. Add
addresses from all VNETs to the driver's list but leave hardware
updates for later when the global VNET/IFADDR list locks have been
released.
- Add address to the hardware table synchronously when a CLIP entry is
requested for an address that's not already in there.
- Provide ioctls that allow userspace tools to manage addresses in the
CLIP table.
- Add a knob (hw.cxgbe.clip_db_auto) that controls whether local IPs are
automatically added to the CLIP table or not.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
A CAM target layer I/O CCB can use a S/G list of virtual address ranges
to describe its data buffer. This change adds zero-copy receive support
for such requests.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29908
As a result, CPL_FW4_ACK now returns credits for these work requests.
To support this, page pod work requests are now constructed in special
mbufs similar to "raw" mbufs used for NIC TLS in plain TX queues.
These special mbufs are stored in the ulp_pduq and dispatched in order
with PDU work requests.
Sponsored by: Chelsio Communications
Discussed with: np
Differential Revision: https://reviews.freebsd.org/D29904
The driver uses both software resources (locks, callouts, memory for
descriptors and for bookkeeping, sysctls, etc.) and hardware resources
(VIs, DMA queues, TCAM entries, etc.) to operate the NIC. This commit
splits the single *_ALLOCATED flag used to track all these resources
into separate *_SW_ALLOCATED and *_HW_ALLOCATED flags.
This is the simplified pseudocode that now applies to most queues (foo
can be ctrlq/txq/rxq/ofld_txq/ofld_rxq):
/* Idempotent */
alloc_foo
{
if (!SW_ALLOCATED)
init_iq/init_eq/init_fl no-fail sw init
alloc_iq_fl/alloc_eq/alloc_wrq may-fail sw alloc
add_foo_sysctls, etc. no-fail post-alloc items
if (!HW_ALLOCATED)
alloc_iq_fl_hwq/alloc_eq_hwq hw resource allocation
}
/* Idempotent */
free_foo
{
if (!HW_ALLOCATED)
free_iq_fl_hwq/free_eq_hwq release hw resources
if (!SW_ALLOCATED)
free_iq_fl/free_eq/free_wrq release sw resources
}
The routines that take the driver to FULL_INIT_DONE and VI_INIT_DONE and
back are now all idempotent. The quiesce routines pay attention to the
HW_ALLOCATED flag and will not wait on the hardware for pidx/cidx
updates and other completions if this flag is not set.
MFC after: 1 month
Sponsored by: Chelsio Communications
The mbuf allocated could be a chain and must be freed with m_freem.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29579
This fixes a panic due to stale so->so_proto if t4_tom is unloaded and
one or more connections that were previously offloaded are still around
in TIME_WAIT state.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29503
This avoids some atomics by using counter_u64 for TX and relying on
existing single-threading (single ithread per rxq) for RX.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29383
This type mirrors struct sge_ofld_rxq and holds state for TCP offload
transmit queues. Currently it only holds a work queue but will
include additional state in future changes.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29382
The hw.cxgbe.kern_tls tunable was used for this in the past and if it
was set then all T6 adapters would be configured for NIC TLS operation
and could not be reconfigured for TOE without a reload. With this
change ifconfig can be used to manipulate toe and txtls caps like any
other caps. hw.cxgbe.kern_tls continues to work as usual but its
effects are not permanent any more.
* Enable nic_ktls_ofld in the default configuration file and use the
firmware instead of direct register manipulation to apply/rollback
NIC TLS configuration. This allows the driver to switch the hardware
between TOE and NIC TLS mode in a safe manner. Note that the
configuration is adapter-wide and not per-port.
* Remove the kern_tls config file as it works with 100G T6 cards only
and leads to firmware crashes with 25G cards. The configurations
included with the driver (with the exception of the FPGA configs) are
supposed to work with all adapters.
Reported by: Veeresh U.K. at Chelsio
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D29291
This avoids mixing the use of two different enums which modern C
compilers warn about.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D29301
1. Query the firmware for filter mode, mask, and related ingress config
instead of trying to figure them out from hardware registers. Read
configuration from the registers only when the firmware does not
support this query.
2. Use the firmware to set the filter mode. This is the correct way to
do it and is more flexible as well. The filter mode (and associated
ingress config) can now be changed any time it is safe to do so.
The user can specify a subset of a valid mode and the driver will
enable enough bits to make sure that the mode is maxed out -- that
is, it is not possible to set another bit without exceeding the
total width for optional filter fields. This is a hardware
requirement that was not enforced by the driver previously.
MFC after: 2 weeks
Sponsored by: Chelsio Communications
These errors do not clear so to NULL, so the existing check was
treating these failures as success. The rest of do_pass_establish()
then tried to use the listen socket as if it was a connection socket
newly created by syncache_expand().
In addition, for negative return values, do not send a RST to the
peer.
Reported by: Sony Arpita Das @ Chelsio
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D28243
The handshake timer can race with another thread sending a FIN or RST
to close a TOE TLS socket. Just bail from the timer without
rescheduling if the connection is closed when the timer fires.
Reported by: Sony Arpita Das @ Chelsio QA
Reviewed by: np
Differential Revision: https://reviews.freebsd.org/D27583
By default, if a TOE TLS socket stops receiving data for more than 5
seconds, revert the connection back to plain TOE mode. This provides
a fallback if the userland SSL library does not support KTLS. In
addition, for client TLS 1.3 sockets using connect(), the TOE socket
blocks before the handshake has completed since the socket option is
only invoked for the final handshake.
The timeout defaults to 5 seconds, but can be changed at boot via the
hw.cxgbe.toe.tls_rx_timeout tunable or for an individual interface via
the dev.<nexus>.toe.tls_rx_timeout sysctl.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27470
This includes mbufs waiting for data from sendfile() I/O requests, or
mbufs awaiting encryption for KTLS.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27469
If TOE TLS is requested for an unsupported cipher suite or TLS
version, disable TLS processing and fall back to plain TOE. In
addition, if an error occurs when saving the decryption keys in the
card's memory, disable TLS processing and fall back to plain TOE.
Reviewed by: np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27468
If a TOE TLS socket ends up using an unsupported TLS version or
ciphersuite, it must be downgraded to a "plain" TOE socket with TLS
encryption/decryption performed on the host. The previous
implementation of this fallback was incomplete and resulted in hung
connections.
Reviewed by: np
MFC after: 2 weeks
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D27467