Commit graph

338 commits

Author SHA1 Message Date
Gordon Bergling
a270ec979f nvmw(4): Fix a typo in a source code comment
- s/inaccessable/inaccessible/

(cherry picked from commit 6e8ab6715d)
2022-06-10 14:31:11 +02:00
Gordon Bergling
72692ac18c nvme(4): Fix a typo in a source code comment
- s/is is/is/

(cherry picked from commit dfa01f4f98)
2022-04-14 08:11:16 +02:00
Chuck Tuffli
7de3a3e919 nvme: fix spelling of Namespace
Fix spelling of a macro definition.

PR:		262141

(cherry picked from commit c2318cf80a)
2022-02-27 17:59:31 -08:00
Chuck Tuffli
c55ba19b70 nvme: Add OAES bit-field definitions
Create definitions for the Optional Asynchronous Events Supported (OAES)
values. Also adds a helper macro for the common use case of "mask and
shift". E.g.
    value = NVME_CTRLR_DATA_OAES_NS_ATTR_MASK << NVME_CTRLR_DATA_OAES_NS_ATTR_SHIFT;
becomes
    value = NVMEB(NVME_CTRLR_DATA_OAES_NS_ATTR);

(cherry picked from commit e71afa1202)
2022-02-27 17:59:31 -08:00
Alexander Motin
fb56e14ce6 nvme: Do not rearm timeout for commands without one.
Admin queues almost always have several ASYNC_EVENT_REQUEST outstanding.
They have no timeouts, but their presence in qpair->outstanding_tr caused
useless timeout callout rearming twice a second.

While there, relax timeout callout period from 0.5s to 0.5-1s to improve
aggregation.  Command timeouts are measured in seconds, so we don't need
to be precise here.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D33781

(cherry picked from commit b3c9b6060f)
2022-01-20 21:07:31 -05:00
Warner Losh
f102ae7159 nvme_sim: Only report PCI related stats when we can
For AHCI attached devices, we report the location and identification
information of the AHCI controller that we're attached to. We also
don't reprot link speed in that case, since we can't get to the PCIe
config space registers to find that out.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D33287

(cherry picked from commit 8f07932272)
2022-01-20 21:07:31 -05:00
Warner Losh
cc20d75705 nvme_ahci: Mark AHCI devices as such in the controller
Add a quirk to flag AHCI attachment to the controller. This is for any
of the strategies for attaching nvme devices as children of the AHCI
device for Intel's RAID devices. This also has a side effect of cleaning
up resource allocation from failed nvme_attach calls now.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D33285

(cherry picked from commit 7cf8d63c88)
2022-01-20 21:07:31 -05:00
Warner Losh
2b2925d1e8 nvme: Move to a quirk for the Intel alignment data
Prior to NVMe 1.3, Intel produced a series of drives that had
performance alignment data in the vendor specific space since no
standard had been defined. Move testing the versions to a quick so the
NVMe NS code doesn't know about PCI device info.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D33284

(cherry picked from commit 053f8ed6eb)
2022-01-20 21:07:31 -05:00
Warner Losh
f022d47f2f nvme: Reduce traffic to the doorbell register
Reduce traffic to doorbell register when processing multiple completion
events at once. Only write it at the end of the loop after we've
processed everything (assuming we found at least one completion,
even if that completion wasn't valid).

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32470

(cherry picked from commit 2ec165e3f0)
2022-01-20 21:07:31 -05:00
Warner Losh
13b711e8c8 nvme: Restore hotplug warning
Restore hotplug warning in recovery state machine. No functional change
other than what message gets printed.

Sponsored by:		Netflix

(cherry picked from commit 18dc12bfd2)
2022-01-20 21:07:31 -05:00
Warner Losh
86721e606c nvme: Use adaptive spinning when polling for completion or state change
We only use nvme_completion_poll in the initialization path. The
commands they queue and wait for finish quickly as they involve no I/O
to the drive's media. These command take about 20-200 microsecnds
each. Set the wait time to 1us and then increase it by 1.5 each
successive iteration (max 1ms). This reduces initialization time by
80ms in cpervica's tests.

Use this same technique waiting for RDY state transitions. This saves
another 20ms. In total we're down from ~330ms to ~2ms.

Tested by:		cperciva
Sponsored by:		Netflix
Reviewed by:		mav
Differential Review:	https://reviews.freebsd.org/D32259

(cherry picked from commit 83581511d9)
2022-01-20 21:07:31 -05:00
Warner Losh
50b3d57e71 nvme: Only reset once on attach.
The FreeBSD nvme driver has reset the nvme controller twice on attach to
address a theoretical issue assuring the hardware is in a known
state. However, exierence has shown the second reset is unnecessary and
increases the time to boot. Eliminate the second reset. Should there be
a situation when you need a second reset (for buggy or at least somewhat
out of the mainstream hardware), the hardware option NVME_2X_RESET will
restore the old behavior. Document this in nvme(4).

If there's any trouble at all with this, I'll add a sysctl tunable to
control it.

Sponsored by:		Netflix
Reviewed by:		cperciva, mav
Differential Revision:	https://reviews.freebsd.org/D32241

(cherry picked from commit 4b3da659bf)
2022-01-20 21:07:31 -05:00
Warner Losh
932df0a258 nvme: Remove pause while resetting
After some study of the code and the standard, I think we can just drop
the pause(), unconditionally.  If we're not initialized, then there's
nothing to wait for from a software perspective.  If we are initialized,
then there might be outstanding I/O. If so, then the qpair 'recovery
state' will transition to WAITING in nvme_ctrlr_disable_qpairs, which
will ignore any interrupts for items that complete before we complete
the reset by setting cc.en=0.

If we go on to fail the controller, we'll cancel the outstanding I/O
transactions.  If we reset the controller, the hardware throws away
pending transactions and we retry all the pending I/O transactions. Any
transactions that happend to complete before cc.en=0 will have the same
effect in the end (doing the same transaction twice is just inefficient,
it won't affect the state of the device any differently than having done
it once).

The standard imposes no wait times here, so it isn't needed from that
perspective.

Unanswered Question: Do we may need to disable interrupts while we
disable in legacy mode since those are level-sensitive.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32248

(cherry picked from commit e5e26e4a24)
2022-01-20 21:07:30 -05:00
Warner Losh
0c3d88b9ab nvme: Explain a workaround a little better
The don't touch the mmio of the drive after we do a EN 1->0 transition
is only for a tiny number of dirves that have this unforunate issue.

Sponsored by:		Netflix

(cherry picked from commit 77054a897f)
2022-01-20 21:07:30 -05:00
Warner Losh
02325a44b5 nvme_ctrlr_enable: Small style nits
Rewrite the nested if's using the preferred FreeBSD style for branches
of ifs that return. NFC. Minor tweaks to the comments to better fit new
code layout.

Sponsored by:		Netflix
Reviewed by:		mav, chuck (prior rev, but comments rolled in)
Differential Revision:	https://reviews.freebsd.org/D32245

(cherry picked from commit a245627a4e)
2022-01-20 21:07:30 -05:00
Warner Losh
21de49b06c nvme: Use MS_2_TICKS rather than rolling our own
Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32246

(cherry picked from commit 26259f6ab9)
2022-01-20 21:07:30 -05:00
Warner Losh
a8ce655445 nvme_ctrlr_enable: Remove unnecessary 5ms delays
Remove the 5ms delays after writing the administrative queue
registers. These delays are from the very earliest days of the driver
(they are in the first commit) and were most likely vestiges of the
Chatham NVMe prototype card that was used to create this driver. Many of
the workarounds necessary for it aren't necessary for standards
compliant cards. The original driver had other areas marked for Chatham,
but these were not. They are unneeded. There's three lines of supporting
evidence.

First, the NVMe standards make no mention of a delay time after these
registers are written. Second, the Linux driver doesn't have them, even
as an option. Third, all my nvme cards work w/o them.

To be safe, add a write barrier between setting up the admin queue and
enabling the controller.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D32247

(cherry picked from commit d5fca1dc1d)
2022-01-20 21:07:30 -05:00
Warner Losh
1397c8ebb7 nvme: Sanity check completion id
Make sure the completion ID is in the range of [0..num_trackers) since
the values past the end of the act_tr array are never going to be valid
trackers and will lead to pain and suffering if we try to dereference
them to get the tracker or to set the tracker back to NULL as we
complete the I/O.

Sponsored by:		Netflix
Reviewed by:		mav, chs, chuck
Differential Revision:	https://reviews.freebsd.org/D32088

(cherry picked from commit 36a87d0c6f)
2022-01-20 21:07:30 -05:00
Warner Losh
86990decd7 nvme: count number of ignored interrupts
Count the number of times we're asked to process completions, but that
we ignore because the state of the qpair isn't in RECOVERY_NONE.

Sponsored by:		Netflix
Reviewed by:		mav, chuck
Differential Revision:	https://reviews.freebsd.org/D32212

(cherry picked from commit 587aa25525)
2022-01-20 21:07:30 -05:00
Warner Losh
7144882ae1 nvme: Add sanity check for phase on startup.
The proper phase for the qpiar right after reset in the first interrupt
is 1. For it, make sure that we're not still in phase 0. This is an
illegal state to be processing interrupts and indicates that we've
failed to properly protect against a race between initializing our state
and processing interrupts. Modify stat resetting code so it resets the
number of interrpts to 1 instead of 0 so we don't trigger a false
positive panic.

Sponsored by:		Netflix
Reviewed by:		cperciva, mav (prior version)
Differential Revision:	https://reviews.freebsd.org/D32211

(cherry picked from commit 7d5eebe0f4)
2022-01-20 21:07:30 -05:00
Warner Losh
e8f693131c nvme: start qpair in state RECOVERY_WAITING
An interrupt happens on the admin queue right away after the reset, so
as soon as we enable interrupts, we'll get a call to our interrupt
handler. It is safe to ignore this interrupt if we're not yet
initialized, or	to process it if we are. If we are initialized,	we'll
see there's no completion records and return. If we're not, we'll
process	no completion records and return. Either way, nothing is
processed and nothing is lost.

Until we've completely setup the qpair, we need to avoid processing
completion records. Start the qpair in the waiting recovery state so we
return immediately when we try to process completions. The code already
sets it to 'NONE' when we're initialization is complete. It's safe to
defer completion processing here because we don't send any commands
before the initialization of the software state of the qpair is
complete. And even if we were to somehow send a command prior to that
completing, the completion record for that command would be processed
when we send commands to the admin qpair after we've setup the software
state. There's no good central point to add an assert for this last
condition.

This fixes an KASSERT "received completion for unknown cmd" panic on
boot.

Fixes:			502dc84a8b
Sponsored by:		Netflix
Reviewed by:		mav, cperciva, gallatin
Differential Revision:	https://reviews.freebsd.org/D32210

(cherry picked from commit fa81f3731d)
2022-01-20 21:07:30 -05:00
Warner Losh
9bbd0a7ca9 nvme: Use shared timeout rather than timeout per transaction
Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, we go back to operations as
normal.

Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D28583

(cherry picked from commit 502dc84a8b)
2022-01-20 21:07:30 -05:00
Warner Losh
24d2d1813e nvme/nda: Fail all nvme I/Os after controller fails
Once the controller has failed, fail all I/O w/o sending it to the
device. The reset of the nvme driver won't schedule any I/O to the
failed device, and the controller is in an indeterminate state and can't
accept I/O. Fail both at the top end of the sim and the bottom
end. Don't bother queueing up the I/O for failure in a different task.

Reviewed by:		chuck
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31341

(cherry picked from commit 4b977e6dda)
2022-01-20 21:07:30 -05:00
Colin Percival
eb4d2eab07 Add some nvme initialization routines to TSLOG
About 335 ms of EC2 instance boot time is being spent here.

(cherry picked from commit bad42df9bf)
2022-01-20 21:07:30 -05:00
Gordon Bergling
022814e08b nvme(4): Correct a typo in a sysctl description
- s/printting/printing/

(cherry picked from commit 5f8ccf6515)
2021-12-03 16:51:46 +01:00
Alexander Motin
0e6969a0f4 nvme(4): Add MSI and single MSI-X support.
If we can't allocate more MSI-X vectors, accept using single shared.
If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but
accept single shared.  If still no luck, fall back to shared INTx.

This provides maximal flexibility in some limited scenarios.  For
example, vmd(4) does not support INTx and can handle only limited
number of MSI/MSI-X vectors without sharing.

MFC after:	1 week

(cherry picked from commit e3bdf3da76)
2021-09-06 21:24:54 -04:00
Alexander Motin
d9cacf0b66 nvme(4): Do not panic on admin queue construct error.
MFC after:	1 week

(cherry picked from commit 31111372e6)
2021-09-06 21:24:53 -04:00
Alexander Motin
2daa57cafb Mark some sysctls as CTLFLAG_MPSAFE.
MFC after:	2 weeks

(cherry picked from commit b776de6796)
2021-08-24 21:53:18 -04:00
Warner Losh
aebc14fcf5 nvme: Enable interrupts after qpair fully constructed
To guard against the ill effects of a spurious interrupt during
construction (or one that was bogusly pending), enable interrupts after
the qpair is completely constructed. Otherwise, we can die with null
pointer dereferences in nvme_qpair_process_completions. This has been
observed in at least one pre-release NVMe drive where the MSIX interrupt
fired while the queue was being created, before we'd started the NVMe
controller card.

The alternative of only turning on the interrupts after the rest was
tried, but was insufficient to work around this bug and made the code
more complicated w/o benefit.

Reviewed by:		mav, chuck
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31182

(cherry picked from commit fc9a084023)
2021-07-21 10:13:11 -06:00
Alexander Motin
a469b89bf2 nvme(4): Report NPWA before NPWG as stripesize.
New Samsung 980 SSDs report Namespace Preferred Write Alignment of
8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB).
My quick tests show that 16KB is a minimal sequential write size
when the SSD reaches peak IOPS, so writing much less is very slow.
But writing slightly less or slightly more does not change much,
so it seems not so much a size granularity as minimum I/O size.

Thinking about different stripesize consumers:
 - Partition alignment should be based on NPWA by definition.
 - ZFS ashift in part of forcing alignment of all I/Os should also
be based on NPWA.  In part of forcing size granularity, if really
needed, it may be set to NPWG, but too big value can make ZFS too
space-inefficient, and the 16KB is actually the biggest supported
value there now.
 - ZFS recordsize/volblocksize could potentially be tuned up toward
NPWG to work as I/O size granularity, but enabled compression makes
it too fuzzy.  And those are normally user-configurable things.
 - ZFS I/O aggregation code could definitely use Optimal Write Size
value and may be NPWG, but we don't have fields in GEOM now to report
the minimal and optimal I/O sizes, and even maximal is not reported
outside GEOM DISK to be used by ZFS.

MFC after:	1 week

(cherry picked from commit e3bcd07d83)
2021-07-12 21:54:44 -04:00
Warner Losh
ffb294bd31 nvme: coherently read status of completion records
Coherently read the phase bit of the status completion record. We loop
over the completion record array, looking for all the transactions in
the same phase that have been completed. In doing that, we have to be
careful to read the status field first, and if it indicates a complete
record, we need to read and process that record. Otherwise, the host
might be overtaken by device when reading this completion record,
leading to a mistaken belief that the record is in phase. This leads to
the code using old values and looking at an already completed entry, which
has no current tracker.

To work around this problem, we read the status and make sure it is in
phase, we then re-read the entire completion record guaranteeing it's
complete, valid, and consistent . In addition we resync the dmatag to
reflect changes since the prior loop for the bouncing dma case.

Reviewed by:		jrtc27@, chuck@
Found by:		jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31002

(cherry picked from commit aa0ab681ae)
2021-07-12 13:39:58 -06:00
Warner Losh
8c66f8811d nvme: Fix alignment on nvme structures
Remove __packed from nvme_command, nvme_completion and
nvme_dsm_trim. Add super-alignment to nvme_completion since it's always
at least that aligned in hardware (and in our existing uses of it
embedded in structures). It generates better code in
nvme_qpair_process_completions on riscv64 because otherwise the ABI
assumes a 4-byte alignment, and the same on all other platforms.

Reviewed by:		jrtc27@, mav@, chuck@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31001

(cherry picked from commit fea3cf1d6d)
2021-07-12 13:30:56 -06:00
Warner Losh
1a3011fe0b nvme: style nit
Put the { on the same line as the struct nvme_foo when we define these
structures. It's FreeBSD standard and these were inconsistent.

Sponsored by:		Netflix

(cherry picked from commit 80a75155e1)
2021-07-12 13:30:55 -06:00
Warner Losh
1109ce0da2 nvme: fix a race between failing the controller and failing requests
Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.

This is a temporary kludge until I can work out questions that arose
during the review, not least is what was the race that queueing to a
failure task solved. The original commit is vague and other error paths
in the same context do a direct failure. I'll investigate that more
completely before committing changing that to a direct failure. mav@
raised this issue during the review, but didn't otherwise object.

Multiple threads, though, solve the problem in the mean time until other
such means can be perfected.

Reviewed by:		jhb@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D30366

(cherry picked from commit f0f4712165)
2021-07-12 13:30:55 -06:00
Warner Losh
c7b2c5da00 nvme: use config_intrhook_drain to avoid removable card races
nvme drives are configured early in boot. However, a number of the configuration
steps takes which take a while, so we defer those to a config intrhook that runs
before the root filesystem is mounted. At the same time, the PCI hot plug wakes
up and tests the status of the card. It may decide that the card has gone away
and deletes the child. As part of that process nvme_detach is called. If this
call happens after the config_intrhook starts to run, but before it is finished,
there's a race where we can tear down the device's soft state while the
config_intrhook is still using it. Use the new config_intrhook_drain to
disestablish the hook. Either it will be removed w/o running, or the routine
will wait for it to finish. This closes the race and allows safe hotplug at any
time, even very early in boot.

Sponsored by:		Netflix, Inc
Reviewed by:		jhb, mav
Differential Revision:	https://reviews.freebsd.org/D29006

(cherry picked from commit 8423f5d4c1)
2021-07-12 13:30:55 -06:00
Dmitry Chagin
02816ecfb9 Partially revert r248770.
Under geom(4) nvme_ns_bio_process() is on the path where sleep
is prohibited as g_io_shedule_down() calls THREAD_NO_SLEEPNG()
before geom->start().

Reviewed By:		imp
Differential Revision:	https://reviews.freebsd.org/D29539

(cherry picked from commit a78109d5db)
2021-04-16 11:33:32 +03:00
Alexander Motin
ed407c92e4 nvme: Replace potentially long DELAY() with pause().
In some cases like broken hardware nvme(4) may wait minutes for
controller response before timeout.  Doing so in a tight spin loop
made whole system unresponsive.

Reviewed by:	imp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29309
Sponsored by:	iXsystems, Inc.

(cherry picked from commit 4fbbe52365)
2021-03-23 21:26:00 -04:00
Warner Losh
0b90164992 nvme: Make nvme_ctrlr_hw_reset static
nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so
make it static. If we need to change this in the future we can.

(cherry picked from commit dd2516fc07)
2021-02-24 11:10:51 -07:00
Warner Losh
4f4f44f70f nvme: use NVME_GONE rather than hard-coded 0xffffffff
Make it clearer that the value 0xfffffff is being used to detect the device is
gone. We use it other places in the driver for other meanings.

(cherry picked from commit 9600aa31aa)
2021-02-24 11:10:45 -07:00
Chuck Tuffli
e83fdf8bb3 fix big-endian platforms after 6733401935
The NVMe byte-swap routines for big-endian platforms used memcpy() to
move the unaligned 64-bit value into a temp register to byte swap it.
Instead of introducing a dependency, manually byte-swap the values in
place.

Point hat:	me
2021-01-08 14:41:45 -08:00
Chuck Tuffli
6733401935 nvmecontrol: add device self-test op and log page
Add decoding of the Device Self-test log page and the ability to start
or abort a test.

Reviewed by:	imp, mav
Tested by:	Muhammad Ahmad <muhammad.ahmad@seagate.com>
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D27517
2021-01-08 09:27:56 -08:00
Warner Losh
082905cad1 nvme: Remove a wmb() that's not necessary.
bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can
be DMA'd immediately after it returns. The details differ, but this
mirrors atomic thread release semantics, at least for the buffers
synced.

For non-x86 platforms, bus_dmamap_sync() has the right syncing and
fences. So in the past, wmb() had been omitted for them.

For x86 platforms, the memory ordering is already strong enough to
ensure DMA to the device sees the current contents. As such, we don't
need the wmb() here. It translates to an sfence which is only needed
for writes to regions that have the write combining attribute set or
when some exotic opcodes are used. The nvme driver does neither of
these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we
can be assured any optimizer won't reorder the bus_dmamap_sync and the
bus_space_write operations. The wmb() was a vestiage of the pre-busdma
version initially committed to the tree.

Reviewed by: kib@, gallatin@, chuck@, mav@
Differential Revision: https://reviews.freebsd.org/D27448
2020-12-04 21:34:48 +00:00
Michal Meloun
8f9d5a8dbf NVME: Multiple busdma related fixes.
- in nvme_qpair_process_completions() do dma sync before completion buffer
  is used.
- in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm
  and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure
  that all CPU stores are visible to external (including DMA) observers.
- Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems,
  buffers continuously owned (and accessed) by DMA must be allocated with this
  flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems
  (or coherent buses in mixed systems).

MFC after:	4 weeks
Reviewed by:	mav, imp
Differential Revision: https://reviews.freebsd.org/D27446
2020-12-02 16:54:24 +00:00
Chuck Tuffli
8d08cdc721 nvme: Fix typo in definition
Change occurrences of "selt test" to "self tests in the NVMe header
file.

Reviewed by:	imp, mav
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D27439
2020-12-02 15:59:08 +00:00
Michal Meloun
cf7c062932 Always use the __unused attribute even for potentially unused parameters.
Requested by:	ian, imp
MFC with:	r368167
2020-12-01 08:52:13 +00:00
Michal Meloun
b2e9e573a3 Unbreak r368167 in userland. Decorate unused arguments.
Reported by:	kp, tuexen, jenkins, and many others
MFC with:	r368167
2020-11-30 14:51:48 +00:00
Michal Meloun
52a832072d NVME: Don't try to swap data on little endian machines.
These swapping functions violate BUSDMA contract - we cannot write
to armed (by bus_dmamap_sync(PRE_..)) buffers. Remove them at least
from little endian machines until a better solution will be developed.

Reviewed by:	imp
MFC after:	3 weeks
2020-11-30 07:01:12 +00:00
Alexander Motin
1770bae5f8 Remove aligment requirements for passthrough buffer.
After r368124 vmapbuf() should happily map misaligned maxphys-sized buffers
thanks to extra page added to pbuf_zone.
2020-11-29 00:57:19 +00:00
Alexander Motin
ac90f70d1e Increase nvme(4) maximum transfer size from 1MB to 2MB.
With 4KB page size the 2MB is the maximum we can address with one page PRP.
Going further would require chaining, that would add some more complexity.

On the other side, to reduce memory consumption, allocate the PRP memory
respecting maximum transfer size reported in the controller identify data.
Many of NVMe devices support much smaller values, starting from 128KB.
To do that we have to change the initialization sequence to pull the data
earlier, before setting up the I/O queue pairs.  The admin queue pair is
still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal,
since there is only one such queue with only 16 trackers.

Reviewed by:	imp
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-11-29 00:20:31 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00