in mlx5core. The EEPROM information is not only a property of the
mlx5en(4) driver.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
The following sysctls are added:
dev.mce.N.conf.qos.cable_length
dev.mce.N.conf.qos.buffers_size
dev.mce.N.conf.qos.buffers_prio
Submitted by: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies
All prints in mlx5core should use on of the macros:
mlx5_core_err/dbg/warn
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
In case of health counter fails to increment it indicates a bad device health.
In case when the syndrome indicated by firmware is 0x0, this indicates that
firmware is unable to respond to initialization segment reads.
Add proper print in this case.
Submitted by: slavash@
MFC after: 3 days
Sponsored by: Mellanox Technologies
MPFS is a logical switch in the Mellanox device which forward packets
based on a hardware driven L2 address table, to one or more physical-
or virtual- functions. The physical- or virtual- function is required
to tell the MPFS by using the MPFS firmware commands, which unicast
MAC addresses it is requesting from the physical port's traffic.
Broadcast and multicast traffic however, is copied to all listening
physical- and virtual- functions and does not need a rule in the MPFS
switching table.
Linux commit: eeb66cdb682678bfd1f02a4547e3649b38ffea7e
MFC after: 3 days
Sponsored by: Mellanox Technologies
Add the 512 bytes limit of RDMA READ and the size of remote address to the max
SGE calculation.
Submitted by: slavash@
Linux commit: 288c01b746aa
MFC after: 3 days
Sponsored by: Mellanox Technologies
Currently only suspend requests are acknowledged by writing an empty
string back to the xenstore control node, but poweroff or reboot
requests are not acknowledged and FreeBSD simply proceeds to perform
the desired action.
Fix this by acknowledging all requests, and remove the suspend specific
ack done in the handler.
Sponsored by: Citrix Systems R&D
MFC after: 3 days
As of r347221 the iflib legacy interrupt mode setup assumes that drivers
perform both receive and transmit processing from the interrupt handler.
This assumption is invalid in the vmxnet3 driver, so introduce the
IFLIB_SINGLE_IRQ_RX_ONLY flag to make iflib avoid tx processing in the
interrupt handler.
PR: 239118
Reported and tested by: Juraj Lutter <otis@sk.freebsd.org>
Obtained from: marius
Reviewed by: gallatin
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D21831
r347183 bumped GEOM classes to SI_ORDER_SECOND to resolve a race between
them and the initialization of devsoftc.mtx in devinit, but missed this
dependency on g_flashmap that may now lose the race against GEOM
classes/g_init.
There's a great comment that describes the situation that has also been
updated with the new ordering of GEOM classes.
Reported by: bdragon
MFC after: 4 days
On rockchip board it seems that the value in the DTS
are not enough for reseting the chip, I don't know if
the value are really incorrect or if DELAY is not precise
enough or if the rockchip gpio driver have some "lag" of some
kind or not.
For now just add more delay.
No functional change intended.
The intent is to add a "legacy" e820 pmem newbus bus for nvdimm device in a
subsequent revision, and it's a little more clear if the parent buses get
independent source files.
Quite a lot of ACPI-specific logic is left in nvdimm.c; disentangling that
is a much larger change (and probably not especially useful).
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21813
Those functions are used by kernel, and we can't check all possible argument
errors in production kernel. Plus according to docs many of those errors
are checked by hardware. Assertions should just help with code debugging.
MFC after: 2 weeks
The TUNABLE_INT_FETCH is macro around getenv_int() and we will get
return value 0 or 1 for failure or success, we can use it to decide
which background color to use.
Convert all remaining references to that field to "ref_count" and update
comments accordingly. No functional change intended.
Reviewed by: alc, kib
Sponsored by: Intel, Netflix
Differential Revision: https://reviews.freebsd.org/D21768
Instead of predicting the MSI-X bar index based on the device's MAC
type, read it from the device's PCI configuration instead.
PR: 239704
Submitted by: Piotr Pietruszewski <piotr.pietruszewski@intel.com>
Reviewed by: erj@
MFC after: 3 days
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21547
- For each queue pair precalculate CPU and domain it is bound to.
If queue pairs are not per-CPU, then use the domain of the device.
- Allocate most of queue pair memory from the domain it is bound to.
- Bind callouts to the same CPUs as queue pair to avoid migrations.
- Do not assign queue pairs to each SMT thread. It just wasted
resources and increased lock congestions.
- Remove fixed multiplier of CPUs per queue pair, spread them even.
This allows to use more queue pairs in some hardware configurations.
- If queue pair serves multiple CPUs, bind different NVMe devices to
different CPUs.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
above 1Kbyte. It might look like some XHCI(4) controllers do not
support when the USB control transfer is split using a link TRB. The
next NORMAL TRB after the link TRB is simply failing with XHCI error
code 4. The quirk ensures we allocate a 64Kbyte buffer so that the
data stage TRB is not broken with a link TRB.
Found at: EuroBSDcon 2019
MFC after: 1 week
Sponsored by: Mellanox Technologies
libusb. This is useful for speeding up large data transfers while reducing
the interrupt rate.
Found at: EuroBSDcon 2019
MFC after: 1 week
Sponsored by: Mellanox Technologies
Allocate ioat->ring memory from the device domain.
Schedule ioat->poll_timer to the first CPU of the device domain.
According to pcm-numa tool from intel-pcm port, this reduces number of
remote DRAM accesses while copying data by 75%. And unless it is a noise,
I've noticed some speed improvement when copying data to other domain.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
If there is an attempt to switch from a process-owned VT to a closed VT,
then vt(4) first requests the process to release its VT and only then
realizes that the target VT is closed and, so, the switch is not
possible. So, the driver does not actually do any switch, but at the
same time the owning process is not notified about that and it does not
re-acquire the VT.
This change adds an early check for the target VT state, so that the
switch can be refused before the process coordination dance.
On top of that, the code now checks for a failure of vt_window_switch()
and calls vt_window_postswitch() for the current VT if it is in the
process mode.
Test Plan:
- configure VT1 - VT8 (ttyv0 - ttyv7) to be text consoles (run getty)
- configure VT9 (ttyv8) to rn X server
- make sure that the X server configuration allows VT switching
- leave VT10 - VT12 unconfigured
- while in the X server press Ctrl+Alt+F10
- without the patch, observe strange screen content and problems with
keyboard input
- with the patch, observe that nothing happens
The problem has been observed and the fix has been tested with an nVidia
graphics card and the proprietary nvidia driver.
Not sure if that matters.
Reviewed by: ray
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21704
BERI stands for Bluespec Extensible RISC Implementation, based on MIPS.
BERI has not implemented standard MIPS perfomance monitoring counters,
instead it provides statistical counters.
BERI statcounters have a several limitations:
- They can't be written
- They don't support start/stop operation
- None of hardware interrupt is provided on a counter overflow.
So make it separate to hwpmc_mips module and support process/system
counting mode only.
Sponsored by: DARPA, AFRL
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
Since TX interrupt is generated when THRE is set, wait for TEMT set means
wait for full character transmission time. At low speeds that may take
awhile, burning CPU time while holding sc_hwmtx lock, also congested.
This is partial revert of r317659.
PR: 240121
MFC after: 2 weeks
Errors are communicated between the i2c controller layer and upper layers
(iicbus and slave device drivers) using a set of IIC_Exxxxxx constants which
effectively define a private number space separate from (and having values
that conflict with) the system errno number space. Sometimes it is necessary
to report a plain old system error (especially EINTR) from the controller or
bus layer and have that value make it back across the syscall interface
intact.
I initially considered replicating a few "crucial" errno values with similar
names and new numbers, e.g., IIC_EINTR, IIC_ERESTART, etc. It seemed like
that had the potential to grow over time until many of the errno names were
duplicated into the IIC_Exxxxx space.
So instead, this defines a mechanism to "encode" an errno into the IIC_Exxxx
space by setting the high bit and putting the errno into the lower-order
bits; a new errno2iic() function does this. The existing iic2errno()
recognizes the encoded values and extracts the original errno out of the
encoded value. An interesting wrinkle occurs with the pseudo-error values
such as ERESTART -- they aleady have the high bit set, and turning it off
would be the wrong thing to do. Instead, iic2errno() recognizes that lots of
high bits are on (i.e., it's a negative number near to zero) and just
returns that value as-is.
Thus, existing drivers continue to work without needing any changes, and
there is now a way to return errno values from the lower layers. The first
use of that is in iicbus_poll() which does mtx_sleep() with the PCATCH flag,
and needs to return the errno from that up the call chain.
Differential Revision: https://reviews.freebsd.org/D20975
PSCI code to use it.
This interface will also be used by Intel Stratix 10 platform.
This was not tested on arm due to lack of PSCI-enabled arm hardware
lying around.
Reviewed by: andrew
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D21439
Execution of "Soft reset" command (IG4_REG_RESETS_SKL) at controller init
stage sets SDA_HOLD register value to 0x0001 which is often too low for
normal operation.
Set SDA_HOLD back to 28 after reset to restore controller functionality.
PR: 240339
Reported by: imp, GregV, et al.
MFC after: 3 days
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
this to 2k to prevent them from being truncated and ignored. It
appears to be a sanity check only, but bumping it to 2k allows both of
my iic hid devices to be parsed and the second one to work...
The iicdev_writeto() function basically does scatter-gather IO by filling
in a pair of iic_msg structs to write the register address then the data
from different locations but with a single bus START/xfer/STOP sequence.
It turns out several low-level i2c controller drivers do not honor the
IIC_NOSTART flag, so the second piece of the write gets a new START on
the bus, and that confuses the ads111x chips which expect a continuous
write of 3 bytes to set a register.
A proper fix for this is to track down all the misbehaving controllers
drivers and fix them. For now this change makes this driver work again.
Also, disable the comparator by default; it's not used for anything.
The previous logic would start a measurement, and then pause_sbt() for the
averaging time currently configured in the chip. After waiting that long,
the code would blindly read the measurement register and return its value.
The problem is that the chip's idea of averaging time is based on its
internal free-running 1MHz oscillator, which may be running at a wildly
different rate than the kernel clock. If the chip's internal timer was
running slower than the kernel clock, we'd end up grabbing a stale result
from an old measurement.
The driver now still uses pause_sbt() to yield the cpu while waiting for
the measurement to complete, but after sleeping it checks the chip's status
register to ensure the measurement engine is idle. If it's not, the driver
uses a retry loop to wait a bit (5% of the original wait time) then check
again for completion.
The NVMe standard (1.4) states
>>> 8.6 Doorbell Stride for Software Emulation
>>> The doorbell stride,...is useful in software emulation of an NVM
>>> Express controller. ... For hardware implementations of the NVM
>>> Express interface, the expected doorbell stride value is 0h.
However, hardware in the wild exists with a doorbell stride of 1
(meaning 8 byte separation). This change supports that hardware, as
well as software emulators as envisioned in Section 8.6. Since this is
the fast path, care has been taken to make this computation
efficient. The bit of math to compute an offset for each is replaced
by a memory load from cache of a pre-computed value.
MFC After: 3 days
Reviewed by: scottl@
Differential Revision: https://reviews.freebsd.org/D21514
When we suspend, we need to properly shutdown the NVME controller. The
controller may go into D3 state (or may have the power removed), and
to properly flush the metadata to non-volatile RAM, we must complete a
normal shutdown. This consists of deleting the I/O queues and setting
the shutodown bit. We have to do some extra stuff to make sure we
reset the software state of the queues as well.
On resume, we have to reset the card twice, for reasons described in
the attach funcion. Once we've done that, we can restart the card. If
any of this fails, we'll fail the NVMe card, just like we do when a
reset fails.
Set is_resetting for the duration of the suspend / resume. This keeps
the reset taskqueue from running a concurrent reset, and also is
needed to prevent any hw completions from queueing more I/O to the
card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get
that from the global state of the ctrlr. Wait for any pending reset to
finish. All queued I/O will get sent to the hardware as part of
nvme_ctrlr_start(), though the upper layers shouldn't send any
down. Disabling the qpairs is the other failsafe to ensure all I/O is
queued.
Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid
confusion with all the other destroy functions. It just removes the
queues in hardware, while the other _destroy_ functions tear down
driver data structures.
Split parts of the hardware reset function up so that I can
do part of the reset in suspsend. Split out the software disabling
of the qpairs into nvme_ctrlr_disable_qpairs.
Finally, fix a couple of spelling errors in comments related to
this.
Relnotes: Yes
MFC After: 1 week
Reviewed by: scottl@ (prior version)
Differential Revision: https://reviews.freebsd.org/D21493
polling within a second. Panic if we don't. All the commands that use this
interface should typically complete within a few tens to hundreds of
microseconds. Panic rather than return ETIMEDOUT because if the command somehow
does later complete, it will randomly corrupt memory. Also, it helps to get a
traceback from where the unexpected failure happens, rather than an infinite
loop.
dump support code, move the while loop into an inline function. These aren't
done in the fast path, so if the compiler choses to not inline, any performance
hit is tiny.
polled interface. Normally this would have the potential to corrupt stack memory
because the completion routines would run after we return. In this case,
however, we're doing a dump so it's safe for reasons explained in the comment.
The returned error number may be EINTR or ERESTART depending on
whether or not the signal is supposed to interrupt the system call.
Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
This fixes a "timed sleep before timers are working" panic seen
while attaching jedec_dimm(4) instances too early in the boot.
Submitted by: ian
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D21452
Remove now-redundant items from toepcb and synq_entry and the code to
support them.
Let the driver calculate tx_align, rx_coalesce, and sndbuf by default.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21387
descriptor. The per-tid tx credits are in demand during active Tx and
it's best not to use too many just for payload.
Sponsored by: Chelsio Communications
According to ACPI 6.3 specification:
The OS sets this bit to 1 if it supports PCI Segment Groups as defined
by the _SEG object, and access to the configuration space of devices
in PCI Segment Groups as described by this specification. Otherwise,
the OS sets this bit to 0.
As far as I see we support both of those as PCI domains for quite a while.
MFC after: 2 months
According to my tests and errata to several generations of Intel CPUs,
PCIe hot-plug command completion reporting is not very reliable thing.
At least on my Supermicro X11DPi-NT board I never saw it reported.
Before this change timeout code detached devices and tried to disable
the slot, that in my case resulted in hot-plugged device being detached
just a second after it was successfully detected and attached. This
change removes that, so in case of timeout it just prints the error and
continue operation. Linux does the same.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
The netmap_pt.c module has become obsolete after
the refactoring that added netmap_kloop.c.
Remove it and unlink it from the build system.
MFC after: 1 week
While it worked with the kenrel, it wasn't working with the loader.
It failed to handle dependencies correctly. The reason for that is
that we never created a nvme module with the DRIVER_MODULE, but
instead a nvme_pci and nvme_ahci module. Create a real nvme module
that nvd can be dependent on so it can import the nvme symbols it
needs from there.
Arguably, nvd should just be a simple child of nvme, but transitioning
to that (and winning that argument given why it was done this way) is
beyond the scope of this change.
Reviewed by: jhb@
Differential Revision: https://reviews.freebsd.org/D21382
queues, don't try to tear them down in the ctrlr_destroy
path. Otherwise, we dereference queue structures that are NULL and we
trap.
This fix is incomplete: we leak IRQ and MSI resources when this
happens. That's preferable to a crash but still should be fixed.
load and people who pull in nvme/nvd from modules can't load nvd.ko
since it depends on nvme, not nvme_foo. The duplicate doesn't matter
since kldxref properly handles that case.
Turn off bus master after we detach the device (to match the prior
order). Release MSI after we're done detaching and have turned off
all the interrupts. Otherwise this may cause problems as other threads
race nvme_detach. This more closely matches the old order.
Reviewed by: mav@
After r351243 when ALTQ was enabled in the kernel, the inline functions
in ifq.h would not have full type information as if_var.h was not
included.
Given usb_ethernet.h already includes all the various headers (which)
is the cause of the problem here, add if_var.h to it. This fixes the
builds again.
Reported by: CI system, e.g. FreeBSD-head-aarch64-LINT
Intel has created RST and many laptops from vendors like Lenovo and Asus. It's a
mechanism for creating multiple boot devices under windows. It effectively hides
the nvme drive inside of the ahci controller. The details are supposed to be a
trade secret. However, there's a reverse engineered Linux driver, and this
implements similar operations to allow nvme drives to attach. The ahci driver
attaches nvme children that proxy the remapped resources to the child. nvme_ahci
is just like nvme_pci, except it doesn't do the PCI specific things. That's
moved into ahci where appropriate.
When the nvme drive is remapped, MSI-x interrupts aren't forwarded (the linux
driver doesn't know how to use this either). INTx interrupts are used
instead. This is suboptimal, but usually sufficient for the laptops these parts
are in.
This is based loosely on https://www.spinics.net/lists/linux-ide/msg53364.html
submitted, but not accepted by, Linux. It was written by Dan Williams. These
changes were written from scratch by Olivier Houchard.
Submitted by: cognet@ (Olivier Houchard)
Nvme drives can be attached in a number of different ways. Separate out the PCI
attachment so that we can have other attachment types, like ahci and various
types of NVMeoF.
Submitted by: cognet@
If device is unplugged from the system (CSTS register reads return
0xffffffff), it makes no sense to send any more recovery requests or
expect any responses back. If there is a detach call in such state,
just stop all activity and free resources. If there is no detach
call (hot-plug is not supported), rely on normal timeout handling,
but when it trigger controller reset, do not wait for impossible and
quickly report failure.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This fixes possible double call of fail_fn, for example on hot removal.
It also allows ctrlr_fn to safely return NULL cookie in case of failure
and not get useless ns_fn or fail_fn call with NULL cookie later.
MFC after: 2 weeks
Otherwise the mutex needs to be dropped when copying out the midistat
sbuf, leading to a race which allows one to read kernel memory beyond
the end of the sbuf buffer.
Reported and tested by: pho
Security: CVE-2019-5612
order to have struct mii_data available. However, it only really needs
a forward declaration of struct mii_data for use in pointer form for
the return type of a function prototype.
Custom kernel configuration that have usb and fdt enabled, but no miibus,
end up with compilation failures because miibus_if.h will not get
generated.
Due to the above, the following changes have been made to usb_ethernet.h:
* remove the inclusion of mii headers
* forward-declare struct mii_data
* include net/ifq.h to satify the need for complete struct ifqueue
Reviewed by: ian
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D21293
Return an empty string when the location is unknown instead of the
string "unknown". This ensures that all location entries are of
the form key=val.
Suggested by: imp
Approved by: jhb (mentor)
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D21326
Move fast entropy source registration to the earlier
SI_SUB_RANDOM:SI_ORDER_FOURTH and move random_harvestq_prime after that.
Relocate the registration routines out of the much later randomdev module
and into random_harvestq.
This is necessary for the fast random sources to actually register before we
perform random_harvestq_prime() early in the kernel boot.
No functional change.
Reviewed by: delphij, markjm
Approved by: secteam(delphij)
Differential Revision: https://reviews.freebsd.org/D21308
If simple multifuction device also provides syscon interface, its
childern should be able to consume it. Due to this:
- declare coresponding method in syscon interface
- implement it in simple multifunction device driver
MFC after: 1 week
NTB Tool driver is meant for testing NTB hardware driver functionalities,
such as doorbell interrupts, link events, scratchpad registers and memory
windows. This is a port of ntb_tool driver from Linux. It has been
verified on top of AMD and PLX NTB HW drivers.
Submitted by: Arpan Palit <arpan.palit@amd.com>
Cleaned up by: mav
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18819
It is unused, the ABI was broken in r322969, and it is broken by design
(more than MDNPAD md devices can exist and there is no way to retreive
them with this interface).
mdconfig(8) was converted to use libgeom to obtain this information
in r157160 and any other consumers of MDIOCLIST should likewise be
converted.
Reviewed by: emaste
Relnotes: yes
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18936
This adds safety net for the case of misconfigured NTB with too big
memory window, for which we may be unable to allocate a memory buffer,
which does not make much sense for the network interface. While there,
fix the code to really work with asymmetric window sizes setup.
This makes driver just print warning message on boot instead of hanging
if too large memory window is configured.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This restores parity with AMD NTB driver. Though without any drivers
supporting more then one peer and respective KPI modification to pass
peer index to most of the calls this addition is pretty useless now.
MFC after: 2 weeks
No functional change.
Add a verbose comment giving an example side-by-side comparison between the
prior and Concurrent modes of Fortuna, and why one should believe they
produce the same result.
The intent is to flip this on by default prior to 13.0, so testing is
encouraged. To enable, add the following to loader.conf:
kern.random.fortuna.concurrent_read="1"
The intent is also to flip the default blockcipher to the faster Chacha-20
prior to 13.0, so testing of that mode of operation is also appreciated.
To enable, add the following to loader.conf:
kern.random.use_chacha20_cipher="1"
Approved by: secteam(implicit)
Namespace Optimal I/O Boundary field added in NVMe 1.3 and Namespace
Preferred Write Granularity added in 1.4 allow upper layers to align
I/Os for improved SSD performance and endurance.
I don't have hardware reportig those yet, but NPWG could probably be
reported by bhyve.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
With support for multiple namespaces and multiple ports in NVMe there is
now a need for reliable unique namespace identification alike to SCSI.
MFC after: 1 weeks
Sponsored by: iXsystems, Inc.
We used the aw_clk_nm clock for clock with only one divider factor
and used a fake multiplier factor. This cannot work properly as we
end up writing the "fake" factor to the register (and so always set
the LSB to 1).
Create a new clock for those.
The reason for not using the clk_div clock is because those clocks are
a bit special. Since they are (almost) all related to video we also need
to set the parent clock (the main PLL) to a frequency that they can support.
As the main PLL have some minimal frequency that they can support we need to
be able to set the main PLL to a multiple of the desired frequency.
Let say you want to have a 71Mhz pixel clock (typical for a 1280x800 display)
and the main PLL cannot go under 192Mhz, you need to set it to 3 times the
desired frequency and set the divider to 3 on the hdmi clock.
So this also introduce the CLK_SET_ROUND_MULTIPLE flag that allow for this kind
of scenario.
Similar to what was done for device_printfs in r347229.
Convert g_print_bio() to a thin shim around g_format_bio(), which acts on an
sbuf; documented in g_bio.9.
Reviewed by: markj
Discussed with: rlibby
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D21165
with Communication Device Class Ethernet Emulation Model (CDC EEM).
The driver supports both the device, and host side operation; there
is a new USB template (#11) for the former.
This enables communication with virtual USB NIC provided by iLO 5,
as found in new HPE Proliant servers.
Reviewed by: hselasky
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Hewlett Packard Enterprise
Some hardware needs more than 32, bump this value.
We cannot use the _alloc for of getencprop as this function is called
too early in the boot before pmap is initialized and we only have
2k of stack when cninit is called.
Discussed with: ian
- Check for an invalid device (vendor is invalid) before reading the
header type register when examining function 0 of a possible device.
- When iterating over functions of a device, reject any device whose
16-bit vendor is invalid rather than requiring the full 32-bit
vendor+device to be all 1's. In practice the latter check is
probably fine, but checking the vendor is what the PCI spec
recommends.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D21147
RT2860_WCID_MAX is supposed to describe the max STA index for wcid2ni, and
was instead being used as the size -- off-by-one.
rt2860_drain_stats_fifo was range-checking wcid only after accessing
out-of-bounds potentially.
Submitted by: Augustin Cavalier <waddlesplash@gmail.com> (basically)
Obtained from: Haiku (58d16d9fe2d5a209cf22823359a8407d138e1a87)
Differential Revision: 3 days
NVMe reservations are quite alike to SCSI persistent reservations and
can be used in clustered setups with shared multiport storage.
MFC after: 10 days
Relnotes: yes
Sponsored by: iXsystems, Inc.
Instances of the device can be configured using hints or FDT data.
Interfaces to reconfigure the chip and extract voltage measurements from
it are available via sysctl(8).
PowerPC, and possibly other architectures, use different address ranges for
PCI space vs physical address space, which is only mapped at resource
activation time, when the BAR gets written. The DRM kernel modules do not
activate the rman resources, soas not to waste KVA, instead only mapping
parts of the PCI memory at a time. This introduces a
BUS_TRANSLATE_RESOURCE() method, implemented in the Open Firmware/FDT PCI
driver, to perform this necessary translation without activating the
resource.
In addition to system KPI changes, LinuxKPI is updated to handle a
big-endian host, by adding proper endian swaps to the I/O functions.
Submitted by: mmacy
Reported by: hselasky
Differential Revision: https://reviews.freebsd.org/D21096
In particular: Changed Namespace List, Commands Supported and Effects,
Reservation Notification, Sanitize Status.
Add few new arguments to `nvmecontrol log` subcommand.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While very useful by itself, it also makes `nvmecontrol` not depend on
hardcoded device names parsing, that in its turn makes simple to take
nvdX (and potentially any other) device names as arguments.
Also added IOCTL bypass from nvdX to respective nvmeYnsZ makes them
interchangeable for management purposes.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
an updated rack depend on having access to the new
ratelimit api in this commit.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20953
with an eventual goal to convert all legacl zlib callers to the new zlib
version:
* Move generic zlib shims that are not specific to zlib 1.0.4 to
sys/dev/zlib.
* Connect new zlib (1.2.11) to the zlib kernel module, currently built
with Z_SOLO.
* Prefix the legacy zlib (1.0.4) with 'zlib104_' namespace.
* Convert sys/opencrypto/cryptodeflate.c to use new zlib.
* Remove bundled zlib 1.2.3 from ZFS and adapt it to new zlib and make
it depend on the zlib module.
* Fix Z_SOLO build of new zlib.
PR: 229763
Submitted by: Yoshihiro Ota <ota j email ne jp>
Reviewed by: markm (sys/dev/zlib/zlib_kmod.c)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D19706
Terasic DE10-Pro (an Intel Stratix 10 GX/SX FPGA Development Kit).
The Altera EMAC is an instance of Synopsys DesignWare Gigabit MAC.
This driver sets correct clock range for MDIO interface on Intel Stratix 10
platform.
This is required due to lack of support for clock manager device for
this platform that could tell us the clock frequency value for ethernet
clock domain.
Sponsored by: DARPA, AFRL
Substitute driver-defined IS_P2ALIGNED() with EFX_IS_P2ALIGNED()
defined in libefx.
Add type argument and cast value and alignment to one specified type.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21076
Substitute driver-defined P2ALIGN() with EFX_P2ALIGN() defined in
libefx.
Cast value and alignment to one specified type to guarantee result
correctness.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21075
Substitute driver-defined P2ROUNDUP() h with EFX_P2ROUNDUP()
defined in libefx.
Cast value and alignment to one specified type to guarantee result
correctness.
Reported by: Andrea Valsania <andrea.valsania at answervad.it>
Reviewed by: philip
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D21074
We want to allocate a contiguous memory block anywhere in memory, but
expressed this as having to be between 0 and 0xffffffff. This limits us
on 64-bit machines, and outright breaks on machines where memory is
mapped above that address range.
Allow the full address range to be used for this allocation.
Sponsored by: Axiado
The timeout field in the CAPS register is defined to be 8 bits, so its type was
uint8_t. We recently started adding 1 to it to cope with rogue devices that
listed 0 timeout time (which is impossible). However, in so doing, other devices
that list 0xff (for a 2 minute timeout) were broken when adding 1
overflowed. Widen the type to be uint32_t like its source register to avoid the
issue.
Reported by: bapt@
- Wrong order of casting and bit shift caused that enabling and disabling
queues didn't work properly for queues number larger than 32. Use literals
with right suffix instead.
- TX ring tail address was not updated during reinitiailzation of TX
structures. It could block sending traffic.
- Also remove unused variables 'eims' and 'active_queues'.
Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by: erj@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D20826
o Add an experimental IOMMU support to xDMA framework
The BERI IOMMU device is the part of CHERI device-model project [1]. It
translates memory addresses for various BERI peripherals modelled in
software. It accepts FreeBSD/mips64 page directories format and manages
BERI TLB.
1. https://github.com/CTSRD-CHERI/device-model
Sponsored by: DARPA, AFRL
features offered by the chips.
For 2127 and 2129 chips, fix the detection of when chip-init is needed. The
chip config needs to be reset whenever power was lost, but the logic was
wrong for 212x chips (it only worked for 8523). Now the "oscillator
stopped" bit rather than the power manager mode is used to detect startup
after powerfail.
For all chips, disable the clock output pin.
For chips that have a timestamp/tamper-monitor feature, turn off monitoring
of the timestamp trigger pin.
The 8523, 2127, and 2129 chips have a "power manager" feature that offers
several options. We've been using the default mode which enables
everything. Now the code sets the power manager options to
- direct-switch (when Vdd < Vbat, without extra threshold check)
- no battery monitor
- no external powerfail monitor
This reduces the current draw while running on battery from 1930nA to 880nA,
which should roughly double the lifespan of the battery under load.
Because battery checking is a nice thing to have, the code now does a check
at startup, and then once a day after that, instead of checking continuously
(but only actually reporting at startup). The battery check is now done by
setting the power manager back to default mode, sleeping briefly while it
makes a voltage measurement, then switching back to power-saving mode.
While we print failure messages on the console, sometimes logs are lost or
overwhelmed. Keeping a count of how many times we've failed retriable commands
helps get a magnitude of the problem.
Retried commands can indicate a performance degredation of an nvme drive. Keep
track of the number of retries and report it out via sysctl, just like number of
commands an interrupts.
Also convert it to a bool. While the rest of the driver isn't yet bool clean,
this will help.
Reviewed by: cem@
Differential Revision: https://reviews.freebsd.org/D20988
The nvme drive dumps only the most relevant details about a command when it
fails. However, there are times this is not sufficient (such as debugging weird
issues for a new drive with a vendor). Setting hw.nvme.verbose_cmd_dump=1
in loader.conf will enable more complete debugging information about each
command that fails.
Reviewed by: rpokala
Sponsored by: Netflix
Differential Version: https://reviews.freebsd.org/D20988
These macros make places where we extract these easier to read. The shift and
mask stuff is also a bit tedious and error prone. Start with the CAP_LO and
CAP_HI registers since their scope is somewhat constrained. This is style
chagne only, no functional changes.
Reviewed by: chuck
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20979
This affects the detection of 24-hour vs AM/PM mode... the ampm bit is in a
different location on 2127 and 2129 chips compared to other nxp rtc chips.
I noticed the 2127 case wasn't being handled correctly when I accidentally
misconfiged my system by claiming my PCF2129 was a 2127.
with various laptops using hdaa(4) sound devices. We don't seem to know
the "correct" configurations for these devices and the defaults are far
superiour, e.g. they work if you don't nuke the default configs.
PR: 200526
Differential Revision: https://reviews.freebsd.org/D17772
Neither the 1.3 or 1.4 standards say this number is 1's based, but adding 1
costs little and copes with those NVMe drives that report '0' in this field
cheaply. This is consistent with what the Linux driver does as well.
This fixes the following panic on powerpc:
pci_get_vendor failed for pcib1 on bus ofwbus0, error = 2
PR: 238730
Reported by: Dennis Clarke <dclarke@blastwave.org>
Tested by: Dennis Clarke <dclarke@blastwave.org>
MFC after: 2 weeks
on PCx2129 chips too.
The datasheet for the PCx2129 chips says that there is only a watchdog
timer, no countdown timer. It turns out the countdown timer hardware is
there and works just the same as it does on a PCx2127 chip, except that you
can't use it to trigger an interrupt or toggle an output pin. We don't need
interrupts or output pins, we only need to read the timer register to get
sub-second resolution. So start treating the 2129 chips the same as 2127.
An obscure footnote in the datasheets for the PCx2127, PCx2129, and
PCF8523 rtc chips states that the chips do not support i2c repeat-start
operations. When the driver was originally written and tested, the i2c
bus on that system also didn't support repeat-start and just quietly
turned repeat-start operations into a stop-then-start, making it appear
that the nxprtc driver was working properly.
The repeat-start situation only comes up on reads, so instead of using
the standard iicdev_readfrom(), use a local nxprtc_readfrom(), which is
just a cut-and-pasted copy of iicdev_readfrom(), modified to send two
separate start-data-stop sequences instead of using repeat-start.
The driver used to log any non-zero cause and when running with a single
line interrupt it would spam the console/logs with reports of interrupts
that are of no interest to anyone.
MFC after: 1 week
Sponsored by: Chelsio Communications
Enable this for the NovAtel OEMv2 GPS receiver.
Not fixed: The receiver shows up as "<Interface 0>" in the device
tree, because that is literally what the descriptor-string is.
Reviewed by: hselasky@
When a command is finished running, we must transition it from INQUEUE
to busy state. We were failing to do that, so we hit a panic when the
commands were freed. This only affects mpr, mps already did simmilar
things. Now both the polling and interrupt paths properly set BUSY as
appropriate.
This device cannot cross a 4GB boundary with DMA. Removing the
boundary in r346386 resulted in low frequency memory corruption on
machines with isci(4) controllers.
Submitted by: gallatin@
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20910
Add VMBus protocol version 4.0. and 5.0 to support Windows 10 and newer HyperV hosts.
For VMBus 4.0 and newer HyperV, the netvsc gpadl teardown must be done after vmbus close.
Submitted by: whu
MFC after: 2 weeks
Sponsored by: Microsoft
r348164 added code to iicbus_request_bus/iicbus_release_bus to automatically
call device_busy()/device_unbusy() as part of aquiring exclusive use of the
bus (so modules can't be unloaded while the bus is exclusively owned and/or
IO is in progress). That broke the ability to do i2c IO from a slave device
probe method, because the slave isn't attached yet, so calling device_busy()
triggers a sanity-check panic for trying to busy a non-attached device.
Now we check whether the device status is < DS_ATTACHING, and if so we busy
the iicbus rather than the slave device. I think this leaves a small window
where a module could be unloaded while probing is in progress. But I think
that's true of all devices, and probably should be fixed by introducing a
DS_PROBING state for devices, and handling that at various points in the
newbus code.
Eliminate the TIMEDOUT state. This state really conveyed two different
concepts: I timed out during recovery (and my command got put on the
recovery queue), and I timed out diring discovery (which doesn't).
Separate those two concepts into two flags. Use the TIMEDOUT flag to
fail requests as timed out. Use the on queue flag to remove them from
the queue.
In mps_intr_locked for MPI2_RPY_DESCRIPT_FLAGS_ADDRESS_REPLY message
type, when completing commands, ignore the ones that are not in state
INQUEUE. They were already completed as part of the recovery
process. When we complete them twice, we wind up with entries on the
free queue that are marked as busy, trigging asserts.
Reviewed by: scottl (earlier version, just for mpr)
Differential Revision: https://reviews.freebsd.org/D20785
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
table VCTRL registers.
Unconditionally program the MSI-X vector control Mask field for MSI-X
table entries without regarud for Mask's previous value. Some devices
return all zeros on reads of the VCTRL registers, which would cause us
to skip disabling interrupts. This fixes the Samsung SM961/PM961 SSDs
which are return zero starting from offset 0x3084 within the memory
region specified by BAR0, even when they are active MSI-X vectors.
The Illumos kernel writes these unconditionally to 0 or 1. However,
section 6.8.2.9 of the PCI Local Bus 3.0 spec (dated Feb 3, 2004)
states for bits 31::01:
After reset, the state of these bits must be 0. However, for
potential future use, software must preserve the value of
these reserved bits when modifying the value of other Vector
Control bits. If software modifies the value of these reserved
bits, the result is undefined."
so we always set or clear the Mask bit, but otherwise preserves the
old value.
PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713
Reviewed By: imp, jhb
Submitted by: Ka Ho Ng
MFC After: 1 week
Differential Revision: https://reviews.freebsd.org/D20873
While at it fix an invalid memory access issue when attaching external
USB HUBs, which are not mapped by ACPI, due to missing status check
when calling AcpiGetObjectInfo() from acpi_usb_hub_port_probe_cb().
Sponsored by: Mellanox Technologies
When the system has no graphical console, such as bhyve in common
configurations, ignore kern.vt.splash_cpu, instead of panicking
on INVARIANTS kernels.
Reviewed by: cem dumbbell
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D20877
Print the adapter name rather than the address of the adapter
to avoid kernel address leakage.
PR: Bug 238642
Submitted by: Fuqian Huang <huangfq.daxian@gmail.com>
Reviewed by: vmaffione
MFC after: 1 week
All MMCBR bridges have to implement all the MMCBR variables. This
implements them for everybody that currently doesn't.
A common routine for this should be written.
XCHAN_CAP_BOUNCE.
The only application that uses bounce buffering for now is the Government
Furnished Equipment (GFE) P2's dma core (AXIDMA) with its own dedicated
cacheless bounce buffer.
Sponsored by: DARPA, AFRL
Otherwise there is a window where they may be rescheduled. This
typically manifested as a page fault shortly after unloading if_iwm.ko.
Close the race by draining callouts after calling iwm_stop_device(),
which is also what Dragonfly does.
Change whitespace to reduce gratuitous diffs with Dragonfly.
Reported and tested by: seanc
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Previously the TOE code used its own custom unmapped mbufs via
EXT_FLAG_VENDOR1. The old version always wired the entire AIO request
buffer first for the duration of the AIO operation and constructed
multiple mbufs which used the wired buffer as an external buffer.
The new version determines how much room is available in the socket
buffer and only wires the pages needed for the available room building
chains of M_NOMAP mbufs. This means that a large AIO write will now
limit the amount of wired memory it uses to the size of the socket
buffer.
Reviewed by: gallatin, np
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D20839
that node is also compatible with syscon. For instance,
Rockchip RK3399's GRF (General Register Files) is compatible
with simple-mfd as well as syscon and has devices like
usb2-phy, emmc-phy and pcie-phy etc. under it.
Reviewed by: manu
This patch is the driver for NTB hardware in AMD SoCs (ported from Linux)
and enables the NTB infrastructure like Doorbells, Scratchpads and Memory
window in AMD SoC. This driver has been validated using ntb_transport and
if_ntb driver already available in FreeBSD.
Submitted by: Rajesh Kumar <rajesh1.kumar@amd.com>
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D18774
otherwise we panic.
dwmmc don't handle VCCQ (voltage for the IO line of the SD/eMMC) or
TIMING.
Add the needed accessor in the {read,write}_ivar functions.
Reviewed by: imp (previous version)
This patch fixes 2 panics. The first one is due to the current VNET not
being set in the emulated adapter transmission path. The second one
is caused by the M_PKTHDR flag not being set when preallocated mbufs
are recycled in the transmit path.
Submitted by: aleksandr.fedorov@itglobal.com
Reviewed by: vmaffione
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20824
The goal of this driver is consolidate information about SuperIO chips
and to provide for peaceful coexistence of drivers that need to access
SuperIO configuration registers.
While SuperIO chips can host various functions most of them are
discoverable and accessible without any knowledge of the SuperIO.
Examples are: keyboard and mouse controllers, UARTs, floppy disk
controllers. SuperIO-s also provide non-standard functions such as
GPIO, watchdog timers and hardware monitoring. Such functions do
require drivers with a knowledge of a specific SuperIO.
At this time the driver supports a number of ITE and Nuvoton (fka
Winbond) SuperIO chips.
There is a single driver for all devices. So, I have not done the usual
split between the hardware driver and the bus functionality. Although,
superio does act as a bus for devices that represent known non-standard
functions of a SuperIO chip. The bus provides enumeration of child
devices based on the hardcoded knowledge of such functions. The
knowledge as extracted from datasheets and other drivers.
As there is a single driver, I have not defined a kobj interface for it.
So, its interface is currently made of simple functions.
I think that we can the flexibility (and complications) when we actually
need it.
I am planning to convert nctgpio and wbwd to superio bus very soon.
Also, I am working on itwd driver (watchdog in ITE SuperIO-s).
Additionally, there is ithwm driver based on the reverted sensors
import, but I am not sure how to integrate it given that we still lack
any sensors interface.
Discussed with: imp, jhb
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D8175
That is, instead of the current GPIO00 - GPIO15 the names will be GPIO00
- GPIO07, GPIO10 - GPIO17. The first digit is a GPIO "bank" / group
number and the second one is a pin number within the bank. Alternative
view is that the pin names are changed from decimal numbering scheme to
octal one (as there are 8 pins per bank).
Discussed with: cem, gonzo
MFC after: 2 weeks
With more ports, some of the registers are shifted a bit to accommodate.
This switch also adds two high speed Serdes/SGMII interfaces (2.5 Gb/s).
Sponsored by: Rubicon Communications, LLC (Netgate)
Since cxgbe(4) uses sglist instead of bus_dma, this required updates
to the code that generates scatter/gather lists for packets. Also,
unmapped mbufs are always sent via DMA and never as immediate data in
the payload of a work request.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: np
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
address before returning it to the user. Some of the least significant
bits have special meaning and should be masked away.
Discussed with: kib@
MFC after: 3 days
Sponsored by: Mellanox Technologies