Hardware with more than 256 CPU cores is currently available and will
become increasingly common over FreeBSD 14's lifetime. Increase MAXCPU
in the amd64 GENERIC kernel configuration to 1024.
Earlier commits increased some related limits. These prerequisite
commits include at least:
- d7ed40243769 Increase MAX_APIC_ID safeguard to 0x800
- d1639e43c5 cpuset: increase userland maximum size to 1024
Global and allocated arrays sized by MAXCPU result in excessive bloat
on systems with lower core counts. In addition, some code used u_char
(8 bits) to hold a CPU index, which is not valid if MAXCPU is greater
than 256.
A number of recent commits addressed these sorts of issues, including
at least:
- 133935d26f pf: atomically increment state ids
- 74ac712f72 vmm: Dynamically allocate a couple of per-CPU state save areas
- 78cfa762eb callout: Move per-CPU callout state into the dpcpu region
- 42f722e721 amd64: store pcids pmap data in pcpu zone
- 9801e7c275 smp_topo: dynamically allocate group array
- 9fb6718d1b smp: Dynamically allocate the stoppcbs array
- 2bb16c6352 x86: retire use of intr_bind
There are some additional allocations still to be converted and
more scalability work is required to make effective use of very high
core count systems, but this change allows us to boot on these systems
and provides a Kernel Binary Interface (KBI) for the FreeBSD 14 release
that supports these configurations.
Special thanks to AMD for providing hardware to test these changes.
PR: 269572
Reviewed by: des
Relnotes: Yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36838
This makes use of the alternate address space support in both GCC and
clang to access per-CPU data as accesses relative to GS:. The
original motivation for this is that it quiets verbose warnings from
GCC 12. However, this version is also much easier to read and
allows the compiler to generate better code (e.g. the compiler can
use a GS: memory operand directly in other instructions such as IMUL
and CMP rather than always MOVing to a temporary register).
The one caveat is that the current approach is very inefficient at -O0
since the compiler expects to load the 0 base offset from a global
variable instead of assuming it is 0 (even with the const).
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40647
Use of compiler builtin ffs/ctz functions will result in optimized
instruction sequences when possible, and fall back to calling a function
provided by the compiler run-time library. We have slowly shifted our
platforms to take advantage of these builtins in 60645781d6 (arm64),
1c76d3a9fb (arm), 9e319462a0 (powerpc, partial).
Some platforms still rely on the libkern implementations of these
functions provided by libkern, namely riscv, powerpc (ffs*, flsll), and
i386 (ffsll and flsll). These routines are slow, as they perform a
linear search for the bit in question. Even on platforms lacking
dedicated bit-search instructions, such as riscv, the compiler library
will provide better-optimized routines, e.g. by using binary search.
Consolidate all definitions of these functions (whether currently using
builtins or not) to libkern.h. This should result in equivalent or
better performing routines in all cases.
One wart in all of this is the existing HAVE_INLINE_F*** macros, which
we use in a few places to conditionally avoid the slow libkern routines.
These aren't easily removed in one commit. For now, provide these
defines unconditionally, but marked for removal after subsequent
cleanup.
Removal of the now unused libkern routines will follow in the next
commit.
Reviewed by: dougm, imp (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40698
Stop requiring all of the PTEs to have the accessed bit set for superpage
promotion to occur. Given that change, add support for promotion to
pmap_enter_quick(), which does not set the accessed bit in the PTE that
it creates.
Since the final mapping within a superpage-aligned and sized region of a
memory-mapped file is typically created by a call to pmap_enter_quick(),
we now achieve promotions in circumstances where they did not occur
before, for example, the X server's read-only mapping of libLLVM-15.so.
See also https://www.usenix.org/system/files/atc20-zhu-weixi_0.pdf
Reviewed by: kib, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40478
This avoids bloating the kernel image when MAXCPU is large.
A follow-up patch for kgdb and other kernel debuggers is needed since
the stoppcbs symbol is now a pointer. Bump __FreeBSD_version so that
debuggers can use osreldate to figure out how to handle stoppcbs.
PR: 269572
MFC after: never
Reviewed by: mjg, emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39806
Commit 0bda8d3e9f ("vmm: permit some IPIs to be handled by userspace")
embedded cpuset_t into the vmm(4) ioctl ABI. This was a mistake since
we otherwise have some leeway to change the cpuset_t for the whole
system, but we want to keep the vmm ioctl ABI stable.
Rework IPI reporting to avoid this problem. Along the way, make VM_RUN
a bit more efficient:
- Split vmexit metadata out of the main VM_RUN structure. This data is
only written by the kernel.
- Have userspace pass a cpuset_t pointer and cpusetsize in the VM_RUN
structure, as is done for cpuset syscalls.
- Have the destination CPU mask for VM_EXITCODE_IPIs live outside the
vmexit info structure, and make VM_RUN copy it out separately. Zero
out any extra bytes in the CPU mask, like cpuset syscalls do.
- Modify the vmexit handler prototype to take a full VM_RUN structure.
PR: 271330
Reviewed by: corvink, jhb (previous versions)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D40113
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
pmap_get_pcid() calls zpcpu_get() which is defined in pcpu.h.
It is unclear why we do not include that header but like right
above the change add another guard around pmap_get_pcid().
This allows some LinuxKPI headers to compile again.
Suggested by: markj
MFC after: 10 days
This patch fixes virtual machine single stepping on VMX hosts.
Currently, when using bhyve's gdb stub, each attempt at single-stepping
a vCPU lands in a timer interrupt. The current single-stepping mechanism
uses the Monitor Trap Flag feature to cause VMEXIT after a single
instruction is executed. Unfortunately, the SDM states that MTF causes
VMEXITs for the next instruction that gets executed, which is often not
what the person using the debugger expects. [1]
This patch adds a new VM capability that masks interrupts on a vCPU by
blocking interrupt injection and modifies the gdb stub to use the newly
added capability while single-stepping a vCPU.
[1] Intel SDM 26.5.2 Vol. 3C
Reviewed by: corvink, jbh
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39949
This existing helper function is preferable to the hand-rolled
calculation of the kstack bounds.
Make some small style improvements while here. Notably, rename every
instance of "r", the return address, to "ra". Tidy the includes in the
affected files.
Reviewed by: jkoshy
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39909
This change eliminates the struct pmap_pcid array embedded into struct
pmap and sized by MAXCPU, which would bloat with MAXCPU increase. Also
it removes false sharing of cache lines, since the array elements are
mostly locally accessed by corresponding CPUs.
Suggested by: mjg
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D39890
and rename the structure to pmap_pcid.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D39890
When vm_map_remove() is called from vm_swapout_map_deactivate_pages()
due to swapout, PKRU attributes for the removed range must be kept
intact. Provide a variant of pmap_remove(), pmap_map_delete(), to
allow pmap to distinguish between real removes of the UVA mappings
and any other internal removes, e.g. swapout.
For non-amd64, pmap_map_delete() is stubbed by define to pmap_remove().
Reported by: andrew
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39556
Move the xenisrc structure which needs to be shared between the core Xen
interrupt code and architecture-dependent code into a separate header. A
similar situation exists for the NR_EVENT_CHANNELS constant.
Turn xi_intsrc into a type definition named xi_arch to reflect the new
purpose of being an architectural variable for the interrupt source.
This was originally implemented by Julien Grall, but has been heavily
modified. The core side was renamed "intr-internal.h" and is #include'd
by "arch-intr.h" instead of the other way around. This allows the
architecture to add function definitions which use struct xenisrc.
The original version only moved xi_intsrc into xen_arch_isrc_t. Moving
xi_vector was done by the submitter.
The submitter had also moved xi_activehi and xi_edgetrigger into
xen_arch_isrc_t. Those disappeared with the removal of PVHv1 support.
Copyright note. The current xenisrc structure was introduced at
76acc41fb7 by Justin T. Gibbs. Traces remain, but the strength of
Copyright claims from before 2013 seem pretty weak.
Reviewed by: royger
Submitted by: Elliott Mitchell <ehem+freebsd@m5p.com>, 2021-03-17 19:09:01
Original implementation: Julien Grall <julien@xen.org>, 2015-10-20 09:14:56
Differential Revision: https://reviews.freebsd.org/D30648
[royger]
- Adjust some line lengths
- Fix comment about NR_EVENT_CHANNELS after movement.
- Use #include instead of symlinks.
This overlaps the purpose of __XEN_INTERFACE_VERSION__. Remove Xen 3.0.2
compatibility. __XEN_INTERFACE_VERSION__ has compatibility to Xen 3.2.8
enabled. As Xen 3.3 was released almost 15 years ago, it seems unlikely
anyone hasn't updated.
Reviewed by: royger
Now that the atomic macros are always genuinely atomic on x86, they can
be used for synchronization with Xen. A single core VM isn't too
unusual, but actual single core hardware is uncommon.
Replace an open-coding of evtchn_clear_port() with the inline.
Substantially inspired by work done by Julien Grall <julien@xen.org>,
2014-01-13 17:40:58.
Reviewed by: royger
MFC after: 1 week
This is a userland-only pointer that isn't relevant to the kernel and
doesn't belong in the ioctl structure shared between userland and the
kernel. For the kernel, the old structure for the ioctl is still
supported under COMPAT_FREEBSD13.
This changes vm_snapshot_req() in libvmmapi to accept an explicit
vmctx argument.
It also changes vm_snapshot_guest2host_addr to take an explicit vmctx
argument. As part of this change, move the declaration for this
function and its wrapper macro from vmm_snapshot.h to snapshot.h as it
is a userland-only API.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D38125
This replaces the 'struct vm, int vcpuid' tuple passed to most API
calls and is similar to the changes recently made in vmm(4) in the
kernel.
struct vcpu is an opaque type managed by libvmmapi. For now it stores
a pointer to the VM context and an integer id.
As an immediate effect this removes the divergence between the kernel
and userland for the instruction emulation code introduced by the
recent vmm(4) changes.
Since this is a major change to the vmmapi API, bump VMMAPI_VERSION to
0x200 (2.0) and the shared library major version.
While here (and since the major version is bumped), remove unused
vcpu argument from vm_setup_pptdev_msi*().
Add new functions vm_suspend_all_cpus() and vm_resume_all_cpus() for
use by the debug server. The underyling ioctl (which uses a vcpuid of
-1) remains unchanged, but the userlevel API now uses separate
functions for global CPU suspend/resume.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D38124
vmx_snapshot() and svm_snapshot() do not save any data and error occurs at
resume:
Restoring kernel structs...
vm_restore_kern_struct: Kernel struct size was 0 for: vmx
Failed to restore kernel structs.
Reviewed by: corvink, markj
Fixes: 39ec056e6d ("vmm: Rework snapshotting of CPU-specific per-vCPU data.")
MFC after: 2 weeks
Sponsored by: vStack
Differential Revision: https://reviews.freebsd.org/D38476
A vcpu only checks if a rendezvous is in progress or not to decide if it
should handle a rendezvous. This could lead to spurios rendezvous where
a vcpu tries a handle a rendezvous it isn't part of. This situation is
properly handled by vm_handle_rendezvous but it could potentially
degrade the performance. Avoid that by an early check if the vcpu is
part of the rendezvous or not.
At the moment, rendezvous are only used to spin up application
processors and to send ioapic interrupts. Spinning up application
processors is done in the guest boot phase by sending INIT SIPI
sequences to single vcpus. This is known to cause spurious rendezvous
and only occurs in the boot phase. Sending ioapic interrupts is rare
because modern guest will use msi and the rendezvous is always send to
all vcpus.
Reviewed by: jhb
MFC after: 1 week
Sponsored by: Beckhoff Automation GmbH & Co. KG
Differential Revision: https://reviews.freebsd.org/D37390
Remove the platform-specific definitions of VM_BATCHQUEUE_SIZE
for amd64 and powerpc64, and instead treat all 64-bit platforms
identically. This has the effect of increasing the arm64
and riscv VM_BATCHQUEUE_SIZE to match that of other platforms.
Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37707
Guard pmap_invlpg() definition with checks that only provide it when
both sys/pcpu.h and machine/cpufunc.h were already included.
Requested by: Elliott Mitchell
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
In particular, do not enable the workaround if INVPCID is not supported
by the core.
Reported by: "Chen, Alvin W" <Weike.Chen@Dell.com>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37940
A hypothetical CPU bug makes invalidation of global PTEs using INVLPG
in pcid mode unreliable, it seems. The workaround is applied for all
CPUs with small cores, since we do not know the scope of the issue, and
the right fix.
Reviewed by: alc (previous version)
Discussed with: emaste, markj
Tested by: karels
PR: 261169, 266145
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D37770
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.
So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.
This has been run for several months on a busy Netflix server, as well
as on my personal desktop.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305
These are manipulating state in a ppt(4) device none of which is
vCPU-specific. Mark the vcpu fields in the relevant ioctl structures
as unused, but don't remove them for now.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37639
Support for rendezvous outside of a vcpu context (vcpuid of -1) was
removed in commit 949f0f47a4, and the vm, vcpuid argument pair was
replaced by a single struct vcpu pointer in commit d8be3d523d.
Reported by: andrew
The default is now the number of physical CPUs in the system rather
than 16.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37175
Convert the vcpu[] array in struct vm to an array of pointers and
allocate vCPUs on first use. This avoids always allocating VM_MAXCPU
vCPUs for each VM, but instead only allocates the vCPUs in use. A new
per-VM sx lock is added to serialize attempts to allocate vCPUs on
first use. However, a given vCPU is never freed while the VM is
active, so the pointer is read via an unlocked read first to avoid the
need for the lock in the common case once the vCPU has been created.
Some ioctls need to lock all vCPUs. To prevent races with ioctls that
want to allocate a new vCPU, these ioctls also lock the sx lock that
protects vCPU creation.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37174
Retire the boot_state member of struct vlapic and instead use a cpuset
in the VM to track vCPUs waiting for STARTUP IPIs. INIT IPIs add
vCPUs to this set, and STARTUP IPIs remove vCPUs from the set.
STARTUP IPIs are only reported to userland for vCPUs that were removed
from the set.
In particular, this permits a subsequent change to allocate vCPUs on
demand when the vCPU may not be allocated until after a STARTUP IPI is
reported to userland.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37173
Previously bhyve obtained a "read lock" on the memory map for ioctls
needing to read the map by locking the last vCPU. This is now
replaced by a new per-VM sx lock. Modifying the map requires
exclusively locking the sx lock as well as locking all existing vCPUs.
Reading the map requires either locking one vCPU or the sx lock.
This permits safely modifying or querying the memory map while some
vCPUs do not exist which will be true in a future commit.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37172
Centralize mapping vCPU IDs to struct vcpu objects in vmmdev_ioctl and
pass vcpu pointers to the routines in vmm.c. For operations that want
to perform an action on all vCPUs or on a single vCPU, pass pointers
to both the VM and the vCPU using a NULL vCPU pointer to request
global actions.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37168
This passes struct vcpu down in place of struct vm and and integer
vcpu index through the in-kernel instruction emulation code. To
minimize userland disruption, helper macros are used for the vCPU
arguments passed into and through the shared instruction emulation
code.
A few other APIs used by the instruction emulation code have also been
updated to accept struct vcpu in the kernel including
vm_get/set_register and vm_inject_fault.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37161
This handles the case that guest pages are being held not on behalf of
a virtual CPU but globally. Previously this was handled by passing a
vcpuid of -1 to vm_gpa_hold, but that will not work in the future when
vm_gpa_hold is changed to accept a struct vcpu pointer.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37160
The arguments identifying the VM and vCPU are only needed for
vm_copy_setup.
Reviewed by: corvink, markj
Differential Revision: https://reviews.freebsd.org/D37158