When running VM on ARM64 Hyper-V, we have seen netvsc/hn driver hit
assert on reading duplicated network completion packets over vmbus
channel or one of the tx channels stalls completely. This seems to
caused by processor reordering the instructions when vmbus driver
reading or updating its channel ring buffer indexes.
Fix this by using load acquire and store release instructions to
enforce the order of these memory accesses.
PR: 271764
Reported by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Reviewed by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Tested by: whu
Sponsored by: Microsoft
Vmbus_synic_setup() is invoked via vmbus_intrhook -> vmbus_doattach
-> smp_rendezvous. On !EARLY_AP_STARTUP (e.g., aarch64), SMP isn't
functional in intrhooks and smp_rendezvous() will just call
vmbus_synic_setup() on the boot processor. There's nothing that will
initialize the pcpu data on every other AP.
To fix it we need to use SI_SUB_SMP for vmbus_doattach(). With this
patch the vmbus interrupt should work on all arm64 cpus on HyperV.
Reported by: kevans
Reviewed by: kevans, whu
Tested by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Obtained from: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D40279
In the Hyper-V drivers we need to allocate buffers shared between the
host and guest. This memory has been allocated with bus_dma, however
it doesn't use this correctly, e.g. it is missing calls to
bus_dmamap_sync. Along with this on arm64 we need this memory to be
mapped with the correct memory type that bus_dma may not use.
Switch to contigmalloc to allocate this memory as this will correctly
allocate cacheable memory.
Reviewed by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D40227
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
In non-Hyper-V systems during Hyper-V initialization, system
initialization was getting hung, as hyperv_identify(),
was returning successful irrespective of the type of the platform.
Reviewed by: andrew, whu
Fixes: 9729f076e4
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D37219
This is the last part for ARM64 Hyper-V enablement. This includes
commone files and make file changes to enable the ARM64 FreeBSD
guest on Hyper-V. With this patch, it should be able to build
the ARM64 image and install it on Hyper-V.
Reviewed by: emaste, andrew, whu
Tested by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D36744
Refactor the code to put split the MSR values for x86 and arm64
Hyper-V. Code not yet built. This is one of several patches for
the arm64 Hyper-V enablement.
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D37103
This is the second part of the ARM64 Hyper-V enablement.
These changes here are mostly with Make, release changes and also
changes required in vmbus.c hyperv.c and common files in hyperv.
Reviewed by: whu
Tested by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D36467
In ARM64 gen2 Hyper-V, use IRQ resource from vmbus_res, which is owning
the IRQ for current device tree. It allows the MMIO resource to be
successfully allocated for vmbus from parent acpi_syscontainer.
Reviewed by: whu
Tested by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D37064
The changes are to refactor the code of vmbus.c and hyperv.c to keep minimal
arch specific codes there and have them in separate files in x86/ arm64/ .
x86 is a new directory, which contains codes for x86 / x86_64. Instead of
repeating the same codes in existing amd64/ and i386/, this approach reduced
the repetition. This is first of three patches for Hyper-V enablement.
Reviewed by: whu
Tested by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D36466
The scanning code uses Giant to coordinate its accesses to newbus as
well as to synchronize a little state within hyperv's vmbus. Switch to
the new bus_topo_* functions instead of referring to Giant explicitly.
Sponsored by: Netflix
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D31840
This reverts commit 9ef7df022a ("hyperv: Register hyperv_timecounter
later during boot") and adds a comment explaining why the timecounter
needs to be registered as early as it is.
PR: 259878
Fixes: 9ef7df022a ("hyperv: Register hyperv_timecounter later during boot")
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33014
Previously the MSR-based timecounter was registered during
SI_SUB_HYPERVISOR, i.e., very early during boot, and before SI_SUB_LOCK.
After commit 621fd9dcb2 this triggers a panic since the timecounter
list lock is not yet initialized.
The hyperv timecounter does not need to be registered so early, so defer
that to SI_SUB_DRIVERS, at the same time the hyperv TSC timecounter is
registered.
Reported by: whu
Approved by: whu
Fixes: 621fd9dcb2 ("timecounter: Lock the timecounter list")
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Add vDSO support for timekeeping devices that support the KVM/XEN
paravirtual clock API.
Also, expose, in the userspace-accessible '<machine/pvclock.h>',
definitions that will be needed by 'libc' to support
'VDSO_TH_ALGO_X86_PVCLK'.
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31418
Now that the upper layers all go through a layer to tie into these
information functions that translates an sbuf into char * and len. The
current interface suffers issues of what to do in cases of truncation,
etc. Instead, migrate all these functions to using struct sbuf and these
issues go away. The caller is also in charge of any memory allocation
and/or expansion that's needed during this process.
Create a bus_generic_child_{pnpinfo,location} and make it default. It
just returns success. This is for those busses that have no information
for these items. Migrate the now-empty routines to using this as
appropriate.
Document these new interfaces with man pages, and oversight from before.
Reviewed by: jhb, bcr
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29937
The vmbus ISR needs to live in a trampoline. Dynamically allocating a
trampoline at driver initialization time poses some difficulties due to
the fact that the KENTER macro assumes that the offset relative to
tramp_idleptd is fixed at static link time. Another problem is that
native_lapic_ipi_alloc() uses setidt(), which assumes a fixed trampoline
offset.
Rather than fight this, move the Hyper-V ISR to i386/exception.s. Add a
new HYPERV kernel option to make this optional, and configure it by
default on i386. This is sufficient to make use of vmbus(4) after the
4/4 split. Note that vmbus cannot be loaded dynamically and both the
HYPERV option and device must be configured together. I think this is
not too onerous a requirement, since vmbus(4) was previously
non-functional.
Reported by: Harry Schmalzbauer <freebsd@omnilan.de>
Tested by: Harry Schmalzbauer <freebsd@omnilan.de>
Reviewed by: whu, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30577
Normally raw interrupt handler is provided by the kernel text. But
vmbus module registers its own handler that needs to be mapped into
userspace mapping on PTI kernels.
Reported and reviewed by: whu
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30310
Do not assume that VBE framebuffer metadata can be used. Like with the
EFI fb metadata, it may be null, so we should take care not to
dereference the null vbefb pointer. This avoids a panic when booting
-CURRENT on a gen1 VM in Azure.
Approved by: tsoome
Sponsored by: Miles AS
Differential Revision: https://reviews.freebsd.org/D27533
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.
struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.
Differential Revision: https://reviews.freebsd.org/D27373
On Gen2 VMs, Hyper-V provides mmio space for framebuffer.
This mmio address range is not useable for other PCI devices.
Currently only efifb driver is using this range without reserving
it from system.
Therefore, vmbus driver reserves it before any other PCI device
drivers start to request mmio addresses.
PR: 222996
Submitted by: weh@microsoft.com
Reported by: dmitry_kuleshov@ukr.net
Reviewed by: decui@microsoft.com
Sponsored by: Microsoft
This change adds Hyper-V socket feature in FreeBSD. New socket address
family AF_HYPERV and its kernel support are added.
Submitted by: Wei Hu <weh@microsoft.com>
Reviewed by: Dexuan Cui <decui@microsoft.com>
Relnotes: yes
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D24061
Add VMBus protocol version 4.0. and 5.0 to support Windows 10 and newer HyperV hosts.
For VMBus 4.0 and newer HyperV, the netvsc gpadl teardown must be done after vmbus close.
Submitted by: whu
MFC after: 2 weeks
Sponsored by: Microsoft
error in the function hypercall_memfree(), where the wrong arena was being
passed to kmem_free().
Introduce a per-page flag, VPO_KMEM_EXEC, to mark physical pages that are
mapped in kmem with execute permissions. Use this flag to determine which
arena the kmem virtual addresses are returned to.
Eliminate UMA_SLAB_KRWX. The introduction of VPO_KMEM_EXEC makes it
redundant.
Update the nearby comment for UMA_SLAB_KERNEL.
Reviewed by: kib, markj
Discussed with: jeff
Approved by: re (marius)
Differential Revision: https://reviews.freebsd.org/D16845
became unused in FreeBSD 12.x as a side-effect of the NUMA-related
changes.)
Reviewed by: kib, markj
Discussed with: jeff, re@
Differential Revision: https://reviews.freebsd.org/D16825
Ifuncs selectors dispatch copyin(9) family to the suitable variant, to
set rflags.AC around userspace access. Rflags.AC bit is cleared in
all kernel entry points unconditionally even on machines not
supporting SMAP.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D13838
FreeBSD VM can't boot up on Hyper-V after the recent malloc change in
r335068: Make UMA and malloc(9) return non-executable memory in most cases.
The hypercall page here must be executable.
Fix the boot-up issue by adding M_EXEC.
PR: 229167
Sponsored by: Microsoft
The change makes the user and kernel address spaces on i386
independent, giving each almost the full 4G of usable virtual addresses
except for one PDE at top used for trampoline and per-CPU trampoline
stacks, and system structures that must be always mapped, namely IDT,
GDT, common TSS and LDT, and process-private TSS and LDT if allocated.
By using 1:1 mapping for the kernel text and data, it appeared
possible to eliminate assembler part of the locore.S which bootstraps
initial page table and KPTmap. The code is rewritten in C and moved
into the pmap_cold(). The comment in vmparam.h explains the KVA
layout.
There is no PCID mechanism available in protected mode, so each
kernel/user switch forth and back completely flushes the TLB, except
for the trampoline PTD region. The TLB invalidations for userspace
becomes trivial, because IPI handlers switch page tables. On the other
hand, context switches no longer need to reload %cr3.
copyout(9) was rewritten to use vm_fault_quick_hold(). An issue for
new copyout(9) is compatibility with wiring user buffers around sysctl
handlers. This explains two kind of locks for copyout ptes and
accounting of the vslock() calls. The vm_fault_quick_hold() AKA slow
path, is only tried after the 'fast path' failed, which temporary
changes mapping to the userspace and copies the data to/from small
per-cpu buffer in the trampoline. If a page fault occurs during the
copy, it is short-circuit by exception.s to not even reach C code.
The change was motivated by the need to implement the Meltdown
mitigation, but instead of KPTI the full split is done. The i386
architecture already shows the sizing problems, in particular, it is
impossible to link clang and lld with debugging. I expect that the
issues due to the virtual address space limits would only exaggerate
and the split gives more liveness to the platform.
Tested by: pho
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D14633
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
The implementation of the Kernel Page Table Isolation (KPTI) for
amd64, first version. It provides a workaround for the 'meltdown'
vulnerability. PTI is turned off by default for now, enable with the
loader tunable vm.pmap.pti=1.
The pmap page table is split into kernel-mode table and user-mode
table. Kernel-mode table is identical to the non-PTI table, while
usermode table is obtained from kernel table by leaving userspace
mappings intact, but only leaving the following parts of the kernel
mapped:
kernel text (but not modules text)
PCPU
GDT/IDT/user LDT/task structures
IST stacks for NMI and doublefault handlers.
Kernel switches to user page table before returning to usermode, and
restores full kernel page table on the entry. Initial kernel-mode
stack for PTI trampoline is allocated in PCPU, it is only 16
qwords. Kernel entry trampoline switches page tables. then the
hardware trap frame is copied to the normal kstack, and execution
continues.
IST stacks are kept mapped and no trampoline is needed for
NMI/doublefault, but of course page table switch is performed.
On return to usermode, the trampoline is used again, iret frame is
copied to the trampoline stack, page tables are switched and iretq is
executed. The case of iretq faulting due to the invalid usermode
context is tricky, since the frame for fault is appended to the
trampoline frame. Besides copying the fault frame and original
(corrupted) frame to kstack, the fault frame must be patched to make
it look as if the fault occured on the kstack, see the comment in
doret_iret detection code in trap().
Currently kernel pages which are mapped during trampoline operation
are identical for all pmaps. They are registered using
pmap_pti_add_kva(). Besides initial registrations done during boot,
LDT and non-common TSS segments are registered if user requested their
use. In principle, they can be installed into kernel page table per
pmap with some work. Similarly, PCPU can be hidden from userspace
mapping using trampoline PCPU page, but again I do not see much
benefits besides complexity.
PDPE pages for the kernel half of the user page tables are
pre-allocated during boot because we need to know pml4 entries which
are copied to the top-level paging structure page, in advance on a new
pmap creation. I enforce this to avoid iterating over the all
existing pmaps if a new PDPE page is needed for PTI kernel mappings.
The iteration is a known problematic operation on i386.
The need to flush hidden kernel translations on the switch to user
mode make global tables (PG_G) meaningless and even harming, so PG_G
use is disabled for PTI case. Our existing use of PCID is
incompatible with PTI and is automatically disabled if PTI is
enabled. PCID can be forced on only for developer's benefit.
MCE is known to be broken, it requires IST stack to operate completely
correctly even for non-PTI case, and absolutely needs dedicated IST
stack because MCE delivery while trampoline did not switched from PTI
stack is fatal. The fix is pending.
Reviewed by: markj (partially)
Tested by: pho (previous version)
Discussed with: jeff, jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Event tasks are pinned to their respective CPU by default, in the same
fashion as they were.
Unpin the event tasks by setting hw.vmbus.pin_evttask to 0, if certain
CPUs serve special purpose.
MFC after: 3 days
Sponsored by: Microsoft
For GEN1 Hyper-V, vmbus is attached to pcib0, which contains the
resources for PCI passthrough and SR-IOV. There is no
acpi_syscontainer0 on GEN1 Hyper-V.
For GEN2 Hyper-V, vmbus is attached to acpi_syscontainer0, which
contains the resources for PCI passthrough and SR-IOV. There is
no pcib0 on GEN2 Hyper-V.
The ACPI VMBUS device now only holds its _CRS, which is empty as
of this commit; its existence is mainly for upward compatibility.
Device tree structure is suggested by jhb@.
Tested-by: dexuan@
Collabrated-wth: dexuan@
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D10565