Subsequent to r240317, kmem_free() was replaced with kva_free() (r254025).
kva_free() releases the KVA allocation for the mapped region, but no longer
clears the pmap (pagetable) entries.
An affected pmap_unmapdev operation would leave the still-pmap'd VA space
free for allocation by other KVA consumers. However, this bug easily
avoided notice for ~7 years because most devices (1) never call
pmap_unmapdev and (2) on amd64, mostly fit within the DMAP and do not need
KVA allocations. Other affected arch are less popular: i386, MIPS, and
PowerPC. Arm64, arm32, and riscv are not affected.
Reported by: Don Morris <dgmorris AT earthlink.net>
Submitted by: Don Morris (amd64 part)
Reviewed by: kib, markj, Don (!amd64 parts)
MFC after: I don't intend to, but you might want to
Sponsored by: Dell Isilon
Differential Revision: https://reviews.freebsd.org/D25689
Use PVO_PADDR macro to get the physical address from a PVO, instead of
explicitly ANDing pvo_pte.pa with LPTE_RPGN where it is needed. Besides
improving readability, this is needed to support superpages (D25237), where
the steps to get the PA from a PVO are different.
Reviewed by: markj
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D25654
Summary:
Radix on AIM, and all of Book-E (currently), can do direct addressing of
user space, instead of needing to map user addresses into kernel space.
Take advantage of this to optimize the copy(9) functions for this
behavior, and avoid effectively NOP translations.
Test Plan: Tested on powerpcspe, powerpc64/booke, powerpc64/AIM
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D25129
Summary:
The point of this addition is to cache CPU behavior 'features', to avoid
having to recompute based on CPU, etc.
The first such use case is to avoid the unnecessary manipulation of the
SLBs (Segment Lookaside Buffers) when using the Radix pmap on POWER9.
Since we already get the PCPU pointer wherever we swap the SLB entries,
we can use a cached flag to check if it's necessary to perform the
operation anyway, and skip it when not.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D24908
Found by running libc tests with radix enabled.
Detect unsigned integer wrapping with a postcondition.
Note: Radix MMU is not enabled by default yet.
Sponsored by: Tag1 Consulting, Inc.
With IFUNC support in the kernel, we can finally get rid of our poor-man's
ifunc for pmap, utilizing kobj. Since moea64 uses a second tier kobj as
well, for its own private methods, this adds a second pmap install function
(pmap_mmu_init()) to perform pmap 'post-install pre-bootstrap'
initialization, before the IFUNCs get initialized.
Reviewed by: bdragon
In this context, 0 actually means 0 (i.e. this is a li instruction).
While most assemblers will ignore this, I did have a compile failure at one
point when using an external toolchain.
In the future, we should use the li syntax to make this clearer.
Sponsored by: Tag1 Consulting, Inc.
Recent changes have caused the vmspace objects to start coming from KVA
instead of direct-mapped memory on powerpc. As far as I can tell, this is
not actually a problem, so we should stop arbitrarily asserting that it is.
I do not know why this was not being triggered before.
Approved by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Instead of crashing the user process when a D-ERAT multihit is detected, try
to flush the ERAT, and continue. This machine check indicates a likely PMAP
invalidation shortcoming that will need to be addressed, but it's
recoverable, so just recover. The recovery is pmap-specific to flush the
ERAT, so add a pmap function to do so, currently only implemented by the
POWER9 radix pmap.
x86 needs delayed TLB invalidation because invalidation requires an
expensive IPI. PowerPC has had a TLB invalidation instruction since the
POWER1 in 1990, so there's no need to delay anything.
A page (even physmem) can be marked as cache-inhibited. Attempting to use
'dcbz' to zero a page mapped cache-inhibited triggers an alignment
exception, which is fatal in kernel. This was seen when testing hardware
acceleration with X on POWER9.
At some point in the future, this should be changed to a more straight
forward zero loop instead of bzero(), and a similar change be made to the
other pmaps.
Reported by: pkubaj@
TRAP_ENTRY(0) should be TRAP_GENTRAP(0) here.
However, in practice, it doesn't matter, as the only time TRAP_ENTRY and
TRAP_GENTRAP can differ is when bridge mode is active, which is impossible
on the 64 bit kernel.
Fix it anyway in case we ever need to add a trap preamble on PPC64.
Summary:
POWER9 supports two MMU formats: traditional hashed page tables, and Radix
page tables, similar to what's presesnt on most other architectures. The
PowerISA also specifies a process table -- a table of page table pointers--
which on the POWER9 is only available with the Radix MMU, so we can take
advantage of it with the Radix MMU driver.
Written by Matt Macy.
Differential Revision: https://reviews.freebsd.org/D19516
Summary:
Some machine checks are process-recoverable, others are not. Let a
CPU-specific handler decide what to do.
This works around a machine check error hit while building www/firefox
and mail/thunderbird, which would otherwise cause the build to fail.
More work is needed to handle all possible machine check conditions, but
this is sufficient to unblock some ports building.
Differential Revision: https://reviews.freebsd.org/D23731
This is a general cleanup of the relocatable kernel support on powerpc,
needed to enable kernel ifuncs.
* Fix some relocatable issues in the kernel linker, and change to using
a RELOCATABLE_KERNEL #define instead of #ifdef __powerpc__ for parts that
other platforms can use in the future if they wish to have ET_DYN kernels.
* Get rid of the DB_STOFFS hack now that the kernel is relocated to the DMAP
properly across the board on powerpc64.
* Add powerpc64 and powerpc32 ifunc functionality.
* Allow AIM64 virtual mode OF kernels to run from the DMAP like other AIM64
by implementing a virtual mode restart. This fixes the runtime address on
PowerMac G5.
* Fix symbol relocation problems on post-relocation kernels by relocating
the symbol table.
* Add an undocumented method for supplying kernel symbols on powernv and
other powerpc machines using linux-style kernel/initrd loading -- If
you pass the kernel in as the initrd as well, the copy resident in initrd
will be used as a source for symbols when initializing the debugger.
This method is subject to removal once we have a better way of doing this.
Approved by: jhibbits
Relnotes: yes
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D23156
Default moea64_bpvo_pool_size 327680 was insufficient for initial
memory mapping at boot time on systems with, for example, 64G and
no huge pages enabled.
Submitted by: Andre Silva <afscoelho@gmail.com>
Reviewed by: jhibbits, alfredo
Approved by: jhibbits (mentor)
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D24102
PR: 244118
Reported by: Francis Little <oggy at farscape.co.uk>
Tested by: Francis Little, Mark Millard <marklmi at yahoo.com>
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23729
Remove mbuf_jumbo_alloc and let large mbuf zones use the new uma default
contig allocator (a copy of mbuf_jumbo_alloc). Tag other zones which
require contiguous objects, even if they don't use the new default
contig allocator, so that uma knows about their constraints.
Reviewed by: jeff, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D23238
In r356767, memcpy/memmove/bcopy optimizations were added to libc to
improve performance.
This exposed an existing kernel issue in VSX handling. The PSL_VSX flag was
not being excluded from the psl_userstatic set, which meant that any thread
that used these and then called swapcontext(3) would get an EINVAL error.
Fixing this exposed a second issue - in r344123, the FPU was being forced
off in set_mcontext(). However, this was neglecting to ensure VSX was turned
off at the same time.
While here, add some code comments to explain what's going on.
Reviewed by: jhibbits, luporl (earlier rev), pkubaj (earlier rev)
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D23497
Summary:
This fixes kernel crashing when tunable "machdep.moea64_bpvo_pool_size" is
set to a value higher then 327680 (default value). Function
moea64_mid_bootstrap() relies on moea64_bpvo_pool_size, but at time of the
use the variable wan't yet updated with the new value provided by user.
Problem was detected after trying to use a VM with 64GB of RAM, and default
moea64_bpvo_pool_size is insufficient (kernel boot used more than 470000) .
I think default value must be discussed to address this use case, or find a
way to calculate pool size automatically based on amount of memory detected.
Test Plan: Tested on QEMU VM with 64GB of RAM using "set
machdep.moea64_bpvo_pool_size=655360" on loader prompt
Submitted by: Alfredo Dal'Ava Júnior (alfredo.junior_eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D23233
In rS354701, I replaced text relocations with offsets from &generictrap.
Unfortunately, the magic variable I was using doesn't actually mean the
address of &generictrap, in bridge mode it actually means &generictrap64.
So, for bridge mode to work, it is necessary to differentiate between
"where do we need to branch to to handle a trap" and "where is &generictrap
for purposes of doing relative math".
Introduce a new TRAP_ENTRY and use it instead of TRAP_GENTRAP for doing
actual calls to the generic trap handler.
Reported by: Mark Millard <marklmi@yahoo.com>
Reviewed by: jhibbits
Sponsored by: Tag1 Consulting, Inc.
Differential Revision: https://reviews.freebsd.org/D23057
Summary:
This makes the interface described in the definition file act like a
pseudo-IFUNC service, by caching the found method locally.
Applying this to the PowerPC MMU definitions, it yields a significant
(15-20%) performance improvement, seen in both a 'make buildworld' and a
parallel build of LLVM, on a POWER9 system.
Reviewed By: imp
Differential Revision: https://reviews.freebsd.org/D23245
Due to ppc32 building mmu_oea64.c (for use when in bridge mode on a G5), we
need to guard the new moea64_page_array_startup code behind __powerpc64__
to avoid a compile error, since vm_offset_t is not 64-bit on ppc32.
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22782
This is a 32-bit structure embedded in each vm_page, consisting mostly
of page queue state. The use of a structure makes it easy to store a
snapshot of a page's queue state in a stack variable and use cmpset
loops to update that state without requiring the page lock.
This change merely adds the structure and updates references to atomic
state fields. No functional change intended.
Reviewed by: alc, jeff, kib
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D22650
Summary:
moea64_pte_sync_native() and moea64_pte_unset_native() don't need the
full PTE created, they only need to check that the PVO has a matching
PTE to the PTE in the page table. Don't waste time creating the full
PTE in this case.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D22341
Summary:
This matches r351198 from amd64. This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing. Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains. On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain. After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short. This means
some inner domains may have pages accounted in earlier domains.
On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.
Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21449
ENOENT is leftover from mmu_oea.c's moea_pvo_enter(), where it's used to
syncicache() on the first new mapping of a page. This sync is done
differently in OEA64.
Fix wrong section ordering that was causing a ".got is not contiguous with
other relro sections" lld error. This also brings ldscript.powerpc and
ldscript.powerpcspe closer to ldscript.powerpc64.
Also, remove unnecessary text relocs from the ppc32 AIM trap code.
Approved by: jhibbits (mentor)
Differential Revision: https://reviews.freebsd.org/D22349
PowerISA 3.0 eliminated the 64-bit bridge mode which allowed 32-bit kernels
to run on 64-bit AIM/Book-S hardware. Since therefore only a 64-bit kernel
can run on this hardware, and 64-bit native always has the direct map, there
is no need to guard it.
In some scenarios, the 4K trapstk may overflow, corrupting tmpstk.
This was observed during remote debugging, with the following steps:
At remote host (R):
- enter kdb during boot
- switch to gdb backend
At local host (L):
- attach gdb to R
- try to read an invalid memory position
At R:
- a DSI trap occurs and kdb restarts (all this occurs on trapstk)
- while printing the stacktrace, trapstk overflows and corrupts tmpstk
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D22200
moea_pvo_remove() might remove the last mapping of a page, in which case
it is clearly no longer writeable. This can happen via pmap_remove(),
or when a CoW fault removes the last mapping of the old page.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22044
The VM_PAGE_OBJECT_BUSY_ASSERT() in pmap_enter() implementation should
be only asserted when the code is executed as result of pmap_enter(),
not when the same code is entered from e.g. pmap_enter_quick(). This
is relevant for all PowerPC pmap variants, because mmu_*_enter() is
used as the backend, and assert is located there.
Add a PowerPC private pmap_enter() PMAP_ENTER_QUICK_LOCKED flag to
indicate that the call is not from pmap_enter(). For non-quick-locked
calls, assert that the object is locked.
Reported and tested by: bdragon
Reviewed by: alc, bdragon, markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D22041
callers hold it.
This simplifies pmap code and removes a dependency on the object lock.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21596
busy acquires while held.
This allows code that would need to acquire and release a very large number
of page busy locks to use the old mechanism where busy is only checked and
not held. This comes at the cost of false positives but never false
negatives which the single consumer, vm_fault_soft_fast(), handles.
Reviewed by: kib
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21592
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).
Reviewed by: jhibbits
Differential Revision: https://reviews.freebsd.org/D21551
As pointed out by mjg, without the parentheses the calculations done against
these macros are incorrect, resulting in only 1/3 of locks being used.
Reported by: mjg
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
We only call alloc_pvo_entry() with M_WAITOK from one location. However,
this can be called while holding nonsleepable locks. Rather than passing
M_WAITOK down, use vm_wait() and loop.
Summary:
MOEA64_PTE_REPLACE() is called often with the pmap lock held, and
sometimes with the page pv lock held. The less work done while holding
a lock, the better. Since we are intending to replace the same PTE
(same hash index), we don't need to recalculate anything, just flat
replace the PTE. This cuts more than 200 instructions off the
invalidating code path. In addition, we don't need to replace a PTE
that's not occupied by this PVO.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21515