Remove the platform-specific definitions of VM_BATCHQUEUE_SIZE
for amd64 and powerpc64, and instead treat all 64-bit platforms
identically. This has the effect of increasing the arm64
and riscv VM_BATCHQUEUE_SIZE to match that of other platforms.
Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37707
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.
So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.
This has been run for several months on a busy Netflix server, as well
as on my personal desktop.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305
This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.
The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.
While these changes are not better tested, superpages support is disabled by
default. To enable it, use vm.pmap.superpages_enabled=1.
In this initial implementation, when superpages are disabled, system
performance stays at the same level as without these changes. When
superpages are enabled, buildworld time increases a bit (~2%). However,
for workloads that put a heavy pressure on the TLB the performance boost
is much bigger (see HPC Challenge and pgbench on D25237).
Reviewed by: jhibbits
Sponsored by: Eldorado Research Institute (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D25237
These definitions were repeated by all architectures, with small
variations. Consolidate the common definitons in machine
independent code and use bitset(9) macros for manipulation. Many
opportunities for deduplication remain in the machine dependent
minidump logic. The only intended functional change is increasing
the bit index type to vm_pindex_t, allowing the indexing of pages
with address of 8 TiB and greater.
Reviewed by: kib, markj
Approved by: scottl (implicit)
MFC after: 1 week
Sponsored by: Ampere Computing, Inc.
Differential Revision: https://reviews.freebsd.org/D26129
Summary:
POWER9 supports two MMU formats: traditional hashed page tables, and Radix
page tables, similar to what's presesnt on most other architectures. The
PowerISA also specifies a process table -- a table of page table pointers--
which on the POWER9 is only available with the Radix MMU, so we can take
advantage of it with the Radix MMU driver.
Written by Matt Macy.
Differential Revision: https://reviews.freebsd.org/D19516
Summary:
The existing page table is fraught with errors, since it creates a hole
in the address space bits. Fix this by taking a cue from the POWER9
radix pmap, and make the page table 4 levels, 52 bits.
Reviewed by: bdragon
Differential Revision: https://reviews.freebsd.org/D24220
Summary:
This matches r351198 from amd64. This only applies to AIM64 and Book-E.
On AIM64 it short-circuits with one domain, to behave similar to
existing. Otherwise it will allocate 16MB huge pages to hold the page
array, across all NUMA domains. On the first domain it will shift the
page array base up, to "upper-align" the page array in that domain, so
as to reduce the number of pages from the next domain appearing in this
domain. After the first domain, subsequent domains will be allocated in
full 16MB pages, until the final domain, which can be short. This means
some inner domains may have pages accounted in earlier domains.
On Book-E the page array is setup at MMU bootstrap time so that it's
always mapped in TLB1, on both 32-bit and 64-bit. This reduces the TLB0
overhead for touching the vm_page_array, which reduces up to one TLB
miss per array access.
Since page_range (vm_page_startup()) is no longer used on Book-E but is on
32-bit AIM, mark the variable as potentially unused, rather than using a
nasty #if defined() list.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21449
Summary:
Reduce the diff between AIM and Book-E even more. This also cleans up
vmparam.h significantly.
Reviewed by: luporl
Differential Revision: https://reviews.freebsd.org/D21301
doing so adds more flexibility with less redundant code.
Reviewed by: jhb, markj, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21250
The only thing blocking UMA_MD_SMALL_ALLOC from working on 64-bit booke
powerpc was a missing check in pmap_kextract(). Adding DMAP handling into
pmap_kextract(), we can now use UMA_MD_SMALL_ALLOC. This should improve
performance and stability a bit, since DMAP is always mapped in TLB1, so
this relieves pressure on TLB0.
MFC after: 3 weeks
Previous commits have made VM_MIN_KERNEL_ADDRESS its own separate entity,
and rebased the kernel around that address instead of KERNBASE. This commit
pulls the trigger to rebase KERNBASE to a physical load address. The
eventual goal is to align the address with the AIM KERNBASE, but at this
time that's not an option.
Currently a Book-E kernel must be loaded on a 64MB boundary, due to size
issues. The common load address is at the 64MB mark (0x04000000), so simply
make that the default KERNBASE.
As of this commit, Book-E kernels can be loaded and booted with ubldr.
MFC after: 3 weeks
A related future change, which changes KERNBASE for Book-E for some reason
causes a "KERNBASE redefined" error with assym.inc, even though it only changed
the value of KERNBASE and nothing else. Since machine/vmparam.h is already
included in booke/locore.S, and the requisite guards are already in place for
properly handling KERNBASE in vmparam.h, just remove it from genassym, and
include vmparam.h in the AIM locore files.
This will let us use much more KVA for ZFS ARC where needed. This may be
incresed in the future if memory requirements increase.
Discussed with: nwhitehorn
As with AIM64, map the DMAP at the beginning of the fourth "quadrant" of
memory, and move the KERNBASE to the the start of KVA.
Eventually we may run the kernel out of the DMAP, but for now, continue
booting as it has been.
accomplishes a few things:
- Makes NULL an invalid address in the kernel, which is useful for catching
bugs.
- Lays groundwork for radix-tree translation on POWER9, which requires the
direct map be at high memory.
- Similarly lays groundwork for a direct map on 64-bit Book-E.
The new base address is chosen as the base of the fourth radix quadrant
(the minimum kernel address in this translation mode) and because all
supported CPUs ignore at least the first two bits of addresses in real
mode, allowing direct-map addresses to be used in real-mode handlers.
This is required by Linux and is part of the architecture standard
starting in POWER ISA 3, so can be relied upon.
Reviewed by: jhibbits, Breno Leitao
Differential Revision: D14499
Commits r326203 and r326978 broke 64-bit booke kernels by introducing a 1MB
zero-pad between the ELF header and the start of the kernel. This didn't
cause a build failure, but caused kernels to need to be loaded into memory
1MB lower, which could easily break scripts expecting previous behavior.
This change matches the similar change made to AIM in r327358.
kernel by PHYS_TO_DMAP() as previously present on amd64, arm64, riscv, and
powerpc64. This introduces a new MI macro (PMAP_HAS_DMAP) that can be
evaluated at runtime to determine if the architecture has a direct map;
if it does not (or does) unconditionally and PMAP_HAS_DMAP is either 0 or
1, the compiler can remove the conditional logic.
As part of this, implement PHYS_TO_DMAP() on sparc64 and mips64, which had
similar things but spelled differently. 32-bit MIPS has a partial direct-map
that maps poorly to this concept and is unchanged.
Reviewed by: kib
Suggestions from: marius, alc, kib
Runtime tested on: amd64, powerpc64, powerpc, mips64
using a new macro PHYS_TO_DMAP, which deliberately has the same name as the
equivalent macro on amd64. This also sets the stage for moving the direct
map to another base address.
The data segement was too big.
Add a fix-up function like on ia32 for MAXDSIZ.
While here, bring also the MAXSSIZ closer to amd64 and add an equal fix-up
function for MAXSSIZ.
Reviewed by: jhibbits@
Obtained from: jhibbits@
Differential Revision: https://reviews.freebsd.org/D13753
is used as the bootloader on a number of PPC64 platforms. This involves the
following pieces:
- Making the first instruction a valid kernel entry point, since kexec
ignores the ELF entry value. This requires a separate section and linker
magic to prevent the linker from filling the beginning of the section
with stubs.
- Adding an entry point at 0x60 past the first instruction for systems
lacking firmware CPU shutdown support (notably PS3).
- Linker script changes to support the above.
MFC after: 1 month
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Extend the Book-E pmap to support 64-bit operation. Much of this was taken from
Juniper's Junos FreeBSD port. It uses a 3-level page table (page directory
list -- PP2D, page directory, page table), but has gaps in the page directory
list where regions will repeat, due to the design of the PP2D hash (a 20-bit gap
between the two parts of the index). In practice this may not be a problem
given the expanded address space. However, an alternative to this would be to
use a 4-level page table, like Linux, and possibly reduce the available address
space; Linux appears to use a 46-bit address space. Alternatively, a cache of
page directory pointers could be used to keep the overall design as-is, but
remove the gaps in the address space.
This includes a new kernel config for 64-bit QorIQ SoCs, based on MPC85XX, with
the following notes:
* The DPAA driver has not yet been ported to 64-bit so is not included in the
kernel config.
* This has been tested on the AmigaOne X5000, using a MD_ROOT compiled in
(total size kernel+mdroot must be under 64MB).
* This can run both 32-bit and 64-bit processes, and has even been tested to run
a 32-bit init with 64-bit children.
Many thanks to stevek and marcel for getting Juniper's FreeBSD patches open
sourced to be used here, and to stevek for reviewing, and providing some
historical contexts on quirks of the code.
Reviewed by: stevek
Obtained from: Juniper (in part)
MFC after: 2 months
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D9433
There are places where checks are made against VM_MAX_KERNEL_ADDRESS, or
virtual_end (set to VM_MAX_KERNEL_ADDRESS). With 32-bit checks, an address will
always be less than or equal to 0xffffffff. Drop a page, so those checks can
terminate loops safely.
Summary:
There is currently a 1GB hole between user and kernel address spaces
into which direct (1:1 PA:VA) device mappings go. This appears to go largely
unused, leaving all devices to contend with the 128MB block at the end of the
32-bit space (0xf8000000-0xffffffff). This easily fills up, and needs to be
densely packed. However, dense packing wastes precious TLB1 space, of which
there are only 16 (e500v2) or 64(e5500) entries available.
Change this by using the 1GB space for all device mappings, and allow the kernel
to use the entire upper 1GB for KVA. This also allows us to use sparse device
mappings, freeing up TLB entries.
Test Plan: Boot tested on p5020.
Differential Revision: https://reviews.freebsd.org/D5832
VM_MAX_KERNEL_ADDERESS is the maximum KVA address. 0xf8000000 is the start of
device mapping space. Since several conditional checks use '<=' against
VM_MAX_KERNEL_ADDRESS, bad things could feasibly happen.
This allows executing static clang built with -O0.
The value is configurable by a sysctl, so if one needs to clamp it down, they
still can.
Discussed with: nwhitehorn,emaste
physical address of the page to direct map address, in case
SFBUF_OPTIONAL_DIRECT_MAP returns true. The case of PowerPC AIM
64bit, where the page physical address is identical to the direct map
address, is accidental.
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
The MD allocators were very common, however there were some minor
differencies. These differencies were all consolidated in the MI allocator,
under ifdefs. The defines from machine/vmparam.h turn on features required
for a particular machine. For details look in the comment in sys/sf_buf.h.
As result no MD code left in sys/*/*/vm_machdep.c. Some arches still have
machine/sf_buf.h, which is usually quite small.
Tested by: glebius (i386), tuexen (arm32), kevlo (arm32)
Reviewed by: kib
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
words, every architecture is now auto-sizing the kmem arena. This revision
changes kmeminit() so that the definition of VM_KMEM_SIZE_SCALE becomes
mandatory and the definition of VM_KMEM_SIZE becomes optional.
Replace or eliminate all existing definitions of VM_KMEM_SIZE. With
auto-sizing enabled, VM_KMEM_SIZE effectively became an alternate spelling
for VM_KMEM_SIZE_MIN on most architectures. Use VM_KMEM_SIZE_MIN for
clarity.
Change kmeminit() so that the effect of defining VM_KMEM_SIZE is similar to
that of setting the tunable vm.kmem_size. Whereas the macros
VM_KMEM_SIZE_{MAX,MIN,SCALE} have had the same effect as the tunables
vm.kmem_size_{max,min,scale}, the effects of VM_KMEM_SIZE and vm.kmem_size
have been distinct. In particular, whereas VM_KMEM_SIZE was overridden by
VM_KMEM_SIZE_{MAX,MIN,SCALE} and vm.kmem_size_{max,min,scale}, vm.kmem_size
was not. Remedy this inconsistency. Now, VM_KMEM_SIZE can be used to set
the size of the kmem arena at compile-time without that value being
overridden by auto-sizing.
Update the nearby comments to reflect the kmem submap being replaced by the
kmem arena. Stop duplicating the auto-sizing formula in every machine-
dependent vmparam.h and place it in kmeminit() where auto-sizing takes
place.
Reviewed by: kib (an earlier version)
Sponsored by: EMC / Isilon Storage Division
- Remove explicit requirement that the SOC registers be found except as an
optimization (although the MPC85XX LAW drivers still require they be found
externally, which should change).
- Remove magic CCSRBAR_VA value.
- Allow bus_machdep.c's early-boot code to handle non 1:1 mappings and
systems not in real-mode or global 1:1 maps in early boot.
- Allow pmap_mapdev() on Book-E to reissue previous addresses if the
area is already mapped. Additionally have it check all mappings, not
just the CCSR area.
This allows the console on e500 systems to actually work on systems where
the boot loader was not kind enough to set up a 1:1 mapping before starting
the kernel.
order to match the MAXCPU concept. The change should also be useful
for consolidation and consistency.
Sponsored by: EMC / Isilon storage division
Obtained from: jeff
Reviewed by: alc
implementation specific vs. the common architecture definition.
Bring PPC4XX defines (PSL, SPR, TLB). Note the new definitions under
BOOKE_PPC4XX are not used in the code yet.
This change set is not supposed to affect existing E500 support, it's just
another reorg step before bringing support for E500mc, E5500 and PPC465.
Obtained from: AppliedMicro, Freescale, Semihalf
AIM systems to 4 GB on 32-bit systems and 2^64 bytes on 64-bit systems.
VM_MAXUSER_ADDRESS remains at 2 GB on pending Book-E, pending review of
an increase to 3 GB by those more familiar with Book-E.
architectures (i386, for example) the virtual memory space may be
constrained enough that 2MB is a large chunk. Use 64K for arches
other than amd64 and ia64, with special handling for sparc64 due to
differing hardware.
Also commit the comment changes to kmem_init_zero_region() that I
missed due to not saving the file. (Darn the unfamiliar development
environment).
Arch maintainers, please feel free to adjust ZERO_REGION_SIZE as you
see fit.
Requested by: alc
MFC after: 1 week
MFC with: r221853
now it uses a very dumb first-touch allocation policy. This will change in
the future.
- Each architecture indicates the maximum number of supported memory domains
via a new VM_NDOMAIN parameter in <machine/vmparam.h>.
- Each cpu now has a PCPU_GET(domain) member to indicate the memory domain
a CPU belongs to. Domain values are dense and numbered from 0.
- When a platform supports multiple domains, the default freelist
(VM_FREELIST_DEFAULT) is split up into N freelists, one for each domain.
The MD code is required to populate an array of mem_affinity structures.
Each entry in the array defines a range of memory (start and end) and a
domain for the range. Multiple entries may be present for a single
domain. The list is terminated by an entry where all fields are zero.
This array of structures is used to split up phys_avail[] regions that
fall in VM_FREELIST_DEFAULT into per-domain freelists.
- Each memory domain has a separate lookup-array of freelists that is
used when fulfulling a physical memory allocation. Right now the
per-domain freelists are listed in a round-robin order for each domain.
In the future a table such as the ACPI SLIT table may be used to order
the per-domain lookup lists based on the penalty for each memory domain
relative to a specific domain. The lookup lists may be examined via a
new vm.phys.lookup_lists sysctl.
- The first-touch policy is implemented by using PCPU_GET(domain) to
pick a lookup list when allocating memory.
Reviewed by: alc
Kernel sources for 64-bit PowerPC, along with build-system changes to keep
32-bit kernels compiling (build system changes for 64-bit kernels are
coming later). Existing 32-bit PowerPC kernel configurations must be
updated after this change to specify their architecture.
UMA segments at their physical addresses instead of into KVA. This emulates
the direct mapping behavior of OEA32 in an ad-hoc way. To make this work
properly required sharing the entire kernel PMAP with Open Firmware, so
ofw_pmap is transformed into a stub on 64-bit CPUs.
Also implement some more tweaks to get more mileage out of our limited
amount of KVA, principally by extending KVA into segment 16 until the
beginning of the first OFW mapping.
Reported by: linimon