It is supposed to be used for ref_count manipulations when the pages
are owned by an object, but ref_count is used for something else than
the wiring, e.g. PTE population count on the page table page.
Reviewed by: markj
Sponsored by: Advanced Micro Devices (AMD)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D45910
Update pagesizes[] to include the L3 ATTR_CONTIGUOUS (L3C) page size,
which is 64KB when the base page size is 4KB and 2MB when the base page
size is 16KB.
Add support for L3C pages to shm_create_largepage().
Add support for creating L3C page mappings to pmap_enter(psind=1).
Add support for reporting L3C page mappings to mincore(2) and
procstat(8).
Update vm_fault_soft_fast() and vm_fault_populate() to handle multiple
superpage sizes.
Declare arm64 as supporting two superpage reservation sizes, and
simulate two superpage reservation sizes, updating the vm_page's psind
field to reflect the correct page size from pagesizes[]. (The next
patch in this series will replace this simulation. This patch is
already big enough.)
Co-authored-by: Eliot Solomon <ehs3@rice.edu>
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45766
Add a parameter to swp_pager_meta_build, for the benefit of
swp_pager_meta_transfer.
swp_pager_meta_transfer calls swp_pager_xfer_source, which may look up
the same trie entry twice - first, by calling sw_pager_meta_lookup,
and then as the first step in swp_pager_meta_build. A boolean
parameter to swp_pager_meta_build tells that function not to replace a
previously assigned swapblk with a new one, and setting it in this
call makes the first meta_lookup call unnecessary.
swp_pager_meta_transfer calls swp_pager_xfer_source, which may release
and reacquire the source object write lock, because the call to
swp_pager_meta_build may acquire and then release the destination
object write block. But it probably doesn't, so fiddling with the
source object write block was probably unnecessary. This boolean
parameter to swp_pager_meta_build tells it to return immediately if
memory allocation problems are about to require a lock
release/reacquisitiion, so that the caller can release/reacquire the
source object write lock only if truly necessary, around a second call
the swp_pager_meta_build with that boolean parameter not set. This
should make manipulation of the source object write lock rarer.
Reviewed by: alc, kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D45781
Function swap_pager_swapoff_object calls vm_pager_unswapped (via
swp_pager_force_dirty) for every page that must be unswapped. That
means that there's an unneeded check for lock ownership (the caller
always owns it), a needless PCTRIE_LOOKUP (the caller has already
found it), a call to free one page of swap space only, and a check to
see if all blocks are empty, when the caller usually knows that the
check is useless.
Isolate the essential part, needed however swap_pager_unswapped is
invoked, into a smaller function swap_pager_unswapped_acct. From
swapoff_object, invoke swp_pager_update_freerange for each appropriate
page, so that there are potentially fewer calls to
swp_pager_freeswapspace. Consider freeing a set of blocks (a struct
swblk) only after having invalidated all those blocks.
Replace the doubly-nested loops with a single loop, and refetch and
rescan a swblk only when the object write lock has been released and
reacquired.
After getting a page from swap, dirty it immediately to address a race
condition observed by @kib.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45668
vm_phys_enq_chunk() inserts a run of pages into the buddy queues. When
lazy initialization is enabled, only the first page of each run is
initialized; vm_phys_enq_chunk() thus initializes the page following the
just-inserted run.
This fails to account for the possibility that the page following the
run doesn't belong to the segment. Handle that in vm_phys_enq_chunk().
Reported by: KASAN
Reported by: syzbot+1097ef4cee8dfb240e31@syzkaller.appspotmail.com
Fixes: b16b4c22d2 ("vm_page: Implement lazy page initialization")
Have PCTRIE_RECLAIM_CALLBACK typecast one function pointer type to
another, to relieve the writer of the call back function from having
to cast its first argument from void* to member type.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D45586
vm_phys_seg_paddr_to_vm_page() expects a PA that's in bounds, but
vm_phys_find_range() purposefully returns a pointer to the end of the
last page in a segment.
Fixes: 69cbb18746 ("vm_phys: Add a vm_phys_seg_paddr_to_vm_page() helper")
FreeBSD's boot times have decreased to the point where vm_page array
initialization represents a significant fraction of the total boot time.
For example, when booting FreeBSD in Firecracker (a VMM designed to
support lightweight VMs) with 128MB and 1GB of RAM, vm_page
initialization consumes 9% (3ms) and 37% (21.5ms) of the kernel boot
time, respectively. This is generally relevant in cloud environments,
where one wants to be able to spin up VMs as quickly as possible.
This patch implements lazy initialization of (most) page structures,
following a suggestion from cperciva@. The idea is to introduce a new
free pool, VM_FREEPOOL_LAZYINIT, into which all vm_page structures are
initially placed. For this to work, we need only initialize the first
free page of each chunk placed into the buddy allocator. Then, early
page allocations draw from the lazy init pool and initialize vm_page
chunks (up to 16MB, 4096 pages) on demand. Once APs are started, an
idle-priority thread drains the lazy init pool in the background to
avoid introducing extra latency in the allocator. With this scheme,
almost all of the initialization work is moved out of the critical path.
A couple of vm_phys operations require the pool to be drained before
they can run: vm_phys_find_range() and vm_phys_unfree_page(). However,
these are rare operations. I believe that
vm_phys_find_freelist_contig() does not require any special treatment,
as it only ever accesses the first page in a power-of-2-sized free page
chunk, which is always initialized.
For now the new pool is only used on amd64 and arm64, since that's where
I can easily test and those platforms would get the most benefit.
Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D40403
A subsequent patch will make this factoring more worthwhile.
No functional change intended.
Reviewed by: dougm, alc, kib, emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40400
This is useful for a subsequent patch which implements lazy
initialization of vm_page structures using a dedicate vm_phys free page
pool.
No functional change intended.
Reviewed by: alc, kib, emaste
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D40399
jemalloc performs two types of virtual memory allocations: (1) large
chunks of virtual memory, where the chunk size is a multiple of a
superpage and explicitly aligned, and (2) small allocations, mostly
128KB, where no alignment is requested. Typically, it starts with a
small allocation, and over time it makes both types of allocation.
With anon_loc being updated on every allocation, we wind up with a
repeating pattern of a small allocation, a large gap, and a large,
aligned allocation. (As an aside, we wind up allocating a reservation
for these small allocations, but it will never fill because the next
large, aligned allocation updates anon_loc, leaving a gap that will
never be filled with other small allocations.)
With this change, anon_loc isn't updated on every allocation. So, the
small allocations will be clustered together, the large allocations will
be clustered together, and there will be fewer gaps between the
anonymous memory allocations. In addition, I see a small reduction in
reservations allocated (e.g., 1.6% during buildworld), fewer partially
populated reservations, and a small increase in 64KB page promotions on
arm64.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39845
Replace the lookup-remove loop in swp_pager_meta_free_all with a call
to SWAP_PCTRIE_RECLAIM_CALLBACK, to eliminate repeated trie searches.
Reviewed by: rlibby
Differential Revision: https://reviews.freebsd.org/D45583
Define a page_range struct to pair up the two values passed to
freerange functions. Have swp_pager_freeswapspace also take a
page_range argument rather than a pair of arguments.
In swp_pager_meta_free_all, drop a needless test and use a new
helper function to do the cleanup for each swap block.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45562
Drop an unneeded test, a branch and a needless computation to save a
few instructions.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45558
This reduces work done under vm_page_insert for large objects.
Reviewed by: alc, dougm, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45486
Use the new pctrie combined lookup/insert. This is an easy application
of the new facility. There are other places where we do this for pages
that may need more plumbing to use combined lookup/insert.
Reviewed by: kib (previous version), dougm, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45396
When vm_object_collapse() was changed in commit 98087a0 to call
vm_object_terminate(), rather than destroying the object directly, its
call to vm_reserv_break_all() should have been removed, as
vm_object_terminate() calls vm_reserv_break_all().
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D45495
Make the tlb shootdown function as a pointer. By default, it still
points to the system function smp_targeted_tlb_shootdown(). It allows
other implemenations to overwrite in the future.
Reviewed by: kib
Tested by: whu
Authored-by: Souradeep Chakrabarti <schakrabarti@microsoft.com>
Co-Authored-by: Erni Sri Satya Vennela <ernis@microsoft.com>
MFC after: 1 week
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D45174
One of these changes saves two instructions on an amd64
GENERIC-NODEBUG build. The rest are entirely cosmetic, because the
compiler can deduce that x is nonzero, and avoid the needless test.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D45331
Implement a simple heuristic to skip pointless promotion attempts by
pmap_enter_quick_locked() and moea64_enter(). Specifically, when
vm_fault() calls pmap_enter_quick() to map neighboring pages at the end
of a copy-on-write fault, there is no point in attempting promotion in
pmap_enter_quick_locked() and moea64_enter(). Promotion will fail
because the base pages have differing protection.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D45431
MFC after: 1 week
In three instances where fls(x)-1 is used, the compiler does not know
that x is nonzero and so adds needless zero checks. Using ilog(x)
instead saves, in each instance, about 4 instructions, including a
conditional, and 16 or so bytes, on an amd64 build.
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D45330
I cannot find a time where the function was not named this.
Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D45383
The LK_NOWAIT was added to suppress a witness warning, but LK_NOWITNESS
is more what we mean. This makes pbuf_ctor() more consistent with
buf_alloc(), although, unlike buf_alloc(), for pbuf there should not be
any danger of a wild locker relying on the type stability of the buf to
attempt a lock. That is, this is essentially cosmetic.
Relevant history:
- 531f8cfea0 Use dedicated lock name for pbufs
- 5875b94c74 buf_alloc(): lock the buffer with LK_NOWAIT
- c9e023541a pbuf_ctor(): lock the buffer with LK_NOWAIT
- 1fb00c8f10 buf_alloc(): Stop using LK_NOWAIT, use LK_NOWITNESS
Reviewed by: rew, kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45360
UMA_MD_SMALL_ALLOC was recently replaced by UMA_USE_DMAP, but
da76d349b6 missed some improper uses of the old symbol.
This change makes sure that UMA_USE_DMAP is used properly in
code that selects uma_small_alloc.
Fixes: da76d349b6
Reported by: eduardo, rlibby
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45368
This commit introduces the MINIDUMP_STARTUP_PAGE_TRACKING symbol and
uses it to simplify several instances of a complex preprocessor conditional
for adding pages allocated when bootstraping the kernel to minidumps.
Reviewed by: markj, mhorne
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45085
This commit refactors the UMA small alloc code and
removes most UMA machine-dependent code.
The existing machine-dependent uma_small_alloc code is almost identical
across all architectures, except for powerpc where using the direct
map addresses involved extra steps in some cases.
The MI/MD split was replaced by a default uma_small_alloc
implementation that can be overridden by architecture-specific code by
defining the UMA_MD_SMALL_ALLOC symbol. Furthermore, UMA_USE_DMAP was
introduced to replace most UMA_MD_SMALL_ALLOC uses.
Reviewed by: markj, kib
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D45084
In vm_pageout_scan_inactive, release the object lock when we go to
refill the scan batch queue so that someone else has a chance to acquire
it. This improves access latency to the object when the pagedaemon is
processing many consecutive pages from a single object, and also in any
case avoids a hiccup during refill for the last touched object.
Reviewed by: alc, markj (previous version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D45288
Whenever file is created, the vnode_create_vobject() function will
try to determine its size by calling vn_getsize_locked() as size 0
is ambigious: it means either the file size is 0 or the file size
is unknown.
Introduce special value for the size argument: VNODE_NO_SIZE.
Only when it is given, the vnode_create_vobject() will try to obtain
file's size on its own.
Introduce dedicated vnode_disk_create_vobject() for use by
g_vfs_open(), so we don't have to call vn_isdisk() in the common case
(for regular files).
Handle the case of mediasize==0 in g_vfs_open().
Reviewed by: alc, kib, markj, olce
Approved by: oshogbo (mentor), allanjude (mentor)
Differential Revision: https://reviews.freebsd.org/D45244
per allocated vm_object. Otherwise, since constructors are not
idempotent, we e.g. leak device reference in case of non-managed pager.
PR: 278826
Reported by: Austin Zhang <austin.zhang@dell.com>
Reviewed by: alc, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D45113
vm_object_page_remove() wants to busy the page, but that won't work
here. (Kernel stack pages are always busy.)
Make the error handling path look more like vm_thread_stack_dispose().
Reported by: pho
Reviewed by: kib, bnovkov
Fixes: 7a79d06697 ("vm: improve kstack_object pindex calculation to avoid pindex holes")
Differential Revision: https://reviews.freebsd.org/D45019
fork() may allocate a new thread in one of two ways: from UMA, or cached
in a freed proc that was just allocated from UMA. In either case, KASAN
and KMSAN need to initialize some state; in particular they need to
initialize the shadow mapping of the new thread's stack.
This is done differently between KASAN and KMSAN, which is confusing.
This patch improves things a bit:
- Add a new thread_recycle() function, which moves all kernel stack
handling out of kern_fork.c, since it doesn't really belong there.
- Then, thread_alloc_stack() has only one local caller, so just inline
it.
- Avoid redundant shadow stack initialization: thread_alloc()
initializes the KMSAN shadow stack (via kmsan_thread_alloc()) even
through vm_thread_new() already did that.
- Add kasan_thread_alloc(), for consistency with kmsan_thread_alloc().
No functional change intended.
Reviewed by: khng
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44891
This commit replaces the linear transformation of kernel virtual
addresses to kstack_object pindex values with a non-linear
scheme that circumvents physical memory fragmentation caused by
kernel stack guard pages. The new mapping scheme is used to
effectively "skip" guard pages and assign pindices for
non-guard pages in a contiguous fashion.
The new allocation scheme requires that all default-sized kstack KVAs
come from a separate, specially aligned region of the KVA space.
For this to work, this commited introduces a dedicated per-domain
kstack KVA arena used to allocate kernel stacks of default size.
The behaviour on 32-bit platforms remains unchanged due to a
significatly smaller KVA space.
Aside from fullfilling the requirements imposed by the new scheme, a
separate kstack KVA arena facilitates superpage promotion in the rest
of kernel and causes most kstacks to have guard pages at both ends.
Reviewed by: alc, kib, markj
Tested by: markj
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D38852
This fixes compiler warnings when -Wunused-arguments is enabled and
not quieted.
Reviewed by: kib, markj
Obtained from: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D44623
The swap pager itself allocates readahead pages, so should take care to
unbusy them after a read error, just as it does in the non-error case.
PR: 277538
Reviewed by: olce, dougm, alc, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44646
Add a function to check whether an aligned block of vm pages are
allocated, for use with impending changes to arm64 superpage
managment.
Reviewed by: alc
Differential Revision: http://reviews.freebsd.org/D44575
Remove sys/cdefs.h and sys/socket.h includes.
Order sys/ includes alphabetically.
Do not check for NULL before free().
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
DIfferential revision: https://reviews.freebsd.org/D43444
The loop 'skip clean blocks' checking for the clean blocks in the dirty
pages might end up setting the in_hole to true when exactly at EOF at
the middle of the block, without advancing the prev_offset value. Then
the next block is not dirty, and next_offset is clipped back to poffset
+ maxsize, equal to prev_offset, failing the assertion.
Instead of asserting prev_offset < next_offset, we must skip the write.
Reported by: asomers
PR: 276191
Reviewed by: alc, markj
Tested by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43358