Commit graph

4747 commits

Author SHA1 Message Date
Doug Moore
8df38859d0 radix_trie: replace node count with popmap
Replace the 'count' field in a trie node with a bitmap that
identifies non-NULL children. Drop the 'last' field, and use the
last bit set in the bitmap instead.  In lookup_le, lookup_ge,
remove, and reclaim_all, use the bitmap to find the
previous/next/only/every non-null child in constant time by
examining the bitmask instead of looping across array elements
and null-checking them one-by-one.

A buildworld test suggests that this reduces the cycle count on
those functions that eliminate some null-checks by 4.9%, 1.5%,
0.0% and 13.3%.
Reviewed by:	alc
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D40775
2023-07-07 11:09:36 -05:00
Konstantin Belousov
ef747607ea vm_fault: move FAULT_* return codes out of range for Mach errors
This way a possible clash between FAULT_* and KERN_* numbering is
avoided, and panics checks for fault_status confusion become more
efficient.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D40771
2023-06-28 00:03:14 +03:00
Doug Moore
da72505f9c radix_trie: pass fewer params to node_get
Let node_get calculate it's own owner value. Don't pass the count
parameter, since it's always 2. Save 16 bytes in insert(). Move,
without modifying, slot and trimkey to handle use-before-declaration
problem.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D40723
2023-06-27 12:21:11 -05:00
Doug Moore
9cfed089ac radix_trie: clean up overlong lines
This is purely a cosmetic change. vm_radix.c has lines that reach past
column 80 and this change cleans that up. The associated changes to
subr_pctrie.c are just to keep mirroring vm_radix.c.
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D40764
2023-06-27 12:01:33 -05:00
Doug Moore
72c3a43b16 radix_trie: skip compare in lookup_le, lookup_ge
In _lookup_ge, where a loop "looks for an available edge or val within
the current bisection node" (to quote the code comment), the value of
index has already been modified to guarantee that it is the least
value than can be found in the non-NULL child node being
examined. Therefore, if the non-NULL child is a leaf, there's no need
to compare 'index' to anything, and the value can just be returned.

The same is true for _lookup_le with 'most' replacing 'least'.
Reviewed by:	alc
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D40746
2023-06-27 00:42:41 -05:00
Alan Cox
d8e6f4946c vm: Fix anonymous memory clustering under ASLR
By default, our ASLR implementation is supposed to cluster anonymous
memory allocations, unless the application's mmap(..., MAP_ANON, ...)
call included a non-zero address hint.  Unfortunately, clustering
never occurred because kern_mmap() always replaced the given address
hint when it was zero.  So, the ASLR implementation always believed
that a non-zero hint had been provided and randomized the mapping's
location in the address space.  To fix this problem, I'm pushing down
the point at which we convert a hint of zero to the minimum allocatable
address from kern_mmap() to vm_map_find_min().

Reviewed by:	kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D40743
2023-06-26 23:42:48 -05:00
Doug Moore
a42d8fe001 radix_trie: simplify trimkey functions
Replacing a branch and two shifts with a single masking operation saves 64 bytes the pair of functions lookup_le and lookup_ge on amd64.  Refresh the associated comments.
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D40722
2023-06-25 12:49:15 -05:00
Doug Moore
e8efee297c radix_trie: avoid reloading radix node
In the vm_radix:remove loop that searches for the last child, load
that child once, without loading it again after the search is over.
Change KASSERTS from index check to NULL node check.
Reviewed by:	alc
Differential Revision:	https://reviews.freebsd.org/D40721
2023-06-23 18:47:23 -05:00
Doug Moore
1efa7dbc07 vm_radix: drop unused function; use bool.
Replace boolean_t with bool in vm_radix.c. Drop the unused function
vm_radix_is_singleton, which is unused and has no corresponding
function in subr_pctrie.c.
Reviewed by:	alc
Differential Revision:	<https://reviews.freebsd.org/D40586>
2023-06-20 23:52:27 -05:00
Doug Moore
05963ea4d1 radix_trie: eliminate iteration in keydiff
Use flsll(), instead of a loop, to find where two keys differ, and
then arithmetic to transform that to a trie level.
Approved by:	alc, markj
Differential Revision:	https://reviews.freebsd.org/D40585
2023-06-20 11:30:29 -05:00
Alan Cox
58d4271721 vm_phys: Fix typo in 9e81742892 2023-06-16 03:12:42 -05:00
Doug Moore
9e81742892 vm_phys: add binary segment search
Replace several sequential searches for a segment that contains a
phyiscal address with a call to a function that does it by binary
search.  In vm_page_reclaim_contig_domain_ext, find the first segment
to reclaim from, and reclaim from each subsequent appropriate segment.
Eliminate vm_phys_scan_contig.

Reviewed by:	alc, markj
Differential Revision:	https://reviews.freebsd.org/D40058
2023-06-16 01:43:45 -05:00
Mark Johnston
6062d9faf2 vm_phys: Change the return type of vm_phys_unfree_page() to bool
This is in keeping with the trend of removing uses of boolean_t, and the
sole caller was implicitly converting it to a "bool".

No functional change intended.

Reviewed by:	dougm, alc, imp, kib
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D40401
2023-06-05 12:22:11 -04:00
Colin Percival
45cc8519f5 tslog: Annotate parts of SYSINIT cpu
Booting an amd64 kernel on Firecracker with 1 CPU and 128 MB of RAM,
SYSINIT cpu takes roughly 2770 us:
* 2280 us in vm_ksubmap_init
  * 535 us in kmem_malloc
    * 450 us in pmap_zero_page
  * 1720 us in pmap_growkernel
    * 1620 us in pmap_zero_page
* 80 us in bufinit
* 480 us in cpu_setregs
  * 430 us in cpu_setregs calling load_cr0

Much of this is hypervisor overhead: load_cr0 is slow because it traps
to the hypervisor, and 99% of the time in pmap_zero_page is spent when
we first touch the page, presumably due to the host Linux kernel
faulting in backing pages one by one.

Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D40327
2023-06-04 10:16:35 -07:00
Warner Losh
4d846d260e spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with:		pfg
MFC After:		3 days
Sponsored by:		Netflix
2023-05-12 10:44:03 -06:00
Andrew Gallatin
8b0dafdb2f vm: implement vm_page_reclaim_contig_domain_ext()
Implement vm_page_reclaim_contig_domain_ext() to reclaim multiple
contiguous regions at once.  This makes it more efficient for users
that need multiple contiguous regions to reclaim those regions
efficiently.

This is needed because callers like ktls may need to reclaim many
contiguous regions, and each scan of physical memory can take
multiple seconds on a large memory machine (order of 100GB of
RMA).  Rather than modifying the core algorithm, I extended
vm_page_reclaim_contig_domain() to take a "desired_runs" argument to
allow the caller to request that it reclaim more than just a single
run. There is no functional change intended for all existing
callers.

The first user for this interface is the ktls code
(https://reviews.freebsd.org/D39421). By reclaiming multiple runs,
ktls goes from consuming hours of CPU to refill its buffer zone to
just seconds or minutes.

Differential Revision: https://reviews.freebsd.org/D39739
Sponsored by:	Netflix
Reviewed by:	alc, jhb, markj
2023-05-09 13:09:34 -04:00
Dimitry Andric
f74be55e30 vm: fix a number of functions to match the expected prototypes
Noticed while attempting to make boolean_t unsigned: some vm-related
function declarations and defintions were using boolean_t where they
should have used int, and vice versa.

MFC after:	1 week
Reviewed by:	jhb
Differential Revision: https://reviews.freebsd.org/D39753
2023-04-25 19:58:18 +02:00
Konstantin Belousov
1e0e335b0f amd64: fix PKRU and swapout interaction
When vm_map_remove() is called from vm_swapout_map_deactivate_pages()
due to swapout, PKRU attributes for the removed range must be kept
intact.  Provide a variant of pmap_remove(), pmap_map_delete(), to
allow pmap to distinguish between real removes of the UVA mappings
and any other internal removes, e.g. swapout.

For non-amd64, pmap_map_delete() is stubbed by define to pmap_remove().

Reported by:	andrew
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39556
2023-04-15 02:53:59 +03:00
Konstantin Belousov
28f957b8b3 vnode_pager_input: return runningbufspace back
Both vnode_pager_input_smlfs() and vnode_pager_generic_getpages()
increment runningbufspace, but also both delegate io completion handling
on the pbuf to either plain bdone() or filesystem-specific strategy
routine. Accidentally, for e.g. UFS it is g_vfs_strategy()/g_vfs_done().
The later calls bufdone() which handles runningbufspace reclamation.

For plain bdone() io done handler, nothing would return
accounted b_runningbufspace back. Do it in the new
helper vnode_pager_input_bdone(), as well as in
vnode_pager_generic_getpages_done() explicitly.

Note that potential multiple calls to runningbufwakeup() for the same
pbuf or buf completion are safe. runningbufwakeup() clears accounting
for the buffer, so second and later calls are nop.

The problem was found due to tarfs using small vnode pager input but not
g_vfs_strategy().

Reported by:	des
Reviewed by:	markj, sjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D39263
2023-03-26 00:55:29 +02:00
Mateusz Guzik
0e71f4f77c vm: add unlocked page lookup before trying vm_fault_soft_fast
Shaves a read lock + tryupgrade trip most of the time.

Stats from doing a kernel build (counters not present in the tree):
vm.fault_soft_fast_ok: 262653
vm.fault_soft_fast_failed_other: 41
vm.fault_soft_fast_failed_no_page: 39595772
vm.fault_soft_fast_failed_page_busy: 1929
vm.fault_soft_fast_failed_page_invalid: 22183

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D39268
2023-03-25 22:14:59 +00:00
Mateusz Guzik
0a310c94ee vm: consistently prefix fault helpers with vm_fault_
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D39029
2023-03-13 11:00:28 +00:00
Mateusz Guzik
3c3a434f8e vm: avoid lock upgrade if possible in vm_fault_next
In my tests during buildkernel fs->m was always NULL at that stage.

Note the change has no impact on vm obj contention during said workload.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D39027
2023-03-11 21:52:01 +00:00
Mateusz Guzik
fdb1dbb1cc vm: read-locked fault handling for backing objects
This is almost the simplest patch which manages to avoid write locking
for backing objects, as a result mostly fixing vm object contention
problems.

What is not fixed:
1. cacheline ping pong due to read-locks
2. cacheline ping pong due to pip
3. cacheling ping pong due to object busying
4. write locking on first object

On top of it the use of VM_OBJECT_UNLOCK instead of explicitly tracking
the state is slower multithreaded that it needs to be, done for
simplicity for the time being.

Sample lock profiling results doing -j 104 buildkernel on tmpfs:
before:
71446200 (rw:vmobject)
14689706 (sx:vm map (user))
4166251 (rw:pmap pv list)
2799924 (spin mutex:turnstile chain)

after:
19940411 (rw:vmobject)
8166012 (rw:pmap pv list)
6017608 (sx:vm map (user))
1151416 (sleep mutex:pipe mutex)

Reviewed by:	kib
Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D38964
2023-03-11 11:08:21 +00:00
Mateusz Guzik
bdfd1adc99 vm: add VM_OBJECT_UNLOCK
Reviewed by:	kib
Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D38964
2023-03-11 11:08:21 +00:00
Mateusz Guzik
73b951cd39 vm: move up object lock asserts in fault functions
No functional changes.

Reviewed by:	kib
Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D38964
2023-03-11 11:08:21 +00:00
Mark Johnston
e08302f649 vm_fault: Update a comment to reflect the removal of the default pager
Fixes:	5d32157d4e ("vm_object: Modify vm_object_allocate_anon() to return OBJT_SWAP objects")
Reviewed by:	alc, kib
Differential Revision:	https://reviews.freebsd.org/D38985
2023-03-09 11:15:49 -05:00
Ed Maste
c3821149f4 Drop space in "vm object" lock name to improve wchan
Lock names are shown in top as a `*` followed by the first five
characters of the name.  `*vmobj` a little more obvious and easier to
search for than `*vm ob`.

Differential Revision:	https://reviews.freebsd.org/D36264
2023-02-15 08:31:17 -05:00
Mark Johnston
d099194818 vm_fault: Fix a race in vm_fault_soft_fast()
When vm_fault_soft_fast() creates a mapping, it release the VM map lock
before unbusying the top-level object.  Without the map lock, however,
nothing prevents the VM object from being deallocated while still busy.

Fix the problem by unbusying the object before releasing the VM map
lock.  If vm_fault_soft_fast() fails to create a mapping, the VM map
lock is not released, so those cases don't need to change.

Reported by:	syzkaller
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D38527
2023-02-13 16:35:47 -05:00
Mateusz Guzik
bbb6228eae vm: ansify
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2023-02-13 18:23:21 +00:00
Andrew Gallatin
9cb6ba29cb vm: centralize VM_BATCHQUEUE_SIZE definition
Remove the platform-specific definitions of VM_BATCHQUEUE_SIZE
for amd64 and powerpc64, and instead treat all 64-bit platforms
identically.  This has the effect of increasing the arm64
and riscv VM_BATCHQUEUE_SIZE to match that of other platforms.

Reviewed by: jhb, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37707
2023-01-21 14:30:00 -05:00
Konstantin Belousov
6189672e60 Handle ERELOOKUP from VOP_FSYNC() in several other places
We need to repeat the operation if the vnode was relocked.

Reported and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D38114
2023-01-20 03:54:56 +02:00
Konstantin Belousov
70e1b11216 vm_object.c: minor style
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2023-01-20 03:54:42 +02:00
Mark Johnston
b050ee6c97 vm_object: Fix a kernel memory disclosure via the vm_object list sysctl
Reported by:	Chris J-D <chris@accessvector.net>
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2023-01-16 11:27:54 -05:00
Mateusz Guzik
f45feecfb2 vfs: add vn_getsize
getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D37885
2022-12-28 22:43:49 +00:00
Konstantin Belousov
3249449190 vm_page_grab_valid(): clear *mp in case of pager denying page allocation
Same as it is done in other error return cases.  Callers depend on error
case returning NULL, e.g. vm_imgact_hold_page().

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37719
2022-12-17 19:01:43 +02:00
Andrew Gallatin
1cac76c93f vm: reduce lock contention when processing vm batchqueues
Rather than waiting until the batchqueue is full to acquire the lock &
process the queue, we now start trying to acquire the lock using trylocks
when the batchqueue is 1/2 full. This removes almost all contention on the
vm pagequeue mutex for for our busy sendfile() based web workload.
It also greadly reduces the amount of time a network driver ithread
remains blocked on a mutex, and eliminates some packet drops under
heavy load.

So that the system does not loose the benefit of processing large
batchqueues, I've doubled the size of the batchqueues. This way, when
there is no contention, we process the same batch size as before.

This has been run for several months on a busy Netflix server, as well
as on my personal desktop.

Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D37305
2022-12-14 14:34:07 -05:00
Konstantin Belousov
645510e62e Provide consistent prototype for swp_pager_meta_free()
This should fix 32bit build breakage.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-12-09 17:23:09 +02:00
Konstantin Belousov
cd086696c2 vm_pager_allocate(): override resulting object type
For dynamically allocated pager type, which inherits the parent's alloc
method, type of the returned object is set to the parent's type
otherwise.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:17:03 +02:00
Konstantin Belousov
ec201dddfb vm_pager: add method to veto page allocation
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:15:37 +02:00
Konstantin Belousov
d537d1f12e vm_pager: add methods for page insertion and removal notifications
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:15:37 +02:00
Konstantin Belousov
d9dc64f158 tmpfs: make vm_object point to the tmpfs node instead of vnode
The vnode could be reclaimed and allocated again during the lifecycle of
the node, but the node cannot.  Also, referencing the node would allow
to reach it and tmpfs mount data from the object, regardless of the
state of the possibly absent vnode.

Still use swp_tmpfs for back-pointer, instead of using handle. Use of
named swap objects would incur taking the sw_alloc_sx on node allocation
and deallocation.

swp_tmpfs is renamed to swp_priv to remove the last bit of tmpfs in vm/.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:15:37 +02:00
Konstantin Belousov
baa1ccceef Make swap_pager_freespace() global
also make it return the count of the swap pages freed, which are not
simultaneously resident in the object.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37097
2022-12-09 14:15:37 +02:00
Mitchell Horne
03d6764b38 ddb: don't limit pindex output in 'show vmopag'
This command already prints a tremendous amount of output, and properly
obeys the pager. It no longer makes sense to arbitrarily limit the pages
that are printed, as the reader will not be aware that this has
happened.

Reviewed by:	markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D37361
2022-11-11 14:25:39 -04:00
Anton Rang
cfbf1da0de vm_page_unswappable: remove wrong assertion
markj says:

    ...the assertion is incorrect and should simply be removed.
    It has been racy since we removed the use of the page hash
    lock to synchronize wiring of pages.

PR:		267621
Reviewed by:	markj, Anton Rang <rang@acm.org>
MFC after:	1 week
Sponsored by:	Dell Inc.
Differential Revision:	https://reviews.freebsd.org/D37320
2022-11-09 14:28:03 -06:00
Mark Johnston
2dba2288aa uma: Never pass cache zones to memguard
Items allocated from cache zones cannot usefully be protected by
memguard.

PR:		267151
Reported and tested by:	pho
MFC after:	1 week
2022-10-19 14:36:36 -04:00
Konstantin Belousov
934bfc128e Add vm_page_any_valid()
Use it and several other vm_page_*_valid() functions in more places.

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37024
2022-10-19 20:24:07 +03:00
Konstantin Belousov
5bd45b2ba3 swap_pager_find_least(): assert that the function is called on the right object type
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D37024
2022-10-19 20:24:07 +03:00
Mark Johnston
2c9dc2384f vm_page: Fix a logic error in the handling of PQ_ACTIVE operations
As an optimization, vm_page_activate() avoids requeuing a page that's
already in the active queue.  A page's location in the active queue is
mostly unimportant.

When a page is unwired and placed back in the page queues,
vm_page_unwire() avoids moving pages out of PQ_ACTIVE to honour the
request, the idea being that they're likely mapped and so will simply
get bounced back in to PQ_ACTIVE during a queue scan.

In both cases, if the page was logically in PQ_ACTIVE but had not yet
been physically enqueued (i.e., the page is in a per-CPU batch), we
would end up clearing PGA_REQUEUE from the page.  Then, batch processing
would ignore the page, so it would end up unwired and not in any queues.
This can arise, for example, when a page is allocated and then
vm_page_activate() is called multiple times in quick succession.  The
result is that the page is hidden from the page daemon, so while it will
be freed when its VM object is destroyed, it cannot be reclaimed under
memory pressure.

Fix the bug: when checking if a page is in PQ_ACTIVE, only perform the
optimization if the page is physically enqueued.

PR:		256507
Fixes:		f3f38e2580 ("Start implementing queue state updates using fcmpset loops.")
Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	E-CARD Ltd.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D36839
2022-10-05 15:12:46 -04:00
John Baldwin
a9fca3b987 Fix various places which cast a pointer to a vm_paddr_t or vice versa.
GCC warns about the mismatched sizes on i386 where vm_paddr_t is 64
bits.

Reviewed by:	imp, markj
Differential Revision:	https://reviews.freebsd.org/D36750
2022-10-03 16:10:41 -07:00
John Baldwin
f49fd63a6a kmem_malloc/free: Use void * instead of vm_offset_t for kernel pointers.
Reviewed by:	kib, markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D36549
2022-09-22 15:09:19 -07:00