Commit graph

20267 commits

Author SHA1 Message Date
Konstantin Belousov
af96ccc6a5 uifree(9): report non-zero values for all shared resources
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2024-09-21 00:08:51 +03:00
Konstantin Belousov
a52b30ff98 sys_pipe: consistently use cr_ruidinfo for accounting of pipebuf
Tested by:	yasu
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2024-09-21 00:08:51 +03:00
Konstantin Belousov
40769168a5 pipespace_new(): decrease uidinfo pipebuf usage if reservation check failed
Submitted by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2024-09-20 17:03:45 +03:00
Konstantin Belousov
d6074f73af pipe: use pipe subsystem KVA counter instead of pipe_map size
to calculate the superuser-reserved amount of the pipe space

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2024-09-20 17:03:45 +03:00
Mark Johnston
283bf3b4b1 socket: Only log splice structs to ktrace if KTR_STRUCT is configured
Fixes:	a1da7dc1cd ("socket: Implement SO_SPLICE")
2024-09-20 11:40:31 +00:00
Siva Mahadevan
75cd1e534c socket: wrap ktrsplice call with KTRACE ifdef
This fixes a build error when the kernel is built without KTRACE
support.

Reviewed by:	emaste, markj
Fixes:		a1da7dc1cd ("socket: Implement SO_SPLICE")
Pull Request:	https://github.com/freebsd/freebsd-src/pull/1426
2024-09-20 11:34:04 +00:00
Konstantin Belousov
7672cbef2c pipes: reserve configured percentage of buffers zone to superuser
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D46619
2024-09-20 09:46:07 +03:00
Konstantin Belousov
3458bbd397 kernel: add RLIMIT_PIPEBUF
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D46619
2024-09-20 09:46:06 +03:00
Simon J. Gerraty
4a5fa10861 procfs require PRIV_PROC_MEM_WRITE to write mem
Add a priv_check for PRIV_PROC_MEM_WRITE which will be blocked
by mac_veriexec if being enforced, unless the process has a maclabel
to grant priv.

Reviewed by:	stevek
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D46692
2024-09-19 13:10:27 -07:00
Doug Moore
fd1d666289 pctrie: create iterator
Define a pctrie iterator type. A pctrie iterator is a wrapper around a
pctrie that remembers a position in the trie where the last search
left off, and where a new search can resume. When the next search is
for an item very near in the trie to where the last search left off,
iter-based search is faster because instead of starting from the root,
the search usually only has to back up one or two steps up the
root-to-last-search path to find the branch that leads to the new
search target.

Every kind of lookup (plain, lookup_ge, lookup_le) that can begin with
the trie root can begin with an iterator instead. An iterator can also
do a relative search ("look for the item 4 greater than the last item
I found") because it remembers where that last search ended. It can
also search within limits ("look for the item bigger than this one,
but it has to be less than 100"), which can save time when the next
item beyond the limits and that is known before we actually know what
that item it is. An iterator can also be used to remove an item that
has already been found, without having to search for it again.

Iterators are vulnerable to unsynchronized data changes. If the
iterator is created with a lock held, and that lock is released and
acquired again, there's no guarantee that the iterator path remains
valid.

Reviewed by:	markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D45627
2024-09-13 10:36:54 -05:00
Kristof Provost
299175f2e5 Revert "Assert that mbufs are writable if we write to them"
This reverts commit f08247fd88.

This assertion is triggered by
ktls_test:ktls_transmit_aes128_cbc_1_0_sha1_control. Remove the assertion until
we fully understand why.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2024-09-11 17:04:35 +02:00
Kristof Provost
f08247fd88 Assert that mbufs are writable if we write to them
m_copyback() modifies the mbuf, so it must be a writable mbuf.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D46627
2024-09-11 13:17:48 +02:00
Andrew Turner
d29771a722 arm: Assume __ARM_ARCH == 7
The only supported 32-bit Arm architecture is Armv7. Remove old checks
for earlier architecture revisions.

Sponsored by:	Arm Ltd
Differential Revision:	https://reviews.freebsd.org/D45957
2024-09-11 10:40:13 +00:00
Mark Johnston
a1da7dc1cd socket: Implement SO_SPLICE
This is a feature which allows one to splice two TCP sockets together
such that data which arrives on one socket is automatically pushed into
the send buffer of the spliced socket.  This can be used to make TCP
proxying more efficient as it eliminates the need to copy data into and
out of userspace.

The interface is copied from OpenBSD, and this implementation aims to be
compatible.  Splicing is enabled by setting the SO_SPLICE socket option.
When spliced, data that arrives on the receive buffer is automatically
forwarded to the other socket.  In particular, splicing is a
unidirectional operation; to splice a socket pair in both directions,
SO_SPLICE needs to be applied to both sockets.  More concretely, when
setting the option one passes the following struct:

    struct splice {
	    int fd;
	    off_t max;
	    struct timveval idle;
    };

where "fd" refers to the socket to which the first socket is to be
spliced, and two setsockopt(SO_SPLICE) calls are required to set up a
bi-directional splice.

select(), poll() and kevent() do not return when data arrives in the
receive buffer of a spliced socket, as such data is expected to be
removed automatically once space is available in the corresponding send
buffer.  Userspace can perform I/O on spliced sockets, but it will be
unpredictably interleaved with splice I/O.

A splice can be configured to unsplice once a certain number of bytes
have been transmitted, or after a given time period.  Once unspliced,
the socket behaves normally from userspace's perspective.  The number of
bytes transmitted via the splice can be retrieved using
getsockopt(SO_SPLICE); this works after unsplicing as well, up until the
socket is closed or spliced again.  Userspace can also manually trigger
unsplicing by splicing to -1.

Splicing work is handled by dedicated threads, similar to KTLS.  A
worker thread is assigned at splice creation time.  At some point it
would be nice to have a direct dispatch mode, wherein the thread which
places data into a receive buffer is also responsible for pushing it
into the sink, but this requires tighter integration with the protocol
stack in order to avoid reentrancy problems.

Currently, sowakeup() and related functions will signal the worker
thread assigned to a spliced socket.  so_splice_xfer() does the hard
work of moving data between socket buffers.

Co-authored by:	gallatin
Reviewed by:	brooks (interface bits)
MFC after:	3 months
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D46411
2024-09-10 16:51:37 +00:00
Maxim Sobolev
a43fb3653b mbuf: improve KASSERT(9) falure messages in the m_apply()
- Make less ambiguous;
- extend to provide more context for post-mortem.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D43776
MFC after:	2 weeks
2024-09-09 19:30:28 -07:00
Doug Moore
8aa2cd9d13 rangeset: speed up range traversal
For rangeset-next search, use exact search rather than greater-than search.

Move a bit of the testing logic from the pmap code to the common rangeset code.

Reviewed by:	kib (previous version)
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D46314
2024-09-09 16:50:14 -05:00
Sebastian Huber
66145c3829 ntptime: Use time_t for tv_sec related variables
The struct timespec tv_sec member is of type time_t.  Make sure that all
variables related to this member are of the type time_t.  This is important for
targets where long is a 32-bit type and time_t a 64-bit type.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/1373
2024-09-06 12:34:32 -06:00
Sebastian Huber
07d90ee0a6 kvprintf(): Fix '+' conversion handling
For example, printf("%+i", 1) prints "+1".  However, kvprintf() did
print just "1" for this example.  According to PRINTF(3):

  A sign must always be placed before a number produced by a signed
  conversion.

For "%+r" radix conversions, keep the "+" handling as it is, since this
is a non-standard conversion.  For "%+p" pointer conversions, continue
to ignore the sign modifier to be in line with libc.

This change allows to support the ' conversion modifier in the future.

Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/1310
2024-09-06 12:34:30 -06:00
Konstantin Belousov
79eba754be vop_stdadvise(): restore correct handling of length == 0
Switch to unsigned arithmetic to handle overflow not relying on -fwrap,
and specially treat the case of length == 0 from posix_fadvise() which
passes OFF_MAX as the end to VOP.  There, roundup() overflows and -fwrap
causes bend and endn become negative.  Using uintmax_t gives the place
for roundup() to not wrap.

Also remove locals with single use, and move calculations out from under
bo lock.

Reported by:	tmunro
Reviewed by:	markj, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D46518
2024-09-05 03:40:14 +03:00
Konstantin Belousov
e28ee29d2d vfs_default.c: trim whitespace
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2024-09-05 00:58:51 +03:00
Olivier Certner
c75a18905e umtx: shm: 'ushm_refcnt > 0' => 'ushm_refcnt != 0'
'ushm_refcnt' is unsigned.  Don't leave the impression it isn't.

No functional change (intended).

Reviewed by:    kib
Approved by:    emaste (mentor)
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D46126
2024-09-04 14:38:12 +00:00
Olivier Certner
c3e6dfe55c umtx: shm: Prevent reference counting overflow
This hardens against provoked use-after-free occurences should there be
reference counting leaks in the future (which is currently not the
case).

At the deepest level, umtx_shm_find_reg_unlocked() now returns EOVERFLOW
when it cannot grant an additional reference to the registry object, and
so will umtx_shm_find_reg().  umtx_shm_create_reg() will fail if calling
umtx_shm_find_reg() returns EOVERFLOW (meaning a SHM object for the
passed key already exists, but we can't acquire another reference on
it), avoiding the creation of a duplicate registry entry for a given key
(this wouldn't pose problem for the rest of the code in its current
form, but is expressly avoided for intelligibility and hardening
purposes).

Since umtx_shm_find_reg*(), and consequently the whole _umtx_op() system
call, can only return EOVERFLOW on such a bug manifesting, we don't
document that return value.

Reviewed by:    kib, emaste
Approved by:    emaste (mentor)
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D46126
2024-09-04 14:38:12 +00:00
Olivier Certner
62f40433ab umtx: shm: Fix use-after-free due to multiple drops of the registry reference
umtx_shm_unref_reg_locked() would unconditionally drop the "registry"
reference, tied to USHMF_LINKED.

This is not a problem for caller umtx_shm_object_terminated(), which
operates under the 'umtx_shm_lock' lock end-to-end, but it is for
indirect caller umtx_shm(), which drops the lock between
umtx_shm_find_reg() and the call to umtx_shm_unref_reg(true) that
deregisters the umtx shared region (from 'umtx_shm_registry';
umtx_shm_find_reg() only finds registered shared mutexes).

Thus, two concurrent user-space callers of _umtx_op() with UMTX_OP_SHM
and flags UMTX_SHM_DESTROY, both progressing past umtx_shm_find_reg()
but before umtx_shm_unref_reg(true), would then decrease twice the
reference count for the single reference standing for the shared mutex's
registration.

Reported by:    Synacktiv
Reviewed by:    kib
Approved by:    emaste (mentor)
Security:	FreeBSD-SA-24:14.umtx
Security:	CVE-2024-43102
Security:       CAP-01
Sponsored by:   The Alpha-Omega Project
Sponsored by:	The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D46126
2024-09-04 14:38:12 +00:00
Olivier Certner
dd83da532c umtx: shm: Collapse USHMF_REG_LINKED and USHMF_OBJ_LINKED flags
...into the only USHMF_LINKED, as they are always set or unset together.

This is both to stop giving the impression that they can be set/unset
independently, which they can't with the current code, and to make it
clearer that an upcoming reference counting fix is correct.

Reviewed by:    kib
Approved by:    emaste (mentor)
Sponsored by:   The FreeBSD Foundation
Differential Revision:  https://reviews.freebsd.org/D46126
2024-09-04 14:38:12 +00:00
Zhenlei Huang
99e3bb555c subr_bus: Stop checking for failures from malloc(M_WAITOK)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D45852
2024-09-03 18:25:17 +08:00
Zhenlei Huang
f444db950e boottrace: Stop checking for failures from realloc(M_WAITOK)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D45852
2024-09-03 18:25:17 +08:00
Zhenlei Huang
6a2a385507 kern_fail: Stop checking for failures from fp_malloc(M_WAITOK)
`fp_malloc` is defined as a macro that redirects to `malloc`.

MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D45852
2024-09-03 18:25:16 +08:00
Zhenlei Huang
356be1348d kernel: Make some compile time constant variables const
Those variables are not going to be changed at runtime. Make them const
to avoid potential overwriting. This will also help spotting accidental
global variables shadowing, since the variable's name such as `version`
is short and commonly used.

This change was inspired by reviewing khng's work D44760.

No functional change intended.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D45227
2024-08-30 18:26:30 +08:00
Konstantin Belousov
7e49f04c88 rangelocks: stop caching per-thread rl_q_entry
This should reduce the frequency of smr_synchronize() calls, that
otherwise occur on almost each rangelock unlock.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D46482
2024-08-30 00:32:48 +03:00
Kevin Bowling
f622dc5dae x86: Detect NVMM hypervisor
MFC after:	1 week
2024-08-28 13:39:07 -07:00
Konstantin Belousov
e1f4d62377 rangelocks: remove unneeded cast of the atomic_load_ptr() result
Noted and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D46465
2024-08-28 17:35:06 +03:00
Konstantin Belousov
5378962154 rangelocks: re-enable cheat mode
Tested by:	lwhsu
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D46465
2024-08-28 17:34:54 +03:00
Konstantin Belousov
4e1f29b92d kern_copy_file_range(): handle rangelock recursion
PR:	281073
Reviewed by:	markj
Tested by:	lwhsu
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D46465
2024-08-28 17:34:40 +03:00
Konstantin Belousov
0b6b1c2859 Add rangelock_may_recurse(9)
Reviewed by:	markj
Tested by:	lwhsu
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D46465
2024-08-28 17:33:58 +03:00
Konstantin Belousov
75447afca8 rangelocks: extract the cheat mode drain code
Reviewed by:	markj
Tested by:	lwhsu
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D46465
2024-08-28 17:33:46 +03:00
Mark Johnston
fe66e4caf4 rangelock: Disable cheat mode by default
Cheat mode is incompatible with code which locks multiple ranges in the
same vnode, with at least one range being write-locked.  This can arise
in kern_copy_file_range().  Until that's handled somehow, avoid the
problem to make the fusefs tests stable.

PR:		281073
Fixes:		9ef425e560 ("rangelocks: add fast cheating mode")
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D46457
2024-08-27 20:36:31 +00:00
Mark Johnston
e6651546c2 rangelock: Fix an off-by-one error
A rangelock entry covers the range [start, end), so entries e1 and e2
with e1->end == e2->start do not overlap.

PR:		281073
Fixes:		5badbeeaf0 ("Re-implement rangelocks part 2")
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D46458
2024-08-27 20:35:08 +00:00
Mariusz Zaborski
24b1cf7a8a sysent: regen after d0675399 2024-08-27 17:24:54 +02:00
Edward Tomasz Napierala
d0675399d0 capsicum: allow subset of wait4(2) functionality
The usual way of handling process exit exit in capsicum(4) mode is
by using process descriptors (pdfork(2)) instead of the traditional
fork(2)/wait4(2) API. But most apps hadn't been converted this way,
and many cannot because the wait is hidden behind a library APIs that
revolve around PID numbers and not descriptors; GLib's
g_spawn_check_wait_status(3) is one example.

Thus, provide backwards compatibility by allowing the wait(2) family
of functions in Capsicum mode, except for child processes created by
pdfork(2).

Reviewed by:	brooks, oshogbo
Sponsored by:	Innovate UK
Differential Revision:	https://reviews.freebsd.org/D44372
2024-08-27 17:22:12 +02:00
Zhenlei Huang
0f64fc6a34 kern: Align the declaration of kernconfstring with its definition
It is defined as const char[] in config.c which is auto generated by
usr.sbin/config/kernconf.tmpl .

While here prefer macro SYSCTL_CONST_STRING to avoid casting.

MFC after:	1 week
2024-08-22 18:00:34 +08:00
Konstantin Belousov
40bffb7d21 rangelocks: fix typo in rl_w_validate
The freed elements should be threaded using rl_q_free pointer.

Reported by:	dougm, markj
Tested by:	markj
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:20:28 +03:00
Konstantin Belousov
c4d8b2462e rangelocks: recheck that entry is not marked after sleepq is locked in rl_w_validate()
otherwise we might loose the wakeup.

Reported and tested by:	markj
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:19:57 +03:00
Konstantin Belousov
a725d61825 rangelock: if CAS for removal failed, restart list iteration
Our next pointer is invalid and cannot be followed.

Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:19:36 +03:00
Konstantin Belousov
9467c1a69b rangelock: assert that we never insert or remove our entry after a logically deleted one
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:19:15 +03:00
Konstantin Belousov
e228961d6e rangelock_destoy(): poison lock->head to trip fault on lock attempt
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:18:56 +03:00
Konstantin Belousov
8a5b2db3d8 ranglelock_destroy(): do not remove lock entries from under live lock acquirer
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:18:39 +03:00
Konstantin Belousov
a3f10d0882 rangelocks: add rangelock_free_free() helper to free free list
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
2024-08-21 18:18:16 +03:00
Zhenlei Huang
7412517f29 init_main: Sprinkle const qualifiers where appropriate
No functional change intended.

MFC after:	1 week
2024-08-21 18:01:30 +08:00
Mark Johnston
66aed7e348 socket: Set lock flags properly
Fixes:	fb901935f2 ("socket: Split up sosend_generic()")
Reported by:	cy
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
2024-08-20 15:17:14 +00:00
Mark Johnston
6982be38cb socket: Microoptimize soreceive_stream_locked()
There is no need to hold the sockbuf lock while checking uio_resid.
No functional change intended.

MFC after:	2 weeks
Sponsored by:	Klara, Inc.
Sponsored by:	Stormshield
2024-08-19 14:52:39 +00:00