In the case of mmap(), add a HISTORY section. Mention that mmap() and
mprotect()'s documentation predates an implementation. The
implementation first saw wide use in 4.3-Reno, but there seems to be no
easy way to express that in mdoc so stick with 4.4BSD.
Reviewed by: emaste
Requested by: cem
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D20713
A new macro PROT_MAX() alters a protection value so it can be OR'd with
a regular protection value to specify the maximum permissions. If
present, these flags specify the maximum permissions.
While these flags are non-portable, they can be used in portable code
with simple ifdefs to expand PROT_MAX() to 0.
This change allows (e.g.) a region that must be writable during run-time
linking or JIT code generation to be made permanently read+execute after
writes are complete. This complements W^X protections allowing more
precise control by the programmer.
This change alters mprotect argument checking and returns an error when
unhandled protection flags are set. This differs from POSIX (in that
POSIX only specifies an error), but is the documented behavior on Linux
and more closely matches historical mmap behavior.
In addition to explicit setting of the maximum permissions, an
experimental sysctl vm.imply_prot_max causes mmap to assume that the
initial permissions requested should be the maximum when the sysctl is
set to 1. PROT_NONE mappings are excluded from this for compatibility
with rtld and other consumers that use such mappings to reserve
address space before mapping contents into part of the reservation. A
final version this is expected to provide per-binary and per-process
opt-in/out options and this sysctl will go away in its current form.
As such it is undocumented.
Reviewed by: emaste, kib (prior version), markj
Additional suggestions from: alc
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18880
The man page claims that with O_FSYNC (aka O_SYNC) the kernel will not cache
written data. However, that's not true. Nor does POSIX require it.
Perhaps it was true when that section of the man page was written in r69336
(I haven't checked). But it's not true now. Now the effect is simply that
writes are sent to disk immediately and synchronously, but they're still
cached.
See also: https://pubs.opengroup.org/onlinepubs/9699919799/
See also: ffs_write in sys/ufs/ffs/ffs_vnops.c
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20641
Historically we have not distinguished between kernel wirings and user
wirings for accounting purposes. User wirings (via mlock(2)) were
subject to a global limit on the number of wired pages, so if large
swaths of physical memory were wired by the kernel, as happens with
the ZFS ARC among other things, the limit could be exceeded, causing
user wirings to fail.
The change adds a new counter, v_user_wire_count, which counts the
number of virtual pages wired by user processes via mlock(2) and
mlockall(2). Only user-wired pages are subject to the system-wide
limit which helps provide some safety against deadlocks. In
particular, while sources of kernel wirings typically support some
backpressure mechanism, there is no way to reclaim user-wired pages
shorting of killing the wiring process. The limit is exported as
vm.max_user_wired, renamed from vm.max_wired, and changed from u_int
to u_long.
The choice to count virtual user-wired pages rather than physical
pages was done for simplicity. There are mechanisms that can cause
user-wired mappings to be destroyed while maintaining a wiring of
the backing physical page; these make it difficult to accurately
track user wirings at the physical page layer.
The change also closes some holes which allowed user wirings to succeed
even when they would cause the system limit to be exceeded. For
instance, mmap() may now fail with ENOMEM in a process that has called
mlockall(MCL_FUTURE) if the new mapping would cause the user wiring
limit to be exceeded.
Note that bhyve -S is subject to the user wiring limit, which defaults
to 1/3 of physical RAM. Users that wish to exceed the limit must tune
vm.max_user_wired.
Reviewed by: kib, ngie (mlock() test changes)
Tested by: pho (earlier version)
MFC after: 45 days
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19908
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
If dso uses initial exec TLS mode, rtld tries to allocate TLS in
static space. If there is no space left, the dlopen(3) fails. If space
if allocated, initial content from PT_TLS segment is distributed to
all threads' pcbs, which was missed and caused un-initialized TLS
segment for such dso after dlopen(3).
The mode is auto-detected either due to the relocation used, or if the
DF_STATIC_TLS dynamic flag is set. In the later case, the TLS segment
is tried to allocate earlier, which increases chance of the dlopen(3)
to succeed. LLD was recently fixed to properly emit the flag, ld.bdf
did it always.
Initial test by: dumbbell
Tested by: emaste (amd64), ian (arm)
Tested by: Gerald Aryeetey <aryeeteygerald_rogers.com> (arm64)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19072
According to 0mp, macros are not expanded in the argument provided to
-width. Use plain identifiers for width specification.
Noted and reviewed by: 0mp
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D19308
back to the lever before r343030. For 64-bit machines reduce it slightly,
too. Together with r343030 I bumped the limit up to the value we use at
Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a
good default.
Provide a loader tunable to change vnode pager pbufs count. Document it.
This is meant to clarify the fact that the system call will not fail
with -1/EFAULT, as one might expect, when reading the sendfile(2)
manpage today.
While here, pet the mandoc linter, when dealing with the section that
describes valid values for `flags`.
PR: 232210
MFC after: 2 weeks
Approved by: emaste (mentor)
Reviewed by: glebius, 0mp
Differential Revision: https://reviews.freebsd.org/D18949
An integrity check such as a check-hash or a cross-correlation failed.
The integrity error falls between EINVAL that identifies errors in
parameters to a system call and EIO that identifies errors with the
underlying storage media. EINTEGRITY is typically raised by intermediate
kernel layers such as a filesystem or an in-kernel GEOM subsystem when
they detect inconsistencies. Uses include allowing the mount(8) command
to return a different exit value to automate the running of fsck(8)
during a system boot.
These changes make no use of the new error, they just add it. Later
commits will be made for the use of the new error number and it will
be added to additional manual pages as appropriate.
Reviewed by: gnn, dim, brueffer, imp
Discussed with: kib, cem, emaste, ed, jilles
Differential Revision: https://reviews.freebsd.org/D18765
Based on the description in Linux man page.
Reviewed by: markj, ngie (previous version)
Sponsored by: Mellanox Technologies
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18837
from the local mapping.
Enable the setting by default.
The article behind the change: https://arxiv.org/abs/1901.01161
Reviewed by: markj
Discussed with: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18764
As it does for recv*(2), MSG_DONTWAIT indicates that the call should
not block, returning EAGAIN instead. Linux and OpenBSD both implement
this, so the change makes porting easier, especially since we do not
return EINVAL or so when unrecognized flags are specified.
Submitted by: Greg V <greg@unrelenting.technology>
Reviewed by: tuexen
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18728
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.
Submitted by: Jack Halford <jack@gandi.net>
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18501
The list of syscalls that modify st_atim, st_mtim, and st_ctim was quite out
of date and probably not accurate to begin with. Update it, and make it
clear that the list is open-ended.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18410
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature. Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.
Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs. At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.
Submitted by: Jack Halford <jack@gandi.net>
Reviewed by: mckusick, pfg, rmacklem (previous versions)
Sponsored by: Gandi.net
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17917
paths.
It was decided that committing the code and drafting of the man page
update is better than allowing the code to rot until wordsmithing
happens.
Reviewed by: jilles (previous version)
Discussed with: brooks, jilles, emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17714
The comment isn't stale. The check is bogus in the sense that poll(2)
does not require pollfd entries to be unique in fd space, so there is no
reason there cannot be more pollfd entries than open or even allowed
fds. The check is mostly a seatbelt against accidental misuse or
abuse. FD_SETSIZE, while usually unrelated to poll, is used as an
arbitrary floor for systems with very low kern.maxfilesperproc.
Additionally, document this possible EINVAL condition in the poll.2
manual.
No functional change.
Reviewed by: markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D17671
timezone offset. These values are generally zero.
While one still theoreticall could set these values, that's almost
never done. Users wishing to have an offset between the time of day
clock hardware and UTC use adjkerntz(8) instead.
localtime(3) should be used to find these values for the current
timezone.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.
Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547
There are multiple ways to wait for any child process to return a
status (e.g., waitpid(-1, ...), waitid(P_ALL, ...)), so don't be so
specific.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
the domain of a socket.
This is helpful when testing and Solaris and Linux have the same
socket option using the same name.
Reviewed by: bcr@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16791
jails since FreeBSD 7.
Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES). These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do. The first use
is obsolete along with jail(2), but keep them for the second-read-only use.
Differential Revision: D14791
Contrary to the removed comment, the kernel does appear to use the timezone
argument of settimeofday. The comment dates to the BSD4.4 import; I assume it
is just stale.
If a timer is updated (re-added) with a different time period
(specified in the .data field of the kevent), the new time period has
no effect; the timer will not expire until the original time has
elapsed. This violates the documented behavior as the kqueue(2) man
page says (in part) "Re-adding an existing event will modify the
parameters of the original event, and not result in a duplicate
entry."
This modification, adapted from a patch submitted by cem@ to PR214987,
fixes the kqueue system to allow updating a timer entry. The
kevent timer behavior is changed to:
* When a timer is re-added, update the timer parameters to and
re-start the timer using the new parameters.
* Allow updating both active and already expired timers.
* When the timer has already expired, dequeue any undelivered events
and clear the count of expirations.
All of these changes address the original PR and also bring the
FreeBSD and macOS kevent timer behaviors into agreement.
A few other changes were made along the way:
* Update the kqueue(2) man page to reflect the new timer behavior.
* Fix man page style issues in kqueue(2) diagnosed by igor.
* Update the timer libkqueue system test to test for the updated
timer behavior.
* Fix the (test) libkqueue common.h file so that it includes
config.h which defines various HAVE_* feature defines, before the
#if tests for such variables in common.h. This enables the use of
the actual err(3) family of functions.
* Fix the usages of the err(3) functions in the tests for incorrect
type of variables. Those were formerly undiagnosed due to the
disablement of the err(3) functions (see previous bullet point).
PR: 214987
Reported by: Brian Wellington <bwelling@xbill.org>
Reviewed by: kib
MFC after: 1 week
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D15778
Remove numactl(1), edit numa(4) to bring it some closer to reality,
provide libc ABI shims for old NUMA syscalls.
Noted and reviewed by: brooks (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D16142
No valid FreeBSD binary very called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.
This is a respin of r335983 with a workaround for the ancient BFD linker
in the libc stubs.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16193
RB_ASKNAME is no longer instructions to the boot loader to request a
prompt for which kernel to boot. Instead, it asks for what the root
file system to use. RB_INITNAME is unused, and never has been in
FreeBSD as far as I can tell. Remove it from the documentation and fix
comment. RB_SELFTEST and RB_MINIROOT likewise (though they were
completely undocumented). These last three constants can likely just
be deleted as nothing references them (even to set useless bits).
RB_ASKNAME doesn't actually survive reboot, however, so needs to be
communicated to the bootloader via other means. If the bootloader sets
it, though, it will be honored.
No valid FreeBSD binary ever called them (they would call lchown and
msync directly) and we haven't supported NetBSD binaries in ages.
Reviewed by: kib
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15814
Add vertical space between struct definition and function prototype.
Use "NULL" to describe zero pointers, instead of "zero."
Remove perhaps unclear "can not" and replace. Tag struct member names used
with appropriate tags.