one path. When the list is empty (contain only a NULL pointer), return
EINVAL instead of pretending to succeed, which will cause a NULL pointer
deference in a later fts_read() call.
Noticed by: Christoph Mallon (via rdivacky@)
MFC after: 2 weeks
- Tolerate applications that pass a NULL pointer for the buffer and
claim that the capacity of the buffer is nonzero.
- If an application passes in a non-NULL buffer pointer and claims the
buffer has zero capacity, we should free (well, realloc) it
anyway. It could have been obtained from malloc(0), so failing to
free it would be a small memory leak.
MFC After: 2 weeks
Reported by: naddy
PR: ports/138320
the type argument. This is known to fix some pthread_mutexattr_settype()
invocations, especially when it comes to pulseaudio.
Approved by: kib
deischen (threads)
MFC after: 3 days
- F_READAHEAD: specify the amount for sequential access. The amount is
specified in bytes and is rounded up to nearest block size.
- F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
access size.
A third argument of zero disables the read-ahead behavior.
Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.
Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.
Submitted by: Igor Sysoev <is rambler-co ru>
Reviewed by: kib@
MFC after: 1 month
a large page size that is greater than malloc(3)'s default chunk size but
less than or equal to 4 MB, then increase the chunk size to match the large
page size.
Most often, using a chunk size that is less than the large page size is not
a problem. However, consider a long-running application that allocates and
frees significant amounts of memory. In particular, it frees enough memory
at times that some of that memory is munmap()ed. Up until the first
munmap(), a 1MB chunk size is just fine; it's not a problem for the virtual
memory system. Two adjacent 1MB chunks that are aligned on a 2MB boundary
will be promoted automatically to a superpage even though they were
allocated at different times. The trouble begins with the munmap(),
releasing a 1MB chunk will trigger the demotion of the containing superpage,
leaving behind a half-used 2MB reservation. Now comes the real problem.
Unfortunately, when the application needs to allocate more memory, and it
recycles the previously munmap()ed address range, the implementation of
mmap() won't be able to reuse the reservation. Basically, the coalescing
rules in the virtual memory system don't allow this new range to combine
with its neighbor. The effect being that superpage promotion will not
reoccur for this range of addresses until both 1MB chunks are freed at some
point in the future.
Reviewed by: jasone
MFC after: 3 weeks
needed to satisfy static libraries that are compiled with -fpic
and linked into static binary afterwards. Several libraries in
gcc are examples of such static libs.
EV_RECEIPT is useful to disambiguating error conditions when multiple
events structures are passed to kevent(2). The error code is returned
in the data field and EV_ERROR is set.
Approved by: rwatson (co-mentor)
When the EV_DISPATCH flag is used the event source will be disabled
immediately after the delivery of an event. This is similar to the
EV_ONESHOT flag but it doesn't delete the event.
Approved by: rwatson (co-mentor)
Add user events support to kernel events which are not associated with any
kernel mechanism but are triggered by user level code. This is useful for
adding user level events to an event handler that may also be monitoring
kernel events.
Approved by: rwatson (co-mentor)
An additional one coming from http://www.research.att.com/~gsf/testregex/
was not added; at some point the entire AT&T regression test harness
should be imported here.
But that would also mean commitment to fix the uncovered errors.
PR: 130504
Submitted by: Chris Kuklewicz
When I wrote the pseudo-terminal driver for the MPSAFE TTY code, Robert
Watson and I agreed the best way to implement this, would be to let
posix_openpt() create a pseudo-terminal with proper permissions in place
and let grantpt() and unlockpt() be no-ops.
This isn't valid behaviour when looking at the spec. Because I thought
it was an elegant solution, I filed a bug report at the Austin Group
about this. In their last teleconference, they agreed on this subject.
This means that future revisions of POSIX may allow grantpt() and
unlockpt() to be no-ops if an open() on /dev/ptmx (if the implementation
has such a device) and posix_openpt() already do the right thing.
I'd rather put this in the manpage, because simply mentioning we don't
comply to any standard makes it look worse than it is. Right now we
don't, but at least we took care of it.
Approved by: re (kib)
MFC after: 3 days
other than the current system-wide size (32-bits) has been updated so
for now just cautiously turn the check off. While here fix the check
for IDs being too large which doesn't work due to type mis-matches.
Reviewed by: jhb (previous version)
Approved by: re (kib)
MFC after: 1 month (type mis-match fixes only)
compiled with stack protector.
Use libssp_nonshared library to pull __stack_chk_fail_local symbol into
each library that needs it instead of pulling it from libc. GCC
generates local calls to this function which result in absolute
relocations put into position-independent code segment, making dynamic
loader do extra work every time given shared library is being relocated
and making affected text pages non-shareable.
Reviewed by: kib
Approved by: re (kib)
behavior is mandated by POSIX.
- Do not fail requests that pass a length greater than SSIZE_MAX
(such as > 2GB on 32-bit platforms). The 'len' parameter is actually
an unsigned 'size_t' so negative values don't really make sense.
Submitted by: Alexander Best alexbestms at math.uni-muenster.de
Reviewed by: alc
Approved by: re (kib)
MFC after: 1 week
Right now nmemb is returned when size is 0. In newer versions of the
standards, it is explicitly required that fwrite() should return 0.
Submitted by: Christoph Mallon
Approved by: re (kib)
if the new file mode is the same as it was before; however, this
optimization must be disabled for filesystems that support NFSv4 ACLs.
Chmod uses pathconf(2) to determine whether this is the case - however,
pathconf(2) always follows symbolic links, while the 'chmod -h' doesn't.
This change adds lpathconf(3) to make it possible to solve that problem
in a clean way.
Reviewed by: rwatson (earlier version)
Approved by: re (kib)
Use libssp_nonshared library to pull __stack_chk_fail_local symbol into
each library that needs it instead of pulling it from libc. GCC generates
local calls to this function which result in absolute relocations put into
position-independent code segment, making dynamic loader do extra work everys
time given shared library is being relocated and making affected text pages
non-shareable.
Reviewed by: kib
Approved by: re (kensmith)
This adds the following functions to the acl(3) API: acl_add_flag_np,
acl_clear_flags_np, acl_create_entry_np, acl_delete_entry_np,
acl_delete_flag_np, acl_get_extended_np, acl_get_flag_np, acl_get_flagset_np,
acl_set_extended_np, acl_set_flagset_np, acl_to_text_np, acl_is_trivial_np,
acl_strip_np, acl_get_brand_np. Most of them are similar to what Darwin
does. There are no backward-incompatible changes.
Approved by: rwatson@
part of libc is still not thread safe but this would at least
reduce the problems we have.
PR: threads/118544
Submitted by: Changming Sun <snnn119 gmail com>
MFC after: 2 weeks
- The uid/cuid members of struct ipc_perm are now uid_t instead of unsigned
short.
- The gid/cgid members of struct ipc_perm are now gid_t instead of unsigned
short.
- The mode member of struct ipc_perm is now mode_t instead of unsigned short
(this is merely a style bug).
- The rather dubious padding fields for ABI compat with SV/I386 have been
removed from struct msqid_ds and struct semid_ds.
- The shm_segsz member of struct shmid_ds is now a size_t instead of an
int. This removes the need for the shm_bsegsz member in struct
shmid_kernel and should allow for complete support of SYSV SHM regions
>= 2GB.
- The shm_nattch member of struct shmid_ds is now an int instead of a
short.
- The shm_internal member of struct shmid_ds is now gone. The internal
VM object pointer for SHM regions has been moved into struct
shmid_kernel.
- The existing __semctl(), msgctl(), and shmctl() system call entries are
now marked COMPAT7 and new versions of those system calls which support
the new ABI are now present.
- The new system calls are assigned to the FBSD-1.1 version in libc. The
FBSD-1.0 symbols in libc now refer to the old COMPAT7 system calls.
- A simplistic framework for tagging system calls with compatibility
symbol versions has been added to libc. Version tags are added to
system calls by adding an appropriate __sym_compat() entry to
src/lib/libc/incldue/compat.h. [1]
PR: kern/16195 kern/113218 bin/129855
Reviewed by: arch@, rwatson
Discussed with: kan, kib [1]
- update for getrlimit(2) manpage;
- support for setting RLIMIT_SWAP in login class;
- addition to the limits(1) and sh and csh limit-setting builtins;
- tuning(7) documentation on the sysctls controlling overcommit.
In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)
It's not necessary to add stdlib directories for each architecture, even
if the architecture doesn't implement any files of its own.
Submitted by: Christoph Mallon
main cycle only if the len passed is equal to 0. If end address
overflows use last possible address as the end address.
Based on: discussion on arm@
MFC after: 1 month
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)
The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.
Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.
Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.
Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.
Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867
system callers of getgroups(), getgrouplist(), and setgroups() to
allocate buffers dynamically. Specifically, allocate a buffer of size
sysconf(_SC_NGROUPS_MAX)+1 (+2 in a few cases to allow for overflow).
This (or similar gymnastics) is required for the code to actually follow
the POSIX.1-2008 specification where {NGROUPS_MAX} may differ at runtime
and where getgroups may return {NGROUPS_MAX}+1 results on systems like
FreeBSD which include the primary group.
In id(1), don't pointlessly add the primary group to the list of all
groups, it is always the first result from getgroups(). In principle
the old code was more portable, but this was only done in one of the two
places where getgroups() was called to the overall effect was pointless.
Document the actual POSIX requirements in the getgroups(2) and
setgroups(2) manpages. We do not yet support a dynamic NGROUPS, but we
may in the future.
MFC after: 2 weeks
dace for UPDv4 sockets bound to INADDR_ANY. Move the code to set
IP_RECVDSTADDR/IP_SENDSRCADDR into svc_dg.c, so that both TLI and non-TLI
users will be using it.
Back out my previous commit to mountd. Turns out the problem was affecting
more than one binary so it needs to me addressed in generic rpc code in
libc in order to fix them all.
Reported by: lstewart
Tested by: lstewart
While hacking on TTY code, I often miss a small utility to revoke my own
(pseudo-)terminals. This small utility is just a small wrapper around
the revoke(2) call, so you can destroy your very own login sessions.
Approved by: re
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.
Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.
Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks
Last year I added SLIST_REMOVE_NEXT and STAILQ_REMOVE_NEXT, to remove
entries behind an element in the list, using O(1) time. I recently
discovered NetBSD also has a similar macro, called SLIST_REMOVE_AFTER.
In my opinion this approach is a lot better:
- It doesn't have the unused first argument of the list pointer. I added
this, mainly because OpenBSD also had it.
- The _AFTER suffix makes a lot more sense, because it is related to
SLIST_INSERT_AFTER. _NEXT is only used to iterate through the list.
The reason why I want to rename this now, is to make sure we don't
release a major version with the badly named macros.
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.
Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().
Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.
Approved by: bz (mentor)
Upgrade of the tzcode from 2004a to 2009e.
Changes are numerous, but include...
- New format of the output of zic, which supports both 32 and 64
bit time_t formats.
- zdump on 64 bit platforms will actually produce some output instead
of doing nothing for a looooooooong time.
- linux_base-fX, with X >= at least 8, will work without problems related
to the local time again.
The original patch, based on the 2008e, has been running for a long
time on both my laptop and desktop machine and have been tested by
other people.
After the installation of this code and the running of zic(8), you
need to run tzsetup(8) again to install the new datafile.
Approved by: wollman@ for usr.sbin/zic
MFC after: 1 month
the length by evaluating the value from the copy, cbuf instead. This
fixes a crash caused by previous commit (use-after-free)
Submitted by: Dimitry Andric <dimitry andric com>
Pointy hat to: delphij
The entire world seems to use the non-standard TIOCSCTTY ioctl to make a
TTY a controlling terminal of a session. Even though tcsetsid(3) is also
non-standard, I think it's a lot better to use in our own source code,
mainly because it's similar to tcsetpgrp(), tcgetpgrp() and tcgetsid().
I stole the idea from QNX. They do it the other way around; their
TIOCSCTTY is just a wrapper around tcsetsid(). tcsetsid() then calls
into an IPC framework.
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).
Approved by: bz (mentor)
the kernel will return in msfr_nsrcs the number of source filters
in-mode for a given multicast group.
However, the filters themselves were never copied out, as the libc
function clobbers this field with zero, causing the kernel to assume
the provided vector of struct sockaddr_storage has zero length.
This bug would only affect users of SSM multicast, which is shimmed
in 7.x.
Picked up during mtest(8) refactoring.
MFC after: 1 day
to if (len == 0).
The length is supposed to be unsigned, so len - 1 < 0 won't happen except
if len == 0 anyway, and it would return 0 when it shouldn't, if len was
> INT_MAX.
Spotted out by: Channa <channa kad gmail com>
because it means getdelim() returns -1 for both error and EOF, and
never returns 0. However, this is what the original GNU implementation
does, and POSIX inherited the bug.
Reported by: marcus@
dlfunc() called dlsym() to do the work, and dlsym() determines the dso
that originating the call by the return address. Due to this, dlfunc()
operated as if the caller is always the libc.
To fix this, move the dlfunc() to rtld, where it can call the internal
implementation of dlsym, and still correctly fetch return address.
Provide usual weak stub for the symbol from libc for static binaries.
dlfunc is put to FBSD_1.0 symver namespace in the ld.so export to
override dlfunc@FBSD_1.0 weak symbol, exported by libc.
Reported, analyzed and tested by: Tijl Coosemans <tijl ulyssis org>
PR: standards/133339
Reviewed by: kan
these functions were moved into the kernel:
- Move the version entries from gen/ to sys/. Since the ABI of the actual
routines did not change, I'm still exporting them as FBSD 1.0 on purpose.
- Add FBSD-private versions for the _ and __sys_ variants.
This does not include the new hash routines since they will cause problems
when reading old hash files.
Since mpool(3) has been changed, provide a compatibility shim for older
binaries.
Obtained from: OpenBSD
open(). The previous logic only initializes the database when O_CREAT is
set, but as long as we can open and write the database, and the database
is empty, we should initialize it anyway.
Obtained from: OpenBSD
an invariant (actually, an ugly hack) to fail, and all Hell would break
loose.
When deleting a big key, the offset of an empty page should be bsize, not
bsize-1; otherwise an insertion into the empty page will cause the new key to
be elongated by 1 byte.
Make the packing more dense in a couple of cases.
- fix NULL dereference exposed on big bsize values;
Obtained from: NetBSD via OpenBSD
if the result is truncated.
db/hash/hash_page.c: use the same way to create temporary file as
bt_open.c; check snprintf() return value.
Obtained from: OpenBSD
all; before freeing memory, zero out them before we release it as free
heap. This will eliminate some potential information leak issue.
While there, remove the PURIFY option. There is a slight difference between
the new behavior and the old -DPURIFY behavior, with the latter initializes
memory with 0xff's. The difference between old and new approach does not
generate observable difference.
Obtained from: OpenBSD (partly).
Now, getaddrinfo(3) returns two SOCK_STREAMs, IPPROTO_TCP and
IPPROTO_SCTP. It confuses some programs. If getaddrinfo(3) returns
IPPROTO_SCTP when SOCK_STREAM is specified by hints.ai_socktype, at
least Apache doesn't work. So, I made getaddrinfo(3) to return
IPPROTO_SCTP with SOCK_STREAM only when IPPROTO_SCTP is specified
explicitly by hints.ai_protocol.
PR: bin/128167
Submitted by: Bruce Cran <bruce__at__cran.org.uk> (partly)
MFC after: 2 week
to possible breakages in the catalog handling code. Since then, that
code has been replaced by the secure code from NetBSD but NLS in libc
remained turned off. Tests have shown that the feature is stable and
working so we can now turn it on again.
- Add several new catalog files:
- ca_ES.ISO8859-1
- de_DE.ISO8859-1
- el_GR.ISO8859-7 (by manolis@ and keramida@)
- es_ES.ISO8859-1 (kern/123179, by carvay@)
- fi_FI.ISO8859-1
- fr_FR.ISO8859-1 (kern/78756, by thierry@)
- hu_HU.ISO8859-2 (by gabor@)
- it_IT.ISO8859-15
- nl_NL.ISO8859-1 (corrections by rene@)
- no_NO.ISO8859-1
- mn_MN.UTF-8 (by ganbold@)
- sk_SK.ISO8859-2
- sv_SE.ISO8859-1
(The catalogs without explicit source has been obtained from NetBSD.)
Approved by: attilio
dprintf() is a simple wrapper around another function, so we may as
well implement it. But also like getline(), we can't prototype it by
default right now because it would break too many ports.
(This is part of a larger changeset which is intended to reduce diff only,
thus some prototypes were left intact since they will be changed in the
future).
Verified with: md5(1)
memory from int to size_t. Implement a workaround for current ABI not
allowing to properly save size for and report more then 2Gb sized segment
of shared memory.
This makes it possible to use > 2 Gb shared memory segments on 64bit
architectures. Please note the new BUGS section in shmctl(2) and
UPDATING note for limitations of this temporal solution.
Reviewed by: csjp
Tested by: Nikolay Dzham <i levsha org ua>
MFC after: 2 weeks
On FreeBSD, this is the default behaviour. According to the spec, we may
give this flag a value of zero, but I'd rather not do this. If we define
it to a non-zero value, we can always change default behaviour without
changing the ABI. This is very unlikely to happen, though.
wcscasecmp(), and wcsncasecmp().
- Make some previously non-standard extensions visible
if POSIX_VISIBLE >= 200809.
- Use restrict qualifiers in stpcpy().
- Declare off_t and size_t in stdio.h.
- Bump __FreeBSD_version in case the new symbols (particularly
getline()) cause issues with ports.
Reviewed by: standards@
When calling setttyent() after calling endttyent(), pts_valid will never
be set to 1, because the readdir()-loop will likely never vind a pts
that has a higher number than before.
Simplify the code by removing pts_valid. We'll just set maxpts to -1
when we don't have a valid count yet.
It seems ttyslot() calls rindex(), to strip the device name to the last
slash, but this is obviously invalid. /dev/pts/0 should be stripped
until pts/0. Because /etc/ttys only supports TTY names in /dev/, just
strip this piece of the pathname.
A more elegant way of obtaining a name of a character device by its file
descriptor on FreeBSD, is to use the FIODGNAME ioctl. Because a valid
file descriptor implies a file descriptor is visible in /dev, it will
always resolve a valid device name.
I'm adding a more friendly wrapper for this ioctl, called fdevname(). It
is a lot easier to use than devname() and also has better error
handling. When a device name cannot be resolved, it will just return
NULL instead of a generated device name that makes no sense.
Discussed with: kib
stating that in FreeBSD the atol() and atoll() functions affect
errno in the same way as strtol() and stroll().
PR: docs/126487
Submitted by: edwin
Reviewed by: trhodes, gabor
MFC after: 1 week
return zero on success and an error code otherwise. The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family. For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.
Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes). Remove the explicit
jailed() checks that preceded many of the function calls.
Approved by: bz (mentor)
reducing branches and doing word-sized operation.
The idea is taken from J.T. Conklin's x86_64 optimized version of strlen(3)
for NetBSD, and reimplemented in C by me.
Discussed on: -arch@
The integer thousands' separator code is rewritten in order to
avoid having to preallocate a buffer for the largest possible
digit string with the most possible instances of the longest
possible multibyte thousands' separator. The new version inserts
thousands' separators for integers using the same code as floating point.
sets up a fake buffered FILE and then effectively calls itself
recursively. Unfortunately, gcc doesn't know how to do tail call
elimination in this case, and actually makes things worse by
inlining __sbprintf(). This means that f[w]printf() to stderr was
allocating about 5k of stack on 64-bit platforms, much of which was
never used.
I've reorganized things to eliminate the waste. In addition to saving
some stack space, this improves performance in my tests by anywhere
from 5% to 17% (depending on the test) when -fstack-protector is
enabled. I found no statistically significant performance difference
when stack protection is turned off. (The tests redirected stderr to
/dev/null.)
to get rid of restrict qualifier discarding. This lets libc compile
cleanly in gnu99 mode.
Suggested by: kib, christoph.mallon at gmx.de
Approved by: kib (mentor)
slightly less evil inline functions, and move the buffering state into
a struct. This will make it possible for helper routines to produce
output for printf() directly, making it possible to untangle the code
somewhat.
In wprintf(), use the same buffering mechanism to reduce diffs to
printf(). This has the side-effect of causing wprintf() to catch write
errors that it previously ignored.
FPA floating-point format is identical to the VFP format,
but is always stored in big-endian.
Introduce _IEEE_WORD_ORDER to describe the byte-order of
the FP representation.
Obtained from: Juniper Networks, Inc
It includes the following fix:
2426. [bug] libbind: inet_net_pton() can sometimes return the
wrong value if excessively large netmasks are
supplied. [RT #18512]
Reported by: Maksymilian Arciemowicz <cxib__at__securityreason.com>
Copyright attribution is kept the same as in original NetBSD source.
Submitted by: Florian Smeets <flo kasimir com>
Obtained from: NetBSD
MFC after: 2 weeks
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
Threading library calls _pre before the fork, allowing the rtld to
lock itself to ensure that other threads of the process are out of
dynamic linker. _post releases the locks.
This allows the rtld to have consistent state in the child. Although
child may legitimately call only async-safe functions, the call may
need plt relocation resolution, and this requires working rtld.
Reported and debugging help by: rink
Reviewed by: kan, davidxu
MFC after: 1 month (anyway, not before 7.1 is out)
This bring huge amount of changes, I'll enumerate only user-visible changes:
- Delegated Administration
Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.
- L2ARC
Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.
- slog
Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).
- vfs.zfs.super_owner
Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.
- chflags(2)
Not all the flags are supported. This still needs work.
- ZFSBoot
Support to boot off of ZFS pool. Not finished, AFAIK.
Submitted by: dfr
- Snapshot properties
- New failure modes
Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests
- Refquota, refreservation properties
Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.
- Sparse volumes
ZVOLs that don't reserve space in the pool.
- External attributes
Compatible with extattr(2).
- NFSv4-ACLs
Not sure about the status, might not be complete yet.
Submitted by: trasz
- Creation-time properties
- Regression tests for zpool(8) command.
Obtained from: OpenSolaris
- Use `fildes[2]' instead of `*fildes' to make more clear that pipe(2)
fills an array with two descriptors.
- Remove EFAULT from the manual page. Because of the current calling
convention, pipe(2) raises a segmentation fault when an invalid
address is passed.
- Introduce kern_pipe() to make it easier for binary emulations to
implement pipe(2).
- Make Linux binary emulation use kern_pipe(), which means we don't have
to recover td_retval after calling the FreeBSD system call.
Approved by: rdivacky
Discussed on: arch
Looking at our source code history, it seems the uname(),
getdomainname() and setdomainname() system calls got deprecated
somewhere after FreeBSD 1.1, but they have never been phased out
properly. Because we don't have a COMPAT_FREEBSD1, just use
COMPAT_FREEBSD4.
Also fix the Linuxolator to build without the setdomainname() routine by
just making it call userland_sysctl on kern.domainname. Also replace the
setdomainname()'s implementation to use this approach, because we're
duplicating code with sysctl_domainname().
I wasn't able to keep these three routines working in our
COMPAT_FREEBSD32, because that would require yet another keyword for
syscalls.master (COMPAT4+NOPROTO). Because this routine is probably
unused already, this won't be a problem in practice. If it turns out to
be a problem, we'll just restore this functionality.
Reviewed by: rdivacky, kib
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.
The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.
To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.
As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.
Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.
The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.
Sponsored by: Isilon Systems
MFC after: 1 month
on long long arguments.
Reviewed by: bde (previous version, that included asm implementation
for all ffs and fls functions on i386 and amd64)
MFC after: 2 weeks
is used to set the ELF size attribute for functions. It isn't normally
critical but some things can make use of it (gdb for stack traces).
Valgrind needs it so I'm adding it in. The problem is present on all
branches and on both i386 and amd64.
from the description but not the errors section. This revision removes it
from the errors statement.
Add a statement about the non-portability of non-page-aligned offsets.
I believe this is not a valid C99 construct. Use directly calculated
offsets into the supplied buffer, using specified members length,
to fill appropriate structure.
Either use sysctl, or copy the value of the UNAME_x environment
variable, instead of unconditionally doing sysctl, and then
overriding a returned value with user-specified one.
Noted and tested by: rdivacky
Update referenced example to include unistd.h per manpage.
Update example to be more style(9)-ish, silence warnings and add
FreeBSD id to the source file.
review by secteam@ for the reasons mentioned below.
1) Rename /dev/urandom to /dev/random since urandom marked as
XXX Deprecated
alias in /sys/dev/random/randomdev.c
(this is our naming convention and no review by secteam@ required)
2) Set rs_stired flag after forced initialization to prevent
double stearing.
(this is already in OpenBSD, i.e. they don't have double stearing.
It means that this change matches their code path and no additional
secteam@ review required)
Submitted by: Thorsten Glaser <tg@mirbsd.de> (2)
that belong in a character class, and (2) one for matching all
the characters *not* in a character class.
Submitted by: Mark B, mkbucc at gmail.com
MFC after: 3 days
This caching allows for completely lock-free allocation/deallocation in the
steady state, at the expense of likely increased memory use and
fragmentation.
Reduce the default number of arenas to 2*ncpus, since thread-specific
caching typically reduces arena contention.
Modify size class spacing to include ranges of 2^n-spaced, quantum-spaced,
cacheline-spaced, and subpage-spaced size classes. The advantages are:
fewer size classes, reduced false cacheline sharing, and reduced internal
fragmentation for allocations that are slightly over 512, 1024, etc.
Increase RUN_MAX_SMALL, in order to limit fragmentation for the
subpage-spaced size classes.
Add a size-->bin lookup table for small sizes to simplify translating sizes
to size classes. Include a hard-coded constant table that is used unless
custom size class spacing is specified at run time.
Add the ability to disable tiny size classes at compile time via
MALLOC_TINY.
is returned shall be kept in the waitable state.
Add WSTOPPED as an alias for WUNTRACED.
Submitted by: Jukka Ukkonen <jau at iki fi>
PR: standards/116221
MFC after: 2 weeks
executed by fexecve(2), imgp->args->fname is NULL. Moreover, there is
no way to recover the path to the script being executed.
Do what some other U*ixes do unconditionally, namely supply /dev/fd/n
as the script path when called from fexecve(). Document requirement of
having fdescfs mounted as caveat.
The routines in grantpt.c have been moved to ptsname.c in the MPSAFE TTY
layer, because grantpt() is now effectively a no-op. I forgot to remove
the corresponding source file from libc.
The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:
- Improved driver model:
The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.
If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.
- Improved hotplugging:
With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).
The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.
- Improved performance:
One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.
Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.
Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan
uuid_dec_be() functions. These routines are not part of the
DCE RPC API. They are provided for convenience.
Reviewed by: marcel
Obtained from: NetBSD
MFC after: 1 week
detect whether the integer division table is large enough to handle the
divisor. Before this change, the last two table elements were never used,
thus causing the slow path to be used for those divisors.
is based on an old implementation from the University of Michigan with lots of
changes and fixes by me and the addition of a Solaris-compatible API.
Sponsored by: Isilon Systems
Reviewed by: alfred
to this commit, "env BLOCKSIZE=4X df" prints not only "4X: unknown
blocksize" as expected, but sometimes also "maximum blocksize is 1G"
and "minimum blocksize is 512" depending on what happened to be on
the stack.
Found by: LLVM/Clang Static Checker
environ[0] to be more obvious that environ is not NULL before environ[0]
is tested. Although I believe the previous code worked, this change
improves code maintainability.
Reviewed by: ache
MFC after: 3 days
assumed to be reviewd by them):
Stir directly from the kernel PRNG, without taking less random pid & time
bytes too (when it is possible).
The difference with OpenBSD code is that they have KERN_ARND sysctl for
that task, while we need to read /dev/random
wide string arguments.
Also simplify the code that handles length modifiers and make it
more conservative. For instance, be explicit about the modifiers
allowed for %d, rather than assuming that anything other than L,
q, t, or z implies an int argument.
the first value (environ[0]) to NULL. This is in addition to the
current detection of environ being replaced, which includes being set to
NULL. Without this fix, the environment is not truly wiped, but appears
to be by getenv() until an *env() call is made to alter the enviroment.
This change is necessary to support those applications that use this
method for clearing environ such as Dovecot and Postfix. Applications
such as Sendmail and the base system's env replace environ (already
detected). While neither of these methods are defined by SUSv3, it is
best to support them due to historic reasons and in lieu of a clean,
defined method.
Add extra units tests for clearing environ using four different methods:
1. Set environ to NULL pointer.
2. Set environ[0] to NULL pointer.
3. Set environ to calloc()'d NULL-terminated array.
4. Set environ to static NULL-terminated array.
Noticed by: Timo Sirainen
MFC after: 3 days
I guess the original author of the popen() code didn't want to use our
<sys/queue.h> macro's, because the single linked list macro's didn't
offer O(1) deletion. Because I introduced SLIST_REMOVE_NEXT() some time
ago, we can now use the macro's here.
By converting the code to an SLIST, it is more consistent with other
parts of the C library and the operating system.
Reviewed by: csjp
Approved by: philip (mentor, implicit)
mkstemps(), and mkdtemp().
- Add proper range checking for the 'slen' parameter passed to mkstemps().
- Try all possible permutations of a template if a collision is encountered.
Previously, once a single template character reached 'z', it would not wrap
around to '0' and keep going until it encountered the original starting
letter. In the edge case that the randomly generated starting name used
all 'z' characters, only that single name would be tried before giving up.
PR: standards/66531
Submitted by: Jim Luther
Obtained from: Apple
MFC after: 1 week
It seems I made a small bug when writing some of the posix_spawn(3)
manpages. Remove the redundant "Ed Schouten", which broke the AUTHORS
section.
Approved by: philip (mentor, implicit)
"If you don't get a review within a day or two, I would firmly recommend
backing out the changes"
back out all my changes, i.e. not comes from merging from OpenBSD as
unreviewed by secteam@ yet.
(OpenBSD changes stays in assumption they are reviewd by OpenBSD)
Yes, it means some old bugs returned, like not setted rs_stired = 1 in
arc4random_stir(3) causing double stirring.
1) Unindent and sort variables.
2) Indent struct members.
3) Remove _packed, use guaranteed >128 bytes size and only first 128
bytes from the structure.
4) Reword comment.
Obtained from: bde
2) Use gettimeofday() and getpid() only if reading from /dev/urandom
fails or impossible.
3) Discard N bytes on very first initialization only (i.e. don't
discard on re-stir).
4) Reduce N from 1024 to 512 as really suggested in the
"(Not So) Random Shuffles of RC4" paper:
http://research.microsoft.com/users/mironov/papers/rc4full.pdf
2) Eliminate "struct arc4_stream *as" arg since only single arg is
possible.
3) Set rs.j = rs.i after arc4random key schedule to be more like arc4
stream cipher.
Obtained from: OpenBSD
the chunk map instead of red-black trees where possible. Remove the
red-black trees and node objects that are obsoleted by this change. The
net result is a ~1-2% memory savings, and a substantial allocation speed
improvement.
This change removes the requirement that an ACL contain no ACL_USER
entries with a uid the same as those of a file, or ACL_GROUP entries
with a gid the same as those of a file. This requirement is not in the
specification, and not enforced by the kernel's ACL implementation.
Reported by: Iustin Pop <iusty at k1024 dot org>
MFC after: 1 week
statement. Add the one from the current NetBSD version.
- Also bump a date to reflect my content changes I have done in previous
revision
Approved by: imp
MFC after: 3 days
by moving the positional argument handling code to a new file,
printf-pos.c, and moving common definitions to printflocal.h.
No functional change intended.
In particular, encapsulate the state of the type table in a struct,
and add inline functions to initialize, free, and manipulate that
state. This replaces some ugly macros that made proper error handling
impossible.
While here, remove an unneeded test for NULL and a variable that is
initialized (many times!) but never used. The compiler didn't catch
these because of rampant use of the same variable to mean different
things in different places.
This commit should not cause any changes in functionality.
1. Save and restore the control part of the MXCSR in addition to the
i387 control word to ensure that the two are consistent.
Note that standards don't require longjmp to restore either control
word, and none of Linux, MacOS X 10.3 and earlier, NetBSD, OpenBSD,
or Solaris do it. However, it is historical FreeBSD behavior, and
bde points out that it is needed to make longjmping out of a signal
handler work properly, given the way FreeBSD clobbers the FPU state
on signal handler entry.
2. Don't clobber the FPU exception flags in longjmp. C99 requires them
to remain unchanged.
- It is opt-out for now so as to give it maximum testing, but it may be
turned opt-in for stable branches depending on the consensus. You
can turn it off with WITHOUT_SSP.
- WITHOUT_SSP was previously used to disable the build of GNU libssp.
It is harmless to steal the knob as SSP symbols have been provided
by libc for a long time, GNU libssp should not have been much used.
- SSP is disabled in a few corners such as system bootstrap programs
(sys/boot), process bootstrap code (rtld, csu) and SSP symbols themselves.
- It should be safe to use -fstack-protector-all to build world, however
libc will be automatically downgraded to -fstack-protector because it
breaks rtld otherwise.
- This option is unavailable on ia64.
Enable GCC stack protection (aka Propolice) for kernel:
- It is opt-out for now so as to give it maximum testing.
- Do not compile your kernel with -fstack-protector-all, it won't work.
Submitted by: Jeremie Le Hen <jeremie@le-hen.org>
Adding exevpe() has caused some ports to break. Even though execvpe() is
a useful routine, it does not conform to any standards.
This patch is a little bit different from the patch sent to the mailing
list. I forgot to remove execvpe from the Symbol.map (which does not
seem to miscompile libc, though).
Reviewed by: davidxu
Approved by: philip
The __use_pts() routine was once probably used by libutil to determine
if we are using BSD or UNIX98 style PTY device names. It doesn't seem to
be used outside grantpt.c, which means we can make it static and remove
it from the Symbol.map.
Reviewed by: cognet, kib
Approved by: philip (mentor)
can be used as replacements for exec/fork in a lot of cases. This
change also added execvpe() which allows environment variable
PATH to be used for searching executable file, it is used for
implementing posix_spawnp().
PR: standards/122051
SO_LISTENQLEN SO_LISTENINCQLEN to the manual page.
Till now those were only present in sys/socket.h file.
Reviewed by: rwatson, gnn, keramida (with mdoc hat)
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
PR:
Reviewed by: several including rwatson, bz and mlair (parts each)
Approved by:
Obtained from: Ironport systems/Cisco
MFC after:
Security:
PR:
Submitted by:
Reviewed by:
Approved by:
Obtained from:
MFC after:
Security:
files after a seekdir().
The seekdir shall set the position for the next readdir operation.
When the _readdir_unlocked() encounters deleted entry, dd_loc is
already advanced. Continuing the loop leads to premature read of
the target entry.
Submitted by: Marc Balmer <mbalmer at openbsd org>
Obtained from: OpenBSD
MFC after: 2 weeks
accessor functions for its benefit now thaat FILE is opaque.
I'm sure there's a better way. I leave that for people to work
on in a src tree that isn't broken.
Pointy hat: jhb
move the definition of the type backing FILE (struct __sFILE) into an
internal header.
- Remove macros to inline certain operations from stdio.h. Applications
will now always call the functions instead.
- Move the various foo_unlocked() functions from unlocked.c into foo.c.
This lets some of the inlining macros (e.g. __sfeof()) move into
foo.c.
- Update a few comments.
- struct __sFILE can now go back to using mbstate_t, pthread_t, and
pthread_mutex_t instead of knowing about their private, backing types.
MFC after: 1 month
Reviewed by: kan
This substantially improves worst case allocation performance, since
O(lg n) tree search can be used instead of O(n) tree iteration.
Use rb_wrap() instead of directly calling rb_*() macros.
macros.
Add rb_foreach_next() and rb_foreach_reverse_prev(), which make it
possible to re-synchronize tree iteration after the tree has been
modified.
Rename rb_tree_new() to rb_new().
color bit in the least significant bit of the right child pointer, in
order to reduce red-black tree linkage overhead by ~2X as compared to
sys/tree.h.
Use the new red-black tree implementation in malloc, which drops
memory usage by ~0.5 or ~1%, for 32- and 64-bit systems, respectively.
There were no checks for left and right precisions at all, and
a check for field width had integer overflow bug.
Reported by: Maksymilian Arciemowicz
Security: http://securityreason.com/achievement_securityalert/53
Submitted by: Maxim Dounin <mdounin@mdounin.ru>
MFC after: 3 days
__sFILE. This was supposed to be done in 6.0. Some notes:
- Where possible I restored the various lines to their pre-__sFILEX state.
- Retire INITEXTRA() and just initialize the wchar bits (orientation and
mbstate) explicitly instead. The various places that used INITEXTRA
didn't need the locking fields or _up initialized. (Some places needed
_up to exist and not be off the end of a NULL or garbage pointer, but
they didn't require it to be initialized to a specific value.)
- For now, stdio.h "knows" that pthread_t is a 'struct pthread *' to
avoid namespace pollution of including all the pthread types in stdio.h.
Once we remove all the inlines and make __sFILE private it can go back
to using pthread_t, etc.
- This does not remove any of the inlines currently and does not change
any of the public ABI of 'FILE'.
MFC after: 1 month
Reviewed by: peter
deals with the usual __opendir2() calls, and the rest part with an interface
translator to expose fdopendir(3) functionality. Manual page was obtained from
kib@'s work for *at(2) system calls.
1. Previously, printing the number 1.0 could produce 0x1p+0, 0x2p-1,
0x4p-2, or 0x8p-3, depending on what happened to be convenient. This
meant that printing a value as a double and printing the same value
as a long double could produce different (but equivalent) results.
The change is to always make the leading digit a 1, unless the
number is 0. This solves the aforementioned problem and has
several other advantages.
2. Use the FPU to do rounding. This is far simpler and more portable
than manipulating the bits, and it fixes an obsure round-to-even
bug. It also raises the exceptions now required by IEEE 754R.
The drawbacks are that it is usually slightly slower, and it makes
printf less effective as a debugging tool when the FPU is hosed
(e.g., due to a buggy softfloat implementation).
3. On i386, twiddle the rounding precision so that (2) works properly
for long doubles.
4. Make several simplifications that are now possible due to (2).
5. Split __hldtoa() into a separate file.
Thanks to remko for access to a sparc64 box for testing.
flags appropriately. The next step is to make it raise a SIGFPE if
any exceptions are unmasked.
Thanks to remko for access to a sparc64 box for testing.
__xdrrec_getrec has returned TRUE, then we have a complete request in
the buffer - calling xdrrec_skiprecord is not necessary. In particular,
if there is another record already buffered on the stream,
xdrrec_skiprecord will discard both this request and the next
one, causing the call to xdr_callmsg to fail and the stream to be
closed.
Sponsored by: Isilon Systems
live in libm, while modf() lives in libc due to historical
mistakes. I'm claiming in the manpage that they all live in libm,
since programmers should not rely on the mistake.
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.
Highlights include:
* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.
* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.
* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.
* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.
* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.
* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.
Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks
of the array length needed to store all the directory entries.
Although BSD has historically guaranteed that st_size is the size
of the directory file, POSIX does not, and more to the point, some
recent filesystems such as ZFS use st_size to mean something else.
The fix is to not stat the directory at all, set the initial
array size to 32 entries, and realloc it in powers of 2 if that
proves insufficient.
PR: 113668
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).
PR: 120233
Submitted by: Jukka Ukkonen
Approved by: rwaston (mentor)
MFC after: 1 month
and assignment.
- Add a reference to a struct cpuset in each thread that is inherited from
the thread that created it.
- Release the reference when the thread is destroyed.
- Add prototypes for syscalls and macros for manipulating cpusets in
sys/cpuset.h
- Add syscalls to create, get, and set new numbered cpusets:
cpuset(), cpuset_{get,set}id()
- Add syscalls for getting and setting affinity masks for cpusets or
individual threads: cpuid_{get,set}affinity()
- Add types for the 'level' and 'which' parameters for the cpuset. This
will permit expansion of the api to cover cpu masks for other objects
identifiable with an id_t integer. For example, IRQs and Jails may be
coming soon.
- The root set 0 contains all valid cpus. All thread initially belong to
cpuset 1. This permits migrating all threads off of certain cpus to
reserve them for special applications.
Sponsored by: Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by: antoine
This reduces the size of a statically-linked binary by approximately 100KB
in a trivial "return (0)" test application. readelf -S was used to verify
that the .text section was reduced and that using strlen() saved a few
more bytes over using sizeof(). Since the section of code is only called
when environ is corrupt (program bug), I went with fewer bytes over fewer
cycles.
I made minor edits to the submitted patch to make the output resemble
warnx().
Submitted by: kib bz
Approved by: wes (mentor)
MFC after: 5 days
them. Thus, any fd whose value is greater than SHRT_MAX is handled
incorrectly (the short value is sign-extended when converted to an int).
An unpleasant side effect is that if fopen() opens a file and gets a
backing fd that is greater than SHRT_MAX, fclose() will fail and the file
descriptor will be leaked. Better handle this by fixing fopen(), fdopen(),
and freopen() to fail attempts to use a fd greater than SHRT_MAX with
EMFILE.
At some point in the future we should look at expanding the file descriptor
in FILE to an int, but that is a bit complicated due to ABI issues.
MFC after: 1 week
Discussed on: arch
Reviewed by: wollman
{SHRT_MAX}, so {STREAM_MAX} should be no greater than that. (This
does not exactly meet the letter of POSIX but comes reasonably close
to it in spirit.)
MFC after: 14 days
variations (e500 currently), this provides a gcc-level FPU emulation and is an
alternative approach to the recently introduced kernel-level emulation
(FPU_EMU).
Approved by: cognet (mentor)
MFp4: e500
reallocation, when junk filling is enabled. Junk filling must occur
prior to shrinking, since any deallocated trailing pages are immediately
available for use by other threads.
Reported by: Mats Palmgren <mats.palmgren@bredband.net>
allocation patterns, number of CPUs, and MALLOC_OPTIONS settings indicate
that lazy deallocation has the potential to worsen throughput dramatically.
Performance degradation occurs when multiple threads try to clear the lazy
free cache simultaneously. Various experiments to avoid this bottleneck
failed to completely solve this problem, while adding yet more complexity.
is a violation of RFC 1034 [STD 13], it is accepted by certain name servers
as well as other popular operating systems' resolver library.
Bugs are mine.
Obtained from: ume
MFC after: 2 weeks
arena_dalloc_lazy_hard() was split out of arena_dalloc_lazy() in revision
1.162.
Reduce thundering herd problems in lazy deallocation by randomly varying
how many probes a thread does before taking the slow path.
assumptions about whether bits are set at various times. This makes
adding other flags safe.
Reorganize functions in order to inline i{m,c,p,s,re}alloc(). This
allows the entire fast-path call chains for malloc() and free() to be
inlined. [1]
Suggested by: [1] Stuart Parmenter <stuart@mozilla.com>
threshold, according to the 'F' MALLOC_OPTIONS flag. This obsoletes the
'H' flag.
Try to realloc() large objects in place. This substantially speeds up
incremental large reallocations in the common case.
Fix a bug in arena_ralloc() that caused relocation of sub-page objects
even if the old and new sizes were in the same size class.
Maintain trees of runs and simplify the per-chunk page map. This allows
logarithmic-time searching for sufficiently large runs in
arena_run_alloc(), whereas the previous algorithm required linear time
in the worst case.
Break various large functions into smaller sub-functions, and inline
only the functions that are in the fast path for small object
allocation/deallocation.
Remove an unnecessary check in base_pages_alloc_mmap().
Avoid integer division in choose_arena() for the NO_TLS case on
single-CPU systems.
referencing the files VM pages are returned from the network stack,
making changes to the file safe.
This flag does not guarantee that the data has been transmitted to the
other end.
fields in FTS and FTSENT structs being too narrow. In addition,
the narrow types creep from there into fts.c. As a result, fts(3)
consumers, e.g., find(1) or rm(1), can't handle file trees an ordinary
user can create, which can have security implications.
To fix the historic implementation of fts(3), OpenBSD and NetBSD
have already changed <fts.h> in somewhat incompatible ways, so we
are free to do so, too. This change is a superset of changes from
the other BSDs with a few more improvements. It doesn't touch
fts(3) functionality; it just extends integer types used by it to
match modern reality and the C standard.
Here are its points:
o For C object sizes, use size_t unless it's 100% certain that
the object will be really small. (Note that fts(3) can construct
pathnames _much_ longer than PATH_MAX for its consumers.)
o Avoid the short types because on modern platforms using them
results in larger and slower code. Change shorts to ints as
follows:
- For variables than count simple, limited things like states,
use plain vanilla `int' as it's the type of choice in C.
- For a limited number of bit flags use `unsigned' because signed
bit-wise operations are implementation-defined, i.e., unportable,
in C.
o For things that should be at least 64 bits wide, use long long
and not int64_t, as the latter is an optional type. See
FTSENT.fts_number aka FTS.fts_bignum. Extending fts_number `to
satisfy future needs' is pointless because there is fts_pointer,
which can be used to link to arbitrary data from an FTSENT.
However, there already are fts(3) consumers that require fts_number,
or fts_bignum, have at least 64 bits in it, so we must allow for them.
o For the tree depth, use `long'. This is a trade-off between making
this field too wide and allowing for 64-bit inode numbers and/or
chain-mounted filesystems. On the one hand, `long' is almost
enough for 32-bit filesystems on a 32-bit platform (our ino_t is
uint32_t now). On the other hand, platforms with a 64-bit (or
wider) `long' will be ready for 64-bit inode numbers, as well as
for several 32-bit filesystems mounted one under another. Note
that fts_level has to be signed because -1 is a magic value for it,
FTS_ROOTPARENTLEVEL.
o For the `nlinks' local var in fts_build(), use `long'. The logic
in fts_build() requires that `nlinks' be signed, but our nlink_t
currently is uint16_t. Therefore let's make the signed var wide
enough to be able to represent 2^16-1 in pure C99, and even 2^32-1
on a 64-bit platform. Perhaps the logic should be changed just
to use nlink_t, but it can be done later w/o breaking fts(3) ABI
any more because `nlinks' is just a local var.
This commit also inludes supporting stuff for the fts change:
o Preserve the old versions of fts(3) functions through libc symbol
versioning because the old versions appeared in all our former releases.
o Bump __FreeBSD_version just in case. There is a small chance that
some ill-written 3-rd party apps may fail to build or work correctly
if compiled after this change.
o Update the fts(3) manpage accordingly. In particular, remove
references to fts_bignum, which was a FreeBSD-specific hack to work
around the too narrow types of FTSENT members. Now fts_number is
at least 64 bits wide (long long) and fts_bignum is an undocumented
alias for fts_number kept around for compatibility reasons. According
to Google Code Search, the only big consumers of fts_bignum are in
our own source tree, so they can be fixed easily to use fts_number.
o Mention the change in src/UPDATING.
PR: bin/104458
Approved by: re (quite a while ago)
Discussed with: deischen (the symbol versioning part)
Reviewed by: -arch (mostly silence); das (generally OK, but we didn't
agree on some types used; assuming that no objections on
-arch let me to stick to my opinion)
instead of 32+32+15+1) on all arches that have such long doubles (amd64,
ia64 and i386). Large objects should be be accessed in large units,
and the 32+32+15+1[+padding] decomposition asks for almost the opposite
of that, sometimes resulting in very slow accesses depending on how
well the compiler ignores what we ask for and converts to the best
units for the given machine. E.g., on Athlons, there is a 10-20 cycle
penalty for accessing the middle 32-bit word immediately after an
80-bit store.
Whether actually using the alternative view is better is very machine-
dependent. A 32+32+16 view is probably best with old 32-bit systems
and gcc through 4.2.1. The compiler should mostly avoid the view and
generate best accesses, but gcc-4.2.1 is far from doing that. I think
64+16 is best for now. Similarly for doubles -- they should be using
64+0 especially on 64-bit machines, but fdlibm uses 32+32 extensively
for them. Fortunately, in 64-bit mode for doubles, gcc already ignores
the 32+32-bit view and generates best accesses in many cases.
into slowsort for some sequences because different parts of the
code used 'r' to store two different things, one of which was
signed. Clean things up by splitting 'r' into two variables, and
use a more meaningful name.
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.
Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())
default. This has the disadvantage of rendering the datasize resource
limit irrelevant, but without this change, legitimate uses of more
memory than will fit in the data segment are thwarted by default.
Fix chunk_alloc_mmap() to work correctly if initial mapping is not
chunk-aligned and mapping extension fails.
Clean up DSS-related locking and protect all pertinent variables with
dss_mtx (remove dss_chunks_mtx). This fixes race conditions that could
cause chunk leaks.
Reported by: [1] kris
This is a long-standing bug, but until recent changes it was difficult
to trigger, and even then its impact was non-catastrophic, with the
exception of revision 1.157.
Optimize chunk_alloc_mmap() to avoid the need for unmapping pages in the
common case. Thanks go to Kris Kennaway for a patch that inspired this
change.
Do not maintain a record of previously mmap'ed chunk address ranges.
The original intent was to avoid the extra system call overhead in
chunk_alloc_mmap(), which is no longer a concern. This also allows some
simplifications for the tree of unused DSS chunks.
Introduce huge_mtx and dss_chunks_mtx to replace chunks_mtx. There was
no compelling reason to use the same mutex for these disjoint purposes.
Avoid memset() for huge allocations when possible.
Maintain two trees instead of one for tracking unused DSS address
ranges. This allows scalable allocation of multi-chunk huge objects in
the DSS. Previously, multi-chunk huge allocation requests failed if the
DSS could not be extended.
order to support re-use of multi-chunk unused regions within the DSS for
huge allocations. This generalization is important to correct function
when mmap-based allocation is disabled.
Avoid zeroing re-used memory in the DSS unless it really needs to be
zeroed.
memory is acquired from the system via sbrk(2) and/or mmap(2). By default,
use sbrk(2) only, in order to support traditional use of resource limits.
Additionally, when both options are enabled, prefer the data segment to
anonymous mappings, in order to coexist better with large file mappings
in applications on 32-bit platforms. This change has the potential to
increase memory fragmentation due to the linear nature of the data
segment, but from a performance perspective this is mitigated by the use
of madvise(2). [1]
Add the ability to interpret integer prefixes in MALLOC_OPTIONS
processing. For example, MALLOC_OPTIONS=lllllllll can now be specified as
MALLOC_OPTIONS=9l.
Reported by: [1] rwatson
Design review: [1] alc, peter, rwatson
- Use PTY* for all pty(4) related constants.
- Use PTMX* for all pts(4) related constants.
- Consistently use _PATH_DEV PTMX rather than "/dev/ptmx".
- Revert 1.7 and properly fix it by using the correct prefix string for
pts(4) masters.
MFC after: 3 days
my original implementation made both use the same code. Unfortunately,
this meant libm depended on a vendor header at compile time and previously-
unexposed vendor bits in libc at runtime.
Hence, I just wrote my own version of the relevant vendor routine. As it
turns out, mine has a factor of 8 fewer of lines of code, and is a bit more
readable anyway. The strtod() and *scanf() routines still use vendor code.
Reviewed by: bde
calculating run sizes. Use of the floating point unit was a potential
pessimization to context switching for applications that do not otherwise
use floating point math. [1]
Reformat cpp macro-related comments to improve consistency.
Submitted by: das
deallocation and dynamic load balancing via the MALLOC_LAZY_FREE and
MALLOC_BALANCE knobs. This is a non-functional change, since these
features are still enabled when possible.
Clean up a few things that more pedantic compiler settings would cause
complaints over.
adds two new directories in msun: ld80 and ld128. These are for
long double functions specific to the 80-bit long double format
used on x86-derived architectures, and the 128-bit format used on
sparc64, respectively.
when particular function can't be found in nsswitch-module. For
example, getgrouplist(3) will use module-supplied 'getgroupmembership'
function (which can work in an optimal way for such source as LDAP) and
will fall back to the stanard iterate-through-all-groups implementation
otherwise.
PR: ports/114655
Submitted by: Michael Hanselmann <freebsd AT hansmi DOT ch>
Reviewed by: brooks (mentor)