- add support to power off the system [2]
- check the linux magic values [3]
Submitted by: Marcin Cieslak <saper@SYSTEM.PL> [1,2]
Modelled after: linux man page of the reboot() syscall [3]
Found by: LTP testcase "reboot02" [1]
Tested with: LTP testcase "reboot02" [1,3]
MFC after: 1 week
handling for amd64 in the common code. The MD parts for amd64 are still
outstanding, but at least this fixes some panics on amd64.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Tested by: bsam
- protect td->td_proc->p_pid with the proc lock in linux_getpid
in the amd64 (= non i386) case [1]
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Noticed by: netchild [1]
- TLS - complete
- pid/tid mangling - complete
- thread area - complete
- futexes - complete with issues
- clone() extension - complete with some possible minor issues
- mq*/timer*/clock* stuff - complete but untested and the mq* stuff is
disabled when not build as part of the kernel with native FreeBSD mq*
support (module support for this will come later)
Tested with:
- linux-firefox - works, tested
- linux-opera - works, tested
- linux-realplay - doesnt work, issue with futexes
- linux-skype - doesnt work, issue with futexes
- linux-rt2-demo - works, tested
- linux-acroread - doesnt work, unknown reason (coredump) and sometimes
issue with futexes
- various unix utilities in linux-base-gentoo3 and linux-base-fc4:
everything tried worked
On amd64 not everything is supported like on i386, the catchup is planned for
later when the remaining bugs in the new functions are fixed.
To test this new stuff, you have to run
sysctl compat.linux.osrelease=2.6.16
to switch back use
sysctl compat.linux.osrelease=2.4.2
Don't switch while running a linux program, strange things may or may not
happen.
Sponsored by: Google SoC 2006
Submitted by: rdivacky
Some suggestions/help by: jhb, kib, manu@NetBSD.org, netchild
Giant VFS locking in that function.
- Remove bogus code to handle the case where namei() returns success but a
NULL vnode pointer.
- Note that this code duplicates exec_check_permissions() and annotate
where it differs.
- Hold the vnode lock longer to protect the write to set VV_TEXT in
v_vflag.
- Mark linux_uselib() MPSAFE.
Reviewed by: rwatson
and don't panic.
This fix is different from the patch submitted as it not only prevents
a NULL-pointer dereference, but also skips some work in this case.
Noticed by: Dmitry Ganenko <dima@apk-inform.com>
Reviewed by: rdivacky (the original version as in emulation@)
MFC after: 1 week
Security: This is a RELENG_x_y candidate (local DoS).
Go ahead by: secteam (cperciva)
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.
2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.
3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.
4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.
5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.
6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.
Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
vnodes and anonymous memory. Note that mmaping a cdev directly does not
currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
cdev the ioctl is acting on rather than trying to find a suitable vnode
to map from.
Reviewed by: alc, arch@
pops data from the userland and pushes results back and the second which does
actual processing. Use the latter to eliminate stackgap in the linux wrappers
of those syscalls.
MFC after: 2 weeks
the raw values including for child process statistics and only compute the
system and user timevals on demand.
- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.
Submitted by: bde (mostly)
MFC after: 1 month
on AMD64, and the general case where the emulated platform has different
size pointers than we use natively:
- declare certain structure members as l_uintptr_t and use the new PTRIN
and PTROUT macros to convert to and from native pointers.
- declare some structures __packed on amd64 when the layout would differ
from that used on i386.
- include <machine/../linux32/linux.h> instead of <machine/../linux/linux.h>
if compiling with COMPAT_LINUX32. This will need to be revisited before
32-bit and 64-bit Linux emulation support can coexist in the same kernel.
- other small scattered changes.
This should be a no-op on i386 and Alpha.
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
options, status pointer and rusage pointer as arguments. It is up to
the caller to copyout the status and rusage to userland if needed. This
lets us axe the 'compat' argument and hide all that functionality in
owait(), by the way. This also cleans up some locking in kern_wait()
since it no longer has to drop locks around copyout() since all the
copyout()'s are deferred.
- Convert owait(), wait4(), and the various ABI compat wait() syscalls to
use kern_wait() rather than wait1() or wait4(). This removes a bit
more stackgap usage.
Tested on: i386
Compiled on: i386, alpha, amd64
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.
Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
- improve sysinfo(2) syscall;
- add dummy fadvise64(2) syscall;
- add dummy *xattr(2) family of syscalls;
- add protos for the syscalls 222-225, 238-249 and 253-267;
- add exit_group(2) syscall, which is currently just wired to exit(2).
Obtained from: OpenBSD
MFC after: 2 weeks
with 64-bit longs again. This was fixed in rev.1.42 but the fix
rotted non-fatally in rev.1.105 and fatally in rev.1.137.
Many more non-egregrious casts are strictly required for conversions
from semi-opaque types to pointers, but we avoid most of them by using
types that are almost certain to be compatible with uintptr_t for
representing pointers (e.g., vm_offset_t). Here we don't really want
the u_longs, but we have them because a.out.h and its support code
doesn't use typedefs (it uses unsigned in V7 and unsigned long in
FreeBSD) and is too obsolete to fix now.
- cut the version string at the newline, suppressing information about
who built the kernel and in what directory. Most of this information
was already lost to truncation.
- on i386, return the precise CPU class (if known) rather than just
"i386". Linux software which uses this information to select
which binary to run often does not know what to make of "i386".
contain the filedescriptor number on opens from userland.
The index is used rather than a "struct file *" since it conveys a bit
more information, which may be useful to in particular fdescfs and /dev/fd/*
For now pass -1 all over the place.
paging space and how much of it is in use (in pages).
Use this interface from the Linuxolator instead of groping around in the
internals of the swap_pager.
take a thread instead of a proc for their first argument.
- Add a mutex to protect the system-wide Linux osname, osrelease, and
oss_version variables.
- Change linux_get_prison() to take a thread instead of a proc for its
first argument and to use td_ucred rather than p_ucred. This is ok
because a thread's prison does not change even though it's ucred might.
- Also, change linux_get_prison() to return a struct prison * instead of
a struct linux_prison * since it returns with the struct prison locked
and this makes it easier to safely unlock the prison when we are done
messing with it.
sched_lock around accesses to p_stats->p_timer[] to avoid a potential
race with hardclock. getitimer(), setitimer() and the realitexpire()
callout are now Giant-free.
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.
After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.
CODAFS is unfinished in this regard because the logic is unclear in
some places.
Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
kernel access control.
Invoke appropriate MAC entry points for a number of VFS-related
operations in the Linux ABI module. In particular, handle uselib
in a manner similar to open() (more work is probably needed here),
as well as handle statfs(), and linux readdir()-like calls.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
VOP_OPEN() and doing lots of manual checking. This would further
centralize use of the name functions, and once the MAC code is integrated,
meaning few extraneous MAC checks scattered all over the place. I don't
have time to fix this now, but want to make sure it doesn't get
forgotten. Anyone interested in fixing this should feel free.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
getpid, getuid, getgid and pipe, since they bootstrapped from
OSF/1 and never cleaned up. Switch to the native syscalls
on alpha so that the above functions work
MFC after: 7 days
mutable contents of struct prison (hostname, securelevel, refcount,
pr_linux, ...)
o Generally introduce mtx_lock()/mtx_unlock() calls throughout kern/
so as to enforce these protections, in particular, in kern_mib.c
protection sysctl access to the hostname and securelevel, as well as
kern_prot.c access to the securelevel for access control purposes.
o Rewrite linux emulator abstractions for accessing per-jail linux
mib entries (osname, osrelease, osversion) so that they don't return
a pointer to the text in the struct linux_prison, rather, a copy
to an array passed into the calls. Likewise, update linprocfs to
use these primitives.
o Update in_pcb.c to always use prison_getip() rather than directly
accessing struct prison.
Reviewed by: jhb
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
o Introduce private types for use in linux syscalls for two reasons:
1. establish type independence for ease in porting and,
2. provide a visual queue as to which syscalls have proper
prototypes to further cleanup the i386/alpha split.
Linuxulator types are prefixed by 'l_'. void and char have not
been "virtualized".
o Provide dummy functions for all syscalls and remove dummy functions
or implementations of truely obsolete syscalls.
o Sanitize the shm*, sem* and msg* syscalls.
o Make a first attempt to implement the linux_sysctl syscall. At this
time it only returns one MIB (KERN_VERSION), but most importantly,
it tells us when we need to add additional sysctls :-)
o Bump the kenel version up to 2.4.2 (this is not the same as the
KERN_VERSION MIB, BTW).
o Implement new syscalls, of which most are specific to i386. Our
syscall table is now up to date with Linux 2.4.2. Some highlights:
- Implement the 32-bit uid_t and gid_t bases syscalls.
- Implement a couple of 64-bit file size/offset bases syscalls.
o Fix or improve numerous syscalls and prototypes.
o Reduce style(9) violations while I'm here. Especially indentation
inconsistencies within the same file are addressed. Re-indenting
did not obfuscate actual changes to the extend that it could not
be combined.
NOTE: I spend some time testing these changes and found that if there
were regressions, they were not caused by these changes AFAICT.
It was observed that installing a RH 7.1 runtime environment
did make matters worse. Hangs and/or reboots have been observed
with and without these changes, so when it failed to make life
better in cases it doesn't look like it made it worse.
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
linuxulator so as to allow privileged processes within a jail() to
invoke the Linux initgroups() system call. This allows the Linux
"su" to work properly (better) when running a complete Linux
environment under jail(). This problem was reported by Attila
Nagy <bra@fsn.hu>.
Reviewed by: marcel
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)