not behaving correctly. Fix by attaching to the correct socket.
Also call so{rw}wakeup in addition to the fifo wakeup, so that any
kqfilters attached to the socket buffer get poked.
Only tun0 -> tun32767 may now be opened as struct ifnet's if_unit
is a short.
It's now possible to open /dev/tun and get a handle back for an available
tun device (use devname to find out what you got).
The implementation uses rman by popular demand (and against my judgement)
to track opened devices and uses the new dev_depends() to ensure that
all make_dev()d devices go away before the module is unloaded.
Reviewed by: phk
dev_t. The dev_depends(dev_t, dev_t) function is for tying them
to each other.
When destroy_dev() is called on a dev_t, all dev_t's depending
on it will also be destroyed (depth first order).
Rewrite the make_dev_alias() to use this dependency facility.
kern/subr_disk.c:
Make the disk mini-layer use dependencies to make sure all
relevant dev_t's are removed when the disk disappears.
Make the disk mini-layer precreate some magic sub devices
which the disk/slice/label code expects to be there.
kern/subr_disklabel.c:
Remove some now unneeded variables.
kern/subr_diskmbr.c:
Remove some ancient, commented out code.
kern/subr_diskslice.c:
Minor cleanup. Use name from dev_t instead of dsname()
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.
Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
This fixes a number of warnings relating to removed cloned devices.
It also makes it possible to recreate deleted devices with
mknod(2). The major/minor arguments are ignored.
systems were repo-copied from sys/miscfs to sys/fs.
- Renamed the following file systems and their modules:
fdesc -> fdescfs, portal -> portalfs, union -> unionfs.
- Renamed corresponding kernel options:
FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.
- Install header files for the above file systems.
- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
Makefiles.
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
vn_start_write() on the given vnode will be successful. VOP_LEASE() may
help to solve this problem, but its return value ignored nearly everywhere.
For now just assume that the missing upper layer on write means insufficient
access rights (which is correct for most cases).
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.
All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.
This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.
Reviewed by: phk, bp
If for some reason DEVFS is undesired, the "NODEVFS" option is
needed now.
Pending any significant issues, DEVFS will be made mandatory in
-current on july 1st so that we can start reaping the full
benefits of having it.
other "system" header files.
Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.
Sort sys/*.h includes where possible in affected files.
OK'ed by: bde (with reservations)
not to mention a compile-time warning about the critical function
becoming unused, by replacing spec_bmap() with vop_stdbmap().
ntfs seems to have the same bug.
The factor for converting specfs block numbers to physical block
numbers is 1, but vop_stdbmap() uses the bogus factor
btodb(ap->a_vp->v_mount->mnt_stat.f_iosize), which is 16 for ffs with
the default block size of 8K. This factor is bogus even for vop_stdbmap()
-- the correct factor is related to the filesystem blocksize which is not
necessarily the same to the optimal i/o size. vop_stdbmap() was apparently
cloned from nfs where these sizes happen to be the same.
There may also be a problem with a_vp->v_mount being null. spec_bmap()
still checks for this, but I think the checks in specfs are dead code
which used to support block devices.
Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.
to struct mount.
This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.
Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().
At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.
This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.
The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
It's not finished yet (I still have to find a way to implement process-
dependent nodes without consuming too much memory, and the permission
system needs tightening up), but it's becoming hard to work on without
a repo (I've accidentally almost nuked it once already), and it works
(except for the lack of process-dependent nodes, that is).
I was supposed to commit this a week ago, but timed out waiting for jkh
to reply to some questions I had. Pass him a spoonful of bad karma :)
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.
o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.
o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.
o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.
o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.
o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.
Obtained from: TrustedBSD Project
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).
proctree lock and the process lock are held when updating p_pptr and
p_oppid. When we are just reaading p_pptr we only need the proc lock and
not a proctree lock as well.
user space. It has already been copied in and mp->mnt_stat.f_mntonname has
already been initialised by the caller.
This fixes a panic on the alpha caused by the fact that the variable
'size' wasn't initialised because the call to copyinstr() bailed out with
an EFAULT error.
An initial tidyup of the mount() syscall and VFS mount code.
This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.
* the guts of the mount work has been moved into vfs_mount().
* move `type', `path' and `flags' from being userland variables into being
kernel variables in vfs_mount(). `data' remains a pointer into
userspace.
* Attempt to verify the `type' and `path' strings passed to vfs_mount()
aren't too long.
* rework mount() and linux_mount() to take the userland parameters
(besides data, as mentioned) and pass kernel variables to vfs_mount().
(linux_mount() already did this, I've just tidied it up a little more.)
* remove the copyin*() stuff for `path'. `data' still requires copyin*()
since its a pointer into userland.
* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
filesystem. This variable is generally initialised with `path', and
each filesystem can override it if they want to.
* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.
Notes:
o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.
Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project
run-time. This is temporary solution until proper kernel Unicode interfaces
are in place and as such was purposely designed to be as tiny as possible
(3 lines of the code not counting comments). The port with conversion routines
for the most popular single-byte languages will be added later today
Reviewed by: bp, "Michael C . Wu" <keichii@iteration.net>
Approved by: bp
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
fsyncs, which typically occur during unmounting, will drain all dirty
buffers even if it takes multiple passes to do so. The guarentee was
mangled by the last patch which solved a problem due to -current disabling
interrupts while holding giant (which caused an infinite spin loop waiting for
I/O to complete). -stable does not have either patch, but has a similar
bug in the original spec_fsync() code which is triggered by a bug in the
softupdates umount code, a fix for which will be committed to -current
as soon as Kirk stamps it. Then both solutions will be MFC'd to -stable.
-stable currently suffers from a combination of the softupdates bug and
a small window of opportunity in the original spec_fsync() code, and -stable
also suffers from the spin-loop bug but since interrupts are enabled the
spin resolves itself in a few milliseconds.
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.
Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.
NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.
Don't check for a null pointer if malloc called with M_WAITOK.
Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>
Approved by: bp
<sys/proc.h> to <sys/systm.h>.
Correctly document the #includes needed in the manpage.
Add one now needed #include of <sys/systm.h>.
Remove the consequent 48 unused #includes of <sys/proc.h>.
the offending inline function (BUF_KERNPROC) on it being #included
already.
I'm not sure BUF_KERNPROC() is even the right thing to do or in the
right place or implemented the right way (inline vs normal function).
Remove consequently unneeded #includes of <sys/proc.h>
file types to requiring all file types to properly implement fo_stat.
This makes any new file type additions much easier as this code no
longer has to be modified to accomodate it.
o Instead of using curproc in fdesc_allocvp, pass a `struct proc' pointer as
a new fifth parameter.
cheap to setup that it doesn't really matter that we recycle device
vnodes at kleenex speed.
Implement first cut try at killing cloned devices when they are
not needed anymore. For now only the bpf driver is involved in
this experiment. Cloned devices can set the SI_CHEAPCLONE flag
which allows us to destroy_dev() it when the vcount() drops to zero
and the vnode is reclaimed. For now it's a requirement that the
driver doesn't keep persistent state from close to (re)open.
Some whitespace changes.
-- don't depend on garbage in <sys/mount.h>. mbufs aren't actually
used here either. They should have been completely removed from filesystem
interfaces when they were removed from the interfaces to convert between
file handles and vnodes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
Replace shared lock on vnode with exclusive one. It shouldn't impact
perfomance as NCP protocol doesn't support outstanding requests.
Do not hold simple lock on vnode for long period of time.
Add functionality to the nwfs_print() routine.
Add correct support for v_object management, so mmap() operation should
work properly.
Add support for extattrctl() routine (submitted by semenu).
At this point nullfs can be considered as functional and much more stable.
In fact, it should behave as a "hard" "symlink" to underlying filesystem.
Reviewed in general by: mckusick, dillon
Parts of logic obtained from: NetBSD
include:
* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)
* Per-CPU idle processes.
* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).
Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
to recycle inodes after a destroy_dev() but not until all mounts
have picked up the change.
Add support for an overflow table for DEVFS inodes. The static
table defaults to 1024 inodes, if that fills, an overflow table
of 32k inodes is allocated. Both numbers can be changed at
compile time, the size of the overflow table also with the
sysctl vfs.devfs.noverflow.
Use atomic instructions to barrier between make_dev()/destroy_dev()
and the mounts.
Add lockmgr() locking of directories for operations accessing or
modifying the directory TAILQs.
Various nitpicking here and there.
at this point):
Replace all '#ifdef DEBUG' with '#ifdef NULLFS_DEBUG' and add NULLFSDEBUG
macro.
Protect nullfs hash table with lockmgr.
Use proper order of operations when freeing mnt_data.
Return correct fsid in the null_getattr().
Add null_open() function to catch MNT_NODEV (obtained from NetBSD).
Add null_rename() to catch cross-fs rename operations (submitted by
Ustimenko Semen <semen@iclub.nsu.ru>)
Remove duplicate $FreeBSD$ tags.
cloning infrastructure standard in kern_conf. Modules are now
the same with or without devfs support.
If you need to detect if devfs is present, in modules or elsewhere,
check the integer variable "devfs_present".
This happily removes an ugly hack from kern/vfs_conf.c.
This forces a rename of the eventhandler and the standard clone
helper function.
Include <sys/eventhandler.h> in <sys/conf.h>: it's a helper #include
like <sys/queue.h>
Remove all #includes of opt_devfs.h they no longer matter.
rather than implementing its own {uid,gid,other} checks against vnode
mode. Similar change to linprocfs currently under review.
Obtained from: TrustedBSD Project
int p_can(p1, p2, operation, privused)
which allows specification of subject process, object process,
inter-process operation, and an optional call-by-reference privused
flag, allowing the caller to determine if privilege was required
for the call to succeed. This allows jail, kern.ps_showallprocs and
regular credential-based interaction checks to occur in one block of
code. Possible operations are P_CAN_SEE, P_CAN_SCHED, P_CAN_KILL,
and P_CAN_DEBUG. p_can currently breaks out as a wrapper to a
series of static function checks in kern_prot, which should not
be invoked directly.
o Commented out capabilities entries are included for some checks.
o Update most inter-process authorization to make use of p_can() instead
of manual checks, PRISON_CHECK(), P_TRESPASS(), and
kern.ps_showallprocs.
o Modify suser{,_xxx} to use const arguments, as it no longer modifies
process flags due to the disabling of ASU.
o Modify some checks/errors in procfs so that ENOENT is returned instead
of ESRCH, further improving concealment of processes that should not
be visible to other processes. Also introduce new access checks to
improve hiding of processes for procfs_lookup(), procfs_getattr(),
procfs_readdir(). Correct a bug reported by bp concerning not
handling the CREATE case in procfs_lookup(). Remove volatile flag in
procfs that caused apparently spurious qualifier warnigns (approved by
bde).
o Add comment noting that ktrace() has not been updated, as its access
control checks are different from ptrace(), whereas they should
probably be the same. Further discussion should happen on this topic.
Reviewed by: bde, green, phk, freebsd-security, others
Approved by: bde
Obtained from: TrustedBSD Project
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.
Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project
longs larger than 32 bits or strict alignment requirements.
pm_fatmask had type u_long, but it must have a type that has precisely
32 bits and this type must be no smaller than int, so that ~pmp->pm_fatmask
has no bits above the 31st set. Otherwise, comparisons between (cn
| ~pmp->pm_fatmask) and magic 32-bit "cluster" numbers always fail.
The correct fix is to use the C99 type uint_least32_t and mask with
0xffffffff. The quick fix is to use u_int32_t and assume that ints
have
msdosfs metadata is riddled with unaligned fields, and on alphas,
unaligned_fixup() apparently has problems fixing up the unaligned
accesses caused by this. The quick fix is to not comment out the
NetBSD code that sort of handles this, and define UNALIGNED_ACCESS on
i386's so that the code doesn't change on i386's. The correct fix
would define UNALIGNED_ACCESS in a central machine-dependent header
and maybe add some extra cases to unaligned_fixup(). UNALIGNED_ACCESS
is also tested in isofs.
Submitted by: parts by Mark Abene <phiber@radicalmedia.com>
PR: 19086
Implement subdirs.
Build the full "devicename" for cloning functions.
Fix panic when deleted device goes away.
Collaps devfs_dir and devfs_dirent structures.
Add proper cloning to the /dev/fd* "device-"driver.
Fix a bug in make_dev_alias() handling which made aliases appear
multiple times.
Use devfs_clone to implement getdiskbyname()
Make specfs maintain the stat(2) timestamps per dev_t
Remove old DEVFS support fields from dev_t.
Make uid, gid & mode members of dev_t and set them in make_dev().
Use correct uid, gid & mode in make_dev in disk minilayer.
Add support for registering alias names for a dev_t using the
new function make_dev_alias(). These will show up as symlinks
in DEVFS.
Use makedev() rather than make_dev() for MFSs magic devices to prevent
DEVFS from noticing this abuse.
Add a field for DEVFS inode number in dev_t.
Add new DEVFS in fs/devfs.
Add devfs cloning to:
disk minilayer (ie: ad(4), sd(4), cd(4) etc etc)
md(4), tun(4), bpf(4), fd(4)
If DEVFS add -d flag to /sbin/inits args to make it mount devfs.
Add commented out DEVFS to GENERIC
with the new snapshot code.
Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.
Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().
Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.
Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.
Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.
Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.
Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.
Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.
Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.
Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.
Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
Don't fake any file types, just set vap->va_type to IFTOVT(stb.st_mode).
If something does not report its mode, vap->va_type is set to VNON
accordingly.
using decimal major and minor numbers. "ls -l" reports
disk partitions using decimal major numbers and hex
minor numbers.
make specfs use decimal major numbers and hex minor numbers,
just like "ls -l"
problems when fetch(1) was passed `-o -'. The rationale of this change
is that applications attempting to change underlying vnodes for /dev/fd
nodes are improperly written and the use of this interface should not
ever have been encouraged. Proper alternatives are fchmod, fchown and
others.
PR: 18952
- Remove stale, unused fdescnode->fd_link structure member.
stderr nodes. More specific items of this patch:
o Removed support for symbolic links, and the need for
fdesc_readlink().
o Put all the code from fdesc_attr() into fdesc_getattr() and removed
fdesc_attr(). This also made it easier to properly give all nodes
unique inode numbers.
o The removal of all non-fd nodes allowed the removal of the fdesc_read(),
fdesc_write(), and fdesc_ioctl() nodes, since we no longer have nodes
that get special handling.
o Correct the component name validity-checking in fdesc_lookup(). It
previously detected the end of the string by checking for a terminating
NUL, now it uses cnp->cn_namelen.
o Handle kqueue files as FIFOs. This is probably the closest file type
to represent this type of file there is, and it is unfortunately not
very representative of a kqueue. Creation time is not supported by
kqueue, so ctime, mtime and atime are all set to the current time when
getattr() was called.
o Also set st_[mca]time to the current time since there's no data in
socket structures that can be used to fill this in (FIFOs).
o Simplify fdesc_readdir() since it only has to report the numbered
fd nodes. Add `.' and `..' directory links as well.
o Remove read bits from directories as they tend to confuse programs
like tar(1).
Reviewed by: phk
Discussed with: bde (earlier on, not quite review)
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
ioccom.h defines only implementation detail, and should therefore
only be included from the #include which defines the ioctl tags,
in other words: never include it from *.c
There's no excuse to have code in synthetic filestores that allows direct
references to the textvp anymore.
Feature requested by: msmith
Feature agreed to by: warner
Move requested by: phk
Move agreed to by: bde
Exceptions:
Vinum untouched. This means that it cannot be compiled.
Greg Lehey is on the case.
CCD not converted yet, casts to struct buf (still safe)
atapi-cd casts to struct buf to examine B_PHYS
(name, value) pairs to be associated with inodes. This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.
In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS. Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default). Userland utilities and man pages will be
committed in the next batch. VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.
o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR
o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)
Currently attributes are not supported in MFS. This will be fixed.
Reviewed by: adrian, bp, freebsd-fs, other unthanked souls
Obtained from: TrustedBSD Project
(Much of this done by script)
Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.
Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.
Add bio_queue field for struct bio aware disksort.
Address a lot of stylistic issues brought up by bde.
fragmentation problem due to geteblk() reserving too much space for the
buffer and imposes a larger granularity (16K) on KVA reservations for
the buffer cache to avoid fragmentation issues. The buffer cache size
calculations have been redone to simplify them (fewer defines, better
comments, less chance of running out of KVA).
The geteblk() fix solves a performance problem that DG was able reproduce.
This patch does not completely fix the KVA fragmentation problems, but
it goes a long way
Mostly Reviewed by: bde and others
Approved by: jkh
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)
substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)
This patch is machine generated except for the ccd.c and buf.h parts.
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.
B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.
Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.
Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.
This change is a step in the direction towards a stackable BIO capability.
A lot of this patch were machine generated (Thanks to style(9) compliance!)
Vinum users: Greg has not had time to test this yet, be careful.
Fix potential bug with directory reading.
Explicitly limit file size to 4GB (msdos can't handle larger files).
Slightly reorganize msdosfs_read() to reduce number of 'if's.
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
the creation time for files to the uninitialized value:
vap->va_ctime = vap->va_ctime;
Changed to what was intended, assigning it to the modification time (thus
making all three values of access time, modification time and creation time
the same thing).
Reviewed by: grog
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.
This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.
Discussed with: grog, mch, peter, phk
Reviewed by: peter
maps onto the upages. We used to use this extensively, particularly
for ps and gdb. Both of these have been "fixed". ps gets the p_stats
via eproc along with all the other stats, and gdb uses the regs, fpregs
etc files.
Once apon a time the UPAGES were mapped here, but that changed back
in January '96. This essentially kills my revisions 1.16 and 1.17.
The 2-page "hole" above the stack can be reclaimed now.
1. ntfs_read*attr*() functions now accept
uio structure to eliminate one data copying.
2. found and removed deadlock caused
by 6 concurent ls -lR.
3. started implementation of nromal
Unicode<->unix recodeing.
Obtained from: NetBSD
drops the counting in bwrite and puts it all in spec_strategy.
I did some tests and verified that the counts collected for writes
in spec_strategy is identical to the counts that we previously
collected in bwrite. We now also get read counts (async reads
come from requests for read-ahead blocks). Note that you need
to compile a new version of mount to get the read counts printed
out. The old mount binary is completely compatible, the only
reason to install a new mount is to get the read counts printed.
Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>
p_trespass(struct proc *p1, struct proc *p2)
which returns zero or an errno depending on the legality of p1 trespassing
on p2.
Replace kern_sig.c:CANSIGNAL() with call to p_trespass() and one
extra signal related check.
Replace procfs.h:CHECKIO() macros with calls to p_trespass().
Only show command lines to process which can trespass on the target
process.
Note: Previous commit to these files (except coda_vnops and devfs_vnops)
that claimed to remove WILLRELE from VOP_RENAME actually removed it from
VOP_MKNOD.
Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.
Unify spec_open() for bdev and cdev cases.
Remove the disabled bdev specific read/write code.
file object. Also explain some possible directions to re-implement it --
I'm not sure it should be, given the minimal application use. (Other
than having the debugger automatically access the symbols for a process,
the main use I'd found was with some minor accounting ability, but _that_
depends on it being in the filesystem space; an ioctl access method would
be useless in that case.)
This is a code-less change; only a comment has been added.
continue doing it despite objections by me (the principal author).
Note that this doesn't fix the real problem -- the real problem is generally
bad setup by ignorant users, and education is the right way to fix it.
So while this doesn't actually solve the prolem mentioned in the complaint
(since it's still possible to do it via other methods, although they mostly
involve a bit more complicity), and there are better methods to do this,
nobody was willing or able to provide me with a real world example that
couldn't be worked around using the existing permissions and group
mechanism. And therefore, security by removing features is the method of
the day.
I only had three applications that used it, in any event. One of them would
have made debugging easier, but I still haven't finished it, and won't
now, so it doesn't really matter.
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.
This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
to remove 'b'lock devices. The agreement is, essentially, that
block devices will be collapsed into character devices as a first
step (though I don't particularly agree), and raw device names 'rxxx'
will become simply 'xxx' in devfs in the second step (i.e. no 'rxxx'
names will exist). The renaming will not effect the original /dev
and the expectation is that devfs will eventually (but not immediately)
become the standard way to access devices in the system.
If it is determined that a reimplementation of block device access
characteristics is beneficial, a number of alternatives will
be possible that do not involve resurrecting the 'b'lock device class.
For example, an ioctl() that might be made on an open character device
descriptor or a generic buffered overlay device.
This commit removes the blockdev disablement sysctl which does not
apply to the solution that was reached.
This means that access to block devices nodes will act the
same as char device nodes for disk-like devices.
If you encounter problems after this, where programs accessing
disks directly fail to operate, please use the following command
to revert to previous behaviour:
sysctl -w vfs.bdev_buffered=1
And verify that this was indeed the cause of your trouble.
See the mail-archives of the arch@FreeBSD.org list for background.
two new functions spec_buf{read|write}.
Add sysctl vfs.bdev_buffered which defaults to 1 == true. This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off. I should not be changed while any blockdevices are
open. Remove the misplaced sysctl vfs.enable_userblk_io.
No other changes in behaviour.
-----------------------------
The core of the signalling code has been rewritten to operate
on the new sigset_t. No methodological changes have been made.
Most references to a sigset_t object are through macros (see
signalvar.h) to create a level of abstraction and to provide
a basis for further improvements.
The NSIG constant has not been changed to reflect the maximum
number of signals possible. The reason is that it breaks
programs (especially shells) which assume that all signals
have a non-null name in sys_signame. See src/bin/sh/trap.c
for an example. Instead _SIG_MAXSIG has been introduced to
hold the maximum signal possible with the new sigset_t.
struct sigprop has been moved from signalvar.h to kern_sig.c
because a) it is only used there, and b) access must be done
though function sigprop(). The latter because the table doesn't
holds properties for all signals, but only for the first NSIG
signals.
signal.h has been reorganized to make reading easier and to
add the new and/or modified structures. The "old" structures
are moved to signalvar.h to prevent namespace polution.
Especially the coda filesystem suffers from the change, because
it contained lines like (p->p_sigmask == SIGIO), which is easy
to do for integral types, but not for compound types.
NOTE: kdump (and port linux_kdump) must be recompiled.
Thanks to Garrett Wollman and Daniel Eischen for pressing the
importance of changing sigreturn as well.
a lower layer to an upper layer. I'm not sure how necessary this is
for reading.
Fix bug in union_lookup() (note: there are probably still several bugs
in union_lookup()). This one set lerror as a side effect without
setting lowervp, causing copyup code further on down to crash on a null
lowervp pointer. Changed the side effect to use a temporary variable
instead.
fixed (many due to changing semantics in other parts of the kernel and not
the original author's fault), including one critical one: unionfs could
cause UFS corruption in the fronting store due to calling VOP_OPEN for
writing without turning on vmio for the UFS vnode.
Most of the bugs were related to semantics changes in VOP calls, lock
ordering problems (causing deadlocks), improper handling of a read-only
backing store (such as an NFS mount), improper referencing and locking
of vnodes, not using real struct locks for vnode locking, not using
recursive locks when accessing the fronting store, and things like that.
New functionality has been added: unionfs now has mmap() support, but
only partially tested, and rename has been enhanced considerably.
There are still some things that unionfs cannot do. You cannot
rename a directory without confusing unionfs, and there are issues
with softlinks, hardlinks, and special files. unionfs mostly doesn't
understand them (and never did).
There are probably still panic situations, but hopefully no where near
as many as before this commit.
The unionfs in this commit has been tested overlayed on /usr/src
(backing /usr/src being a read-only NFS mount, fronting /usr/src being
a local filesystem). kernel builds have been tested, buildworld is
undergoing testing. More testing is necessary.
have been there in the first place. A GENERIC kernel shrinks almost 1k.
Add a slightly different safetybelt under nostop for tty drivers.
Add some missing FreeBSD tags
fields in struct cdevsw:
d_stop moved to struct tty.
d_reset already unused.
d_devtotty linkage now provided by dev_t->si_tty.
These fields will be removed from struct cdevsw together with
d_params and d_maxio Real Soon Now.
The changes in this patch consist of:
initialize dev->si_tty in *_open()
initialize tty->t_stop
remove devtotty functions
rename ttpoll to ttypoll
a few adjustments to these changes in the generic code
a bump of __FreeBSD_version
add a couple of FreeBSD tags
d_maxio is replaced by the dev->si_iosize_max field which the driver
should be set in all calls to cdevsw->d_open if it has a better
idea than the system wide default.
The field is a generic dev_t field (ie: not disk specific) so that
tapes and other devices can use physio as well.
heuristic to detect sequential operation.
VM-related forced clustering code removed from ufs in preparation for a
commit to vm/vm_fault.c that does it more generally.
Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
transfer size calculation was incorrect resulting in the last read being
potentially larger then the actual extent of the device.
EOF and write handling has not yet been fixed.
Reviewed by: Tor.Egge@fast.no
Rename dev->si_bsize_max to si_iosize_max and set it in spec_open
if the device didn't.
Set vp->v_maxio from dev->si_bsize_max in spec_open rather than
in ufs_bmap.c
to buffered block devices are allowed. The default is to be backwards
compatible, i.e. reads and writes are allowed.
The idea is for a larger crowd to start running with this disabled and
see what problems, if any, crop up, and then to change the default to
off and see if any problems crop up in the next 6 months prior to
potentially removing support entirely. There are still a few people,
Julian and myself included, who believe the buffered block device
access from usermode to be useful.
Remove use of vnode->v_lastr from buffered block device I/O in
preparation for removal of vnode->v_lastr field, replacing it with
the already existing seqcount metric to detect sequential operation.
Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
reasonable defaults.
This avoids confusing and ugly casting to eopnotsupp or making dummy functions.
Bogus casting of filesystem sysctls to eopnotsupp() have been removed.
This should make *_vfsops.c more readable and reduce bloat.
Reviewed by: msmith, eivind
Approved by: phk
Tested by: Jeroen Ruigrok/Asmodai <asmodai@wxs.nl>
the other XXXFS_DIAGNOSTIC options (not very) and mostly controlled
tracing of normal operation. Use `#ifdef DEBUG' for non-diagnostics
and `#ifdef DIAGNOSTIC' for diagnostics.
a quick think and discussion among various people some form of some of
these changes will probably be recommitted.
The reversion requested was requested by dg while discussions proceed.
PHK has indicated that he can live with this, and it has been agreed
that some form of some of these changes may return shortly after further
discussion.
the highly non-recommended option ALLOW_BDEV_ACCESS is used.
(bdev access is evil because you don't get write errors reported.)
Kill si_bsize_best before it kills Matt :-)
Use the specfs routines rather having cloned copies in devfs.
format errors exposed by this. It has nothing to do with diagnostics
since it does little more than control tracing of normal operation.
Actual diagnostics for the union file system are still controlled by
the DIAGNOSTIC option.
or DDB and fixed printf format errors exposed by this. The options had
little to do with diagnostics; they mostly controlled tracing of normal
operation.
Make the alias list a SLIST.
Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.
Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.
Make the revoke syscalls use vcount() instead of VALIASED.
Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.
vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.
Print the devicename in specfs/vprint().
Remove a couple of stale LFS vnode flags.
Remove unimplemented/unused LK_DRAINED;
Diskslice/label code not yet handled.
Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers)
Add the correct hook for devfs to kern_conf.c
The net result of this excercise is that a lot less files depends on DEVFS,
and devtoname() gets more sensible output in many cases.
A few drivers had minor additional cleanups performed relating to cdevsw
registration.
A few drivers don't register a cdevsw{} anymore, but only use make_dev().
Introduce BUF_STRATEGY(struct buf *, int flag) macro, and use it throughout.
please see comment in sys/conf.h about the flag argument.
Remove strategy argument from all the diskslice/label/bad144
implementations, it should be found from the dev_t.
Remove bogus and unused strategy1 routines.
Remove open/close arguments from dssize(). Pick them up from dev_t.
Remove unused and unfinished setgeom support from diskslice/label/bad144 code.
operations. This allows a device driver better insight into
what is going on that the current:
proc1: open /dev/foo R/O
devsw->open( R/O, proc1 )
proc2: open /dev/foo R/W
devsw->open( R/W, proc2 )
proc2: close
/* nothing, but device is
really only R/O open */
proc1: close
devsw->close( R/O, proc1 )
- %q -> %ll; don't assume that the promotion of off_t is quad_t; only
assume that off_t's are representable as long longs.
- printing of dev_t's was completely broken.
Fixed nearby printf format errors not reported by gcc -Wformat on i386's:
- printing of ino_t's and pointers was sloppy.
Translated from: a similar fix in ufs_readwrite.c rev.1.61.
Don't forget to set DE_ACCESS for short reads.
Check for invalid (negative) offsets before checking for reads of
0 bytes, as in ufs, although checking for invalid offsets at all
is probably a bug.
vnodes referencing this device.
Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.
An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.
Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.
Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();
The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.
Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.
Storage overhead from all of this is O(50k).
Bump __FreeBSD_version to 400009
The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.
cdevsw_add() will print an message if the d_maj field looks bogus.
Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.
Move bdevsw() and devsw() functions to kern/kern_conf.c
Bump __FreeBSD_version to 400006
This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions
if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.
I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.
Reformat and initialize correctly all "struct cdevsw".
Initialize the d_maj and d_bmaj fields.
The d_reset field was not removed, although it is never used.
I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.
Vinum and i4b not modified, patches emailed to respective authors.
udev_t in the kernel but still called dev_t in userland.
Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()
For now they're functions, they will become in-line functions
after one of the next two steps in this process.
Return major/minor/makedev to macro-hood for userland.
Register a name in cdevsw[] for the "filedescriptor" driver.
In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.
In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).
A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.
Without DEVT_FASCIST I belive this patch is a no-op.
Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.
Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
Made a new (inline) function devsw(dev_t dev) and substituted it.
Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)
DEVFS will eventually benefit from this change too.
Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.
Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)
Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)
(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1:
s/suser/suser_xxx/
2:
Add new function: suser(struct proc *), prototyped in <sys/proc.h>.
3:
s/suser_xxx(\([a-zA-Z0-9_]*\)->p_ucred, \&\1->p_acflag)/suser(\1)/
The remaining suser_xxx() calls will be scrutinized and dealt with
later.
There may be some unneeded #include <sys/cred.h>, but they are left
as an exercise for Bruce.
More changes to the suser() API will come along with the "jail" code.
before mounting (should help to do not mount extended partitions:-).
Fixed problem with hanging while unmounting busy fs.
And (the most important) added some locks to prevent
simulaneous access to kernel structures!
cannot yet be closed, though.
I hope I got all credits right, and that the multiple submitted by lines
do not break anyone's scripts...
PR: kern/5038, kern/5567
Submitted by: Keith Jang <keith@email.gcn.net.tw>
Submitted by: Joachim Kuebart <joki@kuebart.stuttgart.netsurf.de>
Submitted by: Byung Yang <byung@wam.umd.edu>
Submitted by: Motomichi Matsuzaki <mzaki@e-mail.ne.jp>
unallocated parts of the last page when the file ended on a frag
but not a page boundary.
Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF,
in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h
vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c
ufs/ufs/ufs_readwrite.c kern/vfs_bio.c
Submitted by: Matt Dillon <dillon@freebsd.org>
Reviewed by: Alan Cox <alc@freebsd.org>
This can have bad security implications, but the impact on FreeBSD
systems is minimal because this fs isn't in the default kernels and it
is unknown if it even works.
Submitted by: Manuel Bouyer <bouyer@antioche.eu.org> and
Artur Grabowski <art@stacken.kth.se>
Add d_parms() to {c,b}devsw[]. If non-NULL this function points to
a device routine that will properly fill in the specinfo structure.
vfs_subr.c's checkalias() supplies appropriate defaults. This change
should be fully backwards compatible with existing devices.
is the preparation step for moving pmap storage out of vmspace proper.
Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>
attempt to optimize forks but were essentially given-up on due to
problems and replaced with an explicit dup of the vm_map_entry structure.
Prior to the removal, they were entirely unused.
c_caddr_t with extreme prejudice. Here we want to convert from
`const char *' to `const char *'. Casting through c_caddr_t is
not the way to do this. The original cast to caddr_t was apparently
to break warnings about const mismatches in other versions of BSD
(in 4.4BSDLite2, the conversion is from `const char *path' to
plain caddr_t).
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
lock, and add some macros and function parameters to make sure that
the information get to the point where it can be put in the lock
structure.
While I'm here, add DEBUG_VFS_LOCKS to LINT.
is enough to satisfy things like StarOffice. This is a hack, but doing
it properly would be a LOT of work, and would require extensive grovelling
around in the user address space to find the argv[].
Obtained from: Mostly from Andrzej Bialecki <abial@nask.pl>.
Noticed by: Carl Mascott <cmascott@world.std.com>
Don't write access time of a file more than once per day. (Its precision is
1 day anyway). Don't try to write access and creation time in nonwin95 case.
Suggested by: bde (long time ago).
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.
These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.
Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>
client programs are allowed to finish up (coda_call is
forced to complete) and release their locks. Thus there
is a reasonable chance that the vflush implicit in the
unmount will not get hung on held locks.
references to them.
The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.
device drivers about sectors no longer in use.
Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.
This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.
Reviewed by: Kirk Mckusick, bde
Sponsored by: M-Systems (www.m-sys.com)
respectively. Most of the longs should probably have been
u_longs, but this changes is just to prevent warnings about
casts between pointers and integers of different sizes, not
to fix poorly chosen types.
bits. We used a private, wrong, version of `struct dirent' to help
break getdirentries(), and we use a silly check that the size of this
struct is a power of 2 to help break mount() if getdirentries() would
not work. This fix just changes the struct to match `struct dirent'
(except for the name length).
There is only cdevsw (which should be renamed in a later edit to deventry
or something). cdevsw contains the union of what were in both bdevsw an
cdevsw entries. The bdevsw[] table stiff exists and is a second pointer
to the cdevsw entry of the device. it's major is in d_bmaj rather than
d_maj. some cleanup still to happen (e.g. dsopen now gets two pointers
to the same cdevsw struct instead of one to a bdevsw and one to a cdevsw).
rawread()/rawwrite() went away as part of this though it's not strictly
the same patch, just that it involves all the same lines in the drivers.
cdroms no longer have write() entries (they did have rawwrite (?)).
tapes no longer have support for bdev operations.
Reviewed by: Eivind Eklund and Mike Smith
Changes suggested by eivind.
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
emulators. The emulators assume that filesystem may just ignore cookies, and
handle this case correctly. So we just ignore cookies.
Also sync *_readdir "prototypes" with reality.
Check args using the same expression as in fdesc and kernfs. The check
was actually already correct, modulo overflow. It could be tightened
up to either allow huge (aligned) offsets, treating them as EOF, or
disallow all offsets beyond EOF.
Didn't fix invalid address calculation &foo[i] where i may be out of
bounds.
Didn't fix shooting of foot using a private unportable dirent struct.
and missing arg checking.
Panic instead of returning bogus error codes or forgetting to check
all cases if fdesc_readdir() gets called for a non-directory. This
can't happen.
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.
The prototype FreeBSD/alpha machdep will follow in a couple of days
time.
unexpectedly do not complete writes even with sync I/O requests.
This should help the behavior of mmaped files when using
softupdates (and perhaps in other circumstances also.)
---------
Make callers of namei() responsible for releasing references or locks
instead of having the underlying filesystems do it. This eliminates
redundancy in all terminal filesystems and makes it possible for stacked
transport layers such as umapfs or nullfs to operate correctly.
Quality testing was done with testvn, and lat_fs from the lmbench suite.
Some NFS client testing courtesy of Patrik Kudo.
vop_mknod and vop_symlink still release the returned vpp. vop_rename
still releases 4 vnode arguments before it returns. These remaining cases
will be corrected in the next set of patches.
---------
Submitted by: Michael Hancock <michaelh@cet.co.jp>
Reverse the VFS_VRELE patch. Reference counting of vnodes does not need
to be done per-fs. I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.
The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
In msdosfs_sync: spelling fix, formatting changes; fix MNT_LAZY (sync
modified denodes, don't sync device)
Mostly submitted by (and with hints from): bde
Increase limit for maximum disk size: as far as I can see previous limit was
gratuitously too low.
deallocation cycles. This should provide a measurable improvement
on swap and memory allocation on loaded systems. It is unlikely a
complete solution. Also, provide more map info with procfs.
Chuck Cranor spurred on this improvement.
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.
/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.
Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others
This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.
When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.
* Figure out UTC relative to boottime. Four new functions provide
time relative to boottime.
* move "runtime" into struct proc. This helps fix the calcru()
problem in SMP.
* kill mono_time.
* add timespec{add|sub|cmp} macros to time.h. (XXX: These may change!)
* nanosleep, select & poll takes long sleeps one day at a time
Reviewed by: bde
Tested by: ache and others
They are atomic, but return in essence what is in the "time" variable.
gettime() is now a macro front for getmicrotime().
Various patches to use the two new functions instead of the various
hacks used in their absence.
Some puntuation and grammer patches from Bruce.
A couple of XXX comments.
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.
1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.
a complement to all ops that return a vpp, VFS_VRELE. This is
initially only for file systems that implement the following ops
that do a WILLRELE:
vop_create, vop_whiteout, vop_mknod, vop_remove, vop_link,
vop_rename, vop_mkdir, vop_rmdir, vop_symlink
This is initial DNA that doesn't do anything yet. VFS_VRELE is
implemented but not called.
A default vfs_vrele was created for fs implementations that use the
standard vnode management routines.
VFS_VRELE implementations were made for the following file systems:
Standard (vfs_vrele)
ffs mfs nfs msdosfs devfs ext2fs
Custom
union umapfs
Just EOPNOTSUPP
fdesc procfs kernfs portal cd9660
These implementations may change as VOP changes are implemented.
In the next phase, in the vop implementations calls to vrele and the vrele
part of vput will be moved to the top layer vfs_vnops and made visible
to all layers. vput will be replaced by unlock in these cases. Unlocking
will still be done in the per fs layer but the refcount decrement will be
triggered at the top because it doesn't hurt to hold a vnode reference a
little longer. This will have minimal impact on the structure of the
existing code.
This will only be done for vnode arguments that are released by the various
fs vop implementations.
Wider use of VFS_VRELE will likely require restructuring of the code.
Reviewed by: phk, dyson, terry et. al.
Submitted by: Michael Hancock <michaelh@cet.co.jp>
|In the process of evaluating the getpages/putpages issues I discovered
|that mmap on MSDOSFS does not work. This is because I blindly merged
|NetBSD changes in msdosfs_bmap and msdosfs_strategy. Apparently, their
|blocksize is always DEV_BSIZE (even in files), while in FreeBSD
|blocksize in files is v_mount->mnt_stat.f_iosize (i.e. clustersize in
|MSDOSFS case). The patch is below.
Submitted by: Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
- 'mv longnamedfile1 longnamedfile2' would cause longnamedfile2 to lose its
long name.
- Long names have trailing spaces/dots stripped for lookup as well as
assignment.
- A lockup when the mdsosfs was accessed from within the Linux emulator is fixed.
- A bug whereby long filenames were recognised by Microsoft operating systems but
not FreeBSD is fixed.
Submitted by: Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
only win->unix part is implemented at this time with 256-byte
table defaulted to KOI8-R (will be loadable in future).
Since back mapping not supported yet, you'll get "No such file or directory"
on each Cyrillic name with 'ls -l', only 'echo *' work at this moment.
Teach current code to understand Unicode a bit.
FAT32 partitions. Unfortunately, we looked around here at
Walnut Creek CDROM for any newer FAT32-supporting versions
of Win95 and we were unsuccessful; only the older stuff here.
So this is untested beyond simply making sure it compiles and
someone with access to an actual FAT32 fs will have
to let us know how well it actually works.
Submitted by: Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
Obtained from: NetBSD
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.
1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.
This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)
This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)
- Set UN_ULOCK in union_lock() when UN_KLOCK is set. Caller expects
that vnode is locked correctly, and may call another function which
expects locked vnode and may unlock the vnode.
- Do not assume the behavior of inside functions in FreeBSD's
vfs_suber.c is same as 4.4BSD-Lite2. Vnode may be locked in
vget() even though flag is zero. (Locked vnode is, of course,
unlocked before returning from vget.)
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.
When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.
When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.
A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.
Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:
1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.
Be gentle, and please give me feedback asap.
half the way down. Otherwise, further attempts to mount the device
will be rejected with BUSY.
IMHO, this flag can completely go away for cd9660. There's no reason
you need to prevent CDs from being mounted multiple times, and in case
of multisession CDs it can even make sense to mount two different
sessions by the same time (to different mount points, otherwise it
would be pointless ;).
flag is set in the p_pfsflags field. This, essentially, prevents an SUID
proram from hanging after being traced. (E.g., "truss /usr/bin/rlogin" would
fail, but leave rlogin in a stopevent state.) Yet another case where procctl
is (hopefully ;)) no longer needed in the general case.
Reviewed by: bde (thanks bruce :))
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)
The new poll types added are:
POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended
Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.
1. SS_CANTRCVMORE was initially set on the wrong socket, so reads
when there has never been a writer on the socket did not return 0.
Note that such reads are only possible if the fifo was opened in
(O_RDONLY | O_NONBLOCK) mode.
2. SS_CANTSENDMORE was initially set on the wrong socket, but this
was harmless because the wrong socket is never sent from and there
is no need to set the flag initially on the right socket (since open
in (O_WRONLY | O_NONBLOCK) mode fails if there is no reader...).
3. SS_CANTRCVMORE was cleared when read() returns. This broke the
case where read() returns 0 - subsequent reads are supposed to
return 0 until a writer appears. There is no need to clear the
flag when read() returns, since it is cleared correctly when a
writer appears.
general to be of much use. Using it here weakened the _PC_MAX_CANON,
_PC_MAX_INPUT and _PC_VDISABLE cases.
fifo_pathconf() is not quite correct either. _PC_CHOWN_RESTRICTED
and _PC_LINK_MAX should be handled by the host file system. For
directories, the host file system should let us handle _PC_PIPE_BUF.
change from
ioctl(fd, PIOC<foo>, &i);
to
ioctl(fd, PIOC<foo>, i);
This is going from the _IOW to _IO ioctl macro. The kernel, procctl, and
truss must be in synch for it all to work (not doing so will get errors about
inappropriate ioctl's, fortunately). Hopefully I didn't forget anything :).
nodes; this also apparantly caused a panic in some circumstances.
Also, since procfs_exit() is getting rid of the nodes when a process
exits, don't bother checking for the process' existance in procfs_inactive().
what is teh root cause -- but, sometimes, a procfs vnode in pfshead is
apparantly corrupt (or a UFS vnode instead). Without this patch, I can
get it to panic by doing (in csh)
while (1)
ps auxwww
end
and it will panic when the PID's wrap. With it, it does not panic.
Yes -- I know that this is NOT the right way to fix it. But I haven't
been able to get it to panic yet (which confuses me). I am going to
be looking into the vgone() code now, as that may be a part of it.
me; unfortunately, also makes it hard ot check for errors); second, I had
managed to forget a change to PIOCSFL (it should be _IOW, not _IOR) I had
in my local copy, and Bruce called me on it.
Submitted by: bde
Note that an unload facility should be used to call rm_at_exit() (if
procfs is being loaded as an LKM and is subsequently removed), but it
was non-obvious how to do this in the VFS framework.
Reviewed by: Julian Elischer
procfs/mem file. While this doesn't prevent an unkillable process, it
means that a broken truss prorgam won't do it accidently now (well,
there's a small window of opportunity). Note that this requires the
change to truss I am about to commit.
Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.
it in struct proc instead.
This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.
I have not removed the /*ARGSUSED*/, they will require some looking at.
libkvm, ps and other userland struct proc frobbing programs will need
recompiled.
in a file. There was a (harmless, I think) off-by-1 error. This
was fixed in ufs long ago (rev.1.21 of ufs_readwrite.c) but not
in cd9660.
cd9660_read() has stagnated in many other ways. It is closer to
the Net/2 ufs_read() (which is was cloned from) than ufs_read()
itself is.
Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.
1. Add defaults for more VOPs
VOP_LOCK vop_nolock
VOP_ISLOCKED vop_noislocked
VOP_UNLOCK vop_nounlock
and remove direct reference in filesystems.
2. Rename the nfsv2 vnop tables to improve sorting order.
1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.
2. Remove VOP_SEEK, it was unused.
3. Add mode default vops:
VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp
And remove identical functionality from filesystems
4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)
5. Try to make system wide VOP functions have vop_* names.
6. Initialize the um_* vectors in LFS.
(Recompile your LKMS!!!)
1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.
2. Change VOP_BLKATOFF to a normal function in cd9660.
3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.
4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.
5. Fix another VCALL in vfs_cache.c (thanks Bruce!)
1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.
1. Remove comment stating the blatantly obvious.
2. Align in two columns.
3. Sort all but the default element alphabetically.
4. Remove XXX comments pointing out entries not needed.
Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.
A couple of finer points by: bde
1. Clustered I/O is switched by the MNT_NOCLUSTERR and MNT_NOCLUSTERW
bits of the mnt_flag. The sysctl variables, vfs.foo.doclusterread
and vfs.foo.doclusterwrite are deleted. Only mount option can
control clustered I/O from userland.
2. When foofs_mount mounts block device, foofs_mount checks D_CLUSTERR
and D_CLUSTERW bits of the d_flags member in the block device switch
table. If D_NOCLUSTERR / D_NOCLUSTERW are set, MNT_NOCLUSTERR /
MNT_NOCLUSTERW bits will be set. In this case, MNT_NOCLUSTERR and
MNT_NOCLUSTERW cannot be cleared from userland.
3. Vnode driver disables both clustered read and write.
4. Union filesystem disables clutered write.
Reviewed by: bde
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.
This unifies several times in theory indentical 50 lines of code.
The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.
It's still the task of the individual filesystems to populate the
namecache with cache_enter().
Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.
free list problem. Also, the vnode age flag is no longer used by the
vnode pager. (It is actually incorrect to use then.) Constructive
feedback welcome -- just be kind.
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
uerror == 0 && lerror == EACCES, lowervp == NULLVP and union_allocvp
doesn't find existing union node and new union node is created.
Sicne it is dificult to cover all the case, union_lookup always
returns when union_lookup1() returns EACCES.
Submitted by: Naofumi Honda <honda@Kururu.math.sci.hokudai.ac.jp>
Obtained from: NetBSD/pc98