A UFS/FFS snapshot file is identified with the SF_SNAPSHOT
flag to identify it as a snapshot. This flag needs to be
set before setting some of its block pointers to the special
values BLK_SNAP and BLK_NOCOPY. If the snapshot creation fails
and we call VOP_REMOVE(), the SF_SNAPSHOT flag will let the
remove routine know that the special block pointer values need
to be rolled back before attempting deletion of the file.
Also ensure that an fsck is required after setting superblock
values in the ffs_checkcgintegrity() routine.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
If a UFS/FFS filesystem develops a broken cylinder group (which is
usually detected when its check hash fails), that cylinder group
will not be usable until the filesystem has been unmounted and fsck
has been run to repair it. On the first attempt to to allocate
resources from the broken cylinder group, its available resources
are set to zero in the superblock summary information. Since it
will appear to have no resources available, no further calls will
be made to allocate resources from it. When resources are freed to
the broken cylinder group, the resource free routines will find the
cylinder group unusable so the resource will simply be discarded
and thus will not show up in the superblock summary information
until they are recovered by fsck.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
Rename to ffs_checkfreeblk() to better describe that it is checking
to find out if a block or fragment is free. Clarify its implementation.
No functional change intended.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
When a large file is deleted or a large number of files are deleted,
even a single cylinder group with a bad check hash can generate
thousands of check-hash warnings. As with other filesystem messages
such as out-of-space, print a maximum of one check-hash error per
second. Note the limit is per filesystem. If two filesystems have
cylinder group(s) with a bad check hash, each will print a maximum
of one check-hash error message per second.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
When a file is deleted, its blocks need to be put back in the free
block list and its inode needs to be put back in the inode free list.
These lists reside in cylinder-group maps. If either some of its blocks
or its inode reside in a cylinder-group map with a bad check hash
it is not possible to free the associated resource. Since the cylinder
group cannot be repaired until the filesystem is unmounted these
resources cannot be freed. They simply accumulate in memory. And
any attempt to unmount the filesystem loops forever trying to flush them.
With this change, the resource update claims to succeed so that the
file deletion can successfully complete. The filesystem is marked as
requiring an fsck so that before the next time that the filesystem is
mounted, the offending cylinder groups are reconstructed causing the
lost resources to be reclaimed.
A better solution would be to downgrade the filesystem to read-only,
but that capability is not currently implemented.
Reported-by: Peter Holm
Tested-by: Peter Holm
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
As per https://lists.freebsd.org/archives/freebsd-scsi/2023-July/000257.html
move to the modern uintXX_t. While here also migrate u_char to uint8_t.
Where other kernel interfaces allow, migrate u_long to uint64_t.
No functional changes intended.
MFC-after: 1 week
Sponsored-by: The FreeBSD Foundation
Commit fe5e6e2 improved FFS directory placement when creating new
directories. It is done by keeping track of the depth of directories
in the filesystem and placing those lower in the tree closer together
while spreading out those higher in the tree.
Fsck_ffs(8) checks these depths and if incorrect adjusts them to
their correct value. When running in background fsck_ffs(8) needs
to be able to make an adjustment to the depth. This commit adds
the sysctl to make such an adjustment and adds the code to fsck_ffs(8)
to use the new sysctl.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
The algorithm for laying out new directories was devised in the 1980s
and markedly improved the performance of the filesystem. In those days
large disks had at most 100 cylinder groups and often as few as 10-20.
Modern multi-terrabyte disks have thousands of cylinder groups. The
original algorithm does not handle these large sizes well. This change
attempts to expand the scope of the original algorithm to work well
with these much larger disks while still retaining the properties
of the original algorithm for small disks.
The filesystem implementation is divided into policy routines and
implementation routines. The policy routines can be changed in any
way desired without risk of corrupting the filesystem. The policy
requests are handled by the implementation layer. If the policy
asks for an available resource, it is granted. But if it asks for
an already in-use resource, then the implementation will provide
an available one nearby the request. Thus it is impossible for a
policy to double allocate. This change is limited to the policy
implementation.
This change updates the ffs_dirpref() routine which is responsible
for selecting the cylinder group into which a new directory should
be placed. If we are near the root of the filesystem we aim to
spread them out as much as possible. As we descend deeper from the
root we cluster them closer together around their parent as we
expect them to be more closely interactive. Higher-level directories
like usr/src/sys and usr/src/bin should be separated while the
directories in these areas are more likely to be accessed together
so should be closer. And directories within commands or kernel
subsystems should be closer still.
We pick a range of cylinder groups around the cylinder group of the
directory in which we are being created. The size of the range for
our search is based on our depth from the root of our filesystem.
We then probe that range based on how many directories are already
present. The first new directory is at 1/2 (middle) of the range;
the second is in the first 1/4 of the range, then at 3/4, 1/8, 3/8,
5/8, 7/8, 1/16, 3/16, 5/16, etc.
It is desirable to store the depth of a directory in its on-disk
inode so that it is available when we need it. We add a new field
di_dirdepth to track the depth of each directory. Because there are
few spare fields left in the inode, we choose to share an existing
field in the inode rather than having one of our own. Specifically
we create a union with the di_freelink field. The di_freelink field
is used to track inodes that have been unlinked but remain referenced.
It is not needed until a rmdir(2) operation has been done on a
directory. At that point, the directory has no contents and even
if it is kept active as a current directory is no longer able to
have any new directories or files created in it. Thus the use of
di_dirdepth and di_freelink will never coincide.
Reported by: Timo Voelker
Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39246
The function ffs_vgetf() is used to find or load UFS inodes into a
vnode. It first looks up the inode and if found in the cache its
vnode is returned. If it is not already in the cache, a new vnode
is allocated and its associated inode read in from the disk. The
read is done even for inodes that are being initially created.
The contents for the inode on the disk are assumed to be empty. If
the on-disk contents had been corrupted either due to a hardware
glitch or an agent deliberately trying to exploit the system, the
UFS code could panic from the unexpected partially-allocated inode.
Rather then having fsck_ffs(8) verify that all unallocated inodes
are properly empty, it is easier and quicker to add a flag to
ffs_vgetf() to indicate that the request is for a newly allocated
inode. When set, the disk read is skipped and the inode is set to
its expected empty (zero'ed out) initial state. As a side benefit,
an unneeded disk I/O is avoided.
Reported by: Peter Holm
Sponsored by: The FreeBSD Foundation
The K&R style in UFS and other places in the tree's days are numbered
as this syntax is removed in C2x proposal N2432:
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2432.pdf
Though running to nearly 6000 lines of diffs this update should cause
no functional change to the code.
Requested by: Warner Losh
MFC after: 2 weeks
The fsckpid mount option was introduced in 927a12ae16 along with a
couple sysctl's to support SU+J with snapshots. However, those sysctl's
were never used and eventually removed in f2620e9ceb.
There are no in-tree consumers of this mount option.
Reviewed by: mckusick, kib
Differential Revision: https://reviews.freebsd.org/D32015
Make sure we return an error if no input was specified, since
SYSCTL_IN() will report success in that case.
Reported by: KMSAN
Reviewed by: mckusick
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30586
instead of DOINGSOFTDEP(). The softdep_prealloc() function does nothing
in SU case.
Note that the call should be safe with regard to the vnode relock,
because it is called with MNT_NOWAIT, which does not descend into fsync.
Reviewed by: mckusick
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
This count is memoized together with the lookup metadata in directory
inode, and we assert that accesses to lookup metadata are done under
the same lock generation as they were stored. Enabled under DIAGNOSTICS.
UFS saves additional data for parent dirent when doing lookup
(i_offset, i_count, i_endoff), and this data is used later by VOPs
operating on dirents. If parent vnode exclusive lock is dropped and
re-acquired between lookup and the VOP call, we corrupt directories.
Framework asserts that corruption cannot occur that way, by tracking
vnode lock generation counter. Updates to inode dirent members also
save the counter, while users compare current and saved counters
values.
Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
be updated under shared lock. It is not a bug on its own since dvp
i_offset results from such lookup cannot be used, but it causes false
positive in the checker.
In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136
The pointer to vnode is already stored into f_vnode, so f_data can be
reused. Fix all found users of f_data for DTYPE_VNODE.
Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.
Reviewed by: markj (previous version)
Discussed with: freqlabs (openzfs chunk)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346
requires that new data on growing files be accessible. Thus, the
the fsyncdata() system call must update the on-disk inode when the
size of the file has changed.
This commit adds another inode update flag, IN_SIZEMOD, that gets
set any time that the file size changes. If either the IN_IBLKDATA
or the IN_SIZEMOD flag is set when fdatasync() is called, the
associated inode is synchronously written to disk. We could have
overloaded the IN_IBLKDATA flag to also track size changes since
the only (current) use case for these flags are for fsyncdata(),
but it does seem useful for possible future uses to separately
track the file size changes and the inode block pointer changes.
Reviewed by: kib
MFC with: -r361785
Differential revision: https://reviews.freebsd.org/D25072
the underlying media fails or becomes inaccessible. For example
when a USB flash memory card hosting a UFS filesystem is unplugged.
The strategy for handling disk I/O errors when soft updates are
enabled is to stop writing to the disk of the affected file system
but continue to accept I/O requests and report that all future
writes by the file system to that disk actually succeed. Then
initiate an asynchronous forced unmount of the affected file system.
There are two cases for disk I/O errors:
- ENXIO, which means that this disk is gone and the lower layers
of the storage stack already guarantee that no future I/O to
this disk will succeed.
- EIO (or most other errors), which means that this particular
I/O request has failed but subsequent I/O requests to this
disk might still succeed.
For ENXIO, we can just clear the error and continue, because we
know that the file system cannot affect the on-disk state after we
see this error. For EIO or other errors, we arrange for the geom_vfs
layer to reject all future I/O requests with ENXIO just like is
done when the geom_vfs is orphaned. In both cases, the file system
code can just clear the error and proceed with the forcible unmount.
This new treatment of I/O errors is needed for writes of any buffer
that is involved in a dependency. Most dependencies are described
by a structure attached to the buffer's b_dep field. But some are
created and processed as a result of the completion of the dependencies
attached to the buffer.
Clearing of some dependencies require a read. For example if there
is a dependency that requires an inode to be written, the disk block
containing that inode must be read, the updated inode copied into
place in that buffer, and the buffer then written back to disk.
Often the needed buffer is already in memory and can be used. But
if it needs to be read from the disk, the read will fail, so we
fabricate a buffer full of zeroes and pretend that the read succeeded.
This zero'ed buffer can be updated and written back to disk.
The only case where a buffer full of zeros causes the code to do
the wrong thing is when reading an inode buffer containing an inode
that still has an inode dependency in memory that will reinitialize
the effective link count (i_effnlink) based on the actual link count
(i_nlink) that we read. To handle this case we now store the i_nlink
value that we wrote in the inode dependency so that it can be
restored into the zero'ed buffer thus keeping the tracking of the
inode link count consistent.
Because applications depend on knowing when an attempt to write
their data to stable storage has failed, the fsync(2) and msync(2)
system calls need to return errors if data fails to be written to
stable storage. So these operations return ENXIO for every call
made on files in a file system where we have otherwise been ignoring
I/O errors.
Coauthered by: mckusick
Reviewed by: kib
Tested by: Peter Holm
Approved by: mckusick (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D24088
These two sysctls were added to support UFS softupdates journalling
with snapshots. However, the changes to fsck to use them were never
committed and there have never been any in-tree uses of these sysctls.
More details from Kirk:
When journalling got added to soft updates, its journal rollback freed
blocks that it thought were no longer in use. But it does not take
snapshots into account (i.e., if a snapshot is still using it, then it
cannot be freed). So I added the needed logic to fsck by having the
free go through the kernel's blkfree code so it could grab blocks that
were still needed by snapshots. That is done using the setbufoutput
hack. I never got that code working reliably, so it is still sitting
in my work directory. Which also explains why you still cannot take
snapshots on filesystems running with journalling...
In looking over my use of this feature, and in particular the troubles
I was having with it, I conclude that it may be better to extract the
code from the kernel that handles freeing blocks claimed by snapshots
and putting it into fsck directly. My original intent was that it is
complex and at the time changing, so only having to maintain it in one
place was appealing. But at this point it has not changed in years and
the hacks like setinode and setbufoutput to be able to use the kernel
code is sufficiently ugly, that I am leaning towards just extracting
it.
Reviewed by: mckusick
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D24484
This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23889
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.
Reviewed by: kib, mckusick
Approved by: imp (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23787
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.
This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
This will be used later to add vnodes to the lazy list.
Reviewed by: kib (previous version), jeff
Tested by: pho (in a larger patch)
Differential Revision: https://reviews.freebsd.org/D22994
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
the vnode, logical block number, and size of data block that is
being requested. They then use the VOP_BMAP function to calculate
the mapping from logical block number to physical block number from
which to access the data. This change expands the interface to also
pass the physical block number in cases where the VOP_MAP function
may no longer work, for example when a file is being truncated.
No functional change.
Reviewed by: kib
Tested by: Peter Holm
Sponsored by: Netflix
the cg rather than reusuing "ino" for this purpose. This reduces the diff
for an upcoming change that improves handling of I/O errors.
No functional change.
Reviewed by: mckusick
Approved by: mckusick (mentor)
Sponsored by: Netflix
In ffs_valloc(), force reclaim existing vnode on inode reuse, instead
of trying to re-initialize the same vnode for new purposes. This is
done in preparation of changes to the vp->v_object lifecycle handling.
A new FFSV_REPLACE flag to ffs_vgetf() directs the function to
vgone(9) the vnode if found in vfs hash, instead of returning it.
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
filesystems that have block pointers that are out-of-range for their
filesystem. These out-of-range block pointers are corrected by
fsck(8) so are only encountered when an unchecked filesystem is
mounted.
A new "untrusted" flag has been added to the generic mount interface
that can be set when mounting media of unknown provenance or integrity.
For example, a daemon that automounts a filesystem on a flash drive
when it is plugged into a system.
This commit adds a test to UFS/FFS that validates all block numbers
before using them. Because checking for out-of-range blocks adds
unnecessary overhead to normal operation, the tests are only done
when the filesystem is mounted as an "untrusted" filesystem.
Reported by: Christopher Krah, Thomas Barabosch, and Jan-Niclas Hilgert of Fraunhofer FKIE
Reported as: FS-14-UFS-3: Out of bounds read in write-2 (ffs_alloccg)
Reviewed by: kib
Sponsored by: Netflix
filesystem full message is sent to the offending process or the
kernel log if the offending process cannot be identified.
To prevent an explotion of messages, the kernel ppsratecheck()
function is used to limit the messages to one per second. This
revision changes the variable that tracks the rate of these messages
from a systemwide limit to a per-filesystem limit by moving it from
a global variable to a variable in the ufsmount structure.
Suggested by: kib
Reviewed by: kib
Sponsored by: Netflix
rename the source to gsb_crc32.c.
This is a prerequisite of unifying kernel zlib instances.
PR: 229763
Submitted by: Yoshihiro Ota <ota at j.email.ne.jp>
Differential Revision: https://reviews.freebsd.org/D20193
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567