Commit graph

16926 commits

Author SHA1 Message Date
Gleb Smirnoff
1a49612526 Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().
Remove few outdated comments and extraneous assertions.  No
functional change here.
2019-11-07 00:08:34 +00:00
Gleb Smirnoff
b8c923032f If vm_pager_get_pages_async() returns an error synchronously we leak wired
and busy pages.  Add code that would carefully cleanups the state in case
of synchronous error return.  Cover a case when a first I/O went on
asynchronously, but second or N-th returned error synchronously.

In collaboration with:	chs
Reviewed by:		jtl, kib
2019-11-06 23:45:43 +00:00
Bjoern A. Zeeb
28d7601989 m_pulldown(): Change an if () panic() into a KASSERT().
If we pass in a NULL mbuf to m_pulldown() we are in a bad situation
already.  There is no point in doing that check for production code.
Change the if () panic() into a KASSERT.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-11-06 22:40:19 +00:00
Brooks Davis
89f34d4611 libstats: Improve ABI assertion.
On platforms where pointers are larger than 64-bits, struct statsblob
may be harmlessly padded out such that opaque[] always has some included
space.  Make the assertion more general by comparing to the offset of
opaque rather than the size of struct statsblob.

Discussed with:	jhb, James Clarke
Reviewed by:	trasz, lstewart
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D22188
2019-11-06 19:44:44 +00:00
Alexander Motin
3db35ffa2a Some more taskqueue optimizations.
- Optimize enqueue for two task priority values by adding new tq_hint
field, pointing to the last task inserted into the middle of the list.
In case of more then two priority values it should halve average search.
 - Move tq_active insert/remove out of the taskqueue_run_locked loop.
Instead of dirtying few shared cache lines per task introduce different
mechanism to drain active tasks, based on task sequence number counter,
that uses only cache lines already present in cache.  Since the new
mechanism does not need ordering, switch tq_active from TAILQ to LIST.
 - Move static and dynamic struct taskqueue fields into different cache
lines.  Move lock into its own cache line, so that heavy lock spinning
by multiple waiting threads would not affect the running thread.
 - While there, correct some TQ_SLEEP() wait messages.

This change fixes certain ZFS write workloads, causing huge congestion
on taskqueue lock.  Those workloads combine some large block writes to
saturate the pool and trigger allocation throttling, which uses higher
priority tasks to requeue the delayed I/Os, with many small blocks to
generate deep queue of small tasks for taskqueue to sort.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-11-01 22:49:44 +00:00
Ed Maste
2e5f9189bb avoid kernel stack data leak in core dump thrmisc note
bzero the entire thrmisc struct, not just the padding.  Other core dump
notes are already done this way.

Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:	markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2019-10-31 20:42:36 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Jeff Roberson
6ee653cfeb Drop the object lock in vfs_bio and cluster where it is now safe to do so.
Recent changes to busy/valid/dirty have enabled page based synchronization
and the object lock is no longer required in many cases.

Reviewed by:	kib
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21597
2019-10-29 20:37:59 +00:00
Gleb Smirnoff
5757b59f3e Merge td_epochnest with td_no_sleeping.
Epoch itself doesn't rely on the counter and it is provided
merely for sleeping subsystems to check it.

- In functions that sleep use THREAD_CAN_SLEEP() to assert
  correctness.  With EPOCH_TRACE compiled print epoch info.
- _sleep() was a wrong place to put the assertion for epoch,
  right place is sleepq_add(), as there ways to call the
  latter bypassing _sleep().
- Do not increase td_no_sleeping in non-preemptible epochs.
  The critical section would trigger all possible safeguards,
  no sleeping counter is extraneous.

Reviewed by:	kib
2019-10-29 17:28:25 +00:00
Konstantin Belousov
5e921ff49e amd64: move pcb out of kstack to struct thread.
This saves 320 bytes of the precious stack space.

The only negative aspect of the change I can think of is that the
struct thread increased by 320 bytes obviously, and that 320 bytes are
not swapped out anymore. I believe the freed stack space is much more
important than that.  Also, current struct thread size is 1392 bytes
on amd64, so UMA will allocate two thread structures per (4KB) slab,
which leaves a space for pcb without increasing zone memory use.

Reviewed by:	alc, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D22138
2019-10-25 20:09:42 +00:00
Gleb Smirnoff
ed9d69b5e8 Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no
functional change.

Discussed with:	kib
2019-10-24 21:55:19 +00:00
John Baldwin
7d29eb9a91 Use a counter with a random base for explicit IVs in GCM.
This permits constructing the entire TLS header in ktls_frame() rather
than ktls_seq().  This also matches the approach used by OpenSSL which
uses an incrementing nonce as the explicit IV rather than the sequence
number.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22117
2019-10-24 18:13:26 +00:00
Konstantin Belousov
c92f130498 Fix undefined behavior.
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22126
2019-10-23 16:06:47 +00:00
Konstantin Belousov
8076c4e7d1 vn_printf(): Decode VI_TEXT_REF.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-10-23 15:51:26 +00:00
Gleb Smirnoff
080e9496b8 Allow epoch tracker to use the very last byte of the stack. Not sure
this will help to avoid panic in this function, since it will also use
some stack, but makes code more strict.

Submitted by:	hselasky
2019-10-22 18:05:15 +00:00
Gleb Smirnoff
77d70e515f Assert that any epoch tracker belongs to the thread stack.
Reviewed by:	kib
2019-10-21 23:12:14 +00:00
Gleb Smirnoff
279b9aabe3 Remove epoch tracker from struct thread. It was an ugly crutch to emulate
locking semantics for if_addr_rlock() and if_maddr_rlock().
2019-10-21 18:19:32 +00:00
Andriy Gapon
3ad1ce46d3 debug,kassert.warnings is a statistic, not a tunable
MFC after:	1 week
2019-10-21 12:21:56 +00:00
Mark Johnston
f822c9e287 Apply mapping protections to preloaded kernel modules on amd64.
With an upcoming change the amd64 kernel will map preloaded files RW
instead of RWX, so the kernel linker must adjust protections
appropriately using pmap_change_prot().

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21860
2019-10-18 13:56:45 +00:00
Mark Johnston
1d9eae9fb2 Apply mapping protections to .o kernel modules.
Use the section flags to derive mapping protections.  When multiple
sections overlap within a page, the union of their protections must be
applied.  With r353701 the .text and .rodata sections are padded to
ensure that this does not happen on amd64.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21896
2019-10-18 13:53:14 +00:00
Conrad Meyer
dda17b3672 Implement NetGDB(4)
NetGDB(4) is a component of a system using a panic-time network stack to
remotely debug crashed FreeBSD kernels over the network, instead of
traditional serial interfaces.

There are three pieces in the complete NetGDB system.

First, a dedicated proxy server must be running to accept connections from
both NetGDB and gdb(1), and pass bidirectional traffic between the two
protocols.

Second, the NetGDB client is activated much like ordinary 'gdb' and
similarly to 'netdump' in ddb(4) after a panic.  Like other debugnet(4)
clients (netdump(4)), the network interface on the route to the proxy server
must be online and support debugnet(4).

Finally, the remote (k)gdb(1) uses 'target remote <proxy>:<port>' (like any
other TCP remote) to connect to the proxy server.

The NetGDB v1 protocol speaks the literal GDB remote serial protocol, and
uses a 1:1 relationship between GDB packets and sequences of debugnet
packets (fragmented by MTU).  There is no encryption utilized to keep
debugging sessions private, so this is only appropriate for local
segments or trusted networks.

Submitted by:	John Reimer <john.reimer AT emc.com> (earlier version)
Discussed some with:	emaste, markj
Relnotes:	sure
Differential Revision:	https://reviews.freebsd.org/D21568
2019-10-17 21:33:01 +00:00
Mark Johnston
092bacb2c4 Clean up some nits in link_elf_(un)load_file().
- Remove a redundant assignment of ef->address.
- Don't return a Mach error number to the caller if vm_map_find() fails.
- Use ptoa() and fix style.

MFC after:	2 weeks
Sponsored by:	Netflix
2019-10-17 21:25:50 +00:00
Conrad Meyer
addccb8c51 Add a very limited DDB dumpon(8)-alike to MI dumper code
This allows ddb(4) commands to construct a static dumperinfo during
panic/debug and invoke doadump(false) using the provided dumper
configuration (always inserted first in the list).

The intended usecase is a ddb(4)-time netdump(4) command.

Reviewed by:	markj (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21448
2019-10-17 18:29:44 +00:00
Conrad Meyer
7790c8c199 Split out a more generic debugnet(4) from netdump(4)
Debugnet is a simplistic and specialized panic- or debug-time reliable
datagram transport.  It can drive a single connection at a time and is
currently unidirectional (debug/panic machine transmit to remote server
only).

It is mostly a verbatim code lift from netdump(4).  Netdump(4) remains
the only consumer (until the rest of this patch series lands).

The INET-specific logic has been extracted somewhat more thoroughly than
previously in netdump(4), into debugnet_inet.c.  UDP-layer logic and up, as
much as possible as is protocol-independent, remains in debugnet.c.  The
separation is not perfect and future improvement is welcome.  Supporting
INET6 is a long-term goal.

Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to
'debugnet_' or 'dn_' -- sorry.  I thought keeping the netdump name on the
generic module would be more confusing than the refactoring.

The only functional change here is the mbuf allocation / tracking.  Instead
of initiating solely on netdump-configured interface(s) at dumpon(8)
configuration time, we watch for any debugnet-enabled NIC for link
activation and query it for mbuf parameters at that time.  If they exceed
the existing high-water mark allocation, we re-allocate and track the new
high-water mark.  Otherwise, we leave the pre-panic mbuf allocation alone.
In a future patch in this series, this will allow initiating netdump from
panic ddb(4) without pre-panic configuration.

No other functional change intended.

Reviewed by:	markj (earlier version)
Some discussion with:	emaste, jhb
Objection from:	marius
Differential Revision:	https://reviews.freebsd.org/D21421
2019-10-17 16:23:03 +00:00
Andriy Gapon
5fdc2c044e provide a way to assign taskqueue threads to a kernel process
This can be used to group all threads belonging to a single logical
entity under a common kernel process.
I am planning to use the new interface for ZFS threads.

MFC after:	4 weeks
2019-10-17 06:32:34 +00:00
Mark Johnston
6d775f0ba1 Use KOBJMETHOD_END in the kernel linker.
MFC after:	1 week
2019-10-16 22:06:19 +00:00
Mark Johnston
01cef4caa7 Remove page locking from pmap_mincore().
After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore().  Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21823
2019-10-16 22:03:27 +00:00
Andrew Turner
9bb37c03fb Stop leaking information from the kernel through timespec
The timespec struct holds a seconds value in a time_t and a nanoseconds
value in a long. On most architectures these are the same size, however
on 32-bit architectures other than i386 time_t is 8 bytes and long is
4 bytes.

Most ABIs will then pad a struct holding an 8 byte and 4 byte value to
16 bytes with 4 bytes of padding. When copying one of these structs the
compiler is free to copy the padding if it wishes.

In this case the padding may contain kernel data that is then leaked to
userspace. Fix this by copying the timespec elements rather than the
entire struct.

This doesn't affect Tier-1 architectures so no SA is expected.

admbugs:	651
MFC after:	1 week
Sponsored by:	DARPA, AFRL
2019-10-16 13:21:01 +00:00
Kristof Provost
1d95443818 Generalize ARM specific comments in devmap
The comments in devmap are very ARM specific, this generalizes them for other
architectures.

Submitted by:	Nicholas O'Brien <nickisobrien_gmail.com>
Reviewed by:	manu, philip
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D22035
2019-10-15 23:21:52 +00:00
Gleb Smirnoff
4b25d1f2e3 Missing from r353596. 2019-10-15 21:32:38 +00:00
Gleb Smirnoff
bac060388f When assertion for a thread not being in an epoch fails also print all
entered epochs. Works with EPOCH_TRACE only.

Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D22017
2019-10-15 21:24:25 +00:00
Gleb Smirnoff
237c1f932b Remove pfctlinput2(). It came from KAME and had never ever been in use. 2019-10-15 15:40:03 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Jeff Roberson
63e9755548 (1/6) Replace busy checks with acquires where it is trival to do so.
This is the first in a series of patches that promotes the page busy field
to a first class lock that no longer requires the object lock for
consistency.

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Netflix, Intel
Differential Revision:	https://reviews.freebsd.org/D21548
2019-10-15 03:35:11 +00:00
Leandro Lupori
0ecc478b74 [PPC64] Initial kernel minidump implementation
Based on POWER9BSD implementation, with all POWER9 specific code removed and
addition of new methods in PPC64 MMU interface, to isolate platform specific
code. Currently, the new methods are implemented on pseries and PowerNV
(D21643).

Reviewed by:	jhibbits
Differential Revision:	https://reviews.freebsd.org/D21551
2019-10-14 13:04:04 +00:00
Gleb Smirnoff
f6eccf96a0 Since EPOCH_TRACE had been moved to opt_global.h, we don't need to waste
extra space in struct thread.
2019-10-14 04:17:56 +00:00
Mateusz Guzik
d1cbf3eeea vfs: add MNTK_NOMSYNC
On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22009
2019-10-13 15:40:34 +00:00
Mateusz Guzik
737241cd51 vfs: return free vnode batches in sync instead of vfs_msync
It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22008
2019-10-13 15:39:11 +00:00
Alexander Motin
a89a562b60 Allocate device softc from the device domain.
Since we are trying to bind device interrupt threads to the device domain,
it should have sense to make memory often accessed by them local. If domain
is not known, fall back to round-robin.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-10-12 19:03:07 +00:00
Kristof Provost
85d1151f96 mountroot: run statfs after mounting devfs
The usual flow for mounting a file system is to VFS_MOUNT() and then
immediately VFS_STATFS().

That's not done in vfs_mountroot_devfs(), which means the
mp->mnt_stat.f_iosize field is not correctly populated, which in turn
causes us to mark valid aio operations as unsafe (because the io size is
set to 0), ultimately causing the aio_test:md_waitcomplete test to fail.

Reviewed by:	mckusick
MFC after:	1 week
Sponsored by:	Axiado
Differential Revision:	https://reviews.freebsd.org/D21897
2019-10-11 17:04:38 +00:00
Conrad Meyer
46d70077be ddb: Add CSV option, sorting to 'show (malloc|uma)'
Add /i option for machine-parseable CSV output.  This allows ready copy/
pasting into more sophisticated tooling outside of DDB.

Add total zone size ("Memory Use") as a new column for UMA.

For both, sort the displayed list on size (print the largest zones/types
first).  This is handy for quickly diagnosing "where has my memory gone?" at
a high level.

Submitted by:	Emily Pettigrew <Emily.Pettigrew AT isilon.com> (earlier version)
Sponsored by:	Dell EMC Isilon
2019-10-11 01:31:31 +00:00
John Baldwin
97ecf6efa0 Don't free the cursor boundary tag during vmem_destroy().
The cursor boundary tag is statically allocated in the vmem instead of
from the vmem_bt_zone.  Explicitly remove it from the vmem's segment
list in vmem_destroy before freeing all the segments from the vmem.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21953
2019-10-09 21:20:39 +00:00
Gleb Smirnoff
975b8f8462 Cleanup unneeded includes that crept in with r353292. 2019-10-09 16:59:42 +00:00
Gleb Smirnoff
ff3cfc330e Enter network epoch in domain callouts. 2019-10-09 16:21:05 +00:00
Mark Johnston
4013d72684 Fix handling of empty SCM_RIGHTS messages.
As unp_internalize() processes the input control messages, it builds
an output mbuf chain containing the internalized representations of
those messages.  In one special case, that of an empty SCM_RIGHTS
message, the message is simply discarded.  However, the loop which
appends mbufs to the output chain assumed that each iteration would
produce an mbuf, resulting in a null pointer dereference if an empty
SCM_RIGHTS message was followed by a non-empty message.

Fix this by advancing the output mbuf chain tail pointer only if an
internalized control message was produced.

Reported by:	syzbot+1b5cced0f7fad26ae382@syzkaller.appspotmail.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-08 23:34:48 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Doug Moore
2288078c5e Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision:	https://reviews.freebsd.org/D21882
2019-10-08 07:14:21 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Edward Tomasz Napierala
1a13f2e6b4 Introduce stats(3), a flexible statistics gathering API.
This provides a framework to define a template describing
a set of "variables of interest" and the intended way for
the framework to maintain them (for example the maximum, sum,
t-digest, or a combination thereof).  Afterwards the user
code feeds in the raw data, and the framework maintains
these variables inside a user-provided, opaque stats blobs.
The framework also provides a way to selectively extract the
stats from the blobs.  The stats(3) framework can be used in
both userspace and the kernel.

See the stats(3) manual page for details.

This will be used by the upcoming TCP statistics gathering code,
https://reviews.freebsd.org/D20655.

The stats(3) framework is disabled by default for now, except
in the NOTES kernel (for QA); it is expected to be enabled
in amd64 GENERIC after a cool down period.

Reviewed by:	sef (earlier version)
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20477
2019-10-07 19:05:05 +00:00
Mateusz Guzik
dc20b834ca vfs: add optional root vnode caching
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by:	jeff (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21646
2019-10-06 22:14:32 +00:00