- Provide tunable vm.memguard.desc, so one can specify memory type without
changing the code and recompiling the kernel.
- Allow to use memguard for kernel modules by providing sysctl
vm.memguard.desc, which can be changed to short description of memory
type before module is loaded.
- Move as much memguard code as possible to memguard.c.
- Add sysctl node vm.memguard. and move memguard-specific sysctl there.
- Add malloc_desc2type() function for finding memory type based on its
short description (ks_shortdesc field).
- Memory type can be changed (via vm.memguard.desc sysctl) only if it
doesn't exist (will be loaded later) or when no memory is allocated yet.
If there is allocated memory for the given memory type, return EBUSY.
- Implement two ways of memory types comparsion and make safer/slower the
default.
statistics via a binary structure stream:
- Add structure 'malloc_type_stream_header', which defines a stream
version, definition of MAXCPUS used in the stream, and a number of
malloc_type records in the stream.
- Add structure 'malloc_type_header', which defines the name of the
malloc type being reported on.
- When the sysctl is queried, return a stream header, followed by a
series of type descriptions, each consisting of a type header
followed by a series of MAXCPUS malloc_type_stats structures holding
per-CPU allocation information. Typical values of MAXCPUS will be 1
(UP compiled kernel) and 16 (SMP compiled kernel).
This query mechanism allows user space monitoring tools to extract
memory allocation statistics in a machine-readable form, and to do so
at a per-CPU granularity, allowing monitoring of allocation patterns
across CPUs in order to better understand the distribution of work and
memory flow over multiple CPUs.
While here:
- Bump statistics width to uint64_t, and hard code using fixed-width
type in order to be more sure about structure layout in the stream.
We allocate and free a lot of memory.
- Add kmemcount, a counter of the number of registered malloc types,
in order to avoid excessive manual counting of types. Export via a
new sysctl to allow user-space code to better size buffers.
- De-XXX comment on no longer maintaining the high watermark in old
sysctl monitoring code.
A follow-up commit of libmemstat(3), a library to monitor kernel memory
allocation, will occur in the next few days. Likewise, similar changes
to UMA.
a nested include of param.h is required so that MAXCPU is visible to all
consumers of sys/malloc.h. In an earlier version of the patch, the
malloc_type_internal structure was only conditionally visible.
Pointed out by: delphij
allocators: a set of power-of-two UMA zones for small allocations, and the
VM page allocator for large allocations. In order to maintain unified
statistics for specific malloc types, kernel malloc maintains a separate
per-type statistics pool, which can be monitored using vmstat -m. Prior
to this commit, each pool of per-type statistics was protected using a
per-type mutex associated with the malloc type.
This change modifies kernel malloc to maintain per-CPU statistics pools
for each malloc type, and protects writing those statistics using critical
sections. It also moves to unsynchronized reads of per-CPU statistics
when generating coalesced statistics. To do this, several changes are
implemented:
- In the previous world order, the statistics memory was allocated by
the owner of the malloc type structure, allocated statically using
MALLOC_DEFINE(). This embedded the definition of the malloc_type
structure into all kernel modules. Move to a model in which a pointer
within struct malloc_type points at a UMA-allocated
malloc_type_internal data structure owned and maintained by
kern_malloc.c, and not part of the exported ABI/API to the rest of
the kernel. For the purposes of easing a possible MFC, re-use an
existing pointer in 'struct malloc_type', and maintain the current
malloc_type structure size, as well as layout with respect to the
fields reused outside of the malloc subsystem (such as ks_shortdesc).
There are several unused fields as a result of no longer requiring
the mutex in malloc_type.
- Struct malloc_type_internal contains an array of malloc_type_stats,
of size MAXCPU. The structure defined above avoids hard-coding a
kernel compile-time value of MAXCPU into kernel modules that interact
with malloc.
- When accessing per-cpu statistics for a malloc type, surround read -
modify - update requests with critical_enter()/critical_exit() in
order to avoid races during write. The per-CPU fields are written
only from the CPU that owns them.
- Per-CPU stats now maintained "allocated" and "freed" counters for
number of allocations/frees and bytes allocated/freed, since there is
no longer a coherent global notion of the totals. When coalescing
malloc stats, accept a slight race between reading stats across CPUs,
and avoid showing the user a negative allocation count for the type
in the event of a race. The global high watermark is no longer
maintained for a malloc type, as there is no global notion of the
number of allocations.
- While tearing up the sysctl() path, also switch to using sbufs. The
current "export as text" sysctl format is retained with the same
syntax. We may want to change this in the future to export more
per-CPU information, such as how allocations and frees are balanced
across CPUs.
This change results in a substantial speedup of kernel malloc and free
paths on SMP, as critical sections (where usable) out-perform mutexes
due to avoiding atomic/bus-locked operations. There is also a minor
improvement on UP due to the slightly lower cost of critical sections
there. The cost of the change to this approach is the loss of a
continuous notion of total allocations that can be exploited to track
per-type high watermarks, as well as increased complexity when
monitoring statistics.
Due to carefully avoiding changing the ABI, as well as hardening the ABI
against future changes, it is not necessary to recompile kernel modules
for this change. However, MFC'ing this change to RELENG_5 will require
also MFC'ing optimizations for soft critical sections, which may modify
exposed kernel ABIs. The internal malloc API is changed, and
modifications to vmstat in order to restore "vmstat -m" on core dumps will
follow shortly.
Several improvements from: bde
Statistics approach discussed with: ups
Tested by: scottl, others
improved chance of working despite pressure from running programs.
Instead of trying to throw a bunch of pages out to swap and hope for
the best, only a range that can potentially fulfill contigmalloc(9)'s
request will have its contents paged out (potentially, not forcibly)
at a time.
The new contigmalloc operation still operates in three passes, but it
could potentially be tuned to more or less. The first pass only looks
at pages in the cache and free pages, so they would be thrown out
without having to block. If this is not enough, the subsequent passes
page out any unwired memory. To combat memory pressure refragmenting
the section of memory being laundered, each page is removed from the
systems' free memory queue once it has been freed so that blocking
later doesn't cause the memory laundered so far to get reallocated.
The page-out operations are now blocking, as it would make little sense
to try to push out a page, then get its status immediately afterward
to remove it from the available free pages queue, if it's unlikely to
have been freed. Another change is that if KVA allocation fails, the
allocated memory segment will be freed and not leaked.
There is a sysctl/tunable, defaulting to on, which causes the old
contigmalloc() algorithm to be used. Nonetheless, I have been using
vm.old_contigmalloc=0 for over a month. It is safe to switch at
run-time to see the difference it makes.
A new interface has been used which does not require mapping the
allocated pages into KVA: vm_page.h functions vm_page_alloc_contig()
and vm_page_release_contig(). These are what vm.old_contigmalloc=0
uses internally, so the sysctl/tunable does not affect their operation.
When using the contigmalloc(9) and contigfree(9) interfaces, memory
is now tracked with malloc(9) stats. Several functions have been
exported from kern_malloc.c to allow other subsystems to use these
statistics, as well. This invalidates the BUGS section of the
contigmalloc(9) manpage.
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.
Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.
Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)
type M_STRING, now defined in malloc.h. Useful when string parsing
must occur using the kernel strsep() and we want to avoid toasting
the source string.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
malloc(9) failed last time. This is intended to help code adjust
memory usage to the current circumstances.
A typical use could be:
if (malloc_last_fail() < 60)
reduce_cache_by_one();
allocated slabs and bucket caches for free items. It will not go ask the vm
for pages. This differs from M_NOWAIT in that it not only doesn't block, it
doesn't even ask.
- Add a new zcreate option ZONE_VM, that sets the BUCKETCACHE zflag. This
tells uma that it should only allocate buckets out of the bucket cache, and
not from the VM. It does this by using the M_NOVM option to zalloc when
getting a new bucket. This is so that the VM doesn't recursively enter
itself while trying to allocate buckets for vm_map_entry zones. If there
are already allocated buckets when we get here we'll still use them but
otherwise we'll skip it.
- Use the ZONE_VM flag on vm map entries and pv entries on x86.
mallochash. Mallochash is going to go away as soon as I introduce the
kfree/kmalloc api and partially overhaul the malloc wrapper. This can't happen
until all users of the malloc api that expect memory to be aligned on the size
of the allocation are fixed.
Updated the kmemzones logic such that the ks_size bitmap can be used as an
index into it to report the size of the zone used.
Create the kern.malloc sysctl which replaces the kvm mechanism to report
similar data. This will provide an easy place for statistics aggregation if
malloc_type statistics become per cpu data.
Add some code ifdef'd under MALLOC_PROFILING to facilitate a tool for sizing
the malloc buckets.
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
- Callers of asleep() and await() have been converted to calling tsleep().
The only caller outside of M_ASLEEP was the ata driver, which called both
asleep() and await() with spl-raised, so there was no need for the
asleep() and await() pair. M_ASLEEP was unused.
Reviewed by: jasone, peter
With this flag set malloc() will panic if memory allocation failed.
This usable only in critical places where failed allocation is fatal.
Reviewed by: peter
Instead of:
foo = malloc(sizeof(foo), M_WAIT);
bzero(foo, sizeof(foo));
You can now (and please do) use:
foo = malloc(sizeof(foo), M_WAIT | M_ZERO);
In the future this will enable us to do idle-time pre-zeroing of
malloc-space.
Order the SYSINIT() for MALLOC_DEFINE() correctly so that malloc()
doesn't have to waste time initializing itself. The
(SI_SUB_KMEM, SI_ORDER_ANY) order was shared with syscons' SYSINIT()
for scmeminit(), and scmeminit() calls malloc(), so malloc()
initialization was not always complete on the first call to malloc().
kern/kern_malloc.c:
- Removed self-initialization in malloc().
- Removed half-baked sanity check in free(). Trust MALLOC_DEFINE().
in the dysfunctional !KMEMSTATS case. This hasn't compiled since
rev.1.31 of kern_malloc.c quietly removed the core of the support
for the !KMEMSTATS case. I fixed it to see if it was worth saving
and found that (as usual) inlining just wasted space and increased
complexity without significantly affecting time, at least for the
lmbench2 micro-benchmark on a Celeron. The space bloat was
surprisingly large - the text size increased from 1700K to 1840K
for a version with the entire malloc() family inlined.
Removed even older garbage (kmemxtob() and btokmemx() macros).
Attempt to deprecate MALLOC() and FREE(). Given current compilers
(gcc-2.x or C99), they don't do anything that (safe) function-like
macros or inline functions named malloc() and free() couldn't do.
Fixed missing casts of macro args in MALLOC() and FREE().
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
of the number of times a particular type has been used. It's rather easy
to overflow. One site I'm looking at seems to do it in a matter of days.
On the Alpha this is a no-op since 'long' is 64 bit already. The sole
user of this interface seems to be vmstat -m and friends which will need
a recompile. The overheads of using a 64 bit int should be pretty light
as the kernel just does "calls++" type operations and that's it.
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.
Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
removed at module unload (if in a module of course).
However; this introduces a new dependency on <sys/kernel.h> for things
that use MALLOC_DECLARE(). Bruce told me it is better to add sys/kernel.h
to the handful of files that need it rather than add an extra include to
sys/malloc.h for kernel compiles. Updates to follow in subsequent commits.