These sys/cdefs.h are not needed. Purge them. They are mostly left-over
from the $FreeBSD$ removal. A few in libc are still required for macros
that cdefs.h defines. Keep those.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D42385
(cherry picked from commit 559a218c9b257775fb249b67945fe4a05b7a6b9f)
When memcmp(a, b, len) (or equally, bcmp) is called with a phony length
such that a + len < a, the code would malfunction and not compare the
two buffers correctly. While such arguments are illegal (buffers do not
wrap around the end of the address space), it is neverthless conceivable
that people try things like memcmp(a, b, SIZE_MAX) to compare a and b
until the first mismatch, in the knowledge that such a mismatch exists,
expecting memcmp() to stop comparing somewhere around the mismatch.
While memcmp() is usually written to confirm to this assumption, no
version of ISO/IEC 9899 guarantees this behaviour (in contrast to
memchr() for which it is).
Neverthless it appears sensible to at least not grossly misbehave on
phony lengths. This change hardens memcmp() against this case by
comparing at least until the end of the address space if a + len
overflows a 64 bit integer.
Sponsored by: The FreeBSD Foundation
Approved by: mjg (blanket, via IRC)
See also: b2618b651b28fd29e62a4e285f5be09ea30a85d4
MFC after: 1 week
(cherry picked from commit 953b93cf24d8871c62416c9bcfca935f1f1853b6)
Now that we have an optimised memchr(3), we can use it to implement
strnlen(3) with better perofrmance.
Sponsored by: The FreeBSD Foundation
Approved by: mjg
MFC after: 1 week
MFC to: stable/14
Differential Revision: https://reviews.freebsd.org/D41598
(cherry picked from commit 331737281c1929c29e679e48783055351ac4fbd9)
This is conceptually similar to strchr(3), but there are
slight changes to account for the buffer having an explicit
buffer length.
this includes the bug fix from b2618b6.
Sponsored by: The FreeBSD Foundation
Reported by: yuri, des
Tested by: des
Approved by: mjg
MFC after: 1 week
MFC to: stable/14
PR: 273652
Differential Revision: https://reviews.freebsd.org/D41598
(cherry picked from commit de12a689fad271f5a2ba7c188b0b5fb5cabf48e7)
(cherry picked from commit b2618b651b28fd29e62a4e285f5be09ea30a85d4)
This is conceptually very similar to the strcspn(3) implementations
from D41557, but we can't do the fast paths the same way.
Sponsored by: The FreeBSD Foundation
Approved by: mjg
MFC after: 1 week
MFC to: stable/14
Differential Revision: https://reviews.freebsd.org/D41567
(cherry picked from commit 7084133cde6a58412d86bae9f8a55b86141fb304)
This changeset adds both a scalar and an x86-64-v2 implementation
of the strcspn(3) function to libc. A baseline implementation does not
appear to be feasible given the requirements of the function.
The scalar implementation is similar to the generic libc implementation,
but expands the bit set into a byte set to reduce latency, improving
performance. This approach could probably be backported to the generic
C version to benefit other platforms.
The x86-64-v2 implementation is built around the infamous pcmpistri
instruction. An alternative implementation based on the Muła/Langdale
algorithm [1] was prototyped, but performed worse than the pcmpistri
approach except for sets of more than 16 characters with long input
strings.
All implementations provide special cases for the empty set (reduces to
strlen as well as single-character sets (reduces to strchr). The
x86-64-v2 kernel falls back to the scalar implementation for sets of
more than 32 characters. This limit could be raised by additional
multiples of 16 through the use of additional pcmpistri code paths, but
I consider this case to be too rare to be of importance.
This includes the bug fix from 52d4a4d.
[1]: http://0x80.pl/articles/simd-byte-lookup.html
Sponsored by: The FreeBSD Foundation
Approved by: mjg
MFC after: 1 week
MFC to: stable/14
Differential Revision: https://reviews.freebsd.org/D41557
(cherry picked from commit 474408bb7933f0383a0da2b01e717bfe683ae77c)
(cherry picked from commit 52d4a4d4e0dedc72bc33082a3f84c2d0fd6f2cbb)
When the buffer is immediately preceeded by the character we
are looking for and begins with one higher than that character,
and the buffer is misaligned, a match was errorneously detected
in the first character. Fix this by changing the way we prevent
matches before the buffer from being detected: instead of
removing the corresponding bit from the 0x80..80 mask, set the
LSB of bytes before the buffer after xoring with the character we
look for.
The bug only affects amd64 with ARCHLEVEL=scalar (cf. simd(7)).
The change comes at a 2% performance impact for short strings
if ARCHLEVEL is set to scalar. The default configuration is not
affected.
os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
│ strchrnul.scalar.0.out │ strchrnul.scalar.2.out │
│ sec/op │ sec/op vs base │
Short 57.89µ ± 2% 59.08µ ± 1% +2.07% (p=0.030 n=20)
Mid 19.24µ ± 0% 19.73µ ± 0% +2.53% (p=0.000 n=20)
Long 11.03µ ± 0% 11.03µ ± 0% ~ (p=0.547 n=20)
geomean 23.07µ 23.43µ +1.53%
│ strchrnul.scalar.0.out │ strchrnul.scalar.2.out │
│ B/s │ B/s vs base │
Short 2.011Gi ± 2% 1.970Gi ± 1% -2.02% (p=0.030 n=20)
Mid 6.049Gi ± 0% 5.900Gi ± 0% -2.47% (p=0.000 n=20)
Long 10.56Gi ± 0% 10.56Gi ± 0% ~ (p=0.547 n=20)
geomean 5.045Gi 4.969Gi -1.50%
MFC to: stable/14
MFC after: 3 days
Approved by: mjg (blanket, via IRC), re (gjb)
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 3d8ef251aa9dceabd57f7821a0e6749d35317db3)
This changeset adds a baseline implementation of memcmp and bcmp
for amd64. The same code is used for both functions with conditional
code were the behaviour differs (we need more precise output for the
memcmp case).
FreeBSD documents that memcmp returns the difference between the
mismatching characters. Slightly faster code would be possible could
we relax this requirement to the ISO/IEC 9899:1999 requirement of
merely returning a negative/positive integer or zero.
Performance is better than bionic and glibc, except for long strings
were the two are 13% faster. This could be because they use SSE4
ptest which we cannot use in a baseline kernel.
Sponsored by: The FreeBSD Foundation
Approved by: mjg
Differential Revision: https://reviews.freebsd.org/D41442
This commit adds a baseline implementation of stpcpy(3) for amd64.
It performs quite well in comparison to the previous scalar implementation
as well as agains bionic and glibc (though glibc is faster for very long
strings). Fiddle with the Makefile to also have strcpy(3) call into the
optimised stpcpy(3) code, fixing an oversight from D9841.
Sponsored by: The FreeBSD Foundation
Reviewed by: imp ngie emaste
Approved by: mjg kib
Fixes: D9841
Differential Revision: https://reviews.freebsd.org/D41349
This performs very well. x86-64-v3 and x86-64-v4 kernels were written,
too, but performed worse than the baseline kernel on short strings.
These may be added at a future point in time if the performance issues
can be fixed.
os: FreeBSD
arch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
│ strlen_scalar.out │ strlen_baseline.out │
│ B/s │ B/s vs base │
Short 1.667Gi ± 1% 2.676Gi ± 1% +60.55% (p=0.000 n=20)
Mid 5.459Gi ± 1% 8.756Gi ± 1% +60.39% (p=0.000 n=20)
Long 15.34Gi ± 0% 52.27Gi ± 0% +240.64% (p=0.000 n=20)
geomean 5.188Gi 10.70Gi +106.24%
Sponsored by: The FreeBSD Foundation
Approved by: kib
Reviewed by: mjg jrtc27
Differential Revision: https://reviews.freebsd.org/D40693
Add a framework for selecting from one of multiple implementations
of a function based on amd64 architecture level (cf. amd64 SysV
ABI supplement).
Sponsored by: The FreeBSD Foundation
Approved by: kib
Reviewed by: jrtc27
Differential Revision: https://reviews.freebsd.org/D40693
The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.
Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix
Enhanced REP MOVSB feature of CPUs starting from Ivy Bridge makes
REP MOVSB the fastest way to copy memory in most of cases. However
Intel Optimization Reference Manual says: "setting the DF to force
REP MOVSB to copy bytes from high towards low addresses will expe-
rience significant performance degradation". Measurements on Intel
Cascade Lake and Alder Lake, same as on AMD Zen3 show that it can
drop throughput to as low as 2.5-3.5GB/s, comparing to ~10-30GB/s
of REP MOVSQ or hand-rolled loop, used for non-ERMS CPUs.
This patch keeps ERMS use for forward ordered memory copies, but
removes it for backward overlapped moves where it does not work.
This is just a cosmetic sync with kernel, since libc does not use
ERMS at this time.
Reviewed by: mjg
MFC after: 2 weeks
Turns out clang converts "memcmp(foo, bar, len) == 0" and similar to
bcmp calls.
Reviewed by: emaste (previous version), jhb (previous version)
Differential Revision: https://reviews.freebsd.org/D34673
Preferably bcmp would just alias memcmp but there is build magic which
makes this problematic.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D28846
This is a tradeoff which saves jumps for smaller sizes while making
the 8-16 range slower (roughly in line with the other cases).
Tested with glibc test suite.
For example size 3 (most common with vfs namecache) (ops/s):
before: 407086026
after: 461391995
The regressed range of 8-16 (with 8 as example):
before: 540850489
after: 461671032
See the review for sample test results.
Reviewed by: kib (kernel part)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18401
Handling sizes of > 32 backwards will be updated later.
Reviewed by: kib (kernel part)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18387
For non-ERMS case the code used handle possible trailing bytes with
movsb first and then followed it up with movsq. This also happened
to alter how calculations were done for other cases.
Handle the tail with regular movs, just like when copying forward.
Use leaq to calculate the right offset from the get go, instead of
doing separate add and sub.
This adjusts the offset for non-rep cases so that they can be used
to handle the tail.
The routine is still a work in progress.
Sponsored by: The FreeBSD Foundation
Instead of jumping to locations which store the exact number of bytes,
use displacement to move the destination.
In particular the following clears an area between 8-16 (inclusive)
branch-free:
movq %r10,(%rdi)
movq %r10,-8(%rdi,%rcx)
For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2.
Writing 8 bytes starting at that offset overlaps with 6 bytes written
previously and writes 2 new, giving 10 in total.
Provides a nice win for smaller stores. Other ones are erratic depending
on the microarchitecture.
General idea taken from NetBSD (restricted use of the trick) and bionic
string functions (use for various ranges like in this patch).
Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17660
- tidy up memset to have rax set earlier for small sizes
- finish the tail in memset with an overlapping store
- align memset buffers to 16 bytes before using rep stos
Sponsored by: The FreeBSD Foundation
The function is of limited use and is an almost a direct clone of
memmove/memcpy (with arguments swapped). Introduction of ERMS variants
of string routines would mean avoidable growth of libc.
bcopy will get redefined to a __builtin_memmove later on with this
symbol only left for compatibility.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17539
bcopy is left alone as it is expected to be converted to a C func.
Due to header mess ALIGN_TEXT is temporarily defined explicitly in memmove.S
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17538
See r339205 for details.
An unused ERMS support is retained in the macro. It will be activated
after ifunc support lands.
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17405
This is a depessimization, see r334537 for an explanation. Routines
remain significantly slower than they have to be.
bzero was removed from the kernel but remains in libc. Macroify to
accommodate differences to memset (no return value, always setting to 0).
The bzero.S file is left in place due to libc build magic which pulls in
a C variant if a matching .S file is missing.
Reviewed by: kib
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17355
Both are significantly slower than hand-coded loops. See r338963 for
kernel commit.
bcmp differs from memcmp by always returning 1 when a difference is
found, as opposed to going for a value bigger or lower than 0
depending on what it is. This means it can do less work. For now the
code is duplicated and modified. This will get deduplicated after
another round of optimization when memcmp will get a longer-term form.
Both tested with the glibc suite. While the suite does not have a test
for bcmp, I created a wrapper routine which verified that values match
(0 vs 0, 1 vs non-zero).
Reviewed by: kib
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17336
The change resembles what was done in r334537 for kernel routines.
While here take care of i386 variants. Note that primitives remain
suboptimal.
Reviewed by: kib (previous version)
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17167
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using mis-identified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
- Remove .c files which duplicate entries in MISRCS.
- Use the same, less merge conflict prone style in all cases.
- Use MDSRCS for mips (.c and .S files both ended up in SRCS).
- Remove pointless sparc64 Makefile.inc.
- Remove uninformative foreign VCS ID entries.
Reviewed by: emaste, imp, jhb
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9841