If dso uses initial exec TLS mode, rtld tries to allocate TLS in
static space. If there is no space left, the dlopen(3) fails. If space
if allocated, initial content from PT_TLS segment is distributed to
all threads' pcbs, which was missed and caused un-initialized TLS
segment for such dso after dlopen(3).
The mode is auto-detected either due to the relocation used, or if the
DF_STATIC_TLS dynamic flag is set. In the later case, the TLS segment
is tried to allocate earlier, which increases chance of the dlopen(3)
to succeed. LLD was recently fixed to properly emit the flag, ld.bdf
did it always.
Initial test by: dumbbell
Tested by: emaste (amd64), ian (arm)
Tested by: Gerald Aryeetey <aryeeteygerald_rogers.com> (arm64)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D19072
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using mis-identified many licenses so this was mostly a manual - error
prone - task.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock. If the thread does not otherwise use SSE,
this usage incurs a context-switch of the FPU/SSE state, which
reduces the performance of multiple real-world applications by a
non-trivial amount (3-5% in one application).
Instead of this change, I experimented with eagerly switching the
FPU state at context-switch time. This did not help. Most of the
cost seems to be in the read/write of memory--as kib@ stated--and
not in the #NM handling. I tested on machines with and without
XSAVEOPT.
One counter-argument to this change is that most applications already
use SIMD, and the number of applications and amount of SIMD usage
are only increasing. This is absolutely true. I agree that--in
general and in principle--this change is in the wrong direction.
However, there are applications that do not use enough SSE to offset
the extra context-switch cost. SSE does not provide a clear benefit
in the current libthr code with the current compiler, but it does
provide a clear loss in some cases. Therefore, disabling SSE in
libthr is a non-loss for most, and a gain for some.
I refrained from disabling SSE in libc--as was suggested--because
I can't make the above argument for libc. It provides a wide variety
of code; each case should be analyzed separately.
https://lists.freebsd.org/pipermail/freebsd-current/2015-March/055193.html
Suggestions from: dim, jmg, rpaulo
Approved by: kib (mentor)
MFC after: 2 weeks
Sponsored by: Dell Inc.
The amd64, i386, and sparc64 versions were identical, with the one
difference where the former two used inline asm instead of _tcb_get. I
have compared the function before and after replacing the asm with _tcb_get
and found the object files to be identical.
The arm, mips, and powerpc versions were almost identical. The only
difference was the powerpc version used an alignment of 1 where arm and
mips used 16. As this is an increase in alignment is will be safe.
Along with this arm, mips, and powerpc all passed, when initial was true,
the value returned from _tcb_get as the first argument to
_rtld_allocate_tls. This would then return this pointer back to the caller.
We can remove these extra calls by checking if initial is set and setting
the thread control block directly. As this is what the sparc64 code does
we can use it directly.
As after these observations all the architectures can now have identical
code we can merge them into a common file.
Differential Revision: https://reviews.freebsd.org/D1556
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
versions of pthread_md.h have a special case of dereferencing a null
pointer. Clang warns about this with:
In file included from lib/libthr/arch/i386/i386/pthread_md.c:36:
lib/libthr/arch/i386/include/pthread_md.h:96:10: error: indirection of non-volatile null pointer will be deleted, not trap [-Werror,-Wnull-dereference]
return (TCB_GET32(tcb_self));
^~~~~~~~~~~~~~~~~~~
lib/libthr/arch/i386/include/pthread_md.h:73:13: note: expanded from:
: "m" (*(u_int *)(__tcb_offset(name)))); \
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
lib/libthr/arch/i386/include/pthread_md.h:96:10: note: consider using __builtin_trap() or qualifying pointer with 'volatile'
Since this indirection is done relative to the fs or gs segment, to
retrieve thread-specific data, it is an exception to the rule.
Therefore, add a volatile qualifier to tell the compiler we really want
to dereference a zero address.
MFC after: 1 week
For all libthr contexts, use ${MACHINE_CPUARCH}
for all libc contexts, use ${MACHINE_ARCH} if it exists, otherwise use
${MACHINE_CPUARCH}
Move some common code up a layer (the .PATH statement was the same in
all the arch submakefiles).
# Hope she hasn't busted powerpc64 with this...
returns errno, because errno can be mucked by user's signal handler and
most of pthread api heavily depends on errno to be correct, this change
should improve stability of the thread library.
1. fast simple type mutex.
2. __thread tls works.
3. asynchronous cancellation works ( using signal ).
4. thread synchronization is fully based on umtx, mainly, condition
variable and other synchronization objects were rewritten by using
umtx directly. those objects can be shared between processes via
shared memory, it has to change ABI which does not happen yet.
5. default stack size is increased to 1M on 32 bits platform, 2M for
64 bits platform.
As the result, some mysql super-smack benchmarks show performance is
improved massivly.
Okayed by: jeff, mtm, rwatson, scottl