Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
Commit 36a35c550a turned the interface between ginPlaceToPage and
its subroutines in gindatapage.c and ginentrypage.c into a royal mess:
page-update critical sections were started in one place and finished in
another place not even in the same file, and the very same subroutine
might return having started a critical section or not. Subsequent patches
band-aided over some of the problems with this design by making things
even messier.
One user-visible resulting problem is memory leaks caused by the need for
the subroutines to allocate storage that would survive until ginPlaceToPage
calls XLogInsert (as reported by Julien Rouhaud). This would not typically
be noticeable during retail index updates. It could be visible in a GIN
index build, in the form of memory consumption swelling to several times
the commanded maintenance_work_mem.
Another rather nasty problem is that in the internal-page-splitting code
path, we would clear the child page's GIN_INCOMPLETE_SPLIT flag well before
entering the critical section that it's supposed to be cleared in; a
failure in between would leave the index in a corrupt state. There were
also assorted coding-rule violations with little immediate consequence but
possible long-term hazards, such as beginning an XLogInsert sequence before
entering a critical section, or calling elog(DEBUG) inside a critical
section.
To fix, redefine the API between ginPlaceToPage() and its subroutines
by splitting the subroutines into two parts. The "beginPlaceToPage"
subroutine does what can be done outside a critical section, including
full computation of the result pages into temporary storage when we're
going to split the target page. The "execPlaceToPage" subroutine is called
within a critical section established by ginPlaceToPage(), and it handles
the actual page update in the non-split code path. The critical section,
as well as the XLOG insertion call sequence, are both now always started
and finished in ginPlaceToPage(). Also, make ginPlaceToPage() create and
work in a short-lived memory context to eliminate the leakage problem.
(Since a short-lived memory context had been getting created in the most
common code path in the subroutines, this shouldn't cause any noticeable
performance penalty; we're just moving the overhead up one call level.)
In passing, fix a bunch of comments that had gone unmaintained throughout
all this klugery.
Report: <571276DD.5050303@dalibo.com>
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
Per discussion, there doesn't seem to be much value in having
NUM_SPINLOCK_SEMAPHORES set to 1024: under any scenario where you are
running more than a few backends concurrently, you really had better have a
real spinlock implementation if you want tolerable performance. And 1024
semaphores is a sizable fraction of the system-wide SysV semaphore limit
on many platforms. Therefore, reduce this setting's default value to 128
to make it less likely to cause out-of-semaphores problems.
The previous display was sort of confusing, because it didn't
distinguish between the number of workers that we planned to launch
and the number that actually got launched. This has already confused
several people, so display both numbers and label them clearly.
Julien Rouhaud, reviewed by me.
pg_xlogdump includes bufmgr.h. With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere. Per buildfarm member pademelon.
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().
Give up on using a macro, and use a static inline instead.
Discussion: 4369.1460435533@sss.pgh.pa.us
Commit 314cbfc5da redefined the signature of this hook as
typedef int (*walrcv_receive_type) (char **buffer, int *wait_fd);
But in fact the type of the "wait_fd" variable ought to be pgsocket,
which is what WaitLatchOrSocket expects, and which is necessary if
we want to be able to assign PGINVALID_SOCKET to it on Windows.
So fix that.
Logical messages, added in 3fe3511d05, during decoding failed to filter
messages emitted in other databases and messages emitted "under" a
replication origin the output plugin isn't interested in.
Add tests to verify that both types of filtering actually work. While
touching message.sql remove hunk obsoleted by d25379e.
Bump XLOG_PAGE_MAGIC because xl_logical_message changed and because
3fe3511d05 had omitted doing so. 3fe3511d05 additionally didn't bump
catversion, but 7a542700d has done so since.
Author: Petr Jelinek
Reported-By: Andres Freund
Discussion: 20160406142513.wotqy3ba3kanr423@alap3.anarazel.de
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.
On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.
In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.
Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
It's silly to define these counts as narrower than they might someday
need to be. Also, I believe that the BLCKSZ * nflush calculation in
mdwriteback was capable of overflowing an int.
I've seen one too many "could not bind IPv4 socket: No error" log entries
from the Windows buildfarm members. Per previous discussion, this is
likely caused by the fact that we're doing nothing to translate
WSAGetLastError() to errno. Put in a wrapper layer to do that.
If this works as expected, it should get back-patched, but let's see what
happens in the buildfarm first.
Discussion: <4065.1452450340@sss.pgh.pa.us>
The original patch kind of ignored the fact that we were doing something
different from a costing point of view, but nobody noticed. This patch
fixes that oversight.
David Rowley
Per discussion, this gives potential users of the hook more flexibility,
because they can build custom Paths that implement only one stage of
upper processing atop core-provided Paths for earlier stages.
Rename this function to GenericXLogRegisterBuffer() to make it clearer
what it does, and leave room for other sorts of "register" actions in
future. Also, replace its "bool isNew" argument with an integer flags
argument, so as to allow adding more flags in future without an API
break.
Alexander Korotkov, adjusted slightly by me
I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases. Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call. So make it readable.
Per gripes from Álvaro Herrera and Tom Lane
Previously we used a spinlock, in adition to the atomically manipulated
->state field, to protect the wait queue. But it's pretty simple to
instead perform the locking using a flag in state.
Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte
spinlocks, increased their size above 64byte. As 64 bytes are the size
we pad allocated BufferDescs to, this can increase false sharing;
causing performance problems in turn. Together with the previous commit
this reduces the size to <= 64 bytes on all common platforms.
Author: Andres Freund
Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
This routine is unsafe as implemented, because it invalidates the page
image pointers returned by previous GenericXLogRegister() calls.
Rather than complicate the API or the implementation to avoid that,
let's just get rid of it; the use-case for having it seems much
too thin to justify a lot of work here.
While at it, do some wordsmithing on the SGML docs for generic WAL.
\crosstabview is a completely different way to display results from a
query: instead of a vertical display of rows, the data values are placed
in a grid where the column and row headers come from the data itself,
similar to a spreadsheet.
The sort order of the horizontal header can be specified by using
another column in the query, and the vertical header determines its
ordering from the order in which they appear in the query.
This only allows displaying a single value in each cell. If more than
one value correspond to the same cell, an error is thrown. Merging of
values can be done in the query itself, if necessary. This may be
revisited in the future.
Author: Daniel Verité
Reviewed-by: Pavel Stehule, Dean Rasheed
Previously bcac23d exposed a subset of support functions, namely the
ones Kaigai found useful. In
20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de I mentioned that
there's some functions missing to use the facility in an external
project.
To avoid having to add functions piecemeal, add all the functions which
are used to define READ_* and WRITE_* macros; users of the extensible
node functionality are likely to need these. Additionally expose
outDatum(), which doesn't have it's own WRITE_ macro, as it needs
information from the embedding struct.
Discussion: 20160304193704.elq773pyg5fyl3mi@alap3.anarazel.de
This creates an initial set of default roles which administrators may
use to grant access to, historically, superuser-only functions. Using
these roles instead of granting superuser access reduces the number of
superuser roles required for a system. Documention for each of the
default roles has been added to user-manag.sgml.
Bump catversion to 201604082, as we had a commit that bumped it to
201604081 and another that set it back to 201604071...
Reviews by José Luis Tallón and Robert Haas
This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.
This will allow for default roles to be provided at initdb time.
Reviews by José Luis Tallón and Robert Haas
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
These parameters are available for SSPI authentication only, to make
it possible to make it behave more like "normal gssapi", while
making it possible to maintain compatibility.
compat_realm is on by default, but can be turned off to make the
authentication use the full Kerberos realm instead of the NetBIOS name.
upn_username is off by default, and can be turned on to return the users
Kerberos UPN rather than the SAM-compatible name (a user in Active
Directory can have both a legacy SAM-compatible username and a new
Kerberos one. Normally they are the same, but not always)
Author: Christian Ullrich
Reviewed by: Robbie Harwood, Alvaro Herrera, me
Create a "bsd" auth method that works the same as "password" so far as
clients are concerned, but calls the BSD Authentication service to
check the password. This is currently only available on OpenBSD.
Marisa Emerson, reviewed by Thomas Munro
This allows parallel aggregation to use them. It may seem surprising
that we use float8_combine for both float4_accum and float8_accum
transition functions, but that's because those functions differ only
in the type of the non-transition-state argument.
Haribabu Kommi, reviewed by David Rowley and Tomas Vondra
As noticed by Tom Lane changing operation's number in commit
bb140506df causes on-disk format incompatibility.
Revert to previous numbering, that is reason to add special array to store
priorities of operation. Also it reverts order of tsquery to previous.
Author: Dmitry Ivanov
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.
Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me. Some further hacking by me.
The PAM_RHOST item is set to the remote IP address or host name and can
be used by PAM modules. A pg_hba.conf option is provided to choose
between IP address and resolved host name.
From: Grzegorz Sampolski <grzsmp@gmail.com>
Reviewed-by: Haribabu Kommi <kommi.haribabu@gmail.com>
We still use replacement selection for the first run of the sort only
and only when the number of tuples is relatively small. Otherwise,
the first run, and subsequent runs in all cases, are produced using
quicksort. This tends to be faster except perhaps for very small
amounts of working memory.
Peter Geoghegan, reviewed by Tomas Vondra, Jeff Janes, Mithun Cy,
Greg Stark, and me.
Contention on the relation extension lock can become quite fierce when
multiple processes are inserting data into the same relation at the same
time at a high rate. Experimentation shows the extending the relation
multiple blocks at a time improves scalability.
Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
In cases where joins use multiple columns we currently assess each join
separately causing gross mis-estimates for join cardinality.
This patch adds use of FK information for the first time into the
planner. When FKs are present and we have multi-column join information,
plan estimates will be drastically improved. Cases with multiple FKs
are handled, though partial matches are ignored currently.
Net effect is substantial performance improvements for joins in many
common cases. Additional planning time is isolated to cases that are
currently performing poorly, measured at 0.08 - 0.15 ms.
Please watch for planner performance regressions; circumstances seem
unlikely but the law of unintended consequences may apply somewhen.
Additional complex tests welcome to prove this before release.
Tests can be performed using SET enable_fkey_estimates = on | off
using scripts provided during Hackers discussions, message id:
552335D9.3090707@2ndquadrant.com
Authors: Tomas Vondra and David Rowley
Reviewed and tested by Simon Riggs, adding comments only
Patch introduces new text search operator (<-> or <DISTANCE>) into tsquery.
On-disk and binary in/out format of tsquery are backward compatible.
It has two side effect:
- change order for tsquery, so, users, who has a btree index over tsquery,
should reindex it
- less number of parenthesis in tsquery output, and tsquery becomes more
readable
Authors: Teodor Sigaev, Oleg Bartunov, Dmitry Ivanov
Reviewers: Alexander Korotkov, Artur Zakirov
Now that all of the infrastructure exists, add in the ability to
dump out the ACLs of the objects inside of pg_catalog or the ACLs
for objects which are members of extensions, but only if they have
been changed from their original values.
The original values are tracked in pg_init_privs. When pg_dump'ing
9.6-and-above databases, we will dump out the ACLs for all objects
in pg_catalog and the ACLs for all extension members, where the ACL
has been changed from the original value which was set during either
initdb or CREATE EXTENSION.
This should not change dumps against pre-9.6 databases.
Reviews by Alexander Korotkov, Jose Luis Tallon
This new catalog holds the privileges which the system was
initialized with at initdb time, along with any permissions set
by extensions at CREATE EXTENSION time. This allows pg_dump
(and any other similar use-cases) to detect when the privileges
set on initdb-created or extension-created objects have been
changed from what they were set to at initdb/extension-creation
time and handle those changes appropriately.
Reviews by Alexander Korotkov, Jose Luis Tallon
It inserts a new value into an jsonb array at arbitrary position or
a new key to jsonb object.
Author: Dmitry Dolgov
Reviewers: Petr Jelinek, Vitaly Burovoy, Andrew Dunstan