System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set. Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.
This patch uses the brute-force approach of correcting all those code
paths. Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.
Author: Michael Paquier
Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.
Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.
Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada
Logical decoding's reorderbuffer.c may spill transaction files to disk
when transactions are large. These are supposed to be removed when they
become "too old" by xid; but file removal requires the boundary LSNs of
the transaction to be known. The final_lsn is only set when we see the
commit or abort record for the transaction, but nothing sets the value
for transactions that crash, so the removal code misbehaves -- in
assertion-enabled builds, it crashes by a failed assertion.
To fix, modify the final_lsn of transactions that don't have a value
set, to the LSN of the very latest change in the transaction. This
causes the spilled files to be removed appropriately.
Author: Atsushi Torikoshi
Reviewed-by: Kyotaro HORIGUCHI, Craig Ringer, Masahiko Sawada
Discussion: https://postgr.es/m/54e4e488-186b-a056-6628-50628e4e4ebc@lab.ntt.co.jp
The logical slots have a fast code path for sending data so as not to
impose too high a per message overhead. The fast path skips checks for
interrupts and timeouts. However, the existing coding failed to consider
the fact that a transaction with a large number of changes may take a
very long time to be processed and sent to the client. This causes the
walsender to ignore interrupts for potentially a long time and more
importantly it will result in the walsender being killed due to
timeout at the end of such a transaction.
This commit changes the fast path to also check for interrupts and only
allows calling the fast path when the last keepalive check happened less
than half the walsender timeout ago. Otherwise the slower code path will
be taken.
Backpatched to 9.4
Petr Jelinek, reviewed by Kyotaro HORIGUCHI, Yura Sokolov, Craig
Ringer and Robert Haas.
Discussion: https://postgr.es/m/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
1. Since commit b1a9bad9e7 we had pstrdup() inside a
spinlock-protected critical section; reported by Andreas Seltenreich.
Turn those into strlcpy() to stack-allocated variables instead.
Backpatch to 9.6.
2. Since commit 9ed551e0a4 we had a pfree() uselessly inside a
spinlock-protected critical section. Tom Lane noticed in code review.
Move down. Backpatch to 9.6.
3. Since commit 64233902d2 we had GetCurrentTimestamp() (a kernel
call) inside a spinlock-protected critical section. Tom Lane noticed in
code review. Move it up. Backpatch to 9.2.
4. Since commit 1bb2558046 we did elog(PANIC) while holding spinlock.
Tom Lane noticed in code review. Release spinlock before dying.
Backpatch to 9.2.
Discussion: https://postgr.es/m/87h8vhtgj2.fsf@ansel.ydns.eu
The callers for GetOldestSafeDecodingTransactionId() all inverted the
argument for the argument introduced in 2bef06d516. Luckily this
appears to be inconsequential for the moment, as we wait for
concurrent in-progress transaction when assembling a
snapshot. Additionally this could only make a difference when adding a
second logical slot, because only a pre-existing slot could cause an
issue by lowering the returned xid dangerously much.
Reported-By: Antonin Houska
Discussion: https://postgr.es/m/32704.1496993134@localhost
Backport: 9.4-, where 2bef06d516 was backpatched to.
Commit 14e8803f1 removed the locking in SyncRepWaitForLSN, but that
introduced a race condition, where SyncRepWaitForLSN might see
syncRepState already set to SYNC_REP_WAIT_COMPLETE, but the process was
not yet removed from the queue. That tripped the assertion, that the
process should no longer be in the uqeue. Reorder the operations in
SyncRepWakeQueue to remove the process from the queue first, and update
syncRepState only after that, and add a memory barrier in between to make
sure the operations are made visible to other processes in that order.
Fixes bug #14721 reported by Const Zhang. Analysis and fix by Thomas Munro.
Backpatch down to 9.5, where the locking was removed.
Discussion: https://www.postgresql.org/message-id/20170629023623.1480.26508%40wrigleys.postgresql.org
It's possible for WalSndWaitForWal to be asked to wait for WAL that doesn't
exist yet. That's fine, in fact it's the normal situation if we're caught
up; but when the client requests shutdown we should not keep waiting.
The previous coding could wait indefinitely if the source server was idle.
In passing, improve the rather weak comments in this area, and slightly
rearrange some related code for better readability.
Back-patch to 9.4 where this code was introduced.
Discussion: https://postgr.es/m/14154.1498781234@sss.pgh.pa.us
When, during logical decoding, a transaction gets too big, it's
contents get spilled to disk. Not just the top-transaction gets
spilled, but *also* all of its subtransactions, even if they're not
that large themselves. Unfortunately we didn't clean up
such small spilled subtransactions from disk.
Fix that, by keeping better track of whether a transaction has been
spilled to disk.
Author: Andres Freund
Reported-By: Dmitriy Sarafannikov, Fabrízio de Royes Mello
Discussion:
https://postgr.es/m/1457621358.355011041@f382.i.mail.ruhttps://postgr.es/m/CAFcNs+qNMhNYii4nxpO6gqsndiyxNDYV0S=JNq0v_sEE+9PHXg@mail.gmail.com
Backpatch: 9.4-, where logical decoding was introduced
We had three occurrences of essentially the same coding pattern
wherein we tried to retrieve a query result from a libpq connection
without blocking. In the case where PQconsumeInput failed (typically
indicating a lost connection), all three loops simply gave up and
returned, forgetting to clear any previously-collected PGresult
object. Since those are malloc'd not palloc'd, the oversight results
in a process-lifespan memory leak.
One instance, in libpqwalreceiver, is of little significance because
the walreceiver process would just quit anyway if its connection fails.
But we might as well fix it.
The other two instances, in postgres_fdw, are somewhat more worrisome
because at least in principle the scenario could be repeated, allowing
the amount of memory leaked to build up to something worth worrying
about. Moreover, in these cases the loops contain CHECK_FOR_INTERRUPTS
calls, as well as other calls that could potentially elog(ERROR),
providing another way to exit without having cleared the PGresult.
Here we need to add PG_TRY logic similar to what exists in quite a
few other places in postgres_fdw.
Coverity noted the libpqwalreceiver bug; I found the other two cases
by checking all calls of PQconsumeInput.
Back-patch to all supported versions as appropriate (9.2 lacks
postgres_fdw, so this is really quite unexciting for that branch).
Discussion: https://postgr.es/m/22620.1497486981@sss.pgh.pa.us
Because walsender and normal backends share the same main loop it's
problematic to have two different flag variables, set in signal
handlers, indicating a pending configuration reload. Only certain
walsender commands reach code paths checking for the
variable (START_[LOGICAL_]REPLICATION, CREATE_REPLICATION_SLOT
... LOGICAL, notably not base backups).
This is a bug present since the introduction of walsender, but has
gotten worse in releases since then which allow walsender to do more.
A later patch, not slated for v10, will similarly unify SIGHUP
handling in other types of processes as well.
Author: Petr Jelinek, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170423235941.qosiuoyqprq4nu7v@alap3.anarazel.de
Backpatch: 9.2-, bug is present since 9.0
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.
To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown. Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.
Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
The non-participation in procsignal was a problem for both changes in
master, e.g. parallelism not working for normal statements run in
walsender backends, and older branches, e.g. recovery conflicts and
catchup interrupts not working for logical decoding walsenders.
This commit thus replaces the previous WalSndXLogSendHandler with
procsignal_sigusr1_handler. In branches since db0f6cad48 that can
lead to additional SetLatch calls, but that only rarely seems to make
a difference.
Author: Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170421014030.fdzvvvbrz4nckrow@alap3.anarazel.de
Backpatch: 9.4, earlier commits don't seem to benefit sufficiently
Before 955a684e04 logical decoding snapshot maintenance needed to
cope with transactions it might not have seen in their entirety. For
such transactions we'd to assume they modified the catalog (could have
happened before we were watching), and thus a new snapshot had to be
built, and distributed to concurrently running transactions.
That's problematic because building a new snapshot isn't that cheap ,
especially as the the array of committed transactions needs to be
sorted. When creating a slot on a server with a lot of transactions,
this could make logical slot creation infeasibly expensive.
After 955a684e04 there's no need to deal with transaction that
aren't guaranteed to be fully observable. That allows to avoid
building snapshots for transactions that haven't modified catalog,
even before reaching consistency.
While this isn't necessarily a bugfix, slot creation being impossible
in some production workloads, is severe enough to warrant
backpatching.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records. Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.
That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions. That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.
It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.
Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code. That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
we cannot use it to build the initial snapshot. Instead we have to
wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
the all transaction perceived as running based on commit/abort
records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.
Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release. As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility. A later commit will
rejigger that on master.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
Logical decoding stores historical snapshots on disk, so that logical
decoding can restart without having to reconstruct a snapshot from
scratch (for which the resources are not guaranteed to be present
anymore). These serialized snapshots were also used when creating a
new slot via the walsender interface, which can export a "full"
snapshot (i.e. one that can read all tables, not just catalog ones).
The problem is that the serialized snapshots are only useful for
catalogs and not for normal user tables. Thus the use of such a
serialized snapshot could result in an inconsistent snapshot being
exported, which could lead to queries returning wrong data. This
would only happen if logical slots are created while another logical
slot already exists.
Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.
This could cause a corrupted initial snapshot being exported. The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it. Ongoing decoding is not
affected by this bug.
To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol. To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.
In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.
Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
This seems to be largely cosmetic, avoiding valgrind bleats and the
like. The uninitialized padding influences the CRC of the on-disk
entry, but because it's also used when verifying the CRC, that doesn't
cause spurious failures. Backpatch nonetheless.
It's a bit unfortunate that contrib/test_decoding/sql/replorigin.sql
doesn't exercise the checkpoint path, but checkpoints are fairly
expensive on weaker machines, and we'd have to stop/start for that to
be meaningful.
Author: Andres Freund
Discussion: https://postgr.es/m/20170422183123.w2jgiuxtts7qrqaq@alap3.anarazel.de
Backpatch: 9.5, where replication origins were introduced
Walsender uses the local buffers for each outgoing and incoming message.
Previously when creating replication slot, walsender forgot to initialize
one of them and which can cause the segmentation fault error. To fix this
issue, this commit changes walsender so that it always initialize them
before it executes the requested replication command.
Back-patch to 9.4 where replication slot was introduced.
Problem report and initial patch by Stas Kelvich, modified by me.
Report: https://www.postgresql.org/message-id/A1E9CB90-1FAC-4CAD-8DBA-9AA62A6E97C5@postgrespro.ru
Hot_standby_feedback could be reset by reload and worked correctly, but if
the server was restarted rather than reloaded the xmin was not reset.
Force reset always if hot_standby_feedback is enabled at startup.
Ants Aasma, Craig Ringer
Reported-by: Ants Aasma
This was already fixed in HEAD as part of 6ad8ac60 but was not
backpatched.
Also change the way pg_xlog is handled to be the same as the other
directories.
Patch from me with pg_xlog addition from Michael Paquier, test updates
from David Steele.
The need for this was previously obscured even on picky platforms
by the hack we used to support direct cross-module references in
the transforms contrib modules. Now that that hack is gone, the
undefined symbol is exposed, as reported by Robert Haas.
Back-patch to 9.5 where we started to use -Wl,-undefined,dynamic_lookup.
I'm a bit surprised that the older branches don't seem to contain
any gettext references in this module, but since they don't fail
at build time, they must not. (We might be able to get away with
leaving this alone in 9.5/9.6, but I think it's cleaner if the
reference gets resolved at link time.)
Report: <CA+TgmoaHJKU5kcWZcYduATYVT7Mnx+8jUnycaYYL7OtCwCigug@mail.gmail.com>
Before initializing iteration over a subtransaction's changes, the last
few changes were not spilled to disk. That's correct if the transaction
didn't spill to disk, but otherwise... This bug can lead to missed or
misorderd subtransaction contents when they were spilled to disk.
Move spilling of the remaining in-memory changes to
ReorderBufferIterTXNInit(), where it can easily be applied to the top
transaction and, if present, subtransactions.
Since this code had too many bugs already, noticeably increase test
coverage.
Fixes: #14319
Reported-By: Huan Ruan
Discussion: <20160909012610.20024.58169@wrigleys.postgresql.org>
Backport: 9,4-, where logical decoding was added
Clobbering errno during cleanup after an error is an oft-repeated, easy
to make mistake. Deal with it here as everywhere else, by saving it
aside and restoring after cleanup, before ereport'ing.
In passing, add a missing errcode declaration in another ereport() call
in the same file, which I noticed while skimming the file looking for
similar problems.
Backpatch to 9.4, where this code was introduced.
Slot creation did not clear all fields upon creation. After start the
memory is zeroed, but when a physical replication slot was created in
the shared memory of a previously existing logical slot, catalog_xmin
would not be cleared. That in turn would prevent vacuum from doing its
duties.
To fix initialize all the fields. To make similar future bugs less
likely, zero all of ReplicationSlotPersistentData, and re-order the
rest of the initialization to be in struct member order.
Analysis: Andrew Gierth
Reported-By: md@chewy.com
Author: Michael Paquier
Discussion: <20160705173502.1398.70934@wrigleys.postgresql.org>
Backpatch: 9.4, where replication slots were introduced
Commit 14e8803f1 removed LWLocks when accessing MyProc->syncRepState
but didn't clean up the surrounding code and comments.
Cleanup and backpatch to 9.5, to keep code similar.
Julien Rouhaud, improved by suggestion from Michael Paquier,
implemented trivially by myself.
This oversight could cause logical decoding to fail to decode an outer
transaction containing changes, if a subtransaction had an XID but no
actual changes. Per bug #14279 from Marko Tiikkaja. Patch by Marko
based on analysis by Andrew Gierth.
Discussion: <20160804191757.1430.39011@wrigleys.postgresql.org>
This looks like it would cause changes from subtransactions to be missed
by the iterator being constructed, if those changes had been spilled to
disk previously. This implies that large subtransactions might be lost
(in whole or in part) by logical replication. Found and fixed by
Petru-Florin Mihancea, per bug #14208.
Report: <20160622144830.5791.22512@wrigleys.postgresql.org>
Since 5a991ef8 we're explicitly asking for feedback from the receiving
side when shutting down walsender, if there's not yet replicated
data.
Unfortunately we didn't remember (i.e. set waiting_for_ping_response to
true) having asked for feedback, leading to scenarios in which replies
were requested at a high frequency.
I can't reproduce this problem on my laptop, I think that's because the
problem requires a significant TCP window to manifest due to the
!pq_is_send_pending() condition. But since this clearly is a bug, let's
fix it. There's quite possibly more wrong than just this though.
While fiddling with WalSndDone(), I rewrote a hard to understand comment
about looking at the flush vs. the write position.
Reported-By: Nick Cleaton, Magnus Hagander
Author: Nick Cleaton
Discussion: CAFgz3kus=rC_avEgBV=+hRK5HYJ8vXskJRh8yEAbahJGTzF2VQ@mail.gmail.comCABUevExsjROqDcD0A2rnJ6HK6FuKGyewJr3PL12pw85BHFGS2Q@mail.gmail.com
Backpatch: 9.4, were 5a991ef8 introduced the use of feedback messages
during shutdown.
When re-reading an update involving both an old tuple and a new tuple from
disk, reorderbuffer.c was careless about whether the new tuple is suitably
aligned for direct access --- in general, it isn't. We'd missed seeing
this in the buildfarm because the contrib/test_decoding tests exercise this
code path only a few times, and by chance all of those cases have old
tuples with length a multiple of 4, which is usually enough to make the
access to the new tuple's t_len safe. For some still-not-entirely-clear
reason, however, Debian's sparc build gets a bus error, as reported by
Christoph Berg; perhaps it's assuming 8-byte alignment of the pointer?
The lack of previous field reports is probably because you need all of
these conditions to trigger a crash: an alignment-picky platform (not
Intel), a transaction large enough to spill to disk, an update within
that xact that changes a primary-key field and has an odd-length old tuple,
and of course logical decoding tracing the transaction.
Avoid the alignment assumption by using memcpy instead of fetching t_len
directly, and add a test case that exposes the crash on picky platforms.
Back-patch to 9.4 where the bug was introduced.
Discussion: <20160413094117.GC21485@msg.credativ.de>
It was declared as "pid_t", which would be fine except that none of
the places that printed it in error messages took any thought for the
possibility that it's not equivalent to "int". This leads to warnings
on some buildfarm members, and could possibly lead to actually wrong
error messages on those platforms. There doesn't seem to be any very
good reason not to just make it "int"; it's only ever assigned from
MyProcPid, which is int. If we want to cope with PIDs that are wider
than int, this is not the place to start.
Also, fix the comment, which seems to perhaps be a leftover from a time
when the field was only a bool?
Per buildfarm. Back-patch to 9.5 which has same issue.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.
Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability. The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.
Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes; especially on filesystems like xfs and ext4 when mounted
with data=writeback. To be certain that a rename() atomically replaces
the previous file contents in the face of crashes and different
filesystems, one has to fsync the old filename, rename the file, fsync
the new filename, fsync the containing directory. This sequence is not
generally adhered to currently; which exposes us to data loss risks. To
avoid having to repeat this arduous sequence, introduce
durable_rename(), which wraps all that.
Also add durable_link_or_rename(). Several places use link() (with a
fallback to rename()) to rename a file, trying to avoid replacing the
target file out of paranoia. Some of those rename sequences need to be
durable as well. There seems little reason extend several copies of the
same logic, so centralize the link() callers.
This commit does not yet make use of the new functions; they're used in
a followup commit.
Author: Michael Paquier, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
Coverity and inspection for the issue addressed in fd45d16f found some
questionable code.
Specifically coverity noticed that the wrong length was added in
ReorderBufferSerializeChange() - without immediate negative consequences
as the variable isn't used afterwards. During code-review and testing I
noticed that a bit of space was wasted when allocating tuple bufs in
several places. Thirdly, the debug memset()s in
ReorderBufferGetTupleBuf() reduce the error checking valgrind can do.
Backpatch: 9.4, like c8f621c43.
In c8f621c43 I forgot to account for MAXALIGN when allocating a new
tuplebuf in ReorderBufferGetTupleBuf(). That happens to currently not
cause active problems on a number of platforms because the affected
pointer is already aligned, but others, like ppc and hppa, trigger this
in the regression test, due to a debug memset clearing memory.
Fix that.
Backpatch: 9.4, like the previous commit.
When decoding the old version of an UPDATE or DELETE change, and if that
tuple was bigger than MaxHeapTupleSize, we either Assert'ed out, or
failed in more subtle ways in non-assert builds. Normally individual
tuples aren't bigger than MaxHeapTupleSize, with big datums toasted.
But that's not the case for the old version of a tuple for logical
decoding; the replica identity is logged as one piece. With the default
replica identity btree limits that to small tuples, but that's not the
case for FULL.
Change the tuple buffer infrastructure to separate allocate over-large
tuples, instead of always going through the slab cache.
This unfortunately requires changing the ReorderBufferTupleBuf
definition, we need to store the allocated size someplace. To avoid
requiring output plugins to recompile, don't store HeapTupleHeaderData
directly after HeapTupleData, but point to it via t_data; that leaves
rooms for the allocated size. As there's no reason for an output plugin
to look at ReorderBufferTupleBuf->t_data.header, remove the field. It
was just a minor convenience having it directly accessible.
Reported-By: Adam Dratwiński
Discussion: CAKg6ypLd7773AOX4DiOGRwQk1TVOQKhNwjYiVjJnpq8Wo+i62Q@mail.gmail.com
Somehow I managed to flip the order of restoring old & new tuples when
de-spooling a change in a large transaction from disk. This happens to
only take effect when a change is spooled to disk which has old/new
versions of the tuple. That only is the case for UPDATEs where he
primary key changed or where replica identity is changed to FULL.
The tests didn't catch this because either spooled updates, or updates
that changed primary keys, were tested; not both at the same time.
Found while adding tests for the following commit.
Backpatch: 9.4, where logical decoding was added
Logical decoding's reorderbuffer keeps transactions in an LSN ordered
list for efficiency. To make that's efficiently possible upper-level
xids are forced to be logged before nested subtransaction xids. That
only works though if these records are all looked at: Unfortunately we
didn't do so for e.g. row level locks, which are otherwise uninteresting
for logical decoding.
This could lead to errors like:
"ERROR: subxact logged without previous toplevel record".
It's not sufficient to just look at row locking records, the xid could
appear first due to a lot of other types of records (which will trigger
the transaction to be marked logged with MarkCurrentTransactionIdLoggedIfAny).
So invent infrastructure to tell reorderbuffer about xids seen, when
they'd otherwise not pass through reorderbuffer.c.
Reported-By: Jarred Ward
Bug: #13844
Discussion: 20160105033249.1087.66040@wrigleys.postgresql.org
Backpatch: 9.4, where logical decoding was added
When adding replication origins in 5aa235042, I somehow managed to set
the timestamp of decoded transactions to InvalidXLogRecptr when decoding
one made without a replication origin. Fix that, and the wrong type of
the new commit_time variable.
This didn't trigger a regression test failure because we explicitly
don't show commit timestamps in the regression tests, as they obviously
are variable. Add a test that checks that a decoded commit's timestamp
is within minutes of NOW() from before the commit.
Reported-By: Weiping Qu
Diagnosed-By: Artur Zakirov
Discussion: 56D4197E.9050706@informatik.uni-kl.de,
56D42918.1010108@postgrespro.ru
Backpatch: 9.5, where 5aa235042 originates.
A scan for missed proisstrict markings in the core code turned up
these functions:
brin_summarize_new_values
pg_stat_reset_single_table_counters
pg_stat_reset_single_function_counters
pg_create_logical_replication_slot
pg_create_physical_replication_slot
pg_drop_replication_slot
The first three of these take OID, so a null argument will normally look
like a zero to them, resulting in "ERROR: could not open relation with OID
0" for brin_summarize_new_values, and no action for the pg_stat_reset_XXX
functions. The other three will dump core on a null argument, though this
is mitigated by the fact that they won't do so until after checking that
the caller is superuser or has rolreplication privilege.
In addition, the pg_logical_slot_get/peek[_binary]_changes family was
intentionally marked nonstrict, but failed to make nullness checks on all
the arguments; so again a null-pointer-dereference crash is possible but
only for superusers and rolreplication users.
Add the missing ARGISNULL checks to the latter functions, and mark the
former functions as strict in pg_proc. Make that change in the back
branches too, even though we can't force initdb there, just so that
installations initdb'd in future won't have the issue. Since none of these
bugs rise to the level of security issues (and indeed the pg_stat_reset_XXX
functions hardly misbehave at all), it seems sufficient to do this.
In addition, fix some order-of-operations oddities in the slot_get_changes
family, mostly cosmetic, but not the part that moves the function's last
few operations into the PG_TRY block. As it stood, there was significant
risk for an error to exit without clearing historical information from
the system caches.
The slot_get_changes bugs go back to 9.4 where that code was introduced.
Back-patch appropriate subsets of the pg_proc changes into all active
branches, as well.
Previously the "sent" field would be set to 0 and all other xlog
pointers be set to NULL if there were no valid values (such as when
in a backup sending walsender).
These would leak random xlog positions if a walsender used for backup would
a walsender slot previously used by a replication walsender.
In passing also fix a couple of cases where the xlog pointer is directly
compared to zero instead of using XLogRecPtrIsInvalid, noted by
Michael Paquier.