Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
backend, as far as the postmaster shutdown logic is concerned. That means,
fast shutdown will wait for WAL sender processes to exit before signaling
bgwriter to finish. This avoids race conditions between a base backup stopping
or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
want e.g the end-of-backup WAL record to be written after the shutdown
checkpoint.
The contents of an unlogged table are WAL-logged; thus, they are not
available on standby servers and are truncated whenever the database
system enters recovery. Indexes on unlogged tables are also unlogged.
Unlogged GiST indexes are not currently supported.
First, avoid scanning the whole ProcArray once we know there
are at least commit_siblings active; second, skip the check
altogether if commit_siblings = 0.
Greg Smith
an old transaction running in the master, and a lot of transactions have
started and finished since, and a WAL-record is written in the gap between
the creating the running-xacts snapshot and WAL-logging it, recovery will fail
with "too many KnownAssignedXids" error. This bug was reported by
Joachim Wieland on Nov 19th.
In the same scenario, when fewer transactions have started so that all the
xids fit in KnownAssignedXids despite the first bug, a more serious bug
arises. We incorrectly initialize the clog code with the oldest still running
transaction, and when we see the WAL record belonging to a transaction with
an XID larger than one that committed already before the checkpoint we're
recovering from, we zero the clog page containing the already committed
transaction, leading to data loss.
In hindsight, trying to track xids in the known-assigned-xids array before
seeing the running-xacts record was too complicated. To fix that, hold
XidGenLock while the running-xacts snapshot is taken and WAL-logged. That
ensures that no transaction can begin or end in that gap, so that in recvoery
we know that the snapshot contains all transactions running at that point in
WAL.
removing an infrequently occurring race condition in Hot Standby.
An xid must be assigned before a lock appears in shared memory,
rather than immediately after, else GetRunningTransactionLocks()
may see InvalidTransactionId, causing assertion failures during
lock processing on standby.
Bug report and diagnosis by Fujii Masao, fix by me.
Having this in src/include/port.h makes no sense, now that copydir.c lives
in src/backend/strorage rather than src/port. Along the way, remove an
obsolete comment from contrib/pg_upgrade that makes reference to the old
location.
dynamic pool of event handles, we can permanently assign one for each
shared latch. Thanks to that, we no longer need a separate shared memory
block for latches, and we don't need to know in advance how many shared
latches there is, so you no longer need to remember to update
NumSharedLatches when you introduce a new latch to the system.
wait until it is set. Latches can be used to reliably wait until a signal
arrives, which is hard otherwise because signals don't interrupt select()
on some platforms, and even when they do, there's race conditions.
On Unix, latches use the so called self-pipe trick under the covers to
implement the sleep until the latch is set, without race conditions. On
Windows, Windows events are used.
Use the new latch abstraction to sleep in walsender, so that as soon as
a transaction finishes, walsender is woken up to immediately send the WAL
to the standby. This reduces the latency between master and standby, which
is good.
Preliminary work by Fujii Masao. The latch implementation is by me, with
helpful comments from many people.
There is no reason that proc.c should have to get involved in this dirty hack
for letting the postmaster know which children are walsenders. Revert that
file to the way it was, and confine the kluge to pmsignal.c and postmaster.c.
This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.
Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1. It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.
Review by Jaime Casanova and Tom Lane.
max_standby_streaming_delay, and revise the implementation to avoid assuming
that timestamps found in WAL records can meaningfully be compared to clock
time on the standby server. Instead, the delay limits are compared to the
elapsed time since we last obtained a new WAL segment from archive or since
we were last "caught up" to WAL data arriving via streaming replication.
This avoids problems with clock skew between primary and standby, as well
as other corner cases that the original coding would misbehave in, such
as the primary server having significant idle time between transactions.
Per my complaint some time ago and considerable ensuing discussion.
Do some desultory editing on the hot standby documentation, too.
During Hot Standby we need to check for buffer pin deadlocks when the
Startup process begins to wait, in case it never wakes up again. We
previously made the deadlock check immediately on the basis it was
cheap, though clearer thinking and prima facie evidence shows that
was too simple. Refactor existing code to make it easy to add in
deferral of deadlock check until deadlock_timeout allowing a good
reduction in deadlock checks since far few buffer pins are held for
that duration. It's worth doing anyway, though major goal is to
prevent further reports of context switching with high numbers of
users on occasional tests.
of requirements and documentation on LogStandbySnapshot(). Fixes
two minor bugs reported by Tom Lane that would lead to an incorrect
snapshot after transaction wraparound. Also fix two other problems
discovered that would give incorrect snapshots in certain cases.
ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor
refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().
In addition, add support for a "payload" string to be passed along with
each notify event.
This implementation should be significantly more efficient than the old one,
and is also more compatible with Hot Standby usage. There is not yet any
facility for HS slaves to receive notifications generated on the master,
although such a thing is possible in future.
Joachim Wieland, reviewed by Jeff Davis; also hacked on by me.
all the data and using posix_fadvise to nudge the OS into flushing it
earlier. This also hopefully makes CREATE DATABASE avoid spamming the
cache.
Tests show a big speedup on Linux at least on some filesystems.
Idea and patch from Andres Freund.
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
process. If startup waits on a buffer pin we send a request to all
backends to cancel themselves if they are holding the buffer pin
required and they are also waiting on a lock. If not, startup waits
until max_standby_delay before cancelling any backend waiting for
the requested buffer pin.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.
In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
of shared or nailed system catalogs. This has two key benefits:
* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
* We no longer have to use an unsafe reindex-in-place approach for reindexing
shared catalogs.
CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.
Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.
This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch. As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
stage of required deadlock detection to allow re-enabling max_standby_delay
setting of -1, which is now essential in the absence of improved relation-
specific conflict resoluton. Requested by Greg Stark et al.
records for heap and btree. Minor change, mostly API changes to
pass through the required values. This is a simple change though
also provides the refactoring required for further enhancements
to conflict processing using the relOid. Changes only have effect
during Hot Standby.
restore_command, if the connection to the primary server is lost. This
ensures that the standby can recover automatically, if the connection is
lost for a long time and standby falls behind so much that the required
WAL segments have been archived and deleted in the master.
This also makes standby_mode useful without streaming replication; the
server will keep retrying restore_command every few seconds until the
trigger file is found. That's the same basic functionality pg_standby
offers, but without the bells and whistles.
To implement that, refactor the ReadRecord/FetchRecord functions. The
FetchRecord() function introduced in the original streaming replication
patch is removed, and all the retry logic is now in a new function called
XLogReadPage(). XLogReadPage() is now responsible for executing
restore_command, launching walreceiver, and waiting for new WAL to arrive
from primary, as required.
This also changes the life cycle of walreceiver. When launched, it now only
tries to connect to the master once, and exits if the connection fails, or
is lost during streaming for any reason. The startup process detects the
death, and re-launches walreceiver if necessary.
woken by alarm we send SIGUSR1 to all backends requesting that they
check to see if they are blocking Startup process. If so, they throw
ERROR/FATAL as for other conflict resolutions. Deadlock stop gap
removed. max_standby_delay = -1 option removed to prevent deadlock.
binary, revert PGDLLIMPORT decoration of global variables. I'm not sure
if there's any real harm from unnecessary PGDLLIMPORTs, but these are all
internal variables that external modules really shouldn't be messing
with. ThisTimeLineID still needs PGDLLIMPORT.
Conflict reason is passed through directly to the backend, so we can
take decisions about the effect of the conflict based upon the local
state. No specific changes, as yet, though this prepares for later work.
CancelVirtualTransaction() sends signals while holding ProcArrayLock.
Introduce errdetail_abort() to give message detail explaining that the
abort was caused by conflict processing. Remove CONFLICT_MODE states
in favour of using PROCSIG_RECOVERY_CONFLICT states directly, for clarity.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me
of this are to centralise the conflict code to allow further change,
as well as to allow passing through the full reason for the conflict
through to the conflicting backends. Backend state alters how we
can handle different types of conflict so this is now required.
As originally suggested by Heikki, no longer optional.
Previously we only cancelled sessions that were in-transaction.
Simple fix is to just cancel all sessions without waiting. Doing
it this way avoids complicating common code paths, which would
not be worth the trouble to cover this rare case.
Problem report and fix by Andres Freund, edited somewhat by me
deletion, so that we attempt to unlink the correct filepath. unlink()
errors are ignorable there, so lack of a DatabasePath initialization step
did not cause visible problems until a related bug showed up on Solaris.
Code refactored from xact_redo_commit() to
ProcessCommittedInvalidationMessages() in inval.c. Recovery may replay
shared invalidation messages for many databases, so we cannot
SetDatabasePath() once as we do in normal backends. Read the databaseid
from the shared invalidation messages, then set DatabasePath
temporarily before calling RelationCacheInitFileInvalidate().
Problem report by Robert Treat, analysis and fix by me.
This is more in keeping with modern practice, and is a first step towards
porting to Win64 (which has sizeof(pointer) > sizeof(long)).
Tsutomu Yamada, Magnus Hagander, Tom Lane
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
This patch also removes buffer-usage statistics from the track_counts
output, since this (or the global server statistics) is deemed to be a better
interface to this information.
Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.
via the "flat files" facility. This requires making it enough like a backend
to be able to run transactions; it's no longer an "auxiliary process" but
more like the autovacuum worker processes. Also, its signal handling has
to be brought into line with backends/workers. In particular, since it
now has to handle procsignal.c processing, the special autovac-launcher-only
signal conditions are moved to SIGUSR2.
Alvaro, with some cleanup from Tom
(That flat file is now completely useless, but removal will come later.)
To do this, postpone client authentication into the startup transaction
that's run by InitPostgres. We still collect the startup packet and do
SSL initialization (if needed) at the same time we did before. The
AuthenticationTimeout is applied separately to startup packet collection
and the actual authentication cycle. (This is a bit annoying, since it
means a couple extra syscalls; but the signal handling requirements inside
and outside a transaction are sufficiently different that it seems best
to treat the timeouts as completely independent.)
A small security disadvantage is that if the given database name is invalid,
this will be reported to the client before any authentication happens.
We could work around that by connecting to database "postgres" instead,
but consensus seems to be that it's not worth introducing such surprising
behavior.
Processing of all command-line switches and GUC options received from the
client is now postponed until after authentication. This means that
PostAuthDelay is much less useful than it used to be --- if you need to
investigate problems during InitPostgres you'll have to set PreAuthDelay
instead. However, allowing an unauthenticated user to set any GUC options
whatever seems a bit too risky, so we'll live with that.