Commit graph

43 commits

Author SHA1 Message Date
Amit Kapila
f2dbc83501 Fix use-after-free issue in slot synchronization.
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 18, where it was introduced
Discussion: https://postgr.es/m/CANhcyEXMrcEdzj-RNGJam0nJHM4y+ttdWsgUCFmXciM7BNKc7A@mail.gmail.com
2025-09-03 06:31:05 +00:00
Nathan Bossart
6fbd7b93c6 Remove unused parameter from ProcessSlotSyncInterrupts().
Oversight in commit 93db6cbda0.

Author: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/tencent_7B42BBE8D0A5C28DDAB91436192CBCCB8307%40qq.com
2025-08-29 10:56:10 -05:00
Álvaro Herrera
325fc0ab14
Avoid including commands/dbcommands.h in so many places
This has been done historically because of get_database_name (which
since commit cb98e6fb8f belongs in lsyscache.c/h, so let's move it
there) and get_database_oid (which is in the right place, but whose
declaration should appear in pg_database.h rather than dbcommands.h).
Clean this up.

Also, xlogreader.h and stringinfo.h are no longer needed by dbcommands.h
since commit f1fd515b39, so remove them.

Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/202508191031.5ipojyuaswzt@alvherre.pgsql
2025-08-28 12:39:04 +02:00
Fujii Masao
4614d53d4e Avoid unexpected shutdown when sync_replication_slots is enabled.
Previously, enabling sync_replication_slots while wal_level was not set
to logical could cause the server to shut down. This was because
the postmaster performed a configuration check before launching
the slot synchronization worker and raised an ERROR if the settings
were incompatible. Since ERROR is treated as FATAL in the postmaster,
this resulted in the entire server shutting down unexpectedly.

This commit changes the postmaster to log that message with a LOG-level
instead of raising an ERROR, allowing the server to continue running
even with the misconfiguration.

Back-patch to v17, where slot synchronization was introduced.

Reported-by: Hugo DUBOIS <hdubois@scaleway.com>
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Hugo DUBOIS <hdubois@scaleway.com>
Reviewed-by: Shveta Malik <shveta.malik@gmail.com>
Discussion: https://postgr.es/m/CAH0PTU_pc3oHi__XESF9ZigCyzai1Mo3LsOdFyQA4aUDkm01RA@mail.gmail.com
Backpatch-through: 17
2025-08-04 20:51:42 +09:00
Álvaro Herrera
2633dae2e4
Standardize LSN formatting by zero padding
This commit standardizes the output format for LSNs to ensure consistent
representation across various tools and messages.  Previously, LSNs were
inconsistently printed as `%X/%X` in some contexts, while others used
zero-padding.  This often led to confusion when comparing.

To address this, the LSN format is now uniformly set to `%X/%08X`,
ensuring the lower 32-bit part is always zero-padded to eight
hexadecimal digits.

Author: Japin Li <japinli@hotmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/ME0P300MB0445CA53CA0E4B8C1879AF84B641A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
2025-07-07 13:57:43 +02:00
Peter Eisentraut
50fd428b2b Message style improvements 2025-06-28 19:18:06 +02:00
Amit Kapila
1546e17f9d Improve log messages and docs for slot synchronization.
Improve the clarity of LOG messages when a failover logical slot
synchronization fails, making the reasons more explicit for easier
debugging.

Update the documentation to outline scenarios where slot synchronization
can fail, especially during the initial sync, and emphasize that
pg_sync_replication_slot() is primarily intended for testing and
debugging purposes.

We also discussed improving the functionality of
pg_sync_replication_slot() so that it can be used reliably, but we would
take up that work for next version after some more discussion and review.

Reported-by: Suraj Kharage <suraj.kharage@enterprisedb.com>
Author: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/CAF1DzPWTcg+m+x+oVVB=y4q9=PYYsL_mujVp7uJr-_oUtWNGbA@mail.gmail.com
2025-06-19 09:48:08 +05:30
Amit Kapila
3ff2a1f0c9 Fix assertion failure during decoding from synced slots.
The slot synchronization skips updating the confirmed_flush LSN of the
local slot if the local slot has a newer catalog_xmin or restart_lsn, but
still allows updating the two_phase and two_phase_at fields of the slot.
This opens up a window for the prepared transactions between old
confirmed_flush LSN and two_phase_at to unexpectedly get decoded and sent
to the downstream after promotion. Then, while decoding the commit
prepared the assert will fail, which expects that the prepare hasn't been
sent to the downstream.

The fix is to skip updating the other slot fields when we are skipping to
update the confirmed_flush LSN of the slot.

We didn't backpatch this commit as two_phase_at was not synced in back
branches, which means prepared transactions won't be unexpectedly sent to
downstream.

We discovered this problem while analyzing BF failure reported in the
discussion link.

Reliably reproducing this issue without a debugger is difficult. Given
its rarity, adding specific injection point to test it doesn't seem
worthwhile, so we won't be adding a dedicated test case.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/OS0PR01MB5716B44052000EB91EFAE60E94BC2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2025-04-29 12:52:05 +05:30
Amit Kapila
4868c96bc8 Fix slot synchronization for two_phase enabled slots.
The issue is that the transactions prepared before two-phase decoding is
enabled can fail to replicate to the subscriber after being committed on a
promoted standby following a failover. This is because the two_phase_at
field of a slot, which tracks the LSN from which two-phase decoding
starts, is not synchronized to standby servers. Without two_phase_at, the
logical decoding might incorrectly identify prepared transaction as
already replicated to the subscriber after promotion of standby server,
causing them to be skipped.

To address the issue on HEAD, the two_phase_at field of the slot is
exposed by the pg_replication_slots view and allows the slot
synchronization to copy this value to the corresponding synced slot on the
standby server.

This bug is likely to occur if the user toggles the two_phase option to
true after initial slot creation. Given that altering the two_phase option
of a replication slot is not allowed in PostgreSQL 17, this bug is less
likely to occur. We can't change the view/function definition in
backbranch so we can't push the same fix but we are brainstorming an
appropriate solution for PG17.

Author: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/TYAPR01MB5724CC7C288535BBCEEE65DA94A72@TYAPR01MB5724.jpnprd01.prod.outlook.com
2025-04-03 12:26:54 +05:30
Peter Eisentraut
7202d72787 backend launchers void * arguments for binary data
Change backend launcher functions to take void * for binary data
instead of char *.  This removes the need for numerous casts.

Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org
2025-02-21 08:03:33 +01:00
Amit Kapila
0ec3c295e7 Avoid updating inactive_since for invalid replication slots.
It is possible for the inactive_since value of an invalid replication slot
to be updated multiple times, which is unexpected behavior like during the
release of the slot or at the time of restart. This is harmless because
invalid slots are not allowed to be accessed but it is not prudent to
update invalid slots. We are planning to invalidate slots due to other
reasons like idle time and it will look odd that the slot's inactive_since
displays the recent time in this field after invalidated due to idle time.
So, this patch ensures that the inactive_since field of slots is not
updated for invalid slots.

In the passing, ensure to use the same inactive_since time for all the
slots at restart while restoring them from the disk.

Author: Nisha Moond <nisha.moond412@gmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CABdArM7QdifQ_MHmMA=Cc4v8+MeckkwKncm2Nn6tX9wSCQ-+iw@mail.gmail.com
2025-02-05 08:56:14 +05:30
Amit Kapila
f41d8468dd Raise an error while trying to acquire an invalid slot.
Once a replication slot is invalidated, it cannot be altered or used to
fetch changes. However, a process could still acquire an invalid slot and
fail later.

For example, if a process acquires a logical slot that was invalidated due
to wal_removed, it will eventually fail in CreateDecodingContext() when
attempting to access the removed WAL. Similarly, for physical replication
slots, even if the slot is invalidated and invalidation_reason is set to
wal_removed, the walsender does not currently check for invalidation when
starting physical replication. Instead, replication starts, and an error
is only reported later while trying to access WAL. Similarly, we prohibit
modifying slot properties for invalid slots but give the error for the
same after acquiring the slot.

This patch improves error handling by detecting invalid slots earlier at
the time of slot acquisition which is the first step. This also helped in
unifying different ERROR messages at different places and gave a
consistent message for invalid slots. This means that the message for
invalid slots will change to a generic message.

This will also be helpful for future patches where we are planning to
invalidate slots due to more reasons like idle_timeout because we don't
have to modify multiple places in such cases and avoid the chances of
missing out on a particular place.

Author: Nisha Moond <nisha.moond412@gmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CABdArM6pBL5hPnSQ+5nEVMANcF4FCH7LQmgskXyiLY75TMnKpw@mail.gmail.com
2025-01-31 10:27:35 +05:30
Bruce Momjian
50e6eb731d Update copyright for 2025
Backpatch-through: 13
2025-01-01 11:21:55 -05:00
Amit Kapila
d05a387d9d Doc: Clarify the inactive_since field description.
Updated to specify that it represents the exact time a slot became
inactive, rather than the period of inactivity.

Reported-by: Peter Smith
Author: Bruce Momjian, Nisha Moond
Reviewed-by: Amit Kapila, Peter Smith
Backpatch-through: 17
Discussion: https://postgr.es/m/CAHut+PuvsyA5v8y7rYoY9mkDQzUhwaESM05yCByTMaDoRh30tA@mail.gmail.com
2024-11-25 11:12:32 +05:30
Peter Eisentraut
e18512c000 Remove unused #include's from backend .c files
as determined by IWYU

These are mostly issues that are new since commit dbbca2cf29.

Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
2024-10-27 08:26:50 +01:00
Michael Paquier
b4db64270e Apply more quoting to GUC names in messages
This is a continuation of 17974ec259.  More quotes are applied to
GUC names in error messages and hints, taking care of what seems to be
all the remaining holes currently in the tree for the GUCs.

Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+Pv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w@mail.gmail.com
2024-09-04 13:50:44 +09:00
Michael Paquier
4236825197 Fix typos and grammar in code comments and docs
Author: Alexander Lakhin
Discussion: https://postgr.es/m/f7e514cf-2446-21f1-a5d2-8c089a6e2168@gmail.com
2024-09-03 14:49:04 +09:00
Peter Eisentraut
edee0c621d Message style improvements 2024-08-29 14:43:34 +02:00
Peter Eisentraut
dc26ff2f22 Fix misplaced translator comments
They did not immediately precede the code they were applying to.
2024-08-27 16:15:28 +02:00
Tom Lane
0d8bd0a72e Improve logical replication connection-failure messages.
These messages mostly said "could not connect to the publisher: %s"
which is lacking context.  Add some verbiage to indicate which
subscription or worker process is failing.

Nisha Moond

Discussion: https://postgr.es/m/CABdArM7q1=zqL++cYd0hVMg3u_tc0S=0Of=Um-KvDhLony0cSg@mail.gmail.com
2024-07-11 13:21:13 -04:00
Heikki Linnakangas
eb21f5bc67 Remove redundant SetProcessingMode(InitProcessing) calls
After several refactoring iterations, auxiliary processes are no
longer initialized from the bootstrapper. Using the InitProcessing
mode for initializing auxiliary processes is more appropriate. Since
the global variable Mode is initialized to InitProcessing, we can just
remove the redundant calls of SetProcessingMode(InitProcessing).

Author: Xing Guo <higuoxing@gmail.com>
Discussion: https://www.postgresql.org/message-id/CACpMh%2BDBHVT4xPGimzvex%3DwMdMLQEu9PYhT%2BkwwD2x2nu9dU_Q%40mail.gmail.com
2024-07-02 20:14:40 +03:00
Peter Eisentraut
720b0eaae9 Convert some extern variables to static
These probably should have been static all along, it was only
forgotten out of sloppiness.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
2024-07-02 07:26:22 +02:00
Amit Kapila
2357c9223b Rename standby_slot_names to synchronized_standby_slots.
The standby_slot_names GUC allows the specification of physical standby
slots that must be synchronized before the logical walsenders associated
with logical failover slots. However, for this purpose, the GUC name is
too generic.

Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Masahiko Sawada
Backpatch-through: 17
Discussion: https://postgr.es/m/ZnWeUgdHong93fQN@momjian.us
2024-07-01 11:36:56 +05:30
Peter Eisentraut
e9b7aee272 A few follow-up fixes for GUC name quoting
Fixups for 17974ec259: Some messages were missed (and some were new
since the patch was originally proposed), and there was a typo
introduced.
2024-05-17 13:48:31 +02:00
Peter Eisentraut
6d716adf85 Fix incorrect format placeholder 2024-05-08 08:37:46 +02:00
David Rowley
a42fc1c903 Fix an assortment of typos
Author: Alexander Lakhin
Discussion: https://postgr.es/m/ae9f2fcb-4b24-5bb0-4240-efbbbd944ca1@gmail.com
2024-05-04 02:33:25 +12:00
David Rowley
310cd8ab38 Fix duplicated consecutive words in comments
Also, fix a comment incorrectly referencing the "streaming read API".
This was renamed to "read stream" shortly before being committed.

Discussion: https://postgr.es/m/CAApHDvq-2Zdqytm_Hf3RmVf0qg5PS9jTFAJ5QTc9xH9pwvwDTA@mail.gmail.com
2024-04-28 20:03:34 +12:00
Amit Kapila
db08e8c6fa Post-commit review fixes for slot synchronization.
Allow pg_sync_replication_slots() to error out during promotion of standby.
This makes the behavior of the SQL function consistent with the slot sync
worker. We also ensured that pg_sync_replication_slots() cannot be
executed if sync_replication_slots is enabled and the slotsync worker is
already running to perform the synchronization of slots. Previously, it
would have succeeded in cases when the worker is idle and failed when it
is performing sync which could confuse users.

This patch fixes another issue in the slot sync worker where
SignalHandlerForShutdownRequest() needs to be registered *before* setting
SlotSyncCtx->pid, otherwise, the slotsync worker could miss handling
SIGINT sent by the startup process(ShutDownSlotSync) if it is sent before
worker could register SignalHandlerForShutdownRequest(). To be consistent,
all signal handlers' registration is moved to a prior location before we
set the worker's pid.

Ensure that we clean up synced temp slots at the end of
pg_sync_replication_slots() to avoid such slots being left over after
promotion.

Ensure that ShutDownSlotSync() captures SlotSyncCtx->pid under spinlock to
avoid accessing invalid value as it can be reset by concurrent slot sync
exit due to an error.

Author: Shveta Malik
Reviewed-by: Hou Zhijie, Bertrand Drouvot, Amit Kapila, Masahiko Sawada
Discussion: https://postgr.es/m/CAJpy0uBefXUS_TSz=oxmYKHdg-fhxUT0qfjASW3nmqnzVC3p6A@mail.gmail.com
2024-04-25 14:01:44 +05:30
Daniel Gustafsson
950d4a2cb1 Fix typos and duplicate words
This fixes various typos, duplicated words, and tiny bits of whitespace
mainly in code comments but also in docs.

Author: Daniel Gustafsson <daniel@yesql.se>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
2024-04-18 21:28:07 +02:00
Amit Kapila
3741f2a09d Fix the review comments and a bug in the slot sync code.
Ensure that when updating the catalog_xmin of the synced slots, it is
first written to disk before changing the in-memory value
(effective_catalog_xmin). This is to prevent a scenario where the
in-memory value change triggers a vacuum to remove catalog tuples before
the catalog_xmin is written to disk. In the event of a crash before the
catalog_xmin is persisted, we would not know that some required catalog
tuples have been removed and the synced slot would be invalidated.

Change the sanity check to ensure that remote_slot's confirmed_flush LSN
can't precede the local/synced slot during slot sync. Note that the
restart_lsn of the synced/local slot can be ahead of remote_slot. This can
happen when slot advancing machinery finds a running xacts record after
reaching the consistent state at a later point than the primary where it
serializes the snapshot and updates the restart_lsn.

Make the check to sync slots robust by allowing to sync only when the
confirmed_lsn, restart_lsn, or catalog_xmin of the remote slot is ahead of
the synced/local slot.

Reported-by: Amit Kapila and Shveta Malik
Author: Hou Zhijie, Shveta Malik
Reviewed-by: Amit Kapila, Bertrand Drouvot
Discussion: https://postgr.es/m/OS0PR01MB57162B67D3CB01B2756FBA6D94062@OS0PR01MB5716.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/CAJpy0uCSS5zmdyUXhvw41HSdTbRqX1hbYqkOfHNj7qQ+2zn0AQ@mail.gmail.com
2024-04-12 15:10:41 +05:30
David Rowley
8461424fd7 Fixup various StringInfo function usages
This adjusts various appendStringInfo* function calls to use a more
appropriate and efficient function with the same behavior.  For example,
use appendStringInfoChar() when appending a single character rather than
appendStringInfo() and appendStringInfoString() when no formatting is
required rather than using appendStringInfo().

All adjustments made here are in code that's new to v17, so it makes
sense to fix these now rather than wait a few years and make
backpatching harder.

Discussion: https://postgr.es/m/CAApHDvojY2UvMiO+9_55ArTj10P1LBNJyyoGB+C65BLDNT0GsQ@mail.gmail.com
Reviewed-by: Nathan Bossart, Tom Lane
2024-04-10 11:53:32 +12:00
Amit Kapila
6f132ed693 Allow synced slots to have their inactive_since.
This commit does two things:
1) Maintains inactive_since for sync slots whenever the slot is released
just like any other regular slot.

2) Ensures the value is set to the current timestamp during the promotion
of standby to help correctly interpret the time after promotion. We don't
want the slots to appear inactive for a long time after promotion if they
haven't been synchronized recently. This would also avoid the invalidation
of such slots immediately after promotion if tomorrow we have a feature
that invalidates slots based on their inactivity time. Whoever acquires
the slot i.e. makes the slot active will reset it to NULL.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik, Masahiko Sawada
Discussion: https://postgr.es/m/CAA4eK1KrPGwfZV9LYGidjxHeW+rxJ=E2ThjXvwRGLO=iLNuo=Q@mail.gmail.com
Discussion: https://postgr.es/m/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
Discussion: https://postgr.es/m/CA+Tgmob_Ta-t2ty8QrKHBGnNLrf4ZYcwhGHGFsuUoFrAEDw4sA@mail.gmail.com
2024-04-05 09:48:49 +05:30
Amit Kapila
2ec005b4e2 Ensure that the sync slots reach a consistent state after promotion without losing data.
We were directly copying the LSN locations while syncing the slots on the
standby. Now, it is possible that at some particular restart_lsn there are
some running xacts, which means if we start reading the WAL from that
location after promotion, we won't reach a consistent snapshot state at
that point. However, on the primary, we would have already been in a
consistent snapshot state at that restart_lsn so we would have just
serialized the existing snapshot.

To avoid this problem we will use the advance_slot functionality unless
the snapshot already exists at the synced restart_lsn location. This will
help us to ensure that snapbuilder/slot statuses are updated properly
without generating any changes. Note that the synced slot will remain as
RS_TEMPORARY till the decoding from corresponding restart_lsn can reach a
consistent snapshot state after which they will be marked as
RS_PERSISTENT.

Per buildfarm

Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Shveta Malik, Bharath Rupireddy, Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2024-04-03 14:04:59 +05:30
Amit Kapila
6ae701b437 Track invalidation_reason in pg_replication_slots.
Till now, the reason for replication slot invalidation is not tracked
directly in pg_replication_slots. A recent commit 007693f2a3 added
'conflict_reason' to show the reasons for slot conflict/invalidation, but
only for logical slots.

This commit adds a new column 'invalidation_reason' to show invalidation
reasons for both physical and logical slots. And, this commit also turns
'conflict_reason' text column to 'conflicting' boolean column (effectively
reverting commit 007693f2a3). The 'conflicting' column is true for
invalidation reasons 'rows_removed' and 'wal_level_insufficient' because
those make the slot conflict with recovery. When 'conflicting' is true,
one can now look at the new 'invalidation_reason' column for the reason
for the logical slot's conflict with recovery.

The new 'invalidation_reason' column will also be useful to track other
invalidation reasons in the future commit.

Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik
Discussion: https://www.postgresql.org/message-id/ZfR7HuzFEswakt/a%40ip-10-97-1-34.eu-west-3.compute.internal
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
2024-03-22 13:52:05 +05:30
Heikki Linnakangas
aafc05de1b Refactor postmaster child process launching
Introduce new postmaster_child_launch() function that deals with the
differences in EXEC_BACKEND mode.

Refactor the mechanism of passing information from the parent to child
process. Instead of using different command-line arguments when
launching the child process in EXEC_BACKEND mode, pass a
variable-length blob of startup data along with all the global
variables. The contents of that blob depend on the kind of child
process being launched. In !EXEC_BACKEND mode, we use the same blob,
but it's simply inherited from the parent to child process.

Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
2024-03-18 11:35:30 +02:00
Amit Kapila
bf279ddd1c Introduce a new GUC 'standby_slot_names'.
This patch provides a way to ensure that physical standbys that are
potential failover candidates have received and flushed changes before
the primary server making them visible to subscribers. Doing so guarantees
that the promoted standby server is not lagging behind the subscribers
when a failover is necessary.

The logical walsender now guarantees that all local changes are sent and
flushed to the standby servers corresponding to the replication slots
specified in 'standby_slot_names' before sending those changes to the
subscriber.

Additionally, the SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to ensure that they process changes for failover slots only after physical
slots specified in 'standby_slot_names' have confirmed WAL receipt for those.

Author: Hou Zhijie and Shveta Malik
Reviewed-by: Masahiko Sawada, Peter Smith, Bertrand Drouvot, Ajin Cherian, Nisha Moond, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
2024-03-08 08:10:45 +05:30
Heikki Linnakangas
393b5599e5 Use MyBackendType in more places to check what process this is
Remove IsBackgroundWorker, IsAutoVacuumLauncherProcess(),
IsAutoVacuumWorkerProcess(), and IsLogicalSlotSyncWorker() in favor of
new Am*Process() macros that use MyBackendType. For consistency with
the existing Am*Process() macros.

Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/f3ecd4cb-85ee-4e54-8278-5fabfb3a4ed0@iki.fi
2024-03-04 10:25:12 +02:00
Amit Kapila
b3f6b14cf4 Fixups for commit 93db6cbda0.
Ensure to set always-secure search path for both local and remote
connections during slot synchronization, so that malicious users can't
redirect user code (e.g. operators).

In the passing, improve the name of define, remove spurious return
statement, and a minor change in one of the comments.

Author: Bertrand Drouvot and Shveta Malik
Reviewed-by: Amit Kapila, Peter Smith
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Discussion: https://postgr.es/m/ZdcejBDCr+wlVGnO@ip-10-97-1-34.eu-west-3.compute.internal
Discussion: https://postgr.es/m/CAJpy0uBNP=nrkNJkJSfF=jSocEh8vU2Owa8Rtpi=63fG=SvfVQ@mail.gmail.com
2024-02-29 09:45:20 +05:30
Amit Kapila
93db6cbda0 Add a new slot sync worker to synchronize logical slots.
By enabling slot synchronization, all the failover logical replication
slots on the primary (assuming configurations are appropriate) are
automatically created on the physical standbys and are synced
periodically. The slot sync worker on the standby server pings the primary
server at regular intervals to get the necessary failover logical slots
information and create/update the slots locally. The slots that no longer
require synchronization are automatically dropped by the worker.

The nap time of the worker is tuned according to the activity on the
primary. The slot sync worker waits for some time before the next
synchronization, with the duration varying based on whether any slots were
updated during the last cycle.

A new parameter sync_replication_slots enables or disables this new
process.

On promotion, the slot sync worker is shut down by the startup process to
drop any temporary slots acquired by the slot sync worker and to prevent
the worker from trying to fetch the failover slots.

A functionality to allow logical walsenders to wait for the physical will
be done in a subsequent commit.

Author: Shveta Malik, Hou Zhijie based on design inputs by Masahiko Sawada and Amit Kapila
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Ajin Cherian, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
2024-02-22 15:25:15 +05:30
Amit Kapila
801792e528 Improve ERROR/LOG messages added by commits ddd5f4f54a and 7a424ece48.
Additionally, in slotsync.c, replace one StringInfoData variable usage
with a constant string to avoid palloc/pfree. Also, replace the inclusion
of "logical.h" with "slot.h" to prevent the exposure of unnecessary
implementation details.

Reported-by: Kyotaro Horiguchi, Masahiko Sawada
Author: Shveta Malik based on suggestions by Robert Haas and Amit Kapila
Reviewed-by: Kyotaro Horiguchi, Amit Kapila
Discussion: https://postgr.es/m/20240214.162652.773291409747353211.horikyota.ntt@gmail.com
Discussion: https://postgr.es/m/20240219.134015.1888940527023074780.horikyota.ntt@gmail.com
Discussion: https://postgr.es/m/CAD21AoCYXhDYOQDAS-rhGasC2T+tYbV=8Y18o94sB=5AxcW+yA@mail.gmail.com
2024-02-22 11:17:00 +05:30
Amit Kapila
b987be39c3 Fix the incorrect format specifier used in commit 7a424ece48.
Author: Hou Zhijie
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Discussion: https://postgr.es/m/OS0PR01MB5716CB015BAD807B29BC55BE944C2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2024-02-16 11:34:11 +05:30
Amit Kapila
7a424ece48 Add more LOG and DEBUG messages for slot synchronization.
This provides more information about remote slots during synchronization
which helps in debugging bugs and BF failures due to test case issues. We
might later want to change the LOG message added by this patch to DEBUG1.

Author: Hou Zhijie
Reviewed-by: Amit Kapila, Bertrand Drouvot
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Discussion: https://postgr.es/m/OS0PR01MB571633C23B2A4CAC5FB0371A944C2@OS0PR01MB5716.jpnprd01.prod.outlook.com
2024-02-16 09:02:54 +05:30
Amit Kapila
ddd5f4f54a Add a slot synchronization function.
This commit introduces a new SQL function pg_sync_replication_slots()
which is used to synchronize the logical replication slots from the
primary server to the physical standby so that logical replication can be
resumed after a failover or planned switchover.

A new 'synced' flag is introduced in pg_replication_slots view, indicating
whether the slot has been synchronized from the primary server. On a
standby, synced slots cannot be dropped or consumed, and any attempt to
perform logical decoding on them will result in an error.

The logical replication slots on the primary can be synchronized to the
hot standby by using the 'failover' parameter of
pg-create-logical-replication-slot(), or by using the 'failover' option of
CREATE SUBSCRIPTION during slot creation, and then calling
pg_sync_replication_slots() on standby. For the synchronization to work,
it is mandatory to have a physical replication slot between the primary
and the standby aka 'primary_slot_name' should be configured on the
standby, and 'hot_standby_feedback' must be enabled on the standby. It is
also necessary to specify a valid 'dbname' in the 'primary_conninfo'.

If a logical slot is invalidated on the primary, then that slot on the
standby is also invalidated.

If a logical slot on the primary is valid but is invalidated on the
standby, then that slot is dropped but will be recreated on the standby in
the next pg_sync_replication_slots() call provided the slot still exists
on the primary server. It is okay to recreate such slots as long as these
are not consumable on standby (which is the case currently). This
situation may occur due to the following reasons:
- The 'max_slot_wal_keep_size' on the standby is insufficient to retain
WAL records from the restart_lsn of the slot.
- 'primary_slot_name' is temporarily reset to null and the physical slot
is removed.

The slot synchronization status on the standby can be monitored using the
'synced' column of pg_replication_slots view.

A functionality to automatically synchronize slots by a background worker
and allow logical walsenders to wait for the physical will be done in
subsequent commits.

Author: Hou Zhijie, Shveta Malik, Ajin Cherian based on an earlier version by Peter Eisentraut
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
2024-02-14 09:45:36 +05:30