From ea4744782b3f9d3dd69be20f31f970bee60f5f09 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Thu, 5 Mar 2026 10:05:44 +0900 Subject: [PATCH] Fix rare instability in recovery TAP test 004_timeline_switch This fixes a problem similar to ad8c86d22cbd. In this case, the test could fail under the following circumstances: - The primary is stopped with teardown_node(), meaning that it may not be able to send all its WAL records to standby_1 and standby_2. - If standby_2 receives more records than standby_1, attempting to reconnect standby_2 to the promoted standby_1 would fail because of a timeline fork. This race condition is fixed with a simple trick: instead of tearing down the primary, it is stopped cleanly so as all the WAL records of the primary are received and flushed by both standby_1 and standby_2. Once we do that, there is no need for a wait_for_catchup() before stopping the node. The test wants to check that a timeline jump can be achieved when reconnecting a standby to a promoted standby in the same cluster, hence an immediate stop of the primary is not required. This failure is harder to reach than the previous instability of 009_twophase, still the buildfarm has been able to detect this failure at least once. I have tried Alexander Lakhin's test trick with the bgwriter and very aggressive standby snapshots, but I could not reproduce it directly. It is reachable, as the buildfarm has proved. Backpatch down to all supported branches, and this problem can lead to spurious failures in the buildfarm. Discussion: https://postgr.es/m/493401a8-063f-436a-8287-a235d9e065fc@gmail.com Backpatch-through: 14 --- src/test/recovery/t/004_timeline_switch.pl | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl index 989f5fc763a..5afd2f44466 100644 --- a/src/test/recovery/t/004_timeline_switch.pl +++ b/src/test/recovery/t/004_timeline_switch.pl @@ -34,11 +34,10 @@ $node_standby_2->start; $node_primary->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a"); -# Wait until standby has replayed enough data on standby 1 -$node_primary->wait_for_catchup($node_standby_1); - -# Stop and remove primary -$node_primary->teardown_node; +# Cleanly stop and remove primary. A clean stop is required so as all +# the records generated on the primary are received and flushed by the two +# standbys. +$node_primary->stop; # promote standby 1 using "pg_promote", switching it to a new timeline my $psql_out = '';