From ea4744782b3f9d3dd69be20f31f970bee60f5f09 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 5 Mar 2026 10:05:44 +0900
Subject: [PATCH] Fix rare instability in recovery TAP test 004_timeline_switch

This fixes a problem similar to ad8c86d22cbd.  In this case, the test
could fail under the following circumstances:
- The primary is stopped with teardown_node(), meaning that it may not
be able to send all its WAL records to standby_1 and standby_2.
- If standby_2 receives more records than standby_1, attempting to
reconnect standby_2 to the promoted standby_1 would fail because of a
timeline fork.

This race condition is fixed with a simple trick: instead of tearing
down the primary, it is stopped cleanly so as all the WAL records of the
primary are received and flushed by both standby_1 and standby_2.  Once
we do that, there is no need for a wait_for_catchup() before stopping
the node.  The test wants to check that a timeline jump can be achieved
when reconnecting a standby to a promoted standby in the same cluster,
hence an immediate stop of the primary is not required.

This failure is harder to reach than the previous instability of
009_twophase, still the buildfarm has been able to detect this failure
at least once.  I have tried Alexander Lakhin's test trick with the
bgwriter and very aggressive standby snapshots, but I could not
reproduce it directly.  It is reachable, as the buildfarm has proved.

Backpatch down to all supported branches, and this problem can lead to
spurious failures in the buildfarm.

Discussion: https://postgr.es/m/493401a8-063f-436a-8287-a235d9e065fc@gmail.com
Backpatch-through: 14
---
 src/test/recovery/t/004_timeline_switch.pl | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index 989f5fc763a..5afd2f44466 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -34,11 +34,10 @@ $node_standby_2->start;
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
 
-# Wait until standby has replayed enough data on standby 1
-$node_primary->wait_for_catchup($node_standby_1);
-
-# Stop and remove primary
-$node_primary->teardown_node;
+# Cleanly stop and remove primary.  A clean stop is required so as all
+# the records generated on the primary are received and flushed by the two
+# standbys.
+$node_primary->stop;
 
 # promote standby 1 using "pg_promote", switching it to a new timeline
 my $psql_out = '';