Fix orphaned processes when startup process fails during PM_STARTUP

When the startup process exists with a FATAL error during PM_STARTUP,
the postmaster called ExitPostmaster() directly, assuming that no other
processes are running at this stage.  Since 7ff23c6d27, this
assumption is not true, as the checkpointer, the background writer, the
IO workers and bgworkers kicking in early would be around.

This commit removes the startup-specific shortcut happening in
process_pm_child_exit() for a failing startup process during PM_STARTUP,
falling down to the existing exit() flow to signal all the started
children with SIGQUIT, so as we have no risk of creating orphaned
processes.

This required an extra change in HandleFatalError() for v18 and newer
versions, as an assertion could be triggered for PM_STARTUP.  It is now
incorrect.  In v17 and older versions, HandleChildCrash() needs to be
changed to handle PM_STARTUP so as children can be waited on.

While on it, fix a comment at the top of postmaster.c.  It was claiming
that the checkpointer and the background writer were started after
PM_RECOVERY.  That is not the case.

Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Discussion: https://postgr.es/m/CAJTYsWVoD3V9yhhqSae1_wqcnTdpFY-hDT7dPm5005ZFsL_bpA@mail.gmail.com
Backpatch-through: 15
This commit is contained in:
Michael Paquier 2026-04-21 09:40:07 +09:00
parent 954e468bba
commit 23cebf672e

View file

@ -292,12 +292,13 @@ static bool FatalError = false; /* T if recovering from backend crash */
*
* When the startup process is ready to start archive recovery, it signals the
* postmaster, and we switch to PM_RECOVERY state. The background writer and
* checkpointer are launched, while the startup process continues applying WAL.
* If Hot Standby is enabled, then, after reaching a consistent point in WAL
* redo, startup process signals us again, and we switch to PM_HOT_STANDBY
* state and begin accepting connections to perform read-only queries. When
* archive recovery is finished, the startup process exits with exit code 0
* and we switch to PM_RUN state.
* checkpointer are already running (as these are launched during PM_STARTUP),
* and the startup process continues applying WAL. If Hot Standby is enabled,
* then, after reaching a consistent point in WAL redo, startup process
* signals us again, and we switch to PM_HOT_STANDBY state and begin accepting
* connections to perform read-only queries. When archive recovery is
* finished, the startup process exits with exit code 0 and we switch to
* PM_RUN state.
*
* Normal child backends can only be launched when we are in PM_RUN or
* PM_HOT_STANDBY state. (connsAllowed can also restrict launching.)
@ -3037,29 +3038,13 @@ reaper(SIGNAL_ARGS)
}
/*
* Unexpected exit of startup process (including FATAL exit)
* during PM_STARTUP is treated as catastrophic. There are no
* other processes running yet, so we can just exit.
*/
if (pmState == PM_STARTUP &&
StartupStatus != STARTUP_SIGNALED &&
!EXIT_STATUS_0(exitstatus))
{
LogChildExit(LOG, _("startup process"),
pid, exitstatus);
ereport(LOG,
(errmsg("aborting startup due to startup process failure")));
ExitPostmaster(1);
}
/*
* After PM_STARTUP, any unexpected exit (including FATAL exit) of
* the startup process is catastrophic, so kill other children,
* and set StartupStatus so we don't try to reinitialize after
* they're gone. Exception: if StartupStatus is STARTUP_SIGNALED,
* then we previously sent the startup process a SIGQUIT; so
* that's probably the reason it died, and we do want to try to
* restart in that case.
* Any unexpected exit (including FATAL exit) of the startup
* process is catastrophic, so kill other children, and set
* StartupStatus so we don't try to reinitialize after they're
* gone. Exception: if StartupStatus is STARTUP_SIGNALED, then we
* previously sent the startup process a SIGQUIT; so that's
* probably the reason it died, and we do want to try to restart
* in that case.
*
* This stanza also handles the case where we sent a SIGQUIT
* during PM_STARTUP due to some dead_end child crashing: in that
@ -3706,7 +3691,8 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
FatalError = true;
/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
if (pmState == PM_STARTUP ||
pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||