When cloning extended statistics via CREATE TABLE ... LIKE ... INCLUDING
STATISTICS, stxkeys holds attribute numbers from the source (parent)
table, but get_attname() was being called with the child relation's
OID. If the parent has dropped columns, the child's attribute numbers
are renumbered sequentially and no longer match, so the lookup either
returns the wrong column name (silent corruption) or errors out when
the attnum does not exist in the child.
Fix it by remapping the parent attnum through attmap before the lookup,
consistent with how expression statistics are already handled a few
lines below.
Add a regression test covering both manifestations: a 3-column parent
where the stale attnum refers to no child column (cache-lookup error),
and a 4-column parent where the stale attnum silently refers to the
wrong child column.
Author: Julien Tachoires <julmon@gmail.com>
Reviewed-by: Srinath Reddy Sadipiralla <srinath2133@gmail.com>
Discussion: https://postgr.es/m/20260415105718.tomuncfbmlt67oel@poseidon.home.virt
Backpatch-through: 14
After a recent macOS update, building Postgres produces warnings
that look like this:
ranlib: warning: 'libpgport_shlib.a(pg_cpu_x86.c.o)' has no symbols
ranlib: warning: 'libpgport_shlib.a(pg_popcount_x86.c.o)' has no symbols
To fix, add a dummy symbol to files that may otherwise have none.
Per project policy, this is a candidate for back-patching into
out-of-support branches: it suppresses annoying compiler warnings
but changes no behavior.
Reported-by: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/229aaaf3-f529-44ed-8e50-00cb6909af21%40Spark
Backpatch-through: 13
British Columbia (America/Vancouver) moved to permanent UTC-07 on
2026-03-09, which will affect their clocks beginning on 2026-11-01.
For lack of any clarity on the point, assume their TZ abbreviation
will be MST from that time forward.
Moldova (Europe/Chisinau) has followed EU DST transition times since
2022.
Backpatch-through: 14
ExecEvalHashedScalarArrayOp(), when using a strict equality function,
performs a short-circuit when looking up NULL values. When the function
is non-strict, the code incorrectly looked up the hash table for a
zero-valued Datum, which could have resulted in an accidental true
return if the hash table contained zero valued Datum, or could result
in a crash for non-byval types.
Here we fix this by adding an extra step when we build the hash table to
check what the result of a NULL lookup would be. This requires looping
over the array and checking what the non-hashed version of the code
would do. We cache the results of that in the expression so that we can
reuse the result any time we're asked to search for a NULL value.
It's important to note that non-strict equality functions are free to
treat any NULL value as equal to any non-NULL value. For example,
someone may wish to design a type that treats an empty string and NULL
as equal.
All built-in types have strict equality functions, so this could affect
custom / user-defined types.
Author: Chengpeng Yan <chengpeng_yan@outlook.com>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Discussion: https://postgr.es/m/A16187AE-2359-4265-9F5E-71D015EC2B2D@outlook.com
Backpatch-through: 14
pg_test_timing reports timing differences in nanoseconds in master, and
in microseconds in v14 through v18, but previously the backward-clock
warning incorrectly labeled the value as milliseconds.
This commit fixes the warning message to use "ns" in master and
"us" in v14 through v18, matching the actual unit being reported.
Backpatch to all supported versions.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Xiaopeng Wang <wxp_728@163.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/F780CEEB-A237-4302-9F55-60E9D8B6533D@gmail.com
Backpatch-through: 14
If CheckAttributeType() is called with InvalidOid, it performs a bunch
of pointless, futile syscache lookups with InvalidOid, but ultimately
tolerates it and has no effect. We were calling it with InvalidOid on
dropped columns, but it seems accidental that it works, so let's stop
doing it.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/93ce56cd-02a6-4db1-8224-c8999372facc@iki.fi
Backpatch-through: 14
CheckAttributeType() checks that a composite type is not made a member
of itself with ALTER TABLE ADD COLUMN or ALTER TYPE ADD ATTRIBUTE,
even indirectly via a domain, array, another composite type or a range
type. But it missed checking for multiranges. That was a simple
oversight when multiranges were added.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/93ce56cd-02a6-4db1-8224-c8999372facc@iki.fi
Backpatch-through: 14
to_char() allocates its output buffer with 8 bytes per formatting
code in the pattern. If the locale's currency symbol, thousands
separator, or decimal or sign symbol is more than 8 bytes long,
in principle we could overrun the output buffer. No such locales
exist in the real world, so it seems sufficient to truncate the
symbol if we do see it's too long.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/638232.1776790821@sss.pgh.pa.us
Backpatch-through: 14
parse_affentry() and addCompoundAffixFlagValue() each collect fields
from an affix file into working buffers of size BUFSIZ. They failed
to defend against overlength fields, so that a malicious affix file
could cause a stack smash. BUFSIZ (typically 8K) is certainly way
longer than any reasonable affix field, but let's fix this while
we're closing holes in this area.
I chose to do this by silently truncating the input before it can
overrun the buffer, using logic comparable to the existing logic in
get_nextfield(). Certainly there's at least as good an argument for
raising an error, but for now let's follow the existing precedent.
Reported-by: Igor Stepansky <igor.stepansky@orca.security>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/864123.1776810909@sss.pgh.pa.us
Backpatch-through: 14
This function writes into a caller-supplied buffer of length
2 * MAXNORMLEN, which should be plenty in real-world cases.
However a malicious affix file could supply an affix long
enough to overrun that. Defend by just rejecting the match
if it would overrun the buffer. I also inserted a check of
the input word length against Affix->replen, just to be sure
we won't index off the buffer, though it would be caller error
for that not to be true.
Also make the actual copying steps a bit more readable, and remove
an unnecessary requirement for the whole input word to fit into the
output buffer (even though it always will with the current caller).
The lack of documentation in this code makes my head hurt, so
I also reverse-engineered a basic header comment for CheckAffix.
Reported-by: Xint Code
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/641711.1776792744@sss.pgh.pa.us
Backpatch-through: 14
This commit tweaks ALTER INDEX .. ATTACH PARTITION to attempt a
validation of a parent index in the case where an index is already
attached but the parent is not yet valid. This occurs in cases where a
parent index was created invalid such as with CREATE INDEX ONLY, but was
left invalid after an invalid child index was attached (partitioned
indexes set indisvalid to false if at least one partition is
!indisvalid, indisvalid is true in a partitioned table iff all
partitions are indisvalid). This could leave a partition tree in a
situation where a user could not bring the parent index back to valid
after fixing the child index, as there is no built-in mechanism to do
so. This commit relies on the fact that repeated ATTACH PARTITION
commands on the same index silently succeed.
An invalid parent index is more than just a passive issue. It causes
for example ON CONFLICT on a partitioned table if the invalid parent
index is used to enforce a unique constraint.
Some test cases are added to track some of problematic patterns, using a
set of partition trees with combinations of invalid indexes and ATTACH
PARTITION.
Reported-by: Mohamed Ali <moali.pg@gmail.com>
Author: Sami Imseih <sanmimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Discussion: http://postgr.es/m/CAGnOmWqi1D9ycBgUeOGf6mOCd2Dcf=6sKhbf4sHLs5xAcKVCMQ@mail.gmail.com
Backpatch-through: 14
We were using "select count(*) into x from generate_series(1,
1_000_000_000_000)" to waste one second waiting for a statement
timeout trap. Aside from consuming CPU to little purpose, this could
easily eat several hundred MB of temporary file space, which has been
observed to cause out-of-disk-space errors in the buildfarm.
Let's just use "pg_sleep(10)", which is far less resource-intensive.
Also update the "when others" exception handler so that if it does
ever again trap an error, it will tell us what error. The cause of
these intermittent buildfarm failures had been obscure for awhile.
Discussion: https://postgr.es/m/557992.1776779694@sss.pgh.pa.us
Backpatch-through: 14
When a rule action or rule qualification references NEW.col where col
is a generated column (stored or virtual), the rewriter produces
incorrect results.
rewriteTargetListIU removes generated columns from the query's target
list, since stored generated columns are recomputed by the executor
and virtual ones store nothing. However, ReplaceVarsFromTargetList
then cannot find these columns when resolving NEW references during
rule rewriting. For UPDATE, the REPLACEVARS_CHANGE_VARNO fallback
redirects NEW.col to the original target relation, making it read the
pre-update value (same as OLD.col). For INSERT,
REPLACEVARS_SUBSTITUTE_NULL replaces it with NULL. Both are wrong
when the generated column depends on columns being modified.
Fix by building target list entries for generated columns from their
generation expressions, pre-resolving the NEW.attribute references
within those expressions against the query's targetlist, and passing
them together with the query's targetlist to ReplaceVarsFromTargetList.
Back-patch to all supported branches. Virtual generated columns were
added in v18, so the back-patches in pre-v18 branches only handle
stored generated columns.
Reported-by: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAHg+QDexGTmCZzx=73gXkY2ZADS6LRhpnU+-8Y_QmrdTS6yUhA@mail.gmail.com
Backpatch-through: 14
When the startup process exists with a FATAL error during PM_STARTUP,
the postmaster called ExitPostmaster() directly, assuming that no other
processes are running at this stage. Since 7ff23c6d27, this
assumption is not true, as the checkpointer, the background writer, the
IO workers and bgworkers kicking in early would be around.
This commit removes the startup-specific shortcut happening in
process_pm_child_exit() for a failing startup process during PM_STARTUP,
falling down to the existing exit() flow to signal all the started
children with SIGQUIT, so as we have no risk of creating orphaned
processes.
This required an extra change in HandleFatalError() for v18 and newer
versions, as an assertion could be triggered for PM_STARTUP. It is now
incorrect. In v17 and older versions, HandleChildCrash() needs to be
changed to handle PM_STARTUP so as children can be waited on.
While on it, fix a comment at the top of postmaster.c. It was claiming
that the checkpointer and the background writer were started after
PM_RECOVERY. That is not the case.
Author: Ayush Tiwari <ayushtiwari.slg01@gmail.com>
Discussion: https://postgr.es/m/CAJTYsWVoD3V9yhhqSae1_wqcnTdpFY-hDT7dPm5005ZFsL_bpA@mail.gmail.com
Backpatch-through: 15
The documentation for jit_debugging_support and jit_profiling_support
previously stated that these parameters can only be set at server start.
However, both parameters use the PGC_SU_BACKEND context, meaning they
can be set at session start by superusers or users granted the appropriate
SET privilege, but cannot be changed within an active session.
This commit updates the documentation to reflect the actual behavior.
Backpatch to all supported versions.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAHGQGwEpMDpB-K8SSUVRRHg6L6z3pLAkekd9aviOS=ns0EC=+Q@mail.gmail.com
Backpatch-through: 14
The documentation stated only that the log file created by pg_ctl -l is
inaccessible to other users by default. However, since commit c37b3d0,
the actual behavior is that only the cluster owner has access by default,
but users in the same group as the cluster owner may also read the file
if group access is enabled in the cluster.
This commit updates the documentation to describe this behavior
more clearly.
Backpatch to all supported versions.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Xiaopeng Wang <wxp_728@163.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/OS9PR01MB1214959BE987B4839E3046050F54BA@OS9PR01MB12149.jpnprd01.prod.outlook.com
Backpatch-through: 14
* JOHAB: replace the incorrect "simplified Chinese" description with
a correct one that identifies it as the Korean combining (Johab)
encoding standardized in KS X 1001 annex 3.
* EUC_KR: drop a stray space before the comma in the existing
comment, and note that the encoding covers the KS X 1001
precomposed (Wansung) form.
* UHC: spell out "Unified Hangul Code", clarify that it is
Microsoft Windows CodePage 949, and describe its relationship to
EUC-KR (superset covering all 11,172 precomposed Hangul syllables).
Backpatch-through: 14
Author: Henson Choi <assam258@gmail.com>
Discussion: https://postgr.es/m/CAAAe_zAFz1v-3b7Je4L%2B%3DwZM3UGAczXV47YVZfZi9wbJxspxeA%40mail.gmail.com
The goal of this module is to provide an entry point for the coverage of
the low-level compression and decompression PGLZ routines. The new test
is moved to a new parallel group, with all the existing
compression-related tests added to it.
This includes tests for the cases detected by fuzzing that emulate
corrupted compressed data, as fixed by 2b5ba2a0a1:
- Set control bit with read of a match tag, where no data follows.
- Set control bit with read of a match tag, where 1 byte follows.
- Set control bit with match tag where length nibble is 3 bytes
(extended case).
While on it, some tests are added for compress/decompress roundtrips,
and for check_complete=false/true. Like 2b5ba2a0a1, backpatch to all
the stable branches.
Discussion: https://postgr.es/m/adw647wuGjh1oU6p@paquier.xyz
Backpatch-through: 14
Three routines in pgstat_database.c incorrectly ignore the database OID
provided by their caller, using MyDatabaseId instead:
- pgstat_report_connect()
- pgstat_report_disconnect()
- pgstat_reset_database_timestamp()
The first two functions, for connection and disconnection, each have a
single caller that already passes MyDatabaseId. This was harmless,
still incorrect.
The timestamp reset function also has a single caller, but in this case
the issue has a real impact: it fails to reset the timestamp for the
shared-database entry (datid=0) when operating on shared objects. This
situation can occur, for example, when resetting counters for shared
relations via pg_stat_reset_single_table_counters().
There is currently one test in the tree that checks the reset of a
shared relation, for pg_shdescription, we rely on it to check what is
stored in pg_stat_database. As stats_reset may be NULL, two resets are
done to provide a baseline for comparison.
Author: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Dapeng Wang <wangdp20191008@gmail.com>
Discussion: https://postgr.es/m/ABBD5026-506F-4006-A569-28F72C188693@gmail.com
Backpatch-through: 15
When decoding a match tag, pglz_decompress() reads 2 bytes (or 3
for extended-length matches) from the source buffer before checking
whether enough data remains. The existing bounds check (sp > srcend)
occurs after the reads, so truncated compressed data that ends
mid-tag causes a read past the allocated buffer.
Fix by validating that sufficient source bytes are available before
reading each part of the match tag. The post-read sp > srcend
check is no longer needed and is removed.
Found by fuzz testing with libFuzzer and AddressSanitizer.
Backpatch-through: 14
make_ctags did not include field members of structs since the commit
964d01ae90.
For example, in the following field of RestrictInfo:
Selectivity norm_selec pg_node_attr(equal_ignore);
pg_node_attr was mistakenly interpreted to be the name of the field.
To fix this, add -I option to ctags command if the command is
Exuberant ctags or Universal ctags (for plain old ctags, struct
members are not included in the tags file anyway).
Also add "-e" and "-n" options to make_ctags. The -e option invokes
ctags command with -e option, which produces TAGS file for emacs. This
allows to eliminate duplicate codes in make_etags so that make_etags
just exec make_ctags with -e option.
The -n option allows not to produce symbolic links in each
sub directory (the default is producing symbolic links).
This includes the follow-up fixes: 87f21d2c68 and ae66716bf3. This
change is applied to v15 and v14, v16 and nwer versions already
including these improvements. One reason why I am doing this backpatch
is that this can be really useful for backpatching purposes, especially
the -n option that limits the number of TAGS/tags files created in the
tree.
Author: Yugo Nagata
Reviewers: Alvaro Herrera, Tatsuo Ishii
Discussion: https://postgr.es/m/flat/20221007154442.76233afc7c5b255c4de6528a%40sraoss.co.jp
Discussion: https://postgr.es/m/adcKr7fob5ZvjhlH@paquier.xyz
Backpatch-through: 14
In nodeWindowAgg.c, the calculations for frame start and end positions
in ROWS and GROUPS modes were performed using simple integer addition.
If a user-supplied offset was sufficiently large (close to INT64_MAX),
adding it to the current row or group index could cause a signed
integer overflow, wrapping the result to a negative number.
This led to incorrect behavior where frame boundaries that should have
extended indefinitely (or beyond the partition end) were treated as
falling at the first row, or where valid rows were incorrectly marked
as out-of-frame. Depending on the specific query and data, these
overflows can result in incorrect query results, execution errors, or
assertion failures.
To fix, use overflow-aware integer addition (ie, pg_add_s64_overflow)
to check for overflows during these additions. If an overflow is
detected, the boundary is now clamped to INT64_MAX. This ensures the
logic correctly treats the boundary as extending to the end of the
partition.
Bug: #19405
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/19405-1ecf025dda171555@postgresql.org
Backpatch-through: 14
Commit aa606b931 installed a test that would reference a nonexistent
TupleDesc array entry if a system column is used in COPY FROM WHERE.
Typically this would be harmless, but with bad luck it could result
in a phony "generated columns are not supported in COPY FROM WHERE
conditions" error, and at least in principle it could cause SIGSEGV.
(Compare 570e2fcc0 which fixed the identical problem in another
place.) Also, since c98ad086a it throws an Assert instead.
In the back branches, just guard the test to make it a safe no-op for
system columns. Commit 21c69dc73 installed a more aggressive answer
in master.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/6f435023-8ab6-47c2-ba07-035d0c4212f9@gmail.com
Backpatch-through: 14-18
LLVM 22 has the fix that we copied into our tree in commit 9044fc1d and
a new function to reach it[1][2], so we only need to use our copy for
Aarch64 + LLVM < 22. The only change to the final version that our copy
didn't get is a new LLVM_ABI macro, but that isn't appropriate for us.
Our copy is hopefully now frozen and would only need maintenance if bugs
are found in the upstream code.
Non-Aarch64 systems now also use the new API with LLVM 22. It allocates
all sections with one contiguous mmap() instead of one per
section. We could have done that earlier, but commit 9044fc1d wanted to
limit the blast radius to the affected systems. We might as well
benefit from that small improvement everywhere now that it is available
out of the box.
We can't delete our copy until LLVM 22 is our minimum supported version,
or we switch to the newer JITLink API for at least Aarch64.
[1] https://github.com/llvm/llvm-project/pull/71968
[2] https://github.com/llvm/llvm-project/pull/174307
Backpatch-through: 14
Discussion: https://postgr.es/m/CA%2BhUKGJTumad75o8Zao-LFseEbt%3DenbUFCM7LZVV%3Dc8yg2i7dg%40mail.gmail.com
Several places in tuplestore.c would leave the tuplestore data
structure effectively corrupt if some subroutine were to throw
an error. Notably, if WRITETUP() failed after some number of
successful calls within dumptuples(), the tuplestore would
contain some memtuples pointers that were apparently live
entries but in fact pointed to pfree'd chunks.
In most cases this sort of thing is fine because transaction
abort cleanup is not too picky about the contents of memory that
it's going to throw away anyway. There's at least one exception
though: if a Portal has a holdStore, we're going to call
tuplestore_end() on that, even during transaction abort.
So it's not cool if that tuplestore is corrupt, and that means
tuplestore.c has to be more careful.
This oversight demonstrably leads to crashes in v15 and before,
if a holdable cursor fails to persist its data due to an undersized
temp_file_limit setting. Very possibly the same thing can happen in
v16 and v17 as well, though the specific test case submitted failed
to fail there (cf. 095555daf). The failure is accidentally dodged
as of v18 because 590b045c3 got rid of tuplestore_end's retail tuple
deletion loop. Still, it seems unwise to permit tuplestores to become
internally inconsistent in any branch, so I've applied the same fix
across the board.
Since the known test case for this is rather expensive and doesn't
fail in recent branches, I've omitted it.
Bug: #19438
Reported-by: Dmitriy Kuzmin <kuzmin.db4@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/19438-9d37b179c56d43aa@postgresql.org
Backpatch-through: 14
Functions such as hash_numeric() are not careful to use the correct
PG_RETURN_*() macro according to the return type of that function as
defined in pg_proc. Because that function is meant to return int32,
when the hashed value exceeds 2^31, the 64-bit Datum value won't wrap to
a negative number, which means the Datum won't have the same value as it
would have had it been cast to int32 on a two's complement machine. This
isn't harmless as both datum_image_eq() and datum_image_hash() may receive
a Datum that's been formed and deformed from a tuple in some cases, and
not in other cases. When formed into a tuple, the Datum value will be
coerced into an integer according to the attlen as specified by the
TupleDesc. This can result in two Datums that should be equal being
classed as not equal, which could result in (but not limited to) an error
such as:
ERROR: could not find memoization table entry
Here we fix this by ensuring we cast the Datum value to a signed integer
according to the typLen specified in the datum_image_eq/datum_image_hash
function call before comparing or hashing.
Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Tender Wang <tndrwang@gmail.com>
Backpatch-through: 14
Discussion: https://postgr.es/m/CAHewXNmcXVFdB9_WwA8Ez0P+m_TQy_KzYk5Ri5dvg+fuwjD_yw@mail.gmail.com
astreamer_tar_parser_content() sent the wrong data pointer when
forwarding MEMBER_TRAILER padding to the next streamer. After
astreamer_buffer_until() buffers the padding bytes, the 'data'
pointer has been advanced past them, but the code passed 'data'
instead of bbs_buffer.data. This caused the downstream consumer
to receive bytes from after the padding rather than the padding
itself, and could read past the end of the input buffer.
astreamer_gzip_decompressor_content() only checked for
Z_STREAM_ERROR from inflate(), silently ignoring Z_DATA_ERROR
(corrupted data) and Z_MEM_ERROR (out of memory). Fix by
treating any return other than Z_OK, Z_STREAM_END, and
Z_BUF_ERROR as fatal.
astreamer_gzip_decompressor_free() missed calling inflateEnd() to
release zlib's internal decompression state.
astreamer_tar_parser_free() neglected to pfree() the streamer
struct itself, leaking it.
astreamer_extractor_content() did not check the return value of
fclose() when closing an extracted file. A deferred write error
(e.g., disk full on buffered I/O) would be silently lost.
Discussion: https://postgr.es/m/results/98c6b630-acbb-44a7-97fa-1692ce2b827c@dunslane.net
Reviewed-By: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 15
pg_stat_replication is documented to keep the last measured lag values for
a short time after the standby catches up, and then set them to NULL when
there is no WAL activity. However, previously lag values could become NULL
prematurely even while WAL activity was ongoing, especially in logical
replication.
This happened because the code cleared lag when two consecutive reply messages
indicated that the apply location had caught up with the send location.
It did not verify that the reported positions were unchanged, so lag could be
cleared even when positions had advanced between messages. In logical
replication, where the apply location often quickly catches up, this issue was
more likely to occur.
This commit fixes the issue by clearing lag only when the standby reports that
it has fully replayed WAL (i.e., both flush and apply locations have caught up
with the send location) and the write/flush/apply positions remain unchanged
across two consecutive reply messages.
The second message with unchanged positions typically results from
wal_receiver_status_interval, so lag values are cleared after that interval
when there is no activity. This avoids showing stale lag data while preventing
premature NULL values.
Even with this fix, lag may rarely become NULL during activity if identical
position reports are sent repeatedly. Eliminating such duplicate messages
would address this fully, but that change is considered too invasive for stable
branches and will be handled in master only later.
Backpatch to all supported branches.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
Backpatch-through: 14
The check for a mismatch on the second decoded item pointer
was an exact copy of the first item pointer check, comparing
orig_itemptrs[0] with decoded_itemptrs[0] instead of orig_itemptrs[1]
with decoded_itemptrs[1]. The error message also reported (0, 1) as
the expected value instead of (blk, off). As a result, any decoding
error in the second item pointer (where the varbyte delta encoding
is exercised) would go undetected.
This has been wrong since commit bde7493d1, so backpatch to all
supported versions.
Author: Jianghua Yang <yjhjstz@gmail.com>
Discussion: https://postgr.es/m/CAAZLFmSOD8R7tZjRLZsmpKtJLoqjgawAaM-Pne1j8B_Q2aQK8w@mail.gmail.com
Backpatch-through: 14
If a CHECKPOINT record with nextMulti N is written to the WAL before
the CREATE_ID record for N, and N happens to be the first multixid on
an offset page, the backwards compatibility logic to tolerate WAL
generated by older minor versions (before commit 789d65364c) failed to
compensate for the missing XLOG_MULTIXACT_ZERO_OFF_PAGE record. In
that case, the latest_page_number was initialized at the start of WAL
replay to the page for nextMulti from the CHECKPOINT record, even if
we had not seen the CREATE_ID record for that multixid yet, which
fooled the backwards compatibility logic to think that the page was
already initialized.
To fix, track the last XLOG_MULTIXACT_ZERO_OFF_PAGE that we've seen
separately from latest_page_number. If we haven't seen any
XLOG_MULTIXACT_ZERO_OFF_PAGE records yet, use
SimpleLruDoesPhysicalPageExist() to check if the page needs to be
initialized.
Reported-by: duankunren.dkr <duankunren.dkr@alibaba-inc.com>
Analyzed-by: duankunren.dkr <duankunren.dkr@alibaba-inc.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://www.postgresql.org/message-id/c4ef1737-8cba-458e-b6fd-4e2d6011e985.duankunren.dkr@alibaba-inc.com
Backpatch-through: 14-18
Send the correct amount of data to the next astreamer, not the
whole allocated buffer size. This bug escaped detection because
in present uses the next astreamer is always a tar-file parser
which is insensitive to trailing garbage. But that may not
be true in future uses.
Author: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2178517.1774064942@sss.pgh.pa.us
Backpatch-through: 15
Commit 6eedb2a5fd made the logical walsender call
XLogFlush(GetXLogInsertRecPtr()) to ensure that all pending WAL is flushed,
fixing a publisher shutdown hang. However, if the last WAL record ends at
a page boundary, GetXLogInsertRecPtr() can return an LSN pointing past
the page header, which can cause XLogFlush() to report an error.
A similar issue previously existed in the GiST code. Commit b1f14c9672
introduced GetXLogInsertEndRecPtr(), which returns a safe WAL insertion end
location (returning the start of the page when the last record ends at a page
boundary), and updated the GiST code to use it with XLogFlush().
This commit fixes the issue by making the logical walsender use
XLogFlush(GetXLogInsertEndRecPtr()) when flushing pending WAL during shutdown.
Backpatch to all supported versions.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/vzguaguldbcyfbyuq76qj7hx5qdr5kmh67gqkncyb2yhsygrdt@dfhcpteqifux
Backpatch-through: 14
The comment about ParallelWorkerNumbr in parallel.c says:
In parallel workers, it will be set to a value >= 0 and < the number
of workers before any user code is invoked; each parallel worker will
get a different parallel worker number.
However asserts in various places collecting instrumentation allowed
(ParallelWorkerNumber == num_workers). That would be a bug, as the value
is used as index into an array with num_workers entries.
Fixed by adjusting the asserts accordingly. Backpatch to all supported
versions.
Discussion: https://postgr.es/m/5db067a1-2cdf-4afb-a577-a04f30b69167@vondra.me
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Backpatch-through: 14
The function used GetXLogInsertRecPtr() to generate the fake LSN. Most
of the time this is the same as what XLogInsert() would return, and so
it works fine with the XLogFlush() call. But if the last record ends at
a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the
page header. In such case XLogFlush() fails with errors like this:
ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000
Such failures are very hard to trigger, particularly outside aggressive
test scenarios.
Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN
without skipping the header. This is the same as GetXLogInsertRecPtr(),
except that it calls XLogBytePosToEndRecPtr().
Initial investigation by me, root cause identified by Andres Freund.
This is a long-standing bug in gistGetFakeLSN(), probably introduced by
c6b92041d3 in PG13. Backpatch to all supported versions.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo
Backpatch-through: 14
The logic of xslt_process() has never considered the fact that
xsltSaveResultToString() would return NULL for an empty string (the
upstream code has always done so, with a string length of 0). This
would cause memcpy() to be called with a NULL pointer, something
forbidden by POSIX.
Like 46ab07ffda and similar fixes, this is backpatched down to all the
supported branches, with a test case to cover this scenario. An empty
string has been always returned in xml2 in this case, based on the
history of the module, so this is an old issue.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/c516a0d9-4406-47e3-9087-5ca5176ebcf9@gmail.com
Backpatch-through: 14
Commit 2cd40adb85 added the IF NOT EXISTS option to ALTER TABLE ADD COLUMN.
This also enabled IF NOT EXISTS for ALTER FOREIGN TABLE ADD COLUMN,
but the ALTER FOREIGN TABLE documentation was not updated to mention it.
This commit updates the documentation to describe the IF NOT EXISTS option for
ALTER FOREIGN TABLE ADD COLUMN.
While updating that section, also this commit clarifies that the COLUMN keyword
is optional in ALTER FOREIGN TABLE ADD/DROP COLUMN. Previously, part of
the documentation could be read as if COLUMN were required.
This commit adds regression tests covering these ALTER FOREIGN TABLE syntaxes.
Backpatch to all supported versions.
Suggested-by: Fujii Masao <masao.fujii@gmail.com>
Author: Chao Li <lic@highgo.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwFk=rrhrwGwPtQxBesbT4DzSZ86Q3ftcwCu3AR5bOiXLw@mail.gmail.com
Backpatch-through: 14
When make_new_segment() creates an odd-sized segment, the pagemap was
only sized based on a number of usable_pages entries, forgetting that a
segment also contains metadata pages, and that the FreePageManager uses
absolute page indices that cover the entire segment. This
miscalculation could cause accesses to pagemap entries to be out of
bounds. During subsequent reuse of the allocated segment, allocations
landing on pages with indices higher than usable_pages could cause
out-of-bounds pagemap reads and/or writes. On write, 'span' pointers
are stored into the data area, corrupting the allocated objects. On
read (aka during a dsa_free), garbage is interpreted as a span pointer,
typically crashing the server in dsa_get_address().
The normal geometric path correctly sizes the pagemap for all pages in
the segment. The odd-sized path needs to do the same, but it works
forward from usable_pages rather than backward from total_size.
This commit fixes the sizing of the odd-sized case by adding pagemap
entries for the metadata pages after the initial metadata_bytes
calculation, using an integer ceiling division to compute the exact
number of additional entries needed in one go, avoiding any iteration in
the calculation.
An assertion is added in the code path for odd-sized segments, ensuring
that the pagemap includes the metadata area, and that the result is
appropriately sized.
This problem would show up depending on the size requested for the
allocation of a DSA segment. The reporter has noticed this issue when a
parallel hash join makes a DSA allocation large enough to trigger the
odd-sized segment path, but it could happen for anything that does a DSA
allocation.
A regression test is added to test_dsa, down to v17 where the test
module has been introduced. This adds a set of cheap tests to check the
problem, the new assertion being useful for this purpose. Sami has
proposed a test that took a longer time than what I have done here; the
test committed is faster and good enough to check the odd-sized
allocation path.
Author: Paul Bunn <paul.bunn@icloud.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com
Backpatch-through: 14
Previously, when logical replication was running, shutting down
the publisher could cause the logical walsender to enter a busy loop
and prevent the publisher from completing shutdown.
During shutdown, the logical walsender waits for all pending WAL
to be written out. However, some WAL records could remain unflushed,
causing the walsender to wait indefinitely.
The issue occurred because the walsender used XLogBackgroundFlush() to
flush pending WAL. This function does not guarantee that all WAL is written.
For example, WAL generated by a transaction without an assigned
transaction ID that aborts might not be flushed.
This commit fixes the bug by making the logical walsender call XLogFlush()
instead, ensuring that all pending WAL is written and preventing
the busy loop during shutdown.
Backpatch to all supported versions.
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com
Backpatch-through: 14
It looks like whoever wrote the astreamer (nee bbstreamer) code
thought that pg_log_error() is equivalent to elog(ERROR), but
it's not; it just prints a message. So all these places tried to
continue on after a compression or decompression error return,
with the inevitable result being garbage output and possibly
cascading error messages. We should use pg_fatal() instead.
These error conditions are probably pretty unlikely in practice,
which no doubt accounts for the lack of field complaints.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/1531718.1772644615@sss.pgh.pa.us
Backpatch-through: 15
This branch missed the IsolationUsesXactSnapshot() check. That led to EPQ on
repeatable read and serializable isolation levels. This commit fixes the
issue and provides a simple isolation check for that. Backpatch through v15
where MERGE statement was introduced.
Reported-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/CAPpHfdvzZSaNYdj5ac-tYRi6MuuZnYHiUkZ3D-AoY-ny8v%2BS%2Bw%40mail.gmail.com
Author: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Backpatch-through: 15
In ALTER TABLE ... ADD/DROP COLUMN, the COLUMN keyword is optional. However,
part of the documentation could be read as if COLUMN were required, which may
mislead users about the command syntax.
This commit updates the ALTER TABLE documentation to clearly state that
COLUMN is optional for ADD and DROP.
Also this commit adds regression tests covering ALTER TABLE ... ADD/DROP
without the COLUMN keyword.
Backpatch to all supported versions.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2n6ShLMOnjOtf63TjjgGbgiTVT5OMsSOFmbjGb6Xue1Bw@mail.gmail.com
Backpatch-through: 14
This fixes a problem similar to ad8c86d22c. In this case, the test
could fail under the following circumstances:
- The primary is stopped with teardown_node(), meaning that it may not
be able to send all its WAL records to standby_1 and standby_2.
- If standby_2 receives more records than standby_1, attempting to
reconnect standby_2 to the promoted standby_1 would fail because of a
timeline fork.
This race condition is fixed with a simple trick: instead of tearing
down the primary, it is stopped cleanly so as all the WAL records of the
primary are received and flushed by both standby_1 and standby_2. Once
we do that, there is no need for a wait_for_catchup() before stopping
the node. The test wants to check that a timeline jump can be achieved
when reconnecting a standby to a promoted standby in the same cluster,
hence an immediate stop of the primary is not required.
This failure is harder to reach than the previous instability of
009_twophase, still the buildfarm has been able to detect this failure
at least once. I have tried Alexander Lakhin's test trick with the
bgwriter and very aggressive standby snapshots, but I could not
reproduce it directly. It is reachable, as the buildfarm has proved.
Backpatch down to all supported branches, and this problem can lead to
spurious failures in the buildfarm.
Discussion: https://postgr.es/m/493401a8-063f-436a-8287-a235d9e065fc@gmail.com
Backpatch-through: 14
The code path in astreamer_lz4_decompressor_content() that updated
the output pointers when the output buffer isn't full was wrong.
It advanced next_out by bytes_written, which could include previous
decompression output not just that of the current cycle. The
correct amount to advance is out_size. While at it, make the
output pointer updates look more like the input pointer updates.
This bug is pretty hard to reach, as it requires consecutive
compression frames that are too small to fill the output buffer.
pg_dump could have produced such data before 66ec01dc4, but
I'm unsure whether any files we use astreamer with would be
likely to contain problematic data.
Author: Chao Li <lic@highgo.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0594CC79-1544-45DD-8AA4-26270DE777A7@gmail.com
Backpatch-through: 15