Commit graph

12771 commits

Author SHA1 Message Date
David Rowley
c456e39113 Optimize tuple deformation
This commit includes various optimizations to improve the performance of
tuple deformation.

We now precalculate CompactAttribute's attcacheoff, which allows us to
remove the code from the deform routines which was setting the
attcacheoff.  Setting the attcacheoff is now handled by
TupleDescFinalize(), which must be called before the TupleDesc is used for
anything.  Having TupleDescFinalize() means we can store the first
attribute in the TupleDesc which does not have an offset cached.  That
allows us to add a dedicated deforming loop to deform all attributes up
to the final one with an attcacheoff set, or up to the first NULL
attribute, whichever comes first.

Here we also improve tuple deformation performance of tuples with NULLs.
Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask,
deforming would, one-by-one, check each and every bit in the NULL bitmap
to see if it was zero.  Now, we process the NULL bitmap 1 byte at a time
rather than 1 bit at a time to find the attnum with the first NULL.  We
can now deform the tuple without checking for NULLs up to just before that
attribute.

We also record the maximum attribute number which is guaranteed to exist
in the tuple, that is, has a NOT NULL constraint and isn't an
atthasmissing attribute.  When deforming only attributes prior to the
guaranteed attnum, we've no need to access the tuple's natt count.  As an
additional optimization, we only count fixed-width columns when
calculating the maximum guaranteed column, as this eliminates the need to
emit code to fetch byref types in the deformation loop for guaranteed
attributes.

Some locations in the code deform tuples that have yet to go through NOT
NULL constraint validation.  We're unable to perform the guaranteed
attribute optimization when that's the case.  This optimization is opt-in
via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS
flag.

This commit also adds a more efficient way of populating the isnull
array by using a bit-wise SWAR trick which performs multiplication on the
inverse of the tuple's bitmap byte and masking out all but the lower bit
of each of the boolean's byte.  This results in much more optimal code
when compared to determining the NULLness via att_isnull().  8 isnull
elements are processed at once using this method, which means we need to
round the tts_isnull array size up to the next 8 bytes.  The palloc code
does this anyway, but the round-up needed to be formalized so as not to
overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds.  Doing
this also allows the NULL-checking deforming loop to more efficiently
check the isnull array, rather than doing the bit-wise processing for each
attribute that att_isnull() does.

The level of performance improvement from these changes seems to vary
depending on the CPU architecture.  Apple's M chips seem particularly
fond of the changes, with some of the tested deform-heavy queries going
over twice as fast as before.  With x86-64, the speedups aren't quite as
large.  With tables containing only a small number of columns, the
speedups will be less.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
2026-03-16 11:46:00 +13:00
David Rowley
503620311e Add all required calls to TupleDescFinalize()
As of this commit all TupleDescs must have TupleDescFinalize() called on
them once the TupleDesc is set up and before BlessTupleDesc() is called.

In this commit, TupleDescFinalize() does nothing. This change has only
been separated out from the commit that properly implements this function
to make the change more obvious.  Any extension which makes its own
TupleDesc will need to be modified to call the new function.

The follow-up commit which properly implements TupleDescFinalize() will
cause any code which forgets to do this to fail in assert-enabled builds in
BlessTupleDesc().  It may still be worth mentioning this change in the
release notes so that extension authors update their code.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com
2026-03-16 11:45:49 +13:00
Melanie Plageman
99bf1f8aa6 Save vmbuffer in heap-specific scan descriptors for on-access pruning
Future commits will use the visibility map in on-access pruning to fix
VM corruption and set the VM if the page is all-visible.

Saving the vmbuffer in the scan descriptor reduces the number of times
it would need to be pinned and unpinned, making the overhead of doing so
negligible.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com
2026-03-15 11:09:10 -04:00
Peter Eisentraut
cd083b54bd Make typeof and typeof_unqual fallback definitions work on C++11
These macros were unintentionally using C++14 features. This replaces
them with valid C++11 code.

Tested locally by compiling with -std=c++11 (which reproduced the
original issue).

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
2026-03-15 07:36:27 +01:00
David Rowley
4deecb52af Allow sibling call optimization in slot_getsomeattrs_int()
This changes the TupleTableSlotOps contract to make it so the
getsomeattrs() function is in charge of calling
slot_getmissingattrs().

Since this removes all code from slot_getsomeattrs_int() aside from the
getsomeattrs() call itself, we may as well adjust slot_getsomeattrs() so
that it calls getsomeattrs() directly.  We leave slot_getsomeattrs_int()
intact as this is still called from the JIT code.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com
2026-03-14 13:52:09 +13:00
Peter Geoghegan
d774072f00 Move fake LSN infrastructure out of GiST.
Move utility functions used by GiST to generate fake LSNs into xlog.c
and xloginsert.c, so that other index AMs can also generate fake LSNs.

Preparation for an upcoming commit that will add support for fake LSNs
to nbtree, allowing its dropPin optimization to be used during scans of
unlogged relations.  That commit is itself preparation for another
upcoming commit that will add a new amgetbatch/btgetbatch interface to
enable I/O prefetching.

Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming
XLOG_ASSIGN_LSN.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com
2026-03-13 19:38:17 -04:00
Tomas Vondra
b1f14c9672 Use GetXLogInsertEndRecPtr in gistGetFakeLSN
The function used GetXLogInsertRecPtr() to generate the fake LSN. Most
of the time this is the same as what XLogInsert() would return, and so
it works fine with the XLogFlush() call. But if the last record ends at
a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the
page header. In such case XLogFlush() fails with errors like this:

  ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000

Such failures are very hard to trigger, particularly outside aggressive
test scenarios.

Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN
without skipping the header. This is the same as GetXLogInsertRecPtr(),
except that it calls XLogBytePosToEndRecPtr().

Initial investigation by me, root cause identified by Andres Freund.

This is a long-standing bug in gistGetFakeLSN(), probably introduced by
c6b92041d3 in PG13. Backpatch to all supported versions.

Reported-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo
Backpatch-through: 14
2026-03-13 23:25:24 +01:00
Andres Freund
ce5d489166 Fix bug due to confusion about what IsMVCCSnapshot means
In 0b96e734c5 I (Andres) relied on page_collect_tuples() being called only
with an MVCC snapshot, and added assertions to that end, but did not realize
that IsMVCCSnapshot() allows both proper MVCC snapshots and historical
snapshots, which behave quite similarly to MVCC snapshots.

Unfortunately that can lead to incorrect visibility results during logical
decoding, as a historical snapshot is interpreted as a plain MVCC
snapshot. The only reason this wasn't noticed earlier is that it's hard to
reach as most of the time there are no sequential scans during logical
decoding.

To fix the bug and avoid issues like this in the future, split
IsMVCCSnapshot() into IsMVCCSnapshot() and IsMVCCLikeSnapshot(), where now
only the latter includes historic snapshots.

One effect of this is that during logical decoding no page-at-a-time snapshots
are used, as otherwise runtime branches to handle historic snapshots would be
needed in some performance critical paths. Given how uncommon sequential scans
are during logical decoding, that seems acceptable.

Author: Antonin Houska <ah@cybertec.at>
Reported-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/61812.1770637345@localhost
2026-03-13 13:53:19 -04:00
Nathan Bossart
e0a3a3fd53 Optimize COPY FROM (FORMAT {text,csv}) using SIMD.
Presently, such commands scan the input buffer one byte at a time
looking for special characters.  This commit adds a new path that
uses SIMD instructions to skip over chunks of data without any
special characters.  This can be much faster.

To avoid regressions, SIMD processing is disabled for the remainder
of the COPY FROM command as soon as we encounter a short line or a
special character (except for end-of-line characters, else we'd
always disable it after the first line).  This is perhaps too
conservative, but it could probably be made more lenient in the
future via fine-tuned heuristics.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Co-authored-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Neil Conway <neil.conway@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Tested-by: Manni Wood <manni.wood@enterprisedb.com>
Tested-by: Mark Wong <markwkm@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com
2026-03-13 11:07:32 -05:00
Heikki Linnakangas
f9de9bf302 Add callback for I/O error messages in SLRUs
Historically, all SLRUs were addressed by transaction IDs, but that
hasn't been true for a long time. However, the error message on I/O
error still always talked about accessing a transaction ID.

This commit adds a callback that allows subsystems to construct their
own error messages, which can then correctly refer to a transaction
ID, multixid or whatever else is used to address the particular SLRU.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com
2026-03-13 16:21:06 +02:00
Fujii Masao
723619eaa3 Add stats_reset column to pg_stat_database_conflicts.
This commit adds a stats_reset column to pg_stat_database_conflicts,
allowing users to see when the statistics in this view were last reset.
This makes the view consistent with pg_stat_database and other statistics
views.

Catalog version bumped.

Author: Shihao Zhong <zhong950419@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAGRkXqS98OebEWjax99_LVAECsxCB8i=BfsdAL34i-5QHfwyOQ@mail.gmail.com
2026-03-13 22:17:14 +09:00
Peter Eisentraut
59292f7aac Change copyObject() to use typeof_unqual
Currently, when the argument of copyObject() is const-qualified, the
return type is also, because the use of typeof carries over all the
qualifiers.  This is incorrect, since the point of copyObject() is to
make a copy to mutate.  But apparently no code ran into it.

The new implementation uses typeof_unqual, which drops the qualifiers,
making this work correctly.

typeof_unqual is standardized in C23, but all recent versions of all
the usual compilers support it even in non-C23 mode, at least as
__typeof_unqual__.  We add a configure/meson test for typeof_unqual
and __typeof_unqual__ and use it if it's available, else we use the
existing fallback of just returning void *.

This is the second attempt, after the first attempt in commit
4cfce4e62c was reverted.  The following two points address problems
with the earlier version:

We test the underscore variant first so that there is a higher chance
that clang used for bitcode also supports it, since we don't test that
separately.

Unlike the typeof test, the typeof_unqual test also tests with a void
pointer similar to how copyObject() would use it, because that is not
handled by MSVC, so we want the test to fail there.

Reviewed-by: David Geier <geidav.pg@gmail.com>
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
2026-03-13 07:06:57 +01:00
Andrew Dunstan
a0b6ef29a5 Enable fast default for domains with non-volatile constraints
Previously, ALTER TABLE ADD COLUMN always forced a table rewrite when
the column type was a domain with constraints (CHECK or NOT NULL), even
if the default value satisfied those constraints.  This was because
contain_volatile_functions() considers CoerceToDomain immutable, so
the code conservatively assumed any constrained domain might fail.

Improve this by using soft error handling (ErrorSaveContext) to evaluate
the CoerceToDomain expression at ALTER TABLE time.  If the default value
passes the domain's constraints, the value is stored as a "missing"
attribute default and no table rewrite is needed.  If the constraint
check fails, we fall back to a table rewrite, preserving the historical
behavior that constraint violations are only raised when the table
actually contains rows.

Domains with volatile constraint expressions always require a table
rewrite since the constraint result could differ per evaluation and
cannot be cached.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io>
Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com
2026-03-12 18:05:01 -04:00
Andrew Dunstan
487cf2cbd2 Extend DomainHasConstraints() to optionally check constraint volatility
Add an optional bool *has_volatile output parameter to
DomainHasConstraints().  When non-NULL, the function checks whether any
CHECK constraint contains a volatile expression.  Callers that don't
need this information pass NULL and get the same behavior as before.

This is needed by a subsequent commit that enables the fast default
optimization for domains with non-volatile constraints: we can safely
evaluate such constraints once at ALTER TABLE time, but volatile
constraints require a full table rewrite.

Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io>
Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com
2026-03-12 18:04:16 -04:00
Peter Geoghegan
d071e1cfec nbtree: Avoid allocating _bt_search stack.
Avoid allocating memory for an nbtree descent stack during index scans.
We only require a descent stack during inserts, when it is used to
determine where to insert a new pivot tuple/downlink into the target
leaf page's parent page in the event of a page split.  (Page deletion's
first phase also performs a _bt_search that requires a descent stack.)

This optimization improves performance by minimizing palloc churn.  It
speeds up index scans that call _bt_search frequently/descend the index
many times, especially when the cost of scanning the index dominates
(e.g., with index-only skip scans).  Testing has shown that the
underlying issue causes performance problems for an upcoming patch that
will replace btgettuple with a new btgetbatch interface to enable I/O
prefetching.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com
2026-03-12 13:22:36 -04:00
Richard Guo
383eb21ebf Convert NOT IN sublinks to anti-joins when safe
The planner has historically been unable to convert "x NOT IN (SELECT
y ...)" sublinks into anti-joins.  This is because standard SQL
semantics for NOT IN require that if the comparison "x = y" returns
NULL, the "NOT IN" expression evaluates to NULL (effectively false),
causing the row to be discarded.  In contrast, an anti-join preserves
the row if no match is found.  Due to this semantic mismatch regarding
NULL handling, the conversion was previously considered unsafe.

However, if we can prove that neither side of the comparison can yield
NULL values, and further that the operator itself cannot return NULL
for non-null inputs, the behavior of NOT IN and anti-join becomes
identical.  Enabling this conversion allows the planner to treat the
sublink as a first-class relation rather than an opaque SubPlan
filter.  This unlocks global join ordering optimization and permits
the selection of the most efficient join algorithm based on cost,
often yielding significant performance improvements for large
datasets.

This patch verifies that neither side of the comparison can be NULL
and that the operator is safe regarding NULL results before performing
the conversion.

To verify operator safety, we require that the operator be a member of
a B-tree or Hash operator family.  This serves as a proxy for standard
boolean behavior, ensuring the operator does not return NULL on valid
non-null inputs, as doing so would break index integrity.

For operand non-nullability, this patch makes use of several existing
mechanisms.  It leverages the outer-join-aware-Var infrastructure to
verify that a Var does not come from the nullable side of an outer
join, and consults the NOT-NULL-attnums hash table to efficiently
verify schema-level NOT NULL constraints.  Additionally, it employs
find_nonnullable_vars to identify Vars forced non-nullable by qual
clauses, and expr_is_nonnullable to deduce non-nullability for other
expression types.

The logic for verifying the non-nullability of the subquery outputs
was adapted from prior work by David Rowley and Tom Lane.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com
2026-03-12 09:45:18 +09:00
Andres Freund
b0f4ff3c92 bufmgr: Remove the, now obsolete, BM_JUST_DIRTIED
Due to the recent changes to use a share-exclusive mode for setting hint bits
and for flushing pages - instead of using share mode as before - a buffer
cannot be dirtied while the flush is ongoing.  The reason we needed
JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was
ongoing - which is not possible anymore.

Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
2026-03-11 14:58:29 -04:00
Tomas Vondra
943e881733 Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads
On platforms where we can read or write the whole LSN atomically, we do
not need to lock the buffer header to prevent torn LSNs. We can do this
only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the
pd_lsn field is properly aligned.

For historical reasons the PageXLogRecPtr was defined as a struct with
two uint32 fields. This replaces it with a single uint64 value, to make
the intent clearer. To prevent issues with weak typedefs the value is
still wrapped in a struct.

This also adjusts heapfuncs() in pageinspect, to ensure proper alignment
when reading the LSN from a page on alignment-sensitive hardware.

Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by
Peter Geoghegan. Minor tweaks by me.

Author: Andreas Karlsson <andreas@proxel.se>
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se
2026-03-11 19:46:08 +01:00
Peter Eisentraut
9c05f152b5 Fixes for C++ typeof implementation
This fixes two bugs in commit 1887d822f1.

First, if we are using the fallback C++ implementation of typeof, then
we need to include the C++ header <type_traits> for
std::remove_reference_t.  This header is also likely to be used for
other C++ implementations of type tricks, so we'll put it into the
global includes.

Second, for the case that the C compiler supports typeof in a spelling
that is not "typeof" (for example, __typeof__), then we need to #undef
typeof in the C++ section to avoid warnings about duplicate macro
definitions.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
2026-03-11 11:54:10 +01:00
Peter Eisentraut
d4a080b8a1 Remove Int8GetDatum function
We have no uses of Int8GetDatum in our tree and did not have for a
long time (or never), and the inverse does not exist either.

Author: Kirill Reshke <reshkekirill@gmail.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/CALdSSPhFyb9qLSHee73XtZm1CBWJNo9+JzFNf-zUEWCRW5yEiQ@mail.gmail.com
2026-03-11 10:46:08 +01:00
Andres Freund
82467f627b Require share-exclusive lock to set hint bits and to flush
At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.

The need to copy the page is problematic to implement AIO writes:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory

It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).

It also just is plain odd to allow modifications of buffers that are just
share locked.

To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.

To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.

The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.

The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set.  Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.

Several other places also need to be adapted, but those changes are
comparatively simpler.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
2026-03-10 19:32:13 -04:00
Melanie Plageman
c2a23dcf9e Use the newest to-be-frozen xid as the conflict horizon for freezing
Previously WAL records that froze tuples used OldestXmin as the snapshot
conflict horizon, or the visibility cutoff if the page would become
all-frozen. Both are newer than (or equal to) the newst XID actually
frozen on the page.

Track the newest XID that will be frozen and use that as the snapshot
conflict horizon instead. This yields an older horizon resulting in
fewer query cancellations on standbys.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com
2026-03-10 15:24:39 -04:00
Álvaro Herrera
ac58465e06
Introduce the REPACK command
REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command.  Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.

We retain those older commands, but de-emphasize them in the
documentation, in favor of REPACK; the difference between VACUUM FULL
and CLUSTER (namely, the fact that tuples are written in a specific
ordering) is neatly absorbed as two different modes of REPACK.

This allows us to introduce further functionality in the future that
works regardless of whether an ordering is being applied, such as (and
especially) a concurrent mode.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
2026-03-10 19:56:39 +01:00
Robert Haas
0fbfd37cef Allow extensions to mark an individual index as disabled.
Up until now, the only way for a loadable module to disable the use of a
particular index was to use build_simple_rel_hook (or, previous to
yesterday's commit, get_relation_info_hook) to remove it from the index
list. While that works, it has some disadvantages. First, the index
becomes invisible for all purposes, and can no longer be used for
optimizations such as self-join elimination or left join removal, which
can severely degrade the resulting plan.

Second, if the module attempts to compel the use of a certain index
by removing all other indexes from the index list and disabling
other scan types, but the planner is unable to use the chosen index
for some reason, it will fall back to a sequential scan, because that
is only disabled, whereas the other indexes are, from the planner's
point of view, completely gone. While this situation ideally shouldn't
occur, it's hard for a loadable module to be completely sure whether
the planner will view a certain index as usable for a certain query.
If it isn't, it may be better to fall back to a scan using a disabled
index rather than falling back to an also-disabled sequential scan.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com
2026-03-10 08:33:55 -04:00
Robert Haas
91f33a2ae9 Replace get_relation_info_hook with build_simple_rel_hook.
For a long time, PostgreSQL has had a get_relation_info_hook which
plugins can use to editorialize on the information that
get_relation_info obtains from the catalogs. However, this hook is
only called for baserels of type RTE_RELATION, and there is
potential utility in a similar call back for other types of
RTEs. This might have had utility even before commit
4020b370f2 added pgs_mask to
RelOptInfo, but it certainly has utility now.

So, move the callback up one level, deleting get_relation_info_hook and
adding build_simple_rel_hook instead. The new callback is called just
slightly later than before and with slightly different arguments, but it
should be fairly straightforward to adjust existing code that currently
uses get_relation_info_hook: the values previously available as
relationObjectId and inhparent are now available via rte->relid and
rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if
desired.

Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com
2026-03-09 09:48:26 -04:00
Robert Haas
8300d3ad4a Consider startup cost as a figure of merit for partial paths.
Previously, the comments stated that there was no purpose to considering
startup cost for partial paths, but this is not the case: it's perfectly
reasonable to want a fast-start path for a plan that involves a LIMIT
(perhaps over an aggregate, so that there is enough data being processed
to justify parallel query but yet we don't want all the result rows).

Accordingly, rewrite add_partial_path and add_partial_path_precheck to
consider startup costs. This also fixes an independent bug in
add_partial_path_precheck: commit e222534679
failed to update it to do anything with the new disabled_nodes field.
That bug fix is formally separate from the rest of this patch and could
be committed separately, but I think it makes more sense to fix both
issues together, because then we can (as this commit does) just make
add_partial_path_precheck do the cost comparisons in the same way as
compare_path_costs_fuzzily, which hopefully reduces the chances of
ending up with something that's still incorrect.

This patch is based on earlier work on this topic by Tomas Vondra,
but I have rewritten a great deal of it.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Tomas Vondra <tomas@vondra.me>
Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com
2026-03-09 08:16:30 -04:00
Robert Haas
ffc226ab64 Prevent restore of incremental backup from bloating VM fork.
When I (rhaas) wrote the WAL summarizer code, I incorrectly believed
that XLOG_SMGR_TRUNCATE truncates all forks to the same length.  In
fact, what other parts of the code do is compute the truncation length
for the FSM and VM forks from the truncation length used for the main
fork. But, because I was confused, I coded the WAL summarizer to set the
limit block for the VM fork to the same value as for the main fork.
(Incremental backup always copies FSM forks in full, so there is no
similar issue in that case.)

Doing that doesn't directly cause any data corruption, as far as I can
see. However, it does create a serious risk of consuming a large amount
of extra disk space, because pg_combinebackup's reconstruct.c believes
that the reconstructed file should always be at least as long as the
limit block value. We might want to be smarter about that at some point
in the future, because it's always safe to omit all-zeroes blocks at the
end of the last segment of a relation, and doing so could save disk
space, but the current algorithm will rarely waste enough disk space to
worry about unless we believe that a relation has been truncated to a
length much longer than its actual length on disk, which is exactly what
happens as a result of the problem mentioned in the previous paragraph.

To fix, create a new visibilitymap helper function and use it to include
the right limit block in the summary files. Incremental backups taken
with existing summary files will still have this issue, but this should
improve the situation going forward.

Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com>
Diagnosed-by: Amul Sul <sulamul@gmail.com>
Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com
Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com
Backpatch-through: 17
2026-03-09 06:45:32 -04:00
Nathan Bossart
b2898baaf7 pg_dumpall: Fix handling of conflicting options.
pg_dumpall is missing checks for some conflicting options,
including those passed through to pg_dump.  To fix, introduce a
new function that checks whether mutually exclusive options are
set, and use that in pg_dumpall.  A similar change could likely be
made for pg_dump and pg_restore, but that is left as a future
exercise.

This is arguably a bug fix, but since this might break existing
scripts, no back-patch for now.

Author: Jian He <jian.universality@gmail.com>
Co-authored-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Wang Peng <215722532@qq.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CACJufxFf5%3DwSv2MsuO8iZOvpLZQ1-meAMwhw7JX5gNvWo5PDug%40mail.gmail.com
2026-03-06 14:00:04 -06:00
Tom Lane
415100aa62 Support grouping-expression references and GROUPING() in subqueries.
Until now, substitute_grouped_columns and its predecessor
check_ungrouped_columns intentionally did not cope with references
to GROUP BY expressions (anything more complex than a Var) within
subqueries of the query having GROUP BY.  Because they didn't try to
match subexpressions of subqueries to the GROUP BY list, they'd drill
down to raw Vars of the grouping level and then fail with "subquery
uses ungrouped column from outer query".  There have been remarkably
few complaints about this deficiency, so nobody ever did anything
about it.

The reason for not wanting to deal with it is that within a subquery,
Vars will have varlevelsup different from zero and will thus not be
equal() to the expressions seen in the outer query.  We recognized
this at least as far back as 96ca8ffeb, although I think the comment
I added about it then was just documenting a pre-existing deficiency.
It looks like at the time, the solutions I considered were
(1) write a version of equal() that permits an offset in varlevelsup,
or (2) dynamically apply IncrementVarSublevelsUp at each
subexpression.  (1) would require an amount of new code that seems
rather out of proportion to the benefit, while (2) would add an
exponential amount of cost to the matching process.  But rethinking
it now, what seems attractive is (3) apply IncrementVarSublevelsUp to
the groupingClause list not the subexpressions, and do so only once
per subquery depth level.  Then we can still use plain equal() to
check for matches, and we're not incurring cost proportional to some
power of the subquery's complexity.

This patch continues to use the old logic when the GROUP BY list is
all Vars.  We could discard the special comparison logic for that and
always do it the more general way, but that would be a good deal
slower.  (Micro-benchmarking just parse analysis suggests it's about
50% slower than the Vars-only path.  But we've not heard complaints
about the speed of matching within the main query, so I doubt that
applying the same matching logic within subqueries will be a problem.)
The lack of complaints suggests strongly that this is a very minority
use-case, so I don't want to make the typical case slower to fix it.

While testing that, I was surprised to discover a nearby bug:
GROUPING() within a subquery fails to match GROUP BY Vars that are
join alias Vars.  It tries to apply flatten_join_alias_vars to make
such cases work, but that fails to work inside a subquery because
varlevelsup is wrong.  Therefore, this patch invents a new entry point
flatten_join_alias_for_parser() that allows specification of a
sublevels_up offset.  (It seems cleaner to give the parser its own
entry point rather than abuse the planner's conventions even further.)

While this is pretty clearly a bug fix, I'm hesitant to take the risk
of back-patching, seeing that the existing behavior has stood for so
long with so few complaints.  Maybe we can reconsider once this patch
has baked awhile in master.

Reported-by: PALAYRET Jacques <jacques.palayret@meteo.fr>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/531183.1772058731@sss.pgh.pa.us
2026-03-06 13:40:55 -05:00
Jeff Davis
8185bb5347 CREATE SUBSCRIPTION ... SERVER.
Allow CREATE SUBSCRIPTION to accept a foreign server using the SERVER
clause instead of a raw connection string using the CONNECTION clause.

  * Enables a user with sufficient privileges to create a subscription
    using a foreign server by name without specifying the connection
    details.

  * Integrates with user mappings (and other FDW infrastructure) using
    the subscription owner.

  * Provides a layer of indirection to manage multiple subscriptions
    to the same remote server more easily.

Also add CREATE FOREIGN DATA WRAPPER ... CONNECTION clause to specify
a connection_function. To be eligible for a subscription, the foreign
server's foreign data wrapper must specify a connection_function.

Add connection_function support to postgres_fdw, and bump postgres_fdw
version to 1.3.

Bump catversion.

Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.camel@j-davis.com
2026-03-06 08:27:56 -08:00
Álvaro Herrera
868825aaeb
Don't include wait_event.h in pgstat.h
wait_event.h itself includes wait_event_types.h, which is a generated
file, so it's nice that we can avoid compiling >10% of the tree just
because that file is regenerated.

To avoid breaking too many third-party modules, we now #include
utils/wait_classes.h in storage/latch.h.  Then, the very common case
of doing
	WaitLatch(..., PG_WAIT_EXTENSION)
continues to work by including just storage/latch.h.  (I didn't try to
determine how many modules would actually break if we don't do this, but
this seems a convenient and low-impact measure.)

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/202602181214.gcmhx2vhlxzp@alvherre.pgsql
2026-03-06 16:24:58 +01:00
Peter Eisentraut
258248d0bd Make unconstify and unvolatize use StaticAssertVariableIsOfTypeMacro
The unconstify and unvolatize macros had an almost identical assertion
as was already defined in StaticAssertVariableIsOfTypeMacro, only it
had a less useful error message and didn't have a sizeof fallback.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com
2026-03-06 10:14:32 +01:00
Peter Eisentraut
e2308350c9 Use typeof everywhere instead of compiler specific spellings
We define typeof ourselves as __typeof__ if it does not exist.  So
let's actually use that for consistency.  The meson/autoconf checks
for __builtin_types_compatible_p still use __typeof__ though, because
there we have not redefined it.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com
2026-03-06 10:14:32 +01:00
Peter Eisentraut
aa7c868523 Portable StaticAssertExpr
Use a different way to write StaticAssertExpr() that does not require
the GCC extension statement expressions.

For C, we put the static_assert into a struct.  This appears to be a
common approach.

We still need to keep the fallback implementation to support buggy
MSVC < 19.33.

For C++, we put it into a lambda expression.  (The C approach doesn't
work; it's not permitted to define a new type inside sizeof.)

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/5fa3a9f5-eb9a-4408-9baf-403d281f8b10%40eisentraut.org
2026-03-06 09:27:54 +01:00
Michael Paquier
01d485b142 Add system view pg_stat_recovery
This commit introduces pg_stat_recovery, that exposes at SQL level the
state of recovery as tracked by XLogRecoveryCtlData in shared memory,
maintained by the startup process.  This new view includes the following
fields, that are useful for monitoring purposes on a standby, once it
has reached a consistent state (making the execution of the SQL function
possible):
- Last-successfully replayed WAL record LSN boundaries and its timeline.
- Currently replaying WAL record end LSN and its timeline.
- Current WAL chunk start time.
- Promotion trigger state.
- Timestamp of latest processed commit/abort.
- Recovery pause state.

Some of this data can already be recovered from different system
functions, but not all of it.  See pg_get_wal_replay_pause_state or
pg_last_xact_replay_timestamp.  This new view offers a stronger
consistency guarantee, by grabbing the recovery state for all fields
through one spinlock acquisition.

The system view relies on a new function, called pg_stat_get_recovery().
Querying this data requires the pg_read_all_stats privilege.  The view
returns no rows if the node is not in recovery.

This feature originates from a suggestion I have made while discussion
the addition of a CONNECTING state to the WAL receiver's shared memory
state, because we lacked access to some of the state data.  The author
has taken the time to implement it, so thanks for that.

Bump catalog version.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
Discussion: https://postgr.es/m/aW13GJn_RfTJIFCa@paquier.xyz
2026-03-06 12:37:40 +09:00
Tom Lane
f95d73ed43 Simplify creation of built-in functions with non-default ACLs.
Up to now, to create such a function, one had to make a pg_proc.dat
entry and then modify it with GRANT/REVOKE commands, which we put in
system_functions.sql.  That seems a little ugly though, because it
violates the idea of having a single source of truth about the initial
contents of pg_proc, and it results in leaving dead rows in the
initial contents of pg_proc.

This patch improves matters by allowing aclitemin to work during early
bootstrap, before pg_authid has been loaded.  On the same principle
that we use for early access to pg_type details, put a table of known
built-in role names into bootstrap.c, and use that in bootstrap mode.

To create a built-in function with a non-default ACL, one should write
the desired ACL list in its pg_proc.dat entry, using a simplified
version of aclitemout's notation: omit the grantor (if it is the
bootstrap superuser, which it pretty much always should be) and spell
the bootstrap superuser's name as POSTGRES, similarly to the notation
used elsewhere in src/include/catalog.  This results in entries like

  proacl => '{POSTGRES=X,pg_monitor=X}'

which shows that we've revoked public execute permissions and instead
granted that to pg_monitor.

In addition to fixing up pg_proc.dat entries, I got rid of some
role grants that had been stuck into system_functions.sql,
and instead put them into a new file pg_auth_members.dat;
that seems like a far less random place to put the information.

The correctness of the data changes can be verified by comparing the
initial contents of pg_proc and pg_auth_members before and after.
pg_proc should match exactly, but the OID column of pg_auth_members
will probably be different because those OIDs now get assigned a
little earlier in bootstrap.  (I forced a catversion bump out of
caution, but it wasn't really necessary.)

Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net
2026-03-05 17:43:09 -05:00
Melanie Plageman
34cb4254bd Prefix PruneState->all_{visible,frozen} with set_
The PruneState had members called "all_visible" and "all_frozen" which
reflect not the current state of the page but the state it could be in
once pruning and freezing have been executed. These are then saved in
the PruneFreezeResult so the caller can set the VM accordingly.

Prefix the PruneState members as well as the corresponsding
PruneFreezeResult members with "set_" to clarify that they represent the
proposed state of the all-visible and all-frozen bits for a heap page in
the visibility map, not the current state.

Author: Melanie Plageman <melanieplageman@gmail.com>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-05 16:55:00 -05:00
Melanie Plageman
68c2dcb913 Add PageGetPruneXid() helper
This is similar to the other page accessors in bufpage.h. It improves
readability and avoids long lines.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com
2026-03-05 16:22:57 -05:00
Michael Paquier
5f8124a0cf Move definition of XLogRecoveryCtlData to xlogrecovery.h
XLogRecoveryCtlData is the structure that stores the shared-memory state
of WAL recovery, including information such as promotion requests, the
timeline ID (TLI), and the LSNs of replayed records.

This refactoring is independently useful because it allows code outside
of core to access the recovery state in live.  It will be used by an
upcoming patch that introduces a SQL function for querying this
information, that can be accessed on a standby once a consistent state
has been reached.  This only moves code around, changing nothing
functionally.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com
2026-03-05 12:17:47 +09:00
Michael Paquier
34dfca2934 Change default value of default_toast_compression to "lz4", take two
The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

Note that this is a version lighter than 7c1849311e, that included a
replacement of --with-lz4 by --without-lz4 in configure builds, forcing
a requirement for LZ4 in all environments.  The buildfarm did not like
it, at all.  This commit switches default_toast_compression to lz4 as
default only when --with-lz4 is defined, which should keep the buildfarm
at bay while still allowing users to benefit from LZ4 compression in
TOAST as long as the code is compiled with it.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
2026-03-05 09:24:35 +09:00
Michael Paquier
4f0b3afab4 Revert "Change default value of default_toast_compression to "lz4""
This reverts commit 7c1849311e, due to the fact that more than 60% of
the buildfarm members do not have lz4 installed.  As we are in the last
commit fest of the development cycle, and that it could take a couple
of weeks to stabilize things, this change is reverted for now.

This commit will be reworked in a lighter version, as
default_toast_compression's default can be changed to "lz4" without the
switch from --with-lz4 to --without-lz4.  This approach will keep the
buildfarm at bay, and still allow builds to take advantage of LZ4 in
TOAST by default, as long as the code is compiled with LZ4 support.

A harder requirement based on LZ4 should be achievable at some point,
but it is going to require some work from the buildfarm owners first.
Perhaps this part could be revisited at the beginning of the next
development cycle.

Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com
2026-03-05 08:25:35 +09:00
Amit Kapila
fd366065e0 Allow table exclusions in publications via EXCEPT TABLE.
Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE
syntax. This allows one or more tables to be excluded. The publisher will
not send the data of excluded tables to the subscriber.

To support this, pg_publication_rel now includes a prexcept column to flag
excluded relations. For partitioned tables, the exclusion is applied at
the root level; specifying a root table excludes all current and future
partitions in that tree.

Follow-up work will implement ALTER PUBLICATION support for managing these
exclusions.

Author: vignesh C <vignesh21@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: David G. Johnston <david.g.johnston@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
2026-03-04 15:56:48 +05:30
Michael Paquier
7c1849311e Change default value of default_toast_compression to "lz4", when available
The default value for default_toast_compression was "pglz".  The main
reason for this choice is that this option is always available, pglz
code being embedded in Postgres.  However, it is known that LZ4 is more
efficient than pglz: less CPU required, more compression on average.  As
of this commit, the default value of default_toast_compression becomes
"lz4", if available.  By switching to LZ4 as the default, users should
see natural speedups on TOAST data reads and/or writes.

Support for LZ4 in TOAST compression was added in Postgres v14, or 5
releases ago.  This should be long enough to consider this feature as
stable.

--with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the
builds on an option-basis, following a practice similar to readline or
ICU.  References to --with-lz4 are removed from the documentation.

While at it, quotes are removed from default_toast_compression in
postgresql.conf.sample.  Quotes are not required in this case.  The
in-place value replacement done by initdb if the build supports LZ4
would not use them in the postgresql.conf file added to a
freshly-initialized cluster.

For the reference, a similar switch has been done with ICU in
fcb21b3acd.  Some of the changes done in this commit are consistent
with that.

Note: this is going to create some disturbance in the buildfarm, in
environments where lz4 is not installed.

Author: Euler Taveira <euler@eulerto.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com>
Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org
2026-03-04 13:05:31 +09:00
Melanie Plageman
38229cb905 Add read_stream_{pause,resume}()
Read stream users can now pause lookahead when no blocks are currently
available. After resuming, subsequent read_stream_next_buffer() calls
continue lookahead with the previous lookahead distance.

This is especially useful for read stream users with self-referential
access patterns (where consuming already-read buffers can produce
additional block numbers).

Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com
2026-03-03 16:03:09 -05:00
Peter Eisentraut
2a525cc97e Add COPY (on_error set_null) option
If ON_ERROR SET_NULL is specified during COPY FROM, any data type
conversion errors will result in the affected column being set to a
null value.  A column's not-null constraints are still enforced, and
attempting to set a null value in such columns will raise a constraint
violation error.  This applies to a column whose data type is a domain
with a NOT NULL constraint.

Author: Jian He <jian.universality@gmail.com>
Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp>
Reviewed-by: torikoshia <torikoshia@oss.nttdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S%3Dj%2By7pXqg_Dw1WBVrgvf%3DBP3d1_aSfe_%2BQ%40mail.gmail.com
2026-03-03 07:37:12 +01:00
Heikki Linnakangas
ccae90abdb Fix OldestMemberMXactId and OldestVisibleMXactId array usage
Commit ab355e3a88 changed how the OldestMemberMXactId array is
indexed. It's no longer indexed by synthetic dummyBackendId, but with
ProcNumber. The PGPROC entries for prepared xacts come after auxiliary
processes in the allProcs array, which rendered the calculation for
MaxOldestSlot and the indexes into the array incorrect.  (The
OldestVisibleMXactId array is not used for prepared xacts, and thus
never accessed with ProcNumber's greater than MaxBackends, so this
only affects the OldestMemberMXactId array.)

As a result, a prepared xact would store its value past the end of the
OldestMemberMXactId array, overflowing into the OldestVisibleMXactId
array. That could cause a transaction's row lock to appear invisible
to other backends, or other such visibility issues. With a very small
max_connections setting, the store could even go beyond the
OldestVisibleMXactId array, stomping over the first element in the
BufferDescriptor array.

To fix, calculate the array sizes more precisely, and introduce helper
functions to calculate the array indexes correctly.

Author: Yura Sokolov <y.sokolov@postgrespro.ru>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/7acc94b0-ea82-4657-b1b0-77842cb7a60c@postgrespro.ru
Backpatch-through: 17
2026-03-02 19:19:22 +02:00
Peter Eisentraut
1887d822f1 Support using copyObject in standard C++
Calling copyObject in C++ without GNU extensions (e.g. when using
-std=c++11 instead of -std=gnu++11) fails with an error like this:

error: use of undeclared identifier 'typeof'; did you mean 'typeid'

This is due to the C compiler used to compile PostgreSQL supporting
typeof, but that function actually not being present in the C++
compiler.  This fixes that by explicitely checking for typeof support
in C++, and then either use that or define typeof ourselves as:

    std::remove_reference_t<decltype(x)>

According to the paper that led to adding typeof to the C standard,
that's the C++ equivalent of the C typeof:
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2927.htm#existing-decltype

Author: Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/DGPW5WCFY7WY.1IHCDNIVVT300%2540jeltef.nl
2026-03-02 11:48:13 +01:00
Peter Eisentraut
386ca3908d Check for memset_explicit() and explicit_memset()
We can use either of these to implement a missing explicit_bzero().

explicit_memset() is supported on NetBSD.  NetBSD hitherto didn't have
a way to implement explicit_bzero() other than the fallback variant.

memset_explicit() is the C23 standard, so we use it as first
preference.  It is currently supported on:

- NetBSD 11
- FreeBSD 15
- glibc 2.43

It doesn't provide additional coverage, but as it's the new standard,
its availability will presumably grow.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/c4701776-8d99-41da-938d-88528a3adc15%40eisentraut.org
2026-03-02 07:51:19 +01:00
Michael Paquier
f68d7e7483 Remove WAL page header flag XLP_BKP_REMOVABLE
There are no known users of this flag.  The last supposed user was
pglesslog, which is the reason why this flag has been introduced in
core, based on an historical search pointing at a8d539f124.

I have mentioned that we may want to remove this flag back in 2018, due
to zero users of it in core.  More recently, Noah has pointed out that
this flag is not safe to use: XLP_BKP_REMOVABLE can be set by the WAL
writer in a lock-free fashion with runningBackups > 0, meaning that some
full-page images could be required but not logged, ultimately corrupting
backups.

Bump XLOG_PAGE_MAGIC.

Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/20250705001628.c3.nmisch@google.com
Discussion: https://postgr.es/m/CAEze2WhiwKSoAvfUggjDeoeY0-rz9cTpfrHcqvBMmJxv-K_5DA@mail.gmail.com
2026-03-02 14:13:05 +09:00
Peter Eisentraut
3f98862980 Fix some -Wcast-qual warnings
This fixes some warnings from -Wcast-qual that are easy to fix,
without using unconstify or the like.

Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/990c9117-b013-4026-aaf5-261fe2832c3d%40eisentraut.org
2026-02-27 21:57:33 +01:00