postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-03-27 04:44:15 -04:00

Author	SHA1	Message	Date
David Rowley	c456e39113	Optimize tuple deformation This commit includes various optimizations to improve the performance of tuple deformation. We now precalculate CompactAttribute's attcacheoff, which allows us to remove the code from the deform routines which was setting the attcacheoff. Setting the attcacheoff is now handled by TupleDescFinalize(), which must be called before the TupleDesc is used for anything. Having TupleDescFinalize() means we can store the first attribute in the TupleDesc which does not have an offset cached. That allows us to add a dedicated deforming loop to deform all attributes up to the final one with an attcacheoff set, or up to the first NULL attribute, whichever comes first. Here we also improve tuple deformation performance of tuples with NULLs. Previously, if the HEAP_HASNULL bit was set in the tuple's t_infomask, deforming would, one-by-one, check each and every bit in the NULL bitmap to see if it was zero. Now, we process the NULL bitmap 1 byte at a time rather than 1 bit at a time to find the attnum with the first NULL. We can now deform the tuple without checking for NULLs up to just before that attribute. We also record the maximum attribute number which is guaranteed to exist in the tuple, that is, has a NOT NULL constraint and isn't an atthasmissing attribute. When deforming only attributes prior to the guaranteed attnum, we've no need to access the tuple's natt count. As an additional optimization, we only count fixed-width columns when calculating the maximum guaranteed column, as this eliminates the need to emit code to fetch byref types in the deformation loop for guaranteed attributes. Some locations in the code deform tuples that have yet to go through NOT NULL constraint validation. We're unable to perform the guaranteed attribute optimization when that's the case. This optimization is opt-in via the TupleTableSlot using the TTS_FLAG_OBEYS_NOT_NULL_CONSTRAINTS flag. This commit also adds a more efficient way of populating the isnull array by using a bit-wise SWAR trick which performs multiplication on the inverse of the tuple's bitmap byte and masking out all but the lower bit of each of the boolean's byte. This results in much more optimal code when compared to determining the NULLness via att_isnull(). 8 isnull elements are processed at once using this method, which means we need to round the tts_isnull array size up to the next 8 bytes. The palloc code does this anyway, but the round-up needed to be formalized so as not to overwrite the sentinel byte in MEMORY_CONTEXT_CHECKING builds. Doing this also allows the NULL-checking deforming loop to more efficiently check the isnull array, rather than doing the bit-wise processing for each attribute that att_isnull() does. The level of performance improvement from these changes seems to vary depending on the CPU architecture. Apple's M chips seem particularly fond of the changes, with some of the tested deform-heavy queries going over twice as fast as before. With x86-64, the speedups aren't quite as large. With tables containing only a small number of columns, the speedups will be less. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com	2026-03-16 11:46:00 +13:00
David Rowley	503620311e	Add all required calls to TupleDescFinalize() As of this commit all TupleDescs must have TupleDescFinalize() called on them once the TupleDesc is set up and before BlessTupleDesc() is called. In this commit, TupleDescFinalize() does nothing. This change has only been separated out from the commit that properly implements this function to make the change more obvious. Any extension which makes its own TupleDesc will need to be modified to call the new function. The follow-up commit which properly implements TupleDescFinalize() will cause any code which forgets to do this to fail in assert-enabled builds in BlessTupleDesc(). It may still be worth mentioning this change in the release notes so that extension authors update their code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Amit Langote <amitlangote09@gmail.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAApHDvpoFjaj3%2Bw_jD5uPnGazaw41A71tVJokLDJg2zfcigpMQ%40mail.gmail.com	2026-03-16 11:45:49 +13:00
Melanie Plageman	99bf1f8aa6	Save vmbuffer in heap-specific scan descriptors for on-access pruning Future commits will use the visibility map in on-access pruning to fix VM corruption and set the VM if the page is all-visible. Saving the vmbuffer in the scan descriptor reduces the number of times it would need to be pinned and unpinned, making the overhead of doing so negligible. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/C3AB3F5B-626E-4AAA-9529-23E9A20C727F%40gmail.com	2026-03-15 11:09:10 -04:00
Peter Eisentraut	cd083b54bd	Make typeof and typeof_unqual fallback definitions work on C++11 These macros were unintentionally using C++14 features. This replaces them with valid C++11 code. Tested locally by compiling with -std=c++11 (which reproduced the original issue). Author: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org	2026-03-15 07:36:27 +01:00
David Rowley	4deecb52af	Allow sibling call optimization in slot_getsomeattrs_int() This changes the TupleTableSlotOps contract to make it so the getsomeattrs() function is in charge of calling slot_getmissingattrs(). Since this removes all code from slot_getsomeattrs_int() aside from the getsomeattrs() call itself, we may as well adjust slot_getsomeattrs() so that it calls getsomeattrs() directly. We leave slot_getsomeattrs_int() intact as this is still called from the JIT code. Author: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com	2026-03-14 13:52:09 +13:00
Peter Geoghegan	d774072f00	Move fake LSN infrastructure out of GiST. Move utility functions used by GiST to generate fake LSNs into xlog.c and xloginsert.c, so that other index AMs can also generate fake LSNs. Preparation for an upcoming commit that will add support for fake LSNs to nbtree, allowing its dropPin optimization to be used during scans of unlogged relations. That commit is itself preparation for another upcoming commit that will add a new amgetbatch/btgetbatch interface to enable I/O prefetching. Bump XLOG_PAGE_MAGIC due to XLOG_GIST_ASSIGN_LSN becoming XLOG_ASSIGN_LSN. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-WzkehuhxyuA8quc7rRN3EtNXpiKsjPfO8mhb+0Dr2K0Dtg@mail.gmail.com	2026-03-13 19:38:17 -04:00
Tomas Vondra	b1f14c9672	Use GetXLogInsertEndRecPtr in gistGetFakeLSN The function used GetXLogInsertRecPtr() to generate the fake LSN. Most of the time this is the same as what XLogInsert() would return, and so it works fine with the XLogFlush() call. But if the last record ends at a page boundary, GetXLogInsertRecPtr() returns LSN pointing after the page header. In such case XLogFlush() fails with errors like this: ERROR: xlog flush request 0/01BD2018 is not satisfied --- flushed only to 0/01BD2000 Such failures are very hard to trigger, particularly outside aggressive test scenarios. Fixed by introducing GetXLogInsertEndRecPtr(), returning the correct LSN without skipping the header. This is the same as GetXLogInsertRecPtr(), except that it calls XLogBytePosToEndRecPtr(). Initial investigation by me, root cause identified by Andres Freund. This is a long-standing bug in gistGetFakeLSN(), probably introduced by `c6b92041d3` in PG13. Backpatch to all supported versions. Reported-by: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/vf4hbwrotvhbgcnknrqmfbqlu75oyjkmausvy66ic7x7vuhafx@e4rvwavtjswo Backpatch-through: 14	2026-03-13 23:25:24 +01:00
Andres Freund	ce5d489166	Fix bug due to confusion about what IsMVCCSnapshot means In `0b96e734c5` I (Andres) relied on page_collect_tuples() being called only with an MVCC snapshot, and added assertions to that end, but did not realize that IsMVCCSnapshot() allows both proper MVCC snapshots and historical snapshots, which behave quite similarly to MVCC snapshots. Unfortunately that can lead to incorrect visibility results during logical decoding, as a historical snapshot is interpreted as a plain MVCC snapshot. The only reason this wasn't noticed earlier is that it's hard to reach as most of the time there are no sequential scans during logical decoding. To fix the bug and avoid issues like this in the future, split IsMVCCSnapshot() into IsMVCCSnapshot() and IsMVCCLikeSnapshot(), where now only the latter includes historic snapshots. One effect of this is that during logical decoding no page-at-a-time snapshots are used, as otherwise runtime branches to handle historic snapshots would be needed in some performance critical paths. Given how uncommon sequential scans are during logical decoding, that seems acceptable. Author: Antonin Houska <ah@cybertec.at> Reported-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/61812.1770637345@localhost	2026-03-13 13:53:19 -04:00
Nathan Bossart	e0a3a3fd53	Optimize COPY FROM (FORMAT {text,csv}) using SIMD. Presently, such commands scan the input buffer one byte at a time looking for special characters. This commit adds a new path that uses SIMD instructions to skip over chunks of data without any special characters. This can be much faster. To avoid regressions, SIMD processing is disabled for the remainder of the COPY FROM command as soon as we encounter a short line or a special character (except for end-of-line characters, else we'd always disable it after the first line). This is perhaps too conservative, but it could probably be made more lenient in the future via fine-tuned heuristics. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Shinya Kato <shinya11.kato@gmail.com> Reviewed-by: Ayoub Kazar <ma_kazar@esi.dz> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Neil Conway <neil.conway@gmail.com> Reviewed-by: Greg Burd <greg@burd.me> Tested-by: Manni Wood <manni.wood@enterprisedb.com> Tested-by: Mark Wong <markwkm@gmail.com> Discussion: https://postgr.es/m/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig%40mail.gmail.com	2026-03-13 11:07:32 -05:00
Heikki Linnakangas	f9de9bf302	Add callback for I/O error messages in SLRUs Historically, all SLRUs were addressed by transaction IDs, but that hasn't been true for a long time. However, the error message on I/O error still always talked about accessing a transaction ID. This commit adds a callback that allows subsystems to construct their own error messages, which can then correctly refer to a transaction ID, multixid or whatever else is used to address the particular SLRU. Author: Maxim Orlov <orlovmg@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com	2026-03-13 16:21:06 +02:00
Fujii Masao	723619eaa3	Add stats_reset column to pg_stat_database_conflicts. This commit adds a stats_reset column to pg_stat_database_conflicts, allowing users to see when the statistics in this view were last reset. This makes the view consistent with pg_stat_database and other statistics views. Catalog version bumped. Author: Shihao Zhong <zhong950419@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/CAGRkXqS98OebEWjax99_LVAECsxCB8i=BfsdAL34i-5QHfwyOQ@mail.gmail.com	2026-03-13 22:17:14 +09:00
Peter Eisentraut	59292f7aac	Change copyObject() to use typeof_unqual Currently, when the argument of copyObject() is const-qualified, the return type is also, because the use of typeof carries over all the qualifiers. This is incorrect, since the point of copyObject() is to make a copy to mutate. But apparently no code ran into it. The new implementation uses typeof_unqual, which drops the qualifiers, making this work correctly. typeof_unqual is standardized in C23, but all recent versions of all the usual compilers support it even in non-C23 mode, at least as __typeof_unqual__. We add a configure/meson test for typeof_unqual and __typeof_unqual__ and use it if it's available, else we use the existing fallback of just returning void *. This is the second attempt, after the first attempt in commit `4cfce4e62c` was reverted. The following two points address problems with the earlier version: We test the underscore variant first so that there is a higher chance that clang used for bitcode also supports it, since we don't test that separately. Unlike the typeof test, the typeof_unqual test also tests with a void pointer similar to how copyObject() would use it, because that is not handled by MSVC, so we want the test to fail there. Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org	2026-03-13 07:06:57 +01:00
Andrew Dunstan	a0b6ef29a5	Enable fast default for domains with non-volatile constraints Previously, ALTER TABLE ADD COLUMN always forced a table rewrite when the column type was a domain with constraints (CHECK or NOT NULL), even if the default value satisfied those constraints. This was because contain_volatile_functions() considers CoerceToDomain immutable, so the code conservatively assumed any constrained domain might fail. Improve this by using soft error handling (ErrorSaveContext) to evaluate the CoerceToDomain expression at ALTER TABLE time. If the default value passes the domain's constraints, the value is stored as a "missing" attribute default and no table rewrite is needed. If the constraint check fails, we fall back to a table rewrite, preserving the historical behavior that constraint violations are only raised when the table actually contains rows. Domains with volatile constraint expressions always require a table rewrite since the constraint result could differ per evaluation and cannot be cached. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io> Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com	2026-03-12 18:05:01 -04:00
Andrew Dunstan	487cf2cbd2	Extend DomainHasConstraints() to optionally check constraint volatility Add an optional bool *has_volatile output parameter to DomainHasConstraints(). When non-NULL, the function checks whether any CHECK constraint contains a volatile expression. Callers that don't need this information pass NULL and get the same behavior as before. This is needed by a subsequent commit that enables the fast default optimization for domains with non-volatile constraints: we can safely evaluate such constraints once at ALTER TABLE time, but volatile constraints require a full table rewrite. Author: Jian He <jian.universality@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Viktor Holmberg <viktor.holmberg@aiven.io> Discussion: https://postgr.es/m/CACJufxE_+iZBR1i49k_AHigppPwLTJi6km8NOsC7FWvKdEmmXg@mail.gmail.com	2026-03-12 18:04:16 -04:00
Peter Geoghegan	d071e1cfec	nbtree: Avoid allocating _bt_search stack. Avoid allocating memory for an nbtree descent stack during index scans. We only require a descent stack during inserts, when it is used to determine where to insert a new pivot tuple/downlink into the target leaf page's parent page in the event of a page split. (Page deletion's first phase also performs a _bt_search that requires a descent stack.) This optimization improves performance by minimizing palloc churn. It speeds up index scans that call _bt_search frequently/descend the index many times, especially when the cost of scanning the index dominates (e.g., with index-only skip scans). Testing has shown that the underlying issue causes performance problems for an upcoming patch that will replace btgettuple with a new btgetbatch interface to enable I/O prefetching. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wzmy7NMba9k8m_VZ-XNDZJEUQBU8TeLEeL960-rAKb-+tQ@mail.gmail.com	2026-03-12 13:22:36 -04:00
Richard Guo	383eb21ebf	Convert NOT IN sublinks to anti-joins when safe The planner has historically been unable to convert "x NOT IN (SELECT y ...)" sublinks into anti-joins. This is because standard SQL semantics for NOT IN require that if the comparison "x = y" returns NULL, the "NOT IN" expression evaluates to NULL (effectively false), causing the row to be discarded. In contrast, an anti-join preserves the row if no match is found. Due to this semantic mismatch regarding NULL handling, the conversion was previously considered unsafe. However, if we can prove that neither side of the comparison can yield NULL values, and further that the operator itself cannot return NULL for non-null inputs, the behavior of NOT IN and anti-join becomes identical. Enabling this conversion allows the planner to treat the sublink as a first-class relation rather than an opaque SubPlan filter. This unlocks global join ordering optimization and permits the selection of the most efficient join algorithm based on cost, often yielding significant performance improvements for large datasets. This patch verifies that neither side of the comparison can be NULL and that the operator is safe regarding NULL results before performing the conversion. To verify operator safety, we require that the operator be a member of a B-tree or Hash operator family. This serves as a proxy for standard boolean behavior, ensuring the operator does not return NULL on valid non-null inputs, as doing so would break index integrity. For operand non-nullability, this patch makes use of several existing mechanisms. It leverages the outer-join-aware-Var infrastructure to verify that a Var does not come from the nullable side of an outer join, and consults the NOT-NULL-attnums hash table to efficiently verify schema-level NOT NULL constraints. Additionally, it employs find_nonnullable_vars to identify Vars forced non-nullable by qual clauses, and expr_is_nonnullable to deduce non-nullability for other expression types. The logic for verifying the non-nullability of the subquery outputs was adapted from prior work by David Rowley and Tom Lane. Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com	2026-03-12 09:45:18 +09:00
Andres Freund	b0f4ff3c92	bufmgr: Remove the, now obsolete, BM_JUST_DIRTIED Due to the recent changes to use a share-exclusive mode for setting hint bits and for flushing pages - instead of using share mode as before - a buffer cannot be dirtied while the flush is ongoing. The reason we needed JUST_DIRTIED was to handle the case where the buffer was dirtied while IO was ongoing - which is not possible anymore. Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d	2026-03-11 14:58:29 -04:00
Tomas Vondra	943e881733	Do not lock in BufferGetLSNAtomic() on archs with 8 byte atomic reads On platforms where we can read or write the whole LSN atomically, we do not need to lock the buffer header to prevent torn LSNs. We can do this only on platforms with PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY, and when the pd_lsn field is properly aligned. For historical reasons the PageXLogRecPtr was defined as a struct with two uint32 fields. This replaces it with a single uint64 value, to make the intent clearer. To prevent issues with weak typedefs the value is still wrapped in a struct. This also adjusts heapfuncs() in pageinspect, to ensure proper alignment when reading the LSN from a page on alignment-sensitive hardware. Idea by Andres Freund. Initial patch by Andreas Karlsson, improved by Peter Geoghegan. Minor tweaks by me. Author: Andreas Karlsson <andreas@proxel.se> Author: Peter Geoghegan <pg@bowt.ie> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/b6610c3b-3f59-465a-bdbb-8e9259f0abc4@proxel.se	2026-03-11 19:46:08 +01:00
Peter Eisentraut	9c05f152b5	Fixes for C++ typeof implementation This fixes two bugs in commit `1887d822f1`. First, if we are using the fallback C++ implementation of typeof, then we need to include the C++ header <type_traits> for std::remove_reference_t. This header is also likely to be used for other C++ implementations of type tricks, so we'll put it into the global includes. Second, for the case that the C compiler supports typeof in a spelling that is not "typeof" (for example, __typeof__), then we need to #undef typeof in the C++ section to avoid warnings about duplicate macro definitions. Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org	2026-03-11 11:54:10 +01:00
Peter Eisentraut	d4a080b8a1	Remove Int8GetDatum function We have no uses of Int8GetDatum in our tree and did not have for a long time (or never), and the inverse does not exist either. Author: Kirill Reshke <reshkekirill@gmail.com> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/CALdSSPhFyb9qLSHee73XtZm1CBWJNo9+JzFNf-zUEWCRW5yEiQ@mail.gmail.com	2026-03-11 10:46:08 +01:00
Andres Freund	82467f627b	Require share-exclusive lock to set hint bits and to flush At the moment hint bits can be set with just a share lock on a page (and, until `45f658dacb`, in one case even without any lock). Because of this we need to copy pages while writing them out, as otherwise the checksum could be corrupted. The need to copy the page is problematic to implement AIO writes: 1) Instead of just needing a single buffer for a copied page we need one for each page that's potentially undergoing I/O 2) To be able to use the "worker" AIO implementation the copied page needs to reside in shared memory It also causes problems for using unbuffered/direct-IO, independent of AIO: Some filesystems, raid implementations, ... do not tolerate the data being written out to change during the write. E.g. they may compute internal checksums that can be invalidated by concurrent modifications, leading e.g. to filesystem errors (as the case with btrfs). It also just is plain odd to allow modifications of buffers that are just share locked. To address these issues, this commit changes the rules so that modifications to pages are not allowed anymore while holding a share lock. Instead the new share-exclusive lock (introduced in `fcb9c977aa`) allows at most one backend to modify a buffer while other backends have the same page share locked. An existing share-lock can be upgraded to a share-exclusive lock, if there are no conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits() and BufferSetHintBits16() have been introduced. To prevent hint bits from being set while the buffer is being written out, writing out buffers now requires a share-exclusive lock. The use of share-exclusive to gate setting hint bits means that from now on only one backend can set hint bits at a time. To allow multiple backends to set hint bits would require more complicated locking: For setting hint bits we'd need to store the count of backends currently setting hint bits and we would need another lock-level for I/O conflicting with the lock-level to set hint bits. Given that the share-exclusive lock for setting hint bits is only held for a short time, that backends would often just set the same hint bits and that the cost of occasionally not setting hint bits in hotly accessed pages is fairly low, this seems like an acceptable tradeoff. The biggest change to adapt to this is in heapam. To avoid performance regressions for sequential scans that need to set a lot of hint bits, we need to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint bits on a page have been set. Conversely, to avoid regressions in cases where we can't set hint bits in bulk (because we're looking only at individual tuples), use BufferSetHintBits16() when setting hint bits without batching. Several other places also need to be adapted, but those changes are comparatively simpler. After this we do not need to copy buffers to write them out anymore. That change is done separately however. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m	2026-03-10 19:32:13 -04:00
Melanie Plageman	c2a23dcf9e	Use the newest to-be-frozen xid as the conflict horizon for freezing Previously WAL records that froze tuples used OldestXmin as the snapshot conflict horizon, or the visibility cutoff if the page would become all-frozen. Both are newer than (or equal to) the newst XID actually frozen on the page. Track the newest XID that will be frozen and use that as the snapshot conflict horizon instead. This yields an older horizon resulting in fewer query cancellations on standbys. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAAKRu_bbaUV8OUjAfVa_iALgKnTSfB4gO3jnkfpcFgrxEpSGJQ%40mail.gmail.com	2026-03-10 15:24:39 -04:00
Álvaro Herrera	ac58465e06	Introduce the REPACK command REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single command. Because this functionality is completely different from regular VACUUM, having it separate from VACUUM makes it easier for users to understand; as for CLUSTER, the term is heavily overloaded in the IT world and even in Postgres itself, so it's good that we can avoid it. We retain those older commands, but de-emphasize them in the documentation, in favor of REPACK; the difference between VACUUM FULL and CLUSTER (namely, the fact that tuples are written in a specific ordering) is neatly absorbed as two different modes of REPACK. This allows us to introduce further functionality in the future that works regardless of whether an ordering is being applied, such as (and especially) a concurrent mode. Author: Antonin Houska <ah@cybertec.at> Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Euler Taveira <euler@eulerto.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/82651.1720540558@antos Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql	2026-03-10 19:56:39 +01:00
Robert Haas	0fbfd37cef	Allow extensions to mark an individual index as disabled. Up until now, the only way for a loadable module to disable the use of a particular index was to use build_simple_rel_hook (or, previous to yesterday's commit, get_relation_info_hook) to remove it from the index list. While that works, it has some disadvantages. First, the index becomes invisible for all purposes, and can no longer be used for optimizations such as self-join elimination or left join removal, which can severely degrade the resulting plan. Second, if the module attempts to compel the use of a certain index by removing all other indexes from the index list and disabling other scan types, but the planner is unable to use the chosen index for some reason, it will fall back to a sequential scan, because that is only disabled, whereas the other indexes are, from the planner's point of view, completely gone. While this situation ideally shouldn't occur, it's hard for a loadable module to be completely sure whether the planner will view a certain index as usable for a certain query. If it isn't, it may be better to fall back to a scan using a disabled index rather than falling back to an also-disabled sequential scan. Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA%2BTgmoYS4ZCVAF2jTce%3DbMP0Oq_db_srocR4cZyO0OBp9oUoGg%40mail.gmail.com	2026-03-10 08:33:55 -04:00
Robert Haas	91f33a2ae9	Replace get_relation_info_hook with build_simple_rel_hook. For a long time, PostgreSQL has had a get_relation_info_hook which plugins can use to editorialize on the information that get_relation_info obtains from the catalogs. However, this hook is only called for baserels of type RTE_RELATION, and there is potential utility in a similar call back for other types of RTEs. This might have had utility even before commit `4020b370f2` added pgs_mask to RelOptInfo, but it certainly has utility now. So, move the callback up one level, deleting get_relation_info_hook and adding build_simple_rel_hook instead. The new callback is called just slightly later than before and with slightly different arguments, but it should be fairly straightforward to adjust existing code that currently uses get_relation_info_hook: the values previously available as relationObjectId and inhparent are now available via rte->relid and rte->inh, and calls where rte->rtekind != RTE_RELATION can be ignored if desired. Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com> Discussion: http://postgr.es/m/CA%2BTgmoYg8uUWyco7Pb3HYLMBRQoO6Zh9hwgm27V39Pb6Pdf%3Dug%40mail.gmail.com	2026-03-09 09:48:26 -04:00
Robert Haas	8300d3ad4a	Consider startup cost as a figure of merit for partial paths. Previously, the comments stated that there was no purpose to considering startup cost for partial paths, but this is not the case: it's perfectly reasonable to want a fast-start path for a plan that involves a LIMIT (perhaps over an aggregate, so that there is enough data being processed to justify parallel query but yet we don't want all the result rows). Accordingly, rewrite add_partial_path and add_partial_path_precheck to consider startup costs. This also fixes an independent bug in add_partial_path_precheck: commit `e222534679` failed to update it to do anything with the new disabled_nodes field. That bug fix is formally separate from the rest of this patch and could be committed separately, but I think it makes more sense to fix both issues together, because then we can (as this commit does) just make add_partial_path_precheck do the cost comparisons in the same way as compare_path_costs_fuzzily, which hopefully reduces the chances of ending up with something that's still incorrect. This patch is based on earlier work on this topic by Tomas Vondra, but I have rewritten a great deal of it. Co-authored-by: Robert Haas <rhaas@postgresql.org> Co-authored-by: Tomas Vondra <tomas@vondra.me> Discussion: http://postgr.es/m/CA+TgmobRufbUSksBoxytGJS1P+mQY4rWctCk-d0iAUO6-k9Wrg@mail.gmail.com	2026-03-09 08:16:30 -04:00
Robert Haas	ffc226ab64	Prevent restore of incremental backup from bloating VM fork. When I (rhaas) wrote the WAL summarizer code, I incorrectly believed that XLOG_SMGR_TRUNCATE truncates all forks to the same length. In fact, what other parts of the code do is compute the truncation length for the FSM and VM forks from the truncation length used for the main fork. But, because I was confused, I coded the WAL summarizer to set the limit block for the VM fork to the same value as for the main fork. (Incremental backup always copies FSM forks in full, so there is no similar issue in that case.) Doing that doesn't directly cause any data corruption, as far as I can see. However, it does create a serious risk of consuming a large amount of extra disk space, because pg_combinebackup's reconstruct.c believes that the reconstructed file should always be at least as long as the limit block value. We might want to be smarter about that at some point in the future, because it's always safe to omit all-zeroes blocks at the end of the last segment of a relation, and doing so could save disk space, but the current algorithm will rarely waste enough disk space to worry about unless we believe that a relation has been truncated to a length much longer than its actual length on disk, which is exactly what happens as a result of the problem mentioned in the previous paragraph. To fix, create a new visibilitymap helper function and use it to include the right limit block in the summary files. Incremental backups taken with existing summary files will still have this issue, but this should improve the situation going forward. Diagnosed-by: Oleg Tkachenko <oatkachenko@gmail.com> Diagnosed-by: Amul Sul <sulamul@gmail.com> Discussion: http://postgr.es/m/CAAJ_b97PqG89hvPNJ8cGwmk94gJ9KOf_pLsowUyQGZgJY32o9g@mail.gmail.com Discussion: http://postgr.es/m/6897DAF7-B699-41BF-A6FB-B818FCFFD585%40gmail.com Backpatch-through: 17	2026-03-09 06:45:32 -04:00
Nathan Bossart	b2898baaf7	pg_dumpall: Fix handling of conflicting options. pg_dumpall is missing checks for some conflicting options, including those passed through to pg_dump. To fix, introduce a new function that checks whether mutually exclusive options are set, and use that in pg_dumpall. A similar change could likely be made for pg_dump and pg_restore, but that is left as a future exercise. This is arguably a bug fix, but since this might break existing scripts, no back-patch for now. Author: Jian He <jian.universality@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Wang Peng <215722532@qq.com> Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/CACJufxFf5%3DwSv2MsuO8iZOvpLZQ1-meAMwhw7JX5gNvWo5PDug%40mail.gmail.com	2026-03-06 14:00:04 -06:00
Tom Lane	415100aa62	Support grouping-expression references and GROUPING() in subqueries. Until now, substitute_grouped_columns and its predecessor check_ungrouped_columns intentionally did not cope with references to GROUP BY expressions (anything more complex than a Var) within subqueries of the query having GROUP BY. Because they didn't try to match subexpressions of subqueries to the GROUP BY list, they'd drill down to raw Vars of the grouping level and then fail with "subquery uses ungrouped column from outer query". There have been remarkably few complaints about this deficiency, so nobody ever did anything about it. The reason for not wanting to deal with it is that within a subquery, Vars will have varlevelsup different from zero and will thus not be equal() to the expressions seen in the outer query. We recognized this at least as far back as `96ca8ffeb`, although I think the comment I added about it then was just documenting a pre-existing deficiency. It looks like at the time, the solutions I considered were (1) write a version of equal() that permits an offset in varlevelsup, or (2) dynamically apply IncrementVarSublevelsUp at each subexpression. (1) would require an amount of new code that seems rather out of proportion to the benefit, while (2) would add an exponential amount of cost to the matching process. But rethinking it now, what seems attractive is (3) apply IncrementVarSublevelsUp to the groupingClause list not the subexpressions, and do so only once per subquery depth level. Then we can still use plain equal() to check for matches, and we're not incurring cost proportional to some power of the subquery's complexity. This patch continues to use the old logic when the GROUP BY list is all Vars. We could discard the special comparison logic for that and always do it the more general way, but that would be a good deal slower. (Micro-benchmarking just parse analysis suggests it's about 50% slower than the Vars-only path. But we've not heard complaints about the speed of matching within the main query, so I doubt that applying the same matching logic within subqueries will be a problem.) The lack of complaints suggests strongly that this is a very minority use-case, so I don't want to make the typical case slower to fix it. While testing that, I was surprised to discover a nearby bug: GROUPING() within a subquery fails to match GROUP BY Vars that are join alias Vars. It tries to apply flatten_join_alias_vars to make such cases work, but that fails to work inside a subquery because varlevelsup is wrong. Therefore, this patch invents a new entry point flatten_join_alias_for_parser() that allows specification of a sublevels_up offset. (It seems cleaner to give the parser its own entry point rather than abuse the planner's conventions even further.) While this is pretty clearly a bug fix, I'm hesitant to take the risk of back-patching, seeing that the existing behavior has stood for so long with so few complaints. Maybe we can reconsider once this patch has baked awhile in master. Reported-by: PALAYRET Jacques <jacques.palayret@meteo.fr> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/531183.1772058731@sss.pgh.pa.us	2026-03-06 13:40:55 -05:00
Jeff Davis	8185bb5347	CREATE SUBSCRIPTION ... SERVER. Allow CREATE SUBSCRIPTION to accept a foreign server using the SERVER clause instead of a raw connection string using the CONNECTION clause. * Enables a user with sufficient privileges to create a subscription using a foreign server by name without specifying the connection details. * Integrates with user mappings (and other FDW infrastructure) using the subscription owner. * Provides a layer of indirection to manage multiple subscriptions to the same remote server more easily. Also add CREATE FOREIGN DATA WRAPPER ... CONNECTION clause to specify a connection_function. To be eligible for a subscription, the foreign server's foreign data wrapper must specify a connection_function. Add connection_function support to postgres_fdw, and bump postgres_fdw version to 1.3. Bump catversion. Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.camel@j-davis.com	2026-03-06 08:27:56 -08:00
Álvaro Herrera	868825aaeb	Don't include wait_event.h in pgstat.h wait_event.h itself includes wait_event_types.h, which is a generated file, so it's nice that we can avoid compiling >10% of the tree just because that file is regenerated. To avoid breaking too many third-party modules, we now #include utils/wait_classes.h in storage/latch.h. Then, the very common case of doing WaitLatch(..., PG_WAIT_EXTENSION) continues to work by including just storage/latch.h. (I didn't try to determine how many modules would actually break if we don't do this, but this seems a convenient and low-impact measure.) Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/202602181214.gcmhx2vhlxzp@alvherre.pgsql	2026-03-06 16:24:58 +01:00
Peter Eisentraut	258248d0bd	Make unconstify and unvolatize use StaticAssertVariableIsOfTypeMacro The unconstify and unvolatize macros had an almost identical assertion as was already defined in StaticAssertVariableIsOfTypeMacro, only it had a less useful error message and didn't have a sizeof fallback. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com	2026-03-06 10:14:32 +01:00
Peter Eisentraut	e2308350c9	Use typeof everywhere instead of compiler specific spellings We define typeof ourselves as __typeof__ if it does not exist. So let's actually use that for consistency. The meson/autoconf checks for __builtin_types_compatible_p still use __typeof__ though, because there we have not redefined it. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg@mail.gmail.com	2026-03-06 10:14:32 +01:00
Peter Eisentraut	aa7c868523	Portable StaticAssertExpr Use a different way to write StaticAssertExpr() that does not require the GCC extension statement expressions. For C, we put the static_assert into a struct. This appears to be a common approach. We still need to keep the fallback implementation to support buggy MSVC < 19.33. For C++, we put it into a lambda expression. (The C approach doesn't work; it's not permitted to define a new type inside sizeof.) Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/5fa3a9f5-eb9a-4408-9baf-403d281f8b10%40eisentraut.org	2026-03-06 09:27:54 +01:00
Michael Paquier	01d485b142	Add system view pg_stat_recovery This commit introduces pg_stat_recovery, that exposes at SQL level the state of recovery as tracked by XLogRecoveryCtlData in shared memory, maintained by the startup process. This new view includes the following fields, that are useful for monitoring purposes on a standby, once it has reached a consistent state (making the execution of the SQL function possible): - Last-successfully replayed WAL record LSN boundaries and its timeline. - Currently replaying WAL record end LSN and its timeline. - Current WAL chunk start time. - Promotion trigger state. - Timestamp of latest processed commit/abort. - Recovery pause state. Some of this data can already be recovered from different system functions, but not all of it. See pg_get_wal_replay_pause_state or pg_last_xact_replay_timestamp. This new view offers a stronger consistency guarantee, by grabbing the recovery state for all fields through one spinlock acquisition. The system view relies on a new function, called pg_stat_get_recovery(). Querying this data requires the pg_read_all_stats privilege. The view returns no rows if the node is not in recovery. This feature originates from a suggestion I have made while discussion the addition of a CONNECTING state to the WAL receiver's shared memory state, because we lacked access to some of the state data. The author has taken the time to implement it, so thanks for that. Bump catalog version. Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com Discussion: https://postgr.es/m/aW13GJn_RfTJIFCa@paquier.xyz	2026-03-06 12:37:40 +09:00
Tom Lane	f95d73ed43	Simplify creation of built-in functions with non-default ACLs. Up to now, to create such a function, one had to make a pg_proc.dat entry and then modify it with GRANT/REVOKE commands, which we put in system_functions.sql. That seems a little ugly though, because it violates the idea of having a single source of truth about the initial contents of pg_proc, and it results in leaving dead rows in the initial contents of pg_proc. This patch improves matters by allowing aclitemin to work during early bootstrap, before pg_authid has been loaded. On the same principle that we use for early access to pg_type details, put a table of known built-in role names into bootstrap.c, and use that in bootstrap mode. To create a built-in function with a non-default ACL, one should write the desired ACL list in its pg_proc.dat entry, using a simplified version of aclitemout's notation: omit the grantor (if it is the bootstrap superuser, which it pretty much always should be) and spell the bootstrap superuser's name as POSTGRES, similarly to the notation used elsewhere in src/include/catalog. This results in entries like proacl => '{POSTGRES=X,pg_monitor=X}' which shows that we've revoked public execute permissions and instead granted that to pg_monitor. In addition to fixing up pg_proc.dat entries, I got rid of some role grants that had been stuck into system_functions.sql, and instead put them into a new file pg_auth_members.dat; that seems like a far less random place to put the information. The correctness of the data changes can be verified by comparing the initial contents of pg_proc and pg_auth_members before and after. pg_proc should match exactly, but the OID column of pg_auth_members will probably be different because those OIDs now get assigned a little earlier in bootstrap. (I forced a catversion bump out of caution, but it wasn't really necessary.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/183292bb-4891-4c96-a3ca-e78b5e0e1358@dunslane.net	2026-03-05 17:43:09 -05:00
Melanie Plageman	34cb4254bd	Prefix PruneState->all_{visible,frozen} with set_ The PruneState had members called "all_visible" and "all_frozen" which reflect not the current state of the page but the state it could be in once pruning and freezing have been executed. These are then saved in the PruneFreezeResult so the caller can set the VM accordingly. Prefix the PruneState members as well as the corresponsding PruneFreezeResult members with "set_" to clarify that they represent the proposed state of the all-visible and all-frozen bits for a heap page in the visibility map, not the current state. Author: Melanie Plageman <melanieplageman@gmail.com> Suggested-by: Andres Freund <andres@anarazel.de> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk	2026-03-05 16:55:00 -05:00
Melanie Plageman	68c2dcb913	Add PageGetPruneXid() helper This is similar to the other page accessors in bufpage.h. It improves readability and avoids long lines. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/BD8B69E7-26D8-4706-9164-597C6AE57812%40gmail.com	2026-03-05 16:22:57 -05:00
Michael Paquier	5f8124a0cf	Move definition of XLogRecoveryCtlData to xlogrecovery.h XLogRecoveryCtlData is the structure that stores the shared-memory state of WAL recovery, including information such as promotion requests, the timeline ID (TLI), and the LSNs of replayed records. This refactoring is independently useful because it allows code outside of core to access the recovery state in live. It will be used by an upcoming patch that introduces a SQL function for querying this information, that can be accessed on a standby once a consistent state has been reached. This only moves code around, changing nothing functionally. Author: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/CABPTF7W+Nody-+P9y4PNk37-QWuLpfUrEonHuEhrX+Vx9Kq+Kw@mail.gmail.com	2026-03-05 12:17:47 +09:00
Michael Paquier	34dfca2934	Change default value of default_toast_compression to "lz4", take two The default value for default_toast_compression was "pglz". The main reason for this choice is that this option is always available, pglz code being embedded in Postgres. However, it is known that LZ4 is more efficient than pglz: less CPU required, more compression on average. As of this commit, the default value of default_toast_compression becomes "lz4", if available. By switching to LZ4 as the default, users should see natural speedups on TOAST data reads and/or writes. Support for LZ4 in TOAST compression was added in Postgres v14, or 5 releases ago. This should be long enough to consider this feature as stable. While at it, quotes are removed from default_toast_compression in postgresql.conf.sample. Quotes are not required in this case. The in-place value replacement done by initdb if the build supports LZ4 would not use them in the postgresql.conf file added to a freshly-initialized cluster. Note that this is a version lighter than `7c1849311e`, that included a replacement of --with-lz4 by --without-lz4 in configure builds, forcing a requirement for LZ4 in all environments. The buildfarm did not like it, at all. This commit switches default_toast_compression to lz4 as default only when --with-lz4 is defined, which should keep the buildfarm at bay while still allowing users to benefit from LZ4 compression in TOAST as long as the code is compiled with it. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org	2026-03-05 09:24:35 +09:00
Michael Paquier	4f0b3afab4	Revert "Change default value of default_toast_compression to "lz4"" This reverts commit `7c1849311e`, due to the fact that more than 60% of the buildfarm members do not have lz4 installed. As we are in the last commit fest of the development cycle, and that it could take a couple of weeks to stabilize things, this change is reverted for now. This commit will be reworked in a lighter version, as default_toast_compression's default can be changed to "lz4" without the switch from --with-lz4 to --without-lz4. This approach will keep the buildfarm at bay, and still allow builds to take advantage of LZ4 in TOAST by default, as long as the code is compiled with LZ4 support. A harder requirement based on LZ4 should be achievable at some point, but it is going to require some work from the buildfarm owners first. Perhaps this part could be revisited at the beginning of the next development cycle. Discussion: https://postgr.es/m/CAOYmi+meTT0NbLbnVqOJD5OKwCtHL86PQ+RZZTrn6umfmHyWaw@mail.gmail.com	2026-03-05 08:25:35 +09:00
Amit Kapila	fd366065e0	Allow table exclusions in publications via EXCEPT TABLE. Extend CREATE PUBLICATION ... FOR ALL TABLES to support the EXCEPT TABLE syntax. This allows one or more tables to be excluded. The publisher will not send the data of excluded tables to the subscriber. To support this, pg_publication_rel now includes a prexcept column to flag excluded relations. For partitioned tables, the exclusion is applied at the root level; specifying a root table excludes all current and future partitions in that tree. Follow-up work will implement ALTER PUBLICATION support for managing these exclusions. Author: vignesh C <vignesh21@gmail.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com	2026-03-04 15:56:48 +05:30
Michael Paquier	7c1849311e	Change default value of default_toast_compression to "lz4", when available The default value for default_toast_compression was "pglz". The main reason for this choice is that this option is always available, pglz code being embedded in Postgres. However, it is known that LZ4 is more efficient than pglz: less CPU required, more compression on average. As of this commit, the default value of default_toast_compression becomes "lz4", if available. By switching to LZ4 as the default, users should see natural speedups on TOAST data reads and/or writes. Support for LZ4 in TOAST compression was added in Postgres v14, or 5 releases ago. This should be long enough to consider this feature as stable. --with-lz4 is removed, replaced by a --without-lz4 to disable LZ4 in the builds on an option-basis, following a practice similar to readline or ICU. References to --with-lz4 are removed from the documentation. While at it, quotes are removed from default_toast_compression in postgresql.conf.sample. Quotes are not required in this case. The in-place value replacement done by initdb if the build supports LZ4 would not use them in the postgresql.conf file added to a freshly-initialized cluster. For the reference, a similar switch has been done with ICU in `fcb21b3acd`. Some of the changes done in this commit are consistent with that. Note: this is going to create some disturbance in the buildfarm, in environments where lz4 is not installed. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Aleksander Alekseev <aleksander@tigerdata.com> Discussion: https://posgr.es/m/435df33a-129e-4f0c-a803-f3935c5a5ecb@eisentraut.org	2026-03-04 13:05:31 +09:00
Melanie Plageman	38229cb905	Add read_stream_{pause,resume}() Read stream users can now pause lookahead when no blocks are currently available. After resuming, subsequent read_stream_next_buffer() calls continue lookahead with the previous lookahead distance. This is especially useful for read stream users with self-referential access patterns (where consuming already-read buffers can produce additional block numbers). Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJLT2JvWLEiBXMbkSSc5so_Y7%3DN%2BS2ce7npjLw8QL3d5w%40mail.gmail.com	2026-03-03 16:03:09 -05:00
Peter Eisentraut	2a525cc97e	Add COPY (on_error set_null) option If ON_ERROR SET_NULL is specified during COPY FROM, any data type conversion errors will result in the affected column being set to a null value. A column's not-null constraints are still enforced, and attempting to set a null value in such columns will raise a constraint violation error. This applies to a column whose data type is a domain with a NOT NULL constraint. Author: Jian He <jian.universality@gmail.com> Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Jim Jones <jim.jones@uni-muenster.de> Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-by: Yugo NAGATA <nagata@sraoss.co.jp> Reviewed-by: torikoshia <torikoshia@oss.nttdata.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Atsushi Torikoshi <torikoshia@oss.nttdata.com> Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/CAKFQuwawy1e6YR4S%3Dj%2By7pXqg_Dw1WBVrgvf%3DBP3d1_aSfe_%2BQ%40mail.gmail.com	2026-03-03 07:37:12 +01:00
Heikki Linnakangas	ccae90abdb	Fix OldestMemberMXactId and OldestVisibleMXactId array usage Commit `ab355e3a88` changed how the OldestMemberMXactId array is indexed. It's no longer indexed by synthetic dummyBackendId, but with ProcNumber. The PGPROC entries for prepared xacts come after auxiliary processes in the allProcs array, which rendered the calculation for MaxOldestSlot and the indexes into the array incorrect. (The OldestVisibleMXactId array is not used for prepared xacts, and thus never accessed with ProcNumber's greater than MaxBackends, so this only affects the OldestMemberMXactId array.) As a result, a prepared xact would store its value past the end of the OldestMemberMXactId array, overflowing into the OldestVisibleMXactId array. That could cause a transaction's row lock to appear invisible to other backends, or other such visibility issues. With a very small max_connections setting, the store could even go beyond the OldestVisibleMXactId array, stomping over the first element in the BufferDescriptor array. To fix, calculate the array sizes more precisely, and introduce helper functions to calculate the array indexes correctly. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://www.postgresql.org/message-id/7acc94b0-ea82-4657-b1b0-77842cb7a60c@postgrespro.ru Backpatch-through: 17	2026-03-02 19:19:22 +02:00
Peter Eisentraut	1887d822f1	Support using copyObject in standard C++ Calling copyObject in C++ without GNU extensions (e.g. when using -std=c++11 instead of -std=gnu++11) fails with an error like this: error: use of undeclared identifier 'typeof'; did you mean 'typeid' This is due to the C compiler used to compile PostgreSQL supporting typeof, but that function actually not being present in the C++ compiler. This fixes that by explicitely checking for typeof support in C++, and then either use that or define typeof ourselves as: std::remove_reference_t<decltype(x)> According to the paper that led to adding typeof to the C standard, that's the C++ equivalent of the C typeof: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2927.htm#existing-decltype Author: Author: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://www.postgresql.org/message-id/flat/DGPW5WCFY7WY.1IHCDNIVVT300%2540jeltef.nl	2026-03-02 11:48:13 +01:00
Peter Eisentraut	386ca3908d	Check for memset_explicit() and explicit_memset() We can use either of these to implement a missing explicit_bzero(). explicit_memset() is supported on NetBSD. NetBSD hitherto didn't have a way to implement explicit_bzero() other than the fallback variant. memset_explicit() is the C23 standard, so we use it as first preference. It is currently supported on: - NetBSD 11 - FreeBSD 15 - glibc 2.43 It doesn't provide additional coverage, but as it's the new standard, its availability will presumably grow. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/c4701776-8d99-41da-938d-88528a3adc15%40eisentraut.org	2026-03-02 07:51:19 +01:00
Michael Paquier	f68d7e7483	Remove WAL page header flag XLP_BKP_REMOVABLE There are no known users of this flag. The last supposed user was pglesslog, which is the reason why this flag has been introduced in core, based on an historical search pointing at `a8d539f124`. I have mentioned that we may want to remove this flag back in 2018, due to zero users of it in core. More recently, Noah has pointed out that this flag is not safe to use: XLP_BKP_REMOVABLE can be set by the WAL writer in a lock-free fashion with runningBackups > 0, meaning that some full-page images could be required but not logged, ultimately corrupting backups. Bump XLOG_PAGE_MAGIC. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/20250705001628.c3.nmisch@google.com Discussion: https://postgr.es/m/CAEze2WhiwKSoAvfUggjDeoeY0-rz9cTpfrHcqvBMmJxv-K_5DA@mail.gmail.com	2026-03-02 14:13:05 +09:00
Peter Eisentraut	3f98862980	Fix some -Wcast-qual warnings This fixes some warnings from -Wcast-qual that are easy to fix, without using unconstify or the like. Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://www.postgresql.org/message-id/990c9117-b013-4026-aaf5-261fe2832c3d%40eisentraut.org	2026-02-27 21:57:33 +01:00

1 2 3 4 5 ...

12771 commits