Commit graph

2558 commits

Author SHA1 Message Date
Amit Langote
b7b27eb41a Optimize fast-path FK checks with batched index probes
Instead of probing the PK index on each trigger invocation, buffer
FK rows in a new per-constraint cache entry (RI_FastPathEntry) and
flush them as a batch.

On each trigger invocation, the new ri_FastPathBatchAdd() buffers
the FK row in RI_FastPathEntry.  When the buffer fills (64 rows)
or the trigger-firing cycle ends, the new ri_FastPathBatchFlush()
probes the index for all buffered rows, sharing a single
CommandCounterIncrement, snapshot, permission check, and security
context switch across the batch, rather than repeating each per row
as the SPI path does.  Per-flush CCI is safe because all AFTER
triggers for the buffered rows have already fired by flush time.

For single-column foreign keys, the new ri_FastPathFlushArray()
builds an ArrayType from the buffered FK values (casting to the
PK-side type if needed) and constructs a scan key with the
SK_SEARCHARRAY flag.  The index AM sorts and deduplicates the array
internally, then walks matching leaf pages in one ordered traversal
instead of descending from the root once per row.  A matched[] bitmap
tracks which batch items were satisfied; the first unmatched item is
reported as a violation.  Multi-column foreign keys fall back to
per-row probing via the new ri_FastPathFlushLoop().

The fast path introduced in the previous commit (2da86c1ef9) yields
~1.8x speedup.  This commit adds ~1.6x on top of that, for a combined
~2.9x speedup over the unpatched code (int PK / int FK, 1M rows, PK
table and index cached in memory).

FK tuples are materialized via ExecCopySlotHeapTuple() into a new
purpose-specific memory context (flush_cxt), child of
TopTransactionContext, which is also used for per-flush transient
work: cast results, the search array, and index scan allocations.
It is reset after each flush and deleted in teardown.

The PK relation, index, tuple slots, and fast-path metadata are
cached in RI_FastPathEntry across trigger invocations within a
trigger-firing batch, avoiding repeated open/close overhead.  The
snapshot and IndexScanDesc are taken fresh per flush.  The entry is
not subject to cache invalidation: cached relations are held with
locks for the transaction duration, and the entry's lifetime is
bounded by the trigger-firing cycle.

Lifecycle management for RI_FastPathEntry relies on three new
mechanisms:

  - AfterTriggerBatchCallback: A new general-purpose callback
    mechanism in trigger.c.  Callbacks registered via
    RegisterAfterTriggerBatchCallback() fire at the end of each
    trigger-firing batch (AfterTriggerEndQuery for immediate
    constraints, AfterTriggerFireDeferred at COMMIT, and
    AfterTriggerSetState for SET CONSTRAINTS IMMEDIATE).  The RI
    code registers ri_FastPathEndBatch as a batch callback.

  - Batch callbacks only fire at the outermost query level
    (checked inside FireAfterTriggerBatchCallbacks), so nested
    queries from SPI inside other AFTER triggers do not tear down
    the cache mid-batch.

  - XactCallback: ri_FastPathXactCallback NULLs the static cache
    pointer at transaction end, handling the abort path where the
    batch callback never fired.

  - SubXactCallback: ri_FastPathSubXactCallback NULLs the static
    cache pointer on subtransaction abort, preventing the batch
    callback from accessing already-released resources.

  - AfterTriggerBatchIsActive(): A new exported accessor that
    returns true when afterTriggers.query_depth >= 0.  During
    ALTER TABLE ... ADD FOREIGN KEY validation, RI triggers are
    called directly outside the after-trigger framework, so batch
    callbacks would never fire.  The fast-path code uses this to
    fall back to the non-cached per-invocation path in that
    context.

ri_FastPathEndBatch() flushes any partial batch before tearing
down cached resources.  Since the FK relation may already be
closed by flush time (e.g. for deferred constraints at COMMIT),
it reopens the relation using entry->fk_relid if needed.

The existing ALTER TABLE validation path bypasses batching and
continues to call ri_FastPathCheck() directly per row, because
RI triggers are called outside the after-trigger framework there
and batch callbacks would never fire to flush the buffer.

Suggested-by: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Co-authored-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
2026-04-03 14:33:53 +09:00
Peter Eisentraut
8e72d914c5 Add UPDATE/DELETE FOR PORTION OF
This is an extension of the UPDATE and DELETE commands to do a
"temporal update/delete" based on a range or multirange column.  The
user can say UPDATE t FOR PORTION OF valid_at FROM '2001-01-01' TO
'2002-01-01' SET ... (or likewise with DELETE) where valid_at is a
range or multirange column.

The command is automatically limited to rows overlapping the targeted
portion, and only history within those bounds is changed.  If a row
represents history partly inside and partly outside the bounds, then
the command truncates the row's application time to fit within the
targeted portion, then it inserts one or more "temporal leftovers":
new rows containing all the original values, except with the
application-time column changed to only represent the untouched part
of history.

To compute the temporal leftovers that are required, we use the *_minus_multi
set-returning functions defined in 5eed8ce50c.

- Added bison support for FOR PORTION OF syntax.  The bounds must be
  constant, so we forbid column references, subqueries, etc. We do
  accept functions like NOW().
- Added logic to executor to insert new rows for the "temporal
  leftover" part of a record touched by a FOR PORTION OF query.
- Documented FOR PORTION OF.
- Added tests.

Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/ec498c3d-5f2b-48ec-b989-5561c8aa2024%40illuminatedcomputing.com
2026-04-01 19:06:03 +02:00
Nathan Bossart
771fe0948c Avoid including vacuum.h in tableam.h and heapam.h.
Commit 2252fcd427 modified some function prototypes in tableam.h
and heapam.h to take a VacuumParams argument instead of a pointer,
which required including vacuum.h in those headers.  vacuum.h has a
reasonably large dependency tree, and headers like tableam.h are
widely included, so this is not ideal.  To fix, change the
functions in question to accept a "const VacuumParams *" argument
instead.  That allows us to use a forward declaration for
VacuumParams and avoid including vacuum.h.  Since vacuum_rel()
needs to scribble on the params argument, we still pass it by value
to that function so that the original struct is not modified.

Reported-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/rzxpxod4c4la62yvutyrvgoyilrl2fx55djaf2suidy7np5m6c%403l2ln476eadh
2026-03-31 12:43:52 -05:00
Daniel Gustafsson
097ab69d17 Formalize WAL record for XLOG_CHECKPOINT_REDO
XLOG_CHECKPOINT_REDO only contains the wal_level copied straight in
without an encapsulating record structure. While it works, it makes
future uses of XLOG_CHECKPOINT_REDO hard as there is nowhere to put
new data items.  This fix this was inspired by the online checksums
patch which adds data to this record,  but this change has value on
its own.

Author: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/c92b5d8b-bc03-47bc-b209-2e4a719eee32@iki.fi
2026-03-31 09:38:01 +02:00
Amit Langote
2da86c1ef9 Add fast path for foreign key constraint checks
Add a fast-path optimization for foreign key checks that bypasses SPI
by directly probing the unique index on the referenced table.
Benchmarking shows ~1.8x speedup for bulk FK inserts (int PK/int FK,
1M rows, where PK table and index are cached).

The fast path applies when the referenced table is not partitioned and
the constraint does not involve temporal semantics.  Otherwise, the
existing SPI path is used.

This optimization covers only the referential check trigger
(RI_FKey_check).  The action triggers (CASCADE, SET NULL, SET DEFAULT,
RESTRICT, NO ACTION) must find rows on the FK side to modify, which
requires a table scan with no guaranteed index available, and then
execute DML against those rows through the full executor path including
any triggered actions.  Replicating that without substantial code
duplication is not feasible, so those triggers remain on the SPI path.
Extending the fast path to action triggers remains possible as future
work if the necessary infrastructure is built.

The new ri_FastPathCheck() function extracts the FK values, builds scan
keys, performs an index scan, and locks the matching tuple with
LockTupleKeyShare via ri_LockPKTuple(), which handles the RI-specific
subset of table_tuple_lock() results.

If the locked tuple was reached by chasing an update chain
(tmfd.traversed), recheck_matched_pk_tuple() verifies that the key
is still the same, emulating EvalPlanQual.

The scan uses GetTransactionSnapshot(), matching what the SPI path
uses (via _SPI_execute_plan pushing GetTransactionSnapshot() as the
active snapshot).  Under READ COMMITTED this is a fresh snapshot;
under REPEATABLE READ / SERIALIZABLE it is the frozen transaction-
start snapshot, so PK rows committed after the transaction started
are not visible.

The ri_CheckPermissions() function performs schema USAGE and table
SELECT checks, matching what the SPI path gets implicitly through
the executor's permission checks.  The fast path also switches to
the PK table owner's security context (with SECURITY_NOFORCE_RLS)
before the index probe, matching the SPI path where the query runs
as the table owner.

ri_HashCompareOp() is adjusted to handle cross-type equality operators
(e.g. int48eq for int4 PK / int8 FK) which can appear in conpfeqop.
The existing code asserted same-type operators only, which was correct
for its existing callers (ri_KeysEqual compares same-type FK column
values via ff_eq_oprs), but the fast path is the first caller to pass
pf_eq_oprs, which can be cross-type.

Per-key metadata (compare entries, operator procedures, strategy
numbers) is cached in RI_ConstraintInfo via
ri_populate_fastpath_metadata() on first use, eliminating repeated
calls to ri_HashCompareOp() and get_op_opfamily_properties().
conindid and pk_is_partitioned are also cached at constraint load
time, avoiding per-invocation syscache lookups and the need to open
pk_rel before deciding whether the fast path applies.

New regression tests cover RLS bypass and ACL enforcement for the
fast-path permission checks.  New isolation tests exercise concurrent
PK updates under both READ COMMITTED and REPEATABLE READ.

Author: Junwang Zhao <zhjwpku@gmail.com>
Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Tested-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CA+HiwqF4C0ws3cO+z5cLkPuvwnAwkSp7sfvgGj3yQ=Li6KNMqA@mail.gmail.com
2026-03-31 13:49:21 +09:00
Nathan Bossart
bab2f27eaa Remove bits* typedefs.
In addition to removing the bits8, bits16, and bits32 typedefs,
this commit replaces all uses with uint8, uint16, or uint32.  bits*
provided little benefit beyond establishing the intent of the
variable, and they were inconsistently used for that purpose.
Third-party code should instead use the corresponding uint*
typedef.

Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Discussion: https://postgr.es/m/absbX33E4eaA0Ity%40nathan
2026-03-30 16:12:08 -05:00
Heikki Linnakangas
fd8e3f7cee Invent a variant of getopt(3) that is thread-safe
The standard getopt(3) function is not re-entrant nor thread-safe.
That's OK for current usage, but it's one more little thing we need to
change in order to make the server multi-threaded.

There's no standard getopt_r() function on any platform, I presume
because command line arguments are usually parsed early when you start
a program, before launching any threads, so there isn't much need for
it. However, we call it at backend startup to parse options from the
startup packet. Because there's no standard, we're free to define our
own.

The pg_getopt_start/next() implementation is based on the old getopt
implementation, I just gathered all the state variables to a struct.
The non-re-entrant getopt() function is now a wrapper around the
re-entrant variant, on platforms that don't have getopt(3).
getopt_long() is not used in the server, so we don't need to provide a
re-entrant variant of that.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/d1da5f0e-0d68-47c9-a882-eb22f462752f@iki.fi
2026-03-30 20:47:13 +03:00
Peter Eisentraut
c546f008cd headerscheck: Avoid mutual inclusion of pg_config.h and c.h
Headers that c.h includes early should not have another header
included before them in the headerscheck test file, especially not
c.h.

A particular instance of a problem is that pg_config.h defines some
symbols that c.h later undefines in some cases, such as in the code
added by commit cd083b54bd, but there were also some before that.
This only works correctly if pg_config.h is included first.

pg_config_manual.h and pg_config_os.h are closely related to
pg_config.h and should be treated the same way.

postgres_ext.h is meant to be usable standalone, so testing it with
c.h included first defeats the point.

c.h also includes port.h, but this commit leaves that alone, since
port.h does need some of c.h to be processed first.  (But because of
header guards, testing port.h separately is probably ineffective.)

Discussion: https://www.postgresql.org/message-id/flat/579116be-5016-4dbc-aed0-c06f8d9f5bbb%40eisentraut.org
2026-03-30 10:14:41 +02:00
Andres Freund
74eafeab1a bufmgr: Improve StartBufferIO interface
Until now StartBufferIO() had a few weaknesses:

- As it did not submit staged IOs, it was not safe to call StartBufferIO()
  where there was a potential for unsubmitted IO, which required
  AsyncReadBuffers() to use a wrapper (ReadBuffersCanStartIO()) around
  StartBufferIO().

- With nowait = true, the boolean return value did not allow to distinguish
  between no IO being necessary and having to wait, which would lead
  ReadBuffersCanStartIO() to unnecessarily submit staged IO.

- Several callers needed to handle both local and shared buffers, requiring
  the caller to differentiate between StartBufferIO() and StartLocalBufferIO()

- In a future commit some callers of StartBufferIO() want the BufferDesc's
  io_wref to be returned, to asynchronously wait for in-progress IO

- Indicating whether to wait with the nowait parameter was somewhat confusing
  compared to a wait parameter

Address these issues as follows:

- StartBufferIO() is renamed to StartSharedBufferIO()

- A new StartBufferIO() is introduced that supports both shared and local
  buffers

- The boolean return value has been replaced with an enum, indicating whether
  the IO is already done, already in progress or that the buffer has been
  readied for IO

- A new PgAioWaitRef * argument allows the caller to get the wait reference is
  desired.  All current callers pass NULL, a user of this will be introduced
  subsequently

- Instead of the nowait argument there now is wait

  This probably would not have been worthwhile on its own, but since all these
  lines needed to be touched anyway...

Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-27 19:08:12 -04:00
Andres Freund
1f6f200cab test_aio: Add read_stream test infrastructure & tests
While we have a lot of indirect coverage of read streams, there are corner
cases that are hard to test when only indirectly controlling and observing the
read stream.  This commit adds an SQL callable SRF interface for a read stream
and uses that in a few tests.

To make some of the tests possible, the injection point infrastructure in
test_aio had to be expanded to allow blocking IO completion.

While at it, fix a wrong debug message in inj_io_short_read_hook().

Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
2026-03-27 18:52:43 -04:00
Tom Lane
79ac82125e pgindent: ensure all C files end with a newline.
Not only is this good style, but it dodges some obscure bugs within
pg_bsd_indent.  We could try to fix said bugs, but the amount of
effort required seems far out of proportion to the benefit.

Reported-by: Akshay Joshi <akshay.joshi@enterprisedb.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CANxoLDfca8O5SkeDxB_j6SVNXd+pNKaDmVmEW+2yyicdU8fy0w@mail.gmail.com
2026-03-27 15:38:48 -04:00
Nathan Bossart
d7965d65fc Add rudimentary table prioritization to autovacuum.
Autovacuum workers scan pg_class twice to collect the set of tables
to process.  The first pass is for plain relations and materialized
views, and the second is for TOAST tables.  When the worker finds a
table to process, it adds it to the end of a list.  Later on, it
processes the tables in the same order as the list.  This simple
strategy has worked surprisingly well for a long time, but there
have been many discussions over the years about trying to improve
it.

This commit introduces a scoring system that is used to sort the
aforementioned list of tables to process.  The idea is to have
autovacuum workers prioritize tables that are furthest beyond their
thresholds (e.g., a table nearing transaction ID wraparound should
be vacuumed first).  This prioritization scheme is certainly far
from perfect; there are simply too many possibilities for any
scoring technique to work across all workloads, and the situation
might change significantly between the time we calculate the score
and the time that autovacuum processes it.  However, we have
attemped to develop something that is expected to work for a large
portion of workloads with reasonable parameter settings.

The score is calculated as the maximum of the ratios of each of the
table's relevant values to its threshold.  For example, if the
number of inserted tuples is 100, and the insert threshold for the
table is 80, the insert score is 1.25.  If all other scores are
below that value, the table's score will be 1.25.  The other
criteria considered for the score are the table ages (both
relfrozenxid and relminmxid) compared to the corresponding
freeze-max-age setting, the number of update/deleted tuples
compared to the vacuum threshold, and the number of
inserted/updated/deleted tuples compared to the analyze threshold.

Once exception to the previous paragraph is for tables nearing
wraparound, i.e., those that have surpassed the effective failsafe
ages.  In that case, the relfrozenxid/relminmxid-based score is
scaled aggressively so that the table has a decent chance of
sorting to the front of the list.

To adjust how strongly each component contributes to the score, the
following parameters can be adjusted from their default of 1.0 to
anywhere between 0.0 and 10.0 (inclusive).  Setting all of these to
0.0 restores pre-v19 prioritization behavior:

	autovacuum_freeze_score_weight
	autovacuum_multixact_freeze_score_weight
	autovacuum_vacuum_score_weight
	autovacuum_vacuum_insert_score_weight
	autovacuum_analyze_score_weight

This is intended to be a baby step towards smarter autovacuum
workers.  Possible future improvements include, but are not limited
to, periodic reprioritization, automatic cost limit adjustments,
and better observability (e.g., a system view that shows current
scores).  While we do not expect this commit to produce any
earth-shattering improvements, it is arguably a prerequisite for
the aforementioned follow-up changes.

Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/aOaAuXREwnPZVISO%40nathan
2026-03-27 10:17:05 -05:00
Peter Eisentraut
6857947db5 pgindent: Always clean up .BAK files from pg_bsd_indent
The previous commit let pgindent clean up File::Temp files on SIGINT.
This extends that to also cleaning up the .BAK files, created by
pg_bsd_indent.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/DFCDD5H4J7VX.3GJKRBBDCKQ86@jeltef.nl
2026-03-27 14:26:43 +01:00
Peter Eisentraut
801de0bd44 pgindent: Clean up temp files created by File::Temp on SIGINT
When pressing Ctrl+C while running pgindent, it would often leave around
files like pgtypedefAXUEEA. This slightly changes SIGINT handling so
those files are cleaned up.

Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/DFCDD5H4J7VX.3GJKRBBDCKQ86@jeltef.nl
2026-03-27 14:26:43 +01:00
Heikki Linnakangas
d6eba30a24 Refactor how user-defined LWLock tranches are stored in shmem
Merge the LWLockTranches and NamedLWLockTrancheRequest data structures
in shared memory into one array of user-defined tranches. The
NamedLWLockTrancheRequest list is now only used in postmaster, to hold
the requests until shared memory is initialized.

Introduce a C struct, LWLockTranches, to hold all the different fields
kept in shared memory. This gives an easier overview of what are all
the things kept in shared memory. Previously, we had separate pointers
for LWLockTrancheNames, LWLockCounter and the (shared memory copy of)
NamedLWLockTrancheRequestArray.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
2026-03-26 23:47:22 +02:00
Robert Haas
5dcb15e89a pg_plan_advice: Refactor to invent pgpa_planner_info
pg_plan_advice tracks two pieces of per-PlannerInfo data: (1) for each
RTI, the corresponding relation identifier, for purposes of
cross-checking those calculations against the final plan; and (2) the
set of semijoins seen during planning for which the strategy of making
one side unique was considered. The former is tracked using a hash
table that uses <plan_name, RTI> as the key, and the latter is
tracked using a List of <plan_name, relids>.

It seems better to track both of these things in the same way and
to try to reuse some code instead of having everything be completely
separate, so invent pgpa_planner_info; we'll create one every time we
see a new PlannerInfo and need to associate some data with it, and
we'll use the plan_name field to distinguish between PlannerInfo
objects, as it should always be unique. Then, refactor the two
systems mentioned above to use this new infrastructure.

(Note that the adjustment in pgpa_plan_walker is necessary in order
to avoid spuriously triggering the sanity check in that function,
in the case where a pgpa_planner_info is created for a purpose not
related to sj_unique_rels.)

Discussion: https://postgr.es/m/CA+TgmoaK=4w7-qknUo3QhUJ53pXZq=c=KgZmRyD+k7ytqfmgSg@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
2026-03-26 11:57:33 -04:00
Melanie Plageman
a881cc9c7e Remove XLOG_HEAP2_VISIBLE entirely
There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.

Bumps XLOG_PAGE_MAGIC because we removed a WAL record type.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2026-03-24 17:58:12 -04:00
Michael Paquier
4019f725f5 Add support for lock statistics in pgstats
This commit adds a new stats kind, called PGSTAT_KIND_LOCK, implementing
statistics for lock tags, as reported by pg_locks.  The implementation
is fixed-sized, as the data is caped based on the number of lock tags in
LockTagType.

The new statistics kind records the following fields, providing insight
regarding lock behavior, while avoiding impact on performance-critical
code paths (such as fast-path lock acquisition):
- waits and wait_time: respectively track the number of times a lock
required waiting and the total time spent acquiring it.  These metrics
are only collected once a lock is successfully acquired and after
deadlock_timeout has been exceeded.
fastpath_exceeded: counts how often a lock could not be acquired via
the fast path due to the max_locks_per_transaction slot limits.

A new view called pg_stat_lock can be used to access this data, coupled
with a SQL function called pg_stat_get_lock().

Bump stat file format PGSTAT_FILE_FORMAT_ID.
Bump catalog version.

Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal
2026-03-24 15:32:09 +09:00
Peter Eisentraut
0f17d1dbfa headerscheck: Get CXXFLAGS from Makefile.global
headerscheck in C++ mode (cpluspluscheck) previously hardcoded
CXXFLAGS and documented that you might need to override them manually
from the environment.  Now that we have better C++ support in the
build system, we can just get CXXFLAGS from Makefile.global, like we
do for other variables.

Furthermore, this is necessary in some configurations to make
cpluspluscheck work under meson, because under meson, some -I options
end up in CXXFLAGS where under make they would be in CPPFLAGS.
Therefore, getting the correct CXXFLAGS is required in those cases.

Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMSWrt-PoQt4sHryWrB1ViuGBJF_PpbjoSGrWR2Ry47bHNLDqg%40mail.gmail.com
2026-03-23 07:53:43 +01:00
Melanie Plageman
4f7ecca84d Detect and fix visibility map corruption in more cases
Move VM corruption detection and repair into heap page pruning. This
allows VM repair during on-access pruning, not only during vacuum.

Also, expand corruption detection to cover pages marked all-visible that
contain dead tuples and tuples inserted or deleted by in-progress
transactions, rather than only all-visible pages with LP_DEAD items.

Pinning the correct VM page before on-access pruning is cheap when
compared to the cost of actually pruning. The vmbuffer is saved in the
scan descriptor, so a query should only need to pin each VM page once,
and a single VM page covers a large number of heap pages.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
2026-03-22 11:52:40 -04:00
Andrew Dunstan
b15c151398 pg_waldump: Add support for reading WAL from tar archives
pg_waldump can now accept the path to a tar archive (optionally
compressed with gzip, lz4, or zstd) containing WAL files and decode
them.  This was added primarily for pg_verifybackup, which previously
had to skip WAL parsing for tar-format backups.

The implementation uses the existing archive streamer infrastructure
with a hash table to track WAL segments read from the archive.  If WAL
files within the archive are not in sequential order, out-of-order
segments are written to a temporary directory (created via mkdtemp under
$TMPDIR or the archive's directory) and read back when needed.  An
atexit callback ensures the temporary directory is cleaned up.

The --follow option is not supported when reading from a tar archive.

Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
discussion: https://postgr.es/m/CAAJ_b94bqdWN3h2J-PzzzQ2Npbwct5ZQHggn_QoYGhC2rn-=WQ@mail.gmail.com
2026-03-20 15:31:35 -04:00
Andrew Dunstan
a2145605ee introduce CopyFormat, refactor CopyFormatOptions
Currently, the COPY command format is determined by two boolean fields
(binary, csv_mode) in CopyFormatOptions.  This approach, while
functional, isn't ideal for implementing other formats in the future.

To simplify adding new formats, introduce a CopyFormat enum.  This makes
the code cleaner and more maintainable, allowing for easier integration
of additional formats down the line.

Author: Joel Jacobson <joel@compiler.org>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
2026-03-20 08:21:57 -04:00
Masahiko Sawada
adcdbe9386 Add parallel vacuum worker usage to VACUUM (VERBOSE) and autovacuum logs.
This commit adds both the number of parallel workers planned and the
number of parallel workers actually launched to the output of
VACUUM (VERBOSE) and autovacuum logs.

Previously, this information was only reported as an INFO message
during VACUUM (VERBOSE), which meant it was not included in autovacuum
logs in practice. Although autovacuum does not yet support parallel
vacuum, a subsequent patch will enable it and utilize these logs in
its regression tests. This change also improves observability by
making it easier to verify if parallel vacuum is utilizing the
expected number of workers.

Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
2026-03-19 15:01:47 -07:00
Michael Paquier
015d32016d test_saslprep: Apply proper indentation
Noticed before koel has the idea to complain.  Rebase thinko from commit
aa73838a5c.
2026-03-19 13:42:24 +09:00
Tom Lane
8df3c7a85e Exclude contrib/pg_plan_advice/pgpa_parser.h from headerscheck.
Like other Bison-written headers, it's not worth the trouble to
make this compilable standalone.  (We might revisit this someday,
if we ever move up our minimum required Bison version.)
2026-03-18 13:10:22 -04:00
Peter Eisentraut
9b406a9e48 Update RELEASE_CHANGES
The existing instructions did not cover meson.  Point to
src/common/unicode/README instead, where there is more information.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/2a668979-ed92-49a3-abf9-a3ec2d460ec2%40eisentraut.org
2026-03-18 13:42:06 +01:00
Daniel Gustafsson
4f433025f6 ssl: Serverside SNI support for libpq
Support for SNI was added to clientside libpq in 5c55dc8b47 with the
sslsni parameter, but there was no support for utilizing it serverside.
This adds support for serverside SNI such that certificate/key handling
is available per host.  A new config file, $datadir/pg_hosts.conf, is
used for configuring which certificate and key should be used for which
hostname.  In order to use SNI the ssl_sni GUC must be set to on, when
it is off the ssl configuration works just like before.  If ssl_sni is
enabled and pg_hosts.conf is non-empty it will take precedence over
the regular SSL GUCs, if it is empty or missing the regular GUCs will
be used just as before this commit with no hostname specific handling.
The TLS init hook is not compatible with ssl_sni since it operates on
a single TLS configuration and SNI break that assumption.  If the init
hook and ssl_sni are both enabled, a WARNING will be issued.

Host configuration can either be for a literal hostname to match, non-
SNI connections using the no_sni keyword or a default fallback matching
all connections.  By omitting no_sni and the fallback a strict mode
can be achieved where only connections using sslsni=1 and a specified
hostname are allowed.

CRL file(s) are applied from postgresql.conf to all configured hostnames.

Serverside SNI requires OpenSSL, currently LibreSSL does not support
the required infrastructure to update the SSL context during the TLS
handshake.

Author: Daniel Gustafsson <daniel@yesql.se>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dewei Dai <daidewei1970@163.com>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/1C81CD0D-407E-44F9-833A-DD0331C202E5@yesql.se
2026-03-18 12:37:11 +01:00
Peter Eisentraut
e82fc27e09 meson: Add headerscheck and cpluspluscheck targets
Author: Miłosz Bieniek <bieniek.milosz0@gmail.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAMSWrt-PoQt4sHryWrB1ViuGBJF_PpbjoSGrWR2Ry47bHNLDqg%40mail.gmail.com
2026-03-18 11:27:43 +01:00
Peter Eisentraut
2f094e7ac6 SQL Property Graph Queries (SQL/PGQ)
Implementation of SQL property graph queries, according to SQL/PGQ
standard (ISO/IEC 9075-16:2023).

This adds:

- GRAPH_TABLE table function for graph pattern matching
- DDL commands CREATE/ALTER/DROP PROPERTY GRAPH
- several new system catalogs and information schema views
- psql \dG command
- pg_get_propgraphdef() function for pg_dump and psql

A property graph is a relation with a new relkind RELKIND_PROPGRAPH.
It acts like a view in many ways.  It is rewritten to a standard
relational query in the rewriter.  Access privileges act similar to a
security invoker view.  (The security definer variant is not currently
implemented.)

Starting documentation can be found in doc/src/sgml/ddl.sgml and
doc/src/sgml/queries.sgml.

Author: Peter Eisentraut <peter@eisentraut.org>
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/a855795d-e697-4fa5-8698-d20122126567@eisentraut.org
2026-03-16 10:14:18 +01:00
Michael Paquier
bfa3c4f106 Optimize hash index bulk-deletion with streaming read
This commit refactors hashbulkdelete() to use streaming reads, improving
the efficiency of the operation by prefetching upcoming buckets while
processing a current bucket.  There are some specific changes required
to make sure that the cleanup work happens in accordance to the data
pushed to the stream read callback.  When the cached metadata page is
refreshed to be able to process the next set of buckets, the stream is
reset and the data fed to the stream read callback has to be updated.
The reset needs to happen in two code paths, when _hash_getcachedmetap()
is called.

The author has seen better performance numbers than myself on this one
(with tweaks similar to 6c228755ad).  The numbers are good enough for
both of us that this change is worth doing, in terms of IO and runtime.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
2026-03-16 09:22:09 +09:00
Michael Paquier
ae58189a4d pgstattuple: Optimize pgstattuple_approx() with streaming read
This commit plugs into pgstattuple_approx(), the SQL function faster
than pgstattuple() that returns approximate results, the streaming read
APIs.  A callback is used to be able to skip all-visible pages via VM
lookup, to match with the logic prior to this commit.

Under test conditions similar to 6c228755ad (some dm_delay and
debug_io_direct=data), this can substantially improve the execution time
of the function, particularly for large relations.

Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VrqfbcDXqGrdLQ2xaQ=K0RzExNuw6U_GGqzSJu32wfdQ@mail.gmail.com
2026-03-14 15:06:13 +09:00
Jacob Champion
6225403f27 libpq-oauth: Use the PGoauthBearerRequestV2 API
Switch the private libpq-oauth ABI to a public one, based on the new
PGoauthBearerRequestV2 API. A huge amount of glue code can be removed as
part of this, and several code paths can be deduplicated. Additionally,
the shared library no longer needs to change its name for every major
release; it's now just "libpq-oauth.so".

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmrGg%2Bn_X2MOLgeWcj3v_M00gR8uz_D7mM8z%3DdX1JYVbg%40mail.gmail.com
2026-03-13 09:37:59 -07:00
Heikki Linnakangas
f9de9bf302 Add callback for I/O error messages in SLRUs
Historically, all SLRUs were addressed by transaction IDs, but that
hasn't been true for a long time. However, the error message on I/O
error still always talked about accessing a transaction ID.

This commit adds a callback that allows subsystems to construct their
own error messages, which can then correctly refer to a transaction
ID, multixid or whatever else is used to address the particular SLRU.

Author: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://www.postgresql.org/message-id/CACG=ezZZfurhYV+66ceubxQAyWqv9vaUi0yoO4-t48OE5xc0DQ@mail.gmail.com
2026-03-13 16:21:06 +02:00
Peter Geoghegan
a367c433ad Use simplehash for backend-private buffer pin refcounts.
Replace dynahash with simplehash for the per-backend PrivateRefCountHash
overflow table.  Simplehash generates inlined, open-addressed lookup
code, avoiding the per-call overhead of dynahash that becomes noticeable
when many buffers are pinned with a CPU-bound workload.

Motivated by testing of the index prefetching patch, which pins many
more buffers concurrently than typical index scans.

Author: Peter Geoghegan <pg@bowt.ie>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
2026-03-12 13:26:16 -04:00
Robert Haas
5883ff30b0 Add pg_plan_advice contrib module.
Provide a facility that (1) can be used to stabilize certain plan choices
so that the planner cannot reverse course without authorization and
(2) can be used by knowledgeable users to insist on plan choices contrary
to what the planner believes best. In both cases, terrible outcomes are
possible: users should think twice and perhaps three times before
constraining the planner's ability to do as it thinks best; nevertheless,
there are problems that are much more easily solved with these facilities
than without them.

This patch takes the approach of analyzing a finished plan to produce
textual output, which we call "plan advice", that describes key
decisions made during plan; if that plan advice is provided during
future planning cycles, it will force those key decisions to be made in
the same way.  Not all planner decisions can be controlled using advice;
for example, decisions about how to perform aggregation are currently
out of scope, as is choice of sort order. Plan advice can also be edited
by the user, or even written from scratch in simple cases, making it
possible to generate outcomes that the planner would not have produced.
Partial advice can be provided to control some planner outcomes but not
others.

Currently, plan advice is focused only on specific outcomes, such as
the choice to use a sequential scan for a particular relation, and not
on estimates that might contribute to those outcomes, such as a
possibly-incorrect selectivity estimate. While it would be useful to
users to be able to provide plan advice that affects selectivity
estimates or other aspects of costing, that is out of scope for this
commit.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Dian Fay <di@nmfay.com>
Reviewed-by: Ajay Pal <ajay.pal.k@gmail.com>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-03-12 13:00:43 -04:00
Richard Guo
383eb21ebf Convert NOT IN sublinks to anti-joins when safe
The planner has historically been unable to convert "x NOT IN (SELECT
y ...)" sublinks into anti-joins.  This is because standard SQL
semantics for NOT IN require that if the comparison "x = y" returns
NULL, the "NOT IN" expression evaluates to NULL (effectively false),
causing the row to be discarded.  In contrast, an anti-join preserves
the row if no match is found.  Due to this semantic mismatch regarding
NULL handling, the conversion was previously considered unsafe.

However, if we can prove that neither side of the comparison can yield
NULL values, and further that the operator itself cannot return NULL
for non-null inputs, the behavior of NOT IN and anti-join becomes
identical.  Enabling this conversion allows the planner to treat the
sublink as a first-class relation rather than an opaque SubPlan
filter.  This unlocks global join ordering optimization and permits
the selection of the most efficient join algorithm based on cost,
often yielding significant performance improvements for large
datasets.

This patch verifies that neither side of the comparison can be NULL
and that the operator is safe regarding NULL results before performing
the conversion.

To verify operator safety, we require that the operator be a member of
a B-tree or Hash operator family.  This serves as a proxy for standard
boolean behavior, ensuring the operator does not return NULL on valid
non-null inputs, as doing so would break index integrity.

For operand non-nullability, this patch makes use of several existing
mechanisms.  It leverages the outer-join-aware-Var infrastructure to
verify that a Var does not come from the nullable side of an outer
join, and consults the NOT-NULL-attnums hash table to efficiently
verify schema-level NOT NULL constraints.  Additionally, it employs
find_nonnullable_vars to identify Vars forced non-nullable by qual
clauses, and expr_is_nonnullable to deduce non-nullability for other
expression types.

The logic for verifying the non-nullability of the subquery outputs
was adapted from prior work by David Rowley and Tom Lane.

Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/CAMbWs495eF=-fSa5CwJS6B-BaEi3ARp0UNb4Lt3EkgUGZJwkAQ@mail.gmail.com
2026-03-12 09:45:18 +09:00
Andres Freund
82467f627b Require share-exclusive lock to set hint bits and to flush
At the moment hint bits can be set with just a share lock on a page (and,
until 45f658dacb, in one case even without any lock). Because of this we need
to copy pages while writing them out, as otherwise the checksum could be
corrupted.

The need to copy the page is problematic to implement AIO writes:

1) Instead of just needing a single buffer for a copied page we need one for
   each page that's potentially undergoing I/O
2) To be able to use the "worker" AIO implementation the copied page needs to
   reside in shared memory

It also causes problems for using unbuffered/direct-IO, independent of AIO:
Some filesystems, raid implementations, ... do not tolerate the data being
written out to change during the write. E.g. they may compute internal
checksums that can be invalidated by concurrent modifications, leading e.g. to
filesystem errors (as the case with btrfs).

It also just is plain odd to allow modifications of buffers that are just
share locked.

To address these issues, this commit changes the rules so that modifications
to pages are not allowed anymore while holding a share lock. Instead the new
share-exclusive lock (introduced in fcb9c977aa) allows at most one backend to
modify a buffer while other backends have the same page share locked. An
existing share-lock can be upgraded to a share-exclusive lock, if there are no
conflicting locks. For that BufferBeginSetHintBits()/BufferFinishSetHintBits()
and BufferSetHintBits16() have been introduced.

To prevent hint bits from being set while the buffer is being written out,
writing out buffers now requires a share-exclusive lock.

The use of share-exclusive to gate setting hint bits means that from now on
only one backend can set hint bits at a time. To allow multiple backends to
set hint bits would require more complicated locking: For setting hint bits
we'd need to store the count of backends currently setting hint bits and we
would need another lock-level for I/O conflicting with the lock-level to set
hint bits. Given that the share-exclusive lock for setting hint bits is only
held for a short time, that backends would often just set the same hint bits
and that the cost of occasionally not setting hint bits in hotly accessed
pages is fairly low, this seems like an acceptable tradeoff.

The biggest change to adapt to this is in heapam. To avoid performance
regressions for sequential scans that need to set a lot of hint bits, we need
to amortize the cost of BufferBeginSetHintBits() for cases where hint bits are
set at a high frequency. To that end HeapTupleSatisfiesMVCCBatch() uses the
new SetHintBitsExt(), which defers BufferFinishSetHintBits() until all hint
bits on a page have been set.  Conversely, to avoid regressions in cases where
we can't set hint bits in bulk (because we're looking only at individual
tuples), use BufferSetHintBits16() when setting hint bits without batching.

Several other places also need to be adapted, but those changes are
comparatively simpler.

After this we do not need to copy buffers to write them out anymore. That
change is done separately however.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
2026-03-10 19:32:13 -04:00
Álvaro Herrera
ac58465e06
Introduce the REPACK command
REPACK absorbs the functionality of VACUUM FULL and CLUSTER in a single
command.  Because this functionality is completely different from
regular VACUUM, having it separate from VACUUM makes it easier for users
to understand; as for CLUSTER, the term is heavily overloaded in the
IT world and even in Postgres itself, so it's good that we can avoid it.

We retain those older commands, but de-emphasize them in the
documentation, in favor of REPACK; the difference between VACUUM FULL
and CLUSTER (namely, the fact that tuples are written in a specific
ordering) is neatly absorbed as two different modes of REPACK.

This allows us to introduce further functionality in the future that
works regardless of whether an ordering is being applied, such as (and
especially) a concurrent mode.

Author: Antonin Houska <ah@cybertec.at>
Reviewed-by: Mihail Nikalayeu <mihailnikalayeu@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Robert Treat <rob@xzilla.net>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/82651.1720540558@antos
Discussion: https://postgr.es/m/202507262156.sb455angijk6@alvherre.pgsql
2026-03-10 19:56:39 +01:00
Jacob Champion
e982331b52 libpq: Introduce PQAUTHDATA_OAUTH_BEARER_TOKEN_V2
For the libpq-oauth module to eventually make use of the
PGoauthBearerRequest API, it needs some additional functionality: the
derived Issuer ID for the authorization server needs to be provided, and
error messages need to be built without relying on PGconn internals.
These features seem useful for application hooks, too, so that they
don't each have to reinvent the wheel.

The original plan was for additions to PGoauthBearerRequest to be made
without a version bump to the PGauthData type. Applications would simply
check a LIBPQ_HAS_* macro at compile time to decide whether they could
use the new features. That theoretically works for applications linked
against libpq, since it's not safe to downgrade libpq from the version
you've compiled against.

We've since found that this strategy won't work for plugins, due to a
complication first noticed during the libpq-oauth module split: it's
normal for a plugin on disk to be *newer* than the libpq that's loading
it, because you might have upgraded your installation while an
application was running. (In other words, a plugin architecture causes
the compile-time and run-time dependency arrows to point in opposite
directions, so plugins won't be able to rely on the LIBPQ_HAS_* macros
to determine what APIs are available to them.)

Instead, extend the original PGoauthBearerRequest (now retroactively
referred to as "v1" in the code) with a v2 subclass-style struct. When
an application implements and accepts PQAUTHDATA_OAUTH_BEARER_TOKEN_V2,
it may safely cast the base request pointer it receives in its callbacks
to v2 in order to make use of the new functionality. libpq will query
the application for a v2 hook first, then v1 to maintain backwards
compatibility, before giving up and using the builtin flow.

Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CAOYmi%2BmrGg%2Bn_X2MOLgeWcj3v_M00gR8uz_D7mM8z%3DdX1JYVbg%40mail.gmail.com
2026-03-06 12:05:51 -08:00
John Naylor
16743db061 Centralize detection of x86 CPU features
We now maintain an array of booleans that indicate which features were
detected at runtime. When code wants to check for a given feature,
the array is automatically checked if it has been initialized and if
not, a single function checks all features at once.

Move all x86 feature detection to pg_cpu_x86.c, and move the CRC
function choosing logic to the file where the hardware-specific
functions are defined, consistent with more recent hardware-specific
files in src/port.

Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Discussion: https://postgr.es/m/CANWCAZbgEUFw7LuYSVeJ=Tj98R5HoOB1Ffeqk3aLvbw5rU5NTw@mail.gmail.com
2026-02-27 20:30:41 +07:00
Andrew Dunstan
763aaa06f0 Add non-text output formats to pg_dumpall
pg_dumpall can now produce output in custom, directory, or tar formats
in addition to plain text SQL scripts. When using non-text formats,
pg_dumpall creates a directory containing:
- toc.glo: global data (roles and tablespaces) in custom format
- map.dat: mapping between database OIDs and names
- databases/: subdirectory with per-database archives named by OID

pg_restore is extended to handle these pg_dumpall archives, restoring
globals and then each database. The --globals-only option can be used
to restore only the global objects.

This enables parallel restore of pg_dumpall output and selective
restoration of individual databases from a cluster-wide backup.

Author: Mahendra Singh Thalor <mahi6run@gmail.com>
Co-Author: Andrew Dunstan <andrew@dunslane.net>
Reviewed-By: Tushar Ahuja <tushar.ahuja@enterprisedb.com>
Reviewed-By: Jian He <jian.universality@gmail.com>
Reviewed-By: Vaibhav Dalvi <vaibhav.dalvi@enterprisedb.com>
Reviewed-By: Srinath Reddy <srinath2133@gmail.com>

Discussion: https://postgr.es/m/cb103623-8ee6-4ba5-a2c9-f32e3a4933fa@dunslane.net
2026-02-26 08:29:56 -05:00
Tom Lane
4a1b05caa5 Restore AIX support.
The concerns that led us to remove AIX support in commit 0b16bb877
have now been alleviated:

1. IBM has stepped forward to provide support, including buildfarm
animal(s).
2. AIX 7.2 and later seem to be fine with large pg_attribute_aligned
requirements.  Since 7.1 is now EOL anyway, we can just cease to
support it.
3. Tossing xlc support overboard seems okay as well.  It's a bit
sad to drop one of the few remaining non-gcc-alike compilers, but
working around xlc's bugs and idiosyncrasies doesn't seem justified
by the theoretical portability benefits.
4. Likewise, we can stop supporting 32-bit AIX builds.  This is
not so much about whether we could build such executables as that
they're too much of a pain to manage in the field, due to limited
address space available for dynamic library loading.
5. We hit on a way to manage catalog column alignment that doesn't
require continuing developer effort (see commit ecae09725).

Hence, this commit reverts 0b16bb877 and some follow-on commits
such as e6bb491bf, except for not putting back XLC support nor
the changes related to catalog column alignment.

Some other notable changes from the way things were in v16:

Prefer unnamed POSIX semaphores on AIX, rather than the default
choice of SysV semaphores.

Include /opt/freeware/lib in -Wl,-blibpath, even when it is not
mentioned anywhere in LDFLAGS.

Remove platform-specific adjustment of MEMSET_LOOP_LIMIT; maybe
that's still the right thing, but it really ought to be re-tested.

Silence compiler warnings related to getpeereid(), wcstombs_l(),
and PAM conversation procs.

Accept "libpythonXXX.a" as an okay name for the Python shared
library (but only on AIX!).

Author: Aditya Kamath <Aditya.Kamath1@ibm.com>
Author: Srirama Kucherlapati <sriram.rk@in.ibm.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CY5PR11MB63928CC05906F27FB10D74D0FD322@CY5PR11MB6392.namprd11.prod.outlook.com
2026-02-23 13:34:22 -05:00
Peter Eisentraut
3a63b76571 Fix additional fallthrough warnings from clang
Clang warns if falling through to a case or default label that is
immediately followed by break, but GCC does
not (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432).  (MSVC also
warns about the equivalent code in C++.)

This is in preparation for enabling fallthrough warnings on Clang.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org
2026-02-23 07:40:19 +01:00
Peter Eisentraut
8354b9d6b6 Use fallthrough attribute instead of comment
Instead of using comments to mark fallthrough switch cases, use the
fallthrough attribute.  This will (in the future, not here) allow
supporting other compilers besides gcc.  The commenting convention is
only supported by gcc, the attribute is supported by clang, and in the
fullness of time the C23 standard attribute would allow supporting
other compilers as well.

Right now, we package the attribute into a macro called
pg_fallthrough.  This commit defines that macro and replaces the
existing comments with that macro invocation.

We also raise the level of the gcc -Wimplicit-fallthrough= option from
3 to 5 to enforce the use of the attribute.

Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/76a8efcd-925a-4eaf-bdd1-d972cd1a32ff%40eisentraut.org
2026-02-19 08:51:12 +01:00
John Naylor
ef3c3cf6d0 Perform radix sort on SortTuples with pass-by-value Datums
Radix sort can be much faster than quicksort, but for our purposes it
is limited to sequences of unsigned bytes. To make tuples with other
types amenable to this technique, several features of tuple comparison
must be accounted for, i.e. the sort key must be "normalized":

1. Signedness -- It's possible to modify a signed integer such that
it can be compared as unsigned. For example, a signed char has range
-128 to 127. If we cast that to unsigned char and add 128, the range
of values becomes 0 to 255 while preserving order.

2. Direction -- SQL allows specification of ASC or DESC. The
descending case is easily handled by taking the complement of the
unsigned representation.

3. NULL values -- NULLS FIRST and NULLS LAST must work correctly.

This commmit only handles the case where datum1 is pass-by-value
Datum (possibly abbreviated) that compares like an ordinary
integer. (Abbreviations of values of type "numeric" are a convenient
counterexample.) First, tuples are partitioned by nullness in the
correct NULL ordering. Then the NOT NULL tuples are sorted with radix
sort on datum1. For tiebreaks on subsequent sortkeys (including the
first sort key if abbreviated), we divert to the usual qsort.

ORDER BY queries on pre-warmed buffers are up to 2x faster on high
cardinality inputs with radix sort than the sort specializations added
by commit 697492434, so get rid of them. It's sufficient to fall back
to qsort_tuple() for small arrays. Moderately low cardinality inputs
show more modest improvents. Our qsort is strongly optimized for very
low cardinality inputs, but radix sort is usually equal or very close
in those cases.

The changes to the regression tests are caused by under-specified sort
orders, e.g. "SELECT a, b from mytable order by a;". For unstable
sorts, such as our qsort and this in-place radix sort, there is no
guarantee of the order of "b" within each group of "a".

The implementation is taken from ska_byte_sort() (Boost licensed),
which is similar to American flag sort (an in-place radix sort) with
modifications to make it better suited for modern pipelined CPUs.

The technique of normalization described above can also be extended
to the case of multiple keys. That is left for future work (Thanks
to Peter Geoghegan for the suggestion to look into this area).

Reviewed-by: Chengpeng Yan <chengpeng_yan@outlook.com>
Reviewed-by: zengman <zengman@halodbtech.com>
Reviewed-by: ChangAo Chen <cca5507@qq.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com> (earlier version)
Discussion: https://postgr.es/m/CANWCAZYzx7a7E9AY16Jt_U3+GVKDADfgApZ-42SYNiig8dTnFA@mail.gmail.com
2026-02-14 13:50:06 +07:00
Dean Rasheed
88327092ff Add support for INSERT ... ON CONFLICT DO SELECT.
This adds a new ON CONFLICT action DO SELECT [FOR UPDATE/SHARE], which
returns the pre-existing rows when conflicts are detected. The INSERT
statement must have a RETURNING clause, when DO SELECT is specified.

The optional FOR UPDATE/SHARE clause allows the rows to be locked
before they are are returned. As with a DO UPDATE conflict action, an
optional WHERE clause may be used to prevent rows from being selected
for return (but as with a DO UPDATE action, rows filtered out by the
WHERE clause are still locked).

Bumps catversion as stored rules change.

Author: Andreas Karlsson <andreas@proxel.se>
Author: Marko Tiikkaja <marko@joh.to>
Author: Viktor Holmberg <v@viktorh.net>
Reviewed-by: Joel Jacobson <joel@compiler.org>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/d631b406-13b7-433e-8c0b-c6040c4b4663@Spark
Discussion: https://postgr.es/m/5fca222d-62ae-4a2f-9fcb-0eca56277094@Spark
Discussion: https://postgr.es/m/2b5db2e6-8ece-44d0-9890-f256fdca9f7e@proxel.se
Discussion: https://postgr.es/m/CAL9smLCdV-v3KgOJX3mU19FYK82N7yzqJj2HAwWX70E=P98kgQ@mail.gmail.com
2026-02-12 09:57:04 +00:00
Robert Haas
7358abcc60 Store information about Append node consolidation in the final plan.
An extension (or core code) might want to reconstruct the planner's
decisions about whether and where to perform partitionwise joins from
the final plan. To do so, it must be possible to find all of the RTIs
of partitioned tables appearing in the plan. But when an AppendPath
or MergeAppendPath pulls up child paths from a subordinate AppendPath
or MergeAppendPath, the RTIs of the subordinate path do not appear
in the final plan, making this kind of reconstruction impossible.

To avoid this, propagate the RTI sets that would have been present
in the 'apprelids' field of the subordinate Append or MergeAppend
nodes that would have been created into the surviving Append or
MergeAppend node, using a new 'child_append_relid_sets' field for
that purpose. The value of this field is a list of Bitmapsets,
because each relation whose append-list was pulled up had its own
set of RTIs: just one, if it was a partitionwise scan, or more than
one, if it was a partitionwise join. Since our goal is to see where
partitionwise joins were done, it is essential to avoid losing the
information about how the RTIs were grouped in the pulled-up
relations.

This commit also updates pg_overexplain so that EXPLAIN (RANGE_TABLE)
will display the saved RTI sets.

Co-authored-by: Robert Haas <rhaas@postgresql.org>
Co-authored-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-02-10 17:55:59 -05:00
Michael Paquier
9181c870ba Improve type handling of varlena structures
This commit changes the definition of varlena to a typedef, so as it
becomes possible to remove "struct" markers from various declarations in
the code base.  Historically, "struct" markers are not the project style
for variable declarations, so this update simplifies the code and makes
it more consistent across the board.

This change has an impact on the following structures, simplifying
declarations using them:
- varlena
- varatt_indirect
- varatt_external

This cleanup has come up in a different path set that played with
TOAST and varatt.h, independently worth doing on its own.

Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/aW8xvVbovdhyI4yo@paquier.xyz
2026-02-11 07:33:24 +09:00
Robert Haas
0d4391b265 Store information about elided nodes in the final plan.
An extension (or core code) might want to reconstruct the planner's
choice of join order from the final plan. To do so, it must be possible
to find all of the RTIs that were part of the join problem in that plan.
Commit adbad833f3, together with the
earlier work in 8c49a484e8, is enough to
let us match up RTIs we see in the final plan with RTIs that we see
during the planning cycle, but we still have a problem if the planner
decides to drop some RTIs out of the final plan altogether.

To fix that, when setrefs.c removes a SubqueryScan, single-child Append,
or single-child MergeAppend from the final Plan tree, record the type of
the removed node and the RTIs that the removed node would have scanned
in the final plan tree. It would be natural to record this information
on the child of the removed plan node, but that would require adding an
additional pointer field to type Plan, which seems undesirable.  So,
instead, store the information in a separate list that the executor need
never consult, and use the plan_node_id to identify the plan node with
which the removed node is logically associated.

Also, update pg_overexplain to display these details.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-02-10 16:46:05 -05:00
Robert Haas
adbad833f3 Store information about range-table flattening in the final plan.
Suppose that we're currently planning a query and, when that same
query was previously planned and executed, we learned something about
how a certain table within that query should be planned. We want to
take note when that same table is being planned during the current
planning cycle, but this is difficult to do, because the RTI of the
table from the previous plan won't necessarily be equal to the RTI
that we see during the current planning cycle. This is because each
subquery has a separate range table during planning, but these are
flattened into one range table when constructing the final plan,
changing RTIs.

Commit 8c49a484e8 allows us to match up
subqueries seen in the previous planning cycles with the subqueries
currently being planned just by comparing textual names, but that's
not quite enough to let us deduce anything about individual tables,
because we don't know where each subquery's range table appears in
the final, flattened range table.

To fix that, store a list of SubPlanRTInfo objects in the final
planned statement, each including the name of the subplan, the offset
at which it begins in the flattened range table, and whether or not
it was a dummy subplan -- if it was, some RTIs may have been dropped
from the final range table, but also there's no need to control how
a dummy subquery gets planned. The toplevel subquery has no name and
always begins at rtoffset 0, so we make no entry for it.

This commit teaches pg_overexplain's RANGE_TABLE option to make use
of this new data to display the subquery name for each range table
entry.

Reviewed-by: Lukas Fittl <lukas@fittl.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Haibo Yan <tristan.yim@gmail.com>
Reviewed-by: Alexandra Wang <alexandra.wang.oss@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoZ-Jh1T6QyWoCODMVQdhTUPYkaZjWztzP1En4=ZHoKPzw@mail.gmail.com
2026-02-10 15:33:39 -05:00