postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-02-17 17:55:04 -05:00

Author	SHA1	Message	Date
Álvaro Herrera	1d5800019f	Improve comment about snapshot macros The comment mistakenly had "the others" for "the other", but this commit also reorders the comment so it matches the macros below. Now we describe the levels in increasing strictness. In addition, it seems easier to follow if we introduce one level at a time, rather than describing two, followed by "the other" (and then jumping back to one of the first two). Finally, reword the sentence about the purpose of the macros, which was slightly off-point. Author: Paul Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Rustam ALLAKOV <rustamallakov@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA+renyUp=xja80rBaB6NpY3RRdi750y046x28bo_xg29zKY72Q@mail.gmail.com	2025-09-11 19:49:57 +02:00
Michael Paquier	26eadf4d2b	Fix description of WAL record blocks in hash_xlog.h hash_xlog.h included descriptions for the blocks used in WAL records that were was not completely consistent with how the records are generated, with one block missing for SQUEEZE_PAGE, and inconsistent descriptions used for block 0 in VACUUM_ONE_PAGE and MOVE_PAGE_CONTENTS. This information was incorrect since `c11453ce0a`, cross-checking the logic for the record generation. Author: Kirill Reshke <reshkekirill@gmail.com> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://postgr.es/m/CALdSSPj1j=a1d1hVA3oabRFz0hSU3KKrYtZPijw4UPUM7LY9zw@mail.gmail.com Backpatch-through: 13	2025-09-11 17:17:04 +09:00
Melanie Plageman	4b5f206de2	Remove unused xl_heap_prune member, reason `f83d709760` refactored xl_heap_prune and added an unused member, reason. While PruneReason is used when constructing this WAL record to set the WAL record definition, it doesn't need to be stored in a separate field in the record. Remove it. We won't backport this, since modifying an exposed struct definition to remove an unused field would do more harm than good. Author: Melanie Plageman <melanieplageman@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/tvvtfoxz5ykpsctxjbzxg3nldnzfc7geplrt2z2s54pmgto27y%40hbijsndifu45	2025-09-08 14:25:10 -04:00
Amit Kapila	1f7e9ba3ac	Post-commit review fixes for `228c370868`. This commit fixes three issues: 1) When a disabled subscription is created with retain_dead_tuples set to true, the launcher is not woken up immediately, which may lead to delays in creating the conflict detection slot. Creating the conflict detection slot is essential even when the subscription is not enabled. This ensures that dead tuples are retained, which is necessary for accurately identifying the type of conflict during replication. 2) Conflict-related data was unnecessarily retained when the subscription does not have a table. 3) Conflict-relevant data could be prematurely removed before applying prepared transactions on the publisher that are in the commit critical section. This issue occurred because the backend executing COMMIT PREPARED was not accounted for during the computation of oldestXid in the commit phase on the publisher. As a result, the subscriber could advance the conflict slot's xmin without waiting for such COMMIT PREPARED transactions to complete. We fixed this issue by identifying prepared transactions that are in the commit critical section during computation of oldestXid in commit phase. Author: Zhijie Hou <houzj.fnst@fujitsu.com> Reviewed-by: shveta malik <shveta.malik@gmail.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Nisha Moond <nisha.moond412@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OS9PR01MB16913DACB64E5721872AA5C02943BA@OS9PR01MB16913.jpnprd01.prod.outlook.com Discussion: https://postgr.es/m/OS9PR01MB16913F67856B0DA2A909788129400A@OS9PR01MB16913.jpnprd01.prod.outlook.com	2025-09-08 06:10:15 +00:00
Dean Rasheed	6ede13d1b5	Fix concurrent update issue with MERGE. When executing a MERGE UPDATE action, if there is more than one concurrent update of the target row, the lock-and-retry code would sometimes incorrectly identify the latest version of the target tuple, leading to incorrect results. This was caused by using the ctid field from the TM_FailureData returned by table_tuple_lock() in a case where the result was TM_Ok, which is unsafe because the TM_FailureData struct is not guaranteed to be fully populated in that case. Instead, it should use the tupleid passed to (and updated by) table_tuple_lock(). To reduce the chances of similar errors in the future, improve the commentary for table_tuple_lock() and TM_FailureData to make it clearer that table_tuple_lock() updates the tid passed to it, and most fields of TM_FailureData should not be relied on in non-failure cases. An exception to this is the "traversed" field, which is set in both success and failure cases. Reported-by: Dmitry <dsy.075@yandex.ru> Author: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Chao Li <li.evan.chao@gmail.com> Discussion: https://postgr.es/m/1570d30e-2b95-4239-b9c3-f7bf2f2f8556@yandex.ru Backpatch-through: 15	2025-09-05 08:18:18 +01:00
Michael Paquier	c6ea528b47	Update outdated references to the SLRU ControlLock SLRU bank locks are referred as "bank locks" or "SLRU bank locks" in the code comments. The comments updated in this commit use the latter term. Oversight in `53c2a97a92`, that has replaced the single ControlLock by the bank control locks. Author: Julien Rouhaud <julien.rouhaud@free.fr> Discussion: https://postgr.es/m/aLUT2UO8RjJOzZNq@jrouhaud Backpatch-through: 17	2025-09-03 10:20:28 +09:00
Michael Paquier	a977e419ee	Refactor ReadMultiXactCounts() into GetMultiXactInfo() This provides a single entry point to access some information about the state of MultiXacts, able to return some data about multixacts offsets and counts. Originally this function was only able to return some information about the number of multixacts and multixact members, extended here to provide some data about the oldest multixact ID in use and the oldest offset, if known. This change has been proposed in a patch that aims at providing more monitoring capabilities for multixacts, and it is useful on its own. GetMultiXactInfo() is added to multixact.h, becoming available for out-of-core code. Extracted from a larger patch by the same author. Author: Naga Appani <nagnrik@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com	2025-08-19 14:04:09 +09:00
Michael Paquier	df9133fa63	Move SQL-callable code related to multixacts into its own file A patch is under discussion to add more SQL capabilities related to multixacts, and this move avoids bloating the file more than necessary. This affects pg_get_multixact_members(). A side effect of this move is the requirement to add mxstatus_to_string() to multixact.h. Extracted from a larger patch by the same author, tweaked by me. Author: Naga Appani <nagnrik@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CA+QeY+AAsYK6WvBW4qYzHz4bahHycDAY_q5ECmHkEV_eB9ckzg@mail.gmail.com	2025-08-18 14:57:55 +09:00
Álvaro Herrera	d0e7e04ede	Avoid including tableam.h and xlogreader.h in nbtree.h Doing that seems rather random and unnecessary. This commit removes those and fixes fallout, which is pretty minimal. We do need to add a forward declaration of struct TM_IndexDeleteOp (whose full definition appears in tableam.h) so that _bt_delitems_delete_check()'s declaration can use it. Author: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/202508051109.lzk3lcuzsaxo@alvherre.pgsql	2025-08-14 17:48:46 +02:00
Fujii Masao	12f3639ee7	Fix incorrect LSN format in comment. The comment previously used %X/08X, which is wrong. Updated it to the standardized format %X/%08X. Author: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/ME0P300MB0445A37908EFCCD15E6D749DB62BA@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM	2025-08-14 11:12:03 +09:00
Tom Lane	6aebedc384	Grab the low-hanging fruit from forcing sizeof(Datum) to 8. Remove conditionally-compiled code for smaller Datum widths, and simplify comments that describe cases no longer of interest. I also fixed up a few more places that were not using DatumGetIntXX where they should, and made some cosmetic adjustments such as using sizeof(int64) not sizeof(Datum) in places where that fit better with the surrounding code. One thing I remembered while preparing this part is that SP-GiST stores pass-by-value prefix keys as Datums, so that the on-disk representation depends on sizeof(Datum). That's even more unfortunate than the existing commentary makes it out to be, because now there is a hazard that the change of sizeof(Datum) will break SP-GiST indexes on 32-bit machines. It appears that there are no existing SP-GiST opclasses that are actually affected; and if there are some that I didn't find, the number of installations that are using them on 32-bit machines is doubtless tiny. So I'm proceeding on the assumption that we can get away with this, but it's something to worry about. (gininsert.c looks like it has a similar problem, but it's okay because the "tuples" it's constructing are just transient data within the tuplesort step. That's pretty poorly documented though, so I added some comments.) Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us	2025-08-13 17:18:22 -04:00
Tom Lane	665c3dbba4	Mop-up for Datum conversion cleanups. Fix a couple more places where an explicit Datum conversion is needed (not clear how we missed these in `ff89e182d` and previous commits). Replace the minority usage "(Datum) NULL" with "(Datum) 0". The former depends on the assumption that Datum is the same width as Pointer, the latter doesn't. Anyway consistency is a good thing. This is, I believe, the last of the notational mop-up needed before we can consider changing Datum to uint64 everywhere. It's also important cleanup for more aggressive ideas such as making Datum a struct. Discussion: https://postgr.es/m/1749799.1752797397@sss.pgh.pa.us Discussion: https://postgr.es/m/8246d7ff-f4b7-4363-913e-827dadfeb145@eisentraut.org	2025-08-08 18:44:57 -04:00
Nathan Bossart	35baa60cc7	Rename transformRelOptions()'s "namspace" parameter to "nameSpace". The name "namspace" looks like a typo, but it was presumably meant to avoid using the "namespace" C++ keyword. This commit renames the parameter to "nameSpace" to prevent future confusion while still avoiding the keyword. Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/aJJxpfsDfiQ1VbJ5%40nathan	2025-08-06 12:08:07 -05:00
Álvaro Herrera	196063d676	Move enum RecoveryTargetAction to xlogrecovery.h Commit `70e81861fa` split out xlogrecovery.c/h and moved some enums related to recovery targets to xlogrecovery.h. However, it seems that the enum RecoveryTargetAction was inadvertently left out by that commit. This commit moves it to xlogrecovery.h for consistency. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20240904.173013.1132986940678039655.horikyota.ntt@gmail.com	2025-07-23 11:02:13 +02:00
Michael Paquier	19179dbffc	doc: Inform about aminsertcleanup optional NULLness This index AM callback has been introduced in `c1ec02be1d` and it is optional, currently only being used by BRIN. Optional callbacks are documented with NULL as possible value in amapi.h and indexam.sgml, but this callback has missed this part of the description. Reported-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Japin Li <japinli@hotmail.com> Discussion: https://postgr.es/m/CAHut+PvgYcPmPDi1YdHMJY5upnyGRpc0N8pk1xNB11xDSBwNog@mail.gmail.com Backpatch-through: 17	2025-07-22 14:34:15 +09:00
Nathan Bossart	bb938e2c3c	Rename CHECKPOINT_IMMEDIATE to CHECKPOINT_FAST. The new name more accurately reflects the effects of this flag on a requested checkpoint. Checkpoint-related log messages (i.e., those controlled by the log_checkpoints configuration parameter) will now say "fast" instead of "immediate", too. Likewise, references to "immediate" checkpoints in the documentation have been updated to say "fast". This is preparatory work for a follow-up commit that will add a MODE option to the CHECKPOINT command. Author: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Nathan Bossart	cd8324cc89	Rename CHECKPOINT_FLUSH_ALL to CHECKPOINT_FLUSH_UNLOGGED. The new name more accurately relects the effects of this flag on a requested checkpoint. Checkpoint-related log messages (i.e., those controlled by the log_checkpoints configuration parameter) will now say "flush-unlogged" instead of "flush-all", too. This is preparatory work for a follow-up commit that will add a FLUSH_UNLOGGED option to the CHECKPOINT command. Author: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/aDnaKTEf-0dLiEfz%40msg.df7cb.de	2025-07-11 11:51:25 -05:00
Álvaro Herrera	c616785516	Refactor some repetitive SLRU code Functions to bootstrap and zero pages in various SLRU callers were fairly duplicative. We can slash almost two hundred lines with a couple of simple helpers: - SimpleLruZeroAndWritePage: Does the equivalent of SimpleLruZeroPage followed by flushing the page to disk - XLogSimpleInsertInt64: Does a XLogBeginInsert followed by XLogInsert of a trivial record whose data is just an int64. Author: Evgeny Voropaev <evgeny.voropaev@tantorlabs.com> Reviewed by: Álvaro Herrera <alvherre@kurilemu.de> Reviewed by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/97820ce8-a1cd-407f-a02b-47368fadb14b%40tantorlabs.com	2025-07-07 16:49:19 +02:00
Álvaro Herrera	2633dae2e4	Standardize LSN formatting by zero padding This commit standardizes the output format for LSNs to ensure consistent representation across various tools and messages. Previously, LSNs were inconsistently printed as `%X/%X` in some contexts, while others used zero-padding. This often led to confusion when comparing. To address this, the LSN format is now uniformly set to `%X/%08X`, ensuring the lower 32-bit part is always zero-padded to eight hexadecimal digits. Author: Japin Li <japinli@hotmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de> Discussion: https://postgr.es/m/ME0P300MB0445CA53CA0E4B8C1879AF84B641A@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM	2025-07-07 13:57:43 +02:00
Michael Paquier	62a17a9283	Integrate FullTransactionIds deeper into two-phase code This refactoring is a follow-up of the work done in `5a1dfde833`, that has switched 2PC file names to use FullTransactionIds when written on disk. This will help with the integration of a follow-up solution related to the handling of two-phase files during recovery, to address older defects while reading these from disk after a crash. This change is useful in itself as it reduces the need to build the file names from epoch numbers and TransactionIds, because we can use directly FullTransactionIds from which the 2PC file names are guessed. So this avoids a lot of back-and-forth between the FullTransactionIds retrieved from the file names and how these are passed around in the internal 2PC logic. Note that the core of the change is the use of a FullTransactionId instead of a TransactionId in GlobalTransactionData, that tracks 2PC file information in shared memory. The change in TwoPhaseCallback makes this commit unfit for stable branches. Noah has contributed a good chunk of this patch. I have spent some time on it as well while working on the issues with two-phase state files and recovery. Author: Noah Misch <noah@leadboat.com> Co-Authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/Z5sd5O9JO7NYNK-C@paquier.xyz Discussion: https://postgr.es/m/20250116205254.65.nmisch@google.com	2025-07-07 12:50:40 +09:00
Fujii Masao	c84698ceae	Remove leftover dead code from commit_ts.h. Commit `08aa89b326` removed the COMMIT_TS_SETTS WAL record, leaving xl_commit_ts_set and SizeOfCommitTsSet unused. However, it missed removing these definitions. This commit cleans up the leftover code. Since this is a cleanup rather than a bug fix, it is applied only to the master branch. Author: Andy Fan <zhihuifan1213@163.com> Reviewed-by: Fujii Masao <masao.fujii@gmail.com> Discussion: https://postgr.es/m/87ecuzmkqf.fsf@163.com	2025-07-03 23:39:45 +09:00
Michael Paquier	2252fcd427	Rationalize handling of VacuumParams This commit refactors the vacuum routines that rely on VacuumParams, adding const markers where necessary to force a new policy in the code. This structure should not use a pointer as it may be used across multiple relations, and its contents should never be updated. vacuum_rel() stands as an exception as it touches the "index_cleanup" and "truncate" options. VacuumParams has been introduced in `0d83138974`, and `661643deda` has fixed a bug impacting VACUUM operating on multiple relations. The changes done in tableam.h break ABI compatibility, so this commit can only happen on HEAD. Author: Shihao Zhong <zhong950419@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com	2025-06-30 15:42:50 +09:00
Tom Lane	b27644bade	Sync typedefs.list with the buildfarm. Our maintenance of typedefs.list has been a little haphazard (and apparently we can't alphabetize worth a darn). Replace the file with the authoritative list from our buildfarm, and run pgindent using that. I also updated the additions/exclusions lists in pgindent where necessary to keep pgindent from messing things up significantly. Notably, now that regex_t and some related names are macros not real typedefs, we have to whitelist them explicitly. The exclusions list has also drifted noticeably, presumably due to changes of system headers on the buildfarm animals that contribute to the list. Unlike in prior years, I've not manually added typedef names that are missing from the buildfarm's list because they are not used to declare any variables or fields. So there are a few places where the typedef declaration itself is formatted worse than before, e.g. typedef enum IoMethod. I could preserve the names that were manually added to the list previously, but I'd really prefer to find a less manual way of dealing with these cases. A quick grep finds about 75 such symbols, most of which have never gotten any special treatment. Per discussion among pgsql-release, doing this now seems appropriate even though we're still a week or two away from making the v18 branch.	2025-06-15 13:04:24 -04:00
Etsuro Fujita	7d4667c620	Revert "postgres_fdw: Inherit the local transaction's access/deferrable modes." We concluded that commit `e5a3c9d9b` is a feature rather than a fix; since it was added after feature freeze, revert it. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reported-by: Michael Paquier <michael@paquier.xyz> Reported-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/ed2296f1-1a6b-4932-b870-5bb18c2591ae%40oss.nttdata.com	2025-06-08 17:30:00 +09:00
Peter Geoghegan	e6eed40e44	Avoid BufferGetLSNAtomic() calls during nbtree scans. Delay calling BufferGetLSNAtomic() until we finish reading a page that actually contains items that btgettuple will return to the executor. This reduces the number of calls during plain index scans (we'll only call BufferGetLSNAtomic() when _bt_readpage returns true), and totally eliminates calls during index-only scans, bitmap index scans, and plain index scans of an unlogged relation. Currently, when checksums (or wal_log_hints) are enabled, acquiring a page's LSN in BufferGetLSNAtomic() involves locking the buffer header (which involves the use of spinlocks). Testing has shown that enabling page-level checksums causes large regressions with certain workloads, especially on larger multi-socket systems. The regression isn't tied to any Postgres 18 commit. However, Postgres 18 commit `04bec894` made initdb use checksums by default, so it seems prudent to address the problem now. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me Discussion: https://postgr.es/m/CAH2-Wzk-Dg5XWs_jDuiHt4_7ryrSY+n=vxmHY51EVqPDFsKXmg@mail.gmail.com	2025-06-06 10:19:44 -04:00
Peter Eisentraut	32edf732e8	Rename gist stratnum support function Commit `7406ab623f` added a gist support function that we internally refer to by the symbol GIST_STRATNUM_PROC. This translated from "well-known" strategy numbers to opfamily-specific strategy numbers. However, we later (commit `630f9a43ce`) changed this to fit into index-AM-level compare type mapping, so this function actually now maps from compare type to opfamily-specific strategy numbers. So this name is no longer fitting. Moreover, the index AM level also supports the opposite, a function to map from strategy number to compare type. This is currently not supported in gist, but one might wonder what this function is supposed to be called when it is added. This patch changes the naming of the gist-level functionality to be more in line with the index-AM-level functionality. This makes sense because these are essentially the same thing on different levels. This also changes the names of the externally visible functions that are provided for use as such a support function. Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com> Discussion: https://www.postgresql.org/message-id/37ebb1d9-9036-485f-a215-e55435689917%40eisentraut.org	2025-06-02 08:41:27 +02:00
Etsuro Fujita	e5a3c9d9b5	postgres_fdw: Inherit the local transaction's access/deferrable modes. Previously, postgres_fdw always 1) opened a remote transaction in READ WRITE mode even when the local transaction was READ ONLY, causing a READ ONLY transaction using it that references a foreign table mapped to a remote view executing a volatile function to write in the remote side, and 2) opened the remote transaction in NOT DEFERRABLE mode even when the local transaction was DEFERRABLE, causing a SERIALIZABLE READ ONLY DEFERRABLE transaction using it to abort due to a serialization failure in the remote side. To avoid these, modify postgres_fdw to open a remote transaction in the same access/deferrable modes as the local transaction. This commit also modifies it to open a remote subtransaction in the same access mode as the local subtransaction. Although these issues exist since the introduction of postgres_fdw, there have been no reports from the field. So it seems fine to just fix them in master only. Author: Etsuro Fujita <etsuro.fujita@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com	2025-06-01 17:30:00 +09:00
Michael Paquier	2c6469d4cd	Fix incorrect year in some copyright notices A couple of new files have been added in the tree with a copyright year of 2024 while we were already in 2025. These should be marked with 2025, so let's fix them. Reported-by: Shaik Mohammad Mujeeb <mujeeb.sk.dev@gmail.com> Discussion: https://postgr.es/m/CALa6HA4_Wu7-2PV0xv-Q84cT8eG7rTx6bdjUV0Pc=McAwkNMfQ@mail.gmail.com	2025-05-19 09:46:52 +09:00
Michael Paquier	88e947136b	Fix typos and grammar in the code The large majority of these have been introduced by recent commits done in the v18 development cycle. Author: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/9a7763ab-5252-429d-a943-b28941e0e28b@gmail.com	2025-04-19 19:17:42 +09:00
Álvaro Herrera	a379061a22	Allow NOT NULL constraints to be added as NOT VALID This allows them to be added without scanning the table, and validating them afterwards without holding access exclusive lock on the table after any violating rows have been deleted or fixed. Doing ALTER TABLE ... SET NOT NULL for a column that has an invalid not-null constraint validates that constraint. ALTER TABLE .. VALIDATE CONSTRAINT is also supported. There are various checks on whether an invalid constraint is allowed in a child table when the parent table has a valid constraint; this should match what we do for enforced/not enforced constraints. pg_attribute.attnotnull is now only an indicator for whether a not-null constraint exists for the column; whether it's valid or invalid must be queried in pg_constraint. Applications can continue to query pg_attribute.attnotnull as before, but now it's possible that NULL rows are present in the column even when that's set to true. For backend internal purposes, we cache the nullability status in CompactAttribute->attnullability that each tuple descriptor carries (replacing CompactAttribute.attnotnull, which was a mirror of Form_pg_attribute.attnotnull). During the initial tuple descriptor creation, based on the pg_attribute scan, we set this to UNRESTRICTED if pg_attribute.attnotnull is false, or to UNKNOWN if it's true; then we update the latter to VALID or INVALID depending on the pg_constraint scan. This flag is also copied when tupledescs are copied. Comparing tuple descs for equality must also compare the CompactAttribute.attnullability flag and return false in case of a mismatch. pg_dump deals with these constraints by storing the OIDs of invalid not-null constraints in a separate array, and running a query to obtain their properties. The regular table creation SQL omits them entirely. They are then dealt with in the same way as "separate" CHECK constraints, and dumped after the data has been loaded. Because no additional pg_dump infrastructure was required, we don't bump its version number. I decided not to bump catversion either, because the old catalog state works perfectly in the new world. (Trying to run with new catalog state and the old server version would likely run into issues, however.) System catalogs do not support invalid not-null constraints (because commit `14e87ffa5c` didn't allow them to have pg_constraint rows anyway.) Author: Rushabh Lathia <rushabh.lathia@gmail.com> Author: Jian He <jian.universality@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Tested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CAGPqQf0KitkNack4F5CFkFi-9Dqvp29Ro=EpcWt=4_hs-Rt+bQ@mail.gmail.com	2025-04-07 19:19:50 +02:00
Peter Geoghegan	21a152b37f	Improve nbtree skip scan primitive scan scheduling. Don't allow nbtree scans with skip arrays to end any primitive scan on its first leaf page without giving some consideration to how many times the scan's arrays advanced while changing at least one skip array (though continue not caring about the number of array advancements that only affected SAOP arrays, even during skip scans with SAOP arrays). Now when a scan performs more than 3 such array advancements in the course of reading a single leaf page, it is taken as a signal that the next page is unlikely to be skippable. We'll therefore continue the ongoing primitive index scan, at least until we can perform a recheck against the next page's finaltup. Testing has shown that this new heuristic occasionally makes all the difference with skip scans that were expected to rely on the "passed first page" heuristic added by commit `9a2e2a28`. Without it, there is a remaining risk that certain kinds of skip scans will never quite manage to clear the initial hurdle of performing a primitive scan that lasts beyond its first leaf page (or that such a skip scan will only clear that initial hurdle when it has already wasted noticeably-many cycles due to inefficient primitive scan scheduling). Follow-up to commits `92fe23d9` and `9a2e2a28`. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com	2025-04-04 13:58:05 -04:00
Peter Geoghegan	8a510275dd	Further optimize nbtree search scan key comparisons. Postgres 17 commit `e0b1ee17` added two complementary optimizations to nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage was made to avoid needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. "prechecked" did this for keys required in the current scan direction, while "firstmatch" did it for keys required in the opposite-to-scan direction only. The "prechecked" design had a number of notable issues. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in cases involving stepping to a page having advanced the scan's arrays using a truncated high key). "prechecked" was also completely ineffective when only one scan key wasn't guaranteed to be satisfied by every tuple (it didn't recognize that it was still safe to avoid evaluating other, earlier keys). The "firstmatch" optimization had similar limitations. It could only be applied after _bt_readpage found its first matching tuple, regardless of why any earlier tuples failed to satisfy the scan's index quals. This allowed unsatisfied non-required scan keys to impede the optimization. Replace both optimizations with a new optimization, without any of these limitations: the "startikey" optimization. Affected _bt_readpage calls generate a page-level key offset ("startikey"), that their _bt_checkkeys calls can then start at. This is an offset to the first key that isn't known to be satisfied by every tuple on the page. Although this is independently useful work, its main goal is to avoid performance regressions with index scans that use skip arrays, but still never manage to skip over irrelevant leaf pages. We must avoid wasting CPU cycles on overly granular skip array maintenance in these cases. The new "startikey" optimization helps with this by selectively disabling array maintenance for the duration of a _bt_readpage call. This has no lasting consequences for the scan's array keys (they'll still reliably track the scan's progress through the index's key space whenever the scan is "between pages"). Skip scan adds skip arrays during preprocessing using simple, static rules, and decides how best to navigate/apply the scan's skip arrays dynamically, at runtime. The "startikey" optimization enables this approach. As a result of all this, the planner doesn't need to generate distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values (in respect of a skipped column) can be read about as efficiently as before -- without the scan being forced to give up on skipping over other groups of pages that are provably irrelevant. Many scans that cannot possibly skip will still benefit from the use of skip arrays, since they'll allow the "startikey" optimization to be as effective as possible (by allowing preprocessing to mark all the scan's keys as required). A scan that uses a skip array on "a" for a qual "WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now, even when every tuple read by the scan has its own distinct "a" value. However, there are still some remaining regressions, affecting certain trickier cases. Scans whose index quals have several range skip arrays, each on some high cardinality column, can still be slower than they were before the introduction of skip scan -- even with the new "startikey" optimization. There are also known regressions affecting very selective index scans that use a skip array. The underlying issue with such selective scans is that they never get as far as reading a second leaf page, and so will never get a chance to consider applying the "startikey" optimization. In principle, all regressions could be avoided by teaching preprocessing to not add skip arrays whenever they aren't expected to help, but it seems best to err on the side of robust performance. Follow-up to commit `92fe23d9`, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com	2025-04-04 12:27:52 -04:00
Peter Geoghegan	92fe23d93a	Add nbtree skip scan optimization. Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit `5bf748b8`. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com	2025-04-04 12:27:04 -04:00
Andres Freund	459e7bf8e2	Remove HeapBitmapScan's skip_fetch optimization The optimization does not take the removal of TIDs by a concurrent vacuum into account. The concurrent vacuum can remove dead TIDs and make pages ALL_VISIBLE while those dead TIDs are referenced in the bitmap. This can lead to a skip_fetch scan returning too many tuples. It likely would be possible to implement this optimization safely, but we don't have the necessary infrastructure in place. Nor is it clear that it's worth building that infrastructure, given how limited the skip_fetch optimization is. In the backbranches we just disable the optimization by always passing need_tuples=true to table_beginscan_bm(). We can't perform API/ABI changes in the backbranches and we want to make the change as minimal as possible. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Konstantin Knizhnik <knizhnik@garret.ru> Discussion: https://postgr.es/m/CAEze2Wg3gXXZTr6_rwC+s4-o2ZVFB5F985uUSgJTsECx6AmGcQ@mail.gmail.com Backpatch-through: 13	2025-04-02 14:54:20 -04:00
Jeff Davis	a0942f441e	Add ExecCopySlotMinimalTupleExtra(). Allows an "extra" argument that allocates extra memory at the end of the MinimalTuple. This is important for callers that need to store additional data, but do not want to perform an additional allocation. Suggested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvppeqw2pNM-+ahBOJwq2QmC0hOAGsmCpC89QVmEoOvsdg@mail.gmail.com	2025-03-24 22:05:53 -07:00
Nathan Bossart	7d559c8580	Expand comment for isset_offset. This field was added in commit `0164a0f9ee` to provide a way to determine whether a storage parameter was explicitly set for the relation or if it just picked up the default value. In most cases, this can be accomplished by giving the storage parameter a special out-of-range default value (e.g., the autovacuum_vacuum_insert_threshold storage parameter defaults to -2), but this approach doesn't work in all cases. For example, a Boolean storage parameter cannot be given an out-of-range default, so we need another way to discover the source of its value. Reported-by: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com> Discussion: https://postgr.es/m/CAKFQuwYKtEUYKS%2B18gRs-xPhn0qOJgM2KGyyWVCODHuVn9F-XQ%40mail.gmail.com	2025-03-24 15:47:02 -05:00
Peter Geoghegan	9a2e2a285a	Improve nbtree array primitive scan scheduling. Add a new scheduling heuristic: don't end the ongoing primitive index scan immediately (at the point where _bt_advance_array_keys notices that the next set of matching tuples must be on a later page) if the primscan already managed to step right/left from its first leaf page. Schedule a recheck against the next sibling leaf page's finaltup instead. The new heuristic tends to avoid scenarios where the top-level scan repeatedly starts and ends primitive index scans that each read only one leaf page from a group of neighboring leaf pages. Affected top-level scans will now tend to step forward (or backward) through the index instead, without wasting cycles on descending the index anew. The recheck mechanism isn't exactly new. But up until now it has only been used to deal with edge cases involving high key finaltups with one or more truncated -inf attributes that _bt_advance_array_keys deemed "provisionally satisfied" (satisfied for the purposes of allowing the scan to step onto the next page, subject to recheck once on that page). The mechanism was added by commit `5bf748b8`, which invented the general concept of primitive scan scheduling. It was later enhanced by commit `79fa7b3b`, which taught it about cases involving -inf attributes that satisfy inequality scan keys required in the opposite-to-scan direction only (arguably, they should have been covered by the earliest version). Now the recheck mechanism can be applied based on scan-level heuristics, which have nothing to do with truncated high keys. Now rechecks might be performed by _bt_readpage when scanning in _either_ scan direction. The theory behind the new heuristic is that any primitive scan that makes it past its first leaf page is one that is already likely to have arrays whose key values match index tuples that are closely clustered together in the index. The rules that determine whether we ever get past the first page are still conservative (that'll still only happen when pstate.finaltup strongly suggests that it's the right thing to do). Surviving past the first leaf page is a strong signal in itself. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That'll work by adding skip arrays, which behave similarly to SAOP arrays, but generate their elements procedurally and on-demand. Note that this commit isn't specifically concerned with skip arrays; the scheduling logic doesn't (and won't) condition anything on whether the scan uses skip arrays, SAOP arrays, or some combination of the two (which seems like a good general principle for _bt_advance_array_keys). While the problems that this commit ameliorates are more likely with skip arrays (at least in practice), SAOP arrays (or those with very dense, contiguous array elements) are also affected. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzkz0wPe6+02kr+hC+JJNKfGtjGTzpG3CFVTQmKwWNrXNw@mail.gmail.com	2025-03-22 13:02:18 -04:00
Nathan Bossart	0164a0f9ee	Add vacuum_truncate configuration parameter. This new parameter works just like the storage parameter of the same name: if set to true (which is the default), autovacuum and VACUUM attempt to truncate any empty pages at the end of the table. It is primarily intended to help users avoid locking issues on hot standbys. The setting can be overridden with the storage parameter or VACUUM's TRUNCATE option. Since there's presently no way to determine whether a Boolean storage parameter is explicitly set or has just picked up the default value, this commit also introduces an isset_offset member to relopt_parse_elt. Suggested-by: Will Storey <will@summercat.com> Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/Z2DE4lDX4tHqNGZt%40dev.null	2025-03-20 10:16:50 -05:00
Melanie Plageman	c3953226a0	Remove table AM callback scan_bitmap_next_block After pushing the bitmap iterator into table-AM specific code (as part of making bitmap heap scan use the read stream API in `2b73a8cd33`), scan_bitmap_next_block() no longer returns the current block number. Since scan_bitmap_next_block() isn't returning any relevant information to bitmap table scan code, it makes more sense to get rid of it. Now, bitmap table scan code only calls table_scan_bitmap_next_tuple(), and the heap AM implementation of scan_bitmap_next_block() is a local helper in heapam_handler.c. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com	2025-03-15 10:37:46 -04:00
Melanie Plageman	2b73a8cd33	BitmapHeapScan uses the read stream API Make Bitmap Heap Scan use the read stream API instead of invoking ReadBuffer() for each block indicated by the bitmap. The read stream API handles prefetching, so remove all of the explicit prefetching from bitmap heap scan code. Now, heap table AM implements a read stream callback which uses the bitmap iterator to return the next required block to the read stream code. Tomas Vondra conducted extensive regression testing of this feature. Andres Freund, Thomas Munro, and I analyzed regressions and Thomas Munro patched the read stream API. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Tested-by: Tomas Vondra <tomas@vondra.me> Tested-by: Andres Freund <andres@anarazel.de> Tested-by: Thomas Munro <thomas.munro@gmail.com> Tested-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com	2025-03-15 10:34:42 -04:00
Melanie Plageman	944e81bf99	Separate TBM[Shared\|Private]Iterator and TBMIterateResult Remove the TBMIterateResult member from the TBMPrivateIterator and TBMSharedIterator and make tbm_[shared\|private_]iterate() take a TBMIterateResult as a parameter. This allows tidbitmap API users to manage multiple TBMIterateResults per scan. This is required for bitmap heap scan to use the read stream API, with which there may be multiple I/Os in flight at once, each one with a TBMIterateResult. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/d4bb26c9-fe07-439e-ac53-c0e244387e01%40vondra.me	2025-03-15 10:11:19 -04:00
Peter Geoghegan	426ea61117	nbtree: Make BTMaxItemSize into object-like macro. Make nbtree's "1/3 of a page limit" BTMaxItemSize function-like macro (which accepts a "page" argument) into an object-like macro that can be used from code that doesn't have convenient access to an nbtree page. Preparation for an upcoming patch that adds skip scan to nbtree. Parallel index scans that use skip scan will serialize datums (not just SAOP array subscripts) when scheduling primitive scans. BTMaxItemSize will be used by btestimateparallelscan to determine how much DSM to request. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com	2025-03-11 10:35:56 -04:00
Peter Geoghegan	0fbceae841	Show index search count in EXPLAIN ANALYZE, take 2. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan/index-only scan/bitmap index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index searches can be unpredictable due to implementation details that interact with physical index characteristics (at least with nbtree SAOP scans, since Postgres 17 commit `5bf748b8`). The information shown also provides useful context when EXPLAIN ANALYZE runs a plan with an index scan node that successfully applied the skip scan optimization (set to be added to nbtree by an upcoming patch). The instrumentation works by teaching all index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs already increment the pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). Parallel queries have workers copy their local counter struct into shared memory when an index scan node ends -- even when it isn't a parallel aware scan node. An earlier version of this patch that only worked with parallel aware scans became commit `5ead85fb` (though that was quickly reverted by commit `d00107cd` following "debug_parallel_query=regress" buildfarm failures). Our approach doesn't match the approach used when tracking other index scan related costs (e.g., "Rows Removed by Filter:"). It is comparable to the approach used in similar cases involving costs that are only readily accessible inside an access method, not from the executor proper (e.g., "Heap Blocks:" output for a Bitmap Heap Scan, which was recently enhanced to show per-worker costs by commit `5a1e6df3`, using essentially the same scheme as the one used here). It is necessary for index AMs to have direct responsibility for maintaining the new counter, since the counter might need to be incremented multiple times per amgettuple call (or per amgetbitmap call). But it is also necessary for the executor proper to manage the shared memory now used to transfer each worker's counter struct to the leader. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	2025-03-11 09:20:50 -04:00
Peter Eisentraut	af4002b381	Rename amcancrosscompare After more discussion about commit `ce62f2f2a0`, rename the index AM property amcancrosscompare to two separate properties amconsistentequality and amconsistentordering. Also improve the documentation and update some comments that were previously missed. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-07 11:46:33 +01:00
Peter Geoghegan	d00107cd63	Revert "Show index search count in EXPLAIN ANALYZE." This reverts commit `5ead85fbc8`. This commit shows test failures with debug_parallel_query=regress. The underlying issue needs to be debugged, so revert for now.	2025-03-05 10:27:31 -05:00
Peter Geoghegan	5ead85fbc8	Show index search count in EXPLAIN ANALYZE. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index scans isn't predictable in advance (at least not with optimizations like the one added to nbtree by Postgres 17 commit `5bf748b8`). It will also be useful when EXPLAIN ANALYZE shows details of an nbtree index scan that uses skip scan optimizations set to be introduced by an upcoming patch. The instrumentation works by teaching index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs must already increment the index's pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). The new counter is stored in the scan descriptor (IndexScanDescData), which explain.c reaches by going through the scan node's PlanState. This approach doesn't match the approach used when tracking other index scan specific costs (e.g., "Rows Removed by Filter:"). It is similar to the approach used in other cases where we must track costs that are only readily accessible inside an access method, and not from the executor (e.g., "Heap Blocks:" output for a Bitmap Heap Scan). It is inherently necessary to maintain a counter that can be incremented multiple times during a single amgettuple call (or amgetbitmap call), and directly exposing PlanState.instrument to index access methods seems unappealing. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com	2025-03-05 09:36:48 -05:00
Heikki Linnakangas	635f580120	Rename some signal and interrupt handling functions for consistency The usual pattern for handling a signal is that the signal handler sets a flag and calls SetLatch(MyLatch), and CHECK_FOR_INTERRUPTS() or other code that is part of a wait loop calls another function to deal with it. The naming of the functions involved was a bit inconsistent, however. CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() to do the heavy-lifting, but the analogous functions in aux processes were called HandleMainLoopInterrupts(), HandleStartupProcInterrupts(), etc. Similarly, most subroutines of ProcessInterrupts() were called Process(), but some were called Handle(). To make things less confusing, rename all the functions that are part of the overall signal/interrupt handling system but are not executed in a signal handler to e.g. ProcessSomething(), rather than HandleSomething(). The "Process" prefix is now consistently used in the non-signal-handler functions, and the "Handle" prefix in functions that are part of signal handlers, except for some completely unrelated functions that clearly have nothing to do with signal or interrupt handling. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/8a384b26-1499-41f6-be33-64b801fb98b8@iki.fi	2025-03-05 16:22:26 +02:00
Tomas Vondra	8492feb98f	Allow parallel CREATE INDEX for GIN indexes Allow using parallel workers to build a GIN index, similarly to BTREE and BRIN. For large tables this may result in significant speedup when the build is CPU-bound. The work is divided so that each worker builds index entries on a subset of the table, determined by the regular parallel scan used to read the data. Each worker uses a local tuplesort to sort and merge the entries for the same key. The TID lists do not overlap (for a given key), which means the merge sort simply concatenates the two lists. The merged entries are written into a shared tuplesort for the leader. The leader needs to merge the sorted entries again, before writing them into the index. But this way a significant part of the work happens in the workers, and the leader is left with merging fewer large entries, which is more efficient. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for GIN indexes (e.g. uniqueness checks). Original patch by me, with reviews and substantial improvements by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com	2025-03-03 16:53:06 +01:00
Peter Eisentraut	ce62f2f2a0	Generalize hash and ordering support in amapi Stop comparing access method OID values against HASH_AM_OID and BTREE_AM_OID, and instead check the IndexAmRoutine for an index to see if it advertises its ability to perform the necessary ordering, hashing, or cross-type comparing functionality. A field amcanorder already existed, this uses it more widely. Fields amcanhash and amcancrosscompare are added for the other purposes. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-27 17:03:31 +01:00
Michael Paquier	6c349d83b6	Re-add GUC track_wal_io_timing This commit is a rework of `2421e9a51d`, about which Andres Freund has raised some concerns as it is valuable to have both track_io_timing and track_wal_io_timing in some cases, as the WAL write and fsync paths can be a major bottleneck for some workloads. Hence, it can be relevant to not calculate the WAL timings in environments where pg_test_timing performs poorly while capturing some IO data under track_io_timing for the non-WAL IO paths. The opposite can be also true: it should be possible to disable the non-WAL timings and enable the WAL timings (the previous GUC setups allowed this possibility). track_wal_io_timing is added back in this commit, controlling if WAL timings should be calculated in pg_stat_io for the read, fsync and write paths, as done previously with pg_stat_wal. pg_stat_wal previously tracked only the sync and write parts (now removed), read stats is new data tracked in pg_stat_io, all three are aggregated if track_wal_io_timing is enabled. The read part matters during recovery or if a XLogReader is used. Extra note: more control over if the types of timings calculated in pg_stat_io could be done with a GUC that lists pairs of (IOObject,IOOp). Reported-by: Andres Freund <andres@anarazel.de> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/3opf2wh2oljco6ldyqf7ukabw3jijnnhno6fjb4mlu6civ5h24@fcwmhsgmlmzu	2025-02-26 09:49:59 +09:00

1 2 3 4 5 ...

1915 commits