postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-03-13 14:14:09 -04:00

Author	SHA1	Message	Date
Andres Freund	ca3067cc57	aio: Change prefix of PgAioResultStatus values to PGAIO_RS_ The previous prefix wasn't consistent with the naming of other AIO related enum values. It seems best to rename it before the users are introduced. Reported-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_Yb+JzQpNsgUxCB0gBi+sE-mi_HmcJF6ALnmO4W+UgwpA@mail.gmail.com	2025-03-22 17:30:44 -04:00
Peter Geoghegan	9a2e2a285a	Improve nbtree array primitive scan scheduling. Add a new scheduling heuristic: don't end the ongoing primitive index scan immediately (at the point where _bt_advance_array_keys notices that the next set of matching tuples must be on a later page) if the primscan already managed to step right/left from its first leaf page. Schedule a recheck against the next sibling leaf page's finaltup instead. The new heuristic tends to avoid scenarios where the top-level scan repeatedly starts and ends primitive index scans that each read only one leaf page from a group of neighboring leaf pages. Affected top-level scans will now tend to step forward (or backward) through the index instead, without wasting cycles on descending the index anew. The recheck mechanism isn't exactly new. But up until now it has only been used to deal with edge cases involving high key finaltups with one or more truncated -inf attributes that _bt_advance_array_keys deemed "provisionally satisfied" (satisfied for the purposes of allowing the scan to step onto the next page, subject to recheck once on that page). The mechanism was added by commit `5bf748b8`, which invented the general concept of primitive scan scheduling. It was later enhanced by commit `79fa7b3b`, which taught it about cases involving -inf attributes that satisfy inequality scan keys required in the opposite-to-scan direction only (arguably, they should have been covered by the earliest version). Now the recheck mechanism can be applied based on scan-level heuristics, which have nothing to do with truncated high keys. Now rechecks might be performed by _bt_readpage when scanning in _either_ scan direction. The theory behind the new heuristic is that any primitive scan that makes it past its first leaf page is one that is already likely to have arrays whose key values match index tuples that are closely clustered together in the index. The rules that determine whether we ever get past the first page are still conservative (that'll still only happen when pstate.finaltup strongly suggests that it's the right thing to do). Surviving past the first leaf page is a strong signal in itself. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That'll work by adding skip arrays, which behave similarly to SAOP arrays, but generate their elements procedurally and on-demand. Note that this commit isn't specifically concerned with skip arrays; the scheduling logic doesn't (and won't) condition anything on whether the scan uses skip arrays, SAOP arrays, or some combination of the two (which seems like a good general principle for _bt_advance_array_keys). While the problems that this commit ameliorates are more likely with skip arrays (at least in practice), SAOP arrays (or those with very dense, contiguous array elements) are also affected. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzkz0wPe6+02kr+hC+JJNKfGtjGTzpG3CFVTQmKwWNrXNw@mail.gmail.com	2025-03-22 13:02:18 -04:00
Masahiko Sawada	04ff636cbc	Add GUC option to control maximum active replication origins. This commit introduces a new GUC option max_active_replication_origins to control the maximum number of active replication origins. Previously, this was controlled by 'max_replication_slots'. Having a separate GUC option provides better flexibility for setting up subscribers, as they may not require replication slots (for cascading replication) but always require replication origins. Author: Euler Taveira <euler@eulerto.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/b81db436-8262-4575-b7c4-bc0c1551000b@app.fastmail.com	2025-03-21 12:20:15 -07:00
Tom Lane	0e032a2240	Place "extern" declaration in the right part of pg_class.h. errdetail_relkind_not_supported() was declared within EXPOSE_TO_CLIENT_CODE, which is mistaken since that function isn't available client-side. While relatively harmless, this isn't good precedent. Discussion: https://postgr.es/m/1134562.1742507765@sss.pgh.pa.us	2025-03-21 15:14:15 -04:00
Thomas Munro	ce1a75c4fe	Support buffer forwarding in StartReadBuffers(). StartReadBuffers() reports a short read when it finds a cached block that ends a range needing I/O by updating the caller's nblocks. It doesn't want to have to unpin the trailing hit that it knows the caller wants, so the v17 version used sleight of hand in the name of simplicity: it included it in nblocks as if it were part of the I/O, but internally tracked the shorter real I/O size in io_buffers_len (now removed). This API change "forwards" the delimiting buffer to the next call. It's still pinned, and still stored in the caller's array, but *nblocks no longer includes stray buffers that are not really part of the operation. The expectation is that the caller still wants the rest of the blocks and will call again starting from that point, and now it can pass the already pinned buffer back in (or choose not to and release it). The change is needed for the coming asynchronous I/O version's larger version of the problem: by definition it must move BM_IO_IN_PROGRESS negotiation from WaitReadBuffers() to StartReadBuffers(), but it might already have many buffers pinned before it discovers a need to split an I/O. (The current synchronous I/O version hides that detail from callers by looping over smaller reads if required to make all covered buffers valid in WaitReadBuffers(), so it looks like one operation but it might occasionally be several under the covers.) Aside from avoiding unnecessary pin traffic, this will also be important for later work on out-of-order streams: you can't prioritize data that is already available right now if that fact is hidden from you. The new API is natural for read_stream.c (see `ed0b87ca`). After a short read it leaves forwarded buffers where they fell in its circular queue for the continuing call to pick up. Single-block StartReadBuffer() and traditional ReadBuffer() share code but are not affected by the change. They don't do multi-block I/O. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-21 20:43:59 +13:00
Robert Haas	50ba65e733	Add an additional hook for EXPLAIN option validation. Commit `c65bc2e1d1` made it possible for loadable modules to add EXPLAIN options. Normally, any necessary validation can be performed by the hook function passed to RegisterExtensionExplainOption, but if a loadable module wants to sanity check options against each other, that needs to be done after the entire options list has been processed. So, add an additional hook for that purpose. Author: Sami Imseih <samimseih@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CAA5RZ0vOcJF91O2e5AQN+V6guMNLMhJx83dxALf-iUZ-hLGO_Q@mail.gmail.com	2025-03-20 13:47:55 -04:00
Nathan Bossart	0164a0f9ee	Add vacuum_truncate configuration parameter. This new parameter works just like the storage parameter of the same name: if set to true (which is the default), autovacuum and VACUUM attempt to truncate any empty pages at the end of the table. It is primarily intended to help users avoid locking issues on hot standbys. The setting can be overridden with the storage parameter or VACUUM's TRUNCATE option. Since there's presently no way to determine whether a Boolean storage parameter is explicitly set or has just picked up the default value, this commit also introduces an isset_offset member to relopt_parse_elt. Suggested-by: Will Storey <will@summercat.com> Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Gurjeet Singh <gurjeet@singh.im> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Robert Treat <rob@xzilla.net> Discussion: https://postgr.es/m/Z2DE4lDX4tHqNGZt%40dev.null	2025-03-20 10:16:50 -05:00
Peter Eisentraut	4f7f7b0375	extension_control_path The new GUC extension_control_path specifies a path to look for extension control files. The default value is $system, which looks in the compiled-in location, as before. The path search uses the same code and works in the same way as dynamic_library_path. Some use cases of this are: (1) testing extensions during package builds, (2) installing extensions outside security-restricted containers like Python.app (on macOS), (3) adding extensions to PostgreSQL running in a Kubernetes environment using operators such as CloudNativePG without having to rebuild the base image for each new extension. There is also a tweak in Makefile.global so that it is possible to install extensions using PGXS into an different directory than the default, using 'make install prefix=/else/where'. This previously only worked when specifying the subdirectories, like 'make install datadir=/else/where/share pkglibdir=/else/where/lib', for purely implementation reasons. (Of course, without the path feature, installing elsewhere was rarely useful.) Author: Peter Eisentraut <peter@eisentraut.org> Co-authored-by: Matheus Alcantara <matheusssilv97@gmail.com> Reviewed-by: David E. Wheeler <david@justatheory.com> Reviewed-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com> Reviewed-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Reviewed-by: Niccolò Fei <niccolo.fei@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E7C7BFFB-8857-48D4-A71F-88B359FADCFD@justatheory.com	2025-03-19 07:03:20 +01:00
Amit Langote	28317de723	Ensure first ModifyTable rel initialized if all are pruned Commit `cbc127917e` introduced tracking of unpruned relids to avoid processing pruned relations, and changed ExecInitModifyTable() to initialize only unpruned result relations. As a result, MERGE statements that prune all target partitions can now lead to crashes or incorrect behavior during execution. The crash occurs because some executor code paths rely on ModifyTableState.resultRelInfo[0] being present and initialized, even when no result relations remain after pruning. For example, ExecMerge() and ExecMergeNotMatched() use the first resultRelInfo to determine the appropriate action. Similarly, ExecInitPartitionInfo() assumes that at least one result relation exists. To preserve these assumptions, ExecInitModifyTable() now includes the first result relation in the initialized result relation list if all result relations for that ModifyTable were pruned. To enable that, ExecDoInitialPruning() ensures the first relation is locked if it was pruned and locking is necessary. To support this exception to the pruning logic, PlannedStmt now includes a list of RT indexes identifying the first result relation of each ModifyTable node in the plan. This allows ExecDoInitialPruning() to check whether each such relation was pruned and, if so, lock it if necessary. Bug: #18830 Reported-by: Robins Tharakan <tharakan@gmail.com> Diagnozed-by: Tender Wang <tndrwang@gmail.com> Diagnozed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Co-authored-by: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/18830-1f31ea1dc930d444%40postgresql.org	2025-03-19 12:14:24 +09:00
Thomas Munro	06fb5612c9	Increase io_combine_limit range to 1MB. The default of 128kB is unchanged, but the upper limit is changed from 32 blocks to 128 blocks, unless the operating system's IOV_MAX is too low. Some other RDBMSes seem to cap their multi-block buffer pool I/O around this number, and it seems useful to allow experimentation. The concrete change is to our definition of PG_IOV_MAX, which provides the maximum for io_combine_limit and io_max_combine_limit. It also affects a couple of other places that work with arrays of struct iovec or smaller objects on the stack, so we still don't want to use the system IOV_MAX directly without a clamp: it is not under our control and likely to be 1024. 128 seems acceptable for our current usage. For Windows, we can't use real scatter/gather yet, so we continue to define our own IOV_MAX value of 16 and emulate preadv()/pwritev() with loops. Someone would need to research the trade-offs of raising that number. NB if trying to see this working: you might temporarily need to hack BAS_BULKREAD to be bigger, since otherwise the obvious way of "a very big SELECT" is limited by that for now. Suggested-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:40:35 +13:00
Thomas Munro	10f6646847	Introduce io_max_combine_limit. The existing io_combine_limit can be changed by users. The new io_max_combine_limit is fixed at server startup time, and functions as a silent clamp on the user setting. That in itself is probably quite useful, but the primary motivation is: aio_init.c allocates shared memory for all asynchronous IOs including some per-block data, and we didn't want to waste memory you'd never used by assuming they could be up to PG_IOV_MAX. This commit already halves the size of 'AioHandleIov' and 'AioHandleData'. A follow-up commit can now expand PG_IOV_MAX without affecting that. Since our GUC system doesn't support dependencies or cross-checks between GUCs, the user-settable one now assigns a "raw" value to io_combine_limit_guc, and the lower of io_combine_limit_guc and io_max_combine_limit is maintained in io_combine_limit. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com	2025-03-19 15:23:54 +13:00
Álvaro Herrera	62d712ecfd	Introduce squashing of constant lists in query jumbling pg_stat_statements produces multiple entries for queries like SELECT something FROM table WHERE col IN (1, 2, 3, ...) depending on the number of parameters, because every element of ArrayExpr is individually jumbled. Most of the time that's undesirable, especially if the list becomes too large. Fix this by introducing a new GUC query_id_squash_values which modifies the node jumbling code to only consider the first and last element of a list of constants, rather than each list element individually. This affects both the query_id generated by query jumbling, as well as pg_stat_statements query normalization so that it suppresses printing of the individual elements of such a list. The default value is off, meaning the previous behavior is maintained. Author: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Sergey Dudoladov (mysterious, off-list) Reviewed-by: David Geier <geidav.pg@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Marcos Pegoraro <marcos@f10.com.br> Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Tested-by: Yasuo Honda <yasuo.honda@gmail.com> Tested-by: Sergei Kornilov <sk@zsrv.org> Tested-by: Maciek Sakrejda <m.sakrejda@gmail.com> Tested-by: Chengxi Sun <sunchengxi@highgo.com> Tested-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/CA+q6zcWtUbT_Sxj0V6HY6EZ89uv5wuG5aefpe_9n0Jr3VwntFg@mail.gmail.com	2025-03-18 18:56:11 +01:00
Andres Freund	247ce06b88	aio: Add io_method=worker The previous commit introduced the infrastructure to start io_workers. This commit actually makes the workers execute IOs. IO workers consume IOs from a shared memory submission queue, run traditional synchronous system calls, and perform the shared completion handling immediately. Client code submits most requests by pushing IOs into the submission queue, and waits (if necessary) using condition variables. Some IOs cannot be performed in another process due to lack of infrastructure for reopening the file, and must processed synchronously by the client code when submitted. For now the default io_method is changed to "worker". We should re-evaluate that around beta1, we might want to be careful and set the default to "sync" for 18. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Andres Freund	55b454d0e1	aio: Infrastructure for io_method=worker This commit contains the basic, system-wide, infrastructure for io_method=worker. It does not yet actually execute IO, this commit just provides the infrastructure for running IO workers, kept separate for easier review. The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually we'd like to make the number of workers dynamically scale up/down based on the current "IO load". To allow the number of IO workers to be increased without a restart, we need to reserve PGPROC entries for the workers unconditionally. This has been judged to be worth the cost. If it turns out to be problematic, we can introduce a PGC_POSTMASTER GUC to control the maximum number. As io workers might be needed during shutdown, e.g. for AIO during the shutdown checkpoint, a new PMState phase is added. IO workers are shut down after the shutdown checkpoint has been performed and walsender/archiver have shut down, but before the checkpointer itself shuts down. See also `87a6690cc6`. Updates PGSTAT_FILE_FORMAT_ID due to the addition of a new BackendType. Reviewed-by: Noah Misch <noah@leadboat.com> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-18 11:54:01 -04:00
Jeff Davis	549ea06e42	Fix headerscheck warning. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/93731.1742310701@sss.pgh.pa.us	2025-03-18 08:37:07 -07:00
Robert Haas	4fd02bf7cf	Add some new hooks so extensions can add details to EXPLAIN. Specifically, add a per-node hook that is called after the per-node information has been displayed but before we display children, and a per-query hook that is called after existing query-level information is printed. This assumes that extension-added information should always go at the end rather than the beginning or the middle, but that seems like an acceptable limitation for simplicity. It also assumes that extensions will only want to add information, not remove or reformat existing details; those also seem like acceptable restrictions, at least for now. If multiple EXPLAIN extensions are used, the order in which any additional details are printed is likely to depend on the order in which the modules are loaded. That seems OK, since the user may have opinions about the order in which output should appear, and the extension author can't really know whether their stuff is more or less important to a particular user than some other extension. Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Sami Imseih <samimseih@gmail.com>	2025-03-18 09:28:01 -04:00
Melanie Plageman	cc6be07ebd	Increase default maintenance_io_concurrency to 16 Since its introduction in `fc34b0d9de`, the default maintenance_io_concurrency has been larger than the default effective_io_concurrency. maintenance_io_concurrency primarily controlled prefetching done on behalf of the whole system, for operations like recovery. Therefore it makes sense for it to have a value equal to or greater than effective_io_concurrency, which controls I/O concurrency for reading a relation in a bitmap heap scan. `ff79b5b2ab` increased effective_io_concurrency to 16, so we'll increase maintenance_io_concurrency as well. For now, though, we'll keep the defaults of effective_io_concurrency and maintenance_io_concurrency equal to one another (16). On fast, high IOPs systems, significantly higher values of maintenance_io_concurrency are observably beneficial [1]. However, such values would flood low IOPs systems and increase overall system I/O latency. It is worth mentioning that since `9256822608` and `c3e775e608`, maintenance_io_concurrency also controls the I/O concurrency of each vacuum worker. Since many autovacuum workers may be simultaneously issuing I/Os, we want to keep maintenance_io_concurrency appropriately conservative. [1] https://postgr.es/m/c5d52837-6256-0556-ac8c-d6d3d558820a%40enterprisedb.com Suggested-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Discussion: https://postgr.es/m/CAKZiRmxdHQaU%2B2Zpe6d%3Dx%3D0vigJ1sfWwwVYLJAf%3Dud_wQ_VcUw%40mail.gmail.com	2025-03-18 09:08:10 -04:00
Robert Haas	796bdda484	Fix indentation again. Because somehow I manage to keep forgetting this.	2025-03-18 09:02:36 -04:00
Robert Haas	c65bc2e1d1	Make it possible for loadable modules to add EXPLAIN options. Modules can use RegisterExtensionExplainOption to register new EXPLAIN options, and GetExplainExtensionId, GetExplainExtensionState, and SetExplainExtensionState to store related state inside the ExplainState object. Since this substantially increases the amount of code that needs to handle ExplainState-related tasks, move a few bits of existing code to a new file explain_state.c and add the rest of this infrastructure there. See the comments at the top of explain_state.c for further explanation of how this mechanism works. This does not yet provide a way for such such options to do anything useful. The intention is that we'll add hooks for that purpose in a separate commit. Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Sami Imseih <samimseih@gmail.com>	2025-03-18 08:41:12 -04:00
Peter Eisentraut	7317e64126	Add some opfamily support functions to lsyscache.c Add get_opfamily_method() and get_opfamily_member_for_cmptype() in lsyscache.c. No callers yet, but we'll add some soon. This is part of generalizing some parts of the code away from having btree hardcoded and use CompareType instead. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-18 11:17:43 +01:00
Amit Kapila	01e27aab05	Use correct variable name in publicationcmds.c. subid was used at few places for publicationid in publicationcmds.c/.h. Author: vignesh C <vignesh21@gmail.com> Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> Discussion: https://postgr.es/m/CALDaNm1KqJ0VFfDJRPbfYi9Shz6LHFEE-Ckn+eqsePfKhebv9w@mail.gmail.com	2025-03-18 14:06:51 +05:30
Andres Freund	da7226993f	aio: Add core asynchronous I/O infrastructure The main motivations to use AIO in PostgreSQL are: a) Reduce the time spent waiting for IO by issuing IO sufficiently early. In a few places we have approximated this using posix_fadvise() based prefetching, but that is fairly limited (no completion feedback, double the syscalls, only works with buffered IO, only works on some OSs). b) Allow to use Direct-I/O (DIO). DIO can offload most of the work for IO to hardware and thus increase throughput / decrease CPU utilization, as well as reduce latency. While we have gained the ability to configure DIO in `d4e71df6`, it is not yet usable for real world workloads, as every IO is executed synchronously. For portability, the new AIO infrastructure allows to implement AIO using different methods. The choice of the AIO method is controlled by the new io_method GUC. As of this commit, the only implemented method is "sync", i.e. AIO is not actually executed asynchronously. The "sync" method exists to allow to bypass most of the new code initially. Subsequent commits will introduce additional IO methods, including a cross-platform method implemented using worker processes and a linux specific method using io_uring. To allow different parts of postgres to use AIO, the core AIO infrastructure does not need to know what kind of files it is operating on. The necessary behavioral differences for different files are abstracted as "AIO Targets". One example target would be smgr. For boring portability reasons, all targets currently need to be added to an array in aio_target.c. This commit does not implement any AIO targets, just the infrastructure for them. The smgr target will be added in a later commit. Completion (and other events) of IOs for one type of file (i.e. one AIO target) need to be reacted to differently, based on the IO operation and the callsite. This is made possible by callbacks that can be registered on IOs. E.g. an smgr read into a local buffer does not need to update the corresponding BufferDesc (as there is none), but a read into shared buffers does. This commit does not contain any callbacks, they will be added in subsequent commits. For now the AIO infrastructure only understands READV and WRITEV operations, but it is expected that more operations will be added. E.g. fsync/fdatasync, flush_range and network operations like send/recv. As of this commit, nothing uses the AIO infrastructure. Later commits will add an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for read_stream.c IO, which, in one fell swoop, will convert all read stream users to AIO. The goal is to use AIO in many more places. There are patches to use AIO for checkpointer and bgwriter that are reasonably close to being ready. There also are prototypes to use it for WAL, relation extension, backend writes and many more. Those prototypes were important to ensure the design of the AIO subsystem is not too limiting (e.g. WAL writes need to happen in critical sections, which influenced a lot of the design). A future commit will add an AIO README explaining the AIO architecture and how to use the AIO subsystem. The README is added later, as it references details only added in later commits. Many many more people than the folks named below have contributed with feedback, work on semi-independent patches etc. E.g. various folks have contributed patches to use the read stream infrastructure (added by Thomas in `b5a9b18cd0`) in more places. Similarly, a lot of folks have contributed to the CI infrastructure, which I had started to work on to make adding AIO feasible. Some of the work by contributors has gone into the "v1" prototype of AIO, which heavily influenced the current design of the AIO subsystem. None of the code from that directly survives, but without the prototype, the current version of the AIO infrastructure would not exist. Similarly, the reviewers below have not necessarily looked at the current design or the whole infrastructure, but have provided very valuable input. I am to blame for problems, not they. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com> Reviewed-by: Antonin Houska <ah@cybertec.at> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt Discussion: https://postgr.es/m/20210223100344.llw5an2aklengrmn@alap3.anarazel.de Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m	2025-03-17 18:51:33 -04:00
Andres Freund	02844012b3	aio: Basic subsystem initialization This commit just does the minimal wiring up of the AIO subsystem, added in the next commit, to the rest of the system. The next commit contains more details about motivation and architecture. This commit is kept separate to make it easier to review, separating the changes across the tree, from the implementation of the new subsystem. We discussed squashing this commit with the main commit before merging AIO, but there has been a mild preference for keeping it separate. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt	2025-03-17 18:51:33 -04:00
Michael Paquier	5721e5453e	Revert "Add redo LSN to pgstats files" This reverts commit `b860848232`, that was added as a prerequisite for the support of pgstats data flush across checkpoints, linking a pgstats file to a specific checkpoint redo LSN. As reported, this is proving to be currently problematic when going through a pg_upgrade, that does direct manipulations of the control file in the new cluster. The LSN stored in the pgstats file is not able to cope with any changes done in the control file by pg_upgrade yet, causing the pgstats file to be discarded when starting the new cluster after overriding its redo LSN (one is a `pg_resetwal -l` where the new cluster's start LSN is bumped by a hardcoded value of 8 segments, see copy_xact_xlog_xid). The least painful path going forward is likely going to be a refactor of the pgstats code so as it is possible to read and write some of its data with some routines in src/common/, so as pg_upgrade or pg_resetwal are able to update its data. The main point is that we are going to need a LSN in the stats file should we make it written at checkpoint time and not only as part of a shutdown sequence. It is too late to dive into these details for v18, so let's revert the change, and let's try to figure out all the details in the next release cycle. The pgstats file is currently only written as part of a shutdown sequence, and its contents are still lost on crash, same as older releases. Bump PGSTAT_FILE_FORMAT_ID. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2563883.1741826489@sss.pgh.pa.us	2025-03-17 08:35:12 +09:00
Andres Freund	771ba90298	localbuf: Introduce StartLocalBufferIO() To initiate IO on a shared buffer we have StartBufferIO(). For temporary table buffers no similar function exists - likely because the code for that currently is very simple due to the lack of concurrency. However, the upcoming AIO support will make it possible to re-encounter a local buffer, while the buffer already is the target of IO. In that case we need to wait for already in-progress IO to complete. This commit makes it easier to add the necessary code, by introducing StartLocalBufferIO(). Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Andres Freund	4b4d33b9ea	localbuf: Introduce FlushLocalBuffer() Previously we had two paths implementing writing out temporary table buffers. For shared buffers, the logic for that is centralized in FlushBuffer(). Introduce FlushLocalBuffer() to do the same for local buffers. Besides being a nice cleanup on its own, it also makes an upcoming change slightly easier. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Andres Freund	dd6f2618f6	localbuf: Introduce TerminateLocalBufferIO() Previously TerminateLocalBufferIO() was open-coded in multiple places, which doesn't seem like a great idea. While TerminateLocalBufferIO() currently is rather simple, an upcoming patch requires additional code to be added to TerminateLocalBufferIO(), making this modification particularly worthwhile. For some reason FlushRelationBuffers() previously cleared BM_JUST_DIRTIED, even though that's never set for temporary buffers. This is not carried over as part of this change. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com	2025-03-15 22:07:48 -04:00
Jeff Davis	27bdec0684	Optimization for lower(), upper(), casefold() functions. Improve performance and reduce table sizes for case mapping. The main case mapping table stores only 16-bit offsets, which can be used to look up the mapped code point in any of the case tables (fold, lower, upper, or title case). Simple case pairs point to the same offsets. Generate a function in generate-unicode_case_table.pl that consists of a nested branches to test for specific codepoint ranges that determine the offset in the main table. Other approaches were considered, such as representing these ranges as another structure (rather than branches in a generated function), or a different approach such as a radix tree, or perfect hashing. The author implemented and tested these alternatives and settled on the generated branches. Author: Alexander Borisov <lex.borisov@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com	2025-03-15 13:00:50 -07:00
Melanie Plageman	c3953226a0	Remove table AM callback scan_bitmap_next_block After pushing the bitmap iterator into table-AM specific code (as part of making bitmap heap scan use the read stream API in `2b73a8cd33`), scan_bitmap_next_block() no longer returns the current block number. Since scan_bitmap_next_block() isn't returning any relevant information to bitmap table scan code, it makes more sense to get rid of it. Now, bitmap table scan code only calls table_scan_bitmap_next_tuple(), and the heap AM implementation of scan_bitmap_next_block() is a local helper in heapam_handler.c. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com	2025-03-15 10:37:46 -04:00
Melanie Plageman	2b73a8cd33	BitmapHeapScan uses the read stream API Make Bitmap Heap Scan use the read stream API instead of invoking ReadBuffer() for each block indicated by the bitmap. The read stream API handles prefetching, so remove all of the explicit prefetching from bitmap heap scan code. Now, heap table AM implements a read stream callback which uses the bitmap iterator to return the next required block to the read stream code. Tomas Vondra conducted extensive regression testing of this feature. Andres Freund, Thomas Munro, and I analyzed regressions and Thomas Munro patched the read stream API. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Tested-by: Tomas Vondra <tomas@vondra.me> Tested-by: Andres Freund <andres@anarazel.de> Tested-by: Thomas Munro <thomas.munro@gmail.com> Tested-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com	2025-03-15 10:34:42 -04:00
Melanie Plageman	944e81bf99	Separate TBM[Shared\|Private]Iterator and TBMIterateResult Remove the TBMIterateResult member from the TBMPrivateIterator and TBMSharedIterator and make tbm_[shared\|private_]iterate() take a TBMIterateResult as a parameter. This allows tidbitmap API users to manage multiple TBMIterateResults per scan. This is required for bitmap heap scan to use the read stream API, with which there may be multiple I/Os in flight at once, each one with a TBMIterateResult. Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/d4bb26c9-fe07-439e-ac53-c0e244387e01%40vondra.me	2025-03-15 10:11:19 -04:00
Tom Lane	1c9242b2cd	Doc: remove obsolete comment. This para should have been removed by `2f9661311`, which made it both false and irrelevant. Noted while looking at SQL function plancache patch.	2025-03-14 14:08:47 -04:00
Fujii Masao	6d376c3b0d	Add GUC option to log lock acquisition failures. This commit introduces a new GUC, log_lock_failure, which controls whether a detailed log message is produced when a lock acquisition fails. Currently, it only supports logging lock failures caused by SELECT ... NOWAIT. The log message includes information about all processes holding or waiting for the lock that couldn't be acquired, helping users analyze and diagnose the causes of lock failures. Currently, this option does not log failures from SELECT ... SKIP LOCKED, as that could generate excessive log messages if many locks are skipped, causing unnecessary noise. This mechanism can be extended in the future to support for logging lock failures from other commands, such as LOCK TABLE ... NOWAIT. Author: Yuki Seino <seinoyu@oss.nttdata.com> Co-authored-by: Fujii Masao <masao.fujii@gmail.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Discussion: https://postgr.es/m/411280a186cc26ef7034e0f2dfe54131@oss.nttdata.com	2025-03-14 23:14:12 +09:00
Peter Eisentraut	a359d37019	Simplify and generalize PrepareSortSupportFromIndexRel() PrepareSortSupportFromIndexRel() was accepting btree strategy numbers purely for the purpose of comparing it later against btree strategies to determine if the sort direction was forward or reverse. Change that. Instead, pass a bool directly, to indicate the same without an unfortunate assumption that a strategy number refers specifically to a btree strategy. (This is similar in spirit to commits `0d2aa4d493` and c594f1ad2ba.) (This could arguably be simplfied further by having the callers fill in ssup_reverse directly. But this way, it preserves consistency by having all PrepareSortSupport*() variants be responsible for filling in ssup_reverse.) Moreover, remove the hardcoded check against BTREE_AM_OID, and check against amcanorder instead, which is the actual requirement. Co-authored-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-03-14 10:34:08 +01:00
Peter Eisentraut	05cbd6cb22	Swap order of extern/static and pg_nodiscard When pg_nodiscard was first added, the C standard draft had it as a function specifier, and so the code comment about placement was written with that in mind. The final C23 standard has it as an attribute and the placement rules are a bit different for that. Specifically, it needs to be before extern or static. (Or at least both current clang and gcc require that.) So just swap these. (To be clear: The current implementation with gcc attributes doesn't care. This change is just for maximum forward compatibility for non-gcc compilers.) This also keeps the order consistent with the previously introduced pg_noreturn. Also update the code comment to reflect the mentioned developments since its introduction. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw	2025-03-14 07:18:07 +01:00
Thomas Munro	01261fb078	Improve buffer manager API for backend pin limits. Previously the support functions assumed that the caller needed one pin to make progress, and could optionally use some more, allowing enough for every connection to do the same. Add a couple more functions for callers that want to know: * what the maximum possible number could be, irrespective of currently held pins, for space planning purposes * how many additional pins they could acquire right now, without the special case allowing one pin, for callers that already hold pins and could already make progress even if no extra pins are available The pin limit logic began in commit `31966b15`. This refactoring is better suited to read_stream.c, which will be adjusted to respect the remaining limit as it changes over time in a follow-up commit. It also computes MaxProportionalPins up front, to avoid performing divisions whenever a caller needs to check the balance. Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com	2025-03-14 17:13:09 +13:00
Nathan Bossart	0697b23906	Add reverse(bytea). This commit introduces a function for reversing the order of the bytes in binary strings. Bumps catversion. Author: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAJ7c6TMe0QVRuNssUArbMi0bJJK32%2BzNA3at5m3osrBQ25MHuw%40mail.gmail.com	2025-03-13 11:20:53 -05:00
Peter Eisentraut	3691edfab9	pg_noreturn to replace pg_attribute_noreturn() We want to support a "noreturn" decoration on more compilers besides just GCC-compatible ones, but for that we need to move the decoration in front of the function declaration instead of either behind it or wherever, which is the current style afforded by GCC-style attributes. Also rename the macro to "pg_noreturn" to be similar to the C11 standard "noreturn". pg_noreturn is now supported on all compilers that support C11 (using _Noreturn), as well as GCC-compatible ones (using __attribute__, as before), as well as MSVC (using __declspec). (When PostgreSQL requires C11, the latter two variants can be dropped.) Now, all supported compilers effectively support pg_noreturn, so the extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped. This also fixes a possible problem if third-party code includes stdnoreturn.h, because then the current definition of #define pg_attribute_noreturn() __attribute__((noreturn)) would cause an error. Note that the C standard does not support a noreturn attribute on function pointer types. So we have to drop these here. There are only two instances at this time, so it's not a big loss. In one case, we can make up for it by adding the pg_noreturn to a wrapper function and adding a pg_unreachable(), in the other case, the latter was already done before. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw	2025-03-13 12:37:26 +01:00
Amit Kapila	3abe9dc188	Avoid invalidating all RelationSyncCache entries on publication rename. On Publication rename, we need to only invalidate the RelationSyncCache entries corresponding to relations that are part of the publication being renamed. As part of this patch, we introduce a new invalidation message to invalidate the cache maintained by the logical decoding output plugin. We can't use existing relcache invalidation for this purpose, as that would unnecessarily cause relcache invalidations in other backends. This will improve performance by building fewer relation cache entries during logical replication. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Author: Shlok Kyal <shlok.kyal.oss@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966C09AA201EFFA706576A7F5C92@OSCPR01MB14966.jpnprd01.prod.outlook.com	2025-03-13 09:16:33 +05:30
Masahiko Sawada	4ecdd4110d	pg_rewind: Add dbname to primary_conninfo when using --write-recovery-conf. This commit enhances pg_rewind's --write-recovery-conf option to include the dbname in the generated primary_conninfo value when specified in the --source-server option. With this modification, the rewound server can connect to the primary server without manual configuration file modifications when sync_replication_slots is enabled. Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Discussion: https://postgr.es/m/CAD21AoAkW=Ht0k9dVoBTCcqLiiZ2MXhVr+d=j2T_EZMerGrLWQ@mail.gmail.com	2025-03-12 16:56:04 -07:00
Heikki Linnakangas	ac4494646d	Rename alloc/free functions in reorderbuffer.c There used to be bespoken pools for these structs to reduce the palloc/pfree overhead, but that was ripped out a long time ago and replaced with the generic, cheaper generational memory allocator (commit `a4ccc1cef5`). The Get/Return terminology made sense with the pools, as you "got" an object from the pool and "returned" it later, but now it just looks weird. Rename to Alloc/Free. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/c9e43d2d-8e83-444f-b111-430377368989@iki.fi	2025-03-12 22:03:39 +02:00
Melanie Plageman	ff79b5b2ab	Increase default effective_io_concurrency to 16 The default effective_io_concurrency has been 1 since it was introduced in `b7b8f0b609`. Referencing the associated discussion [1], it seems 1 was chosen as a conservative value that seemed unlikely to cause regressions. Experimentation on high latency cloud storage as well as fast, local nvme storage (see Discussion link) shows that even slightly higher values improve query timings substantially. 1 actually performs worse than 0 [2]. With effective_io_concurrency 1, we are not prefetching enough to avoid I/O stalls, but we are issuing extra syscalls. The new default is 16, which should be more appropriate for common hardware while still avoiding flooding low IOPs devices with I/O requests. [1] https://www.postgresql.org/message-id/flat/FDDBA24E-FF4D-4654-BA75-692B3BA71B97%40enterprisedb.com [2] https://www.postgresql.org/message-id/CAAKRu_Zv08Cic%3DqdCfzrQabpEXGrd9Z9UOW5svEVkCM6%3DFXA9g%40mail.gmail.com Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAAKRu_Z%2BJa-mwXebOoOERMMUMvJeRhzTjad4dSThxG0JLXESxw%40mail.gmail.com	2025-03-12 15:57:44 -04:00
Melanie Plageman	18cd15e706	Add connection establishment duration logging Add log_connections option 'setup_durations' which logs durations of several key parts of connection establishment and backend setup. For an incoming connection, starting from when the postmaster gets a socket from accept() and ending when the forked child backend is first ready for query, there are multiple steps that could each take longer than expected due to external factors. This logging provides visibility into authentication and fork duration as well as the end-to-end connection establishment and backend initialization time. To make this portable, the timings captured in the postmaster (socket creation time, fork initiation time) are passed through the BackendStartupData. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Guillaume Lelarge <guillaume.lelarge@dalibo.com> Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com	2025-03-12 11:35:27 -04:00
Melanie Plageman	9219093cab	Modularize log_connections output Convert the boolean log_connections GUC into a list GUC comprised of the connection aspects to log. This gives users more control over the volume and kind of connection logging. The current log_connections options are 'receipt', 'authentication', and 'authorization'. The empty string disables all connection logging. 'all' enables all available connection logging. For backwards compatibility, the most common values for the log_connections boolean are still supported (on, off, 1, 0, true, false, yes, no). Note that previously supported substrings of on, off, true, false, yes, and no are no longer supported. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com	2025-03-12 11:35:21 -04:00
Masahiko Sawada	bd65cb3cd4	pg_logicalinspect: Fix possible crash when passing a directory path. Previously, pg_logicalinspect functions were too trusting of their input and blindly passed it to SnapBuildRestoreSnapshot(). If the input pointed to a directory, the server could a PANIC error while attempting to fsync_fname() with isdir=false on a directory. This commit adds validation checks for input filenames and passes the LSN extracted from the filename to SnapBuildRestoreSnapshot() instead of the filename itself. It also adds regression tests for various input patterns and permission checks. Bug: #18828 Reported-by: Robins Tharakan <tharakan@gmail.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/18828-0f4701c635064211@postgresql.org	2025-03-11 09:56:40 -07:00
Tom Lane	8b1b342544	Improve EXPLAIN's display of window functions. Up to now we just punted on showing the window definitions used in a plan, with window function calls represented as "OVER (?)". To improve that, show the window definition implemented by each WindowAgg plan node, and reference their window names in OVER. For nameless window clauses generated by "OVER (...)", assign unique names w1, w2, etc. In passing, re-order the properties shown for a WindowAgg node so that the Run Condition (if any) appears after the Window property and before the Filter (if any). This seems more sensible since the Run Condition is associated with the Window and acts before the Filter. Thanks to David G. Johnston and Álvaro Herrera for design suggestions. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/144530.1741469955@sss.pgh.pa.us	2025-03-11 11:19:54 -04:00
Peter Geoghegan	426ea61117	nbtree: Make BTMaxItemSize into object-like macro. Make nbtree's "1/3 of a page limit" BTMaxItemSize function-like macro (which accepts a "page" argument) into an object-like macro that can be used from code that doesn't have convenient access to an nbtree page. Preparation for an upcoming patch that adds skip scan to nbtree. Parallel index scans that use skip scan will serialize datums (not just SAOP array subscripts) when scheduling primitive scans. BTMaxItemSize will be used by btestimateparallelscan to determine how much DSM to request. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com	2025-03-11 10:35:56 -04:00
Peter Geoghegan	0fbceae841	Show index search count in EXPLAIN ANALYZE, take 2. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan/index-only scan/bitmap index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index searches can be unpredictable due to implementation details that interact with physical index characteristics (at least with nbtree SAOP scans, since Postgres 17 commit `5bf748b8`). The information shown also provides useful context when EXPLAIN ANALYZE runs a plan with an index scan node that successfully applied the skip scan optimization (set to be added to nbtree by an upcoming patch). The instrumentation works by teaching all index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs already increment the pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). Parallel queries have workers copy their local counter struct into shared memory when an index scan node ends -- even when it isn't a parallel aware scan node. An earlier version of this patch that only worked with parallel aware scans became commit `5ead85fb` (though that was quickly reverted by commit `d00107cd` following "debug_parallel_query=regress" buildfarm failures). Our approach doesn't match the approach used when tracking other index scan related costs (e.g., "Rows Removed by Filter:"). It is comparable to the approach used in similar cases involving costs that are only readily accessible inside an access method, not from the executor proper (e.g., "Heap Blocks:" output for a Bitmap Heap Scan, which was recently enhanced to show per-worker costs by commit `5a1e6df3`, using essentially the same scheme as the one used here). It is necessary for index AMs to have direct responsibility for maintaining the new counter, since the counter might need to be incremented multiple times per amgettuple call (or per amgetbitmap call). But it is also necessary for the executor proper to manage the shared memory now used to transfer each worker's counter struct to the leader. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	2025-03-11 09:20:50 -04:00
Daniel Gustafsson	d35d32d711	Add special case fast-paths for strict functions Many STRICT function calls will have one or two arguments, in which case we can speed up checking for NULL input by avoiding setting up a loop over the arguments. This adds EEOP_FUNCEXPR_STRICT_1 and the corresponding EEOP_FUNCEXPR_STRICT_2 for functions with one and two arguments respectively. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de	2025-03-11 12:02:42 +01:00
Daniel Gustafsson	8dd7c7cd0a	Replace EEOP_DONE with special steps for return/no return Knowing when the side-effects of an expression is the intended result of the execution, rather than the returnvalue, is important for being able generate more efficient JITed code. This replaces EEOP_DONE with two new steps: EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN. Expressions which return a value should use the former step; expressions used for their side-effects which don't return value should use the latter. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de	2025-03-11 12:02:38 +01:00
Michael Paquier	76def4cdd7	Add WAL data to backend statistics This commit adds per-backend WAL statistics, providing the same information as pg_stat_wal, except that it is now possible to know how much WAL activity is happening in each backend rather than an overall aggregate of all the activity. Like pg_stat_wal, the implementation relies on pgWalUsage, tracking the difference of activity between two reports to pgstats. This data can be retrieved with a new system function called pg_stat_get_backend_wal(), that returns one tuple based on the PID provided in input. Like pg_stat_get_backend_io(), this is useful when joined with pg_stat_activity to get a live picture of the WAL generated for each running backend, showing how the activity is [un]balanced. pgstat_flush_backend() gains a new flag value, able to control the flush of the WAL stats. This commit relies mostly on the infrastructure provided by `9aea73fc61`, that has introduced backend statistics. Bump catalog version. A bump of PGSTAT_FILE_FORMAT_ID is not required, as backend stats do not persist on disk. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-03-11 09:04:11 +09:00
Alexander Korotkov	6bb6a62f3c	Use extended stats for precise estimation of bucket size in hash join Recognizing the real-life complexity where columns in the table often have functional dependencies, PostgreSQL's estimation of the number of distinct values over a set of columns can be underestimated (or much rarely, overestimated) when dealing with multi-clause JOIN. In the case of hash join, it can end up with a small number of predicted hash buckets and, as a result, picking non-optimal merge join. To improve the situation, we introduce one additional stage of bucket size estimation - having two or more join clauses estimator lookup for extended statistics and use it for multicolumn estimation. Clauses are grouped into lists, each containing expressions referencing the same relation. The result of the multicolumn estimation made over such a list is combined with others according to the caller's logic. Clauses that are not estimated are returned to the caller for further estimation. Discussion: https://postgr.es/m/52257607-57f6-850d-399a-ec33a654457b%40postgrespro.ru Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Andy Fan <zhihui.fan1213@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-03-10 13:42:01 +02:00
David Rowley	e033696596	Fix incorrect #endif comment Noticed while reading code in this area.	2025-03-10 13:36:04 +13:00
Peter Geoghegan	67fc4c9fd7	Make parallel nbtree index scans use an LWLock. Teach parallel nbtree index scans to use an LWLock (not a spinlock) to protect the scan's shared descriptor state. Preparation for an upcoming patch that will add skip scan optimizations to nbtree. That patch will create the need to occasionally allocate memory while the scan descriptor is locked, while copying datums that were serialized by another backend. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com	2025-03-08 11:10:14 -05:00
Peter Eisentraut	af4002b381	Rename amcancrosscompare After more discussion about commit `ce62f2f2a0`, rename the index AM property amcancrosscompare to two separate properties amconsistentequality and amconsistentordering. Also improve the documentation and update some comments that were previously missed. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/E1tngY6-0000UL-2n%40gemulon.postgresql.org	2025-03-07 11:46:33 +01:00
Dean Rasheed	6da469bada	Allow casting between bytea and integer types. This allows smallint, integer, and bigint values to be cast to and from bytea. The bytea value is the two's complement representation of the integer, with the most significant byte first. For example: 1234::bytea -> \x000004d2 (-1234)::bytea -> \xfffffb2e Author: Aleksander Alekseev <aleksander@timescale.com> Reviewed-by: Joel Jacobson <joel@compiler.org> Reviewed-by: Yugo Nagata <nagata@sraoss.co.jp> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Discussion: https://postgr.es/m/CAJ7c6TPtOp6%2BkFX5QX3fH1SVr7v65uHr-7yEJ%3DGMGQi5uhGtcA%40mail.gmail.com	2025-03-07 09:31:18 +00:00
Heikki Linnakangas	39de4f157d	Fix compiler warnings about typedef redefinitions Clang with -Wtypedef-redefinition produced warnings: src/include/storage/latch.h:122:3: error: redefinition of typedef 'Latch' is a C11 feature [-Werror,-Wtypedef-redefinition] Per buildfarm	2025-03-06 03:10:22 +02:00
Heikki Linnakangas	393e0d2314	Split WaitEventSet functions to separate source file latch.c now only contains the Latch related functions, which build on the WaitEventSet abstraction. Most of the platform-dependent stuff is now in waiteventset.c. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi	2025-03-06 01:26:16 +02:00
Heikki Linnakangas	a98e4dee63	Remove unused ShutdownLatchSupport() function The only caller was removed in commit `80a8f95b3b`. I don't foresee needing it any time soon, and I'm working on some big changes in this area, so let's remove it out of the way. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://www.postgresql.org/message-id/8a507fb6-df28-49d3-81a5-ede180d7f0fb@iki.fi	2025-03-05 23:52:04 +02:00
Peter Geoghegan	d00107cd63	Revert "Show index search count in EXPLAIN ANALYZE." This reverts commit `5ead85fbc8`. This commit shows test failures with debug_parallel_query=regress. The underlying issue needs to be debugged, so revert for now.	2025-03-05 10:27:31 -05:00
Andrew Dunstan	4603903d29	Allow json{b}_strip_nulls to remove null array elements An additional paramater ("strip_in_arrays") is added to these functions. It defaults to false. If true, then null array elements are removed as well as null valued object fields. JSON that just consists of a single null is not affected. Author: Florents Tselai <florents.tselai@gmail.com> Discussion: https://postgr.es/m/4BCECCD5-4F40-4313-9E98-9E16BEB0B01D@gmail.com	2025-03-05 10:04:02 -05:00
Peter Geoghegan	5ead85fbc8	Show index search count in EXPLAIN ANALYZE. Expose the count of index searches/index descents in EXPLAIN ANALYZE's output for index scan nodes. This information is particularly useful with scans that use ScalarArrayOp quals, where the number of index scans isn't predictable in advance (at least not with optimizations like the one added to nbtree by Postgres 17 commit `5bf748b8`). It will also be useful when EXPLAIN ANALYZE shows details of an nbtree index scan that uses skip scan optimizations set to be introduced by an upcoming patch. The instrumentation works by teaching index AMs to increment a new nsearches counter whenever a new index search begins. The counter is incremented at exactly the same point that index AMs must already increment the index's pg_stat_*_indexes.idx_scan counter (we're counting the same event, but at the scan level rather than the relation level). The new counter is stored in the scan descriptor (IndexScanDescData), which explain.c reaches by going through the scan node's PlanState. This approach doesn't match the approach used when tracking other index scan specific costs (e.g., "Rows Removed by Filter:"). It is similar to the approach used in other cases where we must track costs that are only readily accessible inside an access method, and not from the executor (e.g., "Heap Blocks:" output for a Bitmap Heap Scan). It is inherently necessary to maintain a counter that can be incremented multiple times during a single amgettuple call (or amgetbitmap call), and directly exposing PlanState.instrument to index access methods seems unappealing. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com	2025-03-05 09:36:48 -05:00
Heikki Linnakangas	635f580120	Rename some signal and interrupt handling functions for consistency The usual pattern for handling a signal is that the signal handler sets a flag and calls SetLatch(MyLatch), and CHECK_FOR_INTERRUPTS() or other code that is part of a wait loop calls another function to deal with it. The naming of the functions involved was a bit inconsistent, however. CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() to do the heavy-lifting, but the analogous functions in aux processes were called HandleMainLoopInterrupts(), HandleStartupProcInterrupts(), etc. Similarly, most subroutines of ProcessInterrupts() were called Process(), but some were called Handle(). To make things less confusing, rename all the functions that are part of the overall signal/interrupt handling system but are not executed in a signal handler to e.g. ProcessSomething(), rather than HandleSomething(). The "Process" prefix is now consistently used in the non-signal-handler functions, and the "Handle" prefix in functions that are part of signal handlers, except for some completely unrelated functions that clearly have nothing to do with signal or interrupt handling. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/8a384b26-1499-41f6-be33-64b801fb98b8@iki.fi	2025-03-05 16:22:26 +02:00
Álvaro Herrera	f4e53e10b6	Add ALTER TABLE ... ALTER CONSTRAINT ... SET [NO] INHERIT This allows to redefine an existing non-inheritable constraint to be inheritable, which allows to straighten up situations with NO INHERIT constraints so that thay can become normal constraints without having to re-verify existing data. For existing inheritance children this may require creating additional constraints, if they don't exist already. It also allows to do the opposite, if only for symmetry. Author: Suraj Kharage <suraj.kharage@enterprisedb.com> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://postgr.es/m/CAF1DzPVfOW6Kk=7SSh7LbneQDJWh=PbJrEC_Wkzc24tHOyQWGg@mail.gmail.com	2025-03-05 13:50:22 +01:00
Tomas Vondra	c878de1db4	Make FP_LOCK_SLOTS_PER_BACKEND look like a function The FP_LOCK_SLOTS_PER_BACKEND macro looks like a constant, but it depends on the max_locks_per_transaction GUC, and thus can change. This is non-obvious and confusing, so make it look more like a function by renaming it to FastPathLockSlotsPerBackend(). While at it, use the macro when initializing fast-path shared memory, instead of using the formula. Reported-by: Andres Freund Discussion: https://postgr.es/m/ffiwtzc6vedo6wb4gbwelon5nefqg675t5c7an2ta7pcz646cg%40qwmkdb3l4ett	2025-03-04 18:33:12 +01:00
Michael Paquier	c76db55c90	Split pgstat_bestart() into three different routines pgstat_bestart(), used post-authentication to set up a backend entry in the PgBackendStatus array, so as its data becomes visible in pg_stat_activity and related catalogs, has its logic divided into three routines with this commit, called in order at different steps of the backend initialization: * pgstat_bestart_initial() sets up the backend entry with a minimal amount of information, reporting it with a new BackendState called STATE_STARTING while waiting for backend initialization and client authentication to complete. The main benefit that this offers is observability, so as it is possible to monitor the backend activity during authentication. This step happens earlier than in the logic prior to this commit. pgstat_beinit() happens earlier as well, before authentication. * pgstat_bestart_security() reports the SSL/GSS status of the connection, once authentication completes. Auxiliary processes, for example, do not need to call this step, hence it is optional. This step is called after performing authentication, same as previously. * pgstat_bestart_final() reports the user and database IDs, takes the entry out of STATE_STARTING, and reports its application_name. This is called as the last step of the three, once authentication completes. An injection point is added, with a test checking that the "starting" phase of a backend entry is visible in pg_stat_activity. Some follow-up patches are planned to take advantage of this refactoring with more information provided in backend entries during authentication (LDAP hanging was a problem for the author, initially). Author: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAOYmi+=60deN20WDyCoHCiecgivJxr=98s7s7-C8SkXwrCfHXg@mail.gmail.com	2025-03-04 14:09:44 +09:00
Tom Lane	4528768d98	Remove now-dead code in StoreAttrDefault(). StoreAttrDefault() is no longer responsible for filling attmissingval, so remove the code for that. Get rid of RawColumnDefault.missingMode, too, as we no longer need that to pass information around. While here, clean up some sloppy coding in StoreAttrDefault(), such as failure to use XXXGetDatum macros. These aren't bugs but they're not good code either. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxHFssPvkP1we7WMhPD_1kwgbG52o=kQgL+TnVoX5LOyCQ@mail.gmail.com	2025-03-03 13:09:20 -05:00
Tom Lane	95f650674d	Fix broken handling of domains in atthasmissing logic. If a domain type has a default, adding a column of that type (without any explicit DEFAULT clause) failed to install the domain's default value in existing rows, instead leaving the new column null. This is unexpected, and it used to work correctly before v11. The cause is confusion in the atthasmissing mechanism about which default value to install: we'd only consider installing an explicitly-specified default, and then we'd decide that no table rewrite is needed. To fix, take the responsibility for filling attmissingval out of StoreAttrDefault, and instead put it into ATExecAddColumn's existing logic that derives the correct value to fill the new column with. Also, centralize the logic that determines the need for default-related table rewriting there, instead of spreading it over four or five places. In the back branches, we'll leave the attmissingval-filling code in StoreAttrDefault even though it's now dead, for fear that some extension may be depending on that functionality to exist there. A separate HEAD-only patch will clean up the now-useless code. Reported-by: jian he <jian.universality@gmail.com> Author: jian he <jian.universality@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACJufxHFssPvkP1we7WMhPD_1kwgbG52o=kQgL+TnVoX5LOyCQ@mail.gmail.com Backpatch-through: 13	2025-03-03 12:43:44 -05:00
Melanie Plageman	99f8f3fbbc	Add relallfrozen to pg_class Add relallfrozen, an estimate of the number of pages marked all-frozen in the visibility map. pg_class already has relallvisible, an estimate of the number of pages in the relation marked all-visible in the visibility map. This is used primarily for planning. relallfrozen, together with relallvisible, is useful for estimating the outstanding number of all-visible but not all-frozen pages in the relation for the purposes of scheduling manual VACUUMs and tuning vacuum freeze parameters. A future commit will use relallfrozen to trigger more frequent vacuums on insert-focused workloads with significant volume of frozen data. Bump catalog version Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_aj-P7YyBz_cPNwztz6ohP%2BvWis%3Diz3YcomkB3NpYA--w%40mail.gmail.com	2025-03-03 11:18:05 -05:00
Tomas Vondra	8492feb98f	Allow parallel CREATE INDEX for GIN indexes Allow using parallel workers to build a GIN index, similarly to BTREE and BRIN. For large tables this may result in significant speedup when the build is CPU-bound. The work is divided so that each worker builds index entries on a subset of the table, determined by the regular parallel scan used to read the data. Each worker uses a local tuplesort to sort and merge the entries for the same key. The TID lists do not overlap (for a given key), which means the merge sort simply concatenates the two lists. The merged entries are written into a shared tuplesort for the leader. The leader needs to merge the sorted entries again, before writing them into the index. But this way a significant part of the work happens in the workers, and the leader is left with merging fewer large entries, which is more efficient. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for GIN indexes (e.g. uniqueness checks). Original patch by me, with reviews and substantial improvements by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com	2025-03-03 16:53:06 +01:00
Masahiko Sawada	8a1012b35d	Re-export NextCopyFromRawFields() to copy.h. Commit `7717f63006` removed NextCopyFromRawFields() from copy.h. While it was hoped that NextCopyFrom() could serve as an alternative, certain use cases still require NextCopyFromRawFields(). For instance, extensions like file_text_array_fdw, which process source data with an unknown number of columns, rely on this function. Per buildfarm member crake. Reported-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: Sutou Kouhei <kou@clear-code.com> Discussion: https://postgr.es/m/5c7e1ac8-5083-4c08-af19-cb9ade2f16ce@dunslane.net	2025-02-28 15:11:41 -08:00
Masahiko Sawada	7717f63006	Refactor COPY FROM to use format callback functions. This commit introduces a new CopyFromRoutine struct, which is a set of callback routines to read tuples in a specific format. It also makes COPY FROM with the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY FROM command extensible in terms of input formats. Similar to `2e4127b6d2`, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com	2025-02-28 10:29:36 -08:00
Robert Haas	77cb08be51	Avoid including explain.h in explain_format.h and explain_dr.h As per a suggestion from Tom Lane, we do this by declaring "struct ExplainState" here and refer to that rather than "ExplainState". Also per Tom, CreateExplainSerializeDestReceiver was still defined in explain.h in addition to explain_dr.h. Remove leftover prototype. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: http://postgr.es/m/CA+TgmoYtaad3i21V0jqua-fbr+CR0ix6uBvEX8_s6BG96abd=g@mail.gmail.com	2025-02-28 13:17:29 -05:00
Michael Paquier	c2a50ac678	Invent pgstat_fetch_stat_backend_by_pid() This code is extracted from pg_stat_get_backend_io() in pgstatfuncs.c, so as it can be shared with other areas that need backend pgstats entries while having the benefits of the various sanity checks refactored here. As per its name, this retrieves backend statistics based on a PID, with the option of retrieving a BackendType if given in input. Currently, this is used for the backend-level IO statistics. The next move would be to reuse that for the backend-level WAL statistics. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-28 11:20:31 +09:00
Masahiko Sawada	2e4127b6d2	Refactor COPY TO to use format callback functions. This commit introduces a new CopyToRoutine struct, which is a set of callback routines to copy tuples in a specific format. It also makes the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY TO command extensible in terms of output formats. Additionally, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com	2025-02-27 15:03:52 -08:00
Robert Haas	555960a0fb	Create explain_dr.c and move DestReceiver-related code there. explain.c has grown rather large, and the code that deals with the DestReceiver that supports the SERIALIZE option is pretty easily severable from the rest of explain.c; hence, move it to a separate file. Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: http://postgr.es/m/CA+TgmoYutMw1Jgo8BWUmB3TqnOhsEAJiYO=rOQufF4gPLWmkLQ@mail.gmail.com	2025-02-27 13:14:16 -05:00
Robert Haas	9173e8b604	Create explain_format.c and move relevant code there. explain.c has grown rather large, so move various functions that are principally concerned with output generation to a new source file, explain_format.c, instead of lumping them in with everything else that is part of explain.c Reviewed-by: Peter Geoghegan <pg@bowt.ie> Discussion: http://postgr.es/m/CA+TgmoYutMw1Jgo8BWUmB3TqnOhsEAJiYO=rOQufF4gPLWmkLQ@mail.gmail.com	2025-02-27 12:37:10 -05:00
Peter Eisentraut	ce62f2f2a0	Generalize hash and ordering support in amapi Stop comparing access method OID values against HASH_AM_OID and BTREE_AM_OID, and instead check the IndexAmRoutine for an index to see if it advertises its ability to perform the necessary ordering, hashing, or cross-type comparing functionality. A field amcanorder already existed, this uses it more widely. Fields amcanhash and amcancrosscompare are added for the other purposes. Author: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-27 17:03:31 +01:00
Tom Lane	c8c74ad7e1	Get rid of O(N^2) script-parsing overhead in pgbench. pgbench wants to record the starting line number of each command in its scripts. It was computing that by scanning from the script start and counting newlines, so that O(N^2) work had to be done for an N-command script. In a script with 50K lines, this adds up to about 10 seconds on my machine. To add insult to injury, the results were subtly wrong, because expr_scanner_offset() scanned to find the NUL that flex inserts at the end of the current token --- and before the first yylex call, no such NUL has been inserted. So we ended by computing the script's last line number not its first one. This was visible only in case of \gset at the start of a script, which perhaps accounts for the lack of complaints. To fix, steal an idea from plpgsql and track the current lexer ending position and line count as we advance through the script. (It's a bit simpler than plpgsql since we can't need to back up.) Also adjust a couple of other places that were invoking scans from script start when they didn't really need to. I made a new psqlscan function psql_scan_get_location() that replaces both expr_scanner_offset() and expr_scanner_get_lineno(), since in practice expr_scanner_get_lineno() was only being invoked to find the line number of the current lexer end position. Reported-by: Daniel Vérité <daniel@manitou-mail.org> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/84a8a89e-adb8-47a9-9d34-c13f7150ee45@manitou-mail.org	2025-02-27 10:53:38 -05:00
Jeff Davis	15df9d7b51	Remove stray diff introduced by `a5cbdeb98a`. Reported-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/Z77IkjmmfbFfNh3f@paquier.xyz	2025-02-26 13:37:14 -08:00
Michael Paquier	0e42d31b0b	Adding new PgStat_WalCounters structure in pgstat.h This new structure contains the counters and the data related to the WAL activity statistics gathered from WalUsage, separated into its own structure so as it can be shared across more than one Stats structure in pg_stat.h. This refactoring will be used by an upcoming patch introducing backend-level WAL statistics. Bump PGSTAT_FILE_FORMAT_ID. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z3zqc4o09dM/Ezyz@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-26 16:48:54 +09:00
Michael Paquier	d7cbeaf261	Remove pgstat_flush_wal() All the processes that generate WAL should call pgstat_report_wal() to report all their statistics related to WAL, and this is already what happens in the tree. Keeping pgstat_report_wal() is confusing while the other routine is encouraged. This routine is not required since `fc415edf8c`, where it was lastly used in pgstat_report_stat() before an equivalent callback existed. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z71oPkJJICrRB5Ws@paquier.xyz	2025-02-26 15:37:28 +09:00
Michael Paquier	6c349d83b6	Re-add GUC track_wal_io_timing This commit is a rework of `2421e9a51d`, about which Andres Freund has raised some concerns as it is valuable to have both track_io_timing and track_wal_io_timing in some cases, as the WAL write and fsync paths can be a major bottleneck for some workloads. Hence, it can be relevant to not calculate the WAL timings in environments where pg_test_timing performs poorly while capturing some IO data under track_io_timing for the non-WAL IO paths. The opposite can be also true: it should be possible to disable the non-WAL timings and enable the WAL timings (the previous GUC setups allowed this possibility). track_wal_io_timing is added back in this commit, controlling if WAL timings should be calculated in pg_stat_io for the read, fsync and write paths, as done previously with pg_stat_wal. pg_stat_wal previously tracked only the sync and write parts (now removed), read stats is new data tracked in pg_stat_io, all three are aggregated if track_wal_io_timing is enabled. The read part matters during recovery or if a XLogReader is used. Extra note: more control over if the types of timings calculated in pg_stat_io could be done with a GUC that lists pairs of (IOObject,IOOp). Reported-by: Andres Freund <andres@anarazel.de> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/3opf2wh2oljco6ldyqf7ukabw3jijnnhno6fjb4mlu6civ5h24@fcwmhsgmlmzu	2025-02-26 09:49:59 +09:00
Jeff Davis	a5cbdeb98a	Remove redundant pg_set__stats() variants. After commit `f3dae2ae58`, the primary purpose of separating the pg_set__stats() from the pg_restore_*_stats() variants was eliminated. Leave pg_restore_relation_stats() and pg_restore_attribute_stats(), which satisfy both purposes, and remove pg_set_relation_stats() and pg_set_attribute_stats(). Reviewed-by: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/1457469.1740419458@sss.pgh.pa.us	2025-02-25 16:15:47 -08:00
Andres Freund	37c87e63f9	Change relpath() et al to return path by value For AIO, and also some other recent patches, we need the ability to call relpath() in a critical section. Until now that was not feasible, as it allocated memory. The fact that relpath() allocated memory also made it awkward to use in log messages because we had to take care to free the memory afterwards. Which we e.g. didn't do for when zeroing out an invalid buffer. We discussed other solutions, e.g. filling a pre-allocated buffer that's passed to relpath(), but they all came with plenty downsides or were larger projects. The easiest fix seems to be to make relpath() return the path by value. To be able to return the path by value we need to determine the maximum length of a relation path. This patch adds a long #define that computes the exact maximum, which is verified to be correct in a regression test. As this change the signature of relpath(), extensions using it will need to adapt their code. We discussed leaving a backward-compat shim in place, but decided it's not worth it given the use of relpath() doesn't seem widespread. Discussion: https://postgr.es/m/xeri5mla4b5syjd5a25nok5iez2kr3bm26j2qn4u7okzof2bmf@kwdh2vf7npra	2025-02-25 09:02:07 -05:00
Richard Guo	363a6e8c6f	Eliminate code duplication in replace_rte_variables callbacks The callback functions ReplaceVarsFromTargetList_callback and pullup_replace_vars_callback are both used to replace Vars in an expression tree that reference a particular RTE with items from a targetlist, and they both need to expand whole-tuple references and deal with OLD/NEW RETURNING list Vars. As a result, currently there is significant code duplication between these two functions. This patch introduces a new function, ReplaceVarFromTargetList, to perform the replacement and calls it from both callback functions, thereby eliminating code duplication. Author: Dean Rasheed <dean.a.rasheed@gmail.com> Author: Richard Guo <guofenglinux@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/CAEZATCWhr=FM4X5kCPvVs-g2XEk+ceLsNtBK_zZMkqFn9vUjsw@mail.gmail.com	2025-02-25 16:11:34 +09:00
Richard Guo	1e4351af32	Expand virtual generated columns in the planner Commit `83ea6c540` added support for virtual generated columns that are computed on read. All Var nodes in the query that reference virtual generated columns must be replaced with the corresponding generation expressions. Currently, this replacement occurs in the rewriter. However, this approach has several issues. If a Var referencing a virtual generated column has any varnullingrels, those varnullingrels need to be propagated into the generation expression. Failing to do so can lead to "wrong varnullingrels" errors and improper outer-join removal. Additionally, if such a Var comes from the nullable side of an outer join, we may need to wrap the generation expression in a PlaceHolderVar to ensure that it is evaluated at the right place and hence is forced to null when the outer join should do so. In certain cases, such as when the query uses grouping sets, we also need a PlaceHolderVar for anything that is not a simple Var to isolate subexpressions. Failure to do so can result in incorrect results. To fix these issues, this patch expands the virtual generated columns in the planner rather than in the rewriter, and leverages the pullup_replace_vars architecture to avoid code duplication. The generation expressions will be correctly marked with nullingrel bits and wrapped in PlaceHolderVars when needed by the pullup_replace_vars callback function. This requires handling the OLD/NEW RETURNING list Vars in pullup_replace_vars_callback, as it may now deal with Vars referencing the result relation instead of a subquery. The "wrong varnullingrels" error was reported by Alexander Lakhin. The incorrect result issue and the improper outer-join removal issue were reported by Richard Guo. Author: Richard Guo <guofenglinux@gmail.com> Author: Dean Rasheed <dean.a.rasheed@gmail.com> Reviewed-by: Jian He <jian.universality@gmail.com> Discussion: https://postgr.es/m/75eb1a6f-d59f-42e6-8a78-124ee808cda7@gmail.com	2025-02-25 16:10:25 +09:00
Melanie Plageman	bfe56cdf9a	Delay extraction of TIDBitmap per page offsets Pages from the bitmap created by the TIDBitmap API can be exact or lossy. The TIDBitmap API extracts the tuple offsets from exact pages into an array for the convenience of the caller. This was done in tbm_private\|shared_iterate() right after advancing the iterator. However, as long as tbm_private\|shared_iterate() set a reference to the PagetableEntry in the TBMIterateResult, the offset extraction can be done later. Waiting to extract the tuple offsets has a few benefits. For the shared iterator case, it allows us to extract the offsets after dropping the shared iterator state lock, reducing time spent holding a contended lock. Separating the iteration step and extracting the offsets later also allows us to avoid extracting the offsets for prefetched blocks. Those offsets were never used, so the overhead of extracting and storing them was wasted. The real motivation for this change, however, is that future commits will make bitmap heap scan use the read stream API. This requires a TBMIterateResult per issued block. By removing the array of tuple offsets from the TBMIterateResult and only extracting the offsets when they are used, we reduce the memory required for per buffer data substantially. Suggested-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLHbKP3jwJ6_%2BhnGi37Pw3BD5j2amjV3oSk7j-KyCnY7Q%40mail.gmail.com	2025-02-24 16:10:19 -05:00
Melanie Plageman	b8778c4cd8	Add lossy indicator to TBMIterateResult TBMIterateResult->ntuples is -1 when the page in the bitmap is lossy. Add an explicit lossy indicator so that we can move ntuples out of the TBMIterateResult in a future commit. Discussion: https://postgr.es/m/CA%2BhUKGLHbKP3jwJ6_%2BhnGi37Pw3BD5j2amjV3oSk7j-KyCnY7Q%40mail.gmail.com	2025-02-24 16:10:13 -05:00
Nathan Bossart	c56e8af75e	Fix comment for MAX_BACKENDS. This comment mentions that we check that the configured number of backends does not exceed MAX_BACKENDS in RegisterBackgroundWorker() and relevant GUC check hooks, neither of which has those checks anymore. To fix, adjust this comment to say that we do the check in InitializeMaxBackends(). Oversights in commits `6bc8ef0b7f` and `0b1fe1413e`. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/Z7zOEzz8lNjaU9yf%40nathan	2025-02-24 15:02:09 -06:00
Andres Freund	5ee75e32fa	Add static asserts for MAX_BACKENDS limiting factors So far the various dependencies were documented in the comment above MAX_BACKENDS, but not checked. Discussion: https://postgr.es/m/CA+COZaBO_s3LfALq=b+HcBHFSOEGiApVjrRacCe4VP9m7CJsNQ@mail.gmail.com	2025-02-24 06:23:41 -05:00
Andres Freund	418451bfe1	bufmgr: Make it easier to change number of buffer state bits In an upcoming commit I'd like to change the number of bits for the usage count (the current max is 5, fitting in three bits, but we reserve four bits). Until now that required adjusting a bunch of magic constants, now the constants are defined based on the number of bits reserved. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/lxzj26ga6ippdeunz6kuncectr5gfuugmm2ry22qu6hcx6oid6@lzx3sjsqhmt6 Discussion: https://postgr.es/m/riivolmg6uzfvpzfn6wjo3ghwt42rcec43ok6mv4oenfg654y7@x7dbposbskwd	2025-02-24 06:23:41 -05:00
Andres Freund	6394a3a61c	Move MAX_BACKENDS to procnumber.h MAX_BACKENDS influences many things besides postmaster. I e.g. noticed that we don't have static assertions ensuring BUF_REFCOUNT_MASK is big enough for MAX_BACKENDS, adding them would require including postmaster.h in buf_internals.h which doesn't seem right. While at that, add MAX_BACKENDS_BITS, as that's useful in various places for static assertions (to be added in subsequent commits). Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/wptizm4qt6yikgm2pt52xzyv6ycmqiutloyvypvmagn7xvqkce@d4xuv3mylpg4	2025-02-24 06:23:41 -05:00
John Naylor	0600d276d4	Silence warning in older versions of Valgrind Due to misunderstanding on my part, commit `235328ee4` did not go far enough to silence older versions of Valgrind. For those, it was the bit scan that was problematic, not the subsequent bit-masking operation. To fix, use the unaligned path for the trailing bytes. Since we don't have a bit scan here anymore, also remove some comments and endian-specific coding around that. Reported-by: Anton A. Melnikov <a.melnikov@postgrespro.ru> Discussion: https://postgr.es/m/f3aa2d45-3b28-41c5-9499-a1bc30e0f8ec@postgrespro.ru Backpatch-through: 17	2025-02-24 18:03:29 +07:00
Michael Paquier	2421e9a51d	Remove read/sync fields from pg_stat_wal and GUC track_wal_io_timing The four following attributes are removed from pg_stat_wal: * wal_write * wal_sync * wal_write_time * wal_sync_time `a051e71e28` has added an equivalent of this information in pg_stat_io with more granularity as this now spreads across the backend types, IO context and IO objects. So, keeping the same information in pg_stat_wal has little benefits. Another benefit of this commit is the removal of PendingWalStats, simplifying an upcoming patch to add per-backend WAL statistics, which already support IO statistics and which have access to the write/sync stats data of WAL. The GUC track_wal_io_timing, that was used to enable or disable the aggregation of the write and sync timings for WAL, is also removed. pgstat_prepare_io_time() is simplified. Bump catalog version. Bump PGSTAT_FILE_FORMAT_ID, due to the update of PgStat_WalStats. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/Z7RkQ0EfYaqqjgz/@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-24 09:51:56 +09:00
Peter Eisentraut	454c182f85	backend libpq void * argument for binary data Change some backend libpq functions to take void * for binary data instead of char *. This removes the need for numerous casts. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org	2025-02-23 14:27:02 +01:00
Andres Freund	f8d7f29b3e	Allow lwlocks to be disowned To implement AIO writes, the backend initiating writes needs to transfer the lock ownership to the AIO subsystem, so the lock held during the write can be released in another backend. Other backends need to be able to "complete" an asynchronously started IO to avoid deadlocks (consider e.g. one backend starting IO for a buffer and then waiting for a heavyweight lock held by another relation followed by the current holder of the heavyweight lock waiting for the IO to complete). To that end, this commit adds LWLockDisown() and LWLockReleaseDisowned(). If code uses LWLockDisown() it's the code's responsibility to ensure that the lock is released in case of errors. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi	2025-02-21 20:55:23 -05:00
Masahiko Sawada	44fe30fdab	Add default_char_signedness field to ControlFileData. The signedness of the 'char' type in C is implementation-dependent. For instance, 'signed char' is used by default on x86 CPUs, while 'unsigned char' is used on aarch CPUs. Previously, we accidentally let C implementation signedness affect persistent data. This led to inconsistent results when comparing char data across different platforms. This commit introduces a new 'default_char_signedness' field in ControlFileData to store the signedness of the 'char' type. While this change does not encourage the use of 'char' without explicitly specifying its signedness, this field can be used as a hint to ensure consistent behavior for pre-v18 data files that store data sorted by the 'char' type on disk (e.g., GIN and GiST indexes), especially in cross-platform replication scenarios. Newly created database clusters unconditionally set the default char signedness to true. pg_upgrade (with an upcoming commit) changes this flag for clusters if the source database cluster has signedness=false. As a result, signedness=false setting will become rare over time. If we had known about the problem during the last development cycle that forced initdb (v8.3), we would have made all clusters signed or all clusters unsigned. Making pg_upgrade the only source of signedness=false will cause the population of database clusters to converge toward that retrospective ideal. Bump catalog version (for the catalog changes) and PG_CONTROL_VERSION (for the additions in ControlFileData). Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CB11ADBC-0C3F-4FE0-A678-666EE80CBB07%40amazon.com	2025-02-21 10:12:08 -08:00
Peter Eisentraut	7d6d2c4bbd	Drop opcintype from index AM strategy translation API The type argument wasn't actually really necessary. It was a remnant of converting the API of the gist strategy translation from using opclass to using opfamily+opcintype (commits `c09e5a6a01`, `622f678c10`). For looking up the gist translation function, we used the convention "amproclefttype = amprocrighttype = opclass's opcintype" (see pg_amproc.h). But each operator family should only have one translation function, and getting the right type for the lookup is sometimes cumbersome and fragile, so this is all unnecessarily complicated. To simplify this, change the gist stategy support procedure to take "any", "any" as argument. (This is arbitrary but seems intuitive. The alternative of using InvalidOid as argument(s) upsets various DDL commands, so it's not practical.) Then we don't need opcintype for the lookup, and we can remove it from all the API layers introduced by commit `c09e5a6a01`. This also adds some more documentation about the correct signature of the gist support function and adds more checks in gistvalidate(). This was previously underspecified. (It relied implicitly on convention mentioned above.) Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-21 09:07:16 +01:00
Peter Eisentraut	7202d72787	backend launchers void * arguments for binary data Change backend launcher functions to take void * for binary data instead of char *. This removes the need for numerous casts. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org	2025-02-21 08:03:33 +01:00
Daniel Gustafsson	b3f0be788a	Add support for OAUTHBEARER SASL mechanism This commit implements OAUTHBEARER, RFC 7628, and OAuth 2.0 Device Authorization Grants, RFC 8628. In order to use this there is a new pg_hba auth method called oauth. When speaking to a OAuth- enabled server, it looks a bit like this: $ psql 'host=example.org oauth_issuer=... oauth_client_id=...' Visit https://oauth.example.org/login and enter the code: FPQ2-M4BG Device authorization is currently the only supported flow so the OAuth issuer must support that in order for users to authenticate. Third-party clients may however extend this and provide their own flows. The built-in device authorization flow is currently not supported on Windows. In order for validation to happen server side a new framework for plugging in OAuth validation modules is added. As validation is implementation specific, with no default specified in the standard, PostgreSQL does not ship with one built-in. Each pg_hba entry can specify a specific validator or be left blank for the validator installed as default. This adds a requirement on libcurl for the client side support, which is optional to build, but the server side has no additional build requirements. In order to run the tests, Python is required as this adds a https server written in Python. Tests are gated behind PG_TEST_EXTRA as they open ports. This patch has been a multi-year project with many contributors involved with reviews and in-depth discussions: Michael Paquier, Heikki Linnakangas, Zhihong Yu, Mahendrakar Srinivasarao, Andrey Chudnovsky and Stephen Frost to name a few. While Jacob Champion is the main author there have been some levels of hacking by others. Daniel Gustafsson contributed the validation module and various bits and pieces; Thomas Munro wrote the client side support for kqueue. Author: Jacob Champion <jacob.champion@enterprisedb.com> Co-authored-by: Daniel Gustafsson <daniel@yesql.se> Co-authored-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: Antonin Houska <ah@cybertec.at> Reviewed-by: Kashif Zeeshan <kashi.zeeshan@gmail.com> Discussion: https://postgr.es/m/d1b467a78e0e36ed85a09adf979d04cf124a9d4b.camel@vmware.com	2025-02-20 16:25:17 +01:00
Amit Langote	525392d572	Don't lock partitions pruned by initial pruning Before executing a cached generic plan, AcquireExecutorLocks() in plancache.c locks all relations in a plan's range table to ensure the plan is safe for execution. However, this locks runtime-prunable relations that will later be pruned during "initial" runtime pruning, introducing unnecessary overhead. This commit defers locking for such relations to executor startup and ensures that if the CachedPlan is invalidated due to concurrent DDL during this window, replanning is triggered. Deferring these locks avoids unnecessary locking overhead for pruned partitions, resulting in significant speedup, particularly when many partitions are pruned during initial runtime pruning. * Changes to locking when executing generic plans: AcquireExecutorLocks() now locks only unprunable relations, that is, those found in PlannedStmt.unprunableRelids (introduced in commit `cbc127917e`), to avoid locking runtime-prunable partitions unnecessarily. The remaining locks are taken by ExecDoInitialPruning(), which acquires them only for partitions that survive pruning. This deferral does not affect the locks required for permission checking in InitPlan(), which takes place before initial pruning. ExecCheckPermissions() now includes an Assert to verify that all relations undergoing permission checks, none of which can be in the set of runtime-prunable relations, are properly locked. * Plan invalidation handling: Deferring locks introduces a window where prunable relations may be altered by concurrent DDL, invalidating the plan. A new function, ExecutorStartCachedPlan(), wraps ExecutorStart() to detect and handle invalidation caused by deferred locking. If invalidation occurs, ExecutorStartCachedPlan() updates CachedPlan using the new UpdateCachedPlan() function and retries execution with the updated plan. To ensure all code paths that may be affected by this handle invalidation properly, all callers of ExecutorStart that may execute a PlannedStmt from a CachedPlan have been updated to use ExecutorStartCachedPlan() instead. UpdateCachedPlan() replaces stale plans in CachedPlan.stmt_list. A new CachedPlan.stmt_context, created as a child of CachedPlan.context, allows freeing old PlannedStmts while preserving the CachedPlan structure and its statement list. This ensures that loops over statements in upstream callers of ExecutorStartCachedPlan() remain intact. ExecutorStart() and ExecutorStart_hook implementations now return a boolean value indicating whether plan initialization succeeded with a valid PlanState tree in QueryDesc.planstate, or false otherwise, in which case QueryDesc.planstate is NULL. Hook implementations are required to call standard_ExecutorStart() at the beginning, and if it returns false, they should do the same without proceeding. * Testing: To verify these changes, the delay_execution module tests scenarios where cached plans become invalid due to changes in prunable relations after deferred locks. * Note to extension authors: ExecutorStart_hook implementations must verify plan validity after calling standard_ExecutorStart(), as explained earlier. For example: if (prev_ExecutorStart) plan_valid = prev_ExecutorStart(queryDesc, eflags); else plan_valid = standard_ExecutorStart(queryDesc, eflags); if (!plan_valid) return false; <extension-code> return true; Extensions accessing child relations, especially prunable partitions, via ExecGetRangeTableRelation() must now ensure their RT indexes are present in es_unpruned_relids (introduced in commit `cbc127917e`), or they will encounter an error. This is a strict requirement after this change, as only relations in that set are locked. The idea of deferring some locks to executor startup, allowing locks for prunable partitions to be skipped, was first proposed by Tom Lane. Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Reviewed-by: David Rowley <dgrowleyml@gmail.com> (earlier versions) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier versions) Reviewed-by: Tomas Vondra <tomas@vondra.me> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	2025-02-20 17:09:48 +09:00
Álvaro Herrera	80d7f99049	Add ATAlterConstraint struct for ALTER .. CONSTRAINT Replace the use of Constraint with a new ATAlterConstraint struct, which allows us to pass additional information. No functionality is added by this commit. This is necessary for future work that allows altering constraints in other ways. I (Álvaro) took the liberty of restructuring the code for ALTER CONSTRAINT beyond what Amul did. The original coding before Amul's patch was unnecessarily baroque, and this change makes things simpler by removing one level of subroutine. Also, partly remove the assumption that only partitioned tables are relevant (by passing sensible 'recurse' arguments) and no longer ignore whether ONLY was specified. I say 'partly' because the current coding only walks down via the 'conparentid' relationship, which is only used for partitioned tables; but future patches could handle ONLY or not for other types of constraint changes for legacy inheritance trees too. Author: Amul Sul <sulamul@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAAJ_b94bfgPV-8Mw_HwSBeheVwaK9=5s+7+KbBj_NpwXQFgDGg@mail.gmail.com	2025-02-19 13:06:13 +01:00
Alexander Korotkov	e983ee9380	Improve statistics estimation for single-column GROUP BY in sub-queries This commit follows the idea of the `4767bc8ff2`. If sub-query has only one GROUP BY column, we can consider its output variable as being unique. We can employ this fact in the statistics to make more precise estimations in the upper query block. Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-02-19 11:59:30 +02:00
Amit Kapila	ac0e33136a	Invalidate inactive replication slots. This commit introduces idle_replication_slot_timeout GUC that allows inactive slots to be invalidated at the time of checkpoint. Because checkpoints happen checkpoint_timeout intervals, there can be some lag between when the idle_replication_slot_timeout was exceeded and when the slot invalidation is triggered at the next checkpoint. To avoid such lags, users can force a checkpoint to promptly invalidate inactive slots. Note that the idle timeout invalidation mechanism is not applicable for slots that do not reserve WAL or for slots on the standby server that are synced from the primary server (i.e., standby slots having 'synced' field 'true'). Synced slots are always considered to be inactive because they don't perform logical decoding to produce changes. The slots can become inactive for a long period if a subscriber is down due to a system error or inaccessible because of network issues. If such a situation persists, it might be more practical to recreate the subscriber rather than attempt to recover the node and wait for it to catch up which could be time-consuming. Then, external tools could create replication slots (e.g., for migrations or upgrades) that may fail to remove them if an error occurs, leaving behind unused slots that take up space and resources. Manually cleaning them up can be tedious and error-prone, and without intervention, these lingering slots can cause unnecessary WAL retention and system bloat. As the duration of idle_replication_slot_timeout is in minutes, any test using that would be time-consuming. We are planning to commit a follow up patch for tests by using the injection point framework. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com Discussion: https://postgr.es/m/OS0PR01MB5716C131A7D80DAE8CB9E88794FC2@OS0PR01MB5716.jpnprd01.prod.outlook.com	2025-02-19 09:29:50 +05:30
Tom Lane	b464e51ab3	Update to latest Snowball sources. It's been some time since we did this, partly because the upstream snowball project hasn't formally tagged a new release since 2021. The main motivation for doing it now is to absorb a bug fix (their commit e322673a841d9abd69994ae8cd20e191090b6ef4), which prevents a null pointer dereference crash if SN_create_env() gets a malloc failure at just the wrong point. We'll patch the back branches with only that change, but we might as well do the full sync dance on HEAD. Aside from a bunch of mostly-minor tweaks to existing stemmers, this update adds a new stemmer for Estonian. It also removes the existing stemmer for Romanian using ISO-8859-2 encoding. Upstream apparently concluded that ISO-8859-2 doesn't provide an adequate representation of some Romanian characters, and the UTF-8 implementation should be used instead. While at it, update the README's instructions for doing a sync, which have not been adjusted during the addition of meson tooling. Thanks to Maksim Korotkov for discovering the null-pointer bug and submitting the fix to upstream snowball. Reported-by: Maksim Korotkov <m.korotkov@postgrespro.ru> Discussion: https://postgr.es/m/1d1a46-67ab1000-21-80c451@83151435	2025-02-18 21:13:54 -05:00
Daniel Gustafsson	95ef3d9029	Add PGErrorVerbosity to typedefs.list PGErrorVerbosity was missing which resulted in incorrect whitespace alignment going back all the way to `e3860ffa4d`. No backpatch for this though since we don't pgindent backbranches. Author: Jelte Fennema-Nio <postgres@jeltef.nl> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/CAGECzQTVi8n-HW4Q27je-b9ckQk7zf6bS_it42gNvQu+DX0NCQ@mail.gmail.com	2025-02-18 13:23:13 +01:00
Alexander Korotkov	fc069a3a63	Implement Self-Join Elimination The Self-Join Elimination (SJE) feature removes an inner join of a plain table to itself in the query tree if it is proven that the join can be replaced with a scan without impacting the query result. Self-join and inner relation get replaced with the outer in query, equivalence classes, and planner info structures. Also, the inner restrictlist moves to the outer one with the removal of duplicated clauses. Thus, this optimization reduces the length of the range table list (this especially makes sense for partitioned relations), reduces the number of restriction clauses and, in turn, selectivity estimations, and potentially improves total planner prediction for the query. This feature is dedicated to avoiding redundancy, which can appear after pull-up transformations or the creation of an EquivalenceClass-derived clause like the below. SELECT * FROM t1 WHERE x IN (SELECT t3.x FROM t1 t3); SELECT * FROM t1 WHERE EXISTS (SELECT t3.x FROM t1 t3 WHERE t3.x = t1.x); SELECT * FROM t1,t2, t1 t3 WHERE t1.x = t2.x AND t2.x = t3.x; In the future, we could also reduce redundancy caused by subquery pull-up after unnecessary outer join removal in cases like the one below. SELECT * FROM t1 WHERE x IN (SELECT t3.x FROM t1 t3 LEFT JOIN t2 ON t2.x = t1.x); Also, it can drastically help to join partitioned tables, removing entries even before their expansion. The SJE proof is based on innerrel_is_unique() machinery. We can remove a self-join when for each outer row: 1. At most, one inner row matches the join clause; 2. Each matched inner row must be (physically) the same as the outer one; 3. Inner and outer rows have the same row mark. In this patch, we use the next approach to identify a self-join: 1. Collect all merge-joinable join quals which look like a.x = b.x; 2. Add to the list above the baseretrictinfo of the inner table; 3. Check innerrel_is_unique() for the qual list. If it returns false, skip this pair of joining tables; 4. Check uniqueness, proved by the baserestrictinfo clauses. To prove the possibility of self-join elimination, the inner and outer clauses must match exactly. The relation replacement procedure is not trivial and is partly combined with the one used to remove useless left joins. Tests covering this feature were added to join.sql. Some of the existing regression tests changed due to self-join removal logic. Discussion: https://postgr.es/m/flat/64486b0b-0404-e39e-322d-0801154901f3%40postgrespro.ru Author: Andrey Lepikhov <a.lepikhov@postgrespro.ru> Author: Alexander Kuzmenkov <a.kuzmenkov@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Co-authored-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Simon Riggs <simon@2ndquadrant.com> Reviewed-by: Jonathan S. Katz <jkatz@postgresql.org> Reviewed-by: David Rowley <david.rowley@2ndquadrant.com> Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com> Reviewed-by: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Hywel Carver <hywel@skillerwhale.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Ronan Dunklau <ronan.dunklau@aiven.io> Reviewed-by: vignesh C <vignesh21@gmail.com> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Reviewed-by: Greg Stark <stark@mit.edu> Reviewed-by: Jaime Casanova <jcasanov@systemguards.com.ec> Reviewed-by: Michał Kłeczek <michal@kleczek.org> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>	2025-02-17 12:44:12 +02:00
Alexander Korotkov	3fb58625d1	Revert: Get rid of WALBufMappingLock This commit reverts `6a2275b895`. Buildfarm failure on batta spots some concurrency issue, which requires further investigation.	2025-02-17 12:35:28 +02:00
Amit Langote	75dfde1363	Fix an oversight in `cbc127917` to handle MERGE correctly ExecInitModifyTable() forgot to trim MERGE-related lists to exclude entries for result relations pruned during initial pruning, so fix that. While at it, make the function's use of the pruned resultRelations list, rather than ModifyTable.resultRelations, more consistent. Reported-by: Alexander Lakhin <exclusion@gmail.com> (via sqlsmith) Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/e72c94d9-e5f9-4753-9bc1-69d72bd54b8a@gmail.com	2025-02-17 16:12:03 +09:00
Michael Paquier	eaf502747b	Move wal_buffers_full from PgStat_PendingWalStats to WalUsage wal_buffers_full has been introduced in pg_stat_wal in `8d9a935965`, as some information providing metrics for the tuning of the GUC wal_buffers. WalUsage has been introduced before that in `df3b181499`. Moving this field is proving to be beneficial for several reasons: - This information can now be made available in more layers, providing more granularity than just pg_stat_wal, on a per-query basis: EXPLAIN, pgss and VACUUM/ANALYZE logs. - A patch is under discussion to provide statistics for WAL at backend level, and this move simplifies a bit the handling of pending statistics. The remaining data in PgStat_PendingWalStats now relates to write/sync counters and times, with equivalents present in pg_stat_io, that backend statistics are able to already track. So this should cut all the dependencies between PgStat_PendingWalStats and WAL stats at backend level. As of this change, wal_buffers_full only shows in pg_stat_wal. Author: Bertrand Drouvot Reviewed-by: Ilia Evdokimov Discussion: https://postgr.es/m/Z6SOha5YFFgvpwQY@ip-10-97-1-34.eu-west-3.compute.internal	2025-02-17 13:14:28 +09:00
Alexander Korotkov	6a2275b895	Get rid of WALBufMappingLock Allow multiple backends to initialize WAL buffers concurrently. This way `MemSet((char ) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without taking a single LWLock in exclusive mode. The new algorithm works as follows: reserve a page for initialization using XLogCtl->InitializeReserved, * ensure the page is written out, * once the page is initialized, try to advance XLogCtl->InitializedUpTo and signal to waiters using XLogCtl->InitializedUpToCondVar condition variable, * repeat previous steps until we reserve initialization up to the target WAL position, * wait until concurrent initialization finishes using a XLogCtl->InitializedUpToCondVar. Now, multiple backends can, in parallel, concurrently reserve pages, initialize them, and advance XLogCtl->InitializedUpTo to point to the latest initialized page. Author: Yura Sokolov <y.sokolov@postgrespro.ru> Co-authored-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com>	2025-02-17 04:25:29 +02:00
Nathan Bossart	977d865c36	Describe special values in GUC descriptions more consistently. Many GUCs accept special values like -1 or an empty string to disable the feature, use a system default, etc. While the documentation consistently lists these special values, the GUC descriptions do not. Many such descriptions fail to mention the special values, and those that do vary in phrasing and placement. This commit aims to bring some consistency to this area by applying the following rules: * Special values should be listed at the end of the long description. * Descriptions should use numerals (e.g., "0") instead of words (e.g., "zero"). * Special value mentions should be concise and direct (e.g., "0 disables the timeout.", "An empty string means use the operating system setting."). * Multiple special values should be listed in ascending order. Of course, there are exceptions, such as max_pred_locks_per_relation and search_path, whose special values are too complex to include. And there are cases like listen_addresses, where the meaning of an empty string is arguably too obvious to include. In those cases, I've refrained from adding special value information to the GUC description. Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: "David G. Johnston" <david.g.johnston@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/Z6aIy4aywxUZHAo6%40nathan	2025-02-14 10:44:30 -06:00
Peter Eisentraut	cdaeff9b39	XLogRegisterData, XLogRegisterBufData void * argument for binary data Change XLogRegisterData() and XLogRegisterBufData() functions to take void * for binary data instead of char *. This will remove the need for numerous casts (done in a separate commit for clarity). Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org	2025-02-13 10:33:14 +01:00
Masahiko Sawada	abfb29648f	Rename RBTXN_PREPARE to RBTXN_IS_PREPARE for better clarification. RBTXN_PREPARE flag and rbtxn_prepared macro could be misinterpreted as either indicating the transaction type (e.g. a prepared transaction or a normal transaction) or its currentstate (e.g. skipped or its prepare message is sent), especially after commit `072ee847ad` introduced the RBTXN_SENT_PREPARE flag and the rbtxn_sent_prepare macro. The RBTXN_PREPARE flag (and its corresponding macro) have been renamed to RBTXN_IS_PREPARE to explicitly indicate the transaction type. Therefore, this commit also adds the RBTXN_IS_PREPARE flag to the transaction that is a prepared transaction and has been skipped, which previously had only the RBTXN_SKIPPED_PREPARE flag. Reviewed-by: Amit Kapila, Peter Smith Discussion: https://postgr.es/m/CAA4eK1KgNmBsG%3D155E7QQ6TX9RoWnM4z5Z20SvsbwxSe_QXYsg%40mail.gmail.com	2025-02-12 16:55:00 -08:00
Masahiko Sawada	072ee847ad	Skip logical decoding of already-aborted transactions. Previously, transaction aborts were detected concurrently only during system catalog scans while replaying a transaction in streaming mode. This commit adds an additional CLOG lookup to check the transaction status, allowing the logical decoding to skip changes also when it doesn't touch system catalogs, if the transaction is already aborted. This optimization enhances logical decoding performance, especially for large transactions that have already been rolled back, as it avoids unnecessary disk or network I/O. To avoid potential slowdowns caused by frequent CLOG lookups for small transactions (most of which commit), the CLOG lookup is performed only for large transactions before eviction. The performance benchmark results showed there is not noticeable performance regression due to CLOG lookups. Reviewed-by: Amit Kapila, Peter Smith, Vignesh C, Ajin Cherian Reviewed-by: Dilip Kumar, Andres Freund Discussion: https://postgr.es/m/CAD21AoDht9Pz_DFv_R2LqBTBbO4eGrpa9Vojmt5z5sEx3XwD7A@mail.gmail.com	2025-02-12 16:31:34 -08:00
Peter Eisentraut	1b5841d461	Remove unnecessary (char ) casts [checksum] Remove some (char ) casts related to uses of the pg_checksum_page() function. These casts are useless, because everything involved already has the right type. Moreover, these casts actually silently discarded a const qualifier. The declaration of a higher-level function needs to be adjusted to fix that. Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org	2025-02-12 08:59:48 +01:00
Nathan Bossart	bb8dff9995	Add cost-based vacuum delay time to progress views. This commit adds the amount of time spent sleeping due to cost-based delay to the pg_stat_progress_vacuum and pg_stat_progress_analyze system views. A new configuration parameter named track_cost_delay_timing, which is off by default, controls whether this information is gathered. For vacuum, the reported value includes the sleep time of any associated parallel workers. However, parallel workers only report their sleep time once per second to avoid overloading the leader process. Bumps catversion. Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal	2025-02-11 16:38:14 -06:00
Nathan Bossart	e5b0b0ce15	Add is_analyze parameter to vacuum_delay_point(). This function is used in both vacuum and analyze code paths, and a follow-up commit will require distinguishing between the two. This commit forces callers to specify whether they are in a vacuum or analyze path, but it does not use that information for anything yet. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal	2025-02-11 16:38:14 -06:00
Melanie Plageman	052026c9b9	Eagerly scan all-visible pages to amortize aggressive vacuum Aggressive vacuums must scan every unfrozen tuple in order to advance the relfrozenxid/relminmxid. Because data is often vacuumed before it is old enough to require freezing, relations may build up a large backlog of pages that are set all-visible but not all-frozen in the visibility map. When an aggressive vacuum is triggered, all of these pages must be scanned. These pages have often been evicted from shared buffers and even from the kernel buffer cache. Thus, aggressive vacuums often incur large amounts of extra I/O at the expense of foreground workloads. To amortize the cost of aggressive vacuums, eagerly scan some all-visible but not all-frozen pages during normal vacuums. All-visible pages that are eagerly scanned and set all-frozen in the visibility map are counted as successful eager freezes and those not frozen are counted as failed eager freezes. If too many eager scans fail in a row, eager scanning is temporarily suspended until a later portion of the relation. The number of failures tolerated is configurable globally and per table. To effectively amortize aggressive vacuums, we cap the number of successes as well. Capping eager freeze successes also limits the amount of potentially wasted work if these pages are modified again before the next aggressive vacuum. Once we reach the maximum number of blocks successfully eager frozen, eager scanning is disabled for the remainder of the vacuum of the relation. Original design idea from Robert Haas, with enhancements from Andres Freund, Tomas Vondra, and me Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com	2025-02-11 13:53:48 -05:00
Andres Freund	4dd09a1d41	config: Rename "Asynchronous Behavior" to "I/O" "I/O" seems more descriptive than "Asynchronous Behavior", given that some of the GUCs in the section don't relate to anything asynchronous. Most other abbreviations in the config sections are un-abbreviated, but "Input/Output" seems less likely to be helpful than just IO or I/O. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/x3tlw2jk5gm3r3mv47hwrshffyw7halpczkfbk3peksxds7bvc@lguk43z3bsyq	2025-02-11 12:53:40 -05:00
Andres Freund	740766d37c	config: Split "Worker Processes" out of "Asynchronous Behavior" Having all the worker related GUCs in the same section as IO controlling GUCs doesn't really make sense. Create a separate section for them. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/x3tlw2jk5gm3r3mv47hwrshffyw7halpczkfbk3peksxds7bvc@lguk43z3bsyq	2025-02-11 12:53:40 -05:00
Tom Lane	c366d2bdba	Allow extension functions to participate in in-place updates. Commit `1dc5ebc90` allowed PL/pgSQL to perform in-place updates of expanded-object variables that are being updated with assignments like "x := f(x, ...)". However this was allowed only for a hard-wired list of functions f(), since we need to be sure that f() will not modify the variable if it fails. It was always envisioned that we should make that extensible, but at the time we didn't have a good way to do so. Since then we've invented the idea of "support functions" to allow attaching specialized optimization knowledge to functions, and that is a perfect mechanism for doing this. Hence, adjust PL/pgSQL to use a support function request instead of hard-wired logic to decide if in-place update is safe. Preserve the previous optimizations by creating support functions for the three functions that were previously hard-wired. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://postgr.es/m/CACxu=vJaKFNsYxooSnW1wEgsAO5u_v1XYBacfVJ14wgJV_PYeg@mail.gmail.com	2025-02-11 12:49:34 -05:00
Tom Lane	6c7251db0c	Implement new optimization rule for updates of expanded variables. If a read/write expanded variable is declared locally to the assignment statement that is updating it, and it is referenced exactly once in the assignment RHS, then we can optimize the operation as a direct update of the expanded value, whether or not the function(s) operating on it can be trusted not to modify the value before throwing an error. This works because if an error does get thrown, we no longer care what value the variable has. In cases where that doesn't work, fall back to the previous rule that checks for safety of the top-level function. In any case, postpone determination of whether these optimizations are feasible until we are executing a Param referencing the target variable and that variable holds a R/W expanded object. While the previous incarnation of exec_check_rw_parameter was pretty cheap, this is a bit less so, and our plan to invoke support functions will make it even less so. So avoiding the check for variables where it couldn't be useful should be a win. Author: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru> Reviewed-by: Pavel Borisov <pashkin.elfe@gmail.com> Discussion: https://postgr.es/m/CACxu=vJaKFNsYxooSnW1wEgsAO5u_v1XYBacfVJ14wgJV_PYeg@mail.gmail.com	2025-02-11 12:34:59 -05:00
Peter Eisentraut	6998db59c2	Replace AssertMacro() with Assert() when not in macro This was forgotten to be changed in commit `9c727360bc`.	2025-02-11 11:12:05 +01:00
Michael Paquier	c9238ad853	Fix indentation of comment in plannodes.h Oversight in commit `3d17d7d7fb`. Worth noting that pgindent was fine as-is. Author: Sami Imseih Discussion: https://postgr.es/m/CAA5RZ0t80hP2aTv97QtEJy39GkxKmDBVDiTBApfiuTa4O=TEWQ@mail.gmail.com	2025-02-11 07:40:03 +09:00
Andres Freund	3e98c8ce50	Specify the encoding of input to fmtId() This commit adds fmtIdEnc() and fmtQualifiedIdEnc(), which allow to specify the encoding as an explicit argument. Additionally setFmtEncoding() is provided, which defines the encoding when no explicit encoding is provided, to avoid breaking all code using fmtId(). All users of fmtId()/fmtQualifiedId() are either converted to the explicit version or a call to setFmtEncoding() has been added. This commit does not yet utilize the now well-defined encoding, that will happen in a subsequent commit. Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Backpatch-through: 13 Security: CVE-2025-1094	2025-02-10 10:03:37 -05:00
Andres Freund	4dc2896353	Add pg_encoding_set_invalid() There are cases where we cannot / do not want to error out for invalidly encoded input. In such cases it can be useful to replace e.g. an incomplete multi-byte characters with bytes that will trigger an error when getting validated as part of a larger string. Unfortunately, until now, for some encoding no such sequence existed. For those encodings this commit removes one previously accepted input combination - we consider that to be ok, as the chosen bytes are outside of the valid ranges for the encodings, we just previously failed to detect that. As we cannot add a new field to pg_wchar_table without breaking ABI, this is implemented "in-line" in the newly added function. Author: Noah Misch <noah@leadboat.com> Reviewed-by: Andres Freund <andres@anarazel.de> Backpatch-through: 13 Security: CVE-2025-1094	2025-02-10 10:03:37 -05:00
Michael Paquier	3d17d7d7fb	Reformat node comments in plannodes.h This is similar to `d575051b9a` but this time for the comments in plannodes.h to avoid long lines, which is useful if adding per-field annotations with pg_node_attr() to these planner structures. Some patches are under discussion to add such properties to planner fields, which is something that may or may not happen, and this change makes future proposals easier to work on and review, which being more consistent in style with the parse nodes. Author: Sami Imseih Discussion: https://postgr.es/m/Z5xTb5iBHVGns35R@paquier.xyz	2025-02-10 09:58:25 +09:00
Peter Eisentraut	a9258629ed	Make TLS write functions' buffer arguments pointers const This also makes it match the equivalent APIs in libpq. Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> Discussion: https://www.postgresql.org/message-id/flat/fd1fcedb-3492-4fc8-9e3e-74b97f2db6c7%40eisentraut.org	2025-02-09 12:43:30 +01:00
Peter Eisentraut	83ea6c5402	Virtual generated columns This adds a new variant of generated columns that are computed on read (like a view, unlike the existing stored generated columns, which are computed on write, like a materialized view). The syntax for the column definition is ... GENERATED ALWAYS AS (...) VIRTUAL and VIRTUAL is also optional. VIRTUAL is the default rather than STORED to match various other SQL products. (The SQL standard makes no specification about this, but it also doesn't know about VIRTUAL or STORED.) (Also, virtual views are the default, rather than materialized views.) Virtual generated columns are stored in tuples as null values. (A very early version of this patch had the ambition to not store them at all. But so much stuff breaks or gets confused if you have tuples where a column in the middle is completely missing. This is a compromise, and it still saves space over being forced to use stored generated columns. If we ever find a way to improve this, a bit of pg_upgrade cleverness could allow for upgrades to a newer scheme.) The capabilities and restrictions of virtual generated columns are mostly the same as for stored generated columns. In some cases, this patch keeps virtual generated columns more restricted than they might technically need to be, to keep the two kinds consistent. Some of that could maybe be relaxed later after separate careful considerations. Some functionality that is currently not supported, but could possibly be added as incremental features, some easier than others: - index on or using a virtual column - hence also no unique constraints on virtual columns - extended statistics on virtual columns - foreign-key constraints on virtual columns - not-null constraints on virtual columns (check constraints are supported) - ALTER TABLE / DROP EXPRESSION - virtual column cannot have domain type - virtual columns are not supported in logical replication The tests in generated_virtual.sql have been copied over from generated_stored.sql with the keyword replaced. This way we can make sure the behavior is mostly aligned, and the differences can be visible. Some tests for currently not supported features are currently commented out. Reviewed-by: Jian He <jian.universality@gmail.com> Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com> Tested-by: Shlok Kyal <shlok.kyal.oss@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/a368248e-69e4-40be-9c07-6c3b5880b0a6@eisentraut.org	2025-02-07 09:46:59 +01:00
Amit Langote	cbc127917e	Track unpruned relids to avoid processing pruned relations This commit introduces changes to track unpruned relations explicitly, making it possible for top-level plan nodes, such as ModifyTable and LockRows, to avoid processing partitions pruned during initial pruning. Scan-level nodes, such as Append and MergeAppend, already avoid the unnecessary processing by accessing partition pruning results directly via part_prune_index. In contrast, top-level nodes cannot access pruning results directly and need to determine which partitions remain unpruned. To address this, this commit introduces a new bitmapset field, es_unpruned_relids, which the executor uses to track the set of unpruned relations. This field is referenced during plan initialization to skip initializing certain nodes for pruned partitions. It is initialized with PlannedStmt.unprunableRelids, a new field that the planner populates with RT indexes of relations that cannot be pruned during runtime pruning. These include relations not subject to partition pruning and those required for execution regardless of pruning. PlannedStmt.unprunableRelids is computed during set_plan_refs() by removing the RT indexes of runtime-prunable relations, identified from PartitionPruneInfos, from the full set of relation RT indexes. ExecDoInitialPruning() then updates es_unpruned_relids by adding partitions that survive initial pruning. To support this, PartitionedRelPruneInfo and PartitionedRelPruningData now include a leafpart_rti_map[] array that maps partition indexes to their corresponding RT indexes. The former is used in set_plan_refs() when constructing unprunableRelids, while the latter is used in ExecDoInitialPruning() to convert partition indexes returned by get_matching_partitions() into RT indexes, which are then added to es_unpruned_relids. These changes make it possible for ModifyTable and LockRows nodes to process only relations that remain unpruned after initial pruning. ExecInitModifyTable() trims lists, such as resultRelations, withCheckOptionLists, returningLists, and updateColnosLists, to consider only unpruned partitions. It also creates ResultRelInfo structs only for these partitions. Similarly, child RowMarks for pruned relations are skipped. By avoiding unnecessary initialization of structures for pruned partitions, these changes improve the performance of updates and deletes on partitioned tables during initial runtime pruning. Due to ExecInitModifyTable() changes as described above, EXPLAIN on a plan for UPDATE and DELETE that uses runtime initial pruning no longer lists partitions pruned during initial pruning. Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier versions) Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	2025-02-07 17:15:09 +09:00
Nathan Bossart	306dc520b9	Introduce autovacuum_vacuum_max_threshold. One way autovacuum chooses tables to vacuum is by comparing the number of updated or deleted tuples with a value calculated using autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor. The threshold specifies the base value for comparison, and the scale factor specifies the fraction of the table size to add to it. This strategy ensures that smaller tables are vacuumed after fewer updates/deletes than larger tables, which is reasonable in many cases but can result in infrequent vacuums on very large tables. This is undesirable for a couple of reasons, such as very large tables incurring a huge amount of bloat between vacuums. This new parameter provides a way to set a limit on the value calculated with autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor so that very large tables are vacuumed more frequently. By default, it is set to 100,000,000 tuples, but it can be disabled by setting it to -1. It can also be adjusted for individual tables by changing storage parameters. Author: Nathan Bossart <nathandbossart@gmail.com> Co-authored-by: Frédéric Yhuel <frederic.yhuel@dalibo.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Laurenz Albe <laurenz.albe@cybertec.at> Reviewed-by: Michael Banck <mbanck@gmx.net> Reviewed-by: Joe Conway <mail@joeconway.com> Reviewed-by: Sami Imseih <samimseih@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com> Reviewed-by: Vinícius Abrahão <vinnix.bsd@gmail.com> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/956435f8-3b2f-47a6-8756-8c54ded61802%40dalibo.com	2025-02-05 15:48:18 -06:00
Tom Lane	a14707da56	Show more-intuitive titles for psql commands \dt, \di, etc. If exactly one relation type is requested in a command of the \dtisv family, say "tables", "indexes", etc instead of "relations". This should cover the majority of actual uses, without creating a huge number of new translatable strings. The error messages for no matching relations are adjusted as well. In passing, invent "pg_log_error_internal()" to be used for frontend error messages that don't seem to need translation, analogously to errmsg_internal() in the backend. The implementation is a bit cheesy, being just a macro to prevent xgettext from recognizing a trigger keyword. This won't avoid a useless gettext lookup cycle at runtime --- but surely we don't care about an extra microsecond or two in what's supposed to be a can't-happen case. I (tgl) also made "pg_fatal_internal()", though it's not used in this patch. Author: Greg Sabino Mullane <htamfids@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAKAnmm+7o93fQV-RFkGaN1QnP-0D4d3JTykD+cLueqjDMKdfag@mail.gmail.com	2025-02-05 12:45:58 -05:00
Amit Kapila	0ec3c295e7	Avoid updating inactive_since for invalid replication slots. It is possible for the inactive_since value of an invalid replication slot to be updated multiple times, which is unexpected behavior like during the release of the slot or at the time of restart. This is harmless because invalid slots are not allowed to be accessed but it is not prudent to update invalid slots. We are planning to invalidate slots due to other reasons like idle time and it will look odd that the slot's inactive_since displays the recent time in this field after invalidated due to idle time. So, this patch ensures that the inactive_since field of slots is not updated for invalid slots. In the passing, ensure to use the same inactive_since time for all the slots at restart while restoring them from the disk. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CABdArM7QdifQ_MHmMA=Cc4v8+MeckkwKncm2Nn6tX9wSCQ-+iw@mail.gmail.com	2025-02-05 08:56:14 +05:30
Michael Paquier	a051e71e28	Add data for WAL in pg_stat_io and backend statistics This commit adds WAL IO stats to both pg_stat_io view and per-backend IO statistics (pg_stat_get_backend_io()). This change is possible since `f92c854cf4`, as WAL IO is not counted in blocks in some code paths where its stats data is measured (like WAL read in xlogreader.c). IOContext gains IOCONTEXT_INIT and IOObject IOOBJECT_WAL, with the following combinations allowed: - IOOBJECT_WAL/IOCONTEXT_NORMAL is used to track I/O operations done on already-created WAL segments. - IOOBJECT_WAL/IOCONTEXT_INIT is used for tracking I/O operations done when initializing WAL segments. The core changes are done in pg_stat_io.c, backend statistics inherit them. Backend statistics and pg_stat_io are now available for the WAL writer, the WAL receiver and the WAL summarizer processes. I/O timing data is controlled by the GUC track_io_timing, like the existing data of pg_stat_io for consistency. The timings related to IOOBJECT_WAL show up if the GUC is enabled (disabled by default). Bump pgstats file version, due to the additions in IOObject and IOContext, impacting the amount of data written for the fixed-numbered IO stats kind in the pgstats file. Author: Nazir Bilal Yavuz Reviewed-by: Bertrand Drouvot, Nitin Jadhav, Amit Kapila, Michael Paquier, Melanie Plageman, Bharath Rupireddy Discussion: https://postgr.es/m/CAN55FZ3AiQ+ZMxUuXnBpd0Rrh1YhwJ5FudkHg=JU0P+-W8T4Vg@mail.gmail.com	2025-02-04 16:50:00 +09:00
Peter Eisentraut	622f678c10	Integrate GistTranslateCompareType() into IndexAmTranslateCompareType() This turns GistTranslateCompareType() into a callback function of the gist index AM instead of a standalone function. The existing callers are changed to use IndexAmTranslateCompareType(). This then makes that code not hardcoded toward gist. This means in particular that the temporal keys code is now independent of gist. Also, this generalizes commit `74edabce7a`, so other index access methods other than the previously hardcoded ones could now work as REPLICA IDENTITY in a logical replication subscriber. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-03 10:53:18 +01:00
Peter Eisentraut	c09e5a6a01	Convert strategies to and from compare types For each Index AM, provide a mapping between operator strategies and the system-wide generic concept of a comparison type. For example, for btree, BTLessStrategyNumber maps to and from COMPARE_LT. Numerous places in the planner and executor think directly in terms of btree strategy numbers (and a few in terms of hash strategy numbers.) These should be converted over subsequent commits to think in terms of CompareType instead. (This commit doesn't make any use of this API yet.) Author: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-02 10:26:04 +01:00
Peter Eisentraut	119fc30dd5	Move CompareType to separate header file We'll want to make use of it in more places, and we'd prefer to not have to include all of primnodes.h everywhere. Author: Mark Dilger <mark.dilger@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-02 08:11:57 +01:00
Peter Eisentraut	43493cceda	Add get_opfamily_name() function This refactors and simplifies various existing code to make use of the new function. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-01 10:42:58 +01:00
Peter Eisentraut	a5709b5bb2	Rename GistTranslateStratnum() to GistTranslateCompareType() Follow up to commit `630f9a43ce`. The previous name had become confusing, because it doesn't actually translate a strategy number but a CompareType into a strategy number. We might add the inverse at some point, which would then probably be called something like GistTranslateStratnum. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com	2025-02-01 10:18:46 +01:00
Tom Lane	041e8b95b8	Get rid of our dependency on type "long" for memory size calculations. Consistently use "Size" (or size_t, or in some places int64 or double) as the type for variables holding memory allocation sizes. In most places variables' data types were fine already, but we had an ancient habit of computing bytes from kilobytes-units GUCs with code like "work_mem * 1024L". That risks overflow on Win64 where they did not make "long" as wide as "size_t". We worked around that by restricting such GUCs' ranges, so you couldn't set work_mem et al higher than 2GB on Win64. This patch removes that restriction, after replacing such calculations with "work_mem * (Size) 1024" or variants of that. It should be noted that this patch was constructed by searching outwards from the GUCs that have MAX_KILOBYTES as upper limit. So I can't positively guarantee there are no other places doing memory-size arithmetic in int or long variables. I do however feel pretty confident that increasing MAX_KILOBYTES on Win64 is safe now. Also, nothing in our code should be dealing in multiple-gigabyte allocations without authorization from a relevant GUC, so it seems pretty likely that this search caught everything that could be at risk of overflow. Author: Vladlen Popolitov <v.popolitov@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217	2025-01-31 13:52:40 -05:00
Daniel Gustafsson	e21d6f2971	Move PG_MAX_AUTH_TOKEN_LENGTH to libpq/auth.h Future SASL mechanism, like OAUTHBEARER, will use this as a limit on token messages coming from the client, so promote it to the header file to make it available. This patch is extracted from a larger body of work aimed at adding support for OAUTHBEARER in libpq. Author: Jacob Champion <jacob.champion@enterprisedb.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://postgr.es/m/CAOYmi+kJqzo6XsR9TEhvVfeVNQ-TyFM5LATypm9yoQVYk=4Wrw@mail.gmail.com	2025-01-31 15:39:35 +01:00
Amit Langote	d47cbf474e	Perform runtime initial pruning outside ExecInitNode() This commit builds on the prior change that moved PartitionPruneInfos out of individual plan nodes into a list in PlannedStmt, making it possible to initialize PartitionPruneStates without traversing the plan tree and perform runtime initial pruning before ExecInitNode() initializes the plan trees. These tasks are now handled in a new routine, ExecDoInitialPruning(), which is called by InitPlan() before calling ExecInitNode() on various plan trees. ExecDoInitialPruning() performs the initial pruning and saves the result -- a Bitmapset of indexes for surviving child subnodes -- in es_part_prune_results, a list in EState. PartitionPruneStates created for initial pruning are stored in es_part_prune_states, another list in EState, for later use during exec pruning. Both lists are parallel to es_part_prune_infos, which holds the PartitionPruneInfos from PlannedStmt, enabling shared indexing. PartitionPruneStates initialized in ExecDoInitialPruning() now include only the PartitionPruneContexts for initial pruning steps. Exec pruning contexts are initialized later in ExecInitPartitionExecPruning() when the parent plan node is initialized, as the exec pruning step expressions depend on the parent node's PlanState. The existing function PartitionPruneFixSubPlanMap() has been repurposed for this initialization to avoid duplicating a similar loop structure for finding PartitionedRelPruningData to initialize exec pruning contexts for. It has been renamed to InitExecPruningContexts() to reflect its new primary responsibility. The original logic to "fix subplan maps" remains intact but is now encapsulated within the renamed function. This commit removes two obsolete Asserts in partkey_datum_from_expr(). The ExprContext used for pruning expression evaluation is now independent of the parent PlanState, making these Asserts unnecessary. By centralizing pruning logic and decoupling it from the plan initialization step (ExecInitNode()), this change sets the stage for future patches that will use the result of initial pruning to save the overhead of redundant processing for pruned partitions. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	2025-01-31 15:47:15 +09:00
Amit Kapila	f41d8468dd	Raise an error while trying to acquire an invalid slot. Once a replication slot is invalidated, it cannot be altered or used to fetch changes. However, a process could still acquire an invalid slot and fail later. For example, if a process acquires a logical slot that was invalidated due to wal_removed, it will eventually fail in CreateDecodingContext() when attempting to access the removed WAL. Similarly, for physical replication slots, even if the slot is invalidated and invalidation_reason is set to wal_removed, the walsender does not currently check for invalidation when starting physical replication. Instead, replication starts, and an error is only reported later while trying to access WAL. Similarly, we prohibit modifying slot properties for invalid slots but give the error for the same after acquiring the slot. This patch improves error handling by detecting invalid slots earlier at the time of slot acquisition which is the first step. This also helped in unifying different ERROR messages at different places and gave a consistent message for invalid slots. This means that the message for invalid slots will change to a generic message. This will also be helpful for future patches where we are planning to invalidate slots due to more reasons like idle_timeout because we don't have to modify multiple places in such cases and avoid the chances of missing out on a particular place. Author: Nisha Moond <nisha.moond412@gmail.com> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Vignesh C <vignesh21@gmail.com> Reviewed-by: Peter Smith <smithpb2250@gmail.com> Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://postgr.es/m/CABdArM6pBL5hPnSQ+5nEVMANcF4FCH7LQmgskXyiLY75TMnKpw@mail.gmail.com	2025-01-31 10:27:35 +05:30
Michael Paquier	ce5c620fb6	Add pgstat_drop_matching_entries() to pgstats This allows users of the cumulative statistics to drop entries in the shared hash stats table, deleting as well local references. Callers of this function can optionally define a callback able to filter which entries to drop, similarly to pgstat_reset_matching_entries() with its callback do_reset(). pgstat_drop_all_entries() is refactored so as it uses this new function. Author: Lukas Fitti Discussion: https://postgr.es/m/CAP53PkwuFbo3NkwZgxwNRMjMfqPEqidD-SggaoQ4ijotBVLJAA@mail.gmail.com	2025-01-31 12:27:19 +09:00
Tom Lane	b9d232b9de	Use "ssize_t" not "long" in max_stack_depth-related code. This change adapts these functions to the machine's address width without depending on "long" to be the right size. (It isn't on Win64, for example.) While it seems unlikely anyone would care to run with a stack depth limit exceeding 2GB, this is part of a general push to avoid using type "long" to represent memory sizes. It's convenient to use ssize_t rather than the perhaps-more-obvious choice of size_t/Size, because the code involved depends on working with a signed data type. Our MAX_KILOBYTES limit already ensures that ssize_t will be sufficient to represent the maximum value of max_stack_depth. Extracted from a larger patch by Vladlen, plus additional hackery by me. Author: Vladlen Popolitov <v.popolitov@postgrespro.ru> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/1a01f0-66ec2d80-3b-68487680@27595217	2025-01-30 16:44:47 -05:00
Tom Lane	115a365519	Simplify executor's handling of CaseTestExpr & CoerceToDomainValue. Instead of deciding at runtime whether to read from casetest.value or caseValue_datum, split EEOP_CASE_TESTVAL into two opcodes and make the decision during expression compilation. Similarly for EEOP_DOMAIN_TESTVAL. This actually results in net less code, mainly because llvmjit_expr.c's code for handling these opcodes gets shorter. The performance gain is doubtless negligible, but this seems worth changing anyway on grounds of simplicity and understandability. Author: Andreas Karlsson <andreas@proxel.se> Co-authored-by: Xing Guo <higuoxing@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CACpMh+AiBYAWn+D1aU7Rsy-V1tox06Cbc0H3qA7rwL5zdJ=anQ@mail.gmail.com	2025-01-30 13:21:42 -05:00
Amit Langote	bb3ec16e14	Move PartitionPruneInfo out of plan nodes into PlannedStmt This moves PartitionPruneInfo from plan nodes to PlannedStmt, simplifying traversal by centralizing all PartitionPruneInfo structures in a single list in it, which holds all instances for the main query and its subqueries. Instead of plan nodes (Append or MergeAppend) storing PartitionPruneInfo pointers, they now reference an index in this list. A bitmapset field is added to PartitionPruneInfo to store the RT indexes corresponding to the apprelids field in Append or MergeAppend. This allows execution pruning logic to verify that it operates on the correct plan node, mainly to facilitate debugging. Duplicated code in set_append_references() and set_mergeappend_references() is refactored into a new function, register_pruneinfo(). This updates RT indexes by applying rtoffet and adds PartitionPruneInfo to the global list in PlannerGlobal. By allowing pruning to be performed without traversing the plan tree, this change lays the groundwork for runtime initial pruning to occur independently of plan tree initialization. Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> (earlier version) Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com	2025-01-30 11:57:32 +09:00
Tom Lane	0da39aa766	Handle default NULL insertion a little better. If a column is omitted in an INSERT, and there's no column default, the code in preptlist.c generates a NULL Const to be inserted. Furthermore, if the column is of a domain type, we wrap the Const in CoerceToDomain, so as to throw a run-time error if the domain has a NOT NULL constraint. That's fine as far as it goes, but there are two problems: 1. We're being sloppy about the type/typmod that the Const is labeled with. It really should have the domain's base type/typmod, since it's the input to CoerceToDomain not the output. This can result in coerce_to_domain inserting a useless length-coercion function (useless because it's being applied to a null). The coercion would typically get const-folded away later, but it'd be better not to create it in the first place. 2. We're not applying expression preprocessing (specifically, eval_const_expressions) to the resulting expression tree. The planner's primary expression-preprocessing pass already happened, so that means the length coercion step and CoerceToDomain node miss preprocessing altogether. This is at the least inefficient, since it means the length coercion and CoerceToDomain will actually be executed for each inserted row, though they could be const-folded away in most cases. Worse, it seems possible that missing preprocessing for the length coercion could result in an invalid plan (for example, due to failing to perform default-function-argument insertion). I'm not aware of any live bug of that sort with core datatypes, and it might be unreachable for extension types as well because of restrictions of CREATE CAST, but I'm not entirely convinced that it's unreachable. Hence, it seems worth back-patching the fix (although I only went back to v14, as the patch doesn't apply cleanly at all in v13). There are several places in the rewriter that are building null domain constants the same way as preptlist.c. While those are before the planner and hence don't have any reachable bug, they're still applying a length coercion that will be const-folded away later, uselessly wasting cycles. Hence, make a utility routine that all of these places can call to do it right. Making this code more careful about the typmod assigned to the generated NULL constant has visible but cosmetic effects on some of the plans shown in contrib/postgres_fdw's regression tests. Discussion: https://postgr.es/m/1865579.1738113656@sss.pgh.pa.us Backpatch-through: 14	2025-01-29 15:31:55 -05:00

1 2 3 4 5 ...

12197 commits