Two near-identical copies of clause_sides_match_join() existed in
joinpath.c and analyzejoins.c. Deduplicate this by moving the function
into restrictinfo.h.
It isn't quite clear that keeping the inline property of this function
is worthwhile, but this commit is just an exercise in code
deduplication. More effort would be required to determine if the inline
property is worth keeping.
Author: James Hunter <james.hunter.pg@gmail.com>
Discussion: https://postgr.es/m/CAJVSvF7Nm_9kgMLOch4c-5fbh3MYg%3D9BdnDx3Dv7Fcb64zr64Q%40mail.gmail.com
This module provides SQL functions that allow to inspect logical
decoding components.
It currently allows to inspect the contents of serialized logical
snapshots of a running database cluster, which is useful for debugging
or educational purposes.
Author: Bertrand Drouvot
Reviewed-by: Amit Kapila, Shveta Malik, Peter Smith, Peter Eisentraut
Reviewed-by: David G. Johnston
Discussion: https://postgr.es/m/ZscuZ92uGh3wm4tW%40ip-10-97-1-34.eu-west-3.compute.internal
This commit moves the definitions of the SnapBuild and SnapBuildOnDisk
structs, related to logical snapshots, to the snapshot_internal.h
file. This change allows external tools, such as
pg_logicalinspect (with an upcoming patch), to access and utilize the
contents of logical snapshots.
Author: Bertrand Drouvot
Reviewed-by: Amit Kapila, Shveta Malik, Peter Smith
Discussion: https://postgr.es/m/ZscuZ92uGh3wm4tW%40ip-10-97-1-34.eu-west-3.compute.internal
The MergeJoin struct was tracking "mergeStrategies", which were an
array of btree strategy numbers, purely for the purpose of comparing
it later against btree strategies to determine if the scan direction
was forward or reverse. Change that. Instead, track
"mergeReversals", an array of bool, to indicate the same without an
unfortunate assumption that a strategy number refers specifically to a
btree strategy.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
Functions make_pathkey_from_sortop() and transformWindowDefinitions(),
which receive a SortGroupClause, were determining the sort order
(ascending vs. descending) by comparing that structure's operator
strategy to BTLessStrategyNumber, but could just as easily have gotten
it from the SortGroupClause object, if it had such a field, so add
one. This reduces the number of places that hardcode the assumption
that the strategy refers specifically to a btree strategy, rather than
some other index AM's operators.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
Commit 15abc7788e tolerated namespace pollution from BeOS system
headers. Commit 44f902122 de-supported BeOS. Since that stuff didn't
make it into the Meson build system, synchronize by removing from
configure.
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (the idea, not the patch)
Discussion: https://postgr.es/m/ME3P282MB3166F9D1F71F787929C0C7E7B6312%40ME3P282MB3166.AUSP282.PROD.OUTLOOK.COM
PostgreSQL has for a long time mixed two BIO implementations, which can
lead to subtle bugs and inconsistencies. This cleans up our BIO by just
just setting up the methods we need. This patch does not introduce any
functionality changes.
The following methods are no longer defined due to not being needed:
- gets: Not used by libssl
- puts: Not used by libssl
- create: Sets up state not used by libpq
- destroy: Not used since libpq use BIO_NOCLOSE, if it was used it close
the socket from underneath libpq
- callback_ctrl: Not implemented by sockets
The following methods are defined for our BIO:
- read: Used for reading arbitrary length data from the BIO. No change
in functionality from the previous implementation.
- write: Used for writing arbitrary length data to the BIO. No change
in functionality from the previous implementation.
- ctrl: Used for processing ctrl messages in the BIO (similar to ioctl).
The only ctrl message which matters is BIO_CTRL_FLUSH used for
writing out buffered data (or signal EOF and that no more data
will be written). BIO_CTRL_FLUSH is mandatory to implement and
is implemented as a no-op since there is no intermediate buffer
to flush.
BIO_CTRL_EOF is the out-of-band method for signalling EOF to
read_ex based BIO's. Our BIO is not read_ex based but someone
could accidentally call BIO_CTRL_EOF on us so implement mainly
for completeness sake.
As the implementation is no longer related to BIO_s_socket or calling
SSL_set_fd, methods have been renamed to reference the PGconn and Port
types instead.
This also reverts back to using BIO_set_data, with our fallback, as a small
optimization as BIO_set_app_data require the ex_data mechanism in OpenSSL.
Author: David Benjamin <davidben@google.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/CAF8qwaCZ97AZWXtg_y359SpOHe+HdJ+p0poLCpJYSUxL-8Eo8A@mail.gmail.com
This function returns the name, size, and last modification time of
each regular file in pg_wal/summaries. This allows administrators
to grant privileges to view the contents of this directory without
granting privileges on pg_ls_dir(), which allows listing the
contents of many other directories. This commit also gives the
pg_monitor predefined role EXECUTE privileges on the new
pg_ls_summariesdir() function.
Bumps catversion.
Author: Yushi Ogiwara
Reviewed-by: Michael Paquier, Fujii Masao
Discussion: https://postgr.es/m/a0a3af15a9b9daa107739eb45aa9a9bc%40oss.nttdata.com
Commit 90189eefc1 narrowed pg_attribute.attinhcount and
pg_constraint.coninhcount from 32 to 16 bits, but kept other related
structs with 32-bit wide fields: ColumnDef and CookedConstraint contain
an int 'inhcount' field which is itself checked for overflow on
increments, but there's no check that the values aren't above INT16_MAX
before assigning to the catalog columns. This means that a creative
user can get a inconsistent table definition and override some
protections.
Fix it by changing those other structs to also use int16.
Also, modernize style by using pg_add_s16_overflow for overflow testing
instead of checking for negative values.
We also have Constraint.inhcount, which is here removed completely.
This was added by commit b0e96f3119 and not removed by its revert at
6f8bb7c1e9. It is not needed by the upcoming not-null constraints
patch.
This is mostly academic, so we agreed not to backpatch to avoid ABI
problems.
Bump catversion because of the changes to parse nodes.
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Co-authored-by: 何建 (jian he) <jian.universality@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/202410081611.up4iyofb5ie7@alvherre.pgsql
Previously, the descriptions of pg_stat_get_checkpointer_num_requested(),
pg_stat_get_checkpointer_restartpoints_requested(),
and pg_stat_get_checkpointer_restartpoints_performed() in pg_proc.dat
referred to "backend". This was misleading because these functions report
the number of checkpoints or restartpoints requested or performed
by other than backends as well.
This commit removes "backend" from these descriptions to avoid confusion.
Bump catalog version.
Idea from Anton A. Melnikov
Author: Fujii Masao
Reviewed-by: Anton A. Melnikov
Discussion: https://postgr.es/m/8e5f353f-8b31-4a8e-9cfa-c037f22b4aee@postgrespro.ru
These fields can be set by executor nodes to record how many parallel
workers were planned to be launched and how many of them have been
actually launched within the number initially planned. This data is
able to give an approximation of the parallel worker draught a system
is facing, making easier the tuning of related configuration parameters.
These fields will be used by some follow-up patches to populate other
parts of the system with their data.
Author: Guillaume Lelarge, Benoit Lobréau
Discussion: https://postgr.es/m/783bc7f7-659a-42fa-99dd-ee0565644e25@dalibo.com
Discussion: https://postgr.es/m/CAECtzeWtTGOK0UgKXdDGpfTVSa5bd_VbUt6K6xn8P7X+_dZqKw@mail.gmail.com
AIO will need a resource owner to do IO. Right now we create a resowner
on-demand during basebackup, and we could do the same for AIO. But it seems
easier to just always create an aux process resowner.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
With real AIO it doesn't make sense to cross segment boundaries with one
IO. Add smgrmaxcombine() to allow upper layers to query which buffers can be
merged.
We could continue to cross segment boundaries when not using AIO, but it
doesn't really make sense, because md.c will never be able to perform the read
across the segment boundary in one system call. Which means we'll mark more
buffers as undergoing IO than really makes sense - if another backend desires
to read the same blocks, it'll be blocked longer than necessary. So it seems
better to just never cross the boundary.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/1f6b50a7-38ef-4d87-8246-786d39f46ab9@iki.fi
This seems nicer than having to duplicate the logic between
InitProcess() and ProcKill() for which child processes have a
PMChildFlags slot.
Move the MarkPostmasterChildActive() call earlier in InitProcess(),
out of the section protected by the spinlock.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/a102f15f-eac4-4ff2-af02-f9ff209ec66f@iki.fi
Previously, when ON_ERROR was set to 'ignore', the COPY command
would skip all rows with data type conversion errors, with no way to
limit the number of skipped rows before failing.
This commit introduces the REJECT_LIMIT option, allowing users to
specify the maximum number of erroneous rows that can be skipped.
If more rows encounter data type conversion errors than allowed by
REJECT_LIMIT, the COPY command will fail with an error, even when
ON_ERROR = 'ignore'.
Author: Atsushi Torikoshi
Reviewed-by: Junwang Zhao, Kirill Reshke, jian he, Fujii Masao
Discussion: https://postgr.es/m/63f99327aa6b404cc951217fa3e61fe4@oss.nttdata.com
Commit 6aa44060a3 removed pg_authid's TOAST table because the only
varlena column is rolpassword, which cannot be de-TOASTed during
authentication because we haven't selected a database yet and
cannot read pg_class. Since that change, attempts to set password
hashes that require out-of-line storage will fail with a "row is
too big" error. This error message might be confusing to users.
This commit places a limit on the length of password hashes so that
attempts to set long password hashes will fail with a more
user-friendly error. The chosen limit of 512 bytes should be
sufficient to avoid "row is too big" errors independent of BLCKSZ,
but it should also be lenient enough for all reasonable use-cases
(or at least all the use-cases we could imagine).
Reviewed-by: Tom Lane, Jonathan Katz, Michael Paquier, Jacob Champion
Discussion: https://postgr.es/m/89e8649c-eb74-db25-7945-6d6b23992394%40gmail.com
Previously, when the on_error option was set to ignore, the COPY command
would always log NOTICE messages for input rows discarded due to
data type incompatibility. Users had no way to suppress these messages.
This commit introduces a new log_verbosity setting, 'silent',
which prevents the COPY command from emitting NOTICE messages
when on_error = 'ignore' is used, even if rows are discarded.
This feature is particularly useful when processing malformed files
frequently, where a flood of NOTICE messages can be undesirable.
For example, when frequently loading malformed files via the COPY command
or querying foreign tables using file_fdw (with an upcoming patch to
add on_error support for file_fdw), users may prefer to suppress
these messages to reduce log noise and improve clarity.
Author: Atsushi Torikoshi
Reviewed-by: Masahiko Sawada, Fujii Masao
Discussion: https://postgr.es/m/ab59dad10490ea3734cf022b16c24cfd@oss.nttdata.com
Previously, the pg_stat_checkpointer view and the checkpoint completion
log message could show different numbers for buffers written
during checkpoints. The view only counted shared buffers,
while the log message included both shared and SLRU buffers,
causing inconsistencies.
This commit resolves the issue by updating both the view and the log message
to separately report shared and SLRU buffers written during checkpoints.
A new slru_written column is added to the pg_stat_checkpointer view
to track SLRU buffers, while the existing buffers_written column now
tracks only shared buffers. This change would help users distinguish
between the two types of buffers, in the pg_stat_checkpointer view and
the checkpoint complete log message, respectively.
Bump catalog version.
Author: Nitin Jadhav
Reviewed-by: Bharath Rupireddy, Michael Paquier, Kyotaro Horiguchi, Robert Haas
Reviewed-by: Andres Freund, vignesh C, Fujii Masao
Discussion: https://postgr.es/m/CAMm1aWb18EpT0whJrjG+-nyhNouXET6ZUw0pNYYAe+NezpvsAA@mail.gmail.com
Refactoring in the interest of code consistency, a follow-up to 2e068db56e.
The argument against inserting a special enum value at the end of the enum
definition is that a switch statement might generate a compiler warning unless
it has a default clause.
Aleksander Alekseev, reviewed by Michael Paquier, Dean Rasheed, Peter Eisentraut
Discussion: https://postgr.es/m/CAJ7c6TMsiaV5urU_Pq6zJ2tXPDwk69-NKVh4AMN5XrRiM7N%2BGA%40mail.gmail.com
Instead of XXX_IN_XLOCALE_H for several features XXX, let's just
include <xlocale.h> if HAVE_XLOCALE_H. The reason for the extra
complication was apparently that some old glibc systems also had an
<xlocale.h>, and you weren't supposed to include it directly, but it's
gone now (as far as I can tell it was harmless to do so anyway).
Author: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CWZBBRR6YA8D.8EHMDRGLCKCD%40neon.tech
LLVM's opaque pointer change began in LLVM 14, but remained optional
until LLVM 16. When commit 37d5babb added opaque pointer support, we
didn't turn it on for LLVM 14 and 15 yet because we didn't want to risk
weird bitcode incompatibility problems in released branches of
PostgreSQL. (That might have been overly cautious, I don't know.)
Now that PostgreSQL 18 has dropped support for LLVM versions < 14, and
since it hasn't been released yet and no extensions or bitcode have been
built against it in the wild yet, we can be more aggressive. We can rip
out the support code and build system clutter that made opaque pointer
use optional.
Author: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussions: https://postgr.es/m/CA%2BhUKGLhNs5geZaVNj2EJ79Dx9W8fyWUU3HxcpZy55sMGcY%3DiA%40mail.gmail.com
This is a continuation of work like 11c34b342b, done to reduce the
bloat of pg_stat_statements by applying more normalization to query
entries. This commit is able to detect and normalize values in
VariableSetStmt, resulting in:
SET conf_param = $1
Compared to other parse nodes, VariableSetStmt is embedded in much more
places in the parser, impacting many query patterns in
pg_stat_statements. A custom jumble function is used, with an extra
field in the node to decide if arguments should be included in the
jumbling or not, a location field being not enough for this purpose.
This approach allows for a finer tuning.
Clauses relying on one or more keywords are not normalized, for example:
* DEFAULT
* FROM CURRENT
* List of keywords. SET SESSION CHARACTERISTICS AS TRANSACTION,
where it is critical to differentiate different sets of options, is a
good example of why normalization should not happen.
Some queries use VariableSetStmt for some subclauses with SET, that also
have their values normalized:
- ALTER DATABASE
- ALTER ROLE
- ALTER SYSTEM
- CREATE/ALTER FUNCTION
ba90eac7a9 has added test coverage for most of the existing SET
patterns. The expected output of these tests shows the difference this
commit creates. Normalization could be perhaps applied to more portions
of the grammar but what is done here is conservative, and good enough as
a starting point.
Author: Greg Sabino Mullane, Michael Paquier
Discussion: https://postgr.es/m/36e5bffe-e989-194f-85c8-06e7bc88e6f7@amazon.com
Discussion: https://postgr.es/m/B44FA29D-EBD0-4DD9-ABC2-16F1CB087074@amazon.com
Discussion: https://postgr.es/m/CAKAnmmJtJY2jzQN91=2QAD2eAJAA-Per61eyO48-TyxEg-q0Rg@mail.gmail.com
Checkpoints can be skipped when the server is idle. The existing num_timed and
num_requested counters in pg_stat_checkpointer track both completed and
skipped checkpoints, but there was no way to count only the completed ones.
This commit introduces the num_done counter, which tracks only completed
checkpoints, making it easier to see how many were actually performed.
Bump catalog version.
Author: Anton A. Melnikov
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/9ea77f40-818d-4841-9dee-158ac8f6e690@oss.nttdata.com
Up to now, remove_rel_from_query() has done a pretty shoddy job
of updating our where-needed bitmaps (per-Var attr_needed and
per-PlaceHolderVar ph_needed relid sets). It removed direct mentions
of the to-be-removed baserel and outer join, which is the minimum
amount of effort needed to keep the data structures self-consistent.
But it didn't account for the fact that the removed join ON clause
probably mentioned Vars of other relations, and those Vars might now
not be needed as high up in the join tree as before. It's easy to
show cases where this results in failing to remove a lower outer join
that could also have been removed.
To fix, recalculate the where-needed bitmaps from scratch after
each successful join removal. This sounds expensive, but it seems
to add only negligible planner runtime. (We cheat a little bit
by preserving "relation 0" entries in the bitmaps, allowing us to
skip re-scanning the targetlist and HAVING qual.)
The submitted test case drew attention because we had successfully
optimized away the lower join prior to v16. I suspect that that's
somewhat accidental and there are related cases that were never
optimized before and now can be. I've not tried to come up with
one, though.
Perhaps we should back-patch this into v16 and v17 to repair the
performance regression. However, since it took a year for anyone
to notice the problem, it can't be affecting too many people. Let's
let the patch bake awhile in HEAD, and see if we get more complaints.
Per bug #18627 from Mikaël Gourlaouen. No back-patch for now.
Discussion: https://postgr.es/m/18627-44f950eb6a8416c2@postgresql.org
This also works for compressed tar-format backups. However, -n must be
used, because we use pg_waldump to verify WAL, and it doesn't yet know
how to verify WAL that is stored inside of a tarfile.
Amul Sul, reviewed by Sravan Kumar and by me, and revised by me.
This commit improves the catalog data in pg_proc for the three functions
for has_largeobject_privilege(), introduced in 4eada203a5:
- Fix their descriptions (typos and consistency).
- Reallocate OIDs to be within the 8000-9999 range as required by
a6417078c4.
Bump catalog version.
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/ZvUYR0V0dzWaLnsV@paquier.xyz
The previous commit fixed some ways of losing an inplace update. It
remained possible to lose one when a backend working toward a
heap_update() copied a tuple into memory just before inplace update of
that tuple. In catalogs eligible for inplace update, use LOCKTAG_TUPLE
to govern admission to the steps of copying an old tuple, modifying it,
and issuing heap_update(). This includes MERGE commands. To avoid
changing most of the pg_class DDL, don't require LOCKTAG_TUPLE when
holding a relation lock sufficient to exclude inplace updaters.
Back-patch to v12 (all supported versions). In v13 and v12, "UPDATE
pg_class" or "UPDATE pg_database" can still lose an inplace update. The
v14+ UPDATE fix needs commit 86dc90056d,
and it wasn't worth reimplementing that fix without such infrastructure.
Reviewed by Nitin Motiani and (in earlier versions) Heikki Linnakangas.
Discussion: https://postgr.es/m/20231027214946.79.nmisch@google.com
As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog. It could lose
the update. Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running. The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.
For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock. Back-patch to v12 (all supported
versions). In back branches, retain a deprecated heap_inplace_update(),
for extensions.
Reported by Smolkin Grigory. Reviewed by Nitin Motiani, (in earlier
versions) Heikki Linnakangas, and (in earlier versions) Alexander
Lakhin.
Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
Ensure that error paths within these functions do not leak a collator,
and return the result rather than using an out parameter. (Error paths
in the caller may still result in a leaked collator, which will be
addressed separately.)
In make_libc_collator(), if the first newlocale() succeeds and the
second one fails, close the first locale_t object.
The function make_icu_collator() doesn't have any external callers, so
change it to be static.
Discussion: https://postgr.es/m/54d20e812bd6c3e44c10eddcd757ec494ebf1803.camel@j-davis.com
pg_authid's only varlena column is rolpassword, which unfortunately
cannot be de-TOASTed during authentication because we haven't
selected a database yet and cannot read pg_class. By removing
pg_authid's TOAST table, attempts to set password hashes that
require out-of-line storage will fail with a "row is too big"
error instead. We may want to provide a more user-friendly error
in the future, but for now let's just remove the useless TOAST
table.
Bumps catversion.
Reported-by: Alexander Lakhin
Reviewed-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/89e8649c-eb74-db25-7945-6d6b23992394%40gmail.com
Replace the fixed-size array of fast-path locks with arrays, sized on
startup based on max_locks_per_transaction. This allows using fast-path
locking for workloads that need more locks.
The fast-path locking introduced in 9.2 allowed each backend to acquire
a small number (16) of weak relation locks cheaply. If a backend needs
to hold more locks, it has to insert them into the shared lock table.
This is considerably more expensive, and may be subject to contention
(especially on many-core systems).
The limit of 16 fast-path locks was always rather low, because we have
to lock all relations - not just tables, but also indexes, views, etc.
For planning we need to lock all relations that might be used in the
plan, not just those that actually get used in the final plan. So even
with rather simple queries and schemas, we often need significantly more
than 16 locks.
As partitioning gets used more widely, and the number of partitions
increases, this limit is trivial to hit. Complex queries may easily use
hundreds or even thousands of locks. For workloads doing a lot of I/O
this is not noticeable, but for workloads accessing only data in RAM,
the access to the shared lock table may be a serious issue.
This commit removes the hard-coded limit of the number of fast-path
locks. Instead, the size of the fast-path arrays is calculated at
startup, and can be set much higher than the original 16-lock limit.
The overall fast-path locking protocol remains unchanged.
The variable-sized fast-path arrays can no longer be part of PGPROC, but
are allocated as a separate chunk of shared memory and then references
from the PGPROC entries.
The fast-path slots are organized as a 16-way set associative cache. You
can imagine it as a hash table of 16-slot "groups". Each relation is
mapped to exactly one group using hash(relid), and the group is then
processed using linear search, just like the original fast-path cache.
With only 16 entries this is cheap, with good locality.
Treating this as a simple hash table with open addressing would not be
efficient, especially once the hash table gets almost full. The usual
remedy is to grow the table, but we can't do that here easily. The
access would also be more random, with worse locality.
The fast-path arrays are sized using the max_locks_per_transaction GUC.
We try to have enough capacity for the number of locks specified in the
GUC, using the traditional 2^n formula, with an upper limit of 1024 lock
groups (i.e. 16k locks). The default value of max_locks_per_transaction
is 64, which means those instances will have 64 fast-path slots.
The main purpose of the max_locks_per_transaction GUC is to size the
shared lock table. It is often set to the "average" number of locks
needed by backends, with some backends using significantly more locks.
This should not be a major issue, however. Some backens may have to
insert locks into the shared lock table, but there can't be too many of
them, limiting the contention.
The only solution is to increase the GUC, even if the shared lock table
already has sufficient capacity. That is not free, especially in terms
of memory usage (the shared lock table entries are fairly large). It
should only happen on machines with plenty of memory, though.
In the future we may consider a separate GUC for the number of fast-path
slots, but let's try without one first.
Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
table_beginscan_parallel and index_beginscan_parallel contain
Asserts checking that the relation a worker will use in
a parallel scan is the same one the leader intended. However,
they were checking for relation OID match, which was not strong
enough to detect the mismatch problem fixed in 126ec0bc7.
What would be strong enough is to compare relfilenodes instead.
Arguably, that's a saner definition anyway, since a scan surely
operates on a physical relation not a logical one. Hence,
store and compare RelFileLocators not relation OIDs. Also
ensure that index_beginscan_parallel checks the index identity
not just the table identity.
Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
This commit moves pg_wal_replay_wait() procedure to be a neighbor of
WAL-related functions in xlogfuncs.c. The implementation of LSN waiting
continues to reside in the same place.
By proposal from Michael Paquier.
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org
This change allows pg_index rows to use out-of-line storage for the
"indexprs" and "indpred" columns, which enables use-cases such as
very large index expressions.
This system catalog was previously not given a TOAST table due to a
fear of circularity issues (see commit 96cdeae07f). Testing has
not revealed any such problems, and it seems unlikely that the
entries for system indexes could ever need out-of-line storage. In
any case, it is still early in the v18 development cycle, so
committing this now will hopefully increase the chances of finding
any unexpected problems prior to release.
Bumps catversion.
Reported-by: Jonathan Katz
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/b611015f-b423-458c-aa2d-be0e655cc1b4%40postgresql.org
This opens the possibility to define keys for more types of statistics
kinds in PgStat_HashKey, the first case being 8-byte query IDs for
statistics like pg_stat_statements.
This increases the size of PgStat_HashKey from 12 to 16 bytes, while
PgStatShared_HashEntry, entry stored in the dshash for pgstats, keeps
the same size due to alignment.
xl_xact_stats_item, that tracks the stats items to drop in commit WAL
records, is increased from 12 to 16 bytes. Note that individual chunks
in commit WAL records should be multiples of sizeof(int), hence 8-byte
object IDs are stored as two uint32, based on a suggestion from Heikki
Linnakangas.
While on it, the field of PgStat_HashKey is renamed from "objoid" to
"objid", as for some stats kinds this field does not refer to OIDs but
just IDs, like for replication slot stats.
This commit bumps the following format variables:
- PGSTAT_FILE_FORMAT_ID, as PgStat_HashKey is written to the stats file
for non-serialized stats kinds in the dshash table.
- XLOG_PAGE_MAGIC for the changes in xl_xact_stats_item.
- Catalog version, for the SQL function pg_stat_have_stats().
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/ZsvTS9EW79Up8I62@paquier.xyz
Commits 041b9680 and 6377e12a changed the interface of
scan_analyze_next_block() to take a ReadStream instead of a BlockNumber
and a BufferAccessStrategy, and to return a value to indicate when the
stream has run out of blocks.
This caused integration problems for at least one known extension that
uses specially encoded BlockNumber values that map to different
underlying storage, because acquire_sample_rows() sets up the stream so
that read_stream_next_buffer() reads blocks from the main fork of the
relation's SMgrRelation.
Provide read_stream_next_block(), as a way for such an extension to
access the stream of raw BlockNumbers directly and forward them to its
own ReadBuffer() calls after decoding, as it could in earlier releases.
The new function returns the BlockNumber and BufferAccessStrategy that
were previously passed directly to scan_analyze_next_block().
Alternatively, an extension could wrap the stream of BlockNumbers in
another ReadStream with a callback that performs any decoding required
to arrive at real storage manager BlockNumber values, so that it could
benefit from the I/O combining and concurrency provided by
read_stream.c.
Another class of table access method that does nothing in
scan_analyze_next_block() because it is not block-oriented could use
this function to control the number of block sampling loops. It could
match the previous behavior with "return read_stream_next_block(stream,
&bas) != InvalidBlockNumber".
Ongoing work is expected to provide better ANALYZE support for table
access methods that don't behave like heapam with respect to storage
blocks, but that will be for future releases.
Back-patch to 17.
Reported-by: Mats Kindahl <mats@timescale.com>
Reviewed-by: Mats Kindahl <mats@timescale.com>
Discussion: https://postgr.es/m/CA%2B14425%2BCcm07ocG97Fp%2BFrD9xUXqmBKFvecp0p%2BgV2YYR258Q%40mail.gmail.com
Add PERIOD clause to foreign key constraint definitions. This is
supported for range and multirange types. Temporal foreign keys check
for range containment instead of equality.
This feature matches the behavior of the SQL standard temporal foreign
keys, but it works on PostgreSQL's native ranges instead of SQL's
"periods", which don't exist in PostgreSQL (yet).
Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET DEFAULT}
are not supported yet.
(previously committed as 34768ee361, reverted by 8aee330af55; this is
essentially unchanged from those)
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints.
These are backed by GiST indexes instead of B-tree indexes, since they
are essentially exclusion constraints with = for the scalar parts of
the key and && for the temporal part.
(previously committed as 46a0cd4cef, reverted by 46a0cd4cefb; the new
part is this:)
Because 'empty' && 'empty' is false, the temporal PK/UQ constraint
allowed duplicates, which is confusing to users and breaks internal
expectations. For instance, when GROUP BY checks functional
dependencies on the PK, it allows selecting other columns from the
table, but in the presence of duplicate keys you could get the value
from any of their rows. So we need to forbid empties.
This all means that at the moment we can only support ranges and
multiranges for temporal PK/UQs, unlike the original patch (above).
Documentation and tests for this are added. But this could
conceivably be extended by introducing some more general support for
the notion of "empty" for other types.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
This is support function 12 for the GiST AM and translates
"well-known" RT*StrategyNumber values into whatever strategy number is
used by the opclass (since no particular numbers are actually
required). We will use this to support temporal PRIMARY
KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality.
This commit adds two implementations, one for internal GiST opclasses
(just an identity function) and another for btree_gist opclasses. It
updates btree_gist from 1.7 to 1.8, adding the support function for
all its opclasses.
(previously committed as 6db4598fcb, reverted by 8aee330af55; this is
essentially unchanged from those)
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
Remove redundant checks for locale->collate_is_c now that we always
have a valid pg_locale_t.
Also, remove pg_locale_deterministic() wrapper, which is no longer
useful after commit e9931bfb75. Just check the field directly,
consistent with other fields in pg_locale_t.
Author: Andreas Karlsson
Discussion: https://postgr.es/m/60929555-4709-40a7-b136-bcb44cff5a3c@proxel.se
This function checks whether a user has specific privileges on a large object,
identified by OID. The user can be provided by name, OID,
or default to the current user. If the specified large object doesn't exist,
the function returns NULL. It raises an error for a non-existent user name.
This behavior is basically consistent with other privilege inquiry functions
like has_table_privilege.
Bump catalog version.
Author: Yugo Nagata
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20240702163444.ab586f6075e502eb84f11b1a@sranhm.sraoss.co.jp
myLargeObjectExists() and LargeObjectExists() had nearly identical code,
except for handling snapshots. This commit renames myLargeObjectExists()
to LargeObjectExistsWithSnapshot() and refactors LargeObjectExists()
to call it internally, reducing duplication.
Author: Yugo Nagata
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/20240702163444.ab586f6075e502eb84f11b1a@sranhm.sraoss.co.jp
hashvalidate(), which validates the signatures of support functions
for the hash AM, contained several hardcoded exceptions. For example,
hash/date_ops support function 1 was hashint4(), which would
ordinarily fail validation because the function argument is int4, not
date. But this works internally because int4 and date are of the same
size. There are several more exceptions like this that happen to work
and were allowed historically but would now fail the function
signature validation.
This patch removes those exceptions by providing new support functions
that have the proper declared signatures. They internally share most
of the code with the "wrong" functions they replace, so the behavior
is still the same.
With the exceptions gone, hashvalidate() is now simplified and relies
fully on check_amproc_signature().
hashvarlena() and hashvarlenaextended() are kept in pg_proc.dat
because some extensions currently use them to build hash functions for
their own types, and we need to keep exposing these functions as
"LANGUAGE internal" functions for that to continue to work.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/29c3b746-69e7-482a-b37c-dbbf7e5b009b@eisentraut.org
This brings more clarity to heapam.c, by cleanly separating all the
logic related to WAL replay and the rest of Heap and Heap2, similarly
to other RMGRs like hash, btree, etc.
The header reorganization is also nice in heapam.c, cutting half of the
headers required.
Author: Li Yong
Reviewed-by: Sutou Kouhei, Michael Paquier
Discussion: https://postgr.es/m/EFE55E65-D7BD-4C6A-B630-91F43FD0771B@ebay.com
1eff8279d added an API to tuplestore.c to allow callers to obtain
storage telemetry data. That API wasn't quite good enough for callers
that perform tuplestore_clear() as the telemetry functions only
accounted for the current state of the tuplestore, not the maximums
before tuplestore_clear() was called.
There's a pending patch that would like to add tuplestore telemetry
output to EXPLAIN ANALYZE for WindowAgg. That node type uses
tuplestore_clear() before moving to the next window partition and we
want to show the maximum space used, not the space used for the final
partition.
Reviewed-by: Tatsuo Ishii, Ashutosh Bapat
Discussion: https://postgres/m/CAApHDvoY8cibGcicLV0fNh=9JVx9PANcWvhkdjBnDCc9Quqytg@mail.gmail.com
Based on a patch by Michael Paquier.
For libpq, use PQExpBuffer instead of StringInfo. This requires us to
track allocation failures so that we can return JSON_OUT_OF_MEMORY as
needed rather than exit()ing.
Author: Jacob Champion <jacob.champion@enterprisedb.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Co-authored-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/d1b467a78e0e36ed85a09adf979d04cf124a9d4b.camel@vmware.com
The only current implementation is for btree where it calls
_bt_getrootheight(). Other index types can now also use this to pass
information to their amcostestimate routine. Previously, btree was
hardcoded and other index types could not hook into the optimizer at
this point.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
When generating window_pathkeys, distinct_pathkeys, or sort_pathkeys,
we failed to realize that the grouping/ordering expressions might be
nullable by grouping sets. As a result, we may incorrectly deem that
the PathKeys are redundant by EquivalenceClass processing and thus
remove them from the pathkeys list. That would lead to wrong results
in some cases.
To fix this issue, we mark the grouping expressions nullable by
grouping sets if that is the case. If the grouping expression is a
Var or PlaceHolderVar or constructed from those, we can just add the
RT index of the RTE_GROUP RTE to the existing nullingrels field(s);
otherwise we have to add a PlaceHolderVar to carry on the nullingrel
bit.
However, we have to manually remove this nullingrel bit from
expressions in various cases where these expressions are logically
below the grouping step, such as when we generate groupClause pathkeys
for grouping sets, or when we generate PathTarget for initial input to
grouping nodes.
Furthermore, in set_upper_references, the targetlist and quals of an
Agg node should have nullingrels that include the effects of the
grouping step, ie they will have nullingrels equal to the input
Vars/PHVs' nullingrels plus the nullingrel bit that references the
grouping RTE. In order to perform exact nullingrels matches, we also
need to manually remove this nullingrel bit.
Bump catversion because this changes the querytree produced by the
parser.
Thanks to Tom Lane for the idea to invent a new kind of RTE.
Per reports from Geoff Winkless, Tobias Wendorff, Richard Guo from
various threads.
Author: Richard Guo
Reviewed-by: Ashutosh Bapat, Sutou Kouhei
Discussion: https://postgr.es/m/CAMbWs4_dp7e7oTwaiZeBX8+P1rXw4ThkZxh1QG81rhu9Z47VsQ@mail.gmail.com
If there are subqueries in the grouping expressions, each of these
subqueries in the targetlist and HAVING clause is expanded into
distinct SubPlan nodes. As a result, only one of these SubPlan nodes
would be converted to reference to the grouping key column output by
the Agg node; others would have to get evaluated afresh. This is not
efficient, and with grouping sets this can cause wrong results issues
in cases where they should go to NULL because they are from the wrong
grouping set. Furthermore, during re-evaluation, these SubPlan nodes
might use nulled column values from grouping sets, which is not
correct.
This issue is not limited to subqueries. For other types of
expressions that are part of grouping items, if they are transformed
into another form during preprocessing, they may fail to match lower
target items. This can also lead to wrong results with grouping sets.
To fix this issue, we introduce a new kind of RTE representing the
output of the grouping step, with columns that are the Vars or
expressions being grouped on. In the parser, we replace the grouping
expressions in the targetlist and HAVING clause with Vars referencing
this new RTE, so that the output of the parser directly expresses the
semantic requirement that the grouping expressions be gotten from the
grouping output rather than computed some other way. In the planner,
we first preprocess all the columns of this new RTE and then replace
any Vars in the targetlist and HAVING clause that reference this new
RTE with the underlying grouping expressions, so that we will have
only one instance of a SubPlan node for each subquery contained in the
grouping expressions.
Bump catversion because this changes the querytree produced by the
parser.
Thanks to Tom Lane for the idea to invent a new kind of RTE.
Per reports from Geoff Winkless, Tobias Wendorff, Richard Guo from
various threads.
Author: Richard Guo
Reviewed-by: Ashutosh Bapat, Sutou Kouhei
Discussion: https://postgr.es/m/CAMbWs4_dp7e7oTwaiZeBX8+P1rXw4ThkZxh1QG81rhu9Z47VsQ@mail.gmail.com
The existing function PQprotocolVersion() does not include the minor
version of the protocol. In preparation for pending work that will
bump that number for the first time, add a new function to provide it
to clients that may care, using the (major * 10000 + minor)
convention already used by PQserverVersion().
Jacob Champion based on earlier work by Jelte Fennema-Nio
Discussion: http://postgr.es/m/CAOYmi+mM8+6Swt1k7XsLcichJv8xdhPnuNv7-02zJWsezuDL+g@mail.gmail.com
This commit adds two callbacks in pgstats to have a better control of
the flush timing of pgstat_report_stat(), whose operation depends on the
three PGSTAT_*_INTERVAL variables:
- have_fixed_pending_cb(), to check if a stats kind has any pending
data waiting for a flush. This is used as a fast path if there are no
pending statistics to flush, and this check is done for fixed-numbered
statistics only if there are no variable-numbered statistics to flush.
A flush will need to happen if at least one callback reports any pending
data.
- flush_fixed_cb(), to do the actual flush.
These callbacks are currently used by the SLRU, WAL and IO statistics,
generalizing the concept for all stats kinds (builtin and custom).
The SLRU and IO stats relied each on one global variable to determine
whether a flush should happen; these are now local to pgstat_slru.c and
pgstat_io.c, cleaning up a bit how the pending flush states are tracked
in pgstat.c.
pgstat_flush_io() and pgstat_flush_wal() are still required, but we do
not need to check their return result anymore.
Reviewed-by: Bertrand Drouvot, Kyotaro Horiguchi
Discussion: https://postgr.es/m/ZtaVO0N-aTwiAk3w@paquier.xyz
Instead always fetch the locale and look at the ctype_is_c field.
hba.c relies on regexes working for the C locale without needing
catalog access, which worked before due to a special case for
C_COLLATION_OID in lc_ctype_is_c(). Move the special case to
pg_set_regex_collation() now that lc_ctype_is_c() is gone.
Author: Andreas Karlsson
Discussion: https://postgr.es/m/60929555-4709-40a7-b136-bcb44cff5a3c@proxel.se
pgstat_initialize() is currently used by the WAL stats as a code path to
take some custom actions when a backend starts. A callback is added to
generalize the concept so as all stats kinds can do the same, for
builtin and custom kinds, if set.
Reviewed-by: Bertrand Drouvot, Kyotaro Horiguchi
Discussion: https://postgr.es/m/ZtZr1K4PLdeWclXY@paquier.xyz
When WindowAgg finished one partition of a PARTITION BY, it previously
would call tuplestore_end() to purge all the stored tuples before again
calling tuplestore_begin_heap() and carefully setting up all of the
tuplestore read pointers exactly as required for the given frameOptions.
Since the frameOptions don't change between partitions, this part does
not make much sense. For queries that had very few rows per partition,
the overhead of this was very large.
It seems much better to create the tuplestore and the read pointers once
and simply call tuplestore_clear() at the end of each partition.
tuplestore_clear() moves all of the read pointers back to the start
position and deletes all the previously stored tuples.
A simple test query with 1 million partitions and 1 tuple per partition
has been shown to run around 40% faster than without this change. The
additional effort seems to have mostly been spent in malloc/free.
Making this work required adding a new bool field to WindowAggState
which had the unfortunate effect of being the 9th bool field in a group
resulting in the struct being enlarged. Here we shuffle the fields
around a little so that the two bool fields for runcondition relating
stuff fit into existing padding. Also, move the "runcondition" field to
be near those. This frees up enough space with the other bool fields so
that the newly added one fits into the padding bytes. This was done to
address a very small but apparent performance regression with queries
containing a large number of rows per partition.
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Discussion: https://postgr.es/m/CAHoyFK9n-QCXKTUWT_xxtXninSMEv%2BgbJN66-y6prM3f4WkEHw%40mail.gmail.com
This commit adds columns in view pg_stat_subscription_stats to show the
number of times a particular conflict type has occurred during the
application of logical replication changes. The following columns are
added:
confl_insert_exists:
Number of times a row insertion violated a NOT DEFERRABLE unique
constraint.
confl_update_origin_differs:
Number of times an update was performed on a row that was
previously modified by another origin.
confl_update_exists:
Number of times that the updated value of a row violates a
NOT DEFERRABLE unique constraint.
confl_update_missing:
Number of times that the tuple to be updated is missing.
confl_delete_origin_differs:
Number of times a delete was performed on a row that was
previously modified by another origin.
confl_delete_missing:
Number of times that the tuple to be deleted is missing.
The update_origin_differs and delete_origin_differs conflicts can be
detected only when track_commit_timestamp is enabled.
Author: Hou Zhijie
Reviewed-by: Shveta Malik, Peter Smith, Anit Kapila
Discussion: https://postgr.es/m/OS0PR01MB57160A07BD575773045FC214948F2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Similarly to 2065ddf5e3, this introduces a define for "pg_tblspc".
This makes the style more consistent with the existing PG_STAT_TMP_DIR,
for example.
There is a difference with the other cases with the introduction of
PG_TBLSPC_DIR_SLASH, required in two places for recovery and backups.
Author: Bertrand Drouvot
Reviewed-by: Ashutosh Bapat, Álvaro Herrera, Yugo Nagata, Michael
Paquier
Discussion: https://postgr.es/m/ZryVvjqS9SnV1GPP@ip-10-97-1-34.eu-west-3.compute.internal
OpenSSL 1.0.2 has been EOL from the upstream OpenSSL project for
some time, and is no longer the default OpenSSL version with any
vendor which package PostgreSQL. By retiring support for OpenSSL
1.0.2 we can remove a lot of no longer required complexity for
managing state within libcrypto which is now handled by OpenSSL.
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/ZG3JNursG69dz1lr@paquier.xyz
Discussion: https://postgr.es/m/CA+hUKGKh7QrYzu=8yWEUJvXtMVm_CNWH1L_TLWCbZMwbi1XP2Q@mail.gmail.com
Remove src/port/user.c, call getpwuid_r() directly. This reduces some
complexity and allows better control of the error behavior. For
example, the old code would in some circumstances silently truncate
the result string, or produce error message strings that the caller
wouldn't use.
src/port/user.c used to be called src/port/thread.c and contained
various portability complications to support thread-safety. These are
all obsolete, and all but the user-lookup functions have already been
removed. This patch completes this by also removing the user-lookup
functions.
Also convert src/backend/libpq/auth.c to use getpwuid_r() for
thread-safety.
Originally, I tried to be overly correct by using
sysconf(_SC_GETPW_R_SIZE_MAX) to get the buffer size for getpwuid_r(),
but that doesn't work on FreeBSD. All the OS where I could find the
source code internally use 1024 as the suggested buffer size, so I
just ended up hardcoding that. The previous code used BUFSIZ, which
is an unrelated constant from stdio.h, so its use seemed
inappropriate.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/5f293da9-ceb4-4937-8e52-82c25db8e4d3%40eisentraut.org
This commit replaces most of the hardcoded values of "pg_replslot" by a
new PG_REPLSLOT_DIR #define. This makes the style more consistent with
the existing PG_STAT_TMP_DIR, for example. More places will follow a
similar change.
Author: Bertrand Drouvot
Reviewed-by: Ashutosh Bapat, Yugo Nagata, Michael Paquier
Discussion: https://postgr.es/m/ZryVvjqS9SnV1GPP@ip-10-97-1-34.eu-west-3.compute.internal
This commit removes log_cnt from the tuple returned by the SQL function.
This field is an internal counter that tracks when a WAL record should
be generated for a sequence, and it is reset each time the sequence is
restored or recovered. It is not necessary to rebuild the sequence DDL
commands for pg_dump and pg_upgrade where this function is used. The
field can still be queried with a scan of the "table" created
under-the-hood for a sequence.
Issue noticed while hacking on a feature that can rely on this new
function rather than pg_sequence_last_value(), aimed at making sequence
computation more easily pluggable.
Bump catalog version.
Reviewed-by: Nathan Bossart
Discussion: https://postgr.es/m/Zsvka3r-y2ZoXAdH@paquier.xyz
To make them follow the usual naming convention where
FoobarShmemSize() calculates the amount of shared memory needed by
Foobar subsystem, and FoobarShmemInit() performs the initialization.
I didn't rename CreateLWLocks() and InitShmmeIndex(), because they are
a little special. They need to be called before any of the other
ShmemInit() functions, because they set up the shared memory
bookkeeping itself. I also didn't rename InitProcGlobal(), because
unlike other Shmeminit functions, it's not called by individual
backends.
Reviewed-by: Andreas Karlsson
Discussion: https://www.postgresql.org/message-id/c09694ff-2453-47e5-b26c-32a16cd75ce6@iki.fi
Split the shared and local initialization to separate functions, and
follow the common naming conventions. With this, we no longer create
the LockMethodLocalHash hash table in the postmaster process, which
was always pointless.
Reviewed-by: Andreas Karlsson
Discussion: https://www.postgresql.org/message-id/c09694ff-2453-47e5-b26c-32a16cd75ce6@iki.fi
The conflict types 'update_differ' and 'delete_differ' indicate that a row
to be modified was previously altered by another origin. Rename those to
'update_origin_differs' and 'delete_origin_differs' to clarify their
meaning.
Author: Hou Zhijie
Reviewed-by: Shveta Malik, Peter Smith
Discussion: https://postgr.es/m/CAA4eK1+HEKwG_UYt4Zvwh5o_HoCKCjEGesRjJX38xAH3OxuuYA@mail.gmail.com
Use gmtime_r() and localtime_r() instead of gmtime() and localtime(),
for thread-safety.
There are a few affected calls in libpq and ecpg's libpgtypes, which
are probably effectively bugs, because those libraries already claim
to be thread-safe.
There is one affected call in the backend. Most of the backend
otherwise uses the custom functions pg_gmtime() and pg_localtime(),
which are implemented differently.
While we're here, change the call in the backend to gmtime*() instead
of localtime*(), since for that use time zone behavior is irrelevant,
and this side-steps any questions about when time zones are
initialized by localtime_r() vs localtime().
Portability: gmtime_r() and localtime_r() are in POSIX but are not
available on Windows. Windows has functions gmtime_s() and
localtime_s() that can fulfill the same purpose, so we add some small
wrappers around them. (Note that these *_s() functions are also
different from the *_s() functions in the bounds-checking extension of
C11. We are not using those here.)
On MinGW, you can get the POSIX-style *_r() functions by defining
_POSIX_C_SOURCE appropriately before including <time.h>. This leads
to a conflict at least in plpython because apparently _POSIX_C_SOURCE
gets defined in some header there, and then our replacement
definitions conflict with the system definitions. To avoid that sort
of thing, we now always define _POSIX_C_SOURCE on MinGW and use the
POSIX-style functions here.
Reviewed-by: Stepan Neretin <sncfmgg@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/eba1dc75-298e-4c46-8869-48ba8aad7d70@eisentraut.org
Currently, createPartitionTable() opens newly created table using its name.
This approach is prone to privilege escalation attack, because we might end
up opening another table than we just created.
This commit address the issue above by opening newly created table by its
OID. It appears to be tricky to get a relation OID out of ProcessUtility().
We have to extend TableLikeClause with new newRelationOid field, which is
filled within ProcessUtility() to be further accessed by caller.
Security: CVE-2014-0062
Reported-by: Noah Misch
Discussion: https://postgr.es/m/20240808171351.a9.nmisch%40google.com
Reviewed-by: Pavel Borisov, Dmitry Koval
Two syscache identifiers are added for extension names and OIDs.
Shared libraries of extensions might want to invalidate or update their
own caches whenever a CREATE, ALTER or DROP EXTENSION command is run for
their extension (in any backend). Right now this is non-trivial to do
correctly and efficiently, but, if an extension catalog is part of a
syscache, this could simply be done by registering an callback using
CacheRegisterSyscacheCallback for the relevant syscache.
Another case where this is useful is a loaded library where some of its
code paths rely on some objects of the extension to exist; it can be
simpler and more efficient to do an existence check directly on the
extension through the syscache.
Author: Jelte Fennema-Nio
Reviewed-by: Alexander Korotkov, Pavel Stehule
Discussion: https://postgr.es/m/CAGECzQTWm9sex719Hptbq4j56hBGUti7J9OWjeMobQ1ccRok9w@mail.gmail.com
Now that disable_cost is not included in the cost estimate, there's
no visible sign in EXPLAIN output of which plan nodes are disabled.
Fix that by propagating the number of disabled nodes from Path to
Plan, and then showing it in the EXPLAIN output.
There is some question about whether this is a desirable change.
While I personally believe that it is, it seems best to make it a
separate commit, in case we decide to back out just this part, or
rework it.
Reviewed by Andres Freund, Heikki Linnakangas, and David Rowley.
Discussion: http://postgr.es/m/CA+TgmoZ_+MS+o6NeGK2xyBv-xM+w1AfFVuHE4f_aq6ekHv7YSQ@mail.gmail.com
Previously, when a path type was disabled by e.g. enable_seqscan=false,
we either avoided generating that path type in the first place, or
more commonly, we added a large constant, called disable_cost, to the
estimated startup cost of that path. This latter approach can distort
planning. For instance, an extremely expensive non-disabled path
could seem to be worse than a disabled path, especially if the full
cost of that path node need not be paid (e.g. due to a Limit).
Or, as in the regression test whose expected output changes with this
commit, the addition of disable_cost can make two paths that would
normally be distinguishible in cost seem to have fuzzily the same cost.
To fix that, we now count the number of disabled path nodes and
consider that a high-order component of both the startup cost and the
total cost. Hence, the path list is now sorted by disabled_nodes and
then by total_cost, instead of just by the latter, and likewise for
the partial path list. It is important that this number is a count
and not simply a Boolean; else, as soon as we're unable to respect
disabled path types in all portions of the path, we stop trying to
avoid them where we can.
Because the path list is now sorted by the number of disabled nodes,
the join prechecks must compute the count of disabled nodes during
the initial cost phase instead of postponing it to final cost time.
Counts of disabled nodes do not cross subquery levels; at present,
there is no reason for them to do so, since the we do not postpone
path selection across subquery boundaries (see make_subplan).
Reviewed by Andres Freund, Heikki Linnakangas, and David Rowley.
Discussion: http://postgr.es/m/CA+TgmoZ_+MS+o6NeGK2xyBv-xM+w1AfFVuHE4f_aq6ekHv7YSQ@mail.gmail.com
We advance origin progress during abort on successful streaming and
application of ROLLBACK in parallel streaming mode. But the origin
shouldn't be advanced during an error or unsuccessful apply due to
shutdown. Otherwise, it will result in a transaction loss as such a
transaction won't be sent again by the server.
Reported-by: Hou Zhijie
Author: Hayato Kuroda and Shveta Malik
Reviewed-by: Amit Kapila
Backpatch-through: 16
Discussion: https://postgr.es/m/TYAPR01MB5692FAC23BE40C69DA8ED4AFF5B92@TYAPR01MB5692.jpnprd01.prod.outlook.com
ab02d702ef has removed from the backend the code able to support the
unloading of modules, because this has never worked. This removes the
last references to _PG_fini(), that could be used as a callback for
modules to manipulate the stack when unloading a library.
The test module ldap_password_func had the idea to declare it, doing
nothing. The function declaration in fmgr.h is gone.
It was left around in 2022 to avoid breaking extension code, but at this
stage there are also benefits in letting extension developers know that
keeping the unloading code is pointless and this move leads to less
maintenance.
Reviewed-by: Tom Lane, Heikki Linnakangas
Discussion: https://postgr.es/m/ZsQfi0AUJoMF6NSd@paquier.xyz
This patch provides the additional logging information in the following
conflict scenarios while applying changes:
insert_exists: Inserting a row that violates a NOT DEFERRABLE unique constraint.
update_differ: Updating a row that was previously modified by another origin.
update_exists: The updated row value violates a NOT DEFERRABLE unique constraint.
update_missing: The tuple to be updated is missing.
delete_differ: Deleting a row that was previously modified by another origin.
delete_missing: The tuple to be deleted is missing.
For insert_exists and update_exists conflicts, the log can include the origin
and commit timestamp details of the conflicting key with track_commit_timestamp
enabled.
update_differ and delete_differ conflicts can only be detected when
track_commit_timestamp is enabled on the subscriber.
We do not offer additional logging for exclusion constraint violations because
these constraints can specify rules that are more complex than simple equality
checks. Resolving such conflicts won't be straightforward. This area can be
further enhanced if required.
Author: Hou Zhijie
Reviewed-by: Shveta Malik, Amit Kapila, Nisha Moond, Hayato Kuroda, Dilip Kumar
Discussion: https://postgr.es/m/OS0PR01MB5716352552DFADB8E9AD1D8994C92@OS0PR01MB5716.jpnprd01.prod.outlook.com
Here we add ExprState support for obtaining a 32-bit hash value from a
list of expressions. This allows both faster hashing and also JIT
compilation of these expressions. This is especially useful when hash
joins have multiple join keys as the previous code called ExecEvalExpr on
each hash join key individually and that was inefficient as tuple
deformation would have only taken into account one key at a time, which
could lead to walking the tuple once for each join key. With the new
code, we'll determine the maximum attribute required and deform the tuple
to that point only once.
Some performance tests done with this change have shown up to a 20%
performance increase of a query containing a Hash Join without JIT
compilation and up to a 26% performance increase when JIT is enabled and
optimization and inlining were performed by the JIT compiler. The
performance increase with 1 join column was less with a 14% increase
with and without JIT. This test was done using a fairly small hash
table and a large number of hash probes. The increase will likely be
less with large tables, especially ones larger than L3 cache as memory
pressure is more likely to be the limiting factor there.
This commit only addresses Hash Joins, but lays expression evaluation
and JIT compilation infrastructure for other hashing needs such as Hash
Aggregate.
Author: David Rowley
Reviewed-by: Alexey Dvoichenkov <alexey@hyperplane.net>
Reviewed-by: Tels <nospam-pg-abuse@bloodgate.com>
Discussion: https://postgr.es/m/CAApHDvoexAxgQFNQD_GRkr2O_eJUD1-wUGm%3Dm0L%2BGc%3DT%3DkEa4g%40mail.gmail.com
This commit attempts to update a few places, such as the money,
numeric, and timestamp types, to no longer rely on signed integer
wrapping for correctness. This is intended to move us closer
towards removing -fwrapv, which may enable some compiler
optimizations. However, there is presently no plan to actually
remove that compiler option in the near future.
Besides using some of the existing overflow-aware routines in
int.h, this commit introduces and makes use of some new ones.
Specifically, it adds functions that accept a signed integer and
return its absolute value as an unsigned integer with the same
width (e.g., pg_abs_s64()). It also adds functions that accept an
unsigned integer, store the result of negating that integer in a
signed integer with the same width, and return whether the negation
overflowed (e.g., pg_neg_u64_overflow()).
Finally, this commit adds a couple of tests for timestamps near
POSTGRES_EPOCH_JDATE.
Author: Joseph Koshakow
Reviewed-by: Tom Lane, Heikki Linnakangas, Jian He
Discussion: https://postgr.es/m/CAAvxfHdBPOyEGS7s%2Bxf4iaW0-cgiq25jpYdWBqQqvLtLe_t6tw%40mail.gmail.com
Attempting to add a system column for a table to an existing publication
would result in the not very intuitive error message of:
ERROR: negative bitmapset member not allowed
Here we improve that to have it display the same error message as a user
would see if they tried adding a system column for a table when adding
it to the publication in the first place.
Doing this requires making the function which validates the list of
columns an extern function. The signature of the static function wasn't
an ideal external API as it made the code more complex than it needed to be.
Here we adjust the function to have it populate a Bitmapset of attribute
numbers. Doing it this way allows code simplification.
There was no particular bug here other than the weird error message, so
no backpatch.
Bug: #18558
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: Peter Smith, David Rowley
Discussion: https://postgr.es/m/18558-411bc81b03592125@postgresql.org
According to the commit message in 8ec569479, we must have all variables
in header files marked with PGDLLIMPORT. In commit d3cc5ffe81 some
variables were moved from launch_backend.c file to several header files.
This adds PGDLLIMPORT to moved variables.
Author: Sofia Kopikova <s.kopikova@postgrespro.ru>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/e0b17014-5319-4dd6-91cd-93d9c8fc9539%40postgrespro.ru
The TRACE_SORT macro guarded the availability of the trace_sort GUC
setting. But it has been enabled by default ever since it was
introduced in PostgreSQL 8.1, and there have been no reports that
someone wanted to disable it. So just remove the macro to simplify
things. (For the avoidance of doubt: The trace_sort GUC is still
there. This only removes the rarely-used macro guarding it.)
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/be5f7162-7c1d-44e3-9a78-74dcaa6529f2%40eisentraut.org
Previously, (auto)analyze used global variables VacuumPageHit,
VacuumPageMiss, and VacuumPageDirty to track buffer usage. However,
pgBufferUsage provides a more generic way to track buffer usage with
support functions.
This change replaces those global variables with pgBufferUsage in
analyze. Since analyze was the sole user of those variables, it
removes their declarations. Vacuum previously used those variables but
replaced them with pgBufferUsage as part of a bug fix, commit
5cd72cc0c.
Additionally, it adjusts the buffer usage message in both vacuum and
analyze for better consistency.
Author: Anthonin Bonnefoy
Reviewed-by: Masahiko Sawada, Michael Paquier
Discussion: https://postgr.es/m/CAO6_Xqr__kTTCLkftqS0qSCm-J7_xbRG3Ge2rWhucxQJMJhcRA%40mail.gmail.com
We've had code for CRC-32 and CRC-32C for some time (for WAL
records, etc.), but there was no way for users to call it, despite
apparent popular demand. The new crc32() and crc32c() functions
accept bytea input and return bigint (to avoid returning negative
values).
Bumps catversion.
Author: Aleksander Alekseev
Reviewed-by: Peter Eisentraut, Tom Lane
Discussion: https://postgr.es/m/CAJ7c6TNMTGnqnG%3DyXXUQh9E88JDckmR45H2Q%2B%3DucaCLMOW1QQw%40mail.gmail.com
32d3ed816 added the "path" column to pg_backend_memory_contexts to allow
a stable method of obtaining the parent MemoryContext of a given row in
the view. Using the "path" column is now the preferred method of
obtaining the parent row.
Previously, any queries which were self-joining to this view using the
"name" and "parent" columns could get incorrect results due to the fact
that names are not unique. Here we aim to explicitly break such queries
so that they can be corrected and use the "path" column instead.
It is possible that there are more innocent users of the parent column
that just need an indication of the parent and having to write out a
self-joining CTE may be an unnecessary hassle for those cases. Let's
remove the column for now and see if anyone comes back with any
complaints. This does seem like a good time to attempt to get rid of
the column as we still have around 1 year to revert this if someone comes
back with a valid complaint. Plus this view is new to v14 and is quite
niche, so perhaps not many people will be affected.
Author: Melih Mutlu <m.melihmutlu@gmail.com>
Discussion: https://postgr.es/m/CAGPVpCT7NOe4fZXRL8XaoxHpSXYTu6GTpULT_3E-HT9hzjoFRA@mail.gmail.com
The code intends to allow GUCs to be set within parallel workers
via function SET clauses, but not otherwise. However, doing so fails
for "session_authorization" and "role", because the assign hooks for
those attempt to set the subsidiary "is_superuser" GUC, and that call
falls foul of the "not otherwise" prohibition. We can't switch to
using GUC_ACTION_SAVE for this, so instead add a new GUC variable
flag GUC_ALLOW_IN_PARALLEL to mark is_superuser as being safe to set
anyway. (This is okay because is_superuser has context PGC_INTERNAL
and thus only hard-wired calls can change it. We'd need more thought
before applying the flag to other GUCs; but maybe there are other
use-cases.) This isn't the prettiest fix perhaps, but other
alternatives we thought of would be much more invasive.
While here, correct a thinko in commit 059de3ca4: when rejecting
a GUC setting within a parallel worker, we should return 0 not -1
if the ereport doesn't longjmp. (This seems to have no consequences
right now because no caller cares, but it's inconsistent.) Improve
the comments to try to forestall future confusion of the same kind.
Despite the lack of field complaints, this seems worth back-patching.
Thanks to Nathan Bossart for the idea to invent a new flag,
and for review.
Discussion: https://postgr.es/m/2833457.1723229039@sss.pgh.pa.us
Currently, when a child process exits, the postmaster first scans
through BackgroundWorkerList, to see if it the child process was a
background worker. If not found, then it scans through BackendList to
see if it was a regular backend. That leads to some duplication
between the bgworker and regular backend cleanup code, as both have an
entry in the BackendList that needs to be cleaned up in the same way.
Refactor that so that we scan just the BackendList to find the child
process, and if it was a background worker, do the additional
bgworker-specific cleanup in addition to the normal Backend cleanup.
Change HandleChildCrash so that it doesn't try to handle the cleanup
of the process that already exited, only the signaling of all the
other processes. When called for any of the aux processes, the caller
had already cleared the *PID global variable, so the code in
HandleChildCrash() to do that was unused.
On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
now logged with that exit code, instead of 0. Also, if a bgworker
exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
restarted. Previously it was treated as a normal exit.
If a child process is not found in the BackendList, the log message
now calls it "untracked child process" rather than "server process".
Arguably that should be a PANIC, because we do track all the child
processes in the list, so failing to find a child process is highly
unexpected. But if we want to change that, let's discuss and do that
as a separate commit.
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d741@iki.fi
This allows ForgetBackgroundWorker() and ReportBackgroundWorkerExit()
to take a RegisteredBgWorker pointer as argument, rather than a list
iterator. That feels a little more natural. But more importantly, this
paves the way for more refactoring in the next commit.
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d741@iki.fi
This used to be part of CREATE OPERATOR CLASS and ALTER OPERATOR
FAMILY, but it has done nothing (except issue a NOTICE) since
PostgreSQL 8.4. Commit 30e7c175b8 removed support for dumping from
pre-9.2 servers, so this no longer serves any need.
This now removes it completely, and you'd get a normal parse error if
you used it.
Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/113ef2d2-3657-4353-be97-f28fceddbca1%40eisentraut.org
Make it clear that "astreamer" stands for "archive streamer".
Generalize comments that still believe this code can only be used
by pg_basebackup. Add some comments explaining the asymmetry
between the gzip, lz4, and zstd astreamers, in the hopes of making
life easier for anyone who hacks on this code in the future.
Robert Haas, reviewed by Amul Sul.
Discussion: http://postgr.es/m/CAAJ_b97O2kkKVTWxt8MxDN1o-cDfbgokqtiN2yqFf48=gXpcxQ@mail.gmail.com
This new function iterates hash entries with given hash values. This function
is designed to avoid full sequential hash search in the syscache invalidation
callbacks.
Discussion: https://postgr.es/m/5812a6e5-68ae-4d84-9d85-b443176966a1%40sigaev.ru
Author: Teodor Sigaev
Reviewed-by: Aleksander Alekseev, Tom Lane, Michael Paquier, Roman Zharkov
Reviewed-by: Andrei Lepikhov
I started by marking VoidString as const, and fixing the fallout by
marking more fields and function arguments as const. It proliferated
quite a lot, but all within spell.c and spell.h.
A more narrow patch to get rid of the static VoidString buffer would
be to replace it with '#define VoidString ""', as C99 allows assigning
"" to a non-const pointer, even though you're not allowed to modify
it. But it seems like good hygiene to mark all these as const. In the
structs, the pointers can point to the constant VoidString, or a
buffer allocated with palloc(), or with compact_palloc(), so you
should not modify them.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/54c29fb0-edf2-48ea-9814-44e918bbd6e8@iki.fi
There was no need for these to be static buffers, local variables work
just as well. I think they were marked as 'static' to imply that they
are read-only, but 'const' is more appropriate for that, so change
them to const.
To make it possible to mark the variables as 'const', also add 'const'
decorations to the transformRelOptions() signature.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/54c29fb0-edf2-48ea-9814-44e918bbd6e8@iki.fi
When pg_dump retrieves the list of database objects and performs the
data dump, there was possibility that objects are replaced with others
of the same name, such as views, and access them. This vulnerability
could result in code execution with superuser privileges during the
pg_dump process.
This issue can arise when dumping data of sequences, foreign
tables (only 13 or later), or tables registered with a WHERE clause in
the extension configuration table.
To address this, pg_dump now utilizes the newly introduced
restrict_nonsystem_relation_kind GUC parameter to restrict the
accesses to non-system views and foreign tables during the dump
process. This new GUC parameter is added to back branches too, but
these changes do not require cluster recreation.
Back-patch to all supported branches.
Reviewed-by: Noah Misch
Security: CVE-2024-7348
Backpatch-through: 12
This is useful for extensions to get snapshot and shmem data for custom
cumulative statistics when these have a fixed number of objects, so as
these do not need to know about the snapshot internals, aka pgStatLocal.
An upcoming commit introducing an example template for custom cumulative
stats with fixed-numbered objects will make use of these. I have
noticed that this is useful for extension developers while hacking my
own example, actually.
Author: Michael Paquier
Reviewed-by: Dmitry Dolgov, Bertrand Drouvot
Discussion: https://postgr.es/m/Zmqm9j5EO0I4W8dx@paquier.xyz
This commit adds support in the backend for $subject, allowing
out-of-core extensions to plug their own custom kinds of cumulative
statistics. This feature has come up a few times into the lists, and
the first, original, suggestion came from Andres Freund, about
pg_stat_statements to use the cumulative statistics APIs in shared
memory rather than its own less efficient internals. The advantage of
this implementation is that this can be extended to any kind of
statistics.
The stats kinds are divided into two parts:
- The in-core "builtin" stats kinds, with designated initializers, able
to use IDs up to 128.
- The "custom" stats kinds, able to use a range of IDs from 128 to 256
(128 slots available as of this patch), with information saved in
TopMemoryContext. This can be made larger, if necessary.
There are two types of cumulative statistics in the backend:
- For fixed-numbered objects (like WAL, archiver, etc.). These are
attached to the snapshot and pgstats shmem control structures for
efficiency, and built-in stats kinds still do that to avoid any
redirection penalty. The data of custom kinds is stored in a first
array in snapshot structure and a second array in the shmem control
structure, both indexed by their ID, acting as an equivalent of the
builtin stats.
- For variable-numbered objects (like tables, functions, etc.). These
are stored in a dshash using the stats kind ID in the hash lookup key.
Internally, the handling of the builtin stats is unchanged, and both
fixed and variabled-numbered objects are supported. Structure
definitions for builtin stats kinds are renamed to reflect better the
differences with custom kinds.
Like custom RMGRs, custom cumulative statistics can only be loaded with
shared_preload_libraries at startup, and must allocate a unique ID
shared across all the PostgreSQL extension ecosystem with the following
wiki page to avoid conflicts:
https://wiki.postgresql.org/wiki/CustomCumulativeStats
This makes the detection of the stats kinds and their handling when
reading and writing stats much easier than, say, allocating IDs for
stats kinds from a shared memory counter, that may change the ID used by
a stats kind across restarts. When under development, extensions can
use PGSTAT_KIND_EXPERIMENTAL.
Two examples that can be used as templates for fixed-numbered and
variable-numbered stats kinds will be added in some follow-up commits,
with tests to provide coverage.
Some documentation is added to explain how to use this plugin facility.
Author: Michael Paquier
Reviewed-by: Dmitry Dolgov, Bertrand Drouvot
Discussion: https://postgr.es/m/Zmqm9j5EO0I4W8dx@paquier.xyz
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed. This option is useful when
the user makes some data changes on primary and needs a guarantee to see
these changes are on standby.
The queue of waiters is stored in the shared memory as an LSN-ordered pairing
heap, where the waiter with the nearest LSN stays on the top. During
the replay of WAL, waiters whose LSNs have already been replayed are deleted
from the shared memory pairing heap and woken up by setting their latches.
pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise,
the snapshot could prevent the replay of WAL records, implying a kind of
self-deadlock. This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working without an active snapshot,
not a function.
Catversion is bumped.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Reviewed-by: Heikki Linnakangas, Kyotaro Horiguchi
A follow-up patch is planned to make cumulative statistics pluggable,
and using a type is useful in the internal routines used by pgstats as
PgStat_Kind may have a value that was not originally in the enum removed
here, once made pluggable.
While on it, this commit switches pgstat_is_kind_valid() to use
PgStat_Kind rather than an int, to be more consistent with its existing
callers. Some loops based on the stats kind IDs are switched to use
PgStat_Kind rather than int, for consistency with the new time.
Author: Michael Paquier
Reviewed-by: Dmitry Dolgov, Bertrand Drouvot
Discussion: https://postgr.es/m/Zmqm9j5EO0I4W8dx@paquier.xyz
This is used in the startup process to check that the pgstats file we
are reading includes the redo LSN referring to the shutdown checkpoint
where it has been written. The redo LSN in the pgstats file needs to
match with what the control file has.
This is intended to be used for an upcoming change that will extend the
write of the stats file to happen during checkpoints, rather than only
shutdown sequences.
Bump PGSTAT_FILE_FORMAT_ID.
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Zp8o6_cl0KSgsnvS@paquier.xyz
This converts
COPY_PARSE_PLAN_TREES
WRITE_READ_PARSE_PLAN_TREES
RAW_EXPRESSION_COVERAGE_TEST
into run-time parameters
debug_copy_parse_plan_trees
debug_write_read_parse_plan_trees
debug_raw_expression_coverage_test
They can be activated for tests using PG_TEST_INITDB_EXTRA_OPTS.
The compile-time symbols are kept for build farm compatibility, but
they now just determine the default value of the run-time settings.
Furthermore, support for these settings is not compiled in at all
unless assertions are enabled, or the new symbol
DEBUG_NODE_TESTS_ENABLED is defined at compile time, or any of the
legacy compile-time setting symbols are defined. So there is no
run-time overhead in production builds. (This is similar to the
handling of DISCARD_CACHES_ENABLED.)
Discussion: https://www.postgresql.org/message-id/flat/30747bd8-f51e-4e0c-a310-a6e2c37ec8aa%40eisentraut.org
Until now we generated an ExprState for each parameter to a SubPlan and
evaluated them one-by-one ExecScanSubPlan. That's sub-optimal as creating lots
of small ExprStates
a) makes JIT compilation more expensive
b) wastes memory
c) is a bit slower to execute
This commit arranges to evaluate parameters to a SubPlan as part of the
ExprState referencing a SubPlan, using the new EEOP_PARAM_SET expression
step. We emit one EEOP_PARAM_SET for each argument to a subplan, just before
the EEOP_SUBPLAN step.
It likely is worth using EEOP_PARAM_SET in other places as well, e.g. for
SubPlan outputs, nestloop parameters and - more ambitiously - to get rid of
ExprContext->domainValue/caseValue/ecxt_agg*. But that's for later.
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru>
Discussion: https://postgr.es/m/20230225214401.346ancgjqc3zmvek@awork3.anarazel.de
RefreshMatviewByOid is used for both REFRESH and CREATE MATERIALIZED
VIEW. This flag is currently just used for handling internal error
messages, but also aimed to improve code-readability.
Author: Yugo Nagata
Discussion: https://postgr.es/m/20240726122630.70e889f63a4d7e26f8549de8@sraoss.co.jp
This new function returns the data for the given sequence, i.e.,
the values within the sequence tuple. Since this function is a
substitute for SELECT from the sequence, the SELECT privilege is
required on the sequence in question. It returns all NULLs for
sequences for which we lack privileges, other sessions' temporary
sequences, and unlogged sequences on standbys.
This function is primarily intended for use by pg_dump in a
follow-up commit that will use it to optimize dumpSequenceData().
Like pg_sequence_last_value(), which is a support function for the
pg_sequences system view, pg_sequence_read_tuple() is left
undocumented.
Bumps catversion.
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20240503025140.GA1227404%40nathanxps13
Commit 9d9b9d46f3 removed the function (or rather, moved it to a
different source file and renamed it to SendCancelRequest), but forgot
the declaration in the header file.
Previously we had a fallback implementation that made a harmless system
call, based on the assumption that system calls must contain a memory
barrier. That shouldn't be reached on any current system, and it seems
highly likely that we can easily find out how to request explicit memory
barriers, if we've already had to find out how to do atomics on a
hypothetical new system.
Removed comments and a function name referred to a spinlock used for
fallback memory barriers, but that changed in 1b468a13, which left some
misleading words behind in a few places.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/721bf39a-ed8a-44b0-8b8e-be3bd81db748%40technowledgy.de
Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
Previously we had a fallback implementation of pg_compiler_barrier()
that called an empty function across a translation unit boundary so the
compiler couldn't see what it did. That shouldn't be needed on any
current systems, and might not even work with a link time optimizer.
Since we now require compiler-specific knowledge of how to implement
atomics, we should also know how to implement compiler barriers on a
hypothetical new system.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/721bf39a-ed8a-44b0-8b8e-be3bd81db748%40technowledgy.de
Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
Modern versions of all relevant architectures and tool chains have
atomics support. Since edadeb07, there is no remaining reason to carry
code that simulates atomic flags and uint32 imperfectly with spinlocks.
64 bit atomics are still emulated with spinlocks, if needed, for now.
Any modern compiler capable of implementing C11 <stdatomic.h> must have
the underlying operations we need, though we don't require C11 yet. We
detect certain compilers and architectures, so hypothetical new systems
might need adjustments here.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (concept, not the patch)
Reviewed-by: Andres Freund <andres@anarazel.de> (concept, not the patch)
Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
A later change will require atomic support, so it wouldn't make sense
for a hypothetical new system not to be able to implement spinlocks.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (concept, not the patch)
Reviewed-by: Andres Freund <andres@anarazel.de> (concept, not the patch)
Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
Now that the result of pg_newlocale_from_collation() is always
non-NULL, then we can move the collate_is_c and ctype_is_c flags into
pg_locale_t. That simplifies the logic in lc_collate_is_c() and
lc_ctype_is_c(), removing the dependence on setlocale().
This commit also eliminates the multi-stage initialization of the
collation cache.
As long as we have catalog access, then it's now safe to call
pg_newlocale_from_collation() without checking lc_collate_is_c()
first.
Discussion: https://postgr.es/m/cfd9eb85-c52a-4ec9-a90e-a5e4de56e57d@eisentraut.org
Reviewed-by: Peter Eisentraut, Andreas Karlsson
To determine if the two relations being joined can use partitionwise
join, we need to verify the existence of equi-join conditions
involving pairs of matching partition keys for all partition keys.
Currently we do that by looking through the join's restriction
clauses. However, it has been discovered that this approach is
insufficient, because there might be partition keys known equal by a
specific EC, but they do not form a join clause because it happens
that other members of the EC than the partition keys are constrained
to become a join clause.
To address this issue, in addition to examining the join's restriction
clauses, we also check if any partition keys are known equal by ECs,
by leveraging function exprs_known_equal(). To accomplish this, we
enhance exprs_known_equal() to check equality per the semantics of the
opfamily, if provided.
It could be argued that exprs_known_equal() could be called O(N^2)
times, where N is the number of partition key expressions, resulting
in noticeable performance costs if there are a lot of partition key
expressions. But I think this is not a problem. The number of a
joinrel's partition key expressions would only be equal to the join
degree, since each base relation within the join contributes only one
partition key expression. That is to say, it does not scale with the
number of partitions. A benchmark with a query involving 5-way joins
of partitioned tables, each with 3 partition keys and 1000 partitions,
shows that the planning time is not significantly affected by this
patch (within the margin of error), particularly when compared to the
impact caused by partitionwise join.
Thanks to Tom Lane for the idea of leveraging exprs_known_equal() to
check if partition keys are known equal by ECs.
Author: Richard Guo, Tom Lane
Reviewed-by: Tom Lane, Ashutosh Bapat, Robert Haas
Discussion: https://postgr.es/m/CAN_9JTzo_2F5dKLqXVtDX5V6dwqB0Xk+ihstpKEt3a1LT6X78A@mail.gmail.com
The current method of coercing the boolean result value of
JsonPathExists() to the target type specified for an EXISTS column,
which is to call the type's input function via json_populate_type(),
leads to an error when the target type is integer, because the
integer input function doesn't recognize boolean literal values as
valid.
Instead use the boolean-to-integer cast function for coercion in that
case so that using integer or domains thereof as type for EXISTS
columns works. Note that coercion for ON ERROR values TRUE and FALSE
already works like that because the parser creates a cast expression
including the cast function, but the coercion of the actual result
value is not handled by the parser.
Tests by Jian He.
Reported-by: Jian He <jian.universality@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Discussion: https://postgr.es/m/CACJufxEo4sUjKCYtda0_qt9tazqqKPmF1cqhW9KBOUeJFqQd2g@mail.gmail.com
Backpatch-through: 17
Move responsibility of generating the cancel key to the backend
process. The cancel key is now generated after forking, and the
backend advertises it in the ProcSignal array. When a cancel request
arrives, the backend handling it scans the ProcSignal array to find
the target pid and cancel key. This is similar to how this previously
worked in the EXEC_BACKEND case with the ShmemBackendArray, just
reusing the ProcSignal array.
One notable change is that we no longer generate cancellation keys for
non-backend processes. We generated them before just to prevent a
malicious user from canceling them; the keys for non-backend processes
were never actually given to anyone. There is now an explicit flag
indicating whether a process has a valid key or not.
I wrote this originally in preparation for supporting longer cancel
keys, but it's a nice cleanup on its own.
Reviewed-by: Jelte Fennema-Nio
Discussion: https://www.postgresql.org/message-id/508d0505-8b7a-4864-a681-e7e5edfe32aa@iki.fi
In try_partitionwise_join, we aim to break down the join between two
partitioned relations into joins between matching partitions. To
achieve this, we iterate through each pair of partitions from the two
joining relations and create child-join relations for them. With
potentially thousands of partitions, the local objects allocated in
each iteration can accumulate significant memory usage. Therefore, we
opt to eagerly free these local objects at the end of each iteration.
In line with this approach, this patch frees the bitmap set that
represents the relids of child-join relations at the end of each
iteration. Additionally, it modifies build_child_join_rel() to reuse
the AppendRelInfo structures generated within each iteration.
Author: Ashutosh Bapat
Reviewed-by: David Christensen, Richard Guo
Discussion: https://postgr.es/m/CAExHW5s4EqY43oB=ne6B2=-xLgrs9ZGeTr1NXwkGFt2j-OmaQQ@mail.gmail.com
There were quite a few places where we either had a non-NUL-terminated
string or a text Datum which we needed to call escape_json() on. Many of
these places required that a temporary string was created due to the fact
that escape_json() needs a NUL-terminated cstring. For text types, those
first had to be converted to cstring before calling escape_json() on them.
Here we introduce two new functions to make escaping JSON more optimal:
escape_json_text() can be given a text Datum to append onto the given
buffer. This is more optimal as it foregoes the need to convert the text
Datum into a cstring. A temporary allocation is only required if the text
Datum needs to be detoasted.
escape_json_with_len() can be used when the length of the cstring is
already known or the given string isn't NUL-terminated. Having this
allows various places which were creating a temporary NUL-terminated
string to just call escape_json_with_len() without any temporary memory
allocations.
Discussion: https://postgr.es/m/CAApHDvpLXwMZvbCKcdGfU9XQjGCDm7tFpRdTXuB9PVgpNUYfEQ@mail.gmail.com
Reviewed-by: Melih Mutlu, Heikki Linnakangas
When a standby is promoted, CleanupAfterArchiveRecovery() may decide
to rename the final WAL file from the old timeline by adding ".partial"
to the name. If WAL summarization is enabled and this file is renamed
before its partial contents are summarized, WAL summarization breaks:
the summarizer gets stuck at that point in the WAL stream and just
errors out.
To fix that, first make the startup process wait for WAL summarization
to catch up before renaming the file. Generally, this should be quick,
and if it's not, the user can shut off summarize_wal and try again.
To make this fix work, also teach the WAL summarizer that after a
promotion has occurred, no more WAL can appear on the previous
timeline: previously, the WAL summarizer wouldn't switch to the new
timeline until we actually started writing WAL there, but that meant
that when the startup process was waiting for the WAL summarizer, it
was waiting for an action that the summarizer wasn't yet prepared to
take.
In the process of fixing these bugs, I realized that the logic to wait
for WAL summarization to catch up was spread out in a way that made
it difficult to reuse properly, so this code refactors things to make
it easier.
Finally, add a test case that would have caught this bug and the
previously-fixed bug that WAL summarization sometimes needs to back up
when the timeline changes.
Discussion: https://postgr.es/m/CA+TgmoZGEsZodXC4f=XZNkAeyuDmWTSkpkjCEOcF19Am0mt_OA@mail.gmail.com
Commit 274bbced85 accidentally placed the pg_config.h.in
for SSL_CTX_set_num_tickets on the wrong line wrt where autoheader
places it. Fix by re-arranging and backpatch to the same level as
the original commit.
Reported-by: Marina Polyakova <m.polyakova@postgrespro.ru>
Discussion: https://postgr.es/m/48cebe8c3eaf308bae253b1dbf4e4a75@postgrespro.ru
Backpatch-through: v12
The new test tests the libpq fallback behavior on an early error,
which was fixed in the previous commit.
This adds an IS_INJECTION_POINT_ATTACHED() macro, to allow writing
injected test code alongside the normal source code. In principle, the
new test could've been implemented by an extra test module with a
callback that sets the FrontendProtocol global variable, but I think
it's more clear to have the test code right where the injection point
is, because it has pretty intimate knowledge of the surrounding
context it runs in.
Reviewed-by: Michael Paquier
Discussion: https://www.postgresql.org/message-id/CAOYmi%2Bnwvu21mJ4DYKUa98HdfM_KZJi7B1MhyXtnsyOO-PB6Ww%40mail.gmail.com
Commit 86db52a506 changed the locking of injection points to use only
atomic ops and spinlocks, to make it possible to define injection
points in processes that don't have a PGPROC entry (yet). However, it
didn't work in EXEC_BACKEND mode, because the pointer to shared memory
area was not initialized until the process "attaches" to all the
shared memory structs. To fix, pass the pointer to the child process
along with other global variables that need to be set up early.
Backpatch-through: 17
OpenSSL supports two types of session tickets for TLSv1.3, stateless
and stateful. The option we've used only turns off stateless tickets
leaving stateful tickets active. Use the new API introduced in 1.1.1
to disable all types of tickets.
Backpatch to all supported versions.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20240617173803.6alnafnxpiqvlh3g@awork3.anarazel.de
Backpatch-through: v12
This change allows these functions to be called using named-argument
notation, which can be helpful for readability, particularly for
the ones with many arguments.
There was considerable debate about exactly which names to use,
but in the end we settled on the names already shown in our
documentation table 9.10.
The citext extension provides citext-aware versions of some of
these functions, so add argument names to those too.
In passing, fix table 9.10's syntax synopses for regexp_match,
which were slightly wrong about which combinations of arguments
are allowed.
Jian He, reviewed by Dian Fay and others
Discussion: https://postgr.es/m/CACJufxG3NFKKsh6x4fRLv8h3V-HvN4W5dA=zNKMxsNcDwOKang@mail.gmail.com
"path" provides a reliable method of determining the parent/child
relationships between memory contexts. Previously this could be done in
a non-reliable way by writing a recursive query and joining the "parent"
and "name" columns. This wasn't reliable as the names were not unique,
which could result in joining to the wrong parent.
To make this reliable, "path" stores an array of numerical identifiers
starting with the identifier for TopLevelMemoryContext. It contains an
element for each intermediate parent between that and the current context.
Incompatibility: Here we also adjust the "level" column to make it
1-based rather than 0-based. A 1-based level provides a convenient way
to access elements in the "path" array. e.g. path[level] gives the
identifier for the current context.
Identifiers are not stable across multiple evaluations of the view. In
an attempt to make these more stable for ad-hoc queries, the identifiers
are assigned breadth-first. Contexts closer to TopLevelMemoryContext
are less likely to change between queries and during queries.
Author: Melih Mutlu <m.melihmutlu@gmail.com>
Discussion: https://postgr.es/m/CAGPVpCThLyOsj3e_gYEvLoHkr5w=tadDiN_=z2OwsK3VJppeBA@mail.gmail.com
Reviewed-by: Andres Freund, Stephen Frost, Atsushi Torikoshi,
Reviewed-by: Michael Paquier, Robert Haas, David Rowley
Previously, TidStoreIterateNext() would expand the set of offsets for
each block into an internal buffer that it overwrote each time. In
order to be able to collect the offsets for multiple blocks before
working with them, change the contract. Now, the offsets are obtained
by a separate call to TidStoreGetBlockOffsets(), which can be called at
a later time. TidStoreIteratorResult objects are safe to copy and store
in a queue.
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/CAAKRu_bbkmwAzSBgnezancgJeXrQZXy4G4kBTd+5=cr86H5yew@mail.gmail.com
The two_phase option is controlled by both the publisher (as a slot
option) and the subscriber (as a subscription option), so the slot option
must also be modified.
Changing the 'two_phase' option for a subscription from 'true' to 'false'
is permitted only when there are no pending prepared transactions
corresponding to that subscription. Otherwise, the changes of already
prepared transactions can be replicated again along with their corresponding
commit leading to duplicate data or errors.
To avoid data loss, the 'two_phase' option for a subscription can only be
changed from 'false' to 'true' once the initial data synchronization is
completed. Therefore this is performed later by the logical replication worker.
Author: Hayato Kuroda, Ajin Cherian, Amit Kapila
Reviewed-by: Peter Smith, Hou Zhijie, Amit Kapila, Vitaly Davydov, Vignesh C
Discussion: https://postgr.es/m/8fab8-65d74c80-1-2f28e880@39088166
Add extern declarations in appropriate header files for global
variables related to GUC. In many cases, this was handled quite
inconsistently before, with some GUC variables declared in a header
file and some only pulled in via ad-hoc extern declarations in various
.c files.
Also add PGDLLIMPORT qualifications to those variables. These were
previously missing because src/tools/mark_pgdllimport.pl has only been
used with header files.
This also fixes -Wmissing-variable-declarations warnings for GUC
variables (not yet part of the standard warning options).
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
This fixes warnings from -Wmissing-variable-declarations (not yet part
of the standard warning options) under EXEC_BACKEND. The
NON_EXEC_STATIC variables need a suitable declaration in a header file
under EXEC_BACKEND.
Also fix the inconsistent application of the volatile qualifier for
PMSignalState, which was revealed by this change.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
slru.h described incorrectly how SLRU segment names are formatted
depending on the segment number and if long or short segment names are
used. This commit closes the gap with a better description, fitting
with the reality.
Reported-by: Noah Misch
Author: Aleksander Alekseev
Discussion: https://postgr.es/m/20240626002747.dc.nmisch@google.com
Backpatch-through: 17
In the case of a parallel plan, when computing the number of tuples
processed per worker, we divide the total number of tuples by the
parallel_divisor obtained from get_parallel_divisor(), which accounts
for the leader's contribution in addition to the number of workers.
Accordingly, when estimating the number of tuples for gather (merge)
nodes, we should multiply the number of tuples per worker by the same
parallel_divisor to reverse the division. However, currently we use
parallel_workers rather than parallel_divisor for the multiplication.
This could result in an underestimation of the number of tuples for
gather (merge) nodes, especially when there are fewer than four
workers.
This patch fixes this issue by using the same parallel_divisor for the
multiplication. There is one ensuing plan change in the regression
tests, but it looks reasonable and does not compromise its original
purpose of testing parallel-aware hash join.
In passing, this patch removes an unnecessary assignment for path.rows
in create_gather_merge_path, and fixes an uninitialized-variable issue
in generate_useful_gather_paths.
No backpatch as this could result in plan changes.
Author: Anthonin Bonnefoy
Reviewed-by: Rafia Sabih, Richard Guo
Discussion: https://postgr.es/m/CAO6_Xqr9+51NxgO=XospEkUeAg-p=EjAWmtpdcZwjRgGKJ53iA@mail.gmail.com
Previously, the code charged disable_cost for CurrentOfExpr, and then
subtracted disable_cost from the cost of a TID path that used
CurrentOfExpr as the TID qual, effectively disabling all paths except
that one. Now, we instead suppress generation of the disabled paths
entirely, and generate only the one that the executor will actually
understand.
With this approach, we do not need to rely on disable_cost being
large enough to prevent the wrong path from being chosen, and we
save some CPU cycle by avoiding generating paths that we can't
actually use. In my opinion, the code is also easier to understand
like this.
Patch by me. Review by Heikki Linnakangas.
Discussion: http://postgr.es/m/591b3596-2ea0-4b8e-99c6-fad0ef2801f5@iki.fi
To do this, we must include the wal_level in the first WAL record
covered by each summary file; so add wal_level to struct Checkpoint
and the payload of XLOG_CHECKPOINT_REDO and XLOG_END_OF_RECOVERY.
This, in turn, requires bumping XLOG_PAGE_MAGIC and, since the
Checkpoint is also stored in the control file, also
PG_CONTROL_VERSION. It's not great to do that so late in the release
cycle, but the alternative seems to ship v17 without robust
protections against this scenario, which could result in corrupted
incremental backups.
A side effect of this patch is that, when a server with
wal_level=replica is started with summarize_wal=on for the first time,
summarization will no longer begin with the oldest WAL that still
exists in pg_wal, but rather from the first checkpoint after that.
This change should be harmless, because a WAL summary for a partial
checkpoint cycle can never make an incremental backup possible when
it would otherwise not have been.
Report by Fujii Masao. Patch by me. Review and/or testing by Jakub
Wartak and Fujii Masao.
Discussion: http://postgr.es/m/6e30082e-041b-4e31-9633-95a66de76f5d@oss.nttdata.com
This new macro is able to perform a direct lookup from the local cache
of injection points (refreshed each time a point is loaded or run),
without touching the shared memory state of injection points at all.
This works in combination with INJECTION_POINT_LOAD(), and it is better
than INJECTION_POINT() in a critical section due to the fact that it
would avoid all memory allocations should a concurrent detach happen
since a LOAD(), as it retrieves a callback from the backend-private
memory.
The documentation is updated to describe in more details how to use this
new macro with a load. Some tests are added to the module
injection_points based on a new SQL function that acts as a wrapper of
INJECTION_POINT_CACHED().
Based on a suggestion from Heikki Linnakangas.
Author: Heikki Linnakangas, Michael Paquier
Discussion: https://postgr.es/m/58d588d0-e63f-432f-9181-bed29313dece@iki.fi
Commit f4b54e1ed9, which introduced macros for protocol characters,
missed updating a few places. It also did not introduce macros for
messages sent from parallel workers to their leader processes.
This commit adds a new section in protocol.h for those.
Author: Aleksander Alekseev
Discussion: https://postgr.es/m/CAJ7c6TNTd09AZq8tGaHS3LDyH_CCnpv0oOz2wN1dGe8zekxrdQ%40mail.gmail.com
Backpatch-through: 17
Previously, CREATE MATERIALIZED VIEW ... WITH DATA populated the MV
the same way as CREATE TABLE ... AS.
Instead, reuse the REFRESH logic, which locks down security-restricted
operations and restricts the search_path. This reduces the chance that
a subsequent refresh will fail.
Reported-by: Noah Misch
Backpatch-through: 17
Discussion: https://postgr.es/m/20240630222344.db.nmisch@google.com
When creating and initializing a logical slot, the restart_lsn is set
to the latest WAL insertion point (or the latest replay point on
standbys). Subsequently, WAL records are decoded from that point to
find the start point for extracting changes in the
DecodingContextFindStartpoint() function. Since the initial
restart_lsn could be in the middle of a transaction, the start point
must be a consistent point where we won't see the data for partial
transactions.
Previously, when not building a full snapshot, serialized snapshots
were restored, and the SnapBuild jumps to the consistent state even
while finding the start point. Consequently, the slot's restart_lsn
and confirmed_flush could be set to the middle of a transaction. This
could lead to various unexpected consequences. Specifically, there
were reports of logical decoding decoding partial transactions, and
assertion failures occurred because only subtransactions were decoded
without decoding their top-level transaction until decoding the commit
record.
To resolve this issue, the changes prevent restoring the serialized
snapshot and jumping to the consistent state while finding the start
point.
On v17 and HEAD, a flag indicating whether snapshot restores should be
skipped has been added to the SnapBuild struct, and SNAPBUILD_VERSION
has been bumpded.
On backbranches, the flag is stored in the LogicalDecodingContext
instead, preserving on-disk compatibility.
Backpatch to all supported versions.
Reported-by: Drew Callahan
Reviewed-by: Amit Kapila, Hayato Kuroda
Discussion: https://postgr.es/m/2444AA15-D21B-4CCE-8052-52C7C2DAFE5C%40amazon.com
Backpatch-through: 12
This new entry type is used for all the fixed-numbered statistics,
making possible support for custom pluggable stats. In short, we need
to be able to detect more easily if a stats kind exists or not when
reading back its data from the pgstats file without a dependency on the
order of the entries read. The kind ID of the stats is added to the
data written.
The data is written in the same fashion as previously, with the
fixed-numbered stats first and the dshash entries next. The read part
becomes more flexible, loading fixed-numbered stats into shared memory
based on the new entry type found.
Bump PGSTAT_FILE_FORMAT_ID.
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Zot5bxoPYdS7yaoy@paquier.xyz
This new callback gives fixed-numbered stats the possibility to take
actions based on the area of shared memory allocated for them.
This removes from pgstat_shmem.c any knowledge specific to the types
of fixed-numbered stats, and the initializations happen in their own
files. Like b68b29bc8f, this change is useful to make this area of
the code more pluggable, so as custom fixed-numbered stats can take
actions after their shared memory area is initialized.
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Zot5bxoPYdS7yaoy@paquier.xyz
This patch modifies the pg_get_acl() function to accept a third argument
called "objsubid", bringing it on par with similar functions in this
area like pg_describe_object(). This enables the retrieval of ACLs for
relation attributes when scanning dependencies.
Bump catalog version.
Author: Joel Jacobson
Discussion: https://postgr.es/m/f2539bff-64be-47f0-9f0b-df85d3cc0432@app.fastmail.com
Since commit 3a9b18b309, roles with privileges of pg_signal_backend
cannot signal autovacuum workers. Many users treated the ability
to signal autovacuum workers as a feature instead of a bug, so we
are reintroducing it via a new predefined role. Having privileges
of this new role, named pg_signal_autovacuum_worker, only permits
signaling autovacuum workers. It does not permit signaling other
types of superuser backends.
Bumps catversion.
Author: Kirill Reshke
Reviewed-by: Anthony Leung, Michael Paquier, Andrey Borodin
Discussion: https://postgr.es/m/CALdSSPhC4GGmbnugHfB9G0%3DfAxjCSug_-rmL9oUh0LTxsyBfsg%40mail.gmail.com
This is similar to 9004abf620, but this time for the write part of the
stats file. The code is changed so as, rather than referring to
individual members of PgStat_Snapshot in an order based on their
PgStat_Kind value, a loop based on pgstat_kind_infos is used to retrieve
the contents to write from the snapshot structure, for a size of
PgStat_KindInfo's shared_data_len.
This requires the addition to PgStat_KindInfo of an offset to track the
location of each fixed-numbered stats in PgStat_Snapshot. This change
is useful to make this area of the code more easily pluggable, and
reduces the knowledge of specific fixed-numbered kinds in pgstat.c.
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/Zot5bxoPYdS7yaoy@paquier.xyz
Nodes like Memoize report the cache stats for each parallel worker, so it
makes sense to show the exact and lossy pages in Parallel Bitmap Heap Scan
in a similar way. Likewise, Sort shows the method and memory used for
each worker.
There was some discussion on whether the leader stats should include the
totals for each parallel worker or not. I did some analysis on this to
see what other parallel node types do and it seems only Parallel Hash does
anything like this. All the rest, per what's supported by
ExecParallelRetrieveInstrumentation() are consistent with each other.
Author: David Geier <geidav.pg@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Donghang Lin <donghanglin@gmail.com>
Author: Alena Rybakina <lena.ribackina@yandex.ru>
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Michael Christofides <michael@pgmustard.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Donghang Lin <donghanglin@gmail.com>
Reviewed-by: Masahiro Ikeda <Masahiro.Ikeda@nttdata.com>
Discussion: https://postgr.es/m/b3d80961-c2e5-38cc-6a32-61886cdf766d%40gmail.com
This provides the planner with row estimates for
generate_series(TIMESTAMP, TIMESTAMP, INTERVAL),
generate_series(TIMESTAMPTZ, TIMESTAMPTZ, INTERVAL) and
generate_series(TIMESTAMPTZ, TIMESTAMPTZ, INTERVAL, TEXT) when the input
parameter values can be estimated during planning.
Author: David Rowley
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrBE%3D%2BASo_sGYmQJ3GvO8GPvX5yxXhRS%3Dt_ybd4odFkhQ%40mail.gmail.com
a6417078c4 has introduced as project policy that new features
committed during the development cycle should use new OIDs in the
[8000,9999] range.
4564f1cebd did not respect that rule, so let's renumber pg_get_acl()
to use an OID in the correct range.
Bump catalog version.
Both of these counters were using the "long" data type. On MSVC that's
a 32-bit type. On modern hardware, I was able to demonstrate that we can
wrap those counters with a query that only takes 15 minutes to run.
This issue may manifest itself either by not showing the values of the
counters because they've wrapped and are less than zero, resulting in
them being filtered by the > 0 checks in show_tidbitmap_info(), or bogus
numbers being displayed which are modulus 2^32 of the actual number.
Widen these counters to uint64.
Discussion: https://postgr.es/m/CAApHDvpS_97TU+jWPc=T83WPp7vJa1dTw3mojEtAVEZOWh9bjQ@mail.gmail.com
macOS 15's SDK pulls in headers related to <regex.h> when we include
<xlocale.h>. This causes our own regex_t implementation to clash with
the OS's regex_t implementation. Luckily our function names already had
pg_ prefixes, but the macros and typenames did not.
Include <regex.h> explicitly on all POSIX systems, and fix everything
that breaks. Then we can prove that we are capable of fully hiding and
replacing the system regex API with our own.
1. Deal with standard-clobbering macros by undefining them all first.
POSIX says they are "symbolic constants". If they are macros, this
allows us to redefine them. If they are enums or variables, our macros
will hide them.
2. Deal with standard-clobbering types by giving our types pg_
prefixes, and then using macros to redirect xxx_t -> pg_xxx_t.
After including our "regex/regex.h", the system <regex.h> is hidden,
because we've replaced all the standard names. The PostgreSQL source
tree and extensions can continue to use standard prefix-less type and
macro names, but reach our implementation, if they included our
"regex/regex.h" header.
Back-patch to all supported branches, so that macOS 15's tool chain can
build them.
Reported-by: Stan Hu <stanhu@gmail.com>
Suggested-by: Tom Lane <tgl@sss.pgh.pa.us>
Tested-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://postgr.es/m/CAMBWrQnEwEJtgOv7EUNsXmFw2Ub4p5P%2B5QTBEgYwiyjy7rAsEQ%40mail.gmail.com
Each of max_connections, max_worker_processes,
autovacuum_max_workers, and max_wal_senders has a GUC check hook
that verifies the sum of those GUCs does not exceed a hard-coded
limit (see the comment for MAX_BACKENDS in postmaster.h). In
general, the hooks effectively guard against egregious
misconfigurations.
However, this approach has some problems. Since these check hooks
are called as each GUC is assigned its user-specified value, only
one of the hooks will be called with all the relevant GUCs set. If
one or more of the user-specified values are less than the initial
values of the GUCs' underlying variables, false positives can
occur.
Furthermore, the error message emitted when one of the check hooks
fails is not tremendously helpful. For example, the command
$ pg_ctl -D . start -o "-c max_connections=262100 -c max_wal_senders=10000"
fails with the following error:
FATAL: invalid value for parameter "max_wal_senders": 10000
Fortunately, there is an extra copy of this check in
InitializeMaxBackends() that we can rely on, so this commit removes
the aforementioned GUC check hooks in favor of that one. It also
enhances the error message to clearly show the values of the
relevant GUCs and the hard-coded limit their sum may not exceed.
The downside of this change is that server startup progresses
further before failing due to such misconfigurations (thus taking
longer), but these failures are expected to be rare, so we don't
anticipate any real harm in practice.
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/ZnMr2k-Nk5vj7T7H%40nathan
This can be used to load an injection point and prewarm the
backend-level cache before running it, to avoid issues if the point
cannot be loaded due to restrictions in the code path where it would be
run, like a critical section where no memory allocation can happen
(load_external_function() can do allocations when expanding a library
name).
Tests can use a macro called INJECTION_POINT_LOAD() to load an injection
point. The test module injection_points gains some tests, and a SQL
function able to load an injection point.
Based on a request from Andrey Borodin, who has implemented a test for
multixacts requiring this facility.
Reviewed-by: Andrey Borodin
Discussion: https://postgr.es/m/ZkrBE1e2q2wGvsoN@paquier.xyz
Since commit 5764f611e1, we've been using the ilist.h functions for
handling the linked list. There's no need for 'links' to be the first
element of the struct anymore, except for one call in InitProcess
where we used a straight cast from the 'dlist_node *' to PGPROC *,
without the dlist_container() macro. That was just an oversight in
commit 5764f611e1, fix it.
There no imminent need to move 'links' from being the first field, but
let's be tidy.
Reviewed-by: Aleksander Alekseev, Andres Freund
Discussion: https://www.postgresql.org/message-id/22aa749e-cc1a-424a-b455-21325473a794@iki.fi
Up until now, there was no ability to easily determine if a Material
node caused the underlying tuplestore to spill to disk or even see how
much memory the tuplestore used if it didn't.
Here we add some new functions to tuplestore.c to query this information
and add some additional output in EXPLAIN ANALYZE to display this
information for the Material node.
There are a few other executor node types that use tuplestores, so we
could also consider adding these details to the EXPLAIN ANALYZE for
those nodes too. Let's consider those independently from this. Having
the tuplestore.c infrastructure in to allow that is step 1.
Author: David Rowley
Reviewed-by: Matthias van de Meent, Dmitry Dolgov
Discussion: https://postgr.es/m/CAApHDvp5Py9g4Rjq7_inL3-MCK1Co2CRt_YWFwTU2zfQix0p4A@mail.gmail.com
Hash joins can support semijoin with the LHS input on the right, using
the existing logic for inner join, combined with the assurance that only
the first match for each inner tuple is considered, which can be
achieved by leveraging the HEAP_TUPLE_HAS_MATCH flag. This can be very
useful in some cases since we may now have the option to hash the
smaller table instead of the larger.
Merge join could likely support "Right Semi Join" too. However, the
benefit of swapping inputs tends to be small here, so we do not address
that in this patch.
Note that this patch also modifies a test query in join.sql to ensure it
continues testing as intended. With this patch the original query would
result in a right-semi-join rather than semi-join, compromising its
original purpose of testing the fix for neqjoinsel's behavior for
semi-joins.
Author: Richard Guo
Reviewed-by: wenhui qiu, Alena Rybakina, Japin Li
Discussion: https://postgr.es/m/CAMbWs4_X1mN=ic+SxcyymUqFx9bB8pqSLTGJ-F=MHy4PW3eRXw@mail.gmail.com
This code wanted to ensure that the 'exchange' variable passed to
pg_atomic_compare_exchange_u64 has correct alignment, but apparently
platforms don't actually require anything that doesn't come naturally.
While messing with pg_atomic_monotonic_advance_u64: instead of using
Max() to determine the value to return, just use
pg_atomic_compare_exchange_u64()'s return value to decide; also, use
pg_atomic_compare_exchange_u64 instead of the _impl version; also remove
the unnecessary underscore at the end of variable name "target".
Backpatch to 17, where this code was introduced by commit bf3ff7bf83bc.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c1b5@gmail.com
This function returns the ACL for a database object, specified by
catalog OID and object OID. This is useful to be able to
retrieve the ACL associated to an object specified with a
(class_id,objid) couple, similarly to the other functions for object
identification, when joined with pg_depend or pg_shdepend.
Original idea by Álvaro Herrera.
Bump catalog version.
Author: Joel Jacobson
Reviewed-by: Isaac Morland, Michael Paquier, Ranier Vilela
Discussion: https://postgr.es/m/80b16434-b9b1-4c3d-8f28-569f21c2c102@app.fastmail.com
We have in launch_backend.c:
/*
* The following need to be available to the save/restore_backend_variables
* functions. They are marked NON_EXEC_STATIC in their home modules.
*/
extern slock_t *ShmemLock;
extern slock_t *ProcStructLock;
extern PGPROC *AuxiliaryProcs;
extern PMSignalData *PMSignalState;
extern pg_time_t first_syslogger_file_time;
extern struct bkend *ShmemBackendArray;
extern bool redirection_done;
That comment is not completely true: ShmemLock, ShmemBackendArray, and
redirection_done are not in fact NON_EXEC_STATIC. ShmemLock once was,
but was then needed elsewhere. ShmemBackendArray was static inside
postmaster.c before launch_backend.c was created. redirection_done
was never static.
This patch moves the declaration of ShmemLock and redirection_done to
a header file.
ShmemBackendArray gets a NON_EXEC_STATIC. This doesn't make a
difference, since it only exists if EXEC_BACKEND anyway, but it makes
it consistent.
After that, the comment is now correct.
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e0a62134-83da-4ba4-8cdb-ceb0111c95ce@eisentraut.org
This old CPU architecture hasn't been produced in decades, and
whatever instances might still survive are surely too underpowered
for anyone to consider running Postgres on in production. We'd
nonetheless continued to carry code support for it (largely at my
insistence), because its unique implementation of spinlocks seemed
like a good edge case for our spinlock infrastructure. However,
our last buildfarm animal of this type was retired last year, and
it seems quite unlikely that another will emerge. Without the ability
to run tests, the argument that this is useful test code fails to
hold water. Furthermore, carrying code support for an untestable
architecture has costs not to be ignored. So, remove HPPA-specific
code, in the same vein as commits 718aa43a4 and 92d70b77e.
Discussion: https://postgr.es/m/3351991.1697728588@sss.pgh.pa.us
This commit adjusts pg_sequence_last_value() to return NULL instead
of ERROR-ing for sequences for which the current user lacks
privileges. This allows us to remove the call to
has_sequence_privilege() in the definition of the pg_sequences
system view.
Bumps catversion.
Suggested-by: Michael Paquier
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/20240501005730.GA594666%40nathanxps13
The standby_slot_names GUC allows the specification of physical standby
slots that must be synchronized before the logical walsenders associated
with logical failover slots. However, for this purpose, the GUC name is
too generic.
Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Masahiko Sawada
Backpatch-through: 17
Discussion: https://postgr.es/m/ZnWeUgdHong93fQN@momjian.us
Shared statistics with a fixed number of objects are read from the stats
file in pgstat_read_statsfile() using members of PgStat_ShmemControl and
following an order based on their PgStat_Kind value.
Instead of being explicit, this commit changes the stats read to iterate
over the pgstat_kind_infos array to find the memory locations to read
into, based on a new shared_ctl_off in PgStat_KindInfo that can be used
to define the position of this stats kind in shared memory. This makes
the read logic simpler, and eases the introduction of future
improvements aimed at making this area more pluggable for external
modules.
Original idea suggested by Andres Freund.
Author: Tristan Partin
Reviewed-by: Andres Freund, Michael Paquier
Discussion: https://postgr.es/m/D12SQ7OYCD85.20BUVF3DWU5K7@neon.tech
This field is used to track if a stats kind can use a custom format
representation on disk when reading or writing its stats case. On HEAD,
this exists for replication slots stats, that need a mapping between an
internal index ID and the slot names.
named_on_disk is currently used nowhere and the callbacks
to_serialized_name and from_serialized_name are in charge of checking if
the serialization of the stats data should apply, so let's remove it.
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/ZmKVlSX_T5YvIOsd@paquier.xyz
Instead of looking up casts at parse time for converting the result
of JsonPath* query functions to the specified or the default
RETURNING type, always perform the conversion at runtime using either
the target type's input function or the function
json_populate_type().
There are two motivations for this change:
1. json_populate_type() coerces to types with typmod such that any
string values that exceed length limit cause an error instead of
silent truncation, which is necessary to be standard-conforming.
2. It was possible to end up with a cast expression that doesn't
support soft handling of errors causing bugs in the of handling
ON ERROR clause.
JsonExpr.coercion_expr which would store the cast expression is no
longer necessary, so remove.
Bump catversion because stored rules change because of the above
removal.
Reported-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Jian He <jian.universality@gmail.com>
Discussion: Discussion: https://postgr.es/m/202405271326.5a5rprki64aw%40alvherre.pgsql
Before this commit, when the WAL summarizer started up or recovered
from an error, it would resume summarization from wherever it left
off. That was OK normally, but wrong if summarize_wal=off had been
turned off temporary, allowing some WAL to be removed, and then turned
back on again. In such cases, the WAL summarizer would simply hang
forever. This commit changes the reinitialization sequence for WAL
summarizer to rederive the starting position in the way we were
already doing at initial startup, fixing the problem.
Per report from Israel Barth Rubio. Reviewed by Tom Lane.
Discussion: http://postgr.es/m/CA+TgmoYN6x=YS+FoFOS6=nr6=qkXZFWhdiL7k0oatGwug2hcuA@mail.gmail.com
This extends ad98fb1422 to invals of
inplace updates. Trouble requires an inplace update of a catalog having
a TOAST table, so only pg_database was at risk. (The other catalog on
which core code performs inplace updates, pg_class, has no TOAST table.)
Trouble would require something like the inplace-inval.spec test.
Consider GRANT ... ON DATABASE fetching a stale row from cache and
discarding a datfrozenxid update that vac_truncate_clog() has already
relied upon. Back-patch to v12 (all supported versions).
Reviewed (in an earlier version) by Robert Haas.
Discussion: https://postgr.es/m/20240114201411.d0@rfd.leadboat.com
Discussion: https://postgr.es/m/20240512232923.aa.nmisch@google.com
Commit 5b562644fe added a comment that
SetRelationHasSubclass() callers must hold this lock. When commit
17f206fbc8 extended use of this column to
partitioned indexes, it didn't take the lock. As the latter commit
message mentioned, we currently never reset a partitioned index to
relhassubclass=f. That largely avoids harm from the lock omission. The
cause for fixing this now is to unblock introducing a rule about locks
required to heap_update() a pg_class row. This might cause more
deadlocks. It gives minor user-visible benefits:
- If an ALTER INDEX SET TABLESPACE runs concurrently with ALTER TABLE
ATTACH PARTITION or CREATE PARTITION OF, one transaction blocks
instead of failing with "tuple concurrently updated". (Many cases of
DDL concurrency still fail that way.)
- Match ALTER INDEX ATTACH PARTITION in choosing to lock the index.
While not user-visible today, we'll need this if we ever make something
set the flag to false for a partitioned index, like ANALYZE does today
for tables. Back-patch to v12 (all supported versions), the plan for
the commit relying on the new rule. In back branches, add
LockOrStrongerHeldByMe() instead of adding a LockHeldByMe() parameter.
Reviewed (in an earlier version) by Robert Haas.
Discussion: https://postgr.es/m/20240611024525.9f.nmisch@google.com
We did not recover the subtransaction IDs of prepared transactions
when starting a hot standby from a shutdown checkpoint. As a result,
such subtransactions were considered as aborted, rather than
in-progress. That would lead to hint bits being set incorrectly, and
the subtransactions suddenly becoming visible to old snapshots when
the prepared transaction was committed.
To fix, update pg_subtrans with prepared transactions's subxids when
starting hot standby from a shutdown checkpoint. The snapshots taken
from that state need to be marked as "suboverflowed", so that we also
check the pg_subtrans.
Backport to all supported versions.
Discussion: https://www.postgresql.org/message-id/6b852e98-2d49-4ca1-9e95-db419a2696e0@iki.fi
RT_NODE_16_SEARCH_EQ() performs comparisions using vector registers
on x64-64 and aarch64. We apply a mask to the resulting bitfield
to eliminate irrelevant bits that may be set. This ensures correct
behavior, but Valgrind complains of the partially-uninitialised
values. So far the warnings have only occurred on aarch64, which
explains why this hasn't been seen earlier.
To fix this warning, initialize the whole fixed-sized part of the nodes
upon allocation, rather than just do the minimum initialization to
function correctly. The initialization for node48 is a bit different
in that the 256-byte slot index array must be populated with "invalid
index" rather than zero. Experimentation has shown that compilers
tend to emit code that uselessly memsets that array twice. To avoid
pessimizing this path, swap the order of the slot_idxs[] and isset[]
arrays so we can initialize with two non-overlapping memset calls.
Reported by Tomas Vondra
Analysis and patch by Tom Lane, reviewed by Masahiko Sawada. I
investigated the behavior of memset calls to overlapping regions,
leading to the above tweaks to node48 as discussed in the thread.
Discussion: https://postgr.es/m/120c63ad-3d12-415f-a7bf-3da451c31bf6%40enterprisedb.com
Apply const qualifiers to char * arguments and fields throughout the
jsonapi. This allows the top-level APIs such as
pg_parse_json_incremental() to declare their input argument as const.
It also reduces the number of unconstify() calls.
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://www.postgresql.org/message-id/flat/f732b014-f614-4600-a437-dba5a2c3738b%40eisentraut.org
Previously, GetJsonPathVar() allowed a jsonpath expression to
reference any prefix of a PASSING variable's name. For example, the
following query would incorrectly work:
SELECT JSON_QUERY(context_item, jsonpath '$xy' PASSING val AS xyz);
The fix ensures that the length of the variable name mentioned in a
jsonpath expression matches exactly with the name of the PASSING
variable before comparing the strings using strncmp().
Reported-by: Alvaro Herrera (off-list)
Discussion: https://postgr.es/m/CA+HiwqFGkLWMvELBH6E4SQ45qUHthgcRH6gCJL20OsYDRtFx_w@mail.gmail.com
Reconstruction of an SP-GiST index by REINDEX CONCURRENTLY may
insert some REDIRECT tuples. This will typically happen in
a transaction that lacks an XID, which leads either to assertion
failure in spgFormDeadTuple or to insertion of a REDIRECT tuple
with zero xid. The latter's not good either, since eventually
VACUUM will apply GlobalVisTestIsRemovableXid() to the zero xid,
resulting in either an assertion failure or a garbage answer.
In practice, since REINDEX CONCURRENTLY locks out index scans
till it's done, it doesn't matter whether it inserts REDIRECTs
or PLACEHOLDERs; and likewise it doesn't matter how soon VACUUM
reduces such a REDIRECT to a PLACEHOLDER. So in non-assert builds
there's no observable problem here, other than perhaps a little
index bloat. But it's not behaving as intended.
To fix, remove the failing Assert in spgFormDeadTuple, acknowledging
that we might sometimes insert a zero XID; and guard VACUUM's
GlobalVisTestIsRemovableXid() call with a test for valid XID,
ensuring that we'll reduce such a REDIRECT the first time VACUUM
sees it. (Versions before v14 use TransactionIdPrecedes here,
which won't fail on zero xid, so they really have no bug at all
in non-assert builds.)
Another solution could be to not create REDIRECTs at all during
REINDEX CONCURRENTLY, making the relevant code paths treat that
case like index build (which likewise knows that no concurrent
index scans can be happening). That would allow restoring the
Assert in spgFormDeadTuple, but we'd still need the VACUUM change
because redirection tuples with zero xid may be out there already.
But there doesn't seem to be a nice way for spginsert() to tell that
it's being called in REINDEX CONCURRENTLY without some API changes,
so we'll leave that as a possible future improvement.
In HEAD, also rename the SpGistState.myXid field to redirectXid,
which seems less misleading (since it might not in fact be our
transaction's XID) and is certainly less uninformatively generic.
Per bug #18499 from Alexander Lakhin. Back-patch to all supported
branches.
Discussion: https://postgr.es/m/18499-8a519c280f956480@postgresql.org
Commit 534287403 invented SHARED_DEPENDENCY_INITACL entries in
pg_shdepend, but installed them only for non-owner roles mentioned
in a pg_init_privs entry. This turns out to be the wrong thing,
because there is nothing to cue REASSIGN OWNED to go and update
pg_init_privs entries when the object's ownership is reassigned.
That leads to leaving dangling entries in pg_init_privs, as
reported by Hannu Krosing. Instead, install INITACL entries for
all roles mentioned in pg_init_privs entries (except pinned roles),
and change ALTER OWNER to not touch them, just as it doesn't
touch pg_init_privs entries.
REASSIGN OWNED will now substitute the new owner OID for the old
in pg_init_privs entries. This feels like perhaps not quite the
right thing, since pg_init_privs ought to be a historical record
of the state of affairs just after CREATE EXTENSION. However,
it's hard to see what else to do, if we don't want to disallow
dropping the object's original owner. In any case this is
better than the previous do-nothing behavior, and we're unlikely
to come up with a superior solution in time for v17.
While here, tighten up some coding rules about how ACLs in
pg_init_privs should never be null or empty. There's not any
obvious reason to allow that, and perhaps asserting that it's
not so will catch some bugs. (We were previously inconsistent
on the point, with some code paths taking care not to store
empty ACLs and others not.)
This leaves recordExtensionInitPrivWorker not doing anything
with its ownerId argument, but we'll deal with that separately.
catversion bump forced because of change of expected contents
of pg_shdepend when pg_init_privs entries exist.
Discussion: https://postgr.es/m/CAMT0RQSVgv48G5GArUvOVhottWqZLrvC5wBzBa4HrUdXe9VRXw@mail.gmail.com
Commit 667e65aac3 changed both num_dead_tuples and max_dead_tuples
columns to dead_tuple_bytes and max_dead_tuple_bytes columns,
respectively. But as per discussion, the number of dead tuples
collected still provides meaningful insights for users.
This commit reintroduces the column for the count of dead tuples,
renamed as num_dead_item_ids. It avoids confusion with the number of
dead tuples removed by VACUUM, which includes dead heap-only tuples
but excludes any pre-existing LP_DEAD items left behind by
opportunistic pruning.
Bump catalog version.
Reviewed-by: Peter Geoghegan, Álvaro Herrera, Andrey Borodin
Discussion: https://postgr.es/m/CAD21AoBL5sJE9TRWPyv%2Bw7k5Ee5QAJqDJEDJBUdAaCzGWAdvZw%40mail.gmail.com
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places. These
inconsistencies were all introduced during Postgres 17 development.
pg_bsd_indent still has a couple of similar inconsistencies, which I
(pgeoghegan) have left untouched for now.
This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
Meson uses warning/debug/optimize flags such as "-Wall", "-g", and
"-O2" automatically based on "--warnlevel" and "--buildtype" options.
And we use "--warning_level=1" and "--buildtype=debugoptimized" by
default.
But we need these flags for Makefile.global (for extensions) and
pg_config, so we need to compute them manually based on the
higher-level options.
Without this change, extensions building using pgxs wouldn't get -Wall
or optimization options.
Author: Sutou Kouhei <kou@clear-code.com>
Discussion: https://www.postgresql.org/message-id/flat/20240122.141139.931086145628347157.kou%40clear-code.com
0452b461bc made optimizer explore alternative orderings of group-by pathkeys.
It eliminated preprocess_groupclause(), which was intended to match items
between GROUP BY and ORDER BY. Instead, get_useful_group_keys_orderings()
function generates orderings of GROUP BY elements at the time of grouping
paths generation. The get_useful_group_keys_orderings() function takes into
account 3 orderings of GROUP BY pathkeys and clauses: original order as written
in GROUP BY, matching ORDER BY clauses as much as possible, and matching the
input path as much as possible. Given that even before 0452b461b,
preprocess_groupclause() could change the original order of GROUP BY clauses
we don't need to consider it apart from ordering matching ORDER BY clauses.
This commit restores preprocess_groupclause() to provide an ordering of
GROUP BY elements matching ORDER BY before generation of paths. The new
version of preprocess_groupclause() takes into account an incremental sort.
The get_useful_group_keys_orderings() function now takes into 2 orderings of
GROUP BY elements: the order generated preprocess_groupclause() and the order
matching the input path as much as possible.
Discussion: https://postgr.es/m/CAPpHfdvyWLMGwvxaf%3D7KAp-z-4mxbSH8ti2f6mNOQv5metZFzg%40mail.gmail.com
Author: Alexander Korotkov
Reviewed-by: Andrei Lepikhov, Pavel Borisov
0452b461bc made optimizer explore alternative orderings of group-by pathkeys.
The PathKeyInfo data structure was used to store the particular ordering of
group-by pathkeys and corresponding clauses. It turns out that PathKeyInfo
is not the best name for that purpose. This commit renames this data structure
to GroupByOrdering, and revises its comment.
Discussion: https://postgr.es/m/db0fc3a4-966c-4cec-a136-94024d39212d%40postgrespro.ru
Reported-by: Tom Lane
Author: Andrei Lepikhov
Reviewed-by: Alexander Korotkov, Pavel Borisov
0452b461bc made get_eclass_for_sort_expr() always set
EquivalenceClass.ec_sortref if it's not done yet. This leads to an asymmetric
situation when whoever first looks for the EquivalenceClass sets the
ec_sortref. It is also counterintuitive that get_eclass_for_sort_expr()
performs modification of data structures.
This commit makes make_pathkeys_for_sortclauses_extended() responsible for
setting EquivalenceClass.ec_sortref. Now we set the
EquivalenceClass.ec_sortref's needed to explore alternative GROUP BY ordering
specifically during building pathkeys by the list of grouping clauses.
Discussion: https://postgr.es/m/17037754-f187-4138-8285-0e2bfebd0dea%40postgrespro.ru
Reported-by: Tom Lane
Author: Andrei Lepikhov
Reviewed-by: Alexander Korotkov, Pavel Borisov
This reverts commit 7204f35919,
thus restoring 66c0185a3 (Allow planner to use Merge Append to
efficiently implement UNION) as well as the follow-on commits
d5d2205c8, 3b1a7eb28, 7487044d6.
Per further discussion on pgsql-release, we wish to ship beta1 with
this feature, and patch the bug that was found just before wrap,
rather than shipping beta1 with the feature reverted.
This reverts 66c0185a3 (Allow planner to use Merge Append to
efficiently implement UNION) as well as the follow-on commits
d5d2205c8, 3b1a7eb28, 7487044d6. In addition to those, 07746a8ef
had to be removed then re-applied in a different place, because
66c0185a3 moved the relevant code.
The reason for this last-minute thrashing is that depesz found a
case in which the patched code creates a completely wrong plan
that silently gives incorrect query results. It's unclear what
the cause is or how many cases are affected, but with beta1 wrap
staring us in the face, there's no time for closer investigation.
After we figure that out, we can decide whether to un-revert this
for beta2 or hold it for v18.
Discussion: https://postgr.es/m/Zktzf926vslR35Fv@depesz.com
(also some private discussion among pgsql-release)
There were a few typedefs that were never used to define a variable or
field. This had the effect that the buildfarm's typedefs.list would
not include them, and so they would need to be re-added manually to
keep the overall pgindent result perfectly clean.
We can easily get rid of these typedefs to avoid the issue. In a few
cases, we just let the enum or struct stand on its own without a
typedef around it. In one case, an enum was abused to define flag
bits; that's better done with macro definitions.
This fixes all the remaining issues with missing entries in the
buildfarm's typedefs.list.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/1919000.1715815925@sss.pgh.pa.us
This enum was used to determine the first ID to use when assigning a
custom wait event for extensions, which is always 1. It was kept so
as it would be possible to add new in-core wait events in the category
"Extension". There is no such thing currently, so let's remove this
enum until a case justifying it pops up. This makes the code simpler
and easier to understand.
This has as effect to switch back autoprewarm.c to use PG_WAIT_EXTENSION
rather than WAIT_EVENT_EXTENSION, on par with v16 and older stable
branches.
Thinko in c9af054653.
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/195c6c45-abce-4331-be6a-e87724e1d060@eisentraut.org
This feature set did not handle empty ranges correctly, and it's now
too late for PostgreSQL 17 to fix it.
The following commits are reverted:
6db4598fcb Add stratnum GiST support function
46a0cd4cef Add temporal PRIMARY KEY and UNIQUE constraints
86232a49a4 Fix comment on gist_stratnum_btree
030e10ff1a Rename pg_constraint.conwithoutoverlaps to conperiod
a88c800deb Use daterange and YMD in without_overlaps tests instead of tsrange.
5577a71fb0 Use half-open interval notation in without_overlaps tests
34768ee361 Add temporal FOREIGN KEY contraints
482e108cd3 Add test for REPLICA IDENTITY with a temporal key
c3db1f30cb doc: clarify PERIOD and WITHOUT OVERLAPS in CREATE TABLE
144c2ce0cc Fix ON CONFLICT DO NOTHING/UPDATE for temporal indexes
Discussion: https://www.postgresql.org/message-id/d0b64a7a-dfe4-4b84-a906-c7dedfa40a3e@eisentraut.org
Run renumber_oids.pl to move high-numbered OIDs down, as per pre-beta
tasks specified by RELEASE_CHANGES. For reference, the command was
./renumber_oids.pl --first-mapped-oid 8000 --target-oid 6300
Run pgindent, pgperltidy, and reformat-dat-files.
The pgindent part of this is pretty small, consisting mainly of
fixing up self-inflicted formatting damage from patches that
hadn't bothered to add their new typedefs to typedefs.list.
In order to keep it from making anything worse, I manually added
a dozen or so typedefs that appeared in the existing typedefs.list
but not in the buildfarm's list. Perhaps we should formalize that,
or better find a way to get those typedefs into the automatic list.
pgperltidy is as opinionated as always, and reformat-dat-files too.
COMMAND_TAG_NEXTTAG isn't really a valid command tag. Declaring it
as if it were one prompts complaints from Coverity and perhaps other
static analyzers. Our only use of it is in an entirely-unnecessary
array sizing declaration, so let's just drop it.
Ranier Vilela
Discussion: https://postgr.es/m/CAEudQAoY0xrKuTAX7W10zsjjUpKBPFRtdCyScb3Z0FB2v6HNmQ@mail.gmail.com
There are some problems with the new way to handle these constraints
that were detected at the last minute, and require fixes that appear too
invasive to be doing this late in the cycle. Revert this (again) for
now, we'll try again with these problems fixed.
The following commits are reverted:
b0e96f3119 Catalog not-null constraints
9b581c5341 Disallow changing NO INHERIT status of a not-null constraint
d0ec2ddbe0 Fix not-null constraint test
ac22a9545c Move privilege check to the right place
b0f7dd915b Check stack depth in new recursive functions
3af7217942 Update information_schema definition for not-null constraints
c3709100be Fix propagating attnotnull in multiple inheritance
d9f686a72e Fix restore of not-null constraints with inheritance
d72d32f52d Don't try to assign smart names to constraints
0cd711271d Better handle indirect constraint drops
13daa33fa5 Disallow NO INHERIT not-null constraints on partitioned tables
d45597f72f Disallow direct change of NO INHERIT of not-null constraints
21ac38f498 Fix inconsistencies in error messages
Discussion: https://postgr.es/m/202405110940.joxlqcx4dogd@alvherre.pgsql
This commit extends the backend-side infrastructure of injection points
so as it becomes possible to register some input data when attaching a
point. This private data can be registered with the function name and
the library name of the callback when attaching a point, then it is
given as input argument to the callback. This gives the possibility for
modules to pass down custom data at runtime when attaching a point
without managing that internally, in a manner consistent with the
callback entry retrieved from the hash shmem table storing the injection
point data.
InjectionPointAttach() gains two arguments, to be able to define the
private data contents and its size.
A follow-up commit will rely on this infrastructure to close a race
condition with the injection point detach in the module
injection_points.
While on it, this changes InjectionPointDetach() to return a boolean,
returning false if a point cannot be detached. This has been mentioned
by Noah as useful when it comes to implement more complex tests with
concurrent point detach, solid with the automatic detach done for local
points in the test module.
Documentation is adjusted in consequence.
Per discussion with Noah Misch.
Reviewed-by: Noah Misch
Discussion: https://postgr.es/m/20240509031553.47@rfd.leadboat.com
It turns out that we broke this in commit e5bc9454e, because
the code was assuming that no dependent types would appear
among the extension's direct dependencies, and now they do.
This isn't terribly hard to fix: just skip dependent types,
expecting that we will recurse to them when we process the parent
object (which should also be among the direct dependencies).
But a little bit of refactoring is needed so that we can avoid
duplicating logic about what is a dependent type.
Although there is some testing of ALTER EXTENSION SET SCHEMA,
it failed to cover interesting cases, so add more tests.
Discussion: https://postgr.es/m/930191.1715205151@sss.pgh.pa.us
When changing the data type of a column of a partitioned table, craft
the ALTER SEQUENCE command only once. Partitions do not have identity
sequences of their own and thus do not need a ALTER SEQUENCE command
for each partition.
Fix getIdentitySequence() to fetch the identity sequence associated
with the top-level partitioned table when a Relation of a partition is
passed to it. While doing so, translate the attribute number of the
partition into the attribute number of the partitioned table.
Author: Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Discussion: https://www.postgresql.org/message-id/3b8a9dc1-bbc7-0ef5-6863-c432afac7d59@gmail.com
The catalog view pg_stats_ext fails to consider privileges for
expression statistics. The catalog view pg_stats_ext_exprs fails
to consider privileges and row-level security policies. To fix,
restrict the data in these views to table owners or roles that
inherit privileges of the table owner. It may be possible to apply
less restrictive privilege checks in some cases, but that is left
as a future exercise. Furthermore, for pg_stats_ext_exprs, do not
return data for tables with row-level security enabled, as is
already done for pg_stats_ext.
On the back-branches, a fix-CVE-2024-4317.sql script is provided
that will install into the "share" directory. This file can be
used to apply the fix to existing clusters.
Bumps catversion on 'master' branch only.
Reported-by: Lukas Fittl
Reviewed-by: Noah Misch, Tomas Vondra, Tom Lane
Security: CVE-2024-4317
Backpatch-through: 14
94985c210 added code to detect when WindowFuncs were monotonic and
allowed additional quals to be "pushed down" into the subquery to be
used as WindowClause runConditions in order to short-circuit execution
in nodeWindowAgg.c.
The Node representation of runConditions wasn't well selected and
because we do qual pushdown before planning the subquery, the planning
of the subquery could perform subquery pull-up of nested subqueries.
For WindowFuncs with args, the arguments could be changed after pushing
the qual down to the subquery.
This was made more difficult by the fact that the code duplicated the
WindowFunc inside an OpExpr to include in the WindowClauses runCondition
field. This could result in duplication of subqueries and a pull-up of
such a subquery could result in another initplan parameter being issued
for the 2nd version of the subplan. This could result in errors such as:
ERROR: WindowFunc not found in subplan target lists
To fix this, we change the node representation of these run conditions
and instead of storing an OpExpr containing the WindowFunc in a list
inside WindowClause, we now store a new node type named
WindowFuncRunCondition within a new field in the WindowFunc. These get
transformed into OpExprs later in planning once subquery pull-up has been
performed.
This problem did exist in v15 and v16, but that was fixed by 9d36b883b
and e5d20bbd.
Cat version bump due to new node type and modifying WindowFunc struct.
Bug: #18305
Reported-by: Zuming Jiang
Discussion: https://postgr.es/m/18305-33c49b4c830b37b3%40postgresql.org
We support changing NO INHERIT constraint to INHERIT for constraints in
child relations when adding a constraint to some ancestor relation, and
also during pg_upgrade's schema restore; but other than those special
cases, command ALTER TABLE ADD CONSTRAINT should not be allowed to
change an existing constraint from NO INHERIT to INHERIT, as that would
require to process child relations so that they also acquire an
appropriate constraint, which we may not be in a position to do. (It'd
also be surprising behavior.)
It is conceivable that we want to allow ALTER TABLE SET NOT NULL to make
such a change; but in that case some more code is needed to implement it
correctly, so for now I've made that throw the same error message.
Also, during the prep phase of ALTER TABLE ADD CONSTRAINT, acquire locks
on all descendant tables; otherwise we might operate on child tables on
which no locks are held, particularly in the mode where a primary key
causes not-null constraints to be created on children.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/7d923a66-55f0-3395-cd40-81c142b5448b@gmail.com
As an optimization, we store "name" columns as cstrings in btree
indexes.
Here we modify it so that Index Only Scans convert these cstrings back
to names with NAMEDATALEN bytes rather than storing the cstring in the
tuple slot, as was happening previously.
Bug: #17855
Reported-by: Alexander Lakhin
Reviewed-by: Alexander Lakhin, Tom Lane
Discussion: https://postgr.es/m/17855-5f523e0f9769a566@postgresql.org
Backpatch-through: 12, all supported versions
If an ACL recorded in pg_init_privs mentions a non-pinned role,
that reference must also be noted in pg_shdepend so that we know
that the role can't go away without removing the ACL reference.
Otherwise, DROP ROLE could succeed and leave dangling entries
behind, which is what's causing the recent upgrade-check failures
on buildfarm member copperhead.
This has been wrong since pg_init_privs was introduced, but it's
escaped notice because typical pg_init_privs entries would only
mention the bootstrap superuser (pinned) or at worst the owner
of the extension (who can't go away before the extension does).
We lack even a representation of such a role reference for
pg_shdepend. My first thought for a solution was entries listing
pg_init_privs in classid, but that doesn't work because then there's
noplace to put the granted-on object's classid. Rather than adding
a new column to pg_shdepend, let's add a new deptype code
SHARED_DEPENDENCY_INITACL. Much of the associated boilerplate
code can be cribbed from code for SHARED_DEPENDENCY_ACL.
A lot of the bulk of this patch just stems from the new need to pass
the object's owner ID to recordExtensionInitPriv, so that we can
consult it while updating pg_shdepend. While many callers have that
at hand already, a few places now need to fetch the owner ID of an
arbitrary privilege-bearing object. For that, we assume that there
is a catcache on the relevant catalog's OID column, which is an
assumption already made in ExecGrant_common so it seems okay here.
We do need an entirely new routine RemoveRoleFromInitPriv to perform
cleanup of pg_init_privs ACLs during DROP OWNED BY. It's analogous
to RemoveRoleFromObjectACL, but we can't share logic because that
function operates by building a command parsetree and invoking
existing GRANT/REVOKE infrastructure. There is of course no SQL
command that would update pg_init_privs entries when we're not in
process of creating their extension, so we need a routine that can
do the updates directly.
catversion bump because this changes the expected contents of
pg_shdepend. For the same reason, there's no hope of back-patching
this, even though it fixes a longstanding bug. Fortunately, the
case where it's a problem seems to be near nonexistent in the field.
If it weren't for the buildfarm breakage, I'd have been content to
leave this for v18.
Patch by me; thanks to Daniel Gustafsson for review and discussion.
Discussion: https://postgr.es/m/1745535.1712358659@sss.pgh.pa.us
- Bring memory context names in line with other naming
- Fix typos, reported off-list by Alexander Lakhin
- Remove copy-paste errors from comments
- Remove duplicate #undef
Also, fix a memory leak when updating from non-embeddable to
embeddable. Both were unreachable without adding C code.
Reported-by: Noah Misch
Author: Noah Misch
Reviewed-by: Masahiko Sawada, John Naylor
Discussion: https://postgr.es/m/20240424210319.4c.nmisch%40google.com
Allow pg_sync_replication_slots() to error out during promotion of standby.
This makes the behavior of the SQL function consistent with the slot sync
worker. We also ensured that pg_sync_replication_slots() cannot be
executed if sync_replication_slots is enabled and the slotsync worker is
already running to perform the synchronization of slots. Previously, it
would have succeeded in cases when the worker is idle and failed when it
is performing sync which could confuse users.
This patch fixes another issue in the slot sync worker where
SignalHandlerForShutdownRequest() needs to be registered *before* setting
SlotSyncCtx->pid, otherwise, the slotsync worker could miss handling
SIGINT sent by the startup process(ShutDownSlotSync) if it is sent before
worker could register SignalHandlerForShutdownRequest(). To be consistent,
all signal handlers' registration is moved to a prior location before we
set the worker's pid.
Ensure that we clean up synced temp slots at the end of
pg_sync_replication_slots() to avoid such slots being left over after
promotion.
Ensure that ShutDownSlotSync() captures SlotSyncCtx->pid under spinlock to
avoid accessing invalid value as it can be reset by concurrent slot sync
exit due to an error.
Author: Shveta Malik
Reviewed-by: Hou Zhijie, Bertrand Drouvot, Amit Kapila, Masahiko Sawada
Discussion: https://postgr.es/m/CAJpy0uBefXUS_TSz=oxmYKHdg-fhxUT0qfjASW3nmqnzVC3p6A@mail.gmail.com
This field is not used directly in the code, but it is important for
query jumbling to be able to make a difference between a named
DEALLOCATE and DEALLOCATE ALL (see bb45156f34). This behavior is
tracked in the regression tests of pg_stat_statements, but the reason
why this field is important can be easily missed, as a recent discussion
has proved, so let's improve its comment to document the reason why it
needs to be around.
Wording has been suggested by Tom Lane
Discussion: https://postgr.es/m/Zih1ATt37YFda8_p@paquier.xyz
JsonExprState.input_finfo is only assigned to, never read, and
it's really fairly useless since the value can be gotten out of
the adjacent input_fcinfo field. Let's remove it before someone
starts to depend on it.
While here, also remove TidScanState.tss_htup and AggState.combinedproj,
which are referenced nowhere. Those should have been removed by the
commits that caused them to become disused, but were not.
I don't think a catversion bump is necessary here, since plan trees
are never stored on disk.
Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2WjsY4d0TBymLNGK4zpttUcg_YZaTjyWz2VfDUV6YH8wXQ@mail.gmail.com
The optimization for inserts into BRIN indexes added by c1ec02be1d
relies on a cache that needs to be explicitly released after calling
index_insert(). The commit however failed to invoke the cleanup in
validate_index(), which calls index_insert() indirectly through
table_index_validate_scan().
After inspecting index_insert() callers, it seems unique_key_recheck()
is missing the call too.
Fixed by adding the two missing index_insert_cleanup() calls.
The commit does two additional improvements. The aminsertcleanup()
signature is modified to have the index as the first argument, to make
it more like the other AM callbacks. And the aminsertcleanup() callback
is invoked even if the ii_AmCache is NULL, so that it can decide if the
cleanup is necessary.
Author: Alvaro Herrera, Tomas Vondra
Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/202401091043.e3nrqiad6gb7@alvherre.pgsql
This fixes various typos, duplicated words, and tiny bits of whitespace
mainly in code comments but also in docs.
Author: Daniel Gustafsson <daniel@yesql.se>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/3F577953-A29E-4722-98AD-2DA9EFF2CBB8@yesql.se
In tables with primary keys, pg_dump creates tables with primary keys by
initially dumping them with throw-away not-null constraints (marked "no
inherit" so that they don't create problems elsewhere), to later drop
them once the primary key is restored. Because of a unrelated
consideration, on tables with children we add not-null constraints to
all columns of the primary key when it is created.
If both a table and its child have primary keys, and pg_dump happens to
emit the child table first (and its throw-away not-null) and later its
parent table, the creation of the parent's PK will fail because the
throw-away not-null constraint collides with the permanent not-null
constraint that the PK wants to add, so the dump fails to restore.
We can work around this problem by letting the primary key "take over"
the child's not-null. This requires no changes to pg_dump, just two
changes to ALTER TABLE: first, the ability to convert a no-inherit
not-null constraint into a regular inheritable one (including recursing
down to children, if there are any); second, the ability to "drop" a
constraint that is defined both directly in the table and inherited from
a parent (which simply means to mark it as no longer having a local
definition).
Secondarily, change ATPrepAddPrimaryKey() to acquire locks all the way
down the inheritance hierarchy, in case we need to recurse when
propagating constraints.
These two changes allow pg_dump to reproduce more cases involving
inheritance from versions 16 and older.
Lastly, make two changes to pg_dump: 1) do not try to drop a not-null
constraint that's marked as inherited; this allows a dump to restore
with no errors if a table with a PK inherits from another which also has
a PK; 2) avoid giving inherited constraints throwaway names, for the
rare cases where such a constraint survives after the restore.
Reported-by: Andrew Bille <andrewbille@gmail.com>
Reported-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CAJnzarwkfRu76_yi3dqVF_WL-MpvT54zMwAxFwJceXdHB76bOA@mail.gmail.com
Discussion: https://postgr.es/m/Zh0aAH7tbZb-9HbC@pryzbyj2023
This addresses some post-commit review comments for commits 6185c973,
de3600452, and 9425c596a0, with the following changes:
* Fix JSON_TABLE() syntax documentation to use the term
"path_expression" for JSON path expressions instead of
"json_path_specification" to be consistent with the other SQL/JSON
functions.
* Fix a typo in the example code in JSON_TABLE() documentation.
* Rewrite some newly added comments in jsonpath.h.
* In JsonPathQuery(), add missing cast to int before printing an enum
value.
Reported-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG_e0QLCgaELrr2ZNz7AxPeGCNKAORe3fHtFCQLsH4J4Q@mail.gmail.com
This improves some error messages emitted by SQL/JSON query functions
by mentioning column name when available, such as when they are
invoked as part of evaluating JSON_TABLE() columns. To do so, a new
field column_name is added to both JsonFuncExpr and JsonExpr that is
only populated when creating those nodes for transformed JSON_TABLE()
columns.
While at it, relevant error messages are reworded for clarity.
Reported-by: Jian He <jian.universality@gmail.com>
Suggested-by: Jian He <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxG_e0QLCgaELrr2ZNz7AxPeGCNKAORe3fHtFCQLsH4J4Q@mail.gmail.com
GlobalVisTestNonRemovableHorizon() was only used for the implementation of
snapshot_too_old, which was removed in f691f5b80a. As using
GlobalVisTestNonRemovableHorizon() is not particularly efficient, no new uses
for it should be added. Therefore remove.
Discussion: https://postgr.es/m/20240415185720.q4dg4dlcyvvrabz4@awork3.anarazel.de
All backend-side variables should be marked with PGDLLIMPORT, as per
policy introduced in 8ec569479f. aafc05de1b has forgotten
MyClientSocket, and 05c3980e7f LoadedSSL.
These can be spotted with a command like this one (be careful of not
switching __pg_log_level):
src/tools/mark_pgdllimport.pl $(git ls-files src/include/)
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/ZhzkRzrkKhbeQMRm@paquier.xyz
When matching constraints in AttachPartitionEnsureIndexes() we weren't
testing the constraint type, which could make a UNIQUE key lacking a
not-null constraint incorrectly satisfy a primary key requirement. Fix
this by testing that the constraint types match. (Other possible
mismatches are verified by comparing index properties.)
Discussion: https://postgr.es/m/202402051447.wimb4xmtiiyb@alvherre.pgsql
Coverity complained about not freeing some memory associated with
incrementally parsing backup manifests. To fix that, provide and use a new
shutdown function for the JsonManifestParseIncrementalState object, in
line with a suggestion from Tom Lane.
While analysing the problem, I noticed a buglet in freeing memory for
incremental json lexers. To fix that remove a bogus condition on
freeing the memory allocated for them.
b262ad440 added code to have the planner remove redundant IS NOT NULL
quals and eliminate needless scans for IS NULL quals on tables where the
qual's column has a NOT NULL constraint.
That commit failed to consider that an inheritance parent table could
have differing NOT NULL constraints between the parent and the child.
This caused issues as if we eliminated a qual on the parent, when
applying the quals to child tables in apply_child_basequals(), the qual
might not have been added to the parent's baserestrictinfo.
Here we fix this by not applying the optimization to remove redundant
quals to RelOptInfos belonging to inheritance parents and applying the
optimization again in apply_child_basequals(). Effectively, this means
that the parent and child are considered independently as the parent has
both an inh=true and inh=false RTE and we still apply the optimization
to the RelOptInfo corresponding to the inh=false RTE.
We're able to still apply the optimization in add_base_clause_to_rel()
for partitioned tables as the NULLability of partitions must match that
of their parent. And, if we ever expand restriction_is_always_false()
and restriction_is_always_true() to handle partition constraints then we
can apply the same logic as, even in multi-level partitioned tables,
there's no way to route values to a partition when the qual does not
match the partition qual of the partitioned table's parent partition.
The same is true for CHECK constraints as those must also match between
arent partitioned tables and their partitions.
Author: Richard Guo, David Rowley
Discussion: https://postgr.es/m/CAMbWs4930gQSZmjR7aANzEapdy61gCg6z8dT-kAEYD0sYWKPdQ@mail.gmail.com
A pairing heap can perform the same operations as the binary heap +
index, with as good or better algorithmic complexity, and that's an
existing data structure so that we don't need to invent anything new
compared to v16. This commit makes the new binaryheap functionality
that was added in commits b840508644 and bcb14f4abc unnecessary, but
they will be reverted separately.
Remove the optimization to only build and maintain the heap when the
amount of memory used is close to the limit, becuase the bookkeeping
overhead with the pairing heap seems to be small enough that it
doesn't matter in practice.
Reported-by: Jeff Davis
Author: Heikki Linnakangas
Reviewed-by: Michael Paquier, Hayato Kuroda, Masahiko Sawada
Discussion: https://postgr.es/m/12747c15811d94efcc5cda72d6b35c80d7bf3443.camel%40j-davis.com
Previously, the decision to store values in leaves or within the child
pointer was made at compile time, with variable length values using
leaves by necessity. This commit allows introspecting the length of
variable length values at runtime for that decision. This requires
the ability to tell whether the last-level child pointer is actually
a value, so we use a pointer tag in the lowest level bit.
Use this in TID store. This entails adding a byte to the header to
reserve space for the tag. Commit f35bd9bf3 stores up to three offsets
within the header with no bitmap, and now the header can be embedded
as above. This reduces worst-case memory usage when TIDs are sparse.
Reviewed (in an earlier version) by Masahiko Sawada
Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com
Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
While keeping API the same, this commit provides a way for block-level table
AMs to re-use existing acquire_sample_rows() by providing custom callbacks
for getting the next block and the next tuple.
Reported-by: Andres Freund
Discussion: https://postgr.es/m/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de
Reviewed-by: Pavel Borisov
Let table AM define custom reloptions for its tables. This allows specifying
AM-specific parameters by the WITH clause when creating a table.
The reloptions, which could be used outside of table AM, are now extracted
into the CommonRdOptions data structure. These options could be by decision
of table AM directly specified by a user or calculated in some way.
The new test module test_tam_options evaluates the ability to set up custom
reloptions and calculate fields of CommonRdOptions on their base.
The code may use some parts from prior work by Hao Wu.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn
Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
A NESTED path allows to extract data from nested levels of JSON
objects given by the parent path expression, which are projected as
columns specified using a nested COLUMNS clause, just like the parent
COLUMNS clause. Rows comprised from a NESTED columns are "joined"
to the row comprised from the parent columns. If a particular NESTED
path evaluates to 0 rows, then the nested COLUMNS will emit NULLs,
making it an OUTER join.
NESTED columns themselves may include NESTED paths to allow
extracting data from arbitrary nesting levels, which are likewise
joined against the rows at the parent level.
Multiple NESTED paths at a given level are called "sibling" paths
and their rows are combined by UNIONing them, that is, after being
joined against the parent row as described above.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewers have included (in no particular order):
Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
When testing buffer pool logic, it is useful to be able to evict
arbitrary blocks. This function can be used in SQL queries over the
pg_buffercache view to set up a wide range of buffer pool states. Of
course, buffer mappings might change concurrently so you might evict a
block other than the one you had in mind, and another session might
bring it back in at any time. That's OK for the intended purpose of
setting up developer testing scenarios, and more complicated interlocking
schemes to give stronger guararantees about that would likely be less
flexible for actual testing work anyway. Superuser-only.
Author: Palak Chaturvedi <chaturvedipalak1911@gmail.com>
Author: Thomas Munro <thomas.munro@gmail.com> (docs, small tweaks)
Reviewed-by: Nitin Jadhav <nitinjadhavpostgres@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Cédric Villemain <cedric.villemain+pgsql@abcsql.com>
Reviewed-by: Jim Nasby <jim.nasby@gmail.com>
Reviewed-by: Maxim Orlov <orlovmg@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CALfch19pW48ZwWzUoRSpsaV9hqt0UPyaBPC4bOZ4W+c7FF566A@mail.gmail.com
While SH_STAT() is only used for debugging, the allocated array can be large,
and therefore should be freed.
It's unclear why coverity started warning now.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Coverity
Discussion: https://postgr.es/m/3005248.1712538233@sss.pgh.pa.us
Backpatch: 12-
libpq now always tries to send ALPN. With the traditional negotiated
SSL connections, the server accepts the ALPN, and refuses the
connection if it's not what we expect, but connecting without ALPN is
still OK. With the new direct SSL connections, ALPN is mandatory.
NOTE: This uses "TBD-pgsql" as the protocol ID. We must register a
proper one with IANA before the release!
Author: Greg Stark, Heikki Linnakangas
Reviewed-by: Matthias van de Meent, Jacob Champion
By skipping SSLRequest, you can eliminate one round-trip when
establishing a TLS connection. It is also more friendly to generic TLS
proxies that don't understand the PostgreSQL protocol.
This is disabled by default in libpq, because the direct TLS handshake
will fail with old server versions. It can be enabled with the
sslnegotation=direct option. It will still fall back to the negotiated
TLS handshake if the server rejects the direct attempt, either because
it is an older version or the server doesn't support TLS at all, but
the fallback can be disabled with the sslnegotiation=requiredirect
option.
Author: Greg Stark, Heikki Linnakangas
Reviewed-by: Matthias van de Meent, Jacob Champion
The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...])
on the preliminary stage of optimization when we are still working with the
expression tree.
Here Cn is a n-th constant expression, 'expr' is non-constant expression, 'op'
is an operator which returns boolean result and has a commuter (for the case
of reverse order of constant and non-constant parts of the expression,
like 'Cn op expr').
Sometimes it can lead to not optimal plan. This is why there is a
or_to_any_transform_limit GUC. It specifies a threshold value of length of
arguments in an OR expression that triggers the OR-to-ANY transformation.
Generally, more groupable OR arguments mean that transformation will be more
likely to win than to lose.
Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru
Author: Alena Rybakina <lena.ribackina@yandex.ru>
Author: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
29f6a959c added a bump allocator type for efficient compact allocations.
Here we make use of this for non-bounded tuplesorts to store tuples.
This is very space efficient when storing narrow tuples due to bump.c
not having chunk headers. This means we can fit more tuples in work_mem
before spilling to disk, or perform an in-memory sort touching fewer
cacheline.
Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
This tracks the position of WAL that's been fully copied into WAL
buffers by all processes emitting WAL. (For some reason we call that
"WAL insertion"). This is updated using atomic monotonic advance during
WaitXLogInsertionsToFinish, which is not when the insertions actually
occur, but it's the only place where we know where have all the
insertions have completed.
This value is useful in WALReadFromBuffers, which can verify that
callers don't try to read past what has been inserted. (However, more
infrastructure is needed in order to actually use WAL after the flush
point, since it could be lost.)
The value is also useful in WaitXLogInsertionsToFinish() itself, since
we can now exit quickly when all WAL has been already inserted, without
even having to take any locks.
This introduces a bump MemoryContext type. The bump context is best
suited for short-lived memory contexts which require only allocations
of memory and never a pfree or repalloc, which are unsupported.
Memory palloc'd into a bump context has no chunk header. This makes
bump a useful context type when lots of small allocations need to be
done without any need to pfree those allocations. Allocation sizes are
rounded up to the next MAXALIGN boundary, so with this and no chunk
header, allocations are very compact indeed.
Allocations are also very fast as bump does not check any freelists to
try and make use of previously free'd chunks. It just checks if there
is enough room on the current block, and if so it bumps the freeptr
beyond this chunk and returns the value that the freeptr was previously
pointing to. Simple and fast. A new block is malloc'd when there's not
enough space in the current block.
Code using the bump allocator must take care never to call any functions
which could try to call realloc() (or variants), pfree(),
GetMemoryChunkContext() or GetMemoryChunkSpace() on a bump allocated
chunk. Due to lack of chunk headers, these operations are unsupported.
To increase the chances of catching such issues, when compiled with
MEMORY_CONTEXT_CHECKING, bump allocated chunks are given a header and
any attempt to perform an unsupported operation will result in an ERROR.
Without MEMORY_CONTEXT_CHECKING, code attempting an unsupported
operation could result in a segfault.
A follow-on commit will implement the first user of bump.
Author: David Rowley
Reviewed-by: Nathan Bossart
Reviewed-by: Matthias van de Meent
Reviewed-by: Tomas Vondra
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
Reserve 4 bits for MemoryContextMethodID rather than 3. 3 bits did
technically allow a maximum of 8 memory context types, however, we've
opted to reserve some bit patterns which left us with only 4 slots, all
of which were used.
Here we add another bit which frees up 8 slots for future memory context
types.
In passing, adjust the enum names in MemoryContextMethodID to make it
more clear which ones can be used and which ones are reserved.
Author: Matthias van de Meent, David Rowley
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
Commit 792752af4e added infrastructure for using AVX-512 intrinsic
functions, and this commit uses that infrastructure to optimize
visibilitymap_count(). Specificially, a new pg_popcount_masked()
function is introduced that applies a bitmask to every byte in the
buffer prior to calculating the population count, which is used to
filter out the all-visible or all-frozen bits as needed. Platforms
without AVX-512 support should also see a nice speedup due to the
reduced number of calls to a function pointer.
Co-authored-by: Ants Aasma
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
Presently, pg_popcount() processes data in 32-bit or 64-bit chunks
when possible. Newer hardware that supports AVX-512 instructions
can use 512-bit chunks, which provides a nice speedup, especially
for larger buffers. This commit introduces the infrastructure
required to detect compiler and CPU support for the required
AVX-512 intrinsic functions, and it adds a new pg_popcount()
implementation that uses these functions. If CPU support for this
optimized implementation is detected at runtime, a function pointer
is updated so that it is used by subsequent calls to pg_popcount().
Most of the existing in-tree calls to pg_popcount() should benefit
from these instructions, and calls with smaller buffers should at
least not regress compared to v16. The new infrastructure
introduced by this commit can also be used to optimize
visibilitymap_count(), but that is left for a follow-up commit.
Co-authored-by: Paul Amonson, Ants Aasma
Reviewed-by: Matthias van de Meent, Tom Lane, Noah Misch, Akash Shankaran, Alvaro Herrera, Andres Freund, David Rowley
Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
Commit 7c70996ebf introduced an optimization to allow bitmap
scans to operate like index-only scans by not fetching a block from the
heap if none of the underlying data is needed and the block is marked
all visible in the visibility map.
With the introduction of table AMs, a FIXME was added to this code
indicating that the skip_fetch logic should be pushed into the table
AM-specific code, as not all table AMs may use a visibility map in the
same way.
This commit resolves this FIXME for the current block. The layering
violation is still present in BitmapHeapScans's prefetching code, which
uses the visibility map to decide whether or not to prefetch a block.
However, this can be addressed independently.
Author: Melanie Plageman
Reviewed-by: Andres Freund, Heikki Linnakangas, Tomas Vondra, Mark Dilger
Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
This new DDL command splits a single partition into several parititions.
Just like ALTER TABLE ... MERGE PARTITIONS ... command, new patitions are
created using createPartitionTable() function with parent partition as the
template.
This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing. This is why this new DDL command
can't be recommended for large partitioned tables under a high load. However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.
Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
This new DDL command merges several partitions into the one partition of the
target table. The target partition is created using new
createPartitionTable() function with parent partition as the template.
This commit comprises quite naive implementation which works in single process
and holds the ACCESS EXCLUSIVE LOCK on the parent table during all the
operations including the tuple routing. This is why this new DDL command
can't be recommended for large partitioned tables under a high load. However,
this implementation come in handy in certain cases even as is.
Also, it could be used as a foundation for future implementations with lesser
locking and possibly parallel.
Discussion: https://postgr.es/m/c73a1746-0cd0-6bdd-6b23-3ae0b7c0c582%40postgrespro.ru
Author: Dmitry Koval
Reviewed-by: Matthias van de Meent, Laurenz Albe, Zhihong Yu, Justin Pryzby
Reviewed-by: Alvaro Herrera, Robert Haas, Stephane Tachoires
Not just WaitLSNState.waitersHeap, but also WaitLSNState.procInfos and
updating of WaitLSNState.minWaitedLSN is protected by WaitLSNLock. There
is one now documented exclusion on fast-path checking of
WaitLSNProcInfo.inHeap flag.
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals
natively. This works by pushing down the full context (the array keys)
to the nbtree index AM, enabling it to execute multiple primitive index
scans that the planner treats as one continuous index scan/index path.
This earlier enhancement enabled nbtree ScalarArrayOp index-only scans.
It also allowed scans with ScalarArrayOp quals to return ordered results
(with some notable restrictions, described further down).
Take this general approach a lot further: teach nbtree SAOP index scans
to decide how to execute ScalarArrayOp scans (when and where to start
the next primitive index scan) based on physical index characteristics.
This can be far more efficient. All SAOP scans will now reliably avoid
duplicative leaf page accesses (just like any other nbtree index scan).
SAOP scans whose array keys are naturally clustered together now require
far fewer index descents, since we'll reliably avoid starting a new
primitive scan just to get to a later offset from the same leaf page.
The scan's arrays now advance using binary searches for the array
element that best matches the next tuple's attribute value. Required
scan key arrays (i.e. arrays from scan keys that can terminate the scan)
ratchet forward in lockstep with the index scan. Non-required arrays
(i.e. arrays from scan keys that can only exclude non-matching tuples)
"advance" without the process ever rolling over to a higher-order array.
Naturally, only required SAOP scan keys trigger skipping over leaf pages
(non-required arrays cannot safely end or start primitive index scans).
Consequently, even index scans of a composite index with a high-order
inequality scan key (which we'll mark required) and a low-order SAOP
scan key (which we won't mark required) now avoid repeating leaf page
accesses -- that benefit isn't limited to simpler equality-only cases.
In general, all nbtree index scans now output tuples as if they were one
continuous index scan -- even scans that mix a high-order inequality
with lower-order SAOP equalities reliably output tuples in index order.
This allows us to remove a couple of special cases that were applied
when building index paths with SAOP clauses during planning.
Bugfix commit 807a40c5 taught the planner to avoid generating unsafe
path keys: path keys on a multicolumn index path, with a SAOP clause on
any attribute beyond the first/most significant attribute. These cases
are now all safe, so we go back to generating path keys without regard
for the presence of SAOP clauses (just like with any other clause type).
Affected queries can now exploit scan output order in all the usual ways
(e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early).
Also undo changes from follow-up bugfix commit a4523c5a, which taught
the planner to produce alternative index paths, with path keys, but
without low-order SAOP index quals (filter quals were used instead).
We'll no longer generate these alternative paths, since they can no
longer offer any meaningful advantages over standard index qual paths.
Affected queries thereby avoid all of the disadvantages that come from
using filter quals within index scan nodes. They can avoid extra heap
page accesses from using filter quals to exclude non-matching tuples
(index quals will never have that problem). They can also skip over
irrelevant sections of the index in more cases (though only when nbtree
determines that starting another primitive scan actually makes sense).
There is a theoretical risk that removing restrictions on SAOP index
paths from the planner will break compatibility with amcanorder-based
index AMs maintained as extensions. Such an index AM could have the
same limitations around ordered SAOP scans as nbtree had up until now.
Adding a pro forma incompatibility item about the issue to the Postgres
17 release notes seems like a good idea.
Author: Peter Geoghegan <pg@bowt.ie>
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated
the terminator's position and re-loaded the remaining bytes from
the input string. This was slower than the unaligned case for very
short strings. We already have all the data we need in a register,
so let's just mask off the bytes we need and hash them immediately.
In addition to endianness issues, the previous attempt upset valgrind
in the way it computed the mask. Whether by accident or by wisdom,
the author's proposed method passes locally with valgrind 3.22.
Ants Aasma, with cosmetic adjustments by me
Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com
This function previously used a mix of word-wise loads and bytewise
loads. The bytewise loads happened to be little-endian regardless of
platform. This in itself is not a problem. However, a future commit
will require the same result whether A) the input is loaded as a
word with the relevent bytes masked-off, or B) the input is loaded
one byte at a time.
While at it, improve debuggability of the internal hash state.
Discussion: https://postgr.es/m/CANWCAZZpuV1mES1mtSpAq8tWJewbrv4gEz6R_k4gzNG8GZ5gag%40mail.gmail.com
While pinning extra buffers to look ahead, users of strategies are in
danger of using too many buffers. For some strategies, that means
"escaping" from the ring, and in others it means forcing dirty data to
disk very frequently with associated WAL flushing. Since external code
has no insight into any of that, allow individual strategy types to
expose a clamp that should be applied when deciding how many buffers to
pin at once.
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
Remove duplicate hash_string_pointer() function definitions by creating
a new inline function hash_string() for this purpose.
This has the added advantage of avoiding strlen() calls when doing hash
lookup. It's not clear how many of these are perfomance-sensitive
enough to benefit from that, but the simplification is worth it on
its own.
Reviewed by Jeff Davis
Discussion: https://postgr.es/m/CANWCAZbg_XeSeY0a_PqWmWqeRATvzTzUNYRLeT%2Bbzs%2BYQdC92g%40mail.gmail.com
fasthash_accum_cstring_aligned() uses a technique, found in various
strlen() implementations, to detect a string's NUL terminator by
reading a word at at time. That triggers failures when testing with
"-fsanitize=address", at least with frontend code. To enable using
this function anywhere, add a function attribute macro to disable
such testing.
Reviewed by Jeff Davis
Discussion: https://postgr.es/m/CANWCAZbwvp7oUEkbw-xP4L0_S_WNKq-J-ucP4RCNDPJnrakUPw%40mail.gmail.com
Align blocks stored in incremental files to BLCKSZ, so that the
incremental backups work well with CoW filesystems.
The header of the incremental file is padded with \0 to a multiple of
BLCKSZ, so that the block data (also BLCKSZ) is aligned to BLCKSZ. The
padding is added only to files containing block data, so files with just
the header remain small. This adds a bit of extra space, but as the
number of blocks increases the overhead gets negligible very quickly.
And as the padding is \0 bytes, it does compress extremely well.
The alignment is important for CoW filesystems that usually require the
blocks to be aligned to filesystem page size for features like block
sharing, deduplication etc. to work well. With the variable sized header
the blocks in the increments were not aligned at all, negating the
benefits of the CoW filesystems.
This matters even for non-CoW filesystems, for example when placed on a
RAID array. If the block is not aligned, it may easily span multiple
devices, causing read and write amplification.
It might be better to align the blocks to the filesystem page, not
BLCKSZ, but we have no good way to determine that. Even if we determine
the page size at the time of taking the backup, the backup may move. For
now the BLCKSZ seems sufficient - the filesystem page is usually 4K, so
the default BLCKSZ (8K by default) is aligned to that.
Author: Tomas Vondra
Reviewed-by: Robert Haas, Jakub Wartak
Discussion: https://postgr.es/m/3024283a-7491-4240-80d0-421575f6bb23%40enterprisedb.com
JSON_TABLE() allows JSON data to be converted into a relational view
and thus used, for example, in a FROM clause, like other tabular
data. Data to show in the view is selected from a source JSON object
using a JSON path expression to get a sequence of JSON objects that's
called a "row pattern", which becomes the source to compute the
SQL/JSON values that populate the view's output columns. Column
values themselves are computed using JSON path expressions applied to
each of the JSON objects comprising the "row pattern", for which the
SQL/JSON query functions added in 6185c9737c are used.
To implement JSON_TABLE() as a table function, this augments the
TableFunc and TableFuncScanState nodes that are currently used to
support XMLTABLE() with some JSON_TABLE()-specific fields.
Note that the JSON_TABLE() spec includes NESTED COLUMNS and PLAN
clauses, which are required to provide more flexibility to extract
data out of nested JSON objects, but they are not implemented here
to keep this commit of manageable size.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewers have included (in no particular order):
Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
This adds the infrastructure for using the new non-recursive JSON parser
in processing manifests. It's important that callers make sure that the
last piece of json handed to the incremental manifest parser contains
the entire last few lines of the manifest, including the checksum.
Author: Andrew Dunstan
Reviewed-By: Jacob Champion
Discussion: https://postgr.es/m/7b0a51d6-0d9d-7366-3a1a-f74397a02f55@dunslane.net
This parser uses an explicit prediction stack, unlike the present
recursive descent parser where the parser state is represented on the
call stack. This difference makes the new parser suitable for use in
incremental parsing of huge JSON documents that cannot be conveniently
handled piece-wise by the recursive descent parser. One potential use
for this will be in parsing large backup manifests associated with
incremental backups.
Because this parser is somewhat slower than the recursive descent
parser, it is not replacing that parser, but is an additional parser
available to callers.
For testing purposes, if the build is done with -DFORCE_JSON_PSTACK, all
JSON parsing is done with the non-recursive parser, in which case only
trivial regression differences in error messages should be observed.
Author: Andrew Dunstan
Reviewed-By: Jacob Champion
Discussion: https://postgr.es/m/7b0a51d6-0d9d-7366-3a1a-f74397a02f55@dunslane.net
To allow the use of the read stream API added in b5a9b18cd for
sequential scans on heap tables, here we make some adjustments to make
that change less invasive and perhaps make the code easier to follow in
the process.
Here heapgetpage() gets broken into two functions:
1) The part which reads the block has now been moved into a function
named heapfetchbuf().
2) The part which performed pruning and populated the scan's
rs_vistuples[] array is now moved into a new function named
heap_prepare_pagescan().
The functionality provided by heap_prepare_pagescan() was only ever
required by SO_ALLOW_PAGEMODE scans, so the branching that was
previously done in heapgetpage() is no longer needed as we simply just
don't call heap_prepare_pagescan() from heapgettup() in the refactored
code.
Author: Melanie Plageman
Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about
the volume of data emitted by a query, as well as the time taken
to convert the data to the on-the-wire format. Previously there
was no way to investigate this without actually sending the data
to the client, in which case network transmission costs might
swamp what you wanted to see. In particular this feature allows
investigating the costs of de-TOASTing compressed or out-of-line
data during formatting.
Stepan Rutz and Matthias van de Meent,
reviewed by Tomas Vondra and myself
Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de
If there aren't many bytes to process, the function call overhead
of the optimized implementation isn't worth taking, so instead we
inline a loop that consults pg_number_of_ones in that case. If
there are many bytes to process, we accept the function call
overhead because the optimized versions are likely to be faster.
The threshold at which we use the optimized implementation is set
to the smallest amount of data required to use special popcount
instructions.
Reviewed-by: Alvaro Herrera, Tom Lane
Discussion: https://postgr.es/m/20240402155301.GA2750455%40nathanxps13
Execute both freezing and pruning of tuples in the same
heap_page_prune() function, now called heap_page_prune_and_freeze(),
and emit a single WAL record containing all changes. That reduces the
overall amount of WAL generated.
This moves the freezing logic from vacuumlazy.c to the
heap_page_prune_and_freeze() function. The main difference in the
coding is that in vacuumlazy.c, we looked at the tuples after the
pruning had already happened, but in heap_page_prune_and_freeze() we
operate on the tuples before pruning. The heap_prepare_freeze_tuple()
function is now invoked after we have determined that a tuple is not
going to be pruned away.
VACUUM no longer needs to loop through the items on the page after
pruning. heap_page_prune_and_freeze() does all the work. It now
returns the list of dead offsets, including existing LP_DEAD items, to
the caller. Similarly it's now responsible for tracking 'all_visible',
'all_frozen', and 'hastup' on the caller's behalf.
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for
the particular LSN to be replayed on standby. The waiters were stored in
the flat array. Even though scanning small arrays is fast, that might be a
problem at scale (a lot of waiting processes).
This commit replaces the flat shared memory array with the pairing heap,
which holds the waiter with the least LSN at the top. This gives us O(log N)
complexity for both inserting and removing waiters.
Reported-by: Alvaro Herrera
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
We were directly copying the LSN locations while syncing the slots on the
standby. Now, it is possible that at some particular restart_lsn there are
some running xacts, which means if we start reading the WAL from that
location after promotion, we won't reach a consistent snapshot state at
that point. However, on the primary, we would have already been in a
consistent snapshot state at that restart_lsn so we would have just
serialized the existing snapshot.
To avoid this problem we will use the advance_slot functionality unless
the snapshot already exists at the synced restart_lsn location. This will
help us to ensure that snapbuilder/slot statuses are updated properly
without generating any changes. Note that the synced slot will remain as
RS_TEMPORARY till the decoding from corresponding restart_lsn can reach a
consistent snapshot state after which they will be marked as
RS_PERSISTENT.
Per buildfarm
Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Shveta Malik, Bharath Rupireddy, Amit Kapila
Discussion: https://postgr.es/m/OS0PR01MB5716B3942AE49F3F725ACA92943B2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Previously, when selecting the transaction to evict during logical
decoding, we check all transactions to find the largest
transaction. This could lead to a significant replication lag
especially in the case where there are many subtransactions.
This commit improves the eviction algorithm in ReorderBuffer using the
max-heap with transaction size as the key to efficiently find the
largest transaction.
The max-heap starts with empty. While the max-heap is empty, we don't
do anything for the max-heap when updating the memory
counter. Therefore, we get the largest transaction in O(N) time, where
N is the number of transactions including top-level transactions and
subtransactions.
We build the max-heap just before selecting the largest transactions
if the number of transactions being decoded is higher than the
threshold, MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap,
we also update the max-heap when updating the memory counter. The
intention is to efficiently find the largest transaction in O(1) time
instead of incurring the cost of memory counter updates (O(log
N)). Once the number of transactions got lower than the threshold, we
reset the max-heap.
The performance benchmark results showed significant speed up (more
than x30 speed up on my machine) in decoding a transaction with 100k
subtransactions, whereas there is no visible overhead in other cases.
Reviewed-by: Amit Kapila, Hayato Kuroda, Vignesh C, Ajin Cherian,
Tomas Vondra, Shubham Khanna, Peter Smith, Álvaro Herrera,
Euler Taveira
Discussion: https://postgr.es/m/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.gmail.com
Previously, binaryheap didn't support updating a key and removing a
node in an efficient way. For example, in order to remove a node from
the binaryheap, the caller had to pass the node's position within the
array that the binaryheap internally has. Removing a node from the
binaryheap is done in O(log n) but searching for the key's position is
done in O(n).
This commit adds a hash table to binaryheap in order to track the
position of each nodes in the binaryheap. That way, by using newly
added functions such as binaryheap_update_up() etc., both updating a
key and removing a node can be done in O(1) on an average and O(log n)
in worst case. This is known as the indexed binary heap. The caller
can specify to use the indexed binaryheap by passing indexed = true.
The current code does not use the new indexing logic, but it will be
used by an upcoming patch.
Reviewed-by: Vignesh C, Peter Smith, Hayato Kuroda, Ajin Cherian,
Tomas Vondra, Shubham Khanna
Discussion: https://postgr.es/m/CAD21AoDffo37RC-eUuyHJKVEr017V2YYDLyn1xF_00ofptWbkg%40mail.gmail.com
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.
The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.
pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock. This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.
Catversion is bumped.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
If temp tables have dependencies (such as sequences) then it's
possible for autovacuum's cleanup of orphan temp tables to deadlock
against an incoming backend that's trying to clean out the temp
namespace for its own use. That can happen because RemoveTempRelations'
performDeletion call can visit objects within the namespace in
an order different from the order in which a per-table deletion
will visit them.
To fix, observe that performDeletion will begin by taking an exclusive
lock on the temp namespace (even though it won't actually delete it).
So, if we can get a shared lock on the namespace, we can be sure we're
not running concurrently with RemoveTempRelations, while also not
conflicting with ordinary use of the namespace. This requires
introducing a conditional version of LockDatabaseObject, but that's no
big deal. (It's surprising we've got along without that this long.)
Report and patch by Mikhail Zhilin. Back-patch to all supported
branches.
Discussion: https://postgr.es/m/c43ce028-2bc2-4865-9b89-3f706246eed5@postgrespro.ru
Introduce an abstraction allowing relation data to be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.
Client code supplies a callback that can say which block number it wants
next, and then consumes individual buffers one at a time from the
stream. This division puts read_stream.c in control of how far ahead it
can see and allows it to read clusters of neighboring blocks with
StartReadBuffers(). It also issues POSIX_FADV_WILLNEED advice ahead of
time when random access is detected.
Other variants of I/O stream will be proposed in future work (for
example to support recovery, whose LsnReadQueue device in
xlogprefetcher.c is a distant cousin of this code and should eventually
be replaced by this), but this basic API is sufficient for many common
executor usage patterns involving predictable access to a single fork of
a single relation.
Several patches using this API are proposed separately.
This stream concept is loosely based on ideas from Andres Freund on how
we should pave the way for later work on asynchronous I/O.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Author: Melanie Plageman <melanieplageman@gmail.com> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
Break ReadBuffer() up into two steps. StartReadBuffers() and
WaitReadBuffers() give us two main advantages:
1. Multiple consecutive blocks can be read with one system call.
2. Advice (hints of future reads) can optionally be issued to the
kernel ahead of time.
The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.
A new GUC io_combine_limit is defined, and the functions for limiting
per-backend pin counts are made into public APIs. Those are provided
for use by callers of StartReadBuffers(), when deciding how many buffers
to read at once. The following commit will add a higher level mechanism
for doing that automatically with a practical interface.
With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of just issuing
advice and leaving WaitReadBuffers() to do the work synchronously.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (some optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Tested-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:
* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
branch prediction behavior.
This commit replaces that array with the TID store from commit
30e144287a.
Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.
Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.
For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.
Bump catalog version.
Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
66c0185a3 adjusted the UNION planner to request that union child queries
produce Paths correctly ordered to implement the UNION by way of
MergeAppend followed by Unique. The code there made a bad assumption
that if the root->parent_root->parse had setOperations set that the
query must be the child subquery of a set operation. That's not true
when it comes to planning a non-inlined CTE which is parented by a set
operation. This causes issues as the CTE's targetlist has no
requirement to match up to the SetOperationStmt's groupClauses
Fix this by adding a new parameter to both subquery_planner() and
grouping_planner() to explicitly pass the SetOperationStmt only when
planning set operation child subqueries.
Thank you to Tom Lane for helping to rationalize the decision on the
best function signature for subquery_planner().
Reported-by: Alexander Lakhin
Discussion: https://postgr.es/m/242fc7c6-a8aa-2daf-ac4c-0a231e2619c1@gmail.com
Currently there is only one option, HEAP_PAGE_PRUNE_MARK_UNUSED_NOW
which replaces the old boolean argument, but upcoming patches will
introduce at least one more. Having a lot of boolean arguments makes
it hard to see at the call sites what the arguments mean, so prefer a
bitmask of options with human-readable names.
Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://www.postgresql.org/message-id/20240401172219.fngjosaqdgqqvg4e@liskov
This commit adds a new COPY option LOG_VERBOSITY, which controls the
amount of messages emitted during processing. Valid values are
'default' and 'verbose'.
This is currently used in COPY FROM when ON_ERROR option is set to
ignore. If 'verbose' is specified, a NOTICE message is emitted for
each discarded row, providing additional information such as line
number, column name, and the malformed value. This helps users to
identify problematic rows that failed to load.
Author: Bharath Rupireddy
Reviewed-by: Michael Paquier, Atsushi Torikoshi, Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/CALj2ACUk700cYhx1ATRQyRw-fBM%2BaRo6auRAitKGff7XNmYfqQ%40mail.gmail.com
After encountering the NUL terminator, the word-at-a-time loop exits
and we must hash the remaining bytes. Previously we calculated the
terminator's position and re-loaded the remaining bytes from the input
string. We already have all the data we need in a register, so let's
just mask off the bytes we need and hash them immediately. The mask can
be cheaply computed without knowing the terminator's position. We still
need that position for the length calculation, but the CPU can now
do that in parallel with other work, shortening the dependency chain.
Ants Aasma and John Naylor
Discussion: https://postgr.es/m/CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1%2BPC0a_w1fwY4iGMqUA%40mail.gmail.com
Previously, the executor did index insert unconditionally after calling
table AM interface methods tuple_insert() and multi_insert(). This commit
introduces the new parameter insert_indexes for these two methods. Setting
'*insert_indexes' to true saves the current logic. Setting it to false
indicates that table AM cares about index inserts itself and doesn't want the
caller to do that.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent, Mark Dilger
Currently, there is just one algorithm for sampling tuples from a table written
in acquire_sample_rows(). Custom table AM can just redefine the way to get the
next block/tuple by implementing scan_analyze_next_block() and
scan_analyze_next_tuple() API functions.
This approach doesn't seem general enough. For instance, it's unclear how to
sample this way index-organized tables. This commit allows table AM to
encapsulate the whole sampling algorithm (currently implemented in
acquire_sample_rows()) into the relation_analyze() API function.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Pavel Borisov, Matthias van de Meent
This SQL-callable function behaves much like our internal utility
function getBaseType(), except it returns NULL rather than failing for
an invalid type OID. (That behavior is modeled on our experience with
other catalog-inquiry functions such as the ACL checking functions.)
The key advantage over doing a join to pg_type is that it will loop
as needed to find the bottom base type of a nest of domains.
Steve Chavez, reviewed by jian he and others
Discussion: https://postgr.es/m/CAGRrpzZSX8j=MQcbCSEisFA=ic=K3bknVfnFjAv1diVJxFHJvg@mail.gmail.com
This allows MERGE commands to include WHEN NOT MATCHED BY SOURCE
actions, which operate on rows that exist in the target relation, but
not in the data source. These actions can execute UPDATE, DELETE, or
DO NOTHING sub-commands.
This is in contrast to already-supported WHEN NOT MATCHED actions,
which operate on rows that exist in the data source, but not in the
target relation. To make this distinction clearer, such actions may
now be written as WHEN NOT MATCHED BY TARGET.
Writing WHEN NOT MATCHED without specifying BY SOURCE or BY TARGET is
equivalent to writing WHEN NOT MATCHED BY TARGET.
Dean Rasheed, reviewed by Alvaro Herrera, Ted Yu and Vik Fearing.
Discussion: https://postgr.es/m/CAEZATCWqnKGc57Y_JanUBHQXNKcXd7r=0R4NEZUVwP+syRkWbA@mail.gmail.com
This brings the titlecasing implementation for the builtin provider
out of formatting.c and into unicode_case.c, along with
unicode_strlower() and unicode_strupper(). Accepts an arbitrary word
boundary callback.
Simple for now, but can be extended to support the Unicode Default
Case Conversion algorithm with full case mapping.
Discussion: https://postgr.es/m/3bc653b5d562ae9e2838b11cb696816c328a489a.camel@j-davis.com
Reviewed-by: Peter Eisentraut
This is marked PGC_SIGHUP, so it can only be set in a configuration
file, not anywhere else; and it is also marked GUC_DISALLOW_IN_AUTO_FILE,
so it can't be set using ALTER SYSTEM. When set to false, the
ALTER SYSTEM command is disallowed.
There was considerable concern that this would be misinterpreted as
a security feature, which it is not, because a determined superuser
has various ways of bypassing it. Hence, a lot of work has gone into
wordsmithing the documentation, in the hopes of avoiding any such
confusion.
Jelte Fennemia-Nio and Gabriele Bartolini, with wording suggestions
for the documentation from many others.
Discussion: http://postgr.es/m/CA%2BVUV5rEKt2%2BCdC_KUaPoihMu%2Bi5ChT4WVNTr4CD5-xXZUfuQw%40mail.gmail.com
Apparently these markers cause the modules to not link correctly in some
platforms, at least per buildfarm member indri; moreover, this code is
only used in modules that don't have a translation. If we someday add
i18n support to contrib/ it might be worth revisiting this.
Commit 61461a300c introduced new functions to libpq for cancelling
queries. This commit introduces a helper function that backend-side
libraries and extensions can use to invoke those. This function takes a
timeout and can itself be interrupted while it is waiting for a cancel
request to be sent and processed, instead of being blocked.
This replaces the usage of the old functions in postgres_fdw and dblink.
Finally, it also adds some test coverage for the cancel support in
postgres_fdw.
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/CAGECzQT_VgOWWENUqvUV9xQmbaCyXjtRRAYO8W07oqashk_N+g@mail.gmail.com
Previously, the behavior of TidStoreCreate() was inconsistent between
local and shared TidStore instances in terms of memory limitation. For
local TidStore, a memory context was created with initial and maximum
memory block sizes, as well as a minimum memory context size, based on
the specified max_bytes values. However, for shared TidStore, the
provided DSA area was used for TID storage. Although commit bb952c8c8b
allowed specifying the initial and maximum DSA segment sizes, callers
would have needed to clamp their own limits, which was not consistent
and user-friendly.
With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using a provided DSA
area. The initial and maximum DSA segment sizes are chosen based on
the specified max_bytes. Other processes can attach to the shared
TidStore using the handle of the created DSA returned by the new
TidStoreGetDSA() function and the DSA pointer returned by
TidStoreGetHandle(). The created DSA has the same lifetime as the
shared TidStore and is deleted when all processes detach from it.
To improve clarity, the TidStoreCreate() function has been divided
into two separate functions: TidStoreCreateLocal() and
TidStoreCreateShared().
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAyc1j%3DBCdUqZfk6qbdjZ68UgRx1Gkpk0oah4K7S0Ri9g%40mail.gmail.com
The user-facing name is "Other Platforms and Clients", but the
internal name seems too focused on clients specifically, especially
given the plan to add a new setting to this session that is about
platform or deployment model compatibility rather than client
compatibility.
Jelte Fennema-Nio
Discussion: http://postgr.es/m/CAGECzQTfMbDiM6W3av+3weSnHxJvPmuTEcjxVvSt91sQBdOxuQ@mail.gmail.com
This adds 3 new variants of the random() function:
random(min integer, max integer) returns integer
random(min bigint, max bigint) returns bigint
random(min numeric, max numeric) returns numeric
Each returns a random number x in the range min <= x <= max.
For the numeric function, the number of digits after the decimal point
is equal to the number of digits that "min" or "max" has after the
decimal point, whichever has more.
The main entry points for these functions are in a new C source file.
The existing random(), random_normal(), and setseed() functions are
moved there too, so that they can all share the same PRNG state, which
is kept private to that file.
Dean Rasheed, reviewed by Jian He, David Zhang, Aleksander Alekseev,
and Tomas Vondra.
Discussion: https://postgr.es/m/CAEZATCV89Vxuq93xQdmc0t-0Y2zeeNQTdsjbmV7dyFBPykbV4Q@mail.gmail.com
Previously, the DSA segment size always started with 1MB and grew up
to DSA_MAX_SEGMENT_SIZE. It was inconvenient in certain scenarios,
such as when the caller desired a soft constraint on the total DSA
segment size, limiting it to less than 1MB.
This commit introduces the capability to specify the initial and
maximum DSA segment sizes when creating a DSA area, providing more
flexibility and control over memory usage.
Reviewed-by: John Naylor, Tomas Vondra
Discussion: https://postgr.es/m/CAD21AoAYGGC1ePjVX0H%2Bpp9rH%3D9vuPK19nNOiu12NprdV5TVJA%40mail.gmail.com
The newly-introduced "one_by_one" label produces -Wunused-label
warnings when building without SIMD support. To fix, move the
label into the SIMD section of this function.
Oversight in commit 7644a7340c.
Reported-by: Tom Lane
Discussion: https://postgr.es/m/3189995.1711495704%40sss.pgh.pa.us
This commit improves the performance of pg_lfind32() in many cases
by modifying it to process the remaining "tail" of elements with
SIMD instructions instead of processing them one-by-one. Since the
SIMD code processes a large block of elements, this means that we
will process a subset of elements more than once, but that won't
affect the correctness of the result, and testing has shown that
this helps more cases than it regresses. With this change, the
standard one-by-one linear search code is only used for small
arrays and for platforms without SIMD support.
Suggested-by: John Naylor
Reviewed-by: John Naylor
Discussion: https://postgr.es/m/20231129171526.GA857928%40nathanxps13
If we know the sort order of a CTE's output, and it is relevant
to the outer query, label the CTE's outer-query access path using
those pathkeys. This may enable optimizations such as avoiding
a sort in the outer query.
The code for hoisting pathkeys into the outer query already exists
for regular RTE_SUBQUERY subqueries, but it wasn't getting used for
CTEs, possibly out of concern for maintaining an optimization fence
between the CTE and the outer query. However, on the same arguments
used for commit f7816aec2, there seems no harm in letting the outer
query know what the inner query decided to do.
In support of this, we now remember the best Path as well as Plan
for each subquery for the rest of the planner run. There may be
future applications for having that at hand, and it surely costs
little to build one more List.
Richard Guo (minor mods by me)
Discussion: https://postgr.es/m/CAMbWs49xYd3f8CrE8-WW3--dV1zH_sDSDn-vs2DzHj81Wcnsew@mail.gmail.com
ObjectClass is an enum whose values correspond to catalog OIDs. But
the extra layer of redirection, which is used only in small parts of
the code, and the similarity to ObjectType, are confusing and
cumbersome.
One advantage has been that some switches processing the OCLASS enum
don't have "default:" cases. This is so that the compiler tells us
when we fail to add support for some new object class. But you can
also handle that with some assertions and proper test coverage. It's
not even clear how strong this benefit is. For example, in
AlterObjectNamespace_oid(), you could still put a new OCLASS into the
"ignore object types that don't have schema-qualified names" case, and
it might or might not be wrong. Also, there are already various
OCLASS switches that do have a default case, so it's not even clear
what the preferred coding style should be.
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw%40mail.gmail.com
Currently, in read committed transaction isolation mode (default), we have the
following sequence of actions when tuple_update()/tuple_delete() finds
the tuple updated by the concurrent transaction.
1. Attempt to update/delete tuple with tuple_update()/tuple_delete(), which
returns TM_Updated.
2. Lock tuple with tuple_lock().
3. Re-evaluate plan qual (recheck if we still need to update/delete and
calculate the new tuple for update).
4. Second attempt to update/delete tuple with tuple_update()/tuple_delete().
This attempt should be successful, since the tuple was previously locked.
This commit eliminates step 2 by taking the lock during the first
tuple_update()/tuple_delete() call. The heap table access method saves some
effort by checking the updated tuple once instead of twice. Future
undo-based table access methods, which will start from the latest row version,
can immediately place a lock there.
Also, this commit makes tuple_update()/tuple_delete() optionally save the old
tuple into the dedicated slot. That saves efforts on re-fetching tuples in
certain cases.
The code in nodeModifyTable.c is simplified by removing the nested switch/case.
Discussion: https://postgr.es/m/CAPpHfdua-YFw3XTprfutzGp28xXLigFtzNbuFY8yPhqeq6X5kg%40mail.gmail.com
Reviewed-by: Aleksander Alekseev, Pavel Borisov, Vignesh C, Mason Sharp
Reviewed-by: Andres Freund, Chris Travers
It's now possible to specify a table access method via
CREATE TABLE ... USING for a partitioned table, as well change it with
ALTER TABLE ... SET ACCESS METHOD. Specifying an AM for a partitioned
table lets the value be used for all future partitions created under it,
closely mirroring the behavior of the TABLESPACE option for partitioned
tables. Existing partitions are not modified.
For a partitioned table with no AM specified, any new partitions are
created with the default_table_access_method.
Also add ALTER TABLE ... SET ACCESS METHOD DEFAULT, which reverts to the
original state of using the default for new partitions.
The relcache of partitioned tables is not changed: rd_tableam is not
set, even if a partitioned table has a relam set.
Author: Justin Pryzby <pryzby@telsasoft.com>
Author: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Author: Michaël Paquier <michael@paquier.xyz>
Reviewed-by: The authors themselves
Discussion: https://postgr.es/m/CAE-ML+9zM4wJCGCBGv01k96qQ3gFv4WFcFy=zqPHKeaEFwwv6A@mail.gmail.com
Discussion: https://postgr.es/m/20210308010707.GA29832%40telsasoft.com
The new combined WAL record is now used for pruning, freezing and 2nd
pass of vacuum. This is in preparation for changing VACUUM to write a
combined prune+freeze record per page, instead of separate two
records. The new WAL record format now supports that, but the code
still always writes separate records for pruning and freezing.
This reserves separate XLOG_HEAP2_* info codes for when the pruning
record is emitted for on-access pruning or VACUUM, per Peter
Geoghegan's suggestion. The record format is identical, but having
separate info codes makes it easier analyze pruning and vacuuming with
pg_waldump.
The function to emit the new WAL record, log_heap_prune_and_freeze(),
is in pruneheap.c. The existing heap_log_freeze_plan() and its
subroutines are moved to pruneheap.c without changes, to keep them
together with log_heap_prune_and_freeze().
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAAKRu_azf-zH%3DDgVbquZ3tFWjMY1w5pO8m-TXJaMdri8z3933g@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAAKRu_b2oE4GL%3Dq4g9mcByS9yT7wTQvEH9OLpabj28e%2BWKFi2A@mail.gmail.com
This commit adds a new property called last_inactive_time for slots. It is
set to 0 whenever a slot is made active/acquired and set to the current
timestamp whenever the slot is inactive/released or restored from the disk.
Note that we don't set the last_inactive_time for the slots currently being
synced from the primary to the standby because such slots are typically
inactive as decoding is not allowed on those.
The 'last_inactive_time' will be useful on production servers to debug and
analyze inactive replication slots. It will also help to know the lifetime
of a replication slot - one can know how long a streaming standby, logical
subscriber, or replication slot consumer is down.
The 'last_inactive_time' will also be useful to implement inactive
timeout-based replication slot invalidation in a future commit.
Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
This teaches build_child_join_sjinfo() to create the dummy
SpecialJoinInfos (those created for inner joins) directly for a given
child join, skipping the unnecessary overhead of translating the
parent joinrel's SpecialJoinInfo.
To that end, this commit moves the code to initialize the dummy
SpecialJoinInfos to a new function named init_dummy_sjinfo() and
changes the few existing sites that have this code and
build_child_join_sjinfo() to call this new function.
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAExHW5tHqEf3ASVqvFFcghYGPfpy7o3xnvhHwBGbJFMRH8KjNw@mail.gmail.com
Specifically, this commit reduces the memory consumed by the
SpecialJoinInfos that are allocated for child joins in
try_partitionwise_join() by freeing them at the end of creating paths
for each child join.
A SpecialJoinInfo allocated for a given child join is a copy of the
parent join's SpecialJoinInfo, which contains the translated copies
of the various Relids bitmapsets and semi_rhs_exprs, which is a List
of Nodes. The newly added freeing step frees the struct itself and
the various bitmapsets, but not semi_rhs_exprs, because there's no
handy function to free the memory of Node trees.
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Andrey Lepikhov <a.lepikhov@postgrespro.ru>
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://postgr.es/m/CAExHW5tHqEf3ASVqvFFcghYGPfpy7o3xnvhHwBGbJFMRH8KjNw@mail.gmail.com
Coverity complained about the integer handling issue; if we start with
an arbitrary non-negative shift value, the loop may decrement it down
to something less than zero before exiting. This commit adds an
assertion to make sure the 'shift' is always 0 after the loop, and
uses 0 as the shift to get the key chunk in the following operation.
Introduced by ee1b30f12.
Reported-by: Tom Lane as per coverity
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/2089517.1711299216%40sss.pgh.pa.us
Until now, UNION queries have often been suboptimal as the planner has
only ever considered using an Append node and making the results unique
by either using a Hash Aggregate, or by Sorting the entire Append result
and running it through the Unique operator. Both of these methods
always require reading all rows from the union subqueries.
Here we adjust the union planner so that it can request that each subquery
produce results in target list order so that these can be Merge Appended
together and made unique with a Unique node. This can improve performance
significantly as the union child can make use of the likes of btree
indexes and/or Merge Joins to provide the top-level UNION with presorted
input. This is especially good if the top-level UNION contains a LIMIT
node that limits the output rows to a small subset of the unioned rows as
cheap startup plans can be used.
Author: David Rowley
Reviewed-by: Richard Guo, Andy Fan
Discussion: https://postgr.es/m/CAApHDvpb_63XQodmxKUF8vb9M7CxyUyT4sWvEgqeQU-GB7QFoQ@mail.gmail.com
Add PERIOD clause to foreign key constraint definitions. This is
supported for range and multirange types. Temporal foreign keys check
for range containment instead of equality.
This feature matches the behavior of the SQL standard temporal foreign
keys, but it works on PostgreSQL's native ranges instead of SQL's
"periods", which don't exist in PostgreSQL (yet).
Reference actions ON {UPDATE,DELETE} {CASCADE,SET NULL,SET DEFAULT}
are not supported yet.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
Up to now, all of the "catcache list" objects within a catalog cache
were just chained together on a single dlist, requiring O(N) time to
search. Remarkably, we've not had serious performance problems with
that so far; but we got a complaint of a bad performance regression
from v15 in a case with a large number of roles in the system, which
traced down to O(N^2) total time when we probed N catcache lists.
Replace that data structure with a hashtable having an enlargeable
number of dlists, in an exactly parallel way to the data structure
we've used for years for the plain CatCTup cache members. The extra
cost of maintaining a hash table seems negligible, since we were
already computing a hash value for list searches.
Normally this'd be HEAD-only material, but in view of the performance
regression it seems advisable to back-patch into v16. In the v16
version of the patch, leave the dead cc_lists field where it is and
add the new fields at the end of struct catcache, to avoid possible
ABI breakage in case any external code is looking at these structs.
(We assume no external code is actually allocating new catcache
structs.)
Per report from alex work.
Discussion: https://postgr.es/m/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz=vggErdSG7pv2s6vmmTOLJSg@mail.gmail.com
This adds the X509 attributes notBefore and notAfter to sslinfo
as well as pg_stat_ssl to allow verifying and identifying the
validity period of the current client certificate. OpenSSL has
APIs for extracting notAfter and notBefore, but they are only
supported in recent versions so we have to calculate the dates
by hand in order to make this work for the older versions of
OpenSSL that we still support.
Original patch by Cary Huang with additional hacking by Jacob
and myself.
Author: Cary Huang <cary.huang@highgo.ca>
Co-author: Jacob Champion <jacob.champion@enterprisedb.com>
Co-author: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/182b8565486.10af1a86f158715.2387262617218380588@highgo.ca
This changes nodeToString() to not output the actual value of location
fields in nodes, but instead it writes -1. This mirrors the fact that
stringToNode() also does not read location field values but always
stores -1.
For most uses of nodeToString(), which is to store nodes in catalog
fields, this is more useful. We don't store original query texts in
catalogs, so any lingering query location values are not meaningful.
For debugging purposes, there is a new nodeToStringWithLocations(),
which mirrors the existing stringToNodeWithLocations(). This is used
for WRITE_READ_PARSE_PLAN_TREES and nodes/print.c functions, which
covers all the debugging uses.
Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEze2WgrCiR3JZmWyB0YTc8HV7ewRdx13j0CqD6mVkYAW+SFGQ@mail.gmail.com
Till now, the reason for replication slot invalidation is not tracked
directly in pg_replication_slots. A recent commit 007693f2a3 added
'conflict_reason' to show the reasons for slot conflict/invalidation, but
only for logical slots.
This commit adds a new column 'invalidation_reason' to show invalidation
reasons for both physical and logical slots. And, this commit also turns
'conflict_reason' text column to 'conflicting' boolean column (effectively
reverting commit 007693f2a3). The 'conflicting' column is true for
invalidation reasons 'rows_removed' and 'wal_level_insufficient' because
those make the slot conflict with recovery. When 'conflicting' is true,
one can now look at the new 'invalidation_reason' column for the reason
for the logical slot's conflict with recovery.
The new 'invalidation_reason' column will also be useful to track other
invalidation reasons in the future commit.
Author: Bharath Rupireddy
Reviewed-by: Bertrand Drouvot, Amit Kapila, Shveta Malik
Discussion: https://www.postgresql.org/message-id/ZfR7HuzFEswakt/a%40ip-10-97-1-34.eu-west-3.compute.internal
Discussion: https://www.postgresql.org/message-id/CALj2ACW4aUe-_uFQOjdWCEN-xXoLGhmvRFnL8SNw_TZ5nJe+aw@mail.gmail.com
Put the fields alias and eref earlier in the struct, so that it
matches the order in _outRangeTblEntry()/_readRangeTblEntry(). This
helps if we ever want to fully automate out/read of RangeTblEntry.
Also, it makes dumps in the debugger easier to read in the same way.
Internally, this makes no difference.
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca645@eisentraut.org
This is part of an effort to reduce the number of special cases in the
automatically generated node support functions.
This patch removes _jumbleRangeTblEntry() and instead adds per-field
query_jumble_ignore annotations to match the behavior of the previous
custom code. The pg_stat_statements test suite has some coverage of
this. It gets rid of the switch on rtekind; this should be
technically correct, since we do the equal and copy functions like
this also.
The list of fields to jumble has been checked and is considered
correct as of 8b29a119fd.
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca645@eisentraut.org
This allows us to abstract how/whether table AM uses transaction identifiers.
A custom table AM can use a custom slot, which may not store xmin directly,
but determine the tuple belonging to the current transaction in the other way.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
This allows table AM to return a native tuple slot even if
VirtualTupleTableSlot is given as an input. Native tuple slots have knowledge
about system attributes, which could be accessed in the future.
table_multi_insert() method already can modify the input 'slots' array.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
The new table AM method free_rd_amcache is responsible for freeing all the
memory related to rd_amcache and setting free_rd_amcache to NULL. If the new
method is not specified, we still assume rd_amcache to be a single chunk of
memory, which could be just pfree'd.
Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com
Reviewed-by: Matthias van de Meent, Mark Dilger, Pavel Borisov
Reviewed-by: Nikita Malakhov, Japin Li
This introduces the following SQL/JSON functions for querying JSON
data using jsonpath expressions:
JSON_EXISTS(), which can be used to apply a jsonpath expression to a
JSON value to check if it yields any values.
JSON_QUERY(), which can be used to to apply a jsonpath expression to
a JSON value to get a JSON object, an array, or a string. There are
various options to control whether multi-value result uses array
wrappers and whether the singleton scalar strings are quoted or not.
JSON_VALUE(), which can be used to apply a jsonpath expression to a
JSON value to return a single scalar value, producing an error if it
multiple values are matched.
Both JSON_VALUE() and JSON_QUERY() functions have options for
handling EMPTY and ERROR conditions, which can be used to specify
the behavior when no values are matched and when an error occurs
during jsonpath evaluation, respectively.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Author: Jian He <jian.universality@gmail.com>
Reviewers have included (in no particular order):
Andres Freund, Alexander Korotkov, Pavel Stehule, Andrew Alsup,
Erik Rijkers, Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby, Álvaro Herrera, Jian He, Anton A. Melnikov,
Nikita Malakhov, Peter Eisentraut, Tomas Vondra
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqHROpf9e644D8BRqYvaAPmgBZVup-xKMDPk-nd4EpgzHw@mail.gmail.com
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
Commit cca97ce6a6 allowed dbname in pg_basebackup connstring and in this
commit we allow it to be written in postgresql.auto.conf when -R option is
used. The database name in the connection string will be used by the
logical replication slot synchronization on standby.
The dbname will be recorded only if specified explicitly in the connection
string or environment variable.
Masahiko Sawada hasn't reviewed the code in detail but endorsed the idea.
Author: Vignesh C, Kuroda Hayato
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAB8KJ=hdKdg+UeXhReeHpHA6N6v3e0qFF+ZsPFHk9_ThWKf=2A@mail.gmail.com
TIDStore is a data structure designed to efficiently store large sets
of TIDs. For TID storage, it employs a radix tree, where the key is
a block number, and the value is a bitmap representing offset
numbers. The TIDStore can be created on a DSA area and used by
multiple backend processes simultaneously.
There are potential future users such as tidbitmap.c, though it's very
likely the interface will need to evolve as we come to understand the
needs of different kinds of users. For example, we can support
updating the offset bitmap of existing values.
Currently, the TIDStore is not used for anything yet, aside from the
test code. But an upcoming patch will use it.
This includes a unit test module, in src/test/modules/test_tidstore.
Co-authored-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
In combination with to_regtype, this allows converting a string to
the "canonicalized" form emitted by format_type. That usage requires
parsing the string twice, which is slightly annoying but not really
too expensive. We considered alternatives such as returning a record
type, but that way was notationally uglier than this, and possibly
less flexible.
Like to_regtype(), we'd rather that this return NULL for any bad
input, but the underlying type-parsing logic isn't yet capable of
not throwing syntax errors. Adjust the documentation for both
functions to point that out.
In passing, fix up a couple of nearby entries in the System Catalog
Information Functions table that had not gotten the word about our
since-v13 convention for displaying function usage examples.
David Wheeler and Erik Wienhold, reviewed by Pavel Stehule, Jim Jones,
and others.
Discussion: https://postgr.es/m/DF2324CA-2673-4ABE-B382-26B5770B6AA3@justatheory.com
This way, we can fold the list of lock names to occur in
BuiltinTrancheNames instead of having its own separate array. This
saves two lines of code in GetLWTrancheName and some space in
BuiltinTrancheNames, as foreseen in commit 74a7306310, as well as
removing the need for a separate lwlocknames.c file.
We still have to build lwlocknames.h using Perl code, which initially I
wanted to avoid, but it gives us the chance to cross-check
wait_event_names.txt.
Discussion: https://postgr.es/m/202401231025.gbv4nnte5fmm@alvherre.pgsql
To avoid the compiler warnings:
launch_backend.c:211:39: warning: comparison of constant 16 with expression of type 'BackendType' (aka 'enum BackendType') is always true [-Wtautological-constant-out-of-range-compare]
launch_backend.c:233:39: warning: comparison of constant 16 with expression of type 'BackendType' (aka 'enum BackendType') is always true [-Wtautological-constant-out-of-range-compare]
The point of the assertions was to fail more explicitly if someone
adds a new BackendType to the end of the enum, but forgets to add it
to the child_process_kinds array. It was a pretty weak assertion to
begin with, because it wouldn't catch if you added a new BackendType
in the middle of the enum. So let's just remove it.
Per buildfarm member ayu and a few others, spotted by Tom Lane.
Discussion: https://www.postgresql.org/message-id/4119680.1710913067@sss.pgh.pa.us
The builtin C.UTF-8 locale has similar semantics to the libc locale of
the same name. That is, code point sort order (fast, memcmp-based)
combined with Unicode semantics for character operations such as
pattern matching, regular expressions, and
LOWER()/INITCAP()/UPPER(). The character semantics are based on
Unicode simple case mappings.
The builtin provider's C.UTF-8 offers several important advantages
over libc:
* faster sorting -- benefits from additional optimizations such as
abbreviated keys and varstrfastcmp_c
* faster case conversion, e.g. LOWER(), at least compared with some
libc implementations
* available on all platforms with identical semantics, and the
semantics are stable, testable, and documentable within a given
Postgres major version
Being based on memcmp, the builtin C.UTF-8 locale does not offer
natural language sort order. But it is an improvement for most use
cases that might otherwise use libc's "C.UTF-8" locale, as well as
many use cases that use libc's "C" locale.
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
On some systems, calls to pg_popcount{32,64} are indirected through
a function pointer. This commit converts pg_popcount() to a
function pointer on those systems so that we can inline the
appropriate pg_popcount{32,64} implementations into each of the
pg_popcount() implementations. Since pg_popcount() may call
pg_popcount{32,64} several times, this can significantly improve
its performance.
Suggested-by: David Rowley
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/CAApHDvrb7MJRB6JuKLDEY2x_LKdFHwVbogpjZBCX547i5%2BrXOQ%40mail.gmail.com
When considering nestloop paths for individual partitions within
a partitionwise join, if the inner path is parameterized, it is
parameterized by the topmost parent of the outer rel, not the
corresponding outer rel itself. Therefore, we need to translate the
parameterization so that the inner path is parameterized by the
corresponding outer rel.
Up to now, we did this while generating join paths. However, that's
problematic because we must also translate some expressions that are
shared across all paths for a relation, such as restriction clauses
(kept in the RelOptInfo and/or IndexOptInfo) and TableSampleClauses
(kept in the RangeTblEntry). The existing code fails to translate
these at all, leading to wrong answers, odd failures such as
"variable not found in subplan target list", or executor crashes.
But we can't modify them during path generation, because that would
break things if we end up choosing some non-partitioned-join path.
So this patch postpones reparameterization of the inner path until
createplan.c, where it is safe to modify the referenced RangeTblEntry,
RelOptInfo or IndexOptInfo, because we have made a final choice of which
Path to use. We do still have to check during path generation that
the reparameterization will be possible. So we introduce a new
function path_is_reparameterizable_by_child() to detect that.
The duplication between path_is_reparameterizable_by_child() and
reparameterize_path_by_child() is a bit annoying, but there seems
no other good answer. A small benefit is that we can avoid building
useless reparameterized trees in cases where a non-partitioned join
is ultimately chosen. Also, reparameterize_path_by_child() can now
be allowed to scribble on the input paths, saving a few cycles.
This fix repairs the same problems previously addressed in the
back branches by commits 62f120203 et al.
Richard Guo, reviewed at various times by Ashutosh Bapat, Andrei
Lepikhov, Alena Rybakina, Robert Haas, and myself
Discussion: https://postgr.es/m/CAMbWs496+N=UAjOc=rcD3P7B6oJe4rZw08e_TZRUsWbPxZW3Tw@mail.gmail.com
Instead of the rather ugly type=int + name ~= location$, we now have a
marker type for offset pointers or sizes that are only relevant when a
query text is included, which decreases the complexity required in
gen_node_support.pl for handling these values.
Author: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEze2WgrCiR3JZmWyB0YTc8HV7ewRdx13j0CqD6mVkYAW+SFGQ@mail.gmail.com
Based on comments from Peter Eisentraut.
* Document CREATE DATABASE ... BUILTIN_LOCALE.
* Determine required encoding based on locale name for CREATE
COLLATION. Use -1 for "C" (requires catversion bump).
* initdb output fixups.
* Make ctype_is_c a constant true for now.
* Fixups to ICU 010_create_database.pl test.
Discussion: https://postgr.es/m/4135cf11-206d-40ed-96c0-9363c1232379@eisentraut.org
Introduce new postmaster_child_launch() function that deals with the
differences in EXEC_BACKEND mode.
Refactor the mechanism of passing information from the parent to child
process. Instead of using different command-line arguments when
launching the child process in EXEC_BACKEND mode, pass a
variable-length blob of startup data along with all the global
variables. The contents of that blob depend on the kind of child
process being launched. In !EXEC_BACKEND mode, we use the same blob,
but it's simply inherited from the parent to child process.
Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
This just moves the functions, with no other changes, to make the next
commits smaller and easier to review. The moved functions are related
to launching postmaster child processes in EXEC_BACKEND mode.
Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
Allocate memory for the error message inside memory owned by the
JsonLexContext and move responsibility away from the caller for
freeing it. This means that we can partially revert b44669b2ca
as this is now safe to use in FRONTEND code. The motivation for
this comes from the OAuth and incremental JSON patchsets but it
also adds value on its own.
Author: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOYmi+mWdTd6ujtyF7MsvXvk7ToLRVG_tYAcaGbQLvf=N4KrQw@mail.gmail.com
This allows a RETURNING clause to be appended to a MERGE query, to
return values based on each row inserted, updated, or deleted. As with
plain INSERT, UPDATE, and DELETE commands, the returned values are
based on the new contents of the target table for INSERT and UPDATE
actions, and on its old contents for DELETE actions. Values from the
source relation may also be returned.
As with INSERT/UPDATE/DELETE, the output of MERGE ... RETURNING may be
used as the source relation for other operations such as WITH queries
and COPY commands.
Additionally, a special function merge_action() is provided, which
returns 'INSERT', 'UPDATE', or 'DELETE', depending on the action
executed for each row. The merge_action() function can be used
anywhere in the RETURNING list, including in arbitrary expressions and
subqueries, but it is an error to use it anywhere outside of a MERGE
query's RETURNING list.
Dean Rasheed, reviewed by Isaac Morland, Vik Fearing, Alvaro Herrera,
Gurjeet Singh, Jian He, Jeff Davis, Merlin Moncure, Peter Eisentraut,
and Wolfgang Walther.
Discussion: http://postgr.es/m/CAEZATCWePEGQR5LBn-vD6SfeLZafzEm2Qy_L_Oky2=qw2w3Pzg@mail.gmail.com
This allows setting attstattarget when a relation is created.
We make use of this by having index_concurrently_create_copy() copy
over the attstattarget values when the new index is created, instead
of having index_concurrently_swap() fix it up later.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/4da8d211-d54d-44b9-9847-f2a9f1184c76@eisentraut.org
DDL code uses tuple descriptors to pass around pg_attribute values
during table and index creation. But tuple descriptors don't include
the variable-length/nullable columns of pg_attribute, so they have to
be handled separately. Right now, the attoptions field is handled in
a one-off way with a separate argument passed to
InsertPgAttributeTuples(). The other affected fields of pg_attribute
are right now not needed at relation creation time.
The goal of this patch is to generalize this to allow handling
additional variable-length/nullable columns of pg_attribute in a
similar manner. For that, create a new struct
FormExtraData_pg_attribute, which is to be passed around in parallel
to the tuple descriptor and optionally supplies the additional
columns. Right now, this struct only contains one field for
attoptions, so no functionality is actually changed by this.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/4da8d211-d54d-44b9-9847-f2a9f1184c76@eisentraut.org
This introduces a new function equalRowTypes() that is effectively a
subset of equalTupleDescs() but only compares the number of attributes
and attribute name, type, typmod, and collation. This is enough for
most existing uses of equalTupleDescs(), which are changed to use the
new function. The only remaining callers of equalTupleDescs() are
those that really want to check the full tuple descriptor as such,
without concern about record or row or record type semantics.
The existing function hashTupleDesc() is renamed to hashRowType(),
because it now corresponds more to equalRowTypes().
The purpose of this change is to be clearer about the semantics of the
equality asked for by each caller. (At least one caller had a comment
that questioned whether equalTupleDescs() was too restrictive.) For
example, 4f622503d6 removed attstattarget from the tuple descriptor
structure. It was not fully clear at the time how this should affect
equalTupleDescs(). Now the answer is clear: By their own definitions,
equalRowTypes() does not care, and equalTupleDescs() just compares
whatever is in the tuple descriptor but does not care why it is in
there.
Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/f656d6d9-6660-4518-a006-2f65cafbebd1%40eisentraut.org
destroyStringInfo() is a counterpart to makeStringInfo(), freeing a
palloc'd StringInfo and its data. This is a convenience function to
align the StringInfo API with the PQExpBuffer API. Originally added
in the OAuth patchset, it was extracted and committed separately in
order to aid upcoming JSON work.
Author: Daniel Gustafsson <daniel@yesql.se>
Author: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOYmi+mWdTd6ujtyF7MsvXvk7ToLRVG_tYAcaGbQLvf=N4KrQw@mail.gmail.com
This function returns the chunk_id of an on-disk TOASTed value. If
the value is un-TOASTed or not on-disk, it returns NULL. This is
useful for identifying which values are actually TOASTed and for
investigating "unexpected chunk number" errors.
Bumps catversion.
Author: Yugo Nagata
Reviewed-by: Jian He
Discussion: https://postgr.es/m/20230329105507.d764497456eeac1ca491b5bd%40sraoss.co.jp
The parallel query infrastructure copies the leader backend's active
snapshot to the worker processes. But BitmapHeapScan node also had
bespoken code to pass the snapshot from leader to the worker. That was
redundant, so remove it.
The removed code was analogous to the snapshot serialization in
table_parallelscan_initialize(), but that was the wrong role model. A
parallel bitmap heap scan is more like an independent non-parallel
bitmap heap scan in each parallel worker as far as the table AM is
concerned, because the coordination is done in nodeBitmapHeapscan.c,
and the table AM doesn't need to know anything about it.
This relies on the assumption that es_snapshot ==
GetActiveSnapshot(). That's not a new assumption, things would get
weird if you used the QueryDesc's snapshot for visibility checks in
the scans, but the active snapshot for evaluating quals, for
example. This could use some refactoring and cleanup, but for now,
just add some assertions.
Reviewed-by: Dilip Kumar, Robert Haas
Discussion: https://www.postgresql.org/message-id/5f3b9d59-0f43-419d-80ca-6d04c07cf61a@iki.fi
We don't determine the position at which a process waiting for a lock
should insert itself into the wait queue until we reach ProcSleep(),
and we may at that point discover that we must insert ourselves ahead
of everyone who wants a conflicting lock, in which case we obtain the
lock immediately. Up until now, a no-wait lock acquisition would fail
in such cases, erroneously claiming that the lock couldn't be obtained
immediately. Fix that by trying ProcSleep even in the no-wait case.
No back-patch for now, because I'm treating this as an improvement to
the existing no-wait feature. It could instead be argued that it's a
bug fix, on the theory that there should never be any case whatsoever
where no-wait fails to obtain a lock that would have been obtained
immediately without no-wait, but I'm reluctant to interpret the
semantics of no-wait that strictly.
Robert Haas and Jingxian Li
Discussion: http://postgr.es/m/CA+TgmobCH-kMXGVpb0BB-iNMdtcNkTvcZ4JBxDJows3kYM+GDg@mail.gmail.com
Currently, pg_visibility computes its xid horizon using the
GetOldestNonRemovableTransactionId(). The problem is that this horizon can
sometimes go backward. That can lead to reporting false errors.
In order to fix that, this commit implements a new function
GetStrictOldestNonRemovableTransactionId(). This function computes the xid
horizon, which would be guaranteed to be newer or equal to any xid horizon
computed before.
We have to do the following to achieve this.
1. Ignore processes xmin's, because they consider connection to other databases
that were ignored before.
2. Ignore KnownAssignedXids, because they are not database-aware. At the same
time, the primary could compute its horizons database-aware.
3. Ignore walsender xmin, because it could go backward if some replication
connections don't use replication slots.
As a result, we're using only currently running xids to compute the horizon.
Surely these would significantly sacrifice accuracy. But we have to do so to
avoid reporting false errors.
Inspired by earlier patch by Daniel Shelepanov and the following discussion
with Robert Haas and Tom Lane.
Discussion: https://postgr.es/m/1649062270.289865713%40f403.i.mail.ru
Reviewed-by: Alexander Lakhin, Dmitry Koval
New provider for collations, like "libc" or "icu", but without any
external dependency.
Initially, the only locale supported by the builtin provider is "C",
which is identical to the libc provider's "C" locale. The libc
provider's "C" locale has always been treated as a special case that
uses an internal implementation, without using libc at all -- so the
new builtin provider uses the same implementation.
The builtin provider's locale is independent of the server environment
variables LC_COLLATE and LC_CTYPE. Using the builtin provider, the
database collation locale can be "C" while LC_COLLATE and LC_CTYPE are
set to "en_US", which is impossible with the libc provider.
By offering a new builtin provider, it clarifies that the semantics of
a collation using this provider will never depend on libc, and makes
it easier to document the behavior.
Discussion: https://postgr.es/m/ab925f69-5f9d-f85e-b87c-bd2a44798659@joeconway.com
Discussion: https://postgr.es/m/dd9261f4-7a98-4565-93ec-336c1c110d90@manitou-mail.org
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Vérité, Peter Eisentraut, Jeremy Schneider
With the makefile rules, the output of genbki.pl was written to
src/backend/catalog/, and then the header files were linked to
src/include/catalog/.
This changes it so that the output files are written directly to
src/include/catalog/. This makes the logic simpler, and it also makes
the behavior consistent with the meson build system. Also, the list
of catalog files is now kept in parallel in
src/include/catalog/{meson.build,Makefile}, while before the makefiles
had it in src/backend/catalog/Makefile.
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/21b74bdc-183d-4dd5-9c27-9378d178f459@eisentraut.org
pg_stat_checkpointer contains statistics for checkpoints and restartpoints.
Before 12915a58ee documentation said only about checkpoints implying that
restartpoint is the variation of checkpoint. 12915a58ee introduced
new separate statistics fields for restartpoints. This commit explicitly
documents fields that are relevant for both checkpoints and restartpoints.
Reported-by: Magnus Hagander
Discussion: https://postgr.es/m/CABUevExav5-SR0x%2BG9kBUMV0G8XsvSUfuyyqmYBBJi6VHns6sw%40mail.gmail.com
Reviewed-by: Anton A. Melnikov
Roles with MAINTAIN on a relation may run VACUUM, ANALYZE, REINDEX,
REFRESH MATERIALIZE VIEW, CLUSTER, and LOCK TABLE on the relation.
Roles with privileges of pg_maintain may run those same commands on
all relations.
This was previously committed for v16, but it was reverted in
commit 151c22deee due to concerns about search_path tricks that
could be used to escalate privileges to the table owner. Commits
2af07e2f74, 59825d1639, and c7ea3f4229 resolved these concerns by
restricting search_path when running maintenance commands.
Bumps catversion.
Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/20240305161235.GA3478007%40nathanxps13
Before this patch, if you took a full backup on server A and then
tried to use the backup manifest to take an incremental backup on
server B, it wouldn't know that the manifest was from a different
server and so the incremental backup operation could potentially
complete without error. When you later tried to run pg_combinebackup,
you'd find out that your incremental backup was and always had been
invalid. That's poor timing, because nobody likes finding out about
backup problems only at restore time.
With this patch, you'll get an error when trying to take the (invalid)
incremental backup, which seems a lot nicer.
Amul Sul, revised by me. Review by Michael Paquier.
Discussion: http://postgr.es/m/CA+TgmoYLZzbSAMM3cAjV4Y+iCRZn-bR9H2+Mdz7NdaJFU1Zb5w@mail.gmail.com
This works just like get_controlfile(), but expects the path to the
control file rather than the path to the data directory that contains
the control file. This makes more sense in cases where the caller
has already constructed the path to the control file itself.
Amul Sul and Robert Haas, reviewed by Michael Paquier
There is a very ancient hack in check_sql_fn_retval that allows a
single SELECT targetlist entry of composite type to be taken as
supplying all the output columns of a function returning composite.
(This is grotty and fundamentally ambiguous, but it's really hard
to do nested composite-returning functions without it.)
As far as I know, that doesn't cause any problems in ordinary
functions. It's disastrous for procedures however. All procedures
that have any output parameters are labeled with prorettype RECORD,
and the CALL code expects it will get back a record with one column
per output parameter, regardless of whether any of those parameters
is composite. Doing something else leads to an assertion failure
or core dump.
This is simple enough to fix: we just need to not apply that rule
when considering procedures. However, that requires adding another
argument to check_sql_fn_retval, which at least in principle might be
getting called by external callers. Therefore, in the back branches
convert check_sql_fn_retval into an ABI-preserving wrapper around a
new function check_sql_fn_retval_ext.
Per report from Yahor Yuzefovich. This has been broken since we
implemented procedures, so back-patch to all supported branches.
Discussion: https://postgr.es/m/CABz5gWHSjj2df6uG0NRiDhZ_Uz=Y8t0FJP-_SVSsRsnrQT76Gg@mail.gmail.com
In postmaster, use a more lightweight ClientSocket struct that
encapsulates just the socket itself and the remote endpoint's address
that you get from accept() call. ClientSocket is passed to the child
process, which initializes the bigger Port struct. This makes it more
clear what information postmaster initializes, and what is left to the
child process.
Rename the StreamServerPort and StreamConnection functions to make it
more clear what they do. Remove StreamClose, replacing it with plain
closesocket() calls.
Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
We used to smuggle it to the child process in the Port struct, but it
seems better to pass it down as a separate argument. This paves the
way for the next commit, which moves the initialization of the Port
struct to the backend process, after forking.
Reviewed-by: Tristan Partin, Andres Freund
Discussion: https://www.postgresql.org/message-id/7a59b073-5b5b-151e-7ed3-8b01ff7ce9ef@iki.fi
There is a hook called ExplainOneQuery_hook that gives modules the
possibility to plug into this code path, but, like utility.c for utility
statement execution, there is no corresponding "standard" routine in
the case of EXPLAIN executed for one Query.
This commit adds a new standard_ExplainOneQuery() in explain.c, which is
able to run explain on a non-utility Query without calling its hook.
Per the feedback received from a couple of hackers, this change gives
the possibility to cut a few hundred lines of code in some of the
popular out-of-core modules as these maintained a copy of
ExplainOneQuery(), adding custom extra information at the beginning or
the end of the EXPLAIN output.
Author: Mats Kindahl
Reviewed-by: Aleksander Alekseev, Jelte Fennema-Nio, Andrei Lepikhov
Discussion: https://postgr.es/m/CA+14427V_B4EAoC_o-iYYucRdMSOTfpuH9k-QbexffY1HYJBiA@mail.gmail.com
Rename pg_collation.colliculocale to colllocale, and
pg_database.daticulocale to datlocale. These names reflects that the
fields will be useful for the upcoming builtin provider as well, not
just for ICU.
This is purely a rename; no changes to the meaning of the fields.
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Peter Eisentraut
... and in particular don't return them as replica identity.
The motivation for this change is letting the primary keys be seen by
code that derives NOT NULL constraints from them, when creating
inheritance children; before this change, if you had a deferrable PK,
pg_dump would not recreate the attnotnull marking properly, because the
column would not be considered as having anything to back said marking
after dropping the throwaway NOT NULL constraint.
The reason we don't want these PKs as replica identities is that
replication can corrupt data, if the uniqueness constraint is
transiently broken.
Reported-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Discussion: https://postgr.es/m/CAAJ_b94QonkgsbDXofakHDnORQNgafd1y3Oa5QXfpQNJyXyQ7A@mail.gmail.com
You might run out of stack space with recursion, which is not nice in
functions that might be used e.g. at cleanup after transaction
abort. MemoryContext contains pointer to parent and siblings, so we
can traverse a tree of contexts iteratively, without using
stack. Refactor the functions to do that.
MemoryContextStats() still recurses, but it now has a limit to how
deep it recurses. Once the limit is reached, it prints just a summary
of the rest of the hierarchy, similar to how it summarizes contexts
with lots of children. That seems good anyway, because a context dump
with hundreds of nested contexts isn't very readable.
Report by Egor Chindyaskin and Alexander Lakhin.
Discussion: https://postgr.es/m/1672760457.940462079%40f306.i.mail.ru
Author: Heikki Linnakangas
Reviewed-by: Robert Haas, Andres Freund, Alexander Korotkov, Tom Lane
This patch provides a way to ensure that physical standbys that are
potential failover candidates have received and flushed changes before
the primary server making them visible to subscribers. Doing so guarantees
that the promoted standby server is not lagging behind the subscribers
when a failover is necessary.
The logical walsender now guarantees that all local changes are sent and
flushed to the standby servers corresponding to the replication slots
specified in 'standby_slot_names' before sending those changes to the
subscriber.
Additionally, the SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to ensure that they process changes for failover slots only after physical
slots specified in 'standby_slot_names' have confirmed WAL receipt for those.
Author: Hou Zhijie and Shveta Malik
Reviewed-by: Masahiko Sawada, Peter Smith, Bertrand Drouvot, Ajin Cherian, Nisha Moond, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Since b0483263dd, this field can be used to store an access method
name for ALTER TABLE, but access methods were not mentioned in the
field's description.
Issue noticed while working on the area.
Discussion: https://postgr.es/m/ZeWKgCtk6xiAsDsc@paquier.xyz
Implements Unicode simple case mapping, in which all code points map
to exactly one other code point unconditionally.
These tables are generated from UnicodeData.txt, which is already
being used by other infrastructure in src/common/unicode. The tables
are checked into the source tree, so they only need to be regenerated
when we update the Unicode version.
In preparation for the builtin collation provider, and possibly useful
for other callers.
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Peter Eisentraut, Daniel Verite, Jeremy Schneider
The inh field of RangeTblEntry was doubly confusingly documented.
Some parts of the code insisted that it was only valid for
RTE_RELATION entries, other parts said the field was valid for all
entries. Neither was quite correct. More correctly, the field is
valid for RTE_RELATION entries but is also used in the planner for
RTE_SUBQUERY entries. So it makes more sense to group it with other
fields that are primarily for RTE_RELATION but borrowed by
RTE_SUBQUERY. (The exact position was chosen so that it is next to
relkind for better struct packing, and next to relid, since relid and
inh are sort of the input fields and the others are filled in later.)
Also add documentation for the planner's use at the struct definition.
Discussion: https://www.postgresql.org/message-id/6c1fbccc-85c8-40d3-b08b-4f47f2093711@eisentraut.org
This implements a radix tree data structure based on the design in
"The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases"
by Viktor Leis, Alfons Kemper, and ThomasNeumann, 2013. The main
technique that makes it adaptive is using several different node types,
each with a different capacity of elements, and a different algorithm
for accessing them. The nodes start small and grow/shrink as needed.
The main advantage over hash tables is efficient sorted iteration and
better memory locality when successive keys are lexicographically
close together. The implementation currently assumes 64-bit integer
keys, and traversing the tree is in general slower than a linear
probing hash table, so this is not a general-purpose associative array.
The paper describes two other techniques not implemented here,
namely "path compression" and "lazy expansion". These can further
reduce memory usage and speed up traversal, but the former would add
significant complexity and the latter requires storing the full key
with the value. We do trivially compress the path when leading bytes
of the key are zeros, however.
For value storage, we use "combined pointer/value slots", as
recommended in the paper. Values of size equal or smaller than the the
platform's pointer type are stored in the array of child pointers in
the last level node, while larger values are each stored in a separate
allocation. This is for now fixed at compile time, but it would be
fairly trivial to allow determining at runtime how variable-length
values are stored.
One innovation in our implementation compared to the ART paper is
decoupling the notion of node "size class" from "kind". The size
classes within a given node kind have the same underlying type, but
a variable capacity for children, so we can introduce additional node
sizes with little additional code.
To enable different use cases to specialize for different value types
and for shared/local memory, we use macro-templatized code generation
in the same manner as simplehash.h and sort_template.h.
Future commits will use this infrastructure for storing TIDs.
Patch by Masahiko Sawada and John Naylor, but a substantial amount of
credit is due to Andres Freund, whose proof-of-concept was a valuable
source of coding idioms and awareness of performance pitfalls, and
who reviewed earlier versions.
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
This reverts commit eae7be600b, following a discussion with Tom Lane,
due to concerns that this impacts the decisions made by the planner for
the number of workers spawned based on the inlining and const-folding of
index expressions and predicate for cases that would have worked until
this commit.
Discussion: https://postgr.es/m/162802.1709746091@sss.pgh.pa.us
Backpatch-through: 12
Provide functions to test for Unicode properties, such as Alphabetic
or Cased. These functions use tables derived from Unicode data files,
similar to the tables for Unicode normalization or general category,
and those tables can be updated with the 'update-unicode' build
target.
Use Unicode properties to provide functions to test for regex
character classes, like 'punct' or 'alnum'.
Infrastructure in preparation for a builtin collation provider, and
may also be useful for other callers.
Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Verite, Peter Eisentraut, Jeremy Schneider
The first argument of vshrq_n_s8 needs to be a signed vector type,
but it was passed unsigned. Clang is more lax with conversion, but
gcc needs a cast.
Fix by me, tested by Masahiko Sawada
Per buildfarm members splitfin, batta, widowbird, snakefly, parula,
massasauga
Discussion: https://postgr.es/m/20240306074106.mg6w4koohdlworbs%40alap3.anarazel.de
As coded, the planner logic that calculates the number of parallel
workers to use for a parallel index build uses expressions and
predicates from the relcache, which are flattened for the planner by
eval_const_expressions().
As reported in the bug, an immutable parallel-unsafe function flattened
in the relcache would become a Const, which would be considered as
parallel-safe, even if the predicate or the expressions including the
function are not safe in parallel workers. Depending on the expressions
or predicate used, this could cause the parallel build to fail.
Tests are included that check parallel index builds with parallel-unsafe
predicate and expressions. Two routines are added to lsyscache.h to be
able to retrieve expressions and predicate of an index from its pg_index
data.
Reported-by: Alexander Lakhin
Author: Tender Wang
Reviewed-by: Jian He, Michael Paquier
Discussion: https://postgr.es/m/CAHewXN=UaAaNn9ruHDH3Os8kxLVmtWqbssnf=dZN_s9=evHUFA@mail.gmail.com
Backpatch-through: 12
The copy_file_range() system call is available on at least Linux and
FreeBSD, and asks the kernel to use efficient ways to copy ranges of a
file. Options available to the kernel include sharing block ranges
(similar to --clone mode), and pushing down block copies to the storage
layer.
For automated testing, see PG_TEST_PG_UPGRADE_MODE. (Perhaps in a later
commit we could consider setting this mode for one of the CI targets.)
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CA%2BhUKGKe7Hb0-UNih8VD5UNZy5-ojxFb3Pr3xSBBL8qj2M2%3DdQ%40mail.gmail.com
When perminfoindex was added, it was just added at the end of the
block. It would make sense to keep it closer to more related fields.
In passing, also add an inline comment, like the other fields have.
(Other field reorderings and documentation improvements in
RangeTblEntry are being discussed, but it's better not to mix them
together.)
Discussion: https://www.postgresql.org/message-id/flat/6c1fbccc-85c8-40d3-b08b-4f47f2093711%40eisentraut.org
pg_constraint.conwithoutoverlaps was recently added to support primary
keys and unique constraints with the WITHOUT OVERLAPS clause. An
upcoming patch provides the foreign-key side of this functionality,
but the syntax there is different and uses the keyword PERIOD. It
would make sense to use the same pg_constraint field for both of
these, but then we should pick a more general name that conveys "this
constraint has a temporal/period-related feature". conperiod works
for that and is nicely compact. Changing this now avoids possibly
having to introduce versioning into clients. Note there are still
some "without overlaps" variables left, which deal specifically with
the parsing of the primary key/unique constraint feature.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
Use GUC_ACTION_SAVE rather than GUC_ACTION_SET, necessary for working
with parallel query.
Now that the call requires more arguments, wrap the call in a new
function to avoid code duplication and offer a place for a comment.
Discussion: https://postgr.es/m/E1rhJpO-0027Wf-9L@gemulon.postgresql.org
Presently, if an archive module's check_configured_cb callback
returns false, a generic WARNING message is emitted, which
unfortunately provides no actionable details about the reason why
the module is not configured. This commit introduces a macro that
archive module authors can use to add a DETAIL line to this WARNING
message.
Co-authored-by: Tung Nguyen
Reviewed-by: Daniel Gustafsson, Álvaro Herrera
Discussion: https://postgr.es/m/4109578306242a7cd5661171647e11b2%40oss.nttdata.com
Auto-generated array types, multirange types, and relation rowtypes
are treated as dependent objects: they can't be dropped separately
from the base object, nor can they have their own ownership or
permissions. We previously felt that, for objects that are in an
extension, only the base object needs to be listed as an extension
member in pg_depend. While that's sufficient to prevent inappropriate
drops, it results in undesirable answers if someone asks whether a
dependent type belongs to the extension. It looks like the dependent
type is just some random separately-created object that happens to
depend on the base object. Notably, this results in postgres_fdw
concluding that expressions involving an array type are not shippable
to the remote server, even when the defining extension has been
whitelisted.
To fix, cause GenerateTypeDependencies to make extension dependencies
for dependent types as well as their base objects, and adjust
ExecAlterExtensionContentsStmt so that object addition and removal
operations recurse to dependent types. The latter change means that
pg_upgrade of a type-defining extension will end with the dependent
type(s) now also listed as extension members, even if they were
not that way in the source database. Normally we want pg_upgrade
to precisely reproduce the source extension's state, but it seems
desirable to make an exception here.
This is arguably a bug fix, but we can't back-patch it since it
causes changes in the expected contents of pg_depend. (Because
it does, I've bumped catversion, even though there's no change
in the immediate post-initdb catalog contents.)
Tom Lane and David Geier
Discussion: https://postgr.es/m/4a847c55-489f-4e8d-a664-fc6b1cbe306f@gmail.com
The adminpack extension was only used to support pgAdmin III, which
in turn was declared EOL many years ago. Removing the extension also
allows us to remove functions from core as well which were only used
to support old version of adminpack.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CALj2ACUmL5TraYBUBqDZBi1C+Re8_=SekqGYqYprj_W8wygQ8w@mail.gmail.com
The pid was originally used in error context of messages propagated
from parallel workers, but commit 292794f82b removed that. If the need
arises in the future, you can also get the pid with
"shm_mq_get_sender(pcxt->worker[i].error_mqh)->pid".
Remove IsBackgroundWorker, IsAutoVacuumLauncherProcess(),
IsAutoVacuumWorkerProcess(), and IsLogicalSlotSyncWorker() in favor of
new Am*Process() macros that use MyBackendType. For consistency with
the existing Am*Process() macros.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/f3ecd4cb-85ee-4e54-8278-5fabfb3a4ed0@iki.fi
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.
The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.
One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
Previously, backend ID was an index into the ProcState array, in the
shared cache invalidation manager (sinvaladt.c). The entry in the
ProcState array was reserved at backend startup by scanning the array
for a free entry, and that was also when the backend got its backend
ID. Things become slightly simpler if we redefine backend ID to be the
index into the PGPROC array, and directly use it also as an index to
the ProcState array. This uses a little more memory, as we reserve a
few extra slots in the ProcState array for aux processes that don't
need them, but the simplicity is worth it.
Aux processes now also have a backend ID. This simplifies the
reservation of BackendStatusArray and ProcSignal slots.
You can now convert a backend ID into an index into the PGPROC array
simply by subtracting 1. We still use 0-based "pgprocnos" in various
places, for indexes into the PGPROC array, but the only difference now
is that backend IDs start at 1 while pgprocnos start at 0. (The next
commmit will get rid of the term "backend ID" altogether and make
everything 0-based.)
There is still a 'backendId' field in PGPROC, now part of 'vxid' which
encapsulates the backend ID and local transaction ID together. It's
needed for prepared xacts. For regular backends, the backendId is
always equal to pgprocno + 1, but for prepared xact PGPROC entries,
it's the ID of the original backend that processed the transaction.
Reviewed-by: Andres Freund, Reid Thompson
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
In the past, FileRead() and FileWrite() used types based on the Unix
read() and write() functions from before C and POSIX standardization,
though not exactly (we had int for amount instead of unsigned). In
commit 2d4f1ba6 we changed to the appropriate standard C types, just
like the modern POSIX functions they wrap, but again not exactly: the
return type stayed as int. In theory, a ssize_t value could be returned
by the underlying call that is too large for an int.
That wasn't really a live bug, because we don't expect PostgreSQL code
to perform reads or writes of gigabytes, and OSes probably apply
internal caps smaller than that anyway. This change is done on the
principle that the return might as well follow the standard interfaces
consistently.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/1672202.1703441340%40sss.pgh.pa.us
This commit switches pg_enc2gettext_tbl[] in encnames.c to use a
C99-designated initializer syntax.
pg_bind_textdomain_codeset() is simplified so as it is possible to do
a direct lookup at the gettext() array with a value of the enum pg_enc
rather than doing a loop through all its elements, as long as the
encoding value provided by GetDatabaseEncoding() is in the correct range
of supported encoding values. Note that PG_MULE_INTERNAL gains a value
in the array, pointing to NULL.
Author: Jelte Fennema-Nio
Discussion: https://postgr.es/m/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw@mail.gmail.com
Writing correct code using atomic variables is often difficult due
to the memory barrier semantics (or lack thereof) of the underlying
operations. This commit introduces atomic read/write functions
with full barrier semantics to ease this cognitive load. For
example, some spinlocks protect a single value, and these new
functions make it easy to convert the value to an atomic variable
(thus eliminating the need for the spinlock) without modifying the
barrier semantics previously provided by the spinlock. Since these
functions may be less performant than the other atomic reads and
writes, they are not suitable for every use-case. However, using a
single atomic operation with full barrier semantics may be more
performant in cases where a separate explicit barrier would
otherwise be required.
The base implementations for these new functions are atomic
exchanges (for writes) and atomic fetch/adds with 0 (for reads).
These implementations can be overwritten with better architecture-
specific versions as they are discovered.
This commit leaves converting existing code to use these new
functions as a future exercise.
Reviewed-by: Andres Freund, Yong Li, Jeff Davis
Discussion: https://postgr.es/m/20231110205128.GB1315705%40nathanxps13
This allows the target relation of MERGE to be an auto-updatable or
trigger-updatable view, and includes support for WITH CHECK OPTION,
security barrier views, and security invoker views.
A trigger-updatable view must have INSTEAD OF triggers for every type
of action (INSERT, UPDATE, and DELETE) mentioned in the MERGE command.
An auto-updatable view must not have any INSTEAD OF triggers. Mixing
auto-update and trigger-update actions (i.e., having a partial set of
INSTEAD OF triggers) is not supported.
Rule-updatable views are also not supported, since there is no
rewriter support for non-SELECT rules with MERGE operations.
Dean Rasheed, reviewed by Jian He and Alvaro Herrera.
Discussion: https://postgr.es/m/CAEZATCVcB1g0nmxuEc-A+gGB0HnfcGQNGYH7gS=7rq0u0zOBXA@mail.gmail.com
This updates the following lookup arrays to use C99-designated
initializer syntax, indexed based on the enum pg_enc:
pg_enc2icu_tbl[]
pg_enc2name_tbl[]
pg_wchar_table[]
This is more readable, and removes problems with ordering mistakes as
this removes dependencies between the arrays and their lookup index in
the enum pg_enc. So, adding new encodings becomes easier, even if this
does not happen often.
Author: Jelte Fennema-Nio
Reviewed-by: Jian He, Japin Li
Discussion: https://postgr.es/m/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw@mail.gmail.com
More precisely, what we do here is make the SLRU cache sizes
configurable with new GUCs, so that sites with high concurrency and big
ranges of transactions in flight (resp. multixacts/subtransactions) can
benefit from bigger caches. In order for this to work with good
performance, two additional changes are made:
1. the cache is divided in "banks" (to borrow terminology from CPU
caches), and algorithms such as eviction buffer search only affect
one specific bank. This forestalls the problem that linear searching
for a specific buffer across the whole cache takes too long: we only
have to search the specific bank, whose size is small. This work is
authored by Andrey Borodin.
2. Change the locking regime for the SLRU banks, so that each bank uses
a separate LWLock. This allows for increased scalability. This work
is authored by Dilip Kumar. (A part of this was previously committed as
d172b717c6f4.)
Special care is taken so that the algorithms that can potentially
traverse more than one bank release one bank's lock before acquiring the
next. This should happen rarely, but particularly clog.c's group commit
feature needed code adjustment to cope with this. I (Álvaro) also added
lots of comments to make sure the design is sound.
The new GUCs match the names introduced by bcdfa5f2e2 in the
pg_stat_slru view.
The default values for these parameters are similar to the previous
sizes of each SLRU. commit_ts, clog and subtrans accept value 0, which
means to adjust by dividing shared_buffers by 512 (so 2MB for every 1GB
of shared_buffers), with a cap of 8MB. (A new slru.c function
SimpleLruAutotuneBuffers() was added to support this.) The cap was
previously 1MB for clog, so for sites with more than 512MB of shared
memory the total memory used increases, which is likely a good tradeoff.
However, other SLRUs (notably multixact ones) retain smaller sizes and
don't support a configured value of 0. These values based on
shared_buffers may need to be revisited, but that's an easy change.
There was some resistance to adding these new GUCs: it would be better
to adjust to memory pressure automatically somehow, for example by
stealing memory from shared_buffers (where the caches can grow and
shrink naturally). However, doing that seems to be a much larger
project and one which has made virtually no progress in several years,
and because this is such a pain point for so many users, here we take
the pragmatic approach.
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Amul Sul, Gilles Darold, Anastasia Lubennikova,
Ivan Lazarev, Robert Haas, Thomas Munro, Tomas Vondra,
Yura Sokolov, Васильев Дмитрий (Dmitry Vasiliev).
Discussion: https://postgr.es/m/2BEC2B3F-9B61-4C1D-9FB5-5FAB0F05EF86@yandex-team.ru
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
There isn't a lot of user demand for AIX support, we have a bunch of
hacks to work around AIX-specific compiler bugs and idiosyncrasies,
and no one has stepped up to the plate to properly maintain it.
Remove support for AIX to get rid of that maintenance overhead. It's
still supported for stable versions.
The acute issue that triggered this decision was that after commit
8af2565248, the AIX buildfarm members have been hitting this
assertion:
TRAP: failed Assert("(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)"), File: "md.c", Line: 472, PID: 2949728
Apperently the "pg_attribute_aligned(a)" attribute doesn't work on AIX
for values larger than PG_IO_ALIGN_SIZE, for a static const variable.
That could be worked around, but we decided to just drop the AIX support
instead.
Discussion: https://www.postgresql.org/message-id/20240224172345.32@rfd.leadboat.com
Reviewed-by: Andres Freund, Noah Misch, Thomas Munro
The new names are intended to match those in an upcoming patch that adds
a few GUCs to configure the SLRU buffer sizes.
Backwards compatibility concern: this changes the accepted names for
function pg_stat_slru_rest(). Since this function recognizes "any other
string" as a request to reset the entry for "other", this means that
calling it with the old names would silently reset "other" instead of
doing nothing or throwing an error.
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/202402261616.dlriae7b6emv@alvherre.pgsql
A recent commit added a copy_function member to the
dshash_parameters struct, but it missed updating a couple of
comments that refer to the function pointer members of this struct.
One of those comments also refers to a tranche_name member and non-
arg variants of the function pointer members, all of which were
either removed during development or removed shortly after dshash
table support was committed.
Oversights in commits 8c0d7bafad, d7694fc148, and 42a1de3013.
Discussion: https://postgr.es/m/20240227045213.GA2329190%40nathanxps13
object_classes[] provided unnecessary indirection between catalog OIDs
and the enum ObjectClass when calling add_object_address(). This array
has been originally introduced in 30ec31604d and was useful because not
all relation OIDs were compile-time constants back then, which has not
been the case for a long time now for all the elements of ObjectClass.
This commit removes object_classes[], switching to the catalog OIDs
when calling add_object_address(). This shaves some code while saving
in maintenance because it was necessary to maintain the enum ObjectClass
and the array in sync when adding new object types.
Reported-by: Jeff Davis
Author: Jelte Fennema-Nio
Reviewed-by: Jian He, Michael Paquier
Discussion: https://postgr.es/m/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw@mail.gmail.com
Many modern compilers are able to optimize function calls to functions
where the parameters of the called function match a leading subset of
the calling function's parameters. If there are no instructions in the
calling function after the function is called, then the compiler is free
to avoid any stack frame setup and implement the function call as a
"jmp" rather than a "call". This is called sibling call optimization.
Here we adjust the memory allocation functions in mcxt.c to allow this
optimization. This requires moving some responsibility into the memory
context implementations themselves. It's now the responsibility of the
MemoryContext to check for malloc failures. This is good as it both
allows the sibling call optimization, but also because most small and
medium allocations won't call malloc and just allocate memory to an
existing block. That can't fail, so checking for NULLs in that case
isn't required.
Also, traditionally it's been the responsibility of palloc and the other
allocation functions in mcxt.c to check for invalid allocation size
requests. Here we also move the responsibility of checking that into the
MemoryContext. This isn't to allow the sibling call optimization, but
more because most of our allocators handle large allocations separately
and we can just add the size check when doing large allocations. We no
longer check this for non-large allocations at all.
To make checking the allocation request sizes and ERROR handling easier,
add some helper functions to mcxt.c for the allocators to use.
Author: Andres Freund
Reviewed-by: David Rowley
Discussion: https://postgr.es/m/20210719195950.gavgs6ujzmjfaiig@alap3.anarazel.de
Presently, string keys are not well-supported for dshash tables.
The dshash code always copies key_size bytes into new entries'
keys, and dshash.h only provides compare and hash functions that
forward to memcmp() and tag_hash(), both of which do not stop at
the first NUL. This means that callers must pad string keys so
that the data beyond the first NUL does not adversely affect the
results of copying, comparing, and hashing the keys.
To better support string keys in dshash tables, this commit does
a couple things:
* A new copy_function field is added to the dshash_parameters
struct. This function pointer specifies how the key should be
copied into new table entries. For example, we only want to copy
up to the first NUL byte for string keys. A dshash_memcpy()
helper function is provided and used for all existing in-tree
dshash tables without string keys.
* A set of helper functions for string keys are provided. These
helper functions forward to strcmp(), strcpy(), and
string_hash(), all of which ignore data beyond the first NUL.
This commit also adjusts the DSM registry's dshash table to use the
new helper functions for string keys.
Reviewed-by: Andy Fan
Discussion: https://postgr.es/m/20240119215941.GA1322079%40nathanxps13
Similarly to tables and indexes, these functions are able to open
relations with a sequence relkind, which is useful to make a distinction
with the other relation kinds. Previously, commands/sequence.c used a
mix of table_{close,open}() and relation_{close,open}() routines when
manipulating sequence relations, so this clarifies the code.
A direct effect of this change is to align the error messages produced
when attempting DDLs for sequences on relations with an unexpected
relkind, like a table or an index with ALTER SEQUENCE, providing an
extra error detail about the relkind of the relation used in the DDL
query.
Author: Michael Paquier
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/ZWlohtKAs0uVVpZ3@paquier.xyz
The new facility makes it easier to optimize bulk loading, as the
logic for buffering, WAL-logging, and syncing the relation only needs
to be implemented once. It's also less error-prone: We have had a
number of bugs in how a relation is fsync'd - or not - at the end of a
bulk loading operation. By centralizing that logic to one place, we
only need to write it correctly once.
The new facility is faster for small relations: Instead of of calling
smgrimmedsync(), we register the fsync to happen at next checkpoint,
which avoids the fsync latency. That can make a big difference if you
are e.g. restoring a schema-only dump with lots of relations.
It is also slightly more efficient with large relations, as the WAL
logging is performed multiple pages at a time. That avoids some WAL
header overhead. The sorted GiST index build did that already, this
moves the buffering to the new facility.
The changes to pageinspect GiST test needs an explanation: Before this
patch, the sorted GiST index build set the LSN on every page to the
special GistBuildLSN value, not the LSN of the WAL record, even though
they were WAL-logged. There was no particular need for it, it just
happened naturally when we wrote out the pages before WAL-logging
them. Now we WAL-log the pages first, like in B-tree build, so the
pages are stamped with the record's real LSN. When the build is not
WAL-logged, we still use GistBuildLSN. To make the test output
predictable, use an unlogged index.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/30e8f366-58b3-b239-c521-422122dd5150%40iki.fi
By enabling slot synchronization, all the failover logical replication
slots on the primary (assuming configurations are appropriate) are
automatically created on the physical standbys and are synced
periodically. The slot sync worker on the standby server pings the primary
server at regular intervals to get the necessary failover logical slots
information and create/update the slots locally. The slots that no longer
require synchronization are automatically dropped by the worker.
The nap time of the worker is tuned according to the activity on the
primary. The slot sync worker waits for some time before the next
synchronization, with the duration varying based on whether any slots were
updated during the last cycle.
A new parameter sync_replication_slots enables or disables this new
process.
On promotion, the slot sync worker is shut down by the startup process to
drop any temporary slots acquired by the slot sync worker and to prevent
the worker from trying to fetch the failover slots.
A functionality to allow logical walsenders to wait for the physical will
be done in a subsequent commit.
Author: Shveta Malik, Hou Zhijie based on design inputs by Masahiko Sawada and Amit Kapila
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Ajin Cherian, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
This is part of an effort to reduce the number of special cases in the
automatically generated node support functions.
Allegedly, only certain fields of the Constraint node are valid based
on contype. But this has historically not been kept up to date in the
read/write functions. The Constraint node is only used for debugging
DDL statements, so there are no strong requirements for its output,
and there is no enforcement for its correctness. (There was no read
support before a6bc3301925.) Commits e7a552f303 and abf46ad9c7 are
examples of where omissions were fixed.
This patch just removes the custom read/write implementations for the
Constraint node type. Now we just output all the fields, which is a
bit more than before, but at least we don't have to maintain these
functions anymore. Also, we lose the string representation of the
contype field, but for this marginal use case that seems tolerable.
This patch also changes the documentation of the Constraint struct to
put less emphasis on grouping fields by constraint type but rather
document for each field how it's used.
Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/flat/4b27fc50-8cd6-46f5-ab20-88dbaadca645@eisentraut.org
This commit switches the handling of the conflict cause strings for
replication slots to use a table rather than being explicitly listed,
using a C99-designated initializer syntax for the array elements. This
makes the whole more readable while easing future maintenance with less
areas to update when adding a new conflict reason.
This is similar to 74a7306310, but the scale of the change is smaller
as there are less conflict causes than LWLock builtin tranche names.
Author: Bharath Rupireddy
Reviewed-by: Jelte Fennema-Nio
Discussion: https://postgr.es/m/CALj2ACUxSLA91QGFrJsWNKs58KXb1C03mbuwKmzqqmoAKLwJaw@mail.gmail.com
Commit 6b80394781 introduced integer comparison functions designed
to be as efficient as possible while avoiding overflow. This
commit makes use of these functions in many of the in-tree qsort()
comparators to help ensure transitivity. Many of these comparator
functions should also see a small performance boost.
Author: Mats Kindahl
Reviewed-by: Andres Freund, Fabrízio de Royes Mello
Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
This commit adds integer comparison functions that are designed to
be as efficient as possible while avoiding overflow. A follow-up
commit will make use of these functions in many of the in-tree
qsort() comparators. The new functions are not better in all cases
(e.g., when the comparator function is inlined), so it is important
to consider the context before using them.
Author: Mats Kindahl
Reviewed-by: Tom Lane, Heikki Linnakangas, Andres Freund, Thomas Munro, Andrey Borodin, Fabrízio de Royes Mello
Discussion: https://postgr.es/m/CA%2B14426g2Wa9QuUpmakwPxXFWG_1FaY0AsApkvcTBy-YfS6uaw%40mail.gmail.com
A child table can specify a compression or storage method different
from its parents. This was previously an error. (But this was
inconsistently enforced because for example the settings could be
changed later using ALTER TABLE.) This now also allows an explicit
override if multiple parents have different compression or storage
settings, which was previously an error that could not be overridden.
The compression and storage properties remains unchanged in a child
inheriting from parent(s) after its creation, i.e., when using ALTER
TABLE ... INHERIT. (This is not changed.)
Before this change, the error detail would mention the first pair of
conflicting parent compression or storage methods. But with this
change it waits till the child specification is considered by which
time we may have encountered many such conflicting pairs. Hence the
error detail after this change does not include the conflicting
compression/storage methods. Those can be obtained from parent
definitions if necessary. The code to maintain list of all
conflicting methods or even the first conflicting pair does not seem
worth the convenience it offers. This change is inline with what we
do with conflicting default values.
Before this commit, the specified storage method could be stored in
ColumnDef::storage (CREATE TABLE ... LIKE) or ColumnDef::storage_name
(CREATE TABLE ...). This caused the MergeChildAttribute() and
MergeInheritedAttribute() to ignore a storage method specified in the
child definition since it looked only at ColumnDef::storage. This
commit removes ColumnDef::storage and instead uses
ColumnDef::storage_name to save any storage method specification. This
is similar to how compression method specification is handled.
Author: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/24656cec-d6ef-4d15-8b5b-e8dfc9c833a7@eisentraut.org
This commit adds timeout that is expected to be used as a prevention
of long-running queries. Any session within the transaction will be
terminated after spanning longer than this timeout.
However, this timeout is not applied to prepared transactions.
Only transactions with user connections are affected.
Discussion: https://postgr.es/m/CAAhFRxiQsRs2Eq5kCo9nXE3HTugsAAJdSQSmxncivebAxdmBjQ%40mail.gmail.com
Author: Andrey Borodin <amborodin@acm.org>
Author: Japin Li <japinli@hotmail.com>
Author: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Nikolay Samokhvalov <samokhvalov@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: bt23nguyent <bt23nguyent@oss.nttdata.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Thanks to commit 3b00fdba9f, this check in the SIGTERM handler for
the startup process is now obsolete and can be removed. Instead of
leaving around the dead function write_stderr_signal_safe(), I've
opted to just remove it for now.
This partially reverts commit 97550c0711.
Reviewed-by: Andres Freund, Noah Misch
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
This commit introduces a new SQL function pg_sync_replication_slots()
which is used to synchronize the logical replication slots from the
primary server to the physical standby so that logical replication can be
resumed after a failover or planned switchover.
A new 'synced' flag is introduced in pg_replication_slots view, indicating
whether the slot has been synchronized from the primary server. On a
standby, synced slots cannot be dropped or consumed, and any attempt to
perform logical decoding on them will result in an error.
The logical replication slots on the primary can be synchronized to the
hot standby by using the 'failover' parameter of
pg-create-logical-replication-slot(), or by using the 'failover' option of
CREATE SUBSCRIPTION during slot creation, and then calling
pg_sync_replication_slots() on standby. For the synchronization to work,
it is mandatory to have a physical replication slot between the primary
and the standby aka 'primary_slot_name' should be configured on the
standby, and 'hot_standby_feedback' must be enabled on the standby. It is
also necessary to specify a valid 'dbname' in the 'primary_conninfo'.
If a logical slot is invalidated on the primary, then that slot on the
standby is also invalidated.
If a logical slot on the primary is valid but is invalidated on the
standby, then that slot is dropped but will be recreated on the standby in
the next pg_sync_replication_slots() call provided the slot still exists
on the primary server. It is okay to recreate such slots as long as these
are not consumable on standby (which is the case currently). This
situation may occur due to the following reasons:
- The 'max_slot_wal_keep_size' on the standby is insufficient to retain
WAL records from the restart_lsn of the slot.
- 'primary_slot_name' is temporarily reset to null and the physical slot
is removed.
The slot synchronization status on the standby can be monitored using the
'synced' column of pg_replication_slots view.
A functionality to automatically synchronize slots by a background worker
and allow logical walsenders to wait for the physical will be done in
subsequent commits.
Author: Hou Zhijie, Shveta Malik, Ajin Cherian based on an earlier version by Peter Eisentraut
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
This reverts commit 95fb5b4902, for reasons similar to what led to
1aa8324b81. In this case, the callback was called once per row, which
is less worse than the previous callback introduced for COPY TO called
once per argument for each row, still the patch set discussed to plug in
custom routines to the COPY paths would be able to know which subroutine
to use depending on its CopyFromState, so this led to a suboptimal
approach at the end.
For now, this part is reverted to consider better which approach to use.
Discussion: https://postgr.es/m/20240206014125.qofww7ew3dx3v3uk@awork3.anarazel.de
If available, read directly from WAL buffers, avoiding the need to go
through the filesystem. Only for physical replication for now, but can
be expanded to other callers.
In preparation for replicating unflushed WAL data.
Author: Bharath Rupireddy
Discussion: https://postgr.es/m/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
Reviewed-by: Andres Freund, Alvaro Herrera, Nathan Bossart, Dilip Kumar, Nitin Jadhav, Melih Mutlu, Kyotaro Horiguchi
Commit 5579388d removed code that supplied a fallback implementation of
getaddrinfo(), which was dead code on modern systems. One tiny piece of
the removed code was still doing something useful on Windows, though:
that OS's own gai_strerror()/gai_strerrorA() function returns a pointer
to a static buffer that it overwrites each time, so it's not
thread-safe. In rare circumstances, a multi-threaded client program
could get an incorrect or corrupted error message.
Restore the replacement gai_strerror() function, though now that it's
only for Windows we can put it into a win32-specific file and cut it
down to the errors that Windows documents. The error messages here are
taken from FreeBSD, because Windows' own messages seemed too verbose.
Back-patch to 16.
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGKz%2BF9d2PTiXwfYV7qJw%2BWg2jzACgSDgPizUw7UG%3Di58A%40mail.gmail.com
Commit 5b2f4afffe refactored find_other_exec() and in the process
created pipe_read_line() into a static routine for reading a single
line of output, aimed at reading version numbers. Commit a7e8ece41
later exposed it externally in order to read a postgresql.conf GUC
using "postgres -C ..". Further, f06b1c598 also made use of it for
reading a version string much like find_other_exec(). The internal
variable remained "pgver", even when used for other purposes.
Since the function requires passing a buffer and its size, and at
most size - 1 bytes will be read via fgets(), there is a truncation
risk when using this for reading GUCs (like how pg_rewind does,
though the risk in this case is marginal).
To keep this as generic functionality for reading a line from a pipe,
this refactors pipe_read_line() into returning an allocated buffer
containing all of the line to remove the risk of silent truncation.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Discussion: https://postgr.es/m/DEDF73CE-D528-49A3-9089-B3592FD671A9@yesql.se
Various int variables were compared to macros that are of type size_t,
which caused -Wsign-compare warnings in cpluspluscheck. Change those
to size_t, which also better describes their purpose.
Per report from Peter Eisentraut
Discussion: https://postgr.es/m/486847dc-6de5-464a-938e-bac98ec2438b%40eisentraut.org
The new concurrency model proposed for slru.c to improve performance
does not include any single lock that would coordinate processes
doing concurrent reads/writes on SlruShared->latest_page_number.
We can instead use atomic reads and writes for that variable.
Author: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAFiTN-vzDvNz=ExGXz6gdyjtzGixKSqs0mKHMmaQ8sOSEFZ33A@mail.gmail.com
The standalone functions fasthash{32,64} use length for two purposes:
how many bytes to hash, and how to perturb the internal seed.
Developers using the incremental interface may not know the length
ahead of time (e.g. for C strings). In this case, it's advised to
pass length to the finalizer, but initialization still needed some
length up front, in the form of a placeholder macro.
Separate the concerns by having the standalone functions perturb the
internal seed themselves from their own length parameter, allowing
to remove "len" from fasthash_init(), as well as the placeholder macro.
Discussion: https://postgr.es/m/CANWCAZbTUk2LOyhsFo33gjLyLAHZ7ucXCi5K9u%3D%2BPtnTShDKtw%40mail.gmail.com
This patch provides support for regular (non-replication) connections in
libpqrcv_connect(). This can be used to execute SQL statements on the
primary server without starting a walsender.
A new API libpqrcv_get_dbname_from_conninfo() is also added to extract the
database name from the given connection-info.
Note that this patch doesn't change any existing functionality but later
patches implementing the slot synchronization will use this functionality
to connect to the primary server to fetch required slot information.
Author: Shveta Malik, Hou Zhijie, Ajin Cherian
Reviewed-by: Peter Smith, Bertrand Drouvot, Dilip Kumar, Masahiko Sawada, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
CopyReadAttributes{CSV,Text}() are used to parse lines for text and CSV
format. This reduces the number of "if" branches that need to be
checked when parsing fields in CSV and text mode when dealing with a
COPY FROM, something that can become more noticeable with more
attributes and more lines to process.
Extracted from a larger patch by the same author.
Author: Sutou Kouhei
Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com
After calling smgropen(), it was not clear how long you could continue
to use the result, because various code paths including cache
invalidation could call smgrclose(), which freed the memory.
Guarantee that the object won't be destroyed until the end of the
current transaction, or in recovery, the commit/abort record that
destroys the underlying storage.
smgrclose() is now just an alias for smgrrelease(). It closes files
and forgets all state except the rlocator, but keeps the SMgrRelation
object valid.
A new smgrdestroy() function is used by rare places that know there
should be no other references to the SMgrRelation.
The short version:
* smgrclose() is now just an alias for smgrrelease(). It releases
resources, but doesn't destroy until EOX
* smgrdestroy() now frees memory, and should rarely be used.
Existing code should be unaffected, but it is now possible for code that
has an SMgrRelation object to use it repeatedly during a transaction as
long as the storage hasn't been physically dropped. Such code would
normally hold a lock on the relation.
This also replaces the "ownership" mechanism of SMgrRelations with a
pin counter. An SMgrRelation can now be "pinned", which prevents it
from being destroyed at end of transaction. There can be multiple pins
on the same SMgrRelation. In practice, the pin mechanism is only used
by the relcache, so there cannot be more than one pin on the same
SMgrRelation. Except with swap_relation_files XXX
Author: Thomas Munro, Heikki Linnakangas
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJ8NTvqLHz6dqbQnt2c8XCki4r2QvXjBQcXpVwxTY_pvA@mail.gmail.com
This commit introduces a new subscription option named 'failover', which
provides users with the ability to set the failover property of the
replication slot on the publisher when creating or altering a
subscription.
This uses the replication commands introduced by commit 7329240437 to
enable the failover option for a logical replication slot.
If the failover option is set to true, the associated replication slots
(i.e. the main slot and the table sync slots) in the upstream database are
enabled to be synchronized to the standbys. Note that the capability to
sync the replication slots will be added in subsequent commits.
Thanks to Masahiko Sawada for the design inputs.
Author: Shveta Malik, Hou Zhijie, Ajin Cherian
Reviewed-by: Peter Smith, Bertrand Drouvot, Dilip Kumar, Masahiko Sawada, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
This function requires simd.h, which is a rather large dependency
for a widely-used header file like pg_wchar.h. Furthermore, there
is a report of a third-party tool that is struggling to use
pg_wchar.h due to its dependence on simd.h (presumably because
simd.h uses several intrinsics). Moving the function to the much
less popular ascii.h resolves these issues for now.
This commit is back-patched for the benefit of the aforementioned
third-party tool. The simd.h dependency was only added in v16,
but we've opted to back-patch to v15 so that is_valid_ascii() lives
in the same file for all versions where it exists. This could
break existing third-party code that uses the function, but we
couldn't find any examples of such code. It should be possible to
fix any code that this commit breaks by including ascii.h in the
file that uses is_valid_ascii().
Author: Jubilee Young
Reviewed-by: Tom Lane, John Naylor, Andres Freund, Eric Ridge
Discussion: https://postgr.es/m/CAPNHn3oKJJxMsYq%2BqLYzVJOFrUcOr4OF1EC-KtFT-qh8nOOOtQ%40mail.gmail.com
Backpatch-through: 15
This adds a new "Memory:" line under the "Planning:" group (which
currently only has "Buffers:") when the MEMORY option is specified.
In order to make the reporting reasonably accurate, we create a separate
memory context for planner activities, to be used only when this option
is given. The total amount of memory allocated by that context is
reported as "allocated"; we subtract memory in the context's freelists
from that and report that result as "used". We use
MemoryContextStatsInternal() to obtain the quantities.
The code structure to show buffer usage during planning was not in
amazing shape, so I (Álvaro) modified the patch a bit to clean that up
in passing.
Author: Ashutosh Bapat
Reviewed-by: David Rowley, Andrey Lepikhov, Jian He, Andy Fan
Discussion: https://www.postgresql.org/message-id/CAExHW5sZA=5LJ_ZPpRO-w09ck8z9p7eaYAqq3Ks9GDfhrxeWBw@mail.gmail.com
This commit implements a new replication command called
ALTER_REPLICATION_SLOT and a corresponding walreceiver API function named
walrcv_alter_slot. Additionally, the CREATE_REPLICATION_SLOT command has
been extended to support the failover option.
These new additions allow the modification of the failover property of a
replication slot on the publisher. A subsequent commit will make use of
these commands in subscription commands and will add the tests as well to
cover the functionality added/changed by this commit.
Author: Hou Zhijie, Shveta Malik
Reviewed-by: Peter Smith, Bertrand Drouvot, Dilip Kumar, Masahiko Sawada, Nisha Moond, Kuroda, Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Since commit a4ccc1cef, the 'node' and 'alloc_tuple_size' fields of
the ReorderBufferTupleBuf structure are no longer used. This leaves
only the 'tuple' field in the structure. Since keeping a single-field
structure makes little sense, the ReorderBufferTupleBuf is removed
entirely. The code is refactored accordingly.
No back-patching since these are ABI changes in an exposed structure
and functions, and there would be some risk of breaking extensions.
Author: Aleksander Alekseev
Reviewed-by: Amit Kapila, Masahiko Sawada, Reid Thompson
Discussion: https://postgr.es/m/CAD21AoCvnuxiXXfRecp7g9+CeC35POQfhuQeJFr7_9u_Q5jc_Q@mail.gmail.com
Formerly, these were only supported in to_char(), but there seems
little reason for that restriction. We should at least have enough
support to permit round-tripping the output of to_char().
In that spirit, TZ accepts either zone abbreviations or numeric
(HH or HH:MM) offsets, which are the cases that to_char() can output.
In an ideal world we'd make it take full zone names too, but
that seems like it'd introduce an unreasonable amount of ambiguity,
since the rules for POSIX-spec zone names are so lax.
OF is a subset of this, accepting only HH or HH:MM.
One small benefit of this improvement is that we can simplify
jsonpath's executeDateTimeMethod function, which no longer needs
to consider the HH and HH:MM cases separately. Moreover, letting
it accept zone abbreviations means it will accept "Z" to mean UTC,
which is emitted by JSON.stringify() for example.
Patch by me, reviewed by Aleksander Alekseev and Daniel Gustafsson
Discussion: https://postgr.es/m/1681086.1686673242@sss.pgh.pa.us
This commit implements ithe jsonpath .bigint(), .boolean(),
.date(), .decimal([precision [, scale]]), .integer(), .number(),
.string(), .time(), .time_tz(), .timestamp(), and .timestamp_tz()
methods.
.bigint() converts the given JSON string or a numeric value to
the bigint type representation.
.boolean() converts the given JSON string, numeric, or boolean
value to the boolean type representation. In the numeric case, only
integers are allowed. We use the parse_bool() backend function
to convert a string to a bool.
.decimal([precision [, scale]]) converts the given JSON string
or a numeric value to the numeric type representation. If precision
and scale are provided for .decimal(), then it is converted to the
equivalent numeric typmod and applied to the numeric number.
.integer() and .number() convert the given JSON string or a
numeric value to the int4 and numeric type representation.
.string() uses the datatype's output function to convert numeric
and various date/time types to the string representation.
The JSON string representing a valid date/time is converted to the
specific date or time type representation using jsonpath .date(),
.time(), .time_tz(), .timestamp(), .timestamp_tz() methods. The
changes use the infrastructure of the .datetime() method and perform
the datatype conversion as appropriate. Unlike the .datetime()
method, none of these methods accept a format template and use ISO
DateTime format instead. However, except for .date(), the
date/time related methods take an optional precision to adjust the
fractional seconds.
Jeevan Chalke, reviewed by Peter Eisentraut and Andrew Dunstan.
This commit adds the failover property to the replication slot. The
failover property indicates whether the slot will be synced to the standby
servers, enabling the resumption of corresponding logical replication
after failover. But note that this commit does not yet include the
capability to sync the replication slot; the subsequent commits will add
that capability.
A new optional parameter 'failover' is added to the
pg_create_logical_replication_slot() function. We will also enable to set
'failover' option for slots via the subscription commands in the
subsequent commits.
The value of the 'failover' flag is displayed as part of
pg_replication_slots view.
Author: Hou Zhijie, Shveta Malik, Ajin Cherian
Reviewed-by: Peter Smith, Bertrand Drouvot, Dilip Kumar, Masahiko Sawada, Nisha Moond, Kuroda, Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
9e2d870119 enabled the COPY command to skip malformed data, however
there was no visibility into how many tuples were actually skipped
during the COPY FROM.
This commit adds a new "tuples_skipped" column to
pg_stat_progress_copy view to report the number of tuples that were
skipped because they contain malformed data.
Bump catalog version.
Author: Atsushi Torikoshi
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/d12fd8c99adcae2744212cb23feff6ed%40oss.nttdata.com
Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints.
These are backed by GiST indexes instead of B-tree indexes, since they
are essentially exclusion constraints with = for the scalar parts of
the key and && for the temporal part.
Author: Paul A. Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
This adds a Node *escontext parameter to it and a bunch of functions
downstream to it, replacing any ereport()s in that path by either
errsave() or ereturn() as appropriate. This also adds code to those
functions where necessary to return early upon encountering a soft
error.
The changes here are mainly intended to suppress errors in the
functions of jsonfuncs.c. Functions in any external modules, such as
arrayfuncs.c, that those functions may in turn call are not changed
here based on the assumption that the various checks in jsonfuncs.c
functions should ensure that only values that are structurally valid
get passed to the functions in those external modules. An exception
is made for domain_check() to allow handling domain constraint
violation errors softly.
For testing, this adds a function jsonb_populate_record_valid(),
which returns true if jsonb_populate_record() would finish without
causing an error for the provided JSON object, false otherwise. Note
that jsonb_populate_record() internally calls populate_record(),
which in turn uses populate_record_field().
Extracted from a much larger patch to add SQL/JSON query functions.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Reviewers have included (in no particular order) Andres Freund,
Alexander Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers,
Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby,
Álvaro Herrera, Jian He, Peter Eisentraut
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqHROpf9e644D8BRqYvaAPmgBZVup-xKMDPk-nd4EpgzHw@mail.gmail.com
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
This adjusts the code for CoerceViaIO and CoerceToDomain expression
nodes to handle errors softly.
For CoerceViaIo, this adds a new ExprEvalStep opcode
EEOP_IOCOERCE_SAFE, which is implemented in the new accompanying
function ExecEvalCoerceViaIOSafe(). The only difference from
EEOP_IOCOERCE's inline implementation is that the input function
receives an ErrorSaveContext via the function's
FunctionCallInfo.context, which it can use to handle errors softly.
For CoerceToDomain, this simply entails replacing the ereport() in
ExecEvalConstraintNotNull() and ExecEvalConstraintCheck() by
errsave() passing it the ErrorSaveContext passed in the expression's
ExprEvalStep.
In both cases, the ErrorSaveContext to be used is passed by setting
ExprState.escontext to point to it before calling ExecInitExprRec()
on the expression tree whose errors are to be handled softly.
Note that there's no functional change as of this commit as no call
site of ExecInitExprRec() has been changed. This is intended for
implementing new SQL/JSON expression nodes in future commits.
Extracted from a much larger patch to add SQL/JSON query functions.
Author: Nikita Glukhov <n.gluhov@postgrespro.ru>
Author: Teodor Sigaev <teodor@sigaev.ru>
Author: Oleg Bartunov <obartunov@gmail.com>
Author: Alexander Korotkov <aekorotkov@gmail.com>
Author: Andrew Dunstan <andrew@dunslane.net>
Author: Amit Langote <amitlangote09@gmail.com>
Reviewers have included (in no particular order) Andres Freund,
Alexander Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers,
Zihong Yu, Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby,
Álvaro Herrera, Jian He, Peter Eisentraut
Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/20220616233130.rparivafipt6doj3@alap3.anarazel.de
Discussion: https://postgr.es/m/abd9b83b-aa66-f230-3d6d-734817f0995d%40postgresql.org
Discussion: https://postgr.es/m/CA+HiwqHROpf9e644D8BRqYvaAPmgBZVup-xKMDPk-nd4EpgzHw@mail.gmail.com
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com