This adjusts cast functions of the geometry types to support soft
errors. This requires refactoring of various helper functions to
support error contexts. Also make the float8 to float4 cast error
safe. It requires some of the same helper functions.
This is in preparation for a future feature where conversion errors in
casts can be caught.
(The function casting type circle to type polygon is not yet made error
safe, because it is an SQL language function.)
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
When a backend attempts to start a read IO and finds the first buffer already
has I/O in progress, previously it waited for that I/O to complete before
initiating reads for any of the subsequent buffers.
Although it must wait for the I/O to finish when acquiring the buffer, there's
no reason for it to wait when setting up the read operation. Waiting at this
point prevents starting I/O on subsequent buffers and can significantly reduce
concurrency.
This matters in two workloads:
1) When multiple backends scan the same relation concurrently.
2) When a single backend requests the same block multiple times within the
readahead distance.
Waiting each time an in-progress read is encountered effectively degenerates
the access pattern into synchronous I/O.
To fix this, when encountering an already in-progress IO for the head buffer,
the wait reference is now recorded and waiting is deferred until
WaitReadBuffers(), when the buffer actually needs to be acquired.
In rare cases, a backend may still need to wait synchronously at IO
start time: If another backend has set BM_IO_IN_PROGRESS on the buffer
but has not yet set the wait reference. Such windows should be brief and
uncommon.
Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/flat/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw%403p3zu522yykv
Until now StartBufferIO() had a few weaknesses:
- As it did not submit staged IOs, it was not safe to call StartBufferIO()
where there was a potential for unsubmitted IO, which required
AsyncReadBuffers() to use a wrapper (ReadBuffersCanStartIO()) around
StartBufferIO().
- With nowait = true, the boolean return value did not allow to distinguish
between no IO being necessary and having to wait, which would lead
ReadBuffersCanStartIO() to unnecessarily submit staged IO.
- Several callers needed to handle both local and shared buffers, requiring
the caller to differentiate between StartBufferIO() and StartLocalBufferIO()
- In a future commit some callers of StartBufferIO() want the BufferDesc's
io_wref to be returned, to asynchronously wait for in-progress IO
- Indicating whether to wait with the nowait parameter was somewhat confusing
compared to a wait parameter
Address these issues as follows:
- StartBufferIO() is renamed to StartSharedBufferIO()
- A new StartBufferIO() is introduced that supports both shared and local
buffers
- The boolean return value has been replaced with an enum, indicating whether
the IO is already done, already in progress or that the buffer has been
readied for IO
- A new PgAioWaitRef * argument allows the caller to get the wait reference is
desired. All current callers pass NULL, a user of this will be introduced
subsequently
- Instead of the nowait argument there now is wait
This probably would not have been worthwhile on its own, but since all these
lines needed to be touched anyway...
Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/zljergweqti7x67lg5ije2rzjusie37nslsnkjkkby4laqqbfw@3p3zu522yykv
After the series of preceding commits introducing and using
BufferBeginSetHintBits()/BufferSetHintBits16(), hint bits are not set anymore
while IO is going on. Therefore we do not need to copy pages while they are
being written out anymore.
For the same reason XLogSaveBufferForHint() now does not need to operate on a
copy of the page anymore, but can instead use the normal XLogRegisterBuffer()
mechanism. For that the assertions and comments to XLogRegisterBuffer() had to
be updated to allow share-exclusive locked buffers to be registered.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d
Autovacuum workers scan pg_class twice to collect the set of tables
to process. The first pass is for plain relations and materialized
views, and the second is for TOAST tables. When the worker finds a
table to process, it adds it to the end of a list. Later on, it
processes the tables in the same order as the list. This simple
strategy has worked surprisingly well for a long time, but there
have been many discussions over the years about trying to improve
it.
This commit introduces a scoring system that is used to sort the
aforementioned list of tables to process. The idea is to have
autovacuum workers prioritize tables that are furthest beyond their
thresholds (e.g., a table nearing transaction ID wraparound should
be vacuumed first). This prioritization scheme is certainly far
from perfect; there are simply too many possibilities for any
scoring technique to work across all workloads, and the situation
might change significantly between the time we calculate the score
and the time that autovacuum processes it. However, we have
attemped to develop something that is expected to work for a large
portion of workloads with reasonable parameter settings.
The score is calculated as the maximum of the ratios of each of the
table's relevant values to its threshold. For example, if the
number of inserted tuples is 100, and the insert threshold for the
table is 80, the insert score is 1.25. If all other scores are
below that value, the table's score will be 1.25. The other
criteria considered for the score are the table ages (both
relfrozenxid and relminmxid) compared to the corresponding
freeze-max-age setting, the number of update/deleted tuples
compared to the vacuum threshold, and the number of
inserted/updated/deleted tuples compared to the analyze threshold.
Once exception to the previous paragraph is for tables nearing
wraparound, i.e., those that have surpassed the effective failsafe
ages. In that case, the relfrozenxid/relminmxid-based score is
scaled aggressively so that the table has a decent chance of
sorting to the front of the list.
To adjust how strongly each component contributes to the score, the
following parameters can be adjusted from their default of 1.0 to
anywhere between 0.0 and 10.0 (inclusive). Setting all of these to
0.0 restores pre-v19 prioritization behavior:
autovacuum_freeze_score_weight
autovacuum_multixact_freeze_score_weight
autovacuum_vacuum_score_weight
autovacuum_vacuum_insert_score_weight
autovacuum_analyze_score_weight
This is intended to be a baby step towards smarter autovacuum
workers. Possible future improvements include, but are not limited
to, periodic reprioritization, automatic cost limit adjustments,
and better observability (e.g., a system view that shows current
scores). While we do not expect this commit to produce any
earth-shattering improvements, it is arguably a prerequisite for
the aforementioned follow-up changes.
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: wenhui qiu <qiuwenhuifx@gmail.com>
Reviewed-by: Greg Burd <greg@burd.me>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/aOaAuXREwnPZVISO%40nathan
Previously we reused the shmem allocator's ShmemLock to also protect
lwlock.c's shared memory structures. Introduce a separate spinlock for
lwlock.c for the sake of modularity. Now that lwlock.c has its own
shared memory struct (LWLockTranches), this is easy to do.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
Merge the LWLockTranches and NamedLWLockTrancheRequest data structures
in shared memory into one array of user-defined tranches. The
NamedLWLockTrancheRequest list is now only used in postmaster, to hold
the requests until shared memory is initialized.
Introduce a C struct, LWLockTranches, to hold all the different fields
kept in shared memory. This gives an easier overview of what are all
the things kept in shared memory. Previously, we had separate pointers
for LWLockTrancheNames, LWLockCounter and the (shared memory copy of)
NamedLWLockTrancheRequestArray.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Discussion: https://www.postgresql.org/message-id/47aaf57e-1b7b-4e12-bda2-0316081ff50e@iki.fi
Typically, we have only one PlannerInfo for any given subquery, but
when we are considering a MinMaxAggPath or a hashed subplan, we end
up creating a second PlannerInfo for the same portion of the query,
with a clone of the original range table. In fact, in the MinMaxAggPath
case, we might end up creating several clones, one per aggregate.
At present, there's no easy way for a plugin, such as pg_plan_advice,
to understand the relationships between the original range table and
the copies of it that are created in these cases. To fix, add an
alternative_plan_name field to PlannerInfo. For a hashed subplan, this
is the plan name for the non-hashed alternative; for minmax aggregates,
this is the plan_name from the parent PlannerInfo; otherwise, it's the
same as plan_name.
Discussion: http://postgr.es/m/CA+TgmoYuWmN-00Ec5pY7zAcpSFQUQLbgAdVWGR9kOR-HM-fHrA@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
When my commit e222534679 added the
concept of disabled_nodes, it failed to add a disabled_nodes field
to SubPlan. This is a regression: before that commit, when
fix_alternative_subplan compared the costs of two plans, the number
of disabled nodes affected the result, because it was just a
component of the total cost. After that commit, it no longer did,
making it possible for a disabled path to win on cost over one that
is not disabled. Fix that.
As usual for planner fixes that might destabilize plan choices,
no back-patch.
Discussion: https://postgr.es/m/CA+TgmoaK=4w7-qknUo3QhUJ53pXZq=c=KgZmRyD+k7ytqfmgSg@mail.gmail.com
Reviewed-by: Lukas Fittl <lukas@fittl.com>
Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.
As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.
This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
Compound literals, as used in pg_list.h for list_makeN(), are not a
C++ feature. MSVC doesn't accept these. (GCC and Clang accept them,
but they would warn in -pedantic mode.) Replace with equivalent
inline functions. (These are the only instances of compound literals
used in PostgreSQL header files.)
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://www.postgresql.org/message-id/flat/CAGECzQR21OnnKiZO_1rLWO0-16kg1JBxnVq-wymYW0-_1cUNtg%40mail.gmail.com
Constructing a Subcription object uses a number of small or temporary
allocations. Use a per-object memory context for easy cleanup.
Get rid of FreeSubscription() which did not free all the allocations
anyway. Also get rid of the PG_TRY()/PG_CATCH() logic in
ForeignServerConnectionString() which were used to avoid leaks during
GetSubscription().
Co-authored-by: Álvaro Herrera <alvherre@kurilemu.de>
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/xvdjrdqnpap3uq7owbaox3r7p5gf7sv62aaqf2ju3vb6yglatr%40kvvwhoudrlxq
Discussion: https://postgr.es/m/CAA4eK1K=WjZ1maBCmj=5ZdO66AwPORK5ZBxVKedS0xdCcb621A@mail.gmail.com
There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
can be removed. This includes deleting the xl_heap_visible struct and
all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
records.
Bumps XLOG_PAGE_MAGIC because we removed a WAL record type.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Vacuum no longer emits a separate WAL record for each page set
all-visible or all-frozen during phase I. Instead, visibility map
updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
is already emitted for pruning and freezing.
Previously, heap_page_prune_and_freeze() determined whether a page was
all-visible, but the corresponding VM bits were only set later in
lazy_scan_prune(). Now the VM is updated immediately in
heap_page_prune_and_freeze(), at the same time as the heap
modifications. This reduces WAL volume produced by vacuum.
For now, vacuum is still the only user of heap_page_prune_and_freeze()
allowed to set the VM. On-access pruning is not yet able to set the VM.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Earlier version Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
During vacuum's first and third phases, we examine tuples' visibility to
determine if we can set the page all-visible in the visibility map.
Previously, this check compared tuple xmins against a single XID chosen
at the start of vacuum (OldestXmin). We now use GlobalVisState, which
enables future work to set the VM during on-access pruning, since
ordinary queries have access to GlobalVisState but not OldestXmin.
This also benefits vacuum: in some cases, GlobalVisState may advance
during a vacuum, allowing more pages to become considered all-visible.
And, in the future, we could easily add a heuristic to update
GlobalVisState more frequently during vacuums of large tables.
OldestXmin is still used for freezing and as a backstop to ensure we
don't freeze a dead tuple that wasn't yet prunable according to
GlobalVisState in the rare occurrences where GlobalVisState moves
backwards.
Because comparing a transaction ID against GlobalVisState is more
expensive than comparing against a single XID, we defer this check until
after scanning all tuples on the page. Therefore, we perform the
GlobalVisState check only once per page. This is safe because
visibility_cutoff_xid records the newest live xmin on the page; if it is
globally visible, then the entire page is all-visible.
Using GlobalVisState means on-access pruning can also maintain
visibility_cutoff_xid, which is required to set the visibility map
on-access in the future.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk#c755ef151507aba58471ffaca607e493
The number of .c files that must include access/clog.h can currently be
counted on one's fingers and miss only one (assuming one has the usual
number of hands). However, due to indirect inclusion via proc.h,
there's a lot of files that are pointlessly including it. This is easy
to avoid with the easy trick implemented by this commit.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202603221856.iwlhitt6dxxx@alvherre.pgsql
Since storage/locktags.h was added by commit 322bab7974, many headers
can be made leaner by depending on that instead of on storage/lock.h,
which has many other dependencies.
(In fact, some of these changes were possible even before that.)
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/abvrRZo52Yx9ZzWQ@ip-10-97-1-34.eu-west-3.compute.internal
Previously, XLogFindNextRecord() did not return detailed error information
when it failed to find a valid WAL record. As a result, callers such as
the WAL summarizer, pg_waldump, and pg_walinspect could only report generic
errors (e.g., "could not find a valid record after ..."), making
troubleshooting difficult.
This commit fix the issue by extending XLogFindNextRecord() to return
detailed error information on failure, and updating its callers to include
those details in their error messages.
For example, when pg_waldump is run on a WAL file with an invalid magic number,
it now reports not only the generic error but also the specific cause
(e.g., "invalid magic number").
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Reviewed-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Reviewed-by: Japin Li <japinli@hotmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAO6_XqoxJXddcT4wkd9Xd+cD6Sz-fyspRGuV4Bq-wbXG4pVNzA@mail.gmail.com
The second argument to TupleDescAttr should always be at least zero
and less than natts; otherwise, we index outside of the attribute
array. Assert that this is the case.
Various violations, or possible violations, of this rule that are
currently in the tree are actually harmless, because while
we do call TupleDescAttr() before verifying that the argument is
within range, we don't actually dereference it unless the argument
was within range all along. Nonetheless, the Assert means we
should be more careful, so tidy up accordingly.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: http://postgr.es/m/CA+TgmoacixUZVvi00hOjk_d9B4iYKswWP1gNqQ8Vfray-AcOCA@mail.gmail.com
This adjusts many C functions underlying casts to support soft errors.
This is in preparation for a future feature where conversion errors in
casts can be caught.
This patch covers cast functions that can be adjusted easily by
changing ereport to ereturn or making other light changes. The
underlying helper functions were already changed to support soft
errors some time ago as part of soft error support in type input
functions.
Other casts and types will require some more work and are being kept
as separate patches.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Amul Sul <sulamul@gmail.com>
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CADkLM%3Dfv1JfY4Ufa-jcwwNbjQixNViskQ8jZu3Tz_p656i_4hQ%40mail.gmail.com
This commit adds a new stats kind, called PGSTAT_KIND_LOCK, implementing
statistics for lock tags, as reported by pg_locks. The implementation
is fixed-sized, as the data is caped based on the number of lock tags in
LockTagType.
The new statistics kind records the following fields, providing insight
regarding lock behavior, while avoiding impact on performance-critical
code paths (such as fast-path lock acquisition):
- waits and wait_time: respectively track the number of times a lock
required waiting and the total time spent acquiring it. These metrics
are only collected once a lock is successfully acquired and after
deadlock_timeout has been exceeded.
fastpath_exceeded: counts how often a lock could not be acquired via
the fast path due to the max_locks_per_transaction slot limits.
A new view called pg_stat_lock can be used to access this data, coupled
with a SQL function called pg_stat_get_lock().
Bump stat file format PGSTAT_FILE_FORMAT_ID.
Bump catalog version.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aIyNxBWFCybgBZBS%40ip-10-97-1-34.eu-west-3.compute.internal
Because of the SKIP_PAGES_THRESHOLD optimization or a stale prune XID,
heap_page_prune_and_freeze() can be invoked for pages with no pruning or
freezing work to do. To avoid this, if a page is already all-frozen or
it is all-visible and no freezing will be attempted, exit early. We
can't exit early if vacuum passed DISABLE_PAGE_SKIPPING, though.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
Change the IndexScanInstrumentation fields in IndexScanState,
IndexOnlyScanState, and BitmapIndexScanState from inline structs to
pointers. This avoids additional space overhead whenever new fields are
added to IndexScanInstrumentation in the future, at least in the common
case where the instrumentation isn't used (i.e. when the executor node
isn't being run through an EXPLAIN ANALYZE).
Preparation for an upcoming patch series that will add index
prefetching. The new slot-based interface that will enable index
prefetching necessitates that we add at least one more field to
IndexScanInstrumentation (to count heap fetches during index-only
scans).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com
Move VM corruption detection and repair into heap page pruning. This
allows VM repair during on-access pruning, not only during vacuum.
Also, expand corruption detection to cover pages marked all-visible that
contain dead tuples and tuples inserted or deleted by in-progress
transactions, rather than only all-visible pages with LP_DEAD items.
Pinning the correct VM page before on-access pruning is cheap when
compared to the cost of actually pruning. The vmbuffer is saved in the
scan descriptor, so a query should only need to pin each VM page once,
and a single VM page covers a large number of heap pages.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/bqc4kh5midfn44gnjiqez3bjqv4zogydguvdn446riw45jcf3y%404ez66il7ebvk
This commit moves all the declarations related to locktags from lock.h
to a new header called locktag.h. This header is useful so as code
paths that care about locktags but not the lock hashtable can know about
these without having to include lock.h and all its set of dependencies.
This move includes the basic locktag structures and the set of macros to
fill in the locktag fields before attempting to acquire a lock.
Based on a suggestion from me, suggestion done while discussing a
different feature.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/abufUya2oK-_PJ3E@paquier.xyz
Consolidate tar archive identification and compression-type detection
logic into a shared location. Currently used by pg_basebackup and
pg_verifybackup, this functionality is also required for upcoming
pg_waldump enhancements.
This change promotes code reuse and simplifies maintenance across
frontend tools.
Author: Amul Sul <sulamul@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
discussion: https://postgr.es/m/CAAJ_b94bqdWN3h2J-PzzzQ2Npbwct5ZQHggn_QoYGhC2rn-=WQ@mail.gmail.com
genericcostestimate() estimates the number of index leaf pages to
be visited as a pro-rata fraction of the total number of leaf pages.
Or at least that was the intention. What it actually used in the
calculation was the total number of index pages, so that non-leaf
pages were also counted. In a decent-sized index the error is
probably small, since we expect upper page fanout to be high.
But in a small index that's not true; in the worst case with one
data-bearing page plus a metapage, we had 100% relative error.
This led to surprising planning choices such as not using a small
partial index.
To fix, ask genericcostestimate's caller to supply an estimate of
the number of non-leaf pages, and subtract that. For the built-in
index AMs, it seems sufficient to count the index metapage (if the
AM uses one) as non-leaf. Per the above argument, counting upper
index pages shouldn't change the estimate much, and in most cases
we don't have any easy way of estimating the number of upper pages.
This might be an area for further research in future.
Any external genericcostestimate callers that do not set the new field
GenericCosts.numNonLeafPages will see the same behavior as before,
assuming they followed the advice to zero out that whole struct.
Unsurprisingly, this change affects a number of plans seen in the
core regression tests. I hacked up the existing tests to keep the
tests' plans the same, since in each case it appeared that the
test's intent was to test exactly that plan. Also add one new
test case demonstrating that a better index choice is now made.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Henson Choi <assam258@gmail.com>
Discussion: https://postgr.es/m/870521.1745860752@sss.pgh.pa.us
This adds the force_array option, which is available exclusively
when using COPY TO with the JSON format.
When enabled, this option wraps the output in a top-level JSON array
(enclosed in square brackets with comma-separated elements), making the
entire result a valid single JSON value. Without this option, the
default behavior is to output a stream of independent JSON objects.
Attempting to use this option with COPY FROM or with formats other than
JSON will raise an error.
Author: Joe Conway <mail@joeconway.com>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Florents Tselai <florents.tselai@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
This introduces the JSON format option for the COPY TO command, allowing
users to export query results or table data directly as a stream of JSON
objects (one per line, NDJSON style).
The JSON format is currently supported only for COPY TO operations; it
is not available for COPY FROM.
JSON format is incompatible with some standard text/CSV formatting
options, including HEADER, DEFAULT, NULL, DELIMITER, FORCE QUOTE,
FORCE NOT NULL, and FORCE NULL.
Column list support is included: when a column list is specified, only
the named columns are emitted in each JSON object.
Regression tests covering valid JSON exports and error handling for
incompatible options have been added to src/test/regress/sql/copy.sql.
Author: Joe Conway <mail@joeconway.com>
Author: jian he <jian.universality@gmail.com>
Co-Authored-By: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Davin Shearer <davin@apache.org>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
Currently, the COPY command format is determined by two boolean fields
(binary, csv_mode) in CopyFormatOptions. This approach, while
functional, isn't ideal for implementing other formats in the future.
To simplify adding new formats, introduce a CopyFormat enum. This makes
the code cleaner and more maintainable, allowing for easier integration
of additional formats down the line.
Author: Joel Jacobson <joel@compiler.org>
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CALvfUkBxTYy5uWPFVwpk_7ii2zgT07t3d-yR_cy4sfrrLU%3Dkcg%40mail.gmail.com
Discussion: https://postgr.es/m/6a04628d-0d53-41d9-9e35-5a8dc302c34c@joeconway.com
Following commit fd366065e0, which added EXCEPT TABLE support to
CREATE PUBLICATION, this commit extends ALTER PUBLICATION to allow
modifying the exclusion list.
New Syntax:
ALTER PUBLICATION name SET publication_all_object [, ... ]
where publication_all_object is one of:
ALL TABLES [ EXCEPT TABLE ( except_table_object [, ... ] ) ]
ALL SEQUENCES
If the EXCEPT clause is provided, the existing exclusion list in
pg_publication_rel is replaced with the specified relations. If the
EXCEPT clause is omitted, any existing exclusions for the publication
are cleared. Similarly, SET ALL SEQUENCES updates
Note that because this is a SET command, specifying only one object
type (e.g., SET ALL SEQUENCES) will reset the other unspecified flags
(e.g., setting puballtables to false).
Consistent with CREATE PUBLICATION, only root partitioned tables or
standard tables can be specified in the EXCEPT list. Specifying a
partition child will result in an error.
Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Discussion: https://postgr.es/m/CALDaNm3=JrucjhiiwsYQw5-PGtBHFONa6F7hhWCXMsGvh=tamA@mail.gmail.com
This commit adds both the number of parallel workers planned and the
number of parallel workers actually launched to the output of
VACUUM (VERBOSE) and autovacuum logs.
Previously, this information was only reported as an INFO message
during VACUUM (VERBOSE), which meant it was not included in autovacuum
logs in practice. Although autovacuum does not yet support parallel
vacuum, a subsequent patch will enable it and utilize these logs in
its regression tests. This change also improves observability by
making it easier to verify if parallel vacuum is utilizing the
expected number of workers.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/CACG=ezZOrNsuLoETLD1gAswZMuH2nGGq7Ogcc0QOE5hhWaw=cw@mail.gmail.com
This enables the use of functions such as encode() and decode() with
UUID values, allowing them to be converted to and from alternative
formats like base64 or hex.
The cast maps the 16-byte internal representation of a UUID directly
to a bytea datum. This is more efficient than going through a text
forepresentation.
Bump catalog version.
Author: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Co-authored-by: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
Discussion: https://postgr.es/m/CAJ7c6TOramr1UTLcyB128LWMqita1Y7%3Darq3KHaU%3Dqikf5yKOQ%40mail.gmail.com
In a plain join, we can just summarily discard an input tuple
with null join key(s), since it cannot match anything from
the other side of the join (assuming a strict join operator).
However, if the tuple comes from the outer side of an outer join
then we have to emit it with null-extension of the other side.
Up to now, hash joins did that by inserting the tuple into the hash
table as though it were a normal tuple. This is unnecessarily
inefficient though, since the required processing is far simpler than
for a potentially-matchable tuple. Worse, if there are a lot of such
tuples they will bloat the hash bucket they go into, possibly causing
useless repeated attempts to split that bucket or increase the number
of batches. We have a report of a large join vainly creating many
thousands of batches when faced with such input.
This patch improves the situation by keeping such tuples out of the
hash table altogether, instead pushing them into a separate tuplestore
from which we return them later. (One might consider trying to return
them immediately; but that would require substantial refactoring, and
it doesn't work anyway for cases where we rescan an unmodified hash
table.) This works even in parallel hash joins, because whichever
worker reads a null-keyed tuple can just return it; there's no need
for consultation with other workers. Thus the tuplestores are local
storage even in a parallel join.
A pre-existing buglet that I noticed while analyzing the code's
behavior is that ExecHashRemoveNextSkewBucket fails to decrement
hashtable->skewTuples for tuples moved into the main hash table
from the skew hash table. This invalidates ExecHashTableInsert's
calculation of the number of main-hash-table tuples, though probably
not by a lot since we expect the skew table to be small relative
to the main one. Nonetheless, let's fix that too while we're here.
Bug: #18909
Reported-by: Sergey Koposov <Sergey.Koposov@ed.ac.uk>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/3061845.1746486714@sss.pgh.pa.us
Except for GRANT and REVOKE on roles, the GRANTED BY clause
currently only accepts the current role to match the SQL standard.
And even if an acceptable grantor (i.e., the current role) is
specified, Postgres ignores it and chooses the "best" grantor for
the command. Allowing the user to select a specific grantor would
allow better control over the precise behavior of GRANT/REVOKE
statements. This commit adds that ability. For consistency with
select_best_grantor(), we only permit choosing grantor roles for
which the current role inherits privileges.
Author: Nathan Bossart <nathandbossart@gmail.com>
Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/aRYLkTpazxKhnS_w%40nathan
Introduce dshash_find_or_insert_extended, which is just like
dshash_find_or_insert except that it takes a flags argument.
Currently, the only supported flag is DSHASH_INSERT_NO_OOM, but
I have chosen to use an integer rather than a boolean in case we
end up with more flags in the future.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: http://postgr.es/m/CA+TgmoaJwUukUZGu7_yL74oMTQQz2=zqucMhF9+9xBmSC5us1w@mail.gmail.com
The recent commit to change copyObject() to use typeof_unqual allows
cleaning up some APIs to take advantage of this improved qualifier
handling. EventTriggerCollectSimpleCommand() is a good example: It
takes a node tree and makes a copy that it keeps around for its
internal purposes, but it can't communicate via its function signature
that it promises not scribble on the passed node tree. That is now
fixed.
Reviewed-by: David Geier <geidav.pg@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
Support for SNI was added to clientside libpq in 5c55dc8b47 with the
sslsni parameter, but there was no support for utilizing it serverside.
This adds support for serverside SNI such that certificate/key handling
is available per host. A new config file, $datadir/pg_hosts.conf, is
used for configuring which certificate and key should be used for which
hostname. In order to use SNI the ssl_sni GUC must be set to on, when
it is off the ssl configuration works just like before. If ssl_sni is
enabled and pg_hosts.conf is non-empty it will take precedence over
the regular SSL GUCs, if it is empty or missing the regular GUCs will
be used just as before this commit with no hostname specific handling.
The TLS init hook is not compatible with ssl_sni since it operates on
a single TLS configuration and SNI break that assumption. If the init
hook and ssl_sni are both enabled, a WARNING will be issued.
Host configuration can either be for a literal hostname to match, non-
SNI connections using the no_sni keyword or a default fallback matching
all connections. By omitting no_sni and the fallback a strict mode
can be achieved where only connections using sslsni=1 and a specified
hostname are allowed.
CRL file(s) are applied from postgresql.conf to all configured hostnames.
Serverside SNI requires OpenSSL, currently LibreSSL does not support
the required infrastructure to update the SSL context during the TLS
handshake.
Author: Daniel Gustafsson <daniel@yesql.se>
Co-authored-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Reviewed-by: Zsolt Parragi <zsolt.parragi@percona.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Dewei Dai <daidewei1970@163.com>
Reviewed-by: Cary Huang <cary.huang@highgo.ca>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/1C81CD0D-407E-44F9-833A-DD0331C202E5@yesql.se
The IS JSON predicate only accepted the base types text, json, jsonb, and
bytea. Extend it to also accept domain types over those base types by
resolving through getBaseType() during parse analysis.
The base type OID is stored in the JsonIsPredicate node (as exprBaseType)
so the executor can dispatch to the correct validation path without
repeating the domain lookup at runtime.
When a non-supported type (or domain over a non-supported type) is used,
the error message displays the original type name as written by the user,
rather than the resolved base type.
Author: jian he <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CACJufxEk34DnJFG72CRsPPT4tsJL9arobX0tNPsn7yH28J=zQg@mail.gmail.com
A new attempt was made in 63275ce84d to make typeof_unqual work on all
configurations of CC and CLANG. This re-introduced an old problem
though, where CLANG would only support __typeof_unqual__ but the
configure check for CC detected support for typeof_unqual.
This fixes that by always defining typeof_unqual as __typeof_unqual__
under clang.
Author: Jelte Fennema-Nio <postgres@jeltef.nl>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/92f9750f-c7f6-42d8-9a4a-85a3cbe808f3%40eisentraut.org
Complete the TODOs in to_json_is_immutable() and to_jsonb_is_immutable()
by recursing into container types (arrays, composites, ranges, multiranges,
domains) to check element/sub-type mutability, rather than conservatively
returning "mutable" for all arrays and composites.
The shared logic is factored into a single json_check_mutability() function
in jsonfuncs.c, with the existing exported functions as thin wrappers.
Composite type inspection uses lookup_rowtype_tupdesc() (typcache) instead
of relation_open() to avoid unnecessary lock acquisition in the optimizer.
Range and multirange types are now also checked recursively: if the
subtype's conversion is immutable, the range is considered immutable
for JSON purposes, even though range_out is generically marked STABLE.
This is a behavioral change: range types with immutable subtypes (e.g.,
int4range) can now appear in expression indexes via JSON_ARRAY/JSON_OBJECT,
whereas previously they were conservatively rejected.
Add regression tests for JSON_ARRAY and JSON_OBJECT mutability with
expression indexes and generated columns, covering arrays, composites,
domains, ranges, multiranges and combinations thereof.
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/CACJufxFz=OsXQdsMJ-cqoqspD9aJrwntsQP-U2A-UaV_M+-S9g@mail.gmail.com
Commitfest: https://commitfest.postgresql.org/patch/5759
This commit adds table OID and attribute number columns to
pg_stats, and it adds table OID and statistics object OID columns
to pg_stats_ext and pg_stats_ext_exprs. A proposed follow-up
commit would use pg_stats.tableid to simplify a query in pg_dump.
The others have no immediate purpose but may be useful later.
Bumps catversion.
Author: Corey Huinker <corey.huinker@gmail.com>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CADkLM%3DcoCVy92QkVUUTLdo5eO2bMDtwMrzRn_8miAhX%2BuPaqXg%40mail.gmail.com
Previously, this was 16 bytes. With the use of some bitflags and by
reducing the attcacheoff field size to a 16-bit type, we can halve the
size of the struct.
It's unlikely that caching the offsets for offsets larger than what will
fit in a 16-bit int will help much as the tuple is very likely to have
some non-fixed-width types anyway, the offsets of which we cannot cache.
Shrinking this down to 8 bytes helps by accessing fewer cachelines when
performing tuple deformation. The fields used there are all fully
fledged fields, which don't require any bitmasking to extract the value
of. It also helps to more efficiently calculate the address of a
compact_attrs[] element in TupleDesc as the x86 LEA instruction can work
with 8 byte offsets, which allows the element address to be calculated
from the TupleDesc's address in a single instruction using LEA's
concurrent shift and add.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/CAApHDvodSVBj3ypOYbYUCJX%2BNWL%3DVZs63RNBQ_FxB_F%2B6QXF-A%40mail.gmail.com