This reverts commit eae7be600b, following a discussion with Tom Lane,
due to concerns that this impacts the decisions made by the planner for
the number of workers spawned based on the inlining and const-folding of
index expressions and predicate for cases that would have worked until
this commit.
Discussion: https://postgr.es/m/162802.1709746091@sss.pgh.pa.us
Backpatch-through: 12
As coded, the planner logic that calculates the number of parallel
workers to use for a parallel index build uses expressions and
predicates from the relcache, which are flattened for the planner by
eval_const_expressions().
As reported in the bug, an immutable parallel-unsafe function flattened
in the relcache would become a Const, which would be considered as
parallel-safe, even if the predicate or the expressions including the
function are not safe in parallel workers. Depending on the expressions
or predicate used, this could cause the parallel build to fail.
Tests are included that check parallel index builds with parallel-unsafe
predicate and expressions. Two routines are added to lsyscache.h to be
able to retrieve expressions and predicate of an index from its pg_index
data.
Reported-by: Alexander Lakhin
Author: Tender Wang
Reviewed-by: Jian He, Michael Paquier
Discussion: https://postgr.es/m/CAHewXN=UaAaNn9ruHDH3Os8kxLVmtWqbssnf=dZN_s9=evHUFA@mail.gmail.com
Backpatch-through: 12
array_set_element() and related functions allow an array to be
enlarged by assigning to subscripts outside the current array bounds.
While these places were careful to check that the new bounds are
allowable, they neglected to consider the risk of integer overflow
in computing the new bounds. In edge cases, we could compute new
bounds that are invalid but get past the subsequent checks,
allowing bad things to happen. Memory stomps that are potentially
exploitable for arbitrary code execution are possible, and so is
disclosure of server memory.
To fix, perform the hazardous computations using overflow-detecting
arithmetic routines, which fortunately exist in all still-supported
branches.
The test cases added for this generate (after patching) errors that
mention the value of MaxArraySize, which is platform-dependent.
Rather than introduce multiple expected-files, use psql's VERBOSITY
parameter to suppress the printing of the message text. v11 psql
lacks that parameter, so omit the tests in that branch.
Our thanks to Pedro Gallegos for reporting this problem.
Security: CVE-2023-5869
get_explain_guc_options() crashed if a string GUC marked GUC_EXPLAIN
has a NULL boot_val. Nosing around found a couple of other places
that seemed insufficiently cautious about NULL string values, although
those are likely unreachable in practice. Add some commentary
defining the expectations for NULL values of string variables,
in hopes of forestalling future additions of more such bugs.
Xing Guo, Aleksander Alekseev, Tom Lane
Discussion: https://postgr.es/m/CACpMh+AyDx5YUpPaAgzVwC1d8zfOL4JoD-uyFDnNSa1z0EsDQQ@mail.gmail.com
The SIGTERM handler for the startup process immediately calls
proc_exit() for the duration of the restore_command, i.e., a call
to system(). This system() call forks a new process to execute the
shell command, and this child process inherits the parent's signal
handlers. If both the parent and child processes receive SIGTERM,
both will attempt to call proc_exit(). This can end badly. For
example, both processes will try to remove themselves from the
PGPROC shared array.
To fix this problem, this commit adds a check in
StartupProcShutdownHandler() to see whether MyProcPid == getpid().
If they match, this is the parent process, and we can proc_exit()
like before. If they do not match, this is a child process, and we
just emit a message to STDERR (in a signal safe manner) and
_exit(), thereby skipping any problematic exit callbacks.
This commit also adds checks in proc_exit(), ProcKill(), and
AuxiliaryProcKill() that verify they are not being called within
such child processes.
Suggested-by: Andres Freund
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz
Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13
Backpatch-through: 11
Add "#ifndef FRONTEND" where necessary to make pg_waldump build
on compilers that don't elide unused static-inline functions.
This back-patches relevant parts of commit 3e9ca5260, fixing build
breakage from dc7420c2c and back-patching of f10f0ae42.
Per recently-resurrected buildfarm member castoroides. We aren't
expecting castoroides to build anything newer than v11, but we
might as well clean up the intermediate branches while at it.
This is a back-patch of the v15-era commit f10f0ae42 into older
supported branches. The idea is to design out bugs in which an
ill-timed relcache flush clears rel->rd_smgr partway through
some code sequence that wasn't expecting that. We had another
report today of a corner case that reliably crashes v14 under
debug_discard_caches (nee CLOBBER_CACHE_ALWAYS), and therefore
would crash once in a blue moon in the field. We're unlikely
to get rid of all such code paths unless we adopt the more
rigorous coding rules instituted by f10f0ae42. Therefore,
even though this is a bit invasive, it's time to back-patch.
Some comfort can be taken in the fact that f10f0ae42 has been
in v15 for 16 months without problems.
I left the RelationOpenSmgr macro present in the back branches,
even though no core code should use it anymore, in order to not break
third-party extensions in minor releases. Such extensions might opt
to start using RelationGetSmgr instead, to reduce their code
differential between v15 and earlier branches. This carries a hazard
of failing to compile against headers from existing minor releases.
However, once compiled the extension should work fine even with such
releases, because RelationGetSmgr is a "static inline" function so
it creates no link-time dependency. So depending on distribution
practices, that might be an OK tradeoff.
Per report from Spyridon Dimitrios Agathos. Original patch
by Amul Sul.
Discussion: https://postgr.es/m/CAFM5RaqdgyusQvmWkyPYaWMwoK5gigdtW-7HcgHgOeAw7mqJ_Q@mail.gmail.com
Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
This adds additional variants of palloc, pg_malloc, etc. that
encapsulate common usage patterns and provide more type safety.
Specifically, this adds palloc_object(), palloc_array(), and
repalloc_array(), which take the type name of the object to be
allocated as its first argument and cast the return as a pointer to
that type. There are also palloc0_object() and palloc0_array()
variants for initializing with zero, and pg_malloc_*() variants of all
of the above.
Inspired by the talloc library.
This is backpatched from master so that future backpatchable code can
make use of these APIs. This patch by itself does not contain any
users of these APIs.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/bb755632-2a43-d523-36f8-a1e7a389a907@enterprisedb.com
clang 13 with -Wextra warns that "performing pointer subtraction with
a null pointer has undefined behavior" in the places where freepage.c
tries to set a relptr variable to constant NULL. This appears to be
a compiler bug, but it's unlikely to get fixed instantly. Fortunately,
we can work around it by introducing an inline support function, which
seems like a good change anyway because it removes the macro's existing
double-evaluation hazard.
Backpatch to v10 where this code was introduced.
Patch by me, based on an idea of Andres Freund's.
Discussion: https://postgr.es/m/48826.1648310694@sss.pgh.pa.us
Commit 8431e296ea reworked ProcArrayApplyRecoveryInfo to sort XIDs
before adding them to KnownAssignedXids. But the XIDs are sorted using
xidComparator, which compares the XIDs simply as uint32 values, not
logically. KnownAssignedXidsAdd() however expects XIDs in logical order,
and calls TransactionIdFollowsOrEquals() to enforce that. If there are
XIDs for which the two orderings disagree, an error is raised and the
recovery fails/restarts.
Hitting this issue is fairly easy - you just need two transactions, one
started before the 4B limit (e.g. XID 4294967290), the other sometime
after it (e.g. XID 1000). Logically (4294967290 <= 1000) but when
compared using xidComparator we try to add them in the opposite order.
Which makes KnownAssignedXidsAdd() fail with an error like this:
ERROR: out-of-order XID insertion in KnownAssignedXids
This only happens during replica startup, while processing RUNNING_XACTS
records to build the snapshot. Once we reach STANDBY_SNAPSHOT_READY, we
skip these records. So this does not affect already running replicas,
but if you restart (or create) a replica while there are transactions
with XIDs for which the two orderings disagree, you may hit this.
Long-running transactions and frequent replica restarts increase the
likelihood of hitting this issue. Once the replica gets into this state,
it can't be started (even if the old transactions are terminated).
Fixed by sorting the XIDs logically - this is fine because we're dealing
with normal XIDs (because it's XIDs assigned to backends) and from the
same wraparound epoch (otherwise the backends could not be running at
the same time on the primary node). So there are no problems with the
triangle inequality, which is why xidComparator compares raw values.
Investigation and root cause analysis by Abhijit Menon-Sen. Patch by me.
This issue is present in all releases since 9.4, however releases up to
9.6 are EOL already so backpatch to 10 only.
Reviewed-by: Abhijit Menon-Sen
Reviewed-by: Alvaro Herrera
Backpatch-through: 10
Discussion: https://postgr.es/m/36b8a501-5d73-277c-4972-f58a4dce088a%40enterprisedb.com
CIC and REINDEX CONCURRENTLY assume backends see their catalog changes
no later than each backend's next transaction start. That failed to
hold when a backend absorbed a relevant invalidation in the middle of
running RelationBuildDesc() on the CIC index. Queries that use the
resulting index can silently fail to find rows. Fix this for future
index builds by making RelationBuildDesc() loop until it finishes
without accepting a relevant invalidation. It may be necessary to
reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices.
Back-patch to 9.6 (all supported versions).
Noah Misch and Andrey Borodin, reviewed (in earlier versions) by Andres
Freund.
Discussion: https://postgr.es/m/20210730022548.GA1940096@gust.leadboat.com
Commit 84f5c2908 forgot to consider the possibility that
EnsurePortalSnapshotExists could run inside a subtransaction with
lifespan shorter than the Portal's. In that case, the new active
snapshot would be popped at the end of the subtransaction, leaving
a dangling pointer in the Portal, with mayhem ensuing.
To fix, make sure the ActiveSnapshot stack entry is marked with
the same subtransaction nesting level as the associated Portal.
It's certainly safe to do so since we won't be here at all unless
the stack is empty; hence we can't create an out-of-order stack.
Let's also apply this logic in the case where PortalRunUtility
sets portalSnapshot, just to be sure that path can't cause similar
problems. It's slightly less clear that that path can't create
an out-of-order stack, so add an assertion guarding it.
Report and patch by Bertrand Drouvot (with kibitzing by me).
Back-patch to v11, like the previous commit.
Discussion: https://postgr.es/m/ff82b8c5-77f4-3fe7-6028-fcf3303e82dd@amazon.com
That allows to include just visibilitymapdefs.h from file.c, and in turn,
remove include of postgres.h from relcache.h.
Reported-by: Andres Freund
Discussion: https://postgr.es/m/20210913232614.czafiubr435l6egi%40alap3.anarazel.de
Author: Alexander Korotkov
Reviewed-by: Andres Freund, Tom Lane, Alvaro Herrera
Backpatch-through: 13
COMMIT/ROLLBACK necessarily destroys all snapshots within the session.
The original implementation of intra-procedure transactions just
cavalierly did that, ignoring the fact that this left us executing in
a rather different environment than normal. In particular, it turns
out that handling of toasted datums depends rather critically on there
being an outer ActiveSnapshot: otherwise, when SPI or the core
executor pop whatever snapshot they used and return, it's unsafe to
dereference any toasted datums that may appear in the query result.
It's possible to demonstrate "no known snapshots" and "missing chunk
number N for toast value" errors as a result of this oversight.
Historically this outer snapshot has been held by the Portal code,
and that seems like a good plan to preserve. So add infrastructure
to pquery.c to allow re-establishing the Portal-owned snapshot if it's
not there anymore, and add enough bookkeeping support that we can tell
whether it is or not.
We can't, however, just re-establish the Portal snapshot as part of
COMMIT/ROLLBACK. As in normal transaction start, acquiring the first
snapshot should wait until after SET and LOCK commands. Hence, teach
spi.c about doing this at the right time. (Note that this patch
doesn't fix the problem for any PLs that try to run intra-procedure
transactions without using SPI to execute SQL commands.)
This makes SPI's no_snapshots parameter rather a misnomer, so in HEAD,
rename that to allow_nonatomic.
replication/logical/worker.c also needs some fixes, because it wasn't
careful to hold a snapshot open around AFTER trigger execution.
That code doesn't use a Portal, which I suspect someday we're gonna
have to fix. But for now, just rearrange the order of operations.
This includes back-patching the recent addition of finish_estate()
to centralize the cleanup logic there.
This also back-patches commit 2ecfeda3e into v13, to improve the
test coverage for worker.c (it was that test that exposed that
worker.c's snapshot management is wrong).
Per bug #15990 from Andreas Wicht. Back-patch to v11 where
intra-procedure COMMIT was added.
Discussion: https://postgr.es/m/15990-eee2ac466b11293d@postgresql.org
While we were (mostly) careful about ensuring that the dimensions of
arrays aren't large enough to cause integer overflow, the lower bound
values were generally not checked. This allows situations where
lower_bound + dimension overflows an integer. It seems that that's
harmless so far as array reading is concerned, except that array
elements with subscripts notionally exceeding INT_MAX are inaccessible.
However, it confuses various array-assignment logic, resulting in a
potential for memory stomps.
Fix by adding checks that array lower bounds aren't large enough to
cause lower_bound + dimension to overflow. (Note: this results in
disallowing cases where the last subscript position would be exactly
INT_MAX. In principle we could probably allow that, but there's a lot
of code that computes lower_bound + dimension and would need adjustment.
It seems doubtful that it's worth the trouble/risk to allow it.)
Somewhat independently of that, array_set_element() was careless
about possible overflow when checking the subscript of a fixed-length
array, creating a different route to memory stomps. Fix that too.
Security: CVE-2021-32027
For an index, attstattarget can be updated using ALTER INDEX SET
STATISTICS. This data was lost on the new index after REINDEX
CONCURRENTLY.
The update of this field is done when the old and new indexes are
swapped to make the fix back-patchable. Another approach we could look
after in the long-term is to change index_create() to pass the wanted
values of attstattarget when creating the new relation, but, as this
would cause an ABI breakage this can be done only on HEAD.
Reported-by: Ronan Dunklau
Author: Michael Paquier
Reviewed-by: Ronan Dunklau, Tomas Vondra
Discussion: https://postgr.es/m/16628084.uLZWGnKmhe@laptop-ronand
Backpatch-through: 12
Given a permanent relation rewritten in the current transaction, the
old_snapshot_threshold mechanism assumed the relation had never been
subject to early pruning. Hence, a query could fail to report "snapshot
too old" when the rewrite followed an early truncation. ALTER TABLE SET
TABLESPACE is probably the only rewrite mechanism capable of exposing
this bug. REINDEX sets indcheckxmin, avoiding the problem. CLUSTER has
zeroed page LSNs since before old_snapshot_threshold existed, so
old_snapshot_threshold has never cooperated with it. ALTER TABLE
... SET DATA TYPE makes the table look empty to every past snapshot,
which is strictly worse. Back-patch to v13, where commit
c6b92041d3 broke this.
Kyotaro Horiguchi and Noah Misch
Discussion: https://postgr.es/m/20210113.160705.2225256954956139776.horikyota.ntt@gmail.com
Introduce TimestampDifferenceMilliseconds() to simplify callers
that would rather have the difference in milliseconds, instead of
the select()-oriented seconds-and-microseconds format. This gets
rid of at least one integer division per call, and it eliminates
some apparently-easy-to-mess-up arithmetic.
Two of these call sites were in fact wrong:
* pg_prewarm's autoprewarm_main() forgot to multiply the seconds
by 1000, thus ending up with a delay 1000X shorter than intended.
That doesn't quite make it a busy-wait, but close.
* postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute
microseconds not milliseconds, thus ending up with a delay 1000X longer
than intended. Somebody along the way had noticed this problem but
misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather
than fixing the units. This was relatively harmless in context, because
we don't care that much about exactly how long this delay is; still,
it's wrong.
There are a few more callers of TimestampDifference() that don't
have a direct need for seconds-and-microseconds, but can't use
TimestampDifferenceMilliseconds() either because they do need
microsecond precision or because they might possibly deal with
intervals long enough to overflow 32-bit milliseconds. It might be
worth inventing another API to improve that, but that seems outside
the scope of this patch; so those callers are untouched here.
Given the fact that we are fixing some bugs, and the likelihood
that future patches might want to back-patch code that uses this
new API, back-patch to all supported branches.
Alexey Kondratov and Tom Lane
Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
The date-vs-timestamp, date-vs-timestamptz, and timestamp-vs-timestamptz
comparators all worked by promoting the first type to the second and
then doing a simple same-type comparison. This works fine, except
when the conversion result is out of range, in which case we throw an
entirely avoidable error. The sources of such failures are
(a) type date can represent dates much farther in the future than
the timestamp types can;
(b) timezone rotation might cause a just-in-range timestamp value to
become a just-out-of-range timestamptz value.
Up to now we just ignored these corner-case issues, but now we have
an actual user complaint (bug #16657 from Huss EL-Sheikh), so let's
do something about it.
It turns out that commit 52ad1e659 already built all the necessary
infrastructure to support error-free comparisons, but neglected to
actually use it in the main-line code paths. Fix that, do a little
bit of code style review, and remove the now-duplicate logic in
jsonpath_exec.c.
Back-patch to v13 where 52ad1e659 came in. We could take this back
further by back-patching said infrastructure, but given the small
number of complaints so far, I don't feel a great need to.
Discussion: https://postgr.es/m/16657-cde2f876d8cc7971@postgresql.org
In passing, also remove a few surplus empty lines from pg_ltoa and
pg_ulltoa_n in numutils.c
Reported-by: Andrew Gierth
Discussion: https://postgr.es/m/87y2ou3xuh.fsf@news-spur.riddles.org.uk
Backpatch-through: 13, where these changes were introduced
ineq_histogram_selectivity() can be invoked in situations where the
ordering we care about is not that of the column's histogram. We could
be considering some other collation, or even more drastically, the
query operator might not agree at all with what was used to construct
the histogram. (We'll get here for anything using scalarineqsel-based
estimators, so that's quite likely to happen for extension operators.)
Up to now we just ignored this issue and assumed we were dealing with
an operator/collation whose sort order exactly matches the histogram,
possibly resulting in junk estimates if the binary search gets confused.
It's past time to improve that, since the use of nondefault collations
is increasing. What we can do is verify that the given operator and
collation match what's recorded in pg_statistic, and use the existing
code only if so. When they don't match, instead execute the operator
against each histogram entry, and take the fraction of successes as our
selectivity estimate. This gives an estimate that is probably good to
about 1/histogram_size, with no assumptions about ordering. (The quality
of the estimate is likely to degrade near the ends of the value range,
since the two orderings probably don't agree on what is an extremal value;
but this is surely going to be more reliable than what we did before.)
At some point we might further improve matters by storing more than one
histogram calculated according to different orderings. But this code
would still be good fallback logic when no matches exist, so that is
not an argument for not doing this.
While here, also improve get_variable_range() to deal more honestly
with non-default collations.
This isn't back-patchable, because it requires adding another argument
to ineq_histogram_selectivity, and because it might have significant
impact on the estimation results for extension operators relying on
scalarineqsel --- mostly for the better, one hopes, but in any case
destabilizing plan choices in back branches is best avoided.
Per investigation of a report from James Lucas.
Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
Commit 5e0928005 changed the planner so that, instead of blindly using
DEFAULT_COLLATION_OID when invoking operators for selectivity estimation,
it would use the collation of the column whose statistics we're
considering. This was recognized as still being not quite the right
thing, but it seemed like a good incremental improvement. However,
shortly thereafter we introduced nondeterministic collations, and that
creates cases where operators can fail if they're passed the wrong
collation. We don't want planning to fail in cases where the query itself
would work, so this means that we *must* use the query's collation when
invoking operators for estimation purposes.
The only real problem this creates is in ineq_histogram_selectivity, where
the binary search might produce a garbage answer if we perform comparisons
using a different collation than the column's histogram is ordered with.
However, when the query's collation is significantly different from the
column's default collation, the estimate we previously generated would be
pretty irrelevant anyway; so it's not clear that this will result in
noticeably worse estimates in practice. (A follow-on patch will improve
this situation in HEAD, but it seems too invasive for back-patch.)
The patch requires changing the signatures of mcv_selectivity and allied
functions, which are exported and very possibly are used by extensions.
In HEAD, I just did that, but an API/ABI break of this sort isn't
acceptable in stable branches. Therefore, in v12 the patch introduces
"mcv_selectivity_ext" and so on, with signatures matching HEAD, and makes
the old functions into wrappers that assume DEFAULT_COLLATION_OID should
be used. That does not match the prior behavior, but it should avoid risk
of failure in most cases. (In practice, I think most extension datatypes
aren't collation-aware, so the change probably doesn't matter to them.)
Per report from James Lucas. Back-patch to v12 where the problem was
introduced.
Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
It's intentional that we don't allow values greater than 24 hours,
while we do allow "24:00:00" as well as "23:59:60" as inputs.
However, the range check was miscoded in such a way that it would
accept "23:59:60.nnn" with a nonzero fraction. For time or timetz,
the stored result would then be greater than "24:00:00" which would
fail dump/reload, not to mention possibly confusing other operations.
Fix by explicitly calculating the result and making sure it does not
exceed 24 hours. (This calculation is redundant with what will happen
later in tm2time or tm2timetz. Maybe someday somebody will find that
annoying enough to justify refactoring to avoid the duplication; but
that seems too invasive for a back-patched bug fix, and the cost is
probably unmeasurable anyway.)
Note that this change also rejects such input as the time portion
of a timestamp(tz) value.
Back-patch to v10. The bug is far older, but to change this pre-v10
we'd need to ensure that the logic behaves sanely with float timestamps,
which is possibly nontrivial due to roundoff considerations.
Doesn't really seem worth troubling with.
Per report from Christoph Berg.
Discussion: https://postgr.es/m/20200520125807.GB296739@msg.df7cb.de
Includes some manual cleanup of places that pgindent messed up,
most of which weren't per project style anyway.
Notably, it seems some people didn't absorb the style rules of
commit c9d297751, because there were a bunch of new occurrences
of function calls with a newline just after the left paren, all
with faulty expectations about how the rest of the call would get
indented.
0f5ca02f53 introduces 3 new keywords. It appears to be too much for relatively
small feature. Given now we past feature freeze, it's already late for
discussion of the new syntax. So, revert.
Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us
This commit adds following optional clause to BEGIN and START TRANSACTION
commands.
WAIT FOR LSN lsn [ TIMEOUT timeout ]
New clause pospones transaction start till given lsn is applied on standby.
This clause allows user be sure, that changes previously made on primary would
be visible on standby.
New shared memory struct is used to track awaited lsn per backend. Recovery
process wakes up backend once required lsn is applied.
Author: Ivan Kartyshov, Anna Akenteva
Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi
Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs
Reviewed-by: Amit Kapila, Alexander Korotkov
Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru
Since the existing bit number argument can't exceed INT32_MAX, it's
not possible for these functions to manipulate bits beyond the first
256MB of a bytea value. Lift that restriction by redeclaring the
bit number arguments as int8 (which requires a catversion bump,
hence is not back-patchable).
The similarly-named functions for bit/varbit don't really have a
problem because we restrict those types to at most VARBITMAXLEN bits;
hence leave them alone.
While here, extend the encode/decode functions in utils/adt/encode.c
to allow dealing with values wider than 1GB. This is not a live bug
or restriction in current usage, because no input could be more than
1GB, and since none of the encoders can expand a string more than 4X,
the result size couldn't overflow uint32. But it might be desirable
to support more in future, so make the input length values size_t
and the potential-output-length values uint64.
Also add some test cases to improve the miserable code coverage
of these functions.
Movead Li, editorialized some by me; also reviewed by Ashutosh Bapat
Discussion: https://postgr.es/m/20200312115135445367128@highgo.ca
It turns out that the code did indeed rely on a zeroed
TuplesortInstrumentation.sortMethod field to indicate
"this worker never did anything", although it seems the
issue only comes up during certain race-condition-y cases.
Hence, rearrange the TuplesortMethod enum to restore
SORT_TYPE_STILL_IN_PROGRESS to having the value zero,
and add some comments reinforcing that that isn't optional.
Also future-proof a loop over the possible values of the enum.
sizeof(bits32) happened to be the correct limit value,
but only by purest coincidence.
Per buildfarm and local investigation.
Discussion: https://postgr.es/m/12222.1586223974@sss.pgh.pa.us
Similar to xid, but 64 bits wide. This new type is suitable for use in
various system views and administration functions.
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Takao Fujii <btfujiitkp@oss.nttdata.com>
Reviewed-by: Yoshikazu Imai <imai.yoshikazu@fujitsu.com>
Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
A table rewritten by ALTER TABLE would lose tracking of an index usable
for CLUSTER. This setting is tracked by pg_index.indisclustered and is
controlled by ALTER TABLE, so some extra work was needed to restore it
properly. Note that ALTER TABLE only marks the index that can be used
for clustering, and does not do the actual operation.
Author: Amit Langote, Justin Pryzby
Reviewed-by: Ibrar Ahmed, Michael Paquier
Discussion: https://postgr.es/m/20200202161718.GI13621@telsasoft.com
Backpatch-through: 9.5
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
This patch replaces the boolean GUC log_parameters_on_error introduced
by commit ba79cb5dc with an integer log_parameter_max_length_on_error,
adding the ability to specify how many bytes to trim each logged
parameter value to. (The previous coding hard-wired that choice at
64 bytes.)
In addition, add a new parameter log_parameter_max_length that provides
similar control over truncation of query parameters that are logged in
response to statement-logging options, as opposed to errors. Previous
releases always logged such parameters in full, possibly causing log
bloat.
For backwards compatibility with prior releases,
log_parameter_max_length defaults to -1 (log in full), while
log_parameter_max_length_on_error defaults to 0 (no logging).
Per discussion, log_parameter_max_length is SUSET since the DBA should
control routine logging behavior, but log_parameter_max_length_on_error
is USERSET because it also affects errcontext data sent back to the
client.
Alexey Bashtanov, editorialized a little by me
Discussion: https://postgr.es/m/b10493cc-a399-a03a-67c7-068f2791ee50@imap.cc
Quite a few matching operators such as JSONB's @> used "contsel" and
"contjoinsel" as their selectivity estimators. That was a bad idea,
because (a) contsel is only a stub, yielding a fixed default estimate,
and (b) that default is 0.001, meaning we estimate these operators as
five times more selective than equality, which is surely pretty silly.
There's a good model for improving this in ltree's ltreeparentsel():
for any "var OP constant" query, we can try applying the operator
to all of the column's MCV and histogram values, taking the latter
as being a random sample of the non-MCV values. That code is
actually 100% generic, except for the question of exactly what
default selectivity ought to be plugged in when we don't have stats.
Hence, migrate the guts of ltreeparentsel() into the core code, provide
wrappers "matchingsel" and "matchingjoinsel" with a more-appropriate
default estimate, and use those for the non-geometric operators that
formerly used contsel (mostly JSONB containment operators and tsquery
matching).
Also apply this code to some match-like operators in hstore, ltree, and
pg_trgm, including the former users of ltreeparentsel as well as ones
that improperly used contsel. Since commit 911e70207 just created new
versions of those extensions that we haven't released yet, we can sneak
this change into those new versions instead of having to create an
additional generation of update scripts.
Patch by me, reviewed by Alexey Bashtanov
Discussion: https://postgr.es/m/12237.1582833074@sss.pgh.pa.us
PostgreSQL provides set of template index access methods, where opclasses have
much freedom in the semantics of indexing. These index AMs are GiST, GIN,
SP-GiST and BRIN. There opclasses define representation of keys, operations on
them and supported search strategies. So, it's natural that opclasses may be
faced some tradeoffs, which require user-side decision. This commit implements
opclass parameters allowing users to set some values, which tell opclass how to
index the particular dataset.
This commit doesn't introduce new storage in system catalog. Instead it uses
pg_attribute.attoptions, which is used for table column storage options but
unused for index attributes.
In order to evade changing signature of each opclass support function, we
implement unified way to pass options to opclass support functions. Options
are set to fn_expr as the constant bytea expression. It's possible due to the
fact that opclass support functions are executed outside of expressions, so
fn_expr is unused for them.
This commit comes with some examples of opclass options usage. We parametrize
signature length in GiST. That applies to multiple opclasses: tsvector_ops,
gist__intbig_ops, gist_ltree_ops, gist__ltree_ops, gist_trgm_ops and
gist_hstore_ops. Also we parametrize maximum number of integer ranges for
gist__int_ops. However, the main future usage of this feature is expected
to be json, where users would be able to specify which way to index particular
json parts.
Catversion is bumped.
Discussion: https://postgr.es/m/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru
Author: Nikita Glukhov, revised by me
Reviwed-by: Nikolay Shaplov, Robert Haas, Tom Lane, Tomas Vondra, Alvaro Herrera
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
Buildfarm experience shows what probably should've occurred to me before:
if a cache flush occurs partway through building a generic plan, then
the plansource may have is_valid = false even though the plan is valid.
We need to accept this case, use the generated plan, and then try to
replan the next time. We can't try to replan immediately, because that
would produce an infinite loop in CLOBBER_CACHE_ALWAYS builds; moreover
it's really overkill. (We can assume that the plan is valid, it's just
possibly a bit stale. Note that the pre-existing code behaved this way,
and the non-simple-expression code paths do too.) Conversely, not using
the generated plan would drop us into the not-a-simple-expression code
path, which is bad for performance and would also cause regression-test
failures due to visibly different error-reporting behavior.
Hence, refactor the validity-check functions so that the initial check
and recheck cases can react differently to plansource->is_valid.
This makes their usage a bit simpler, too.
Discussion: https://postgr.es/m/7072.1585332104@sss.pgh.pa.us
For relatively simple expressions (say, "x + 1" or "x > 0"), plpgsql's
management overhead exceeds the cost of evaluating the expression.
This patch substantially improves that situation, providing roughly
2X speedup for such trivial expressions.
First, add infrastructure in the plancache to allow fast re-validation
of cached plans that contain no table access, and hence need no locks.
Teach plpgsql to use this infrastructure for expressions that it's
already deemed "simple" (which in particular will never contain table
references).
The fast path still requires checking that search_path hasn't changed,
so provide a fast path for OverrideSearchPathMatchesCurrent by
counting changes that have occurred to the active search path in the
current session. This is simplistic but seems enough for now, seeing
that PushOverrideSearchPath is not used in any performance-critical
cases.
Second, manage the refcounts on simple expressions' cached plans using
a transaction-lifespan resource owner, so that we only need to take
and release an expression's refcount once per transaction not once per
expression evaluation. The management of this resource owner exactly
parallels the existing management of plpgsql's simple-expression EState.
Add some regression tests covering this area, in particular verifying
that expression caching doesn't break semantics for search_path changes.
Patch by me, but it owes something to previous work by Amit Langote,
who recognized that getting rid of plancache-related overhead would
be a useful thing to do here. Also thanks to Andres Freund for review.
Discussion: https://postgr.es/m/CAFj8pRDRVfLdAxsWeVLzCAbkLFZhW549K+67tpOc-faC8uH8zw@mail.gmail.com
This reverts the parts of commit 17a28b0364
that changed ereport's auxiliary functions from returning dummy integer
values to returning void. It turns out that a minority of compilers
complain (not entirely unreasonably) about constructs such as
(condition) ? errdetail(...) : 0
if errdetail() returns void rather than int. We could update those
call sites to say "(void) 0" perhaps, but the expectation for this
patch set was that ereport callers would not have to change anything.
And this aspect of the patch set was already the most invasive and
least compelling part of it, so let's just drop it.
Per buildfarm.
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
Change all the auxiliary error-reporting routines to return void,
now that we no longer need to pretend they are passing something
useful to errfinish(). While this probably doesn't save anything
significant at the machine-code level, it allows detection of some
additional types of mistakes.
Pass the error location details (__FILE__, __LINE__, PG_FUNCNAME_MACRO)
to errfinish not errstart. This shaves a few cycles off the case where
errstart decides we're not going to emit anything.
Re-implement elog() as a trivial wrapper around ereport(), removing
the separate support infrastructure it used to have. Aside from
getting rid of some now-surplus code, this means that elog() now
really does have exactly the same semantics as ereport(), in particular
that it can skip evaluation work if the message is not to be emitted.
Andres Freund and Tom Lane
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
Now that we require C99, we can depend on __VA_ARGS__ to work, and
revising ereport() to use it has several significant benefits:
* The extra parentheses around the auxiliary function calls are now
optional. Aside from being a bit less ugly, this removes a common
gotcha for new contributors, because in some cases the compiler errors
you got from forgetting them were unintelligible.
* The auxiliary function calls are now evaluated as a comma expression
list rather than as extra arguments to errfinish(). This means that
compilers can be expected to warn about no-op expressions in the list,
allowing detection of several other common mistakes such as forgetting
to add errmsg(...) when converting an elog() call to ereport().
* Unlike the situation with extra function arguments, comma expressions
are guaranteed to be evaluated left-to-right, so this removes platform
dependency in the order of the auxiliary function calls. While that
dependency hasn't caused us big problems in the past, this change does
allow dropping some rather shaky assumptions around errcontext() domain
handling.
There's no intention to make wholesale changes of existing ereport
calls, but as proof-of-concept this patch removes the extra parens
from a couple of calls in postgres.c.
While new code can be written either way, code intended to be
back-patched will need to use extra parens for awhile yet. It seems
worth back-patching this change into v12, so as to reduce the window
where we have to be careful about that by one year. Hence, this patch
is careful to preserve ABI compatibility; a followup HEAD-only patch
will make some additional simplifications.
Andres Freund and Tom Lane
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
headerscheck and cpluspluscheck should skip the recently-added
cmdtaglist.h header, since (like kwlist.h and some other similarly-
designed headers) it's not meant to be included standalone.
evtcache.h was missing an #include to support its usage of Bitmapset.
typecmds.h was missing an #include to support its usage of ParseState.
The first two of these were evidently oversights in commit 2f9661311.
I didn't track down exactly which change broke typecmds.h, but it
must have been some rearrangement in one of its existing inclusions,
because it's referenced ParseState for quite a long time and there
were not complaints from these checking programs before.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Back-patch to 9.5 (all supported versions). This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As
always, update standby systems before master systems. This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions. (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
Remove an obsolete comment from AtEOXact_cleanup(). Restore formatting
of a comment in struct RelationData, mangled by the pgindent run in
commit 9af4159fce. Back-patch to 9.5 (all
supported versions), because another fix stacks on this.