Commit graph

1643 commits

Author SHA1 Message Date
Alvaro Herrera
731d514ae5
Fix ENABLE/DISABLE TRIGGER to handle recursion correctly
Using ATSimpleRecursion() in ATPrepCmd() to do so as bbb927b4db did is
not correct, because ATPrepCmd() can't distinguish between triggers that
may be cloned and those that may not, so would wrongly try to recurse
for the latter category of triggers.

So this commit restores the code in EnableDisableTrigger() that
86f575948c had added to do the recursion, which would do it only for
triggers that may be cloned, that is, row-level triggers.  This also
changes tablecmds.c such that ATExecCmd() is able to pass the value of
ONLY flag down to EnableDisableTrigger() using its new 'recurse'
parameter.

This also fixes what seems like an oversight of 86f575948c that the
recursion to partition triggers would only occur if EnableDisableTrigger()
had actually changed the trigger.  It is more apt to recurse to inspect
partition triggers even if the parent's trigger didn't need to be
changed: only then can we be certain that all descendants share the same
state afterwards.

Backpatch all the way back to 11, like bbb927b4db.  Care is taken not
to break ABI compatibility (and that no catversion bump is needed.)

Co-authored-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Dmitry Koval <d.koval@postgrespro.ru>
Discussion: https://postgr.es/m/CA+HiwqG-cZT3XzGAnEgZQLoQbyfJApVwOTQaCaas1mhpf+4V5A@mail.gmail.com
2022-08-05 09:47:11 +02:00
David Rowley
3f712ea6dc Fix incorrect comments for Memoize struct
Reported-by: Peter Eisentraut
Discussion: https://postgr.es/m/0635f5aa-4973-8dc2-4e4e-df9fd5778a65@enterprisedb.com
Backpatch-through: 14, where Memoize was added
2022-05-19 17:14:56 +12:00
Tom Lane
da22ef388a Remove inadequate assertion check in CTE inlining.
inline_cte() expected to find exactly as many references to the
target CTE as its cterefcount indicates.  While that should be
accurate for the tree as emitted by the parser, there are some
optimizations that occur upstream of here that could falsify it,
notably removal of unused subquery output expressions.

Trying to make the accounting 100% accurate seems expensive and
doomed to future breakage.  It's not really worth it, because
all this code is protecting is downstream assumptions that every
referenced CTE has a plan.  Let's convert those assertions to
regular test-and-elog just in case there's some actual problem,
and then drop the failing assertion.

Per report from Tomas Vondra (thanks also to Richard Guo for
analysis).  Back-patch to v12 where the faulty code came in.

Discussion: https://postgr.es/m/29196a1e-ed47-c7ca-9be2-b1c636816183@enterprisedb.com
2022-04-21 17:58:52 -04:00
Tom Lane
d228af79d0 Fix index-only scan plans, take 2.
Commit 4ace45677 failed to fix the problem fully, because the
same issue of attempting to fetch a non-returnable index column
can occur when rechecking the indexqual after using a lossy index
operator.  Moreover, it broke EXPLAIN for such indexquals (which
indicates a gap in our test cases :-().

Revert the code changes of 4ace45677 in favor of adding a new field
to struct IndexOnlyScan, containing a version of the indexqual that
can be executed against the index-returned tuple without using any
non-returnable columns.  (The restrictions imposed by check_index_only
guarantee this is possible, although we may have to recompute indexed
expressions.)  Support construction of that during setrefs.c
processing by marking IndexOnlyScan.indextlist entries as resjunk
if they can't be returned, rather than removing them entirely.
(We could alternatively require setrefs.c to look up the IndexOptInfo
again, but abusing resjunk this way seems like a reasonably safe way
to avoid needing to do that.)

This solution isn't great from an API-stability standpoint: if there
are any extensions out there that build IndexOnlyScan structs directly,
they'll be broken in the next minor releases.  However, only a very
invasive extension would be likely to do such a thing.  There's no
change in the Path representation, so typical planner extensions
shouldn't have a problem.

As before, back-patch to all supported branches.

Discussion: https://postgr.es/m/3179992.1641150853@sss.pgh.pa.us
Discussion: https://postgr.es/m/17350-b5bdcf476e5badbb@postgresql.org
2022-01-03 15:42:27 -05:00
Tom Lane
cabea571d1 Fix index-only scan plans when not all index columns can be returned.
If an index has both returnable and non-returnable columns, and one of
the non-returnable columns is an expression using a Var that is in a
returnable column, then a query returning that expression could result
in an index-only scan plan that attempts to read the non-returnable
column, instead of recomputing the expression from the returnable
column as intended.

To fix, redefine the "indextlist" list of an IndexOnlyScan plan node
as containing null Consts in place of any non-returnable columns.
This solves the problem by preventing setrefs.c from falsely matching
to such entries.  The executor is happy since it only cares about the
exposed types of the entries, and ruleutils.c doesn't care because a
correct plan won't reference those entries.  I considered some other
ways to prevent setrefs.c from doing the wrong thing, but this way
seems good since (a) it allows a very localized fix, (b) it makes
the indextlist structure more compact in many cases, and (c) the
indextlist is now a more faithful representation of what the index AM
will actually produce, viz. nulls for any non-returnable columns.

This is easier to hit since we introduced included columns, but it's
possible to construct failing examples without that, as per the
added regression test.  Hence, back-patch to all supported branches.

Per bug #17350 from Louis Jachiet.

Discussion: https://postgr.es/m/17350-b5bdcf476e5badbb@postgresql.org
2022-01-01 16:12:03 -05:00
David Rowley
c2dc7b9e15 Flush Memoize cache when non-key parameters change, take 2
It's possible that a subplan below a Memoize node contains a parameter
from above the Memoize node.  If this parameter changes then cache entries
may become out-dated due to the new parameter value.

Previously Memoize was mistakenly not aware of this.  We fix this here by
flushing the cache whenever a parameter that's not part of the cache
key changes.

Bug: #17213
Reported by: Elvis Pranskevichus
Author: David Rowley
Discussion: https://postgr.es/m/17213-988ed34b225a2862@postgresql.org
Backpatch-through: 14, where Memoize was added
2021-11-24 23:29:56 +13:00
David Rowley
7933bc0d13 Revert "Flush Memoize cache when non-key parameters change"
This reverts commit f94edb06ab.
2021-11-24 15:28:34 +13:00
David Rowley
f94edb06ab Flush Memoize cache when non-key parameters change
It's possible that a subplan below a Memoize node contains a parameter
from above the Memoize node.  If this parameter changes then cache entries
may become out-dated due to the new parameter value.

Previously Memoize was mistakenly not aware of this.  We fix this here by
flushing the cache whenever a parameter that's not part of the cache
key changes.

Bug: #17213
Reported by: Elvis Pranskevichus
Author: David Rowley
Discussion: https://postgr.es/m/17213-988ed34b225a2862@postgresql.org
Backpatch-through: 14, where Memoize was added
2021-11-24 14:57:07 +13:00
David Rowley
6c32c09777 Allow Memoize to operate in binary comparison mode
Memoize would always use the hash equality operator for the cache key
types to determine if the current set of parameters were the same as some
previously cached set.  Certain types such as floating points where -0.0
and +0.0 differ in their binary representation but are classed as equal by
the hash equality operator may cause problems as unless the join uses the
same operator it's possible that whichever join operator is being used
would be able to distinguish the two values.  In which case we may
accidentally return in the incorrect rows out of the cache.

To fix this here we add a binary mode to Memoize to allow it to the
current set of parameters to previously cached values by comparing
bit-by-bit rather than logically using the hash equality operator.  This
binary mode is always used for LATERAL joins and it's used for normal
joins when any of the join operators are not hashable.

Reported-by: Tom Lane
Author: David Rowley
Discussion: https://postgr.es/m/3004308.1632952496@sss.pgh.pa.us
Backpatch-through: 14, where Memoize was added
2021-11-24 10:07:38 +13:00
Tom Lane
08cfa5981e Avoid O(N^2) behavior in SyncPostCheckpoint().
As in commits 6301c3ada and e9d9ba2a4, avoid doing repetitive
list_delete_first() operations, since that would be expensive when
there are many files waiting to be unlinked.  This is a slightly
larger change than in those cases.  We have to keep the list state
valid for calls to AbsorbSyncRequests(), so it's necessary to invent a
"canceled" field instead of immediately deleting PendingUnlinkEntry
entries.  Also, because we might not be able to process all the
entries, we need a new list primitive list_delete_first_n().

list_delete_first_n() is almost list_copy_tail(), but it modifies the
input List instead of making a new copy.  I found a couple of existing
uses of the latter that could profitably use the new function.  (There
might be more, but the other callers look like they probably shouldn't
overwrite the input List.)

As before, back-patch to v13.

Discussion: https://postgr.es/m/CD2F0E7F-9822-45EC-A411-AE56F14DEA9F@amazon.com
2021-11-02 11:31:54 -04:00
Tom Lane
29aa0ce361 Fix planner error with multiple copies of an AlternativeSubPlan.
It's possible for us to copy an AlternativeSubPlan expression node
into multiple places, for example the scan quals of several
partition children.  Then it's possible that we choose a different
one of the alternatives as optimal in each place.  Commit 41efb8340
failed to consider this scenario, so its attempt to remove "unused"
subplans could remove subplans that were still used elsewhere.

Fix by delaying the removal logic until we've examined all the
AlternativeSubPlans in a given query level.  (This does assume that
AlternativeSubPlans couldn't get copied to other query levels, but
for the foreseeable future that's fine; cf qual_is_pushdown_safe.)

Per report from Rajkumar Raghuwanshi.  Back-patch to v14
where the faulty logic came in.

Discussion: https://postgr.es/m/CAKcux6==O3NNZC3bZ2prRYv3cjm3_Zw1GfzmOjEVqYN4jub2+Q@mail.gmail.com
2021-09-14 15:11:21 -04:00
Tom Lane
4649003933 Avoid trying to lock OLD/NEW in a rule with FOR UPDATE.
transformLockingClause neglected to exclude the pseudo-RTEs for
OLD/NEW when processing a rule's query.  This led to odd errors
or even crashes later on.  This bug is very ancient, but it's
not terribly surprising that nobody noticed, since the use-case
for SELECT FOR UPDATE in a non-view rule is somewhere between
thin and non-existent.  Still, crashing is not OK.

Per bug #17151 from Zhiyong Wu.  Thanks to Masahiko Sawada
for analysis of the problem.

Discussion: https://postgr.es/m/17151-c03a3e6e4ec9aadb@postgresql.org
2021-08-19 12:12:35 -04:00
David Rowley
47ca483644 Change the name of the Result Cache node to Memoize
"Result Cache" was never a great name for this node, but nobody managed
to come up with another name that anyone liked enough.  That was until
David Johnston mentioned "Node Memoization", which Tom Lane revised to
just "Memoize".  People seem to like "Memoize", so let's do the rename.

Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/20210708165145.GG1176@momjian.us
Backpatch-through: 14, where Result Cache was introduced
2021-07-14 12:45:00 +12:00
Andrew Dunstan
e1c1c30f63
Pre branch pgindent / pgperltidy run
Along the way make a slight adjustment to
src/include/utils/queryjumble.h to avoid an unused typedef.
2021-06-28 11:05:54 -04:00
Tomas Vondra
b676ac443b Optimize creation of slots for FDW bulk inserts
Commit b663a41363 introduced bulk inserts for FDW, but the handling of
tuple slots turned out to be problematic for two reasons. Firstly, the
slots were re-created for each individual batch. Secondly, all slots
referenced the same tuple descriptor - with reasonably small batches
this is not an issue, but with large batches this triggers O(N^2)
behavior in the resource owner code.

These two issues work against each other - to reduce the number of times
a slot has to be created/dropped, larger batches are needed. However,
the larger the batch, the more expensive the resource owner gets. For
practical batch sizes (100 - 1000) this would not be a big problem, as
the benefits (latency savings) greatly exceed the resource owner costs.
But for extremely large batches it might be much worse, possibly even
losing with non-batching mode.

Fixed by initializing tuple slots only once (and reusing them across
batches) and by using a new tuple descriptor copy for each slot.

Discussion: https://postgr.es/m/ebbbcc7d-4286-8c28-0272-61b4753af761%40enterprisedb.com
2021-06-11 20:23:33 +02:00
Noah Misch
13a1ca160d Change position of field "transformed" in struct CreateStatsStmt.
Resolve the disagreement with nodes/*funcs.c field order in favor of the
latter, which is better-aligned with the IndexStmt field order.  This
field is new in v14.

Discussion: https://postgr.es/m/20210611045546.GA573364@rfd.leadboat.com
2021-06-10 21:56:14 -07:00
Tom Lane
e56bce5d43 Reconsider the handling of procedure OUT parameters.
Commit 2453ea142 redefined pg_proc.proargtypes to include the types of
OUT parameters, for procedures only.  While that had some advantages
for implementing the SQL-spec behavior of DROP PROCEDURE, it was pretty
disastrous from a number of other perspectives.  Notably, since the
primary key of pg_proc is name + proargtypes, this made it possible to
have multiple procedures with identical names + input arguments and
differing output argument types.  That would make it impossible to call
any one of the procedures by writing just NULL (or "?", or any other
data-type-free notation) for the output argument(s).  The change also
seems likely to cause grave confusion for client applications that
examine pg_proc and expect the traditional definition of proargtypes.

Hence, revert the definition of proargtypes to what it was, and
undo a number of complications that had been added to support that.

To support the SQL-spec behavior of DROP PROCEDURE, when there are
no argmode markers in the command's parameter list, we perform the
lookup both ways (that is, matching against both proargtypes and
proallargtypes), succeeding if we get just one unique match.
In principle this could result in ambiguous-function failures
that would not happen when using only one of the two rules.
However, overloading of procedure names is thought to be a pretty
rare usage, so this shouldn't cause many problems in practice.
Postgres-specific code such as pg_dump can defend against any
possibility of such failures by being careful to specify argmodes
for all procedure arguments.

This also fixes a few other bugs in the area of CALL statements
with named parameters, and improves the documentation a little.

catversion bump forced because the representation of procedures
with OUT arguments changes.

Discussion: https://postgr.es/m/3742981.1621533210@sss.pgh.pa.us
2021-06-10 17:11:36 -04:00
Amit Kapila
be57f21650 Remove two_phase variable from CreateReplicationSlotCmd struct.
Commit 19890a064e added the option to enable two_phase commits via
pg_create_logical_replication_slot but didn't extend the support of same
in replication protocol. However, by mistake, it added the two_phase
variable in CreateReplicationSlotCmd which is required only when we extend
the replication protocol.

Reported-by: Jeff Davis
Author: Ajin Cherian
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/64b9f783c6e125f18f88fbc0c0234e34e71d8639.camel@j-davis.com
2021-06-07 09:32:06 +05:30
Tom Lane
79c50ca578 Update plannodes.h's comments about PlanRowMark.
The reference here to different physical column numbers in inherited
UPDATE/DELETE plans is obsolete as of 86dc90056; remove it.  Also
rework the text about inheritance cases to make it clearer.
2021-06-02 11:52:35 -04:00
Tom Lane
def5b065ff Initial pgindent and pgperltidy run for v14.
Also "make reformat-dat-files".

The only change worthy of note is that pgindent messed up the formatting
of launcher.c's struct LogicalRepWorkerId, which led me to notice that
that struct wasn't used at all anymore, so I just took it out.
2021-05-12 13:14:10 -04:00
Etsuro Fujita
a363bc6da9 Fix EXPLAIN ANALYZE for async-capable nodes.
EXPLAIN ANALYZE for an async-capable ForeignScan node associated with
postgres_fdw is done just by using instrumentation for ExecProcNode()
called from the node's callbacks, causing the following problems:

1) If the remote table to scan is empty, the node is incorrectly
   considered as "never executed" by the command even if the node is
   executed, as ExecProcNode() isn't called from the node's callbacks at
   all in that case.
2) The command fails to collect timings for things other than
   ExecProcNode() done in the node, such as creating a cursor for the
   node's remote query.

To fix these problems, add instrumentation for async-capable nodes, and
modify postgres_fdw accordingly.

My oversight in commit 27e1f1456.

While at it, update a comment for the AsyncRequest struct in execnodes.h
and the documentation for the ForeignAsyncRequest API in fdwhandler.sgml
to match the code in ExecAsyncAppendResponse() in nodeAppend.c, and fix
typos in comments in nodeAppend.c.

Per report from Andrey Lepikhov, though I didn't use his patch.

Reviewed-by: Andrey Lepikhov
Discussion: https://postgr.es/m/2eb662bb-105d-fc20-7412-2f027cc3ca72%40postgrespro.ru
2021-05-12 14:00:00 +09:00
Tom Lane
049e1e2edb Fix mishandling of resjunk columns in ON CONFLICT ... UPDATE tlists.
It's unusual to have any resjunk columns in an ON CONFLICT ... UPDATE
list, but it can happen when MULTIEXPR_SUBLINK SubPlans are present.
If it happens, the ON CONFLICT UPDATE code path would end up storing
tuples that include the values of the extra resjunk columns.  That's
fairly harmless in the short run, but if new columns are added to
the table then the values would become accessible, possibly leading
to malfunctions if they don't match the datatypes of the new columns.

This had escaped notice through a confluence of missing sanity checks,
including

* There's no cross-check that a tuple presented to heap_insert or
heap_update matches the table rowtype.  While it's difficult to
check that fully at reasonable cost, we can easily add assertions
that there aren't too many columns.

* The output-column-assignment cases in execExprInterp.c lacked
any sanity checks on the output column numbers, which seems like
an oversight considering there are plenty of assertion checks on
input column numbers.  Add assertions there too.

* We failed to apply nodeModifyTable's ExecCheckPlanOutput() to
the ON CONFLICT UPDATE tlist.  That wouldn't have caught this
specific error, since that function is chartered to ignore resjunk
columns; but it sure seems like a bad omission now that we've seen
this bug.

In HEAD, the right way to fix this is to make the processing of
ON CONFLICT UPDATE tlists work the same as regular UPDATE tlists
now do, that is don't add "SET x = x" entries, and use
ExecBuildUpdateProjection to evaluate the tlist and combine it with
old values of the not-set columns.  This adds a little complication
to ExecBuildUpdateProjection, but allows removal of a comparable
amount of now-dead code from the planner.

In the back branches, the most expedient solution seems to be to
(a) use an output slot for the ON CONFLICT UPDATE projection that
actually matches the target table, and then (b) invent a variant of
ExecBuildProjectionInfo that can be told to not store values resulting
from resjunk columns, so it doesn't try to store into nonexistent
columns of the output slot.  (We can't simply ignore the resjunk columns
altogether; they have to be evaluated for MULTIEXPR_SUBLINK to work.)
This works back to v10.  In 9.6, projections work much differently and
we can't cheaply give them such an option.  The 9.6 version of this
patch works by inserting a JunkFilter when it's necessary to get rid
of resjunk columns.

In addition, v11 and up have the reverse problem when trying to
perform ON CONFLICT UPDATE on a partitioned table.  Through a
further oversight, adjust_partition_tlist() discarded resjunk columns
when re-ordering the ON CONFLICT UPDATE tlist to match a partition.
This accidentally prevented the storing-bogus-tuples problem, but
at the cost that MULTIEXPR_SUBLINK cases didn't work, typically
crashing if more than one row has to be updated.  Fix by preserving
resjunk columns in that routine.  (I failed to resist the temptation
to add more assertions there too, and to do some minor code
beautification.)

Per report from Andres Freund.  Back-patch to all supported branches.

Security: CVE-2021-32028
2021-05-10 11:02:29 -04:00
Thomas Munro
ec48314708 Revert per-index collation version tracking feature.
Design problems were discovered in the handling of composite types and
record types that would cause some relevant versions not to be recorded.
Misgivings were also expressed about the use of the pg_depend catalog
for this purpose.  We're out of time for this release so we'll revert
and try again.

Commits reverted:

1bf946bd: Doc: Document known problem with Windows collation versions.
cf002008: Remove no-longer-relevant test case.
ef387bed: Fix bogus collation-version-recording logic.
0fb0a050: Hide internal error for pg_collation_actual_version(<bad OID>).
ff942057: Suppress "warning: variable 'collcollate' set but not used".
d50e3b1f: Fix assertion in collation version lookup.
f24b1569: Rethink extraction of collation dependencies.
257836a7: Track collation versions for indexes.
cd6f479e: Add pg_depend.refobjversion.
7d1297df: Remove pg_collation.collversion.

Discussion: https://postgr.es/m/CA%2BhUKGLhj5t1fcjqAu8iD9B3ixJtsTNqyCCD4V0aTO9kAKAjjA%40mail.gmail.com
2021-05-07 21:10:11 +12:00
Fujii Masao
3f20d5f370 Reorder COMPRESSION option in gram.y and parsenodes.h into alphabetical order.
Commit bbe0a81db6 introduced "INCLUDING COMPRESSION" option
in CREATE TABLE command, but previously TableLikeOption in gram.y and
parsenodes.h didn't classify this new option in alphabetical order
with the rest.

Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/YHerAixOhfR1ryXa@paquier.xyz
2021-04-23 19:10:24 +09:00
David Rowley
50e17ad281 Speedup ScalarArrayOpExpr evaluation
ScalarArrayOpExprs with "useOr=true" and a set of Consts on the righthand
side have traditionally been evaluated by using a linear search over the
array.  When these arrays contain large numbers of elements then this
linear search could become a significant part of execution time.

Here we add a new method of evaluating ScalarArrayOpExpr expressions to
allow them to be evaluated by first building a hash table containing each
element, then on subsequent evaluations, we just probe that hash table to
determine if there is a match.

The planner is in charge of determining when this optimization is possible
and it enables it by setting hashfuncid in the ScalarArrayOpExpr.  The
executor will only perform the hash table evaluation when the hashfuncid
is set.

This means that not all cases are optimized. For example CHECK constraints
containing an IN clause won't go through the planner, so won't get the
hashfuncid set.  We could maybe do something about that at some later
date.  The reason we're not doing it now is from fear that we may slow
down cases where the expression is evaluated only once.  Those cases can
be common, for example, a single row INSERT to a table with a CHECK
constraint containing an IN clause.

In the planner, we enable this when there are suitable hash functions for
the ScalarArrayOpExpr's operator and only when there is at least
MIN_ARRAY_SIZE_FOR_HASHED_SAOP elements in the array.  The threshold is
currently set to 9.

Author: James Coleman, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/CAAaqYe8x62+=wn0zvNKCj55tPpg-JBHzhZFFc6ANovdqFw7-dA@mail.gmail.com
2021-04-08 23:51:22 +12:00
David Rowley
5ac9c43073 Cleanup partition pruning step generation
There was some code in gen_prune_steps_from_opexps that needlessly
checked a list was not empty when it clearly had to contain at least one
item. This prompted a further cleanup operation in partprune.c.

Additionally, the previous code could end up adding additional needless
INTERSECT steps. However, those do not appear to be able to cause any
misbehavior.

gen_prune_steps_from_opexps is now no longer in charge of generating
combine pruning steps. Instead, gen_partprune_steps_internal, which
already does some combine step creation has been given the sole
responsibility of generating all combine steps. This means that when
we recursively call gen_partprune_steps_internal, since it always now adds
a combine step when it produces multiple steps, we can just pay attention
to the final step returned.

In passing, do quite a bit of work on the comments to try to more clearly
explain the role of both gen_partprune_steps_internal and
gen_prune_steps_from_opexps. This is fairly complex code so some extra
effort to give any new readers an overview of how things work seems like
a good idea.

Author: Amit Langote
Reported-by: Andy Fan
Reviewed-by: Kyotaro Horiguchi, Andy Fan, Ryan Lambert, David Rowley
Discussion: https://postgr.es/m/CAKU4AWqWoVii+bRTeBQmeVW+PznkdO8DfbwqNsu9Gj4ubt9A6w@mail.gmail.com
2021-04-08 22:35:48 +12:00
Peter Eisentraut
e717a9a18b SQL-standard function body
This adds support for writing CREATE FUNCTION and CREATE PROCEDURE
statements for language SQL with a function body that conforms to the
SQL standard and is portable to other implementations.

Instead of the PostgreSQL-specific AS $$ string literal $$ syntax,
this allows writing out the SQL statements making up the body
unquoted, either as a single statement:

    CREATE FUNCTION add(a integer, b integer) RETURNS integer
        LANGUAGE SQL
        RETURN a + b;

or as a block

    CREATE PROCEDURE insert_data(a integer, b integer)
    LANGUAGE SQL
    BEGIN ATOMIC
      INSERT INTO tbl VALUES (a);
      INSERT INTO tbl VALUES (b);
    END;

The function body is parsed at function definition time and stored as
expression nodes in a new pg_proc column prosqlbody.  So at run time,
no further parsing is required.

However, this form does not support polymorphic arguments, because
there is no more parse analysis done at call time.

Dependencies between the function and the objects it uses are fully
tracked.

A new RETURN statement is introduced.  This can only be used inside
function bodies.  Internally, it is treated much like a SELECT
statement.

psql needs some new intelligence to keep track of function body
boundaries so that it doesn't send off statements when it sees
semicolons that are inside a function body.

Tested-by: Jaime Casanova <jcasanov@systemguards.com.ec>
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/1c11f1eb-f00c-43b7-799d-2d44132c02d7@2ndquadrant.com
2021-04-07 21:47:55 +02:00
Peter Eisentraut
dd13ad9d39 Fix use of cursor sensitivity terminology
Documentation and comments in code and tests have been using the terms
sensitive/insensitive cursor incorrectly relative to the SQL standard.
(Cursor sensitivity is only relevant for changes made in the same
transaction as the cursor, not for concurrent changes in other
sessions.)  Moreover, some of the behavior of PostgreSQL is incorrect
according to the SQL standard, confusing the issue further.  (WHERE
CURRENT OF changes are not visible in insensitive cursors, but they
should be.)

This change corrects the terminology and removes the claim that
sensitive cursors are supported.  It also adds a test case that checks
the insensitive behavior in a "correct" way, using a change command
not using WHERE CURRENT OF.  Finally, it adds the ASENSITIVE cursor
option to select the default asensitive behavior, per SQL standard.

There are no changes to cursor behavior in this patch.

Discussion: https://www.postgresql.org/message-id/flat/96ee8b30-9889-9e1b-b053-90e10c050e85%40enterprisedb.com
2021-04-07 08:05:55 +02:00
Tom Lane
a1115fa078 Postpone some more stuff out of ExecInitModifyTable.
Delay creation of the projections for INSERT and UPDATE tuples
until they're needed.  This saves a pretty fair amount of work
when only some of the partitions are actually touched.

The logic associated with identifying junk columns in UPDATE/DELETE
is moved to another loop, allowing removal of one loop over the
target relations; but it didn't actually change at all.

Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.

Amit Langote, reviewed at different times by Heikki Linnakangas and
myself

Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
2021-04-06 18:13:17 -04:00
Tom Lane
c5b7ba4e67 Postpone some stuff out of ExecInitModifyTable.
Arrange to do some things on-demand, rather than immediately during
executor startup, because there's a fair chance of never having to do
them at all:

* Don't open result relations' indexes until needed.

* Don't initialize partition tuple routing, nor the child-to-root
tuple conversion map, until needed.

This wins in UPDATEs on partitioned tables when only some of the
partitions will actually receive updates; with larger partition
counts the savings is quite noticeable.  Also, we can remove some
sketchy heuristics in ExecInitModifyTable about whether to set up
tuple routing.

Also, remove execPartition.c's private hash table tracking which
partitions were already opened by the ModifyTable node.  Instead
use the hash added to ModifyTable itself by commit 86dc90056.

To allow lazy computation of the conversion maps, we now set
ri_RootResultRelInfo in all child ResultRelInfos.  We formerly set it
only in some, not terribly well-defined, cases.  This has user-visible
side effects in that now more error messages refer to the root
relation instead of some partition (and provide error data in the
root's column order, too).  It looks to me like this is a strict
improvement in consistency, so I don't have a problem with the
output changes visible in this commit.

Extracted from a larger patch, which seemed to me to be too messy
to push in one commit.

Amit Langote, reviewed at different times by Heikki Linnakangas and
myself

Discussion: https://postgr.es/m/CA+HiwqG7ZruBmmih3wPsBZ4s0H2EhywrnXEduckY5Hr3fWzPWA@mail.gmail.com
2021-04-06 15:57:11 -04:00
Peter Eisentraut
82ed7748b7 ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION
At present, if we want to update publications in a subscription, we
can use SET PUBLICATION.  However, it requires supplying all
publications that exists and the new publications.  If we want to add
new publications, it's inconvenient.  The new syntax only supplies the
new publications.  When the refresh is true, it only refreshes the new
publications.

Author: Japin Li <japinli@hotmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/MEYP282MB166939D0D6C480B7FBE7EFFBB6BC0@MEYP282MB1669.AUSP282.PROD.OUTLOOK.COM
2021-04-06 11:49:51 +02:00
Fujii Masao
43620e3286 Add function to log the memory contexts of specified backend process.
Commit 3e98c0bafb added pg_backend_memory_contexts view to display
the memory contexts of the backend process. However its target process
is limited to the backend that is accessing to the view. So this is
not so convenient when investigating the local memory bloat of other
backend process. To improve this situation, this commit adds
pg_log_backend_memory_contexts() function that requests to log
the memory contexts of the specified backend process.

This information can be also collected by calling
MemoryContextStats(TopMemoryContext) via a debugger. But
this technique cannot be used in some environments because no debugger
is available there. So, pg_log_backend_memory_contexts() allows us to
see the memory contexts of specified backend more easily.

Only superusers are allowed to request to log the memory contexts
because allowing any users to issue this request at an unbounded rate
would cause lots of log messages and which can lead to denial of service.

On receipt of the request, at the next CHECK_FOR_INTERRUPTS(),
the target backend logs its memory contexts at LOG_SERVER_ONLY level,
so that these memory contexts will appear in the server log but not
be sent to the client. It logs one message per memory context.
Because if it buffers all memory contexts into StringInfo to log them
as one message, which may require the buffer to be enlarged very much
and lead to OOM error since there can be a large number of memory
contexts in a backend.

When a backend process is consuming huge memory, logging all its
memory contexts might overrun available disk space. To prevent this,
now this patch limits the number of child contexts to log per parent
to 100. As with MemoryContextStats(), it supposes that practical cases
where the log gets long will typically be huge numbers of siblings
under the same parent context; while the additional debugging value
from seeing details about individual siblings beyond 100 will not be large.

There was another proposed patch to add the function to return
the memory contexts of specified backend as the result sets,
instead of logging them, in the discussion. However that patch is
not included in this commit because it had several issues to address.

Thanks to Tatsuhito Kasahara, Andres Freund, Tom Lane, Tomas Vondra,
Michael Paquier, Kyotaro Horiguchi and Zhihong Yu for the discussion.

Bump catalog version.

Author: Atsushi Torikoshi
Reviewed-by: Kyotaro Horiguchi, Zhihong Yu, Fujii Masao
Discussion: https://postgr.es/m/0271f440ac77f2a4180e0e56ebd944d1@oss.nttdata.com
2021-04-06 13:44:15 +09:00
Peter Eisentraut
a63dd8afe2 Renumber cursor option flags
Move the planner-control flags up so that there is more room for parse
options.  Some pending patches need some room there, so do this
renumbering separately so that there is less potential for conflicts.
2021-04-05 09:10:27 +02:00
David Rowley
9eacee2e62 Add Result Cache executor node (take 2)
Here we add a new executor node type named "Result Cache".  The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins.  This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again.  Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.

For certain data sets, this can significantly improve the performance of
joins.  The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join.  In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch.  Merge joins would have to
skip over all of the unmatched rows.  If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join.  The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large.  Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join.  This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does.  The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables.  Smaller hash tables generally perform better.

The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size.  We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.

For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node.  We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be.  Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.

For now, the planner will only consider using a result cache for
parameterized nested loop joins.  This works for both normal joins and
also for LATERAL type joins to subqueries.  It is possible to use this new
node for other uses in the future.  For example, to cache results from
correlated subqueries.  However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio.  Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.

The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations.  With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be.   In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%.  Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join.   However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values.  If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join.  Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature.  Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.

For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache.  However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default.  There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression.  Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default.  It remains to be seen if we'll
maintain that setting for the release.

Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch.  Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people.  If there's some consensus on a better name, then we can
change it before the release.  Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.

Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
2021-04-02 14:10:56 +13:00
David Rowley
28b3e3905c Revert b6002a796
This removes "Add Result Cache executor node".  It seems that something
weird is going on with the tracking of cache hits and misses as
highlighted by many buildfarm animals.  It's not yet clear what the
problem is as other parts of the plan indicate that the cache did work
correctly, it's just the hits and misses that were being reported as 0.

This is especially a bad time to have the buildfarm so broken, so
reverting before too many more animals go red.

Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
2021-04-01 13:33:23 +13:00
David Rowley
b6002a796d Add Result Cache executor node
Here we add a new executor node type named "Result Cache".  The planner
can include this node type in the plan to have the executor cache the
results from the inner side of parameterized nested loop joins.  This
allows caching of tuples for sets of parameters so that in the event that
the node sees the same parameter values again, it can just return the
cached tuples instead of rescanning the inner side of the join all over
again.  Internally, result cache uses a hash table in order to quickly
find tuples that have been previously cached.

For certain data sets, this can significantly improve the performance of
joins.  The best cases for using this new node type are for join problems
where a large portion of the tuples from the inner side of the join have
no join partner on the outer side of the join.  In such cases, hash join
would have to hash values that are never looked up, thus bloating the hash
table and possibly causing it to multi-batch.  Merge joins would have to
skip over all of the unmatched rows.  If we use a nested loop join with a
result cache, then we only cache tuples that have at least one join
partner on the outer side of the join.  The benefits of using a
parameterized nested loop with a result cache increase when there are
fewer distinct values being looked up and the number of lookups of each
value is large.  Also, hash probes to lookup the cache can be much faster
than the hash probe in a hash join as it's common that the result cache's
hash table is much smaller than the hash join's due to result cache only
caching useful tuples rather than all tuples from the inner side of the
join.  This variation in hash probe performance is more significant when
the hash join's hash table no longer fits into the CPU's L3 cache, but the
result cache's hash table does.  The apparent "random" access of hash
buckets with each hash probe can cause a poor L3 cache hit ratio for large
hash tables.  Smaller hash tables generally perform better.

The hash table used for the cache limits itself to not exceeding work_mem
* hash_mem_multiplier in size.  We maintain a dlist of keys for this cache
and when we're adding new tuples and realize we've exceeded the memory
budget, we evict cache entries starting with the least recently used ones
until we have enough memory to add the new tuples to the cache.

For parameterized nested loop joins, we now consider using one of these
result cache nodes in between the nested loop node and its inner node.  We
determine when this might be useful based on cost, which is primarily
driven off of what the expected cache hit ratio will be.  Estimating the
cache hit ratio relies on having good distinct estimates on the nested
loop's parameters.

For now, the planner will only consider using a result cache for
parameterized nested loop joins.  This works for both normal joins and
also for LATERAL type joins to subqueries.  It is possible to use this new
node for other uses in the future.  For example, to cache results from
correlated subqueries.  However, that's not done here due to some
difficulties obtaining a distinct estimation on the outer plan to
calculate the estimated cache hit ratio.  Currently we plan the inner plan
before planning the outer plan so there is no good way to know if a result
cache would be useful or not since we can't estimate the number of times
the subplan will be called until the outer plan is generated.

The functionality being added here is newly introducing a dependency on
the return value of estimate_num_groups() during the join search.
Previously, during the join search, we only ever needed to perform
selectivity estimations.  With this commit, we need to use
estimate_num_groups() in order to estimate what the hit ratio on the
result cache will be.   In simple terms, if we expect 10 distinct values
and we expect 1000 outer rows, then we'll estimate the hit ratio to be
99%.  Since cache hits are very cheap compared to scanning the underlying
nodes on the inner side of the nested loop join, then this will
significantly reduce the planner's cost for the join.   However, it's
fairly easy to see here that things will go bad when estimate_num_groups()
incorrectly returns a value that's significantly lower than the actual
number of distinct values.  If this happens then that may cause us to make
use of a nested loop join with a result cache instead of some other join
type, such as a merge or hash join.  Our distinct estimations have been
known to be a source of trouble in the past, so the extra reliance on them
here could cause the planner to choose slower plans than it did previous
to having this feature.  Distinct estimations are also fairly hard to
estimate accurately when several tables have been joined already or when a
WHERE clause filters out a set of values that are correlated to the
expressions we're estimating the number of distinct value for.

For now, the costing we perform during query planning for result caches
does put quite a bit of faith in the distinct estimations being accurate.
When these are accurate then we should generally see faster execution
times for plans containing a result cache.  However, in the real world, we
may find that we need to either change the costings to put less trust in
the distinct estimations being accurate or perhaps even disable this
feature by default.  There's always an element of risk when we teach the
query planner to do new tricks that it decides to use that new trick at
the wrong time and causes a regression.  Users may opt to get the old
behavior by turning the feature off using the enable_resultcache GUC.
Currently, this is enabled by default.  It remains to be seen if we'll
maintain that setting for the release.

Additionally, the name "Result Cache" is the best name I could think of
for this new node at the time I started writing the patch.  Nobody seems
to strongly dislike the name. A few people did suggest other names but no
other name seemed to dominate in the brief discussion that there was about
names. Let's allow the beta period to see if the current name pleases
enough people.  If there's some consensus on a better name, then we can
change it before the release.  Please see the 2nd discussion link below
for the discussion on the "Result Cache" name.

Author: David Rowley
Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu
Tested-By: Konstantin Knizhnik
Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
2021-04-01 12:32:22 +13:00
Tom Lane
86dc90056d Rework planning and execution of UPDATE and DELETE.
This patch makes two closely related sets of changes:

1. For UPDATE, the subplan of the ModifyTable node now only delivers
the new values of the changed columns (i.e., the expressions computed
in the query's SET clause) plus row identity information such as CTID.
ModifyTable must re-fetch the original tuple to merge in the old
values of any unchanged columns.  The core advantage of this is that
the changed columns are uniform across all tables of an inherited or
partitioned target relation, whereas the other columns might not be.
A secondary advantage, when the UPDATE involves joins, is that less
data needs to pass through the plan tree.  The disadvantage of course
is an extra fetch of each tuple to be updated.  However, that seems to
be very nearly free in context; even worst-case tests don't show it to
add more than a couple percent to the total query cost.  At some point
it might be interesting to combine the re-fetch with the tuple access
that ModifyTable must do anyway to mark the old tuple dead; but that
would require a good deal of refactoring and it seems it wouldn't buy
all that much, so this patch doesn't attempt it.

2. For inherited UPDATE/DELETE, instead of generating a separate
subplan for each target relation, we now generate a single subplan
that is just exactly like a SELECT's plan, then stick ModifyTable
on top of that.  To let ModifyTable know which target relation a
given incoming row refers to, a tableoid junk column is added to
the row identity information.  This gets rid of the horrid hack
that was inheritance_planner(), eliminating O(N^2) planning cost
and memory consumption in cases where there were many unprunable
target relations.

Point 2 of course requires point 1, so that there is a uniform
definition of the non-junk columns to be returned by the subplan.
We can't insist on uniform definition of the row identity junk
columns however, if we want to keep the ability to have both
plain and foreign tables in a partitioning hierarchy.  Since
it wouldn't scale very far to have every child table have its
own row identity column, this patch includes provisions to merge
similar row identity columns into one column of the subplan result.
In particular, we can merge the whole-row Vars typically used as
row identity by FDWs into one column by pretending they are type
RECORD.  (It's still okay for the actual composite Datums to be
labeled with the table's rowtype OID, though.)

There is more that can be done to file down residual inefficiencies
in this patch, but it seems to be committable now.

FDW authors should note several API changes:

* The argument list for AddForeignUpdateTargets() has changed, and so
has the method it must use for adding junk columns to the query.  Call
add_row_identity_var() instead of manipulating the parse tree directly.
You might want to reconsider exactly what you're adding, too.

* PlanDirectModify() must now work a little harder to find the
ForeignScan plan node; if the foreign table is part of a partitioning
hierarchy then the ForeignScan might not be the direct child of
ModifyTable.  See postgres_fdw for sample code.

* To check whether a relation is a target relation, it's no
longer sufficient to compare its relid to root->parse->resultRelation.
Instead, check it against all_result_relids or leaf_result_relids,
as appropriate.

Amit Langote and Tom Lane

Discussion: https://postgr.es/m/CA+HiwqHpHdqdDn48yCEhynnniahH78rwcrv1rEX65-fsZGBOLQ@mail.gmail.com
2021-03-31 11:52:37 -04:00
Peter Eisentraut
055fee7eb4 Allow an alias to be attached to a JOIN ... USING
This allows something like

    SELECT ... FROM t1 JOIN t2 USING (a, b, c) AS x

where x has the columns a, b, c and unlike a regular alias it does not
hide the range variables of the tables being joined t1 and t2.

Per SQL:2016 feature F404 "Range variable for common column names".

Reviewed-by: Vik Fearing <vik.fearing@2ndquadrant.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/454638cf-d563-ab76-a585-2564428062af@2ndquadrant.com
2021-03-31 17:10:50 +02:00
Etsuro Fujita
27e1f14563 Add support for asynchronous execution.
This implements asynchronous execution, which runs multiple parts of a
non-parallel-aware Append concurrently rather than serially to improve
performance when possible.  Currently, the only node type that can be
run concurrently is a ForeignScan that is an immediate child of such an
Append.  In the case where such ForeignScans access data on different
remote servers, this would run those ForeignScans concurrently, and
overlap the remote operations to be performed simultaneously, so it'll
improve the performance especially when the operations involve
time-consuming ones such as remote join and remote aggregation.

We may extend this to other node types such as joins or aggregates over
ForeignScans in the future.

This also adds the support for postgres_fdw, which is enabled by the
table-level/server-level option "async_capable".  The default is false.

Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself.  This commit
is mostly based on the patch proposed by Robert Haas, but also uses
stuff from the patch proposed by Kyotaro Horiguchi and from the patch
proposed by Thomas Munro.  Reviewed by Kyotaro Horiguchi, Konstantin
Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and
others.

Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com
Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
2021-03-31 18:45:00 +09:00
Michael Paquier
7ef64e7e72 Fix comment in parsenodes.h
CreateStmt->inhRelations is a list of RangeVars, but a comment was
incorrect about that.

Author: Julien Rouhaud
Discussion: https://postgr.es/m/20210330123015.yzekhz5sweqbgxdr@nol
2021-03-31 09:35:58 +09:00
David Rowley
f58b230ed0 Cache if PathTarget and RestrictInfos contain volatile functions
Here we aim to reduce duplicate work done by contain_volatile_functions()
by caching whether PathTargets and RestrictInfos contain any volatile
functions the first time contain_volatile_functions() is called for them.
Any future calls for these nodes just use the cached value rather than
going to the trouble of recursively checking the sub-node all over again.
Thanks to Tom Lane for the idea.

Any locations in the code which make changes to a PathTarget or
RestrictInfo which could change the outcome of the volatility check must
change the cached value back to VOLATILITY_UNKNOWN again.
contain_volatile_functions() is the only code in charge of setting the
cache value to either VOLATILITY_VOLATILE or VOLATILITY_NOVOLATILE.

Some existing code does benefit from this additional caching, however,
this change is mainly aimed at an upcoming patch that must check for
volatility during the join search.  Repeated volatility checks in that
case can become very expensive when the join search contains more than a
few relations.

Author: David Rowley
Discussion: https://postgr.es/m/3795226.1614059027@sss.pgh.pa.us
2021-03-29 14:55:26 +13:00
Tomas Vondra
a4d75c86bf Extended statistics on expressions
Allow defining extended statistics on expressions, not just just on
simple column references.  With this commit, expressions are supported
by all existing extended statistics kinds, improving the same types of
estimates. A simple example may look like this:

  CREATE TABLE t (a int);
  CREATE STATISTICS s ON mod(a,10), mod(a,20) FROM t;
  ANALYZE t;

The collected statistics are useful e.g. to estimate queries with those
expressions in WHERE or GROUP BY clauses:

  SELECT * FROM t WHERE mod(a,10) = 0 AND mod(a,20) = 0;

  SELECT 1 FROM t GROUP BY mod(a,10), mod(a,20);

This introduces new internal statistics kind 'e' (expressions) which is
built automatically when the statistics object definition includes any
expressions. This represents single-expression statistics, as if there
was an expression index (but without the index maintenance overhead).
The statistics is stored in pg_statistics_ext_data as an array of
composite types, which is possible thanks to 79f6a942bd.

CREATE STATISTICS allows building statistics on a single expression, in
which case in which case it's not possible to specify statistics kinds.

A new system view pg_stats_ext_exprs can be used to display expression
statistics, similarly to pg_stats and pg_stats_ext views.

ALTER TABLE ... ALTER COLUMN ... TYPE now treats indexes the same way it
treats indexes, i.e. it drops and recreates the statistics. This means
all statistics are reset, and we no longer try to preserve at least the
functional dependencies. This should not be a major issue in practice,
as the functional dependencies actually rely on per-column statistics,
which were always reset anyway.

Author: Tomas Vondra
Reviewed-by: Justin Pryzby, Dean Rasheed, Zhihong Yu
Discussion: https://postgr.es/m/ad7891d2-e90c-b446-9fe2-7419143847d7%40enterprisedb.com
2021-03-27 00:01:11 +01:00
Alvaro Herrera
71f4c8c6f7
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY
Allow a partition be detached from its partitioned table without
blocking concurrent queries, by running in two transactions and only
requiring ShareUpdateExclusive in the partitioned table.

Because it runs in two transactions, it cannot be used in a transaction
block.  This is the main reason to use dedicated syntax: so that users
can choose to use the original mode if they need it.  But also, it
doesn't work when a default partition exists (because an exclusive lock
would still need to be obtained on it, in order to change its partition
constraint.)

In case the second transaction is cancelled or a crash occurs, there's
ALTER TABLE .. DETACH PARTITION .. FINALIZE, which executes the final
steps.

The main trick to make this work is the addition of column
pg_inherits.inhdetachpending, initially false; can only be set true in
the first part of this command.  Once that is committed, concurrent
transactions that use a PartitionDirectory will include or ignore
partitions so marked: in optimizer they are ignored if the row is marked
committed for the snapshot; in executor they are always included.  As a
result, and because of the way PartitionDirectory caches partition
descriptors, queries that were planned before the detach will see the
rows in the detached partition and queries that are planned after the
detach, won't.

A CHECK constraint is created that duplicates the partition constraint.
This is probably not strictly necessary, and some users will prefer to
remove it afterwards, but if the partition is re-attached to a
partitioned table, the constraint needn't be rechecked.

Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/20200803234854.GA24158@alvherre.pgsql
2021-03-25 18:00:28 -03:00
Amit Kapila
26acb54a13 Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."
To allow inserts in parallel-mode this feature has to ensure that all the
constraints, triggers, etc. are parallel-safe for the partition hierarchy
which is costly and we need to find a better way to do that. Additionally,
we could have used existing cached information in some cases like indexes,
domains, etc. to determine the parallel-safety.

List of commits reverted, in reverse chronological order:

ed62d3737c Doc: Update description for parallel insert reloption.
c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode.
c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f.
e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f.
e4e87a32cc Fix valgrind issue in commit 05c8482f7f.
05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...".

Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
2021-03-24 11:29:15 +05:30
Tomas Vondra
a5f002ad9a Use correct spelling of statistics kind
A couple error messages and comments used 'statistic kind', not the
correct 'statistics kind'. Fix and backpatch all the way back to 10,
where extended statistics were introduced.

Backpatch-through: 10
2021-03-23 05:01:35 +01:00
Robert Haas
bbe0a81db6 Allow configurable LZ4 TOAST compression.
There is now a per-column COMPRESSION option which can be set to pglz
(the default, and the only option in up until now) or lz4. Or, if you
like, you can set the new default_toast_compression GUC to lz4, and
then that will be the default for new table columns for which no value
is specified. We don't have lz4 support in the PostgreSQL code, so
to use lz4 compression, PostgreSQL must be built --with-lz4.

In general, TOAST compression means compression of individual column
values, not the whole tuple, and those values can either be compressed
inline within the tuple or compressed and then stored externally in
the TOAST table, so those properties also apply to this feature.

Prior to this commit, a TOAST pointer has two unused bits as part of
the va_extsize field, and a compessed datum has two unused bits as
part of the va_rawsize field. These bits are unused because the length
of a varlena is limited to 1GB; we now use them to indicate the
compression type that was used. This means we only have bit space for
2 more built-in compresison types, but we could work around that
problem, if necessary, by introducing a new vartag_external value for
any further types we end up wanting to add. Hopefully, it won't be
too important to offer a wide selection of algorithms here, since
each one we add not only takes more coding but also adds a build
dependency for every packager. Nevertheless, it seems worth doing
at least this much, because LZ4 gets better compression than PGLZ
with less CPU usage.

It's possible for LZ4-compressed datums to leak into composite type
values stored on disk, just as it is for PGLZ. It's also possible for
LZ4-compressed attributes to be copied into a different table via SQL
commands such as CREATE TABLE AS or INSERT .. SELECT.  It would be
expensive to force such values to be decompressed, so PostgreSQL has
never done so. For the same reasons, we also don't force recompression
of already-compressed values even if the target table prefers a
different compression method than was used for the source data.  These
architectural decisions are perhaps arguable but revisiting them is
well beyond the scope of what seemed possible to do as part of this
project.  However, it's relatively cheap to recompress as part of
VACUUM FULL or CLUSTER, so this commit adjusts those commands to do
so, if the configured compression method of the table happens not to
match what was used for some column value stored therein.

Dilip Kumar. The original patches on which this work was based were
written by Ildus Kurbangaliev, and those were patches were based on
even earlier work by Nikita Glukhov, but the design has since changed
very substantially, since allow a potentially large number of
compression methods that could be added and dropped on a running
system proved too problematic given some of the architectural issues
mentioned above; the choice of which specific compression method to
add first is now different; and a lot of the code has been heavily
refactored.  More recently, Justin Przyby helped quite a bit with
testing and reviewing and this version also includes some code
contributions from him. Other design input and review from Tomas
Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander
Korotkov, and me.

Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain
Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
2021-03-19 15:10:38 -04:00
Tomas Vondra
be45be9c33 Implement GROUP BY DISTINCT
With grouping sets, it's possible that some of the grouping sets are
duplicate.  This is especially common with CUBE and ROLLUP clauses. For
example GROUP BY CUBE (a,b), CUBE (b,c) is equivalent to

  GROUP BY GROUPING SETS (
    (a, b, c),
    (a, b, c),
    (a, b, c),
    (a, b),
    (a, b),
    (a, b),
    (a),
    (a),
    (a),
    (c, a),
    (c, a),
    (c, a),
    (c),
    (b, c),
    (b),
    ()
  )

Some of the grouping sets are calculated multiple times, which is mostly
unnecessary.  This commit implements a new GROUP BY DISTINCT feature, as
defined in the SQL standard, which eliminates the duplicate sets.

Author: Vik Fearing
Reviewed-by: Erik Rijkers, Georgios Kokolatos, Tomas Vondra
Discussion: https://postgr.es/m/bf3805a8-d7d1-ae61-fece-761b7ff41ecc@postgresfriends.org
2021-03-18 18:22:18 +01:00
Amit Kapila
05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...".
Parallel SELECT can't be utilized for INSERT in the following cases:
- INSERT statement uses the ON CONFLICT DO UPDATE clause
- Target table has a parallel-unsafe: trigger, index expression or
  predicate, column default expression or check constraint
- Target table has a parallel-unsafe domain constraint on any column
- Target table is a partitioned table with a parallel-unsafe partition key
  expression or support function

The planner is updated to perform additional parallel-safety checks for
the cases listed above, for determining whether it is safe to run INSERT
in parallel-mode with an underlying parallel SELECT. The planner will
consider using parallel SELECT for "INSERT INTO ... SELECT ...", provided
nothing unsafe is found from the additional parallel-safety checks, or
from the existing parallel-safety checks for SELECT.

While checking parallel-safety, we need to check it for all the partitions
on the table which can be costly especially when we decide not to use a
parallel plan. So, in a separate patch, we will introduce a GUC and or a
reloption to enable/disable parallelism for Insert statements.

Prior to entering parallel-mode for the execution of INSERT with parallel
SELECT, a TransactionId is acquired and assigned to the current
transaction state. This is necessary to prevent the INSERT from attempting
to assign the TransactionId whilst in parallel-mode, which is not allowed.
This approach has a disadvantage in that if the underlying SELECT does not
return any rows, then the TransactionId is not used, however that
shouldn't happen in practice in many cases.

Author: Greg Nancarrow, Amit Langote, Amit Kapila
Reviewed-by: Amit Langote, Hou Zhijie, Takayuki Tsunakawa, Antonin Houska, Bharath Rupireddy, Dilip Kumar, Vignesh C, Zhihong Yu, Amit Kapila
Tested-by: Tang, Haiying
Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com
Discussion: https://postgr.es/m/CAJcOf-fAdj=nDKMsRhQzndm-O13NY4dL6xGcEvdX5Xvbbi0V7g@mail.gmail.com
2021-03-10 07:38:58 +05:30
Amit Kapila
19890a064e Add option to enable two_phase commits via pg_create_logical_replication_slot.
Commit 0aa8a01d04 extends the output plugin API to allow decoding of
prepared xacts and allowed the user to enable/disable the two-phase option
via pg_logical_slot_get_changes(). This can lead to a problem such that
the first time when it gets changes via pg_logical_slot_get_changes()
without two_phase option enabled it will not get the prepared even though
prepare is after consistent snapshot. Now next time during getting changes,
if the two_phase option is enabled it can skip prepare because by that
time start decoding point has been moved. So the user will only get commit
prepared.

Allow to enable/disable this option at the create slot time and default
will be false. It will break the existing slots which is fine in a major
release.

Author: Ajin Cherian
Reviewed-by: Amit Kapila and Vignesh C
Discussion: https://postgr.es/m/d0f60d60-133d-bf8d-bd70-47784d8fabf3@enterprisedb.com
2021-03-03 07:34:11 +05:30
David Rowley
bb437f995d Add TID Range Scans to support efficient scanning ranges of TIDs
This adds a new executor node named TID Range Scan.  The query planner
will generate paths for TID Range scans when quals are discovered on base
relations which search for ranges on the table's ctid column.  These
ranges may be open at either end. For example, WHERE ctid >= '(10,0)';
will return all tuples on page 10 and over.

To support this, two new optional callback functions have been added to
table AM.  scan_set_tidrange is used to set the scan range to just the
given range of TIDs.  scan_getnextslot_tidrange fetches the next tuple
in the given range.

For AMs were scanning ranges of TIDs would not make sense, these functions
can be set to NULL in the TableAmRoutine.  The query planner won't
generate TID Range Scan Paths in that case.

Author: Edmund Horner, David Rowley
Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu
Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
2021-02-27 22:59:36 +13:00