This case didn't work because columns merged by FULL JOIN USING are
represented in the parse tree by COALESCE expressions, and the logic
for recognizing a partitionable join failed to match upper-level join
clauses to such expressions. To fix, synthesize suitable COALESCE
expressions and add them to the nullable_partexprs lists. This is
pretty ugly and brute-force, but it gets the job done. (I have
ambitions of rethinking the way outer-join output Vars are
represented, so maybe that will provide a cleaner solution someday.
For now, do this.)
Amit Langote, reviewed by Justin Pryzby, Richard Guo, and myself
Discussion: https://postgr.es/m/CA+HiwqG2WVUGmLJqtR0tPFhniO=H=9qQ+Z3L_ZC+Y3-EVQHFGg@mail.gmail.com
Previously, the partitionwise join technique only allowed partitionwise
join when input partitioned tables had exactly the same partition
bounds. This commit extends the technique to some cases when the tables
have different partition bounds, by using an advanced partition-matching
algorithm introduced by this commit. For both the input partitioned
tables, the algorithm checks whether every partition of one input
partitioned table only matches one partition of the other input
partitioned table at most, and vice versa. In such a case the join
between the tables can be broken down into joins between the matching
partitions, so the algorithm produces the pairs of the matching
partitions, plus the partition bounds for the join relation, to allow
partitionwise join for computing the join. Currently, the algorithm
works for list-partitioned and range-partitioned tables, but not
hash-partitioned tables. See comments in partition_bounds_merge().
Ashutosh Bapat and Etsuro Fujita, most of regression tests by Rajkumar
Raghuwanshi, some of the tests by Mark Dilger and Amul Sul, reviewed by
Dmitry Dolgov and Amul Sul, with additional review at various points by
Ashutosh Bapat, Mark Dilger, Robert Haas, Antonin Houska, Amit Langote,
Justin Pryzby, and Tomas Vondra
Discussion: https://postgr.es/m/CAFjFpRdjQvaUEV5DJX3TW6pU5eq54NCkadtxHX2JiJG_GvbrCA@mail.gmail.com
The goal of separating hotly accessed per-backend data from PGPROC
into PGXACT is to make accesses fast (GetSnapshotData() in
particular). But delayChkpt is not actually accessed frequently; only
when starting a checkpoint. As it is frequently modified (multiple
times in the course of a single transaction), storing it in the same
cacheline as hotly accessed data unnecessarily dirties a contended
cacheline.
Therefore move delayChkpt to PGPROC.
This is part of a larger series of patches intending to improve
GetSnapshotData() scalability. It is committed and pushed separately,
as it is independently beneficial (small but measurable win, limited
by the other frequent modifications of PGXACT).
Author: Andres Freund
Reviewed-By: Robert Haas, Thomas Munro, David Rowley
Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de
Since heap TID is supposed to be just another key attribute to the
implementation, it doesn't make much sense to have separate
BTreeTupleSetNAtts() and BTreeTupleSetAltHeapTID() functions. Merge the
two functions together. This slightly simplifies _bt_truncate().
Replication slots are useful to retain data that may be needed by a
replication system. But experience has shown that allowing them to
retain excessive data can lead to the primary failing because of running
out of space. This new feature allows the user to configure a maximum
amount of space to be reserved using the new option
max_slot_wal_keep_size. Slots that overrun that space are invalidated
at checkpoint time, enabling the storage to be released.
Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp
This commit adds following optional clause to BEGIN and START TRANSACTION
commands.
WAIT FOR LSN lsn [ TIMEOUT timeout ]
New clause pospones transaction start till given lsn is applied on standby.
This clause allows user be sure, that changes previously made on primary would
be visible on standby.
New shared memory struct is used to track awaited lsn per backend. Recovery
process wakes up backend once required lsn is applied.
Author: Ivan Kartyshov, Anna Akenteva
Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi
Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs
Reviewed-by: Amit Kapila, Alexander Korotkov
Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru
WITH TIES is an option to the FETCH FIRST N ROWS clause (the SQL
standard's spelling of LIMIT), where you additionally get rows that
compare equal to the last of those N rows by the columns in the
mandatory ORDER BY clause.
There was a proposal by Andrew Gierth to implement this functionality in
a more powerful way that would yield more features, but the other patch
had not been finished at this time, so we decided to use this one for
now in the spirit of incremental development.
Author: Surafel Temesgen <surafel3000@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/CALAY4q9ky7rD_A4vf=FVQvCGngm3LOes-ky0J6euMrg=_Se+ag@mail.gmail.com
Discussion: https://postgr.es/m/87o8wvz253.fsf@news-spur.riddles.org.uk
Since the existing bit number argument can't exceed INT32_MAX, it's
not possible for these functions to manipulate bits beyond the first
256MB of a bytea value. Lift that restriction by redeclaring the
bit number arguments as int8 (which requires a catversion bump,
hence is not back-patchable).
The similarly-named functions for bit/varbit don't really have a
problem because we restrict those types to at most VARBITMAXLEN bits;
hence leave them alone.
While here, extend the encode/decode functions in utils/adt/encode.c
to allow dealing with values wider than 1GB. This is not a live bug
or restriction in current usage, because no input could be more than
1GB, and since none of the encoders can expand a string more than 4X,
the result size couldn't overflow uint32. But it might be desirable
to support more in future, so make the input length values size_t
and the potential-output-length values uint64.
Also add some test cases to improve the miserable code coverage
of these functions.
Movead Li, editorialized some by me; also reviewed by Ashutosh Bapat
Discussion: https://postgr.es/m/20200312115135445367128@highgo.ca
Commit d2d8a229bc introduced Incremental Sort, but it was considered
only in create_ordered_paths() as an alternative to regular Sort. There
are many other places that require sorted input and might benefit from
considering Incremental Sort too.
This patch modifies a number of those places, but not all. The concern
is that just adding Incremental Sort to any place that already adds
Sort may increase the number of paths considered, negatively affecting
planning time, without any benefit. So we've taken a more conservative
approach, based on analysis of which places do affect a set of queries
that did seem practical. This means some less common queries may not
benefit from Incremental Sort yet.
Author: Tomas Vondra
Reviewed-by: James Coleman
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
It turns out that the code did indeed rely on a zeroed
TuplesortInstrumentation.sortMethod field to indicate
"this worker never did anything", although it seems the
issue only comes up during certain race-condition-y cases.
Hence, rearrange the TuplesortMethod enum to restore
SORT_TYPE_STILL_IN_PROGRESS to having the value zero,
and add some comments reinforcing that that isn't optional.
Also future-proof a loop over the possible values of the enum.
sizeof(bits32) happened to be the correct limit value,
but only by purest coincidence.
Per buildfarm and local investigation.
Discussion: https://postgr.es/m/12222.1586223974@sss.pgh.pa.us
The txid_XXX family of fmgr functions exposes 64 bit transaction IDs to
users as int8. Now that we have an SQL type xid8 for FullTransactionId,
define a new set of functions including pg_current_xact_id() and
pg_current_snapshot() based on that. Keep the old functions around too,
for now.
It's a bit sneaky to use the same C functions for both, but since the
binary representation is identical except for the signedness of the
type, and since older functions are the ones using the wrong signedness,
and since we'll presumably drop the older ones after a reasonable period
of time, it seems reasonable to switch to FullTransactionId internally
and share the code for both.
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Takao Fujii <btfujiitkp@oss.nttdata.com>
Reviewed-by: Yoshikazu Imai <imai.yoshikazu@fujitsu.com>
Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de
Similar to xid, but 64 bits wide. This new type is suitable for use in
various system views and administration functions.
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Takao Fujii <btfujiitkp@oss.nttdata.com>
Reviewed-by: Yoshikazu Imai <imai.yoshikazu@fujitsu.com>
Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de
Incremental Sort is an optimized variant of multikey sort for cases when
the input is already sorted by a prefix of the requested sort keys. For
example when the relation is already sorted by (key1, key2) and we need
to sort it by (key1, key2, key3) we can simply split the input rows into
groups having equal values in (key1, key2), and only sort/compare the
remaining column key3.
This has a number of benefits:
- Reduced memory consumption, because only a single group (determined by
values in the sorted prefix) needs to be kept in memory. This may also
eliminate the need to spill to disk.
- Lower startup cost, because Incremental Sort produce results after each
prefix group, which is beneficial for plans where startup cost matters
(like for example queries with LIMIT clause).
We consider both Sort and Incremental Sort, and decide based on costing.
The implemented algorithm operates in two different modes:
- Fetching a minimum number of tuples without check of equality on the
prefix keys, and sorting on all columns when safe.
- Fetching all tuples for a single prefix group and then sorting by
comparing only the remaining (non-prefix) keys.
We always start in the first mode, and employ a heuristic to switch into
the second mode if we believe it's beneficial - the goal is to minimize
the number of unnecessary comparions while keeping memory consumption
below work_mem.
This is a very old patch series. The idea was originally proposed by
Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the
patch was taken over by James Coleman, who wrote and rewrote most of the
current code.
There were many reviewers/contributors since 2013 - I've done my best to
pick the most active ones, and listed them in this commit message.
Author: James Coleman, Alexander Korotkov
Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov
Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
Mainly, this adds support code in logical/worker.c for applying
replicated operations whose target is a partitioned table to its
relevant partitions.
Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
This commit adds a new option WAL similar to existing option BUFFERS in the
EXPLAIN command. This option allows to include information on WAL record
generation added by commit df3b181499 in EXPLAIN output.
This also allows the WAL usage information to be displayed via
the auto_explain module. A new parameter auto_explain.log_wal controls
whether WAL usage statistics are printed when an execution plan is logged.
This parameter has no effect unless auto_explain.log_analyze is enabled.
Author: Julien Rouhaud
Reviewed-by: Dilip Kumar and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
A table rewritten by ALTER TABLE would lose tracking of an index usable
for CLUSTER. This setting is tracked by pg_index.indisclustered and is
controlled by ALTER TABLE, so some extra work was needed to restore it
properly. Note that ALTER TABLE only marks the index that can be used
for clustering, and does not do the actual operation.
Author: Amit Langote, Justin Pryzby
Reviewed-by: Ibrar Ahmed, Michael Paquier
Discussion: https://postgr.es/m/20200202161718.GI13621@telsasoft.com
Backpatch-through: 9.5
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Bump XLOG_PAGE_MAGIC, since this introduces XLOG_GIST_ASSIGN_LSN.
Future servers accept older WAL, so this bump is discretionary.
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
This allows gathering the WAL generation statistics for each statement
execution. The three statistics that we collect are the number of WAL
records, the number of full page writes and the amount of WAL bytes
generated.
This helps the users who have write-intensive workload to see the impact
of I/O due to WAL. This further enables us to see approximately what
percentage of overall WAL is due to full page writes.
In the future, we can extend this functionality to allow us to compute the
the exact amount of WAL data due to full page writes.
This patch in itself is just an infrastructure to compute WAL usage data.
The upcoming patches will expose this data via explain, auto_explain,
pg_stat_statements and verbose (auto)vacuum output.
Author: Kirill Bychik, Julien Rouhaud
Reviewed-by: Dilip Kumar, Fujii Masao and Amit Kapila
Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com
Don't try to be precise about it, just use a constant 16 bytes of
chunk overhead. Being smarter would require knowing the memory context
where the chunk will be allocated, which is not known by all callers.
Discussion: https://postgr.es/m/20200325220936.il3ni2fj2j2b45y5@alap3.anarazel.de
Move have_partkey_equi_join and match_expr_to_partition_keys to
relnode.c, since they're used only there. Refactor
build_joinrel_partition_info to split out the code that fills the
joinrel's partition key lists; this doesn't have any non-cosmetic
impact, but it seems like a useful separation of concerns.
Improve assorted nearby comments.
Amit Langote, with a little further editorialization by me
Discussion: https://postgr.es/m/CA+HiwqG2WVUGmLJqtR0tPFhniO=H=9qQ+Z3L_ZC+Y3-EVQHFGg@mail.gmail.com
A manifest is a JSON document which includes (1) the file name, size,
last modification time, and an optional checksum for each file backed
up, (2) timelines and LSNs for whatever WAL will need to be replayed
to make the backup consistent, and (3) a checksum for the manifest
itself. By default, we use CRC-32C when checksumming data files,
because we are trying to detect corruption and user error, not foil an
adversary. However, pg_basebackup and the server-side BASE_BACKUP
command now have options to select a different algorithm, so users
wanting a cryptographic hash function can select SHA-224, SHA-256,
SHA-384, or SHA-512. Users not wanting file checksums at all can
disable them, or disable generating of the backup manifest altogether.
Using a cryptographic hash function in place of CRC-32C consumes
significantly more CPU cycles, which may slow down backups in some
cases.
A new tool called pg_validatebackup can validate a backup against the
manifest. If no checksums are present, it can still check that the
right files exist and that they have the expected sizes. If checksums
are present, it can also verify that each file has the expected
checksum. Additionally, it calls pg_waldump to verify that the
expected WAL files are present and parseable. Only plain format
backups can be validated directly, but tar format backups can be
validated after extracting them.
Robert Haas, with help, ideas, review, and testing from David Steele,
Stephen Frost, Andrew Dunstan, Rushabh Lathia, Suraj Kharage, Tushar
Ahuja, Rajkumar Raghuwanshi, Mark Dilger, Davinder Singh, Jeevan
Chalke, Amit Kapila, Andres Freund, and Noah Misch.
Discussion: http://postgr.es/m/CA+TgmoZV8dw1H2bzZ9xkKwdrk8+XYa+DC9H=F7heO2zna5T6qg@mail.gmail.com
When BUFFERS option is enabled, EXPLAIN command includes the information
on buffer usage during each plan node, in its output. In addition to that,
this commit makes EXPLAIN command include also the information on
buffer usage during planning phase, in its output. This feature makes it
easier to discern the cases where lots of buffer access happen during
planning.
This commit revives the original commit ed7a509571 that was reverted by
commit 19db23bcbd. The original commit had to be reverted because
it caused the regression test failure on the buildfarm members prion and
dory. But since commit c0885c4c30 got rid of the caues of the test failure,
the original commit can be safely introduced again.
Author: Julien Rouhaud, slightly revised by Fujii Masao
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/16109-26a1a88651e90608@postgresql.org
This commit introduces new wait events RecoveryConflictSnapshot and
RecoveryConflictTablespace. The former is reported while waiting for
recovery conflict resolution on a vacuum cleanup. The latter is reported
while waiting for recovery conflict resolution on dropping tablespace.
Also this commit changes the code so that the wait event Lock is reported
while waiting in ResolveRecoveryConflictWithVirtualXIDs() for recovery
conflict resolution on a lock. Basically the wait event Lock is reported
during that wait, but previously was not reported only when that wait
happened in ResolveRecoveryConflictWithVirtualXIDs().
Author: Masahiko Sawada
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/CA+fd4k4mXWTwfQLS3RPwGr4xnfAEs1ysFfgYHvmmoUgv6Zxvmg@mail.gmail.com
When BUFFERS option is enabled, EXPLAIN command includes the information
on buffer usage during each plan node, in its output. In addition to that,
this commit makes EXPLAIN command include also the information on
buffer usage during planning phase, in its output. This feature makes it
easier to discern the cases where lots of buffer access happen during
planning.
Author: Julien Rouhaud, slightly revised by Fujii Masao
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/16109-26a1a88651e90608@postgresql.org
This patch replaces the boolean GUC log_parameters_on_error introduced
by commit ba79cb5dc with an integer log_parameter_max_length_on_error,
adding the ability to specify how many bytes to trim each logged
parameter value to. (The previous coding hard-wired that choice at
64 bytes.)
In addition, add a new parameter log_parameter_max_length that provides
similar control over truncation of query parameters that are logged in
response to statement-logging options, as opposed to errors. Previous
releases always logged such parameters in full, possibly causing log
bloat.
For backwards compatibility with prior releases,
log_parameter_max_length defaults to -1 (log in full), while
log_parameter_max_length_on_error defaults to 0 (no logging).
Per discussion, log_parameter_max_length is SUSET since the DBA should
control routine logging behavior, but log_parameter_max_length_on_error
is USERSET because it also affects errcontext data sent back to the
client.
Alexey Bashtanov, editorialized a little by me
Discussion: https://postgr.es/m/b10493cc-a399-a03a-67c7-068f2791ee50@imap.cc
This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and
check Unicode normal forms, per SQL standard.
To support fast IS NORMALIZED tests, we pull in a new data file
DerivedNormalizationProps.txt from Unicode and build a lookup table
from that, using techniques similar to ones already used for other
Unicode data. make update-unicode will keep it up to date. We only
build and use these tables for the NFC and NFKC forms, because they
are too big for NFD and NFKD and the improvement is not significant
enough there.
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
There's a number of SLRU caches used to access important data like clog,
commit timestamps, multixact, asynchronous notifications, etc. Until now
we had no easy way to monitor these shared caches, compute hit ratios,
number of reads/writes etc.
This commit extends the statistics collector to track this information
for a predefined list of SLRUs, and also introduces a new system view
pg_stat_slru displaying the data.
The list of built-in SLRUs is fixed, but additional SLRUs may be defined
in extensions. Unfortunately, there's no suitable registry of SLRUs, so
this patch simply defines a fixed list of SLRUs with entries for the
built-in ones and one entry for all additional SLRUs. Extensions adding
their own SLRU are fairly rare, so this seems acceptable.
This patch only allows monitoring of SLRUs, not tuning. The SLRU sizes
are still fixed (hard-coded in the code) and it's not entirely clear
which of the SLRUs might need a GUC to tune size. In a way, allowing us
to determine that is one of the goals of this patch.
Bump catversion as the patch introduces new functions and system view.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/20200119143707.gyinppnigokesjok@development
Quite a few matching operators such as JSONB's @> used "contsel" and
"contjoinsel" as their selectivity estimators. That was a bad idea,
because (a) contsel is only a stub, yielding a fixed default estimate,
and (b) that default is 0.001, meaning we estimate these operators as
five times more selective than equality, which is surely pretty silly.
There's a good model for improving this in ltree's ltreeparentsel():
for any "var OP constant" query, we can try applying the operator
to all of the column's MCV and histogram values, taking the latter
as being a random sample of the non-MCV values. That code is
actually 100% generic, except for the question of exactly what
default selectivity ought to be plugged in when we don't have stats.
Hence, migrate the guts of ltreeparentsel() into the core code, provide
wrappers "matchingsel" and "matchingjoinsel" with a more-appropriate
default estimate, and use those for the non-geometric operators that
formerly used contsel (mostly JSONB containment operators and tsquery
matching).
Also apply this code to some match-like operators in hstore, ltree, and
pg_trgm, including the former users of ltreeparentsel as well as ones
that improperly used contsel. Since commit 911e70207 just created new
versions of those extensions that we haven't released yet, we can sneak
this change into those new versions instead of having to create an
additional generation of update scripts.
Patch by me, reviewed by Alexey Bashtanov
Discussion: https://postgr.es/m/12237.1582833074@sss.pgh.pa.us
pg_rewind needs to copy from the source cluster to the target cluster a
set of relation blocks changed from the previous checkpoint where WAL
forked up to the end of WAL on the target. Building this list of
relation blocks requires a range of WAL segments that may not be present
anymore on the target's pg_wal, causing pg_rewind to fail. It is
possible to work around this issue by copying manually the WAL segments
needed but this may lead to some extra and actually useless work.
This commit introduces a new option allowing pg_rewind to use a
restore_command while doing the rewind by grabbing the parameter value
of restore_command from the target cluster configuration. This allows
the rewind operation to be more reliable, so as only the WAL segments
needed by the rewind are restored from the archives.
In order to be able to do that, a new routine is added to src/common/ to
allow frontend tools to restore files from archives using an
already-built restore command. This version is more simple than the
backend equivalent as there is no need to handle the non-recovery case.
Author: Alexey Kondratov
Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander
Korotkov, Michael Paquier
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
The definitions of the routines defined in xlogarchive.c have been part
of xlog_internal.h which is included by several frontend tools, but all
those routines are only called by the backend. More cleanup could be
done within xlog_internal.h, but that's already a nice cut.
This will help a follow-up patch for pg_rewind where handling of
restore_command is added for frontends.
Author: Alexey Kondratov, Michael Paquier
Reviewed-by: Álvaro Herrera, Alexander Korotkov
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
PostgreSQL provides set of template index access methods, where opclasses have
much freedom in the semantics of indexing. These index AMs are GiST, GIN,
SP-GiST and BRIN. There opclasses define representation of keys, operations on
them and supported search strategies. So, it's natural that opclasses may be
faced some tradeoffs, which require user-side decision. This commit implements
opclass parameters allowing users to set some values, which tell opclass how to
index the particular dataset.
This commit doesn't introduce new storage in system catalog. Instead it uses
pg_attribute.attoptions, which is used for table column storage options but
unused for index attributes.
In order to evade changing signature of each opclass support function, we
implement unified way to pass options to opclass support functions. Options
are set to fn_expr as the constant bytea expression. It's possible due to the
fact that opclass support functions are executed outside of expressions, so
fn_expr is unused for them.
This commit comes with some examples of opclass options usage. We parametrize
signature length in GiST. That applies to multiple opclasses: tsvector_ops,
gist__intbig_ops, gist_ltree_ops, gist__ltree_ops, gist_trgm_ops and
gist_hstore_ops. Also we parametrize maximum number of integer ranges for
gist__int_ops. However, the main future usage of this feature is expected
to be json, where users would be able to specify which way to index particular
json parts.
Catversion is bumped.
Discussion: https://postgr.es/m/d22c3a18-31c7-1879-fc11-4c1ce2f5e5af%40postgrespro.ru
Author: Nikita Glukhov, revised by me
Reviwed-by: Nikolay Shaplov, Robert Haas, Tom Lane, Tomas Vondra, Alvaro Herrera
When certain parameters are changed on a physical replication primary,
this is communicated to standbys using the XLOG_PARAMETER_CHANGE WAL
record. The standby then checks whether its own settings are at least
as big as the ones on the primary. If not, the standby shuts down
with a fatal error.
The correspondence of settings between primary and standby is required
because those settings influence certain shared memory sizings that
are required for processing WAL records that the primary might send.
For example, if the primary sends a prepared transaction, the standby
must have had max_prepared_transaction set appropriately or it won't
be able to process those WAL records.
However, fatally shutting down the standby immediately upon receipt of
the parameter change record might be a bit of an overreaction. The
resources related to those settings are not required immediately at
that point, and might never be required if the activity on the primary
does not exhaust all those resources. If we just let the standby roll
on with recovery, it will eventually produce an appropriate error when
those resources are used.
So this patch relaxes this a bit. Upon receipt of
XLOG_PARAMETER_CHANGE, we still check the settings but only issue a
warning and set a global flag if there is a problem. Then when we
actually hit the resource issue and the flag was set, we issue another
warning message with relevant information. At that point we pause
recovery, so a hot standby remains usable. We also repeat the last
warning message once a minute so it is harder to miss or ignore.
Reviewed-by: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Masahiko Sawada <masahiko.sawada@2ndquadrant.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/4ad69a4c-cc9b-0dfe-0352-8b1b0cd36c7b@2ndquadrant.com
This commit adds query_string argument into the planner-related functions
and hook and allows us to pass the query string to them.
Currently there is no user of the query string passed. But the upcoming patch
for the planning counters will add the planning hook function into
pg_stat_statements and the function will need the query string. So this change
will be necessary for that patch.
Also this change is useful for some extensions that want to use the query
string in their planner hook function.
Author: Pascal Legrand, Julien Rouhaud
Reviewed-by: Yoshikazu Imai, Tom Lane, Fujii Masao
Discussion: https://postgr.es/m/CAOBaU_bU1m3_XF5qKYtSj1ua4dxd=FWDyh2SH4rSJAUUfsGmAQ@mail.gmail.com
Discussion: https://postgr.es/m/1583789487074-0.post@n3.nabble.com
Previously pg_stat_statements calculated the difference of buffer counters
by its own code even while BufferUsageAccumDiff() had the same code.
This commit expose BufferUsageAccumDiff() and makes pg_stat_statements
use it for the calculation, in order to simply the code.
This change also would be useful for the upcoming patch for the planning
counters in pg_stat_statements because the patch will add one more code
for the calculation of difference of buffer counters and that can easily be
done by using BufferUsageAccumDiff().
Author: Julien Rouhaud
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/bdfee4e0-a304-2498-8da5-3cb52c0a193e@oss.nttdata.com
As of Windows 10 version 1803, Unix-domain sockets are supported on
Windows. But it's not automatically detected by configure because it
looks for struct sockaddr_un and Windows doesn't define that. So we
just make our own definition on Windows and override the configure
result.
Set DEFAULT_PGSOCKET_DIR to empty on Windows so by default no
Unix-domain socket is used, because there is no good standard
location.
In pg_upgrade, we have to do some extra tweaking to preserve the
existing behavior of not using Unix-domain sockets on Windows. Adding
support would be desirable, but it needs further work, in particular a
way to select whether to use Unix-domain sockets from the command-line
or with a run-time test.
The pg_upgrade test script needs a fix. The previous code passed
"localhost" to postgres -k, which only happened to work because
Windows used to ignore the -k argument value altogether. We instead
need to pass an empty string to get the desired effect.
The test suites will continue to not use Unix-domain sockets on
Windows. This requires a small tweak in pg_regress.c. The TAP tests
don't need to be changed because they decide by the operating system
rather than HAVE_UNIX_SOCKETS.
Reviewed-by: Andrew Dunstan <andrew.dunstan@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/54bde68c-d134-4eb8-5bd3-8af33b72a010@2ndquadrant.com
Traditionally autovacuum has only ever invoked a worker based on the
estimated number of dead tuples in a table and for anti-wraparound
purposes. For the latter, with certain classes of tables such as
insert-only tables, anti-wraparound vacuums could be the first vacuum that
the table ever receives. This could often lead to autovacuum workers being
busy for extended periods of time due to having to potentially freeze
every page in the table. This could be particularly bad for very large
tables. New clusters, or recently pg_restored clusters could suffer even
more as many large tables may have the same relfrozenxid, which could
result in large numbers of tables requiring an anti-wraparound vacuum all
at once.
Here we aim to reduce the work required by anti-wraparound and aggressive
vacuums in general, by triggering autovacuum when the table has received
enough INSERTs. This is controlled by adding two new GUCs and reloptions;
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor. These work exactly the same as the
existing scale factor and threshold controls, only base themselves off the
number of inserts since the last vacuum, rather than the number of dead
tuples. New controls were added rather than reusing the existing
controls, to allow these new vacuums to be tuned independently and perhaps
even completely disabled altogether, which can be done by setting
autovacuum_vacuum_insert_threshold to -1.
We make no attempt to skip index cleanup operations on these vacuums as
they may trigger for an insert-mostly table which continually doesn't have
enough dead tuples to trigger an autovacuum for the purpose of removing
those dead tuples. If we were to skip cleaning the indexes in this case,
then it is possible for the index(es) to become bloated over time.
There are additional benefits to triggering autovacuums based on inserts,
as tables which never contain enough dead tuples to trigger an autovacuum
are now more likely to receive a vacuum, which can mark more of the table
as "allvisible" and encourage the query planner to make use of Index Only
Scans.
Currently, we still obey vacuum_freeze_min_age when triggering these new
autovacuums based on INSERTs. For large insert-only tables, it may be
beneficial to lower the table's autovacuum_freeze_min_age so that tuples
are eligible to be frozen sooner. Here we've opted not to zero that for
these types of vacuums, since the table may just be insert-mostly and we
may otherwise freeze tuples that are still destined to be updated or
removed in the near future.
There was some debate to what exactly the new scale factor and threshold
should default to. For now, these are set to 0.2 and 1000, respectively.
There may be some motivation to adjust these before the release.
Author: Laurenz Albe, Darafei Praliaskouski
Reviewed-by: Alvaro Herrera, Masahiko Sawada, Chris Travers, Andres Freund, Justin Pryzby
Discussion: https://postgr.es/m/CAC8Q8t%2Bj36G_bLF%3D%2B0iMo6jGNWnLnWb1tujXuJr-%2Bx8ZCCTqoQ%40mail.gmail.com
The parameters primary_conninfo, primary_slot_name and
wal_receiver_create_temp_slot can now be changed with a simple "reload"
signal, no longer requiring a server restart. This is achieved by
signalling the walreceiver process to terminate and having it start
again with the new values.
Thanks to Andres Freund, Kyotaro Horiguchi, Fujii Masao for discussion.
Author: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/19513901543181143@sas1-19a94364928d.qloud-c.yandex.net
Commit 3297308278 gave walreceiver the ability to create and use a
temporary replication slot, and made it controllable by a GUC (enabled
by default) that can be changed with SIGHUP. That's useful but has two
problems: one, it's possible to cause the origin server to fill its disk
if the slot doesn't advance in time; and also there's a disconnect
between state passed down via the startup process and GUCs that
walreceiver reads directly.
We handle the first problem by setting the option to disabled by
default. If the user enables it, its on their head to make sure that
disk doesn't fill up.
We handle the second problem by passing the flag via startup rather than
having walreceiver acquire it directly, and making it PGC_POSTMASTER
(which ensures a walreceiver always has the fresh value). A future
commit can relax this (to PGC_SIGHUP again) by having the startup
process signal walreceiver to shutdown whenever the value changes.
Author: Sergei Kornilov <sk@zsrv.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/20200122055510.GH174860@paquier.xyz
Buildfarm experience shows what probably should've occurred to me before:
if a cache flush occurs partway through building a generic plan, then
the plansource may have is_valid = false even though the plan is valid.
We need to accept this case, use the generated plan, and then try to
replan the next time. We can't try to replan immediately, because that
would produce an infinite loop in CLOBBER_CACHE_ALWAYS builds; moreover
it's really overkill. (We can assume that the plan is valid, it's just
possibly a bit stale. Note that the pre-existing code behaved this way,
and the non-simple-expression code paths do too.) Conversely, not using
the generated plan would drop us into the not-a-simple-expression code
path, which is bad for performance and would also cause regression-test
failures due to visibly different error-reporting behavior.
Hence, refactor the validity-check functions so that the initial check
and recheck cases can react differently to plansource->is_valid.
This makes their usage a bit simpler, too.
Discussion: https://postgr.es/m/7072.1585332104@sss.pgh.pa.us
For relatively simple expressions (say, "x + 1" or "x > 0"), plpgsql's
management overhead exceeds the cost of evaluating the expression.
This patch substantially improves that situation, providing roughly
2X speedup for such trivial expressions.
First, add infrastructure in the plancache to allow fast re-validation
of cached plans that contain no table access, and hence need no locks.
Teach plpgsql to use this infrastructure for expressions that it's
already deemed "simple" (which in particular will never contain table
references).
The fast path still requires checking that search_path hasn't changed,
so provide a fast path for OverrideSearchPathMatchesCurrent by
counting changes that have occurred to the active search path in the
current session. This is simplistic but seems enough for now, seeing
that PushOverrideSearchPath is not used in any performance-critical
cases.
Second, manage the refcounts on simple expressions' cached plans using
a transaction-lifespan resource owner, so that we only need to take
and release an expression's refcount once per transaction not once per
expression evaluation. The management of this resource owner exactly
parallels the existing management of plpgsql's simple-expression EState.
Add some regression tests covering this area, in particular verifying
that expression caching doesn't break semantics for search_path changes.
Patch by me, but it owes something to previous work by Amit Langote,
who recognized that getting rid of plancache-related overhead would
be a useful thing to do here. Also thanks to Andres Freund for review.
Discussion: https://postgr.es/m/CAFj8pRDRVfLdAxsWeVLzCAbkLFZhW549K+67tpOc-faC8uH8zw@mail.gmail.com
Some platforms require libssl to be linked explicitly in the new
SSL test module. Borrow contrib/sslinfo's code for that.
Since src/test/modules/Makefile now has a variable SUBDIRS list,
it needs to follow the ALWAYS_SUBDIRS protocol for that (cf.
comments in Makefile.global.in).
Blindly try to fix MSVC build failures by adding PGDLLIMPORT.
The default hook function sets the default password callback function.
In order to allow preloaded libraries to have an opportunity to override
the default, TLS initialization if now delayed slightly until after
shared preloaded libraries have been loaded.
A test module is provided which contains a trivial example that decodes
an obfuscated password for an SSL certificate.
Author: Andrew Dunstan
Reviewed By: Andreas Karlsson, Asaba Takanori
Discussion: https://postgr.es/m/04116472-818b-5859-1d74-3d995aab2252@2ndQuadrant.com
This reverts the parts of commit 17a28b0364
that changed ereport's auxiliary functions from returning dummy integer
values to returning void. It turns out that a minority of compilers
complain (not entirely unreasonably) about constructs such as
(condition) ? errdetail(...) : 0
if errdetail() returns void rather than int. We could update those
call sites to say "(void) 0" perhaps, but the expectation for this
patch set was that ereport callers would not have to change anything.
And this aspect of the patch set was already the most invasive and
least compelling part of it, so let's just drop it.
Per buildfarm.
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
It was for unclear reasons defined in a separate location, which makes
it more cumbersome to override for testing, and it also did not have
any prominent documentation. Move to pg_config_manual.h, where
similar things are already collected.
The previous definition on the command-line had the effect of defining
it to the value 1, but now that we don't need that anymore we just
define it to empty, to simplify manual editing a bit.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/b7053ba8-b008-5335-31de-2fe4fe41ef0f%402ndquadrant.com
Change all the auxiliary error-reporting routines to return void,
now that we no longer need to pretend they are passing something
useful to errfinish(). While this probably doesn't save anything
significant at the machine-code level, it allows detection of some
additional types of mistakes.
Pass the error location details (__FILE__, __LINE__, PG_FUNCNAME_MACRO)
to errfinish not errstart. This shaves a few cycles off the case where
errstart decides we're not going to emit anything.
Re-implement elog() as a trivial wrapper around ereport(), removing
the separate support infrastructure it used to have. Aside from
getting rid of some now-surplus code, this means that elog() now
really does have exactly the same semantics as ereport(), in particular
that it can skip evaluation work if the message is not to be emitted.
Andres Freund and Tom Lane
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
Now that we require C99, we can depend on __VA_ARGS__ to work, and
revising ereport() to use it has several significant benefits:
* The extra parentheses around the auxiliary function calls are now
optional. Aside from being a bit less ugly, this removes a common
gotcha for new contributors, because in some cases the compiler errors
you got from forgetting them were unintelligible.
* The auxiliary function calls are now evaluated as a comma expression
list rather than as extra arguments to errfinish(). This means that
compilers can be expected to warn about no-op expressions in the list,
allowing detection of several other common mistakes such as forgetting
to add errmsg(...) when converting an elog() call to ereport().
* Unlike the situation with extra function arguments, comma expressions
are guaranteed to be evaluated left-to-right, so this removes platform
dependency in the order of the auxiliary function calls. While that
dependency hasn't caused us big problems in the past, this change does
allow dropping some rather shaky assumptions around errcontext() domain
handling.
There's no intention to make wholesale changes of existing ereport
calls, but as proof-of-concept this patch removes the extra parens
from a couple of calls in postgres.c.
While new code can be written either way, code intended to be
back-patched will need to use extra parens for awhile yet. It seems
worth back-patching this change into v12, so as to reduce the window
where we have to be careful about that by one year. Hence, this patch
is careful to preserve ABI compatibility; a followup HEAD-only patch
will make some additional simplifications.
Andres Freund and Tom Lane
Discussion: https://postgr.es/m/CA+fd4k6N8EjNvZpM8nme+y+05mz-SM8Z_BgkixzkA34R+ej0Kw@mail.gmail.com
It previously only supported NFKC, for use by SASLprep. This expands
the API to offer the choice of all four normalization forms. Right
now, there are no internal users of the forms other than NFKC.
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
Previously if a promotion was triggered while recovery was paused,
the paused state continued. Also recovery could be paused by executing
pg_wal_replay_pause() even while a promotion was ongoing. That is,
recovery pause had higher priority over a standby promotion.
But this behavior was not desirable because most users basically wanted
the recovery to complete as soon as possible and the server to become
the master when they requested a promotion.
This commit changes recovery so that it prefers a promotion over
recovery pause. That is, if a promotion is triggered while recovery
is paused, the paused state ends and a promotion continues. Also
this commit makes recovery pause functions like pg_wal_replay_pause()
throw an error if they are executed while a promotion is ongoing.
Internally, this commit adds new internal function PromoteIsTriggered()
that returns true if a promotion is triggered. Since the name of
this function and the existing function IsPromoteTriggered() are
confusingly similar, the commit changes the name of IsPromoteTriggered()
to IsPromoteSignaled, as more appropriate name.
Author: Fujii Masao
Reviewed-by: Atsushi Torikoshi, Sergei Kornilov
Discussion: https://postgr.es/m/00c194b2-dbbb-2e8a-5b39-13f14048ef0a@oss.nttdata.com
restore_command has only been used until now by the backend, but there
is a pending patch for pg_rewind to make use of that in the frontend.
Author: Alexey Kondratov
Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander
Korotkov, Michael Paquier
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1@postgrespro.ru
This commit introduces new wait events BackupWaitWalArchive and
RecoveryPause. The former is reported while waiting for the WAL files
required for the backup to be successfully archived. The latter is
reported while waiting for recovery in pause state to be resumed.
Author: Fujii Masao
Reviewed-by: Michael Paquier, Atsushi Torikoshi, Robert Haas
Discussion: https://postgr.es/m/f0651f8c-9c96-9f29-0ff9-80414a15308a@oss.nttdata.com
Previously 0 was reported in pg_stat_progress_basebackup.total_backup
if the total backup size was not estimated. Per discussion, our consensus
is that NULL is better choise as the value in total_backup in that case.
So this commit makes pg_stat_progress_basebackup view report NULL
in total_backup column if the estimation is disabled.
Bump catversion.
Author: Fujii Masao
Reviewed-by: Amit Langote, Magnus Hagander, Alvaro Herrera
Discussion: https://postgr.es/m/CABUevExnhOD89zBDuPvfAAh243RzNpwCPEWNLtMYpKHMB8gbAQ@mail.gmail.com
This reverts commit b7f64c6, which broke the fallback implementation for
C++. We have discussed a couple of alternatives to reduce the number of
implementations for those asserts, but nothing allowing to reduce the
number of implementations down to three instead of four, so there is no
benefit in keeping this patch.
Thanks to Tom Lane for the discussion.
Discussion: https://postgr.es/m/20200313115033.GA183471@paquier.xyz
headerscheck and cpluspluscheck should skip the recently-added
cmdtaglist.h header, since (like kwlist.h and some other similarly-
designed headers) it's not meant to be included standalone.
evtcache.h was missing an #include to support its usage of Bitmapset.
typecmds.h was missing an #include to support its usage of ParseState.
The first two of these were evidently oversights in commit 2f9661311.
I didn't track down exactly which change broke typecmds.h, but it
must have been some rearrangement in one of its existing inclusions,
because it's referenced ParseState for quite a long time and there
were not complaints from these checking programs before.
Until now, only selected bulk operations (e.g. COPY) did this. If a
given relfilenode received both a WAL-skipping COPY and a WAL-logged
operation (e.g. INSERT), recovery could lose tuples from the COPY. See
src/backend/access/transam/README section "Skipping WAL for New
RelFileNode" for the new coding rules. Maintainers of table access
methods should examine that section.
To maintain data durability, just before commit, we choose between an
fsync of the relfilenode and copying its contents to WAL. A new GUC,
wal_skip_threshold, guides that choice. If this change slows a workload
that creates small, permanent relfilenodes under wal_level=minimal, try
adjusting wal_skip_threshold. Users setting a timeout on COMMIT may
need to adjust that timeout, and log_min_duration_statement analysis
will reflect time consumption moving to COMMIT from commands like COPY.
Internally, this requires a reliable determination of whether
RollbackAndReleaseCurrentSubTransaction() would unlink a relation's
current relfilenode. Introduce rd_firstRelfilenodeSubid. Amend the
specification of rd_createSubid such that the field is zero when a new
rel has an old rd_node. Make relcache.c retain entries for certain
dropped relations until end of transaction.
Back-patch to 9.5 (all supported versions). This introduces a new WAL
record type, XLOG_GIST_ASSIGN_LSN, without bumping XLOG_PAGE_MAGIC. As
always, update standby systems before master systems. This changes
sizeof(RelationData) and sizeof(IndexStmt), breaking binary
compatibility for affected extensions. (The most recent commit to
affect the same class of extensions was
089e4d405d0f3b94c74a2c6a54357a84a681754b.)
Kyotaro Horiguchi, reviewed (in earlier, similar versions) by Robert
Haas. Heikki Linnakangas and Michael Paquier implemented earlier
designs that materially clarified the problem. Reviewed, in earlier
designs, by Andrew Dunstan, Andres Freund, Alvaro Herrera, Tom Lane,
Fujii Masao, and Simon Riggs. Reported by Martijn van Oosterhout.
Discussion: https://postgr.es/m/20150702220524.GA9392@svana.org
Remove an obsolete comment from AtEOXact_cleanup(). Restore formatting
of a comment in struct RelationData, mangled by the pgindent run in
commit 9af4159fce. Back-patch to 9.5 (all
supported versions), because another fix stacks on this.
This is required as it is no safer for two related processes to extend the
same relation at a time than for unrelated processes to do the same. We
don't acquire a heavyweight lock on any other object after relation
extension lock which means such a lock can never participate in the
deadlock cycle. So, avoid checking wait edges from this lock.
This provides an infrastructure to allow parallel operations like insert,
copy, etc. which were earlier not possible as parallel group members won't
conflict for relation extension lock.
Author: Dilip Kumar, Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
This patch adds the pseudo-types anycompatible, anycompatiblearray,
anycompatiblenonarray, and anycompatiblerange. They work much like
anyelement, anyarray, anynonarray, and anyrange respectively, except
that the actual input values need not match precisely in type.
Instead, if we can find a common supertype (using the same rules
as for UNION/CASE type resolution), then the parser automatically
promotes the input values to that type. For example,
"myfunc(anycompatible, anycompatible)" can match a call with one
integer and one bigint argument, with the integer automatically
promoted to bigint. With anyelement in the definition, the user
would have had to cast the integer explicitly.
The new types also provide a second, independent set of type variables
for function matching; thus with "myfunc(anyelement, anyelement,
anycompatible) returns anycompatible" the first two arguments are
constrained to be the same type, but the third can be some other
type, and the result has the type of the third argument. The need
for more than one set of type variables was foreseen back when we
first invented the polymorphic types, but we never did anything
about it.
Pavel Stehule, revised a bit by me
Discussion: https://postgr.es/m/CAFj8pRDna7VqNi8gR+Tt2Ktmz0cq5G93guc3Sbn_NVPLdXAkqA@mail.gmail.com
This by itself doesn't change any functionality but prepares the way
for having relations other than base tables in publications.
Make arrangements for COPY handling the initial table sync. For
non-tables we have to use COPY (SELECT ...) instead of directly
copying from the table, but then we have to take care to omit
generated columns from the column list.
Also, remove a hardcoded reference to relkind = 'r' and rely on the
publisher to send only what it can actually publish, which will be
correct even in future cross-version scenarios.
Reviewed-by: Amit Langote <amitlangote09@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
This commit renames RecoveryWalAll and RecoveryWalStream wait events to
RecoveryWalStream and RecoveryRetrieveRetryInterval, respectively,
in order to make the names and what they are more consistent. For example,
previously RecoveryWalAll was reported as a wait event while the recovery
was waiting for WAL from a stream, and which was confusing because the name
was very different from the situation where the wait actually could happen.
The names of macro variables for those wait events also are renamed
accordingly.
This commit also changes the category of RecoveryRetrieveRetryInterval to
Timeout from Activity because the wait event is reported while waiting based
on wal_retrieve_retry_interval.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Atsushi Torikoshi
Discussion: https://postgr.es/m/124997ee-096a-5d09-d8da-2c7a57d0816e@oss.nttdata.com
While performing hash aggregation, track memory usage when adding new
groups to a hash table. If the memory usage exceeds work_mem, enter
"spill mode".
In spill mode, new groups are not created in the hash table(s), but
existing groups continue to be advanced if input tuples match. Tuples
that would cause a new group to be created are instead spilled to a
logical tape to be processed later.
The tuples are spilled in a partitioned fashion. When all tuples from
the outer plan are processed (either by advancing the group or
spilling the tuple), finalize and emit the groups from the hash
table. Then, create new batches of work from the spilled partitions,
and select one of the saved batches and process it (possibly spilling
recursively).
Author: Jeff Davis
Reviewed-by: Tomas Vondra, Adam Lee, Justin Pryzby, Taylor Vesely, Melanie Plageman
Discussion: https://postgr.es/m/507ac540ec7c20136364b5272acbcd4574aa76ef.camel@j-davis.com
An AllocSet doubles the size of allocated blocks (up to maxBlockSize),
which means that the current block can represent half of the total
allocated space for the memory context. But the free space in the
current block may never have been touched, so don't count the
untouched memory as allocated for the purposes of
MemoryContextMemAllocated().
Discussion: https://postgr.es/m/ec63d70b668818255486a83ffadc3aec492c1f57.camel@j-davis.com
relation extension lock.
The only exception to the rule is that we can try to acquire the same
relation extension lock more than once. This is allowed as we are not
creating any new lock for this case. This restriction implies that the
relation extension lock won't ever participate in the deadlock cycle
because we can never wait for any other heavyweight lock after acquiring
this lock.
Such a restriction is okay for relation extension locks as unlike other
heavyweight locks these are not held till the transaction end. These are
taken for a short duration to extend a particular relation and then
released.
Author: Dilip Kumar, with few changes by Amit Kapila
Reviewed-by: Amit Kapila, Kuntal Ghosh and Sawada Masahiko
Discussion: https://postgr.es/m/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B+Ss=Gn1LA@mail.gmail.com
pg_proc.c and pg_aggregate.c had near-duplicate copies of the logic
to decide whether a function or aggregate's signature is legal.
This seems like a bad thing even without the problem that the
upcoming "anycompatible" patch would roughly double the complexity
of that logic. Hence, refactor so that the rules are localized
in new subroutines supplied by parse_coerce.c. (One could quibble
about just where to add that code, but putting it beside
enforce_generic_type_consistency seems not totally unreasonable.)
The fact that the rules are about to change would mandate some
changes in the wording of the associated error messages in any case.
I ended up spelling things out in a fairly literal fashion in the
errdetail messages, eg "A result of type anyelement requires at
least one input of type anyelement, anyarray, anynonarray, anyenum,
or anyrange." Perhaps this is overkill, but once there's more than
one subgroup of polymorphic types, people might get confused by
more-abstract messages.
Discussion: https://postgr.es/m/24137.1584139352@sss.pgh.pa.us
It devolved into a content-less wrapper over read_local_xlog_page, with
nothing to add, plus it's easily confused with walsender's
logical_read_xlog_page. There doesn't seem to be any reason for it to
stay.
src/include/replication/logicalfuncs.h becomes empty, so remove it too.
The prototypes it initially had were absorbed by generated fmgrprotos.h.
Discussion: https://postgr.es/m/20191115214102.GA15616@alvherre.pgsql
This extends the fixes made in commit 085b6b667 to other SRFs with the
same bug, namely pg_logdir_ls(), pgrowlocks(), pg_timezone_names(),
pg_ls_dir(), and pg_tablespace_databases().
Also adjust various comments and documentation to warn against
expecting to clean up resources during a ValuePerCall SRF's final
call.
Back-patch to all supported branches, since these functions were
all born broken.
Justin Pryzby, with cosmetic tweaks by me
Discussion: https://postgr.es/m/20200308173103.GC1357@telsasoft.com
Introduce a GUC and a tablespace option to control I/O prefetching, much
like effective_io_concurrency, but for work that is done on behalf of
many client sessions.
Use the new setting in heapam.c instead of the hard-coded formula
effective_io_concurrency + 10 introduced by commit 558a9165e0. Go with
a default value of 10 for now, because it's a round number pretty close
to the value used for that existing case.
Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com
The effective_io_concurrency GUC and equivalent tablespace option were
previously passed through a formula based on a theory about RAID
spindles and probabilities, to arrive at the number of pages to prefetch
in bitmap heap scans. Tomas Vondra, Andres Freund and others argued
that it was anachronistic and hard to justify, and commit 558a9165e0
already started down the path of bypassing it in new code. We agreed to
drop that logic and use the value directly.
For the default setting of 1, there is no change in effect. Higher
settings can be converted from the old meaning to the new with:
select round(sum(OLD / n::float)) from generate_series(1, OLD) s(n);
We might want to consider renaming the GUC before the next release given
the change in meaning, but it's not clear that many users had set it
very carefully anyway. That decision is deferred for now.
Discussion: https://postgr.es/m/CA%2BhUKGJUw08dPs_3EUcdO6M90GnjofPYrWp4YSLaBkgYwS-AqA%40mail.gmail.com
resolve_polymorphic_tupdesc() and resolve_polymorphic_argtypes() failed to
cover the case of having to resolve anyarray given only an anyrange input.
The bug was masked if anyelement was also used (as either input or
output), which probably helps account for our not having noticed.
While looking at this I noticed that resolve_generic_type() would produce
the wrong answer if asked to make that same resolution. ISTM that
resolve_generic_type() is confusingly defined and overly complex, so
rather than fix it, let's just make funcapi.c do the specific lookups
it requires for itself.
With this change, resolve_generic_type() is not used anywhere, so remove
it in HEAD. In the back branches, leave it alone (complete with bug)
just in case any external code is using it.
While we're here, make some other refactoring adjustments in funcapi.c
with an eye to upcoming future expansion of the set of polymorphic types:
* Simplify quick-exit tests by adding an overall have_polymorphic_result
flag. This is about a wash now but will be a win when there are more
flags.
* Reduce duplication of code between resolve_polymorphic_tupdesc() and
resolve_polymorphic_argtypes().
* Don't bother to validate correct matching of anynonarray or anyenum;
the parser should have done that, and even if it didn't, just doing
"return false" here would lead to a very confusing, off-point error
message. (Really, "return false" in these two functions should only
occur if the call_expr isn't supplied or we can't obtain data type
info from it.)
* For the same reason, throw an elog rather than "return false" if
we fail to resolve a polymorphic type.
The bug's been there since we added anyrange, so back-patch to
all supported branches.
Discussion: https://postgr.es/m/6093.1584202130@sss.pgh.pa.us
Commit 8f321bd16c added support for estimating ScalarArrayOpExpr clauses
(IN/ANY) clauses using functional dependencies. There's no good reason
not to support estimation of these clauses using multi-variate MCV lists
too, so this commits implements that. That makes the behavior consistent
and MCV lists can estimate all variants (ANY/ALL, inequalities, ...).
Author: Tomas Vondra
Review: Dean Rasheed
Discussion: https://www.postgresql.org/message-id/flat/13902317.Eha0YfKkKy%40pierred-pdoc
Use the new MyBackendType instead. More similar changes for other "am
something" variables are possible. This one was just particularly
simple.
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com
Add a new global variable MyBackendType that uses the same BackendType
enum that was previously only used by the stats collector. That way
several duplicate ways of checking what type a particular process is
can be simplified. Since it's no longer just for stats, move to
miscinit.c and rename existing functions to match the expanded
purpose.
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com
If an index was explicitly set as replica identity index, this setting
was lost when a table was rewritten by ALTER TABLE. Because this
setting is part of pg_index but actually controlled by ALTER
TABLE (not part of CREATE INDEX, say), we have to do some extra work
to restore it.
Based-on-patch-by: Quan Zongliang <quanzongliang@gmail.com>
Reviewed-by: Euler Taveira <euler.taveira@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c70fcab2-4866-0d9f-1d01-e75e189db342@gmail.com
This commit refactors and simplifies the definitions of StaticAssertStmt,
StaticAssertExpr and StaticAssertDecl. By unifying the C and C++
fallback implementations, this reduces the number of different
implementations from four to three.
Author: Michael Paquier
Reviewed-by: Georgios Kokolatos, Tom Lane
Discussion: https://postgr.es/m/20200204081503.GF2287@paquier.xyz
The init_ps_display() arguments were mostly lies by now, so to match
typical usage, just use one argument and let the caller assemble it
from multiple sources if necessary. The only user of the additional
arguments is BackendInitialize(), which was already doing string
assembly on the caller side anyway.
Remove the second argument of set_ps_display() ("force") and just
handle that in init_ps_display() internally.
BackendInitialize() also used to set the initial status as
"authentication", but that was very far from where authentication
actually happened. So now it's set to "initializing" and then
"authentication" just before the actual call to
ClientAuthentication().
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/c65e5196-4f04-4ead-9353-6088c19615a3@2ndquadrant.com
If the command is attempted for an extension that the object already
depends on, silently do nothing.
In particular, this means that if a database containing multiple such
entries is dumped, the restore will silently do the right thing and
record just the first one. (At least, in a world where pg_dump does
dump such entries -- which it doesn't currently, but it will.)
Backpatch to 9.6, where this kind of dependency was introduced.
Reviewed-by: Ibrar Ahmed, Tom Lane (offlist)
Discussion: https://postgr.es/m/20200217225333.GA30974@alvherre.pgsql
The code around InitPostmasterChild() from commit 31c453165b somehow
ended up in the middle of a block of code related to "User ID state".
Move it into its own block instead.
Previously, hard links were not used on Windows and Cygwin, but they
support them just fine in currently supported OS versions, so we can
use them there as well.
Since all supported platforms now support hard links, we can remove
the alternative code paths.
Rename durable_link_or_rename() to durable_rename_excl() to make the
purpose more clear without referencing the implementation details.
Discussion: https://www.postgresql.org/message-id/flat/72fff73f-dc9c-4ef4-83e8-d2e60c98df48%402ndquadrant.com
This catalog-handling code was previously together with the rest of
CastCreate() in src/backend/commands/functioncmds.c. A future patch
will need a way to add casts internally, so this will be useful to have
separate.
Also, move the nearby get_cast_oid() function from functioncmds.c to
lsyscache.c, which seems a more natural place for it.
Author: Paul Jungwirth, minor edits by Álvaro
Discussion: https://postgr.es/m/20200309210003.GA19992@alvherre.pgsql
This removes another relic from the old nmake-based Windows build.
version_stamp.pl put version number information into win32ver.rc. But
win32ver.rc already gets other version number information from the
preprocessor at build time, so it would make more sense if all version
number information would be handled in the same way and we don't have
two places that do it.
What we need for this is having the major version number and the minor
version number as separate integer symbols. Both configure and
Solution.pm already have that logic, because they compute
PG_VERSION_NUM. So we just keep all the logic there now. Put the
minor version number into a new symbol PG_MINORVERSION_NUM. Also, add
a symbol PG_MAJORVERSION_NUM, which is a number, alongside the
existing PG_MAJORVERSION, which is a string.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/1ee46ac4-a9b2-4531-bf54-5ec2e374634d@2ndquadrant.com
The need for this was removed by
8b9e9644dc.
A number of files now need to include utils/acl.h or
parser/parse_node.h explicitly where they previously got it indirectly
somehow.
Since parser/parse_node.h already includes nodes/parsenodes.h, the
latter is then removed where the former was added. Also, remove
nodes/pg_list.h from objectaddress.h, since that's included via
nodes/parsenodes.h.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/7601e258-26b2-8481-36d0-dc9dca6f28f1%402ndquadrant.com
When a partitioned table is added to a publication, changes of all of
its partitions (current or future) are published via that publication.
This change only affects which tables a publication considers as its
members. The receiving side still sees the data coming from the
individual leaf partitions. So existing restrictions that partition
hierarchies can only be replicated one-to-one are not changed by this.
Author: Amit Langote <amitlangote09@gmail.com>
Reviewed-by: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+HiwqH=Y85vRK3mOdjEkqFK+E=ST=eQiHdpj43L=_eJMOOznQ@mail.gmail.com
Such indexes can only be duplicated leftovers of a previously failed
REINDEX CONCURRENTLY command, and a valid equivalent is guaranteed to
exist. As toast indexes can only be dropped if invalid, reindexing
these would lead to useless duplicated indexes that can't be dropped
anymore, except if the parent relation is dropped.
Thanks to Justin Pryzby for reminding that this problem was reported
long ago during the review of the original patch of REINDEX
CONCURRENTLY, but the issue was never addressed.
Reported-by: Sergei Kornilov, Justin Pryzby
Author: Julien Rouhaud
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/36712441546604286%40sas1-890ba5c2334a.qloud-c.yandex.net
Discussion: https://postgr.es/m/20200216190835.GA21832@telsasoft.com
Backpatch-through: 12
Increases the number of tapes in a logical tape set. This will be
important for disk-based hash aggregation, because the maximum number
of tapes is not known ahead of time.
While discussing this change, it was observed to regress the
performance of Sort for at least one test case. The performance
regression was because some versions of GCC switch to an inlined
version of memcpy() in LogicalTapeWrite() after this change. No
performance regression for clang was observed.
Because the regression is due to an arbitrary decision by the
compiler, I decided it shouldn't hold up this change. If it needs to
be fixed, we can find a workaround.
Author: Adam Lee, Jeff Davis
Discussion: https://postgr.es/m/e54bfec11c59689890f277722aaaabd05f78e22c.camel%40j-davis.com
SQL includes provisions for numeric Unicode escapes in string
literals and identifiers. Previously we only accepted those
if they represented ASCII characters or the server encoding
was UTF-8, making the conversion to internal form trivial.
This patch adjusts things so that we'll call the appropriate
encoding conversion function in less-trivial cases, allowing
the escape sequence to be accepted so long as it corresponds
to some character available in the server encoding.
This also applies to processing of Unicode escapes in JSONB.
However, the old restriction still applies to client-side
JSON processing, since that hasn't got access to the server's
encoding conversion infrastructure.
This patch includes some lexer infrastructure that simplifies
throwing errors with error cursors pointing into the middle of
a string (or other complex token). For the moment I only used
it for errors relating to Unicode escapes, but we might later
expand the usage to some other cases.
Patch by me, reviewed by John Naylor.
Discussion: https://postgr.es/m/2393.1578958316@sss.pgh.pa.us
Specifically, this patch allows ALTER TYPE to:
* Change the default TOAST strategy for a toastable base type;
* Promote a non-toastable type to toastable;
* Add/remove binary I/O functions for a type;
* Add/remove typmod I/O functions for a type;
* Add/remove a custom ANALYZE statistics functions for a type.
The first of these can be done by the type's owner; all the others
require superuser privilege since misuse could cause problems.
The main motivation for this patch is to allow extensions to
upgrade the feature sets of their data types, so the set of
alterable properties is biased towards that use-case. However
it's also true that changing some other properties would be
a lot harder, as they get baked into physical storage and/or
stored expressions that depend on the type.
Along the way, refactor GenerateTypeDependencies() to make it easier
to call, refactor DefineType's volatility checks so they can be shared
by AlterType, and teach typcache.c that it might have to reload data
from the type's pg_type row, a scenario it never handled before.
Also rearrange alter_type.sgml a bit for clarity (put the
composite-type operations together).
Tomas Vondra and Tom Lane
Discussion: https://postgr.es/m/20200228004440.b23ein4qvmxnlpht@development
A long time ago, it was necessary to declare datatype I/O functions,
triggers, and language handler support functions in a very type-unsafe
way involving a single pseudo-type "opaque". We got rid of those
conventions in 7.3, but there was still support in various places to
automatically convert such functions to the modern declaration style,
to be able to transparently re-load dumps from pre-7.3 servers.
It seems unnecessary to continue to support that anymore, so take out
the hacks; whereupon the "opaque" pseudo-type itself is no longer
needed and can be dropped.
This is part of a group of patches removing various server-side kluges
for transparently upgrading pre-8.0 dump files. Since we've had few
complaints about dropping pg_dump's support for dumping from pre-8.0
servers (commit 64f3524e2), it seems okay to now remove these kluges.
Discussion: https://postgr.es/m/4110.1583255415@sss.pgh.pa.us
This does not matter much when compiling Postgres proper as many
warnings exist when enabling this compilation flag, but it can be
annoying for external modules willing to use both.
Author: David Steele
Discussion: https://postgr.es/m/91d86c8a-11fc-7b88-43eb-5ca3f6fb8bd3@pgmasters.net
Optionally push a step to check for a NULL pointer to the pergroup
state.
This will be important for disk-based hash aggregation in combination
with grouping sets. When memory limits are reached, a given tuple may
find its per-group state for some grouping sets but not others. For
the former, it advances the per-group state as normal; for the latter,
it skips evaluation and the calling code will have to spill the tuple
and reprocess it in a later batch.
Add the NULL check as a separate expression step because in some
common cases it's not needed.
Discussion: https://postgr.es/m/20200221202212.ssb2qpmdgrnx52sj%40alap3.anarazel.de
Our usual practice for "poor man's enum" catalog columns is to define
macros for the possible values and use those, not literal constants,
in C code. But for some reason lost in the mists of time, this was
never done for typalign/attalign or typstorage/attstorage. It's never
too late to make it better though, so let's do that.
The reason I got interested in this right now is the need to duplicate
some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch.
But in general, this sort of change aids greppability and readability,
so it's a good idea even without any specific motivation.
I may have missed a few places that could be converted, and it's even
more likely that pending patches will re-introduce some hard-coded
references. But that's not fatal --- there's no expectation that
we'd actually change any of these values. We can clean up stragglers
over time.
Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
to_char() has long allowed the TM (translation mode) prefix to
specify output of translated month or day names; but that prefix
had no effect in input format strings. Now it does. to_date()
and to_timestamp() will now recognize the same month or day names
that to_char() would output for the same format code. Matching is
case-insensitive (per the active collation's notion of what that
means), just as it has always been for English month/day names
without the TM prefix.
(As per the discussion thread, there are lots of cases that this
feature will not handle, such as alternate day names. But being
able to accept what to_char() will output seems useful enough.)
In passing, fix some shaky English and violations of message
style guidelines in jsonpath errors for the .datetime() method,
which depends on this code.
Juan José Santamaría Flecha, reviewed and modified by me,
with other commentary from Alvaro Herrera, Tomas Vondra,
Arthur Zakirov, Peter Eisentraut, Mark Dilger.
Discussion: https://postgr.es/m/CAC+AXB3u1jTngJcoC1nAHBf=M3v-jrEfo86UFtCqCjzbWS9QhA@mail.gmail.com
This commit adds pg_stat_progress_basebackup view that reports
the progress while an application like pg_basebackup is taking
a base backup. This uses the progress reporting infrastructure
added by c16dc1aca5, adding support for streaming base backup.
Bump catversion.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Amit Langote, Sergei Kornilov
Discussion: https://postgr.es/m/9ed8b801-8215-1f3d-62d7-65bff53f6e94@oss.nttdata.com
The backend was using strings to represent command tags and doing string
comparisons in multiple places, but that's slow and unhelpful. Create a
new command list with a supporting structure to use instead; this is
stored in a tag-list-file that can be tailored to specific purposes with
a caller-definable C macro, similar to what we do for WAL resource
managers. The first first such uses are a new CommandTag enum and a
CommandTagBehavior struct.
Replace numerous occurrences of char *completionTag with a
QueryCompletion struct so that the code no longer stores information
about completed queries in a cstring. Only at the last moment, in
EndCommand(), does this get converted to a string.
EventTriggerCacheItem no longer holds an array of palloc’d tag strings
in sorted order, but rather just a Bitmapset over the CommandTags.
Author: Mark Dilger, with unsolicited help from Álvaro Herrera
Reviewed-by: John Naylor, Tom Lane
Discussion: https://postgr.es/m/981A9DB4-3F0C-4DA5-88AD-CB9CFF4D6CAD@enterprisedb.com
Add a cast to size_t to silence "comparison between signed and unsigned
integer expressions" cpluspluscheck warning.
Reported-By: Tom Lane
Discussion: https://postgr.es/m/7971.1583171266@sss.pgh.pa.us
Such an access became possible when commit 246a6c8f7 added more
aggressive cleanup of orphaned temp relations by autovacuum.
Since autovacuum's snapshot might be slightly stale, it could
attempt to access an already-dropped temp namespace, resulting in
an assertion failure or null-pointer dereference. (In practice,
since we don't drop temp namespaces automatically but merely
recycle them, this situation could only arise if a superuser does
a manual drop of a temp namespace. Still, that should be allowed.)
The core of the bug, IMO, is that isTempNamespaceInUse and its callers
failed to think hard about whether to treat "temp namespace isn't there"
differently from "temp namespace isn't in use". In hopes of forestalling
future mistakes of the same ilk, replace that function with a new one
checkTempNamespaceStatus, which makes the same tests but returns a
three-way enum rather than just a bool. isTempNamespaceInUse is gone
entirely in HEAD; but just in case some external code is relying on it,
keep it in the back branches, as a bug-compatible wrapper around the
new function.
Per report originally from Prabhat Kumar Sahu, investigated by Mahendra
Singh and Michael Paquier; the final form of the patch is my fault.
This replaces the failed fix attempt in a052f6cbb.
Backpatch as far as v11, as 246a6c8f7 was.
Discussion: https://postgr.es/m/CAKYtNAr9Zq=1-ww4etHo-VCC-k120YxZy5OS01VkaLPaDbv2tg@mail.gmail.com
This also involves renaming src/include/utils/hashutils.h, which
becomes src/include/common/hashfn.h. Perhaps an argument can be
made for keeping the hashutils.h name, but it seemed more
consistent to make it match the name of the file, and also more
descriptive of what is actually going on here.
Patch by me, reviewed by Suraj Kharage and Mark Dilger. Off-list
advice on how not to break the Windows build from Davinder Singh
and Amit Kapila.
Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
Deduplication reduces the storage overhead of duplicates in indexes that
use the standard nbtree index access method. The deduplication process
is applied lazily, after the point where opportunistic deletion of
LP_DEAD-marked index tuples occurs. Deduplication is only applied at
the point where a leaf page split would otherwise be required. New
posting list tuples are formed by merging together existing duplicate
tuples. The physical representation of the items on an nbtree leaf page
is made more space efficient by deduplication, but the logical contents
of the page are not changed. Even unique indexes make use of
deduplication as a way of controlling bloat from duplicates whose TIDs
point to different versions of the same logical table row.
The lazy approach taken by nbtree has significant advantages over a GIN
style eager approach. Most individual inserts of index tuples have
exactly the same overhead as before. The extra overhead of
deduplication is amortized across insertions, just like the overhead of
page splits. The key space of indexes works in the same way as it has
since commit dd299df8 (the commit that made heap TID a tiebreaker
column).
Testing has shown that nbtree deduplication can generally make indexes
with about 10 or 15 tuples for each distinct key value about 2.5X - 4X
smaller, even with single column integer indexes (e.g., an index on a
referencing column that accompanies a foreign key). The final size of
single column nbtree indexes comes close to the final size of a similar
contrib/btree_gin index, at least in cases where GIN's posting list
compression isn't very effective. This can significantly improve
transaction throughput, and significantly reduce the cost of vacuuming
indexes.
A new index storage parameter (deduplicate_items) controls the use of
deduplication. The default setting is 'on', so all new B-Tree indexes
automatically use deduplication where possible. This decision will be
reviewed at the end of the Postgres 13 beta period.
There is a regression of approximately 2% of transaction throughput with
synthetic workloads that consist of append-only inserts into a table
with several non-unique indexes, where all indexes have few or no
repeated values. The underlying issue is that cycles are wasted on
unsuccessful attempts at deduplicating items in non-unique indexes.
There doesn't seem to be a way around it short of disabling
deduplication entirely. Note that deduplication of items in unique
indexes is fairly well targeted in general, which avoids the problem
there (we can use a special heuristic to trigger deduplication passes in
unique indexes, since we're specifically targeting "version bloat").
Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed.
No bump in BTREE_VERSION, since the representation of posting list
tuples works in a way that's backwards compatible with version 4 indexes
(i.e. indexes built on PostgreSQL 12). However, users must still
REINDEX a pg_upgrade'd index to use deduplication, regardless of the
Postgres version they've upgraded from. This is the only way to set the
new nbtree metapage flag indicating that deduplication is generally
safe.
Author: Anastasia Lubennikova, Peter Geoghegan
Reviewed-By: Peter Geoghegan, Heikki Linnakangas
Discussion:
https://postgr.es/m/55E4051B.7020209@postgrespro.ruhttps://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru
Invent the concept of a B-Tree equalimage ("equality implies image
equality") support function, registered as support function 4. This
indicates whether it is safe (or not safe) to apply optimizations that
assume that any two datums considered equal by an operator class's order
method must be interchangeable without any loss of semantic information.
This is static information about an operator class and a collation.
Register an equalimage routine for almost all of the existing B-Tree
opclasses. We only need two trivial routines for all of the opclasses
that are included with the core distribution. There is one routine for
opclasses that index non-collatable types (which returns 'true'
unconditionally), plus another routine for collatable types (which
returns 'true' when the collation is a deterministic collation).
This patch is infrastructure for an upcoming patch that adds B-Tree
deduplication.
Author: Peter Geoghegan, Anastasia Lubennikova
Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com
Do so by combining the various steps that are part of aggregate
transition function invocation into one larger step. As some of the
current steps are only necessary for some aggregates, have one variant
of the aggregate transition step for each possible combination.
To avoid further manual copies of code in the different transition
step implementations, move most of the code into helper functions
marked as "always inline".
The benefit of this change is an increase in performance when
aggregating lots of rows. This comes in part due to the reduced number
of indirect jumps due to the reduced number of steps, and in part by
reducing redundant setup code across steps. This mainly benefits
interpreted execution, but the code generated by JIT is also improved
a bit.
As a nice side-effect it also ends up making the code a bit simpler.
A small additional optimization is removing the need to set
aggstate->curaggcontext before calling ExecAggInitGroup, choosing to
instead passign curaggcontext as an argument. It was, in contrast to
other aggregate related functions, only needed to fetch a memory
context to copy the transition value into.
Author: Andres Freund
Discussion:
https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.dehttps://postgr.es/m/5c371df7cee903e8cd4c685f90c6c72086d3a2dc.camel@j-davis.com
The comments in fd.c have long claimed that all file allocations should
go through that module, but in reality that's not always practical.
fd.c doesn't supply APIs for invoking some FD-producing syscalls like
pipe() or epoll_create(); and the APIs it does supply for non-virtual
FDs are mostly insistent on releasing those FDs at transaction end;
and in some cases the actual open() call is in code that can't be made
to use fd.c, such as libpq.
This has led to a situation where, in a modern server, there are likely
to be seven or so long-lived FDs per backend process that are not known
to fd.c. Since NUM_RESERVED_FDS is only 10, that meant we had *very*
few spare FDs if max_files_per_process is >= the system ulimit and
fd.c had opened all the files it thought it safely could. The
contrib/postgres_fdw regression test, in particular, could easily be
made to fall over by running it under a restrictive ulimit.
To improve matters, invent functions Acquire/Reserve/ReleaseExternalFD
that allow outside callers to tell fd.c that they have or want to allocate
a FD that's not directly managed by fd.c. Add calls to track all the
fixed FDs in a standard backend session, so that we are honestly
guaranteeing that NUM_RESERVED_FDS FDs remain unused below the EMFILE
limit in a backend's idle state. The coding rules for these functions say
that there's no need to call them in code that just allocates one FD over
a fairly short interval; we can dip into NUM_RESERVED_FDS for such cases.
That means that there aren't all that many places where we need to worry.
But postgres_fdw and dblink must use this facility to account for
long-lived FDs consumed by libpq connections. There may be other places
where it's worth doing such accounting, too, but this seems like enough
to solve the immediate problem.
Internally to fd.c, "external" FDs are limited to max_safe_fds/3 FDs.
(Callers can choose to ignore this limit, but of course it's unwise
to do so except for fixed file allocations.) I also reduced the limit
on "allocated" files to max_safe_fds/3 FDs (it had been max_safe_fds/2).
Conceivably a smarter rule could be used here --- but in practice,
on reasonable systems, max_safe_fds should be large enough that this
isn't much of an issue, so KISS for now. To avoid possible regression
in the number of external or allocated files that can be opened,
increase FD_MINFREE and the lower limit on max_files_per_process a
little bit; we now insist that the effective "ulimit -n" be at least 64.
This seems like pretty clearly a bug fix, but in view of the lack of
field complaints, I'll refrain from risking a back-patch.
Discussion: https://postgr.es/m/E1izCmM-0005pV-Co@gemulon.postgresql.org
hash_any() and its various variants are defined to return Datum,
which is a backend-only concept, but the underlying functions
actually want to return uint32 and uint64, and only return Datum
because it's convenient for callers who are using them to
implement a hash function for some SQL datatype.
However, changing these functions to return uint32 and uint64
seems like it might lead to programming errors or back-patching
difficulties, both because they are widely used and because
failure to use UInt{32,64}GetDatum() might not provoke a
compilation error. Instead, rename the existing functions as
well as changing the return type, and add static inline wrappers
for those callers that need the previous behavior.
Although this commit adapts hashutils.h and hashfn.c so that they
can be compiled as frontend code, it does not actually do
anything that would cause them to be so compiled. That is left
for another commit.
Patch by me, reviewed by Suraj Kharage and Mark Dilger.
Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
The closely-related function bms_hash_value is already defined in that
file, and this change means that hashfn.c no longer needs to depend on
nodes/bitmapset.h. That gets us closer to allowing use of the hash
functions in hashfn.c in frontend code.
Patch by me, reviewed by Suraj Kharage and Mark Dilger.
Discussion: http://postgr.es/m/CA+TgmoaRiG4TXND8QuM6JXFRkM_1wL2ZNhzaUKsuec9-4yrkgw@mail.gmail.com
These compiler features are required by C99, so remove the configure
probes for them.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
Windows has this, and so do all other live platforms according to the
buildfarm; it's been required by POSIX since SUSv2. So remove the
configure probe and tests of HAVE_WCHAR_H.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
These are required by POSIX since SUSv2, and no live platforms fail
to provide them. On Windows, utime() exists and we bring our own
<utime.h>, so we're good there too. So remove the configure probes
and ad-hoc substitute code. We don't need to check for utimes()
anymore either, since that was only used as a substitute.
In passing, make the Windows build include <sys/utime.h> only where
we need it, not everywhere.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
Windows has this since _MSC_VER >= 1200, and so do all other live
platforms according to the buildfarm, so remove the configure probe
and src/port/ substitution.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
Windows has this, and so do all other live platforms according to the
buildfarm, so remove the configure probe and c.h's substitute code.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
Windows has this, and so do all other live platforms according to the
buildfarm, so remove the configure probe and float.c's substitute code.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
Windows has this, and so do all other live platforms according to the
buildfarm, so remove the configure probe and src/port/ substitution.
This also lets us get rid of some configure probes that existed only
to support src/port/isinf.c. I kept the port.h hack to force using
__builtin_isinf() on clang, though.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
Windows has this, and so do all other live platforms according to the
buildfarm, so remove the configure probe and src/port/ substitution.
Keep the probe that detects whether _LARGEFILE_SOURCE has to be
defined to get that, though ... that seems to be still relevant in
some places.
This is part of a series of commits to get rid of no-longer-relevant
configure checks and dead src/port/ code. I'm committing them separately
to make it easier to back out individual changes if they prove less
portable than I expect.
Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
stdint.h belongs to the compiler (as opposed to inttypes.h), so by
requiring a C99 compiler we can also require stdint.h
unconditionally. Remove configure checks and other workarounds for
it.
This also removes a few steps in the required portability adjustments
to the imported time zone code, which can be applied on the next
import.
When using GCC on a platform that is otherwise pre-C99, this will now
require at least GCC 4.5, which is the first release that supplied a
standard-conforming stdint.h if the native platform didn't have it.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://www.postgresql.org/message-id/flat/5d398bbb-262a-5fed-d839-d0e5cff3c0d7%402ndquadrant.com
When updating a table row with generated columns, only recompute those
generated columns whose base columns have changed in this update and
keep the rest unchanged. This can result in a significant performance
benefit. The required information was already kept in
RangeTblEntry.extraUpdatedCols; we just have to make use of it.
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b05e781a-fa16-6b52-6738-761181204567@2ndquadrant.com
The extraUpdatedCols field of the target RTE records which generated
columns are affected by an update. This is used in a variety of
places, including per-column triggers and foreign data wrappers. When
an update was initiated by a logical replication subscription, this
field was not filled in, so such an update would not affect generated
columns in a way that is consistent with normal updates. To fix,
factor out some code from analyze.c to fill in extraUpdatedCols in the
logical replication worker as well.
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/b05e781a-fa16-6b52-6738-761181204567@2ndquadrant.com
This commit also updates wait event enum into alphabetical order.
Previously the enum entry for GSSOpenServer was added out-of-order.
Back-patch to v12 where commit b0b39f72b9 introduced
GSSOpenServer wait event. In v12, the commit doesn't include
the update of wait event enum, not to break ABI.
Author: Fujii Masao
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/949931aa-4ed4-d867-a7b5-de9c02b2292b@oss.nttdata.com
Practically everybody who's ever added a column to one of the bootstrap
catalogs has been burnt by the need to update the relnatts field in the
initial pg_class data to match. Now that we use Perl scripts to
generate postgres.bki, we can have the machines take care of that,
by filling the field during genbki.pl.
While at it, use the BKI_DEFAULTS mechanism to eliminate repetitive
specifications of other column values in pg_class.dat, too. They
weren't particularly a maintenance problem, but this way is prettier
(certainly the spotty previous usage of BKI_DEFAULTS wasn't pretty).
No catversion bump needed, since this doesn't actually change the
contents of postgres.bki.
Per gripe from Justin Pryzby, though this is quite different from
his originally proposed solution.
Amit Langote, John Naylor, Tom Lane
Discussion: https://postgr.es/m/20200212182337.GZ1412@telsasoft.com
Commit 6bf0bc842 replaced float.c's CHECKFLOATVAL() macro with static
inline subroutines, but that wasn't too well thought out. In the original
coding, the unlikely condition (isinf(result) or result == 0) was checked
first, and the inf_is_valid or zero_is_valid condition only afterwards.
The inline-subroutine coding caused that to be swapped around, which is
pretty horrid for performance because (a) in common cases the is_valid
condition is twice as expensive to evaluate (e.g., requiring two isinf()
calls not one) and (b) in common cases the is_valid condition is false,
requiring us to perform the unlikely-condition check anyway. Net result
is that one isinf() call becomes two or three, resulting in visible
performance loss as reported by Keisuke Kuroda.
The original fix proposal was to revert the replacement of the macro,
but on second thought, that macro was just a bad idea from the beginning:
if anything it's a net negative for readability of the code. So instead,
let's just open-code all the overflow/underflow tests, being careful to
test the unlikely condition first (and mark it unlikely() to help the
compiler get the point).
Also, rather than having N copies of the actual ereport() calls, collapse
those into out-of-line error subroutines to save some code space. This
does mean that the error file/line numbers won't be very helpful for
figuring out where the issue really is --- but we'd already burned that
bridge by putting the ereports into static inlines.
In HEAD, check_float[48]_val() are gone altogether. In v12, leave them
present in float.h but unused in the core code, just in case some
extension is depending on them.
Emre Hasegeli, with some kibitzing from me and Andres Freund
Discussion: https://postgr.es/m/CANDwggLe1Gc1OrRqvPfGE=kM9K0FSfia0hbeFCEmwabhLz95AA@mail.gmail.com
The previous system had configure put the value into the makefiles and
then have the makefiles pass them to the build of pg_config. That was
put in place when pg_config was a shell script. We can simplify that
by having configure put the value into pg_config.h directly. This
also makes the standard build system match how the MSVC build system
already does it.
Discussion: https://www.postgresql.org/message-id/flat/6e457870-cef5-5f1d-b57c-fc89cfb8a788%402ndquadrant.com
Commit 4eaea3db introduced TupleHashTableHash(), but the signature
didn't match the other exposed functions. Separate it into internal
and external versions. The external version hides the details behind
an API more consistent with the other external functions, and the
internal version is still suitable for simplehash.
Commit 147e3722f7 changed Tid scan so that it calls table_beginscan()
and uses the scan option for seq scan. This change caused two issues.
(1) The change caused Tid scan to take a predicate lock on the entire
relation in serializable transaction even when relation-level
lock is not necessary. This could lead to an unexpected
serialization error.
(2) The change caused Tid scan to increment the number of seq_scan
in pg_stat_*_tables views even though it's not seq scan. This
could confuse the users.
This commit adds the scan option for Tid scan and makes Tid scan
use it, to avoid those issues.
Back-patch to v12, where the bug was introduced.
Author: Tatsuhito Kasahara
Reviewed-by: Kyotaro Horiguchi, Masahiko Sawada, Fujii Masao
Discussion: https://postgr.es/m/CAP0=ZVKy+gTbFmB6X_UW0pP3WaeJ-fkUWHoD-pExS=at3CY76g@mail.gmail.com
The main benefit of doing so is that this allows llvm to ensure that
types match - previously that'd only be detected by a crash within the
called function. There were a number of cases where we passed a
superfluous parameter...
To avoid needing to add all the functions to llvmjit.{c,h}, instead
get them from the llvm module for llvmjit_types.c. Also use that for
the functions from llvmjit_types already in llvmjit.h.
Author: Soumyadeep Chakraborty and Andres Freund
Discussion: https://postgr.es/m/CADwEdooww3wZv-sXSfatzFRwMuwa186LyTwkBfwEW6NjtooBPA@mail.gmail.com
Expose two new entry points: one for only calculating the hash value
of a tuple, and another for looking up a hash entry when the hash
value is already known. This will be useful for disk-based Hash
Aggregation to avoid recomputing the hash value for the same tuple
after saving and restoring it from disk.
Discussion: https://postgr.es/m/37091115219dd522fd9ed67333ee8ed1b7e09443.camel%40j-davis.com
It's already tracked via ExprState->parent, so we don't need to also
include it in ExprEvalStep. When that code originally was written
ExprState->parent didn't exist, but it since has been introduced in
6719b238e8.
Author: Andres Freund
Discussion: https://postgr.es/m/20191023163849.sosqbfs5yenocez3@alap3.anarazel.de
This new field tracks the PID of the group leader used with parallel
query. For parallel workers and the leader, the value is set to the
PID of the group leader. So, for the group leader, the value is the
same as its own PID. Note that this reflects what PGPROC stores in
shared memory, so as leader_pid is NULL if a backend has never been
involved in parallel query. If the backend is using parallel query or
has used it at least once, the value is set until the backend exits.
Author: Julien Rouhaud
Reviewed-by: Sergei Kornilov, Guillaume Lelarge, Michael Paquier, Tomas
Vondra
Discussion: https://postgr.es/m/CAOBaU_Yy5bt0vTPZ2_LUM6cUcGeqmYNoJ8-Rgto+c2+w3defYA@mail.gmail.com
Using 32 bit counters means they can now realistically wrap around when
vacuuming extremely large tables. Because they're signed integers,
stats printed by vacuum look very odd when they do.
We'd love to backpatch this, but refrain because the variables are
exported and could cause third-party code to break.
Reviewed-by: Julien Rouhaud, Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/20200131205926.GA16367@alvherre.pgsql
Use kevent(2) to wait for events on the BSD family of operating
systems and macOS. This is similar to the epoll(2) support added
for Linux by commit 98a64d0bd.
Author: Thomas Munro
Reviewed-by: Andres Freund, Marko Tiikkaja, Tom Lane
Tested-by: Mateusz Guzik, Matteo Beccati, Keith Fiske, Heikki Linnakangas, Michael Paquier, Peter Eisentraut, Rui DeSousa, Tom Lane, Mark Wong
Discussion: https://postgr.es/m/CAEepm%3D37oF84-iXDTQ9MrGjENwVGds%2B5zTr38ca73kWR7ez_tA%40mail.gmail.com
Those new assertions can be used at file scope, outside of any function
for compilation checks. This commit provides implementations for C and
C++, and fallback implementations.
Author: Peter Smith
Reviewed-by: Andres Freund, Kyotaro Horiguchi, Dagfinn Ilmari Mannsåker,
Michael Paquier
Discussion: https://postgr.es/m/201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217
Using a lookup table of digit pairs reduces the number of divisions
needed, and calculating the length upfront saves some work; these
ideas are taken from the code previously committed for floats.
David Fetter, reviewed by Kyotaro Horiguchi, Tels, and me.
Discussion: https://postgr.es/m/20190924052620.GP31596%40fetter.org