Pre-C99 platforms may lack <stdint.h> and thereby SIZE_MAX. We have
a couple of places using the hack "(size_t) -1" as a fallback, but
it wasn't universally available; which means the code added in commit
2e70d6b5e fails to compile everywhere. Move that hack to c.h so that
we can rely on having SIZE_MAX everywhere.
Per discussion, it'd be a good idea to make the macro's value safe
for use in #if-tests, but that will take a bit more work. This is
just a quick expedient to get the buildfarm green again.
Back-patch to all supported branches, like the previous commit.
Discussion: https://postgr.es/m/15883.1504278595@sss.pgh.pa.us
2cd7084524 / c6293249d change the way individual attributes in a
TupleDesc are stored / accessed. To reduce the effort of making
extensions compatible with postgresql 11, and to ease future
backpatching, backpatch introduction of TupleDescAttr() to all
releases. Do not backpatch change in storage, as that'd be a breaking
change for existing and working extensions.
Author: Andres Freund
Discussion: https://postgr.es/m/20170820181723.tdswdinzptbcwhrr@alap3.anarazel.de
Backpatch: 9.2-
The sole useful effect of this function, to check that no catcache
entries have positive refcounts at transaction end, has really been
obsolete since we introduced ResourceOwners in PG 8.1. We reduced the
checks to assertions years ago, so that the function was a complete
no-op in production builds. There have been previous discussions about
removing it entirely, but consensus up to now was that it had some small
value as a cross-check for bugs in the ResourceOwner logic.
However, it now emerges that it's possible to trigger these assertions
if you hit an assert-enabled backend with SIGTERM during a call to
SearchCatCacheList, because that function temporarily increases the
refcounts of entries it's intending to add to a catcache list construct.
In a normal ERROR scenario, the extra refcounts are cleaned up by
SearchCatCacheList's PG_CATCH block; but in a FATAL exit we do a
transaction abort and exit without ever executing PG_CATCH handlers.
There's a case to be made that this is a generic hazard and we should
consider restructuring elog(FATAL) handling so that pending PG_CATCH
handlers do get run. That's pretty scary though: it could easily create
more problems than it solves. Preliminary stress testing by Andreas
Seltenreich suggests that there are not many live problems of this ilk,
so we rejected that idea.
There are more-localized ways to fix the problem; the most principled
one would be to use PG_ENSURE_ERROR_CLEANUP instead of plain PG_TRY.
But adding cycles to SearchCatCacheList isn't very appealing. We could
also weaken the assertions in AtEOXact_CatCache in some more or less
ad-hoc way, but that just makes its raison d'etre even less compelling.
In the end, the most reasonable solution seems to be to just remove
AtEOXact_CatCache altogether, on the grounds that it's not worth trying
to fix it. It hasn't found any bugs for us in many years.
Per report from Jeevan Chalke. Back-patch to all supported branches.
Discussion: https://postgr.es/m/CAM2+6=VEE30YtRQCZX7_sCFsEpoUkFBV1gZazL70fqLn8rcvBA@mail.gmail.com
XLByteToSeg and XLByteToPrevSeg calculate only a segment number. The
definition of these macros were modified by commit
dfda6ebaec but the comment remain
unchanged.
Patch by Yugo Nagata. Back patched to 9.3 and beyond.
When, during logical decoding, a transaction gets too big, it's
contents get spilled to disk. Not just the top-transaction gets
spilled, but *also* all of its subtransactions, even if they're not
that large themselves. Unfortunately we didn't clean up
such small spilled subtransactions from disk.
Fix that, by keeping better track of whether a transaction has been
spilled to disk.
Author: Andres Freund
Reported-By: Dmitriy Sarafannikov, Fabrízio de Royes Mello
Discussion:
https://postgr.es/m/1457621358.355011041@f382.i.mail.ruhttps://postgr.es/m/CAFcNs+qNMhNYii4nxpO6gqsndiyxNDYV0S=JNq0v_sEE+9PHXg@mail.gmail.com
Backpatch: 9.4-, where logical decoding was introduced
Because walsender and normal backends share the same main loop it's
problematic to have two different flag variables, set in signal
handlers, indicating a pending configuration reload. Only certain
walsender commands reach code paths checking for the
variable (START_[LOGICAL_]REPLICATION, CREATE_REPLICATION_SLOT
... LOGICAL, notably not base backups).
This is a bug present since the introduction of walsender, but has
gotten worse in releases since then which allow walsender to do more.
A later patch, not slated for v10, will similarly unify SIGHUP
handling in other types of processes as well.
Author: Petr Jelinek, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170423235941.qosiuoyqprq4nu7v@alap3.anarazel.de
Backpatch: 9.2-, bug is present since 9.0
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so. At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits). So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.
To fix this, divide the walsender shutdown into two phases. First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders. If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown. Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.
Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records. Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.
That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions. That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.
It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.
Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code. That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
we cannot use it to build the initial snapshot. Instead we have to
wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
the all transaction perceived as running based on commit/abort
records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.
Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release. As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility. A later commit will
rejigger that on master.
Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
We have now grown enough registerable syscache-invalidation callback
functions that the original assumption that there would be few of them
is causing performance problems. In particular, let's fix things so that
CallSyscacheCallbacks doesn't have to search the whole array to find
which callback(s) to invoke for a given cache ID. Preserve the original
behavior that callbacks are called in order of registration, just in
case there's someplace that depends on that (which I doubt).
In support of this, export the number of syscaches from syscache.h.
People could have found that out anyway from the enum, but adding a
#define makes that much safer.
This provides a useful additional speedup in Mathieu Fenniak's
logical-decoding test case, although we're reaching the point of
diminishing returns there. I think any further improvement will have
to come from reducing the number of cache invalidations that are
triggered in the first place. Still, we can hope that this change
gives some incremental benefit for all invalidation scenarios.
Back-patch to 9.4 where logical decoding was introduced.
Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
A test case provided by Mathieu Fenniak shows that the initial search for
the target catcache in CatalogCacheIdInvalidate consumes a very significant
amount of overhead in cases where cache invalidation is triggered but has
little useful work to do. There is no good reason for that search to exist
at all, as the index array maintained by syscache.c allows direct lookup of
the catcache from its ID. We just need a frontend function in syscache.c,
matching the division of labor for most other cache-accessing operations.
While there's more that can be done in this area, this patch alone reduces
the runtime of Mathieu's example by 2X. We can hope that it offers some
useful benefit in other cases too, although usually cache invalidation
overhead is not such a striking fraction of the total runtime.
Back-patch to 9.4 where logical decoding was introduced. It might be
worth going further back, but presently the only case we know of where
cache invalidation is really a significant burden is in logical decoding.
Also, older branches have fewer catcaches, reducing the possible benefit.
(Note: although this nominally changes catcache's API, we have always
documented CatalogCacheIdInvalidate as a private function, so I would
have little sympathy for an external module calling it directly. So
backpatching should be fine.)
Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
Values in a STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM slot are float8,
not of the type of the column the statistics are for.
This bug is at least partly the fault of sloppy specification comments
for get_attstatsslot()/free_attstatsslot(): the type OID they want is that
of the stavalues entries, not of the underlying column. (I double-checked
other callers and they seem to get this right.) Adjust the comments to be
more correct.
Per buildfarm.
Security: CVE-2017-7484
Some selectivity estimation functions run user-supplied operators over
data obtained from pg_statistic without security checks, which allows
those operators to leak pg_statistic data without having privileges on
the underlying tables. Fix by checking that one of the following is
satisfied: (1) the user has table or column privileges on the table
underlying the pg_statistic data, or (2) the function implementing the
user-supplied operator is leak-proof. If neither is satisfied, planning
will proceed as if there are no statistics available.
At least one of these is satisfied in most cases in practice. The only
situations that are negatively impacted are user-defined or
not-leak-proof operators on a security-barrier view.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Peter Eisentraut <peter_e@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Security: CVE-2017-7484
Logical decoding stores historical snapshots on disk, so that logical
decoding can restart without having to reconstruct a snapshot from
scratch (for which the resources are not guaranteed to be present
anymore). These serialized snapshots were also used when creating a
new slot via the walsender interface, which can export a "full"
snapshot (i.e. one that can read all tables, not just catalog ones).
The problem is that the serialized snapshots are only useful for
catalogs and not for normal user tables. Thus the use of such a
serialized snapshot could result in an inconsistent snapshot being
exported, which could lead to queries returning wrong data. This
would only happen if logical slots are created while another logical
slot already exists.
Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.
This could cause a corrupted initial snapshot being exported. The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it. Ongoing decoding is not
affected by this bug.
To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol. To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.
In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.
Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type. For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))". Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.
As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.
Patch by me, after an idea of Andrew Gierth's.
Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
HandleFunctionRequest() is no longer responsible for reading the protocol
message from the client, since commit 2b3a8b20c2. Fix the outdated
comments.
HandleFunctionRequest() now always returns 0, because the code that used
to return EOF was moved in 2b3a8b20c2. Therefore, the caller no longer
needs to check the return value.
Reported by Andres Freund. Backpatch to all supported versions, even though
this doesn't have any user-visible effect, to make backporting future
patches in this area easier.
Discussion: https://www.postgresql.org/message-id/20170405010525.rt5azbya5fkbhvrx@alap3.anarazel.de
When using integer timestamps, the interval-comparison functions tried
to compute the overall magnitude of an interval as an int64 number of
microseconds. As reported by Frazer McLean, this overflows for intervals
exceeding about 296000 years, which is bad since we nominally allow
intervals many times larger than that. That results in wrong comparison
results, and possibly in corrupted btree indexes for columns containing
such large interval values.
To fix, compute the magnitude as int128 instead. Although some compilers
have native support for int128 calculations, many don't, so create our
own support functions that can do 128-bit addition and multiplication
if the compiler support isn't there. These support functions are designed
with an eye to allowing the int128 code paths in numeric.c to be rewritten
for use on all platforms, although this patch doesn't do that, or even
provide all the int128 primitives that will be needed for it.
Back-patch as far as 9.4. Earlier releases did not guard against overflow
of interval values at all (commit 146604ec4 fixed that), so it seems not
very exciting to worry about overly-large intervals for them.
Before 9.6, we did not assume that unreferenced "static inline" functions
would not draw compiler warnings, so omit functions not directly referenced
by timestamp.c, the only present consumer of int128.h. (We could have
omitted these functions in HEAD too, but since they were written and
debugged on the way to the present patch, and they look likely to be needed
by numeric.c, let's keep them in HEAD.) I did not bother to try to prevent
such warnings in a --disable-integer-datetimes build, though.
Before 9.5, configure will never define HAVE_INT128, so the part of
int128.h that exploits a native int128 implementation is dead code in the
9.4 branch. I didn't bother to remove it, thinking that keeping the file
looking similar in different branches is more useful.
In HEAD only, add a simple test harness for int128.h in src/tools/.
In back branches, this does not change the float-timestamps code path.
That's not subject to the same kind of overflow risk, since it computes
the interval magnitude as float8. (No doubt, when this code was originally
written, overflow was disregarded for exactly that reason.) There is a
precision hazard instead :-(, but we'll avert our eyes from that question,
since no complaints have been reported and that code's deprecated anyway.
Kyotaro Horiguchi and Tom Lane
Discussion: https://postgr.es/m/1490104629.422698.918452336.26FA96B7@webmail.messagingengine.com
This could result in corruption of the init fork of an unlogged index
if the ambuildempty routine for that index used shared buffers to
create the init fork, which was true for gin, gist, and hash indexes.
Patch by me, based on an earlier patch by Michael Paquier, who also
reviewed this one. This also incorporates an idea from Artur
Zakirov.
Discussion: http://postgr.es/m/CACYUyc8yccE4xfxhqxfh_Mh38j7dRFuxfaK1p6dSNAEUakxUyQ@mail.gmail.com
Make the typedefs for output plugins consistent with project style;
they were previously not even consistent with each other as to layout
or inclusion of parameter names. Make the documentation look the same,
and fix errors therein (missing and misdescribed parameters).
Back-patch because of the documentation bugs.
The new function allows to cast from one NodeTag based type to
another, while asserting that the conversion is valid. This replaces
the common pattern of doing a cast and a Assert(IsA(ptr, type))
close-by.
As this seems likely to be used pervasively, we decided to backpatch
this change the addition of this macro. Otherwise backpatched fixes
are more likely not to work on back-branches.
On branches before 9.6, where we do not yet rely on inline functions
being available, the type assertion is only performed if PG_USE_INLINE
support is detected. The cast obviously is performed regardless.
For the benefit of verifying the macro compiles in the back-branches,
this commit contains a single use of the new macro. On master, a
somewhat larger conversion will be committed separately.
Author: Peter Eisentraut and Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/c5d387d9-3440-f5e0-f9d4-71d53b9fbe52@2ndquadrant.com
Backpatch: 9.2-
These macros work fine when they are used directly in an "if" test or
similar, but as soon as the return values are assigned to boolean
variables (or passed as boolean arguments to some function), they become
bugs, hopefully caught by compiler warnings. To avoid future problems,
fix the definitions so that they return actual booleans.
To further minimize the risk that somebody uses them in back-patched
fixes that only work correctly in branches starting from the current
master and not in old ones, back-patch the change to supported branches
as appropriate.
See also commit af4472bcb8, and the long
discussion (and larger patch) in the thread mentioned in its commit
message.
Discussion: https://postgr.es/m/18672.1483022414@sss.pgh.pa.us
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.
This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.
No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
This is a back-patch of commits 687f2cd7a0, 3e4b7d8798, b602842613
by Simon Riggs, to branches 9.4 and 9.5. No further backpatch is
possible because this depends on catalog scans being MVCC. I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
This turns out not to be as harmless as I thought: MSVC will complain
if it sees an "extern" declaration without PGDLLEXPORT and then one with.
(Seems fairly silly, given that this can be changed after the fact by the
linker, but there you have it.) Therefore, contrib modules that have
extern's for V1 functions in header files are falling over in the
buildfarm, since none of those externs are marked PGDLLEXPORT.
We might or might not conclude that we're willing to plaster those
declarations with PGDLLEXPORT in HEAD, but in any case there's no way we're
going to ship this change in the back branches. Third-party authors would
not thank us for breaking their code in a minor release. Hence, revert
the addition of PGDLLEXPORT (but let's keep the extra info in the comment).
If we do the other changes we can revert this commit in HEAD.
Per buildfarm.
This isn't really necessary for our own code, because we use a .DEF file
in MSVC builds (see gendef.pl), or --export-all-symbols in MinGW and
Cygwin builds, to ensure that all global symbols in loadable modules
will be exported on Windows. However, third-party authors might use
different build processes that need this marker, and it's harmless
enough for our own builds.
To some extent, this is an oversight in commit e7128e8db, so back-patch
to 9.4 where that was added.
Laurenz Albe
Discussion: <A737B7A37273E048B164557ADEF4A58B539300BD@ntex2010a.host.magwien.gv.at>
Commits 470d886c3 et al intended to fix the problem that the postmaster
selected the same "random" DSM control segment ID on every start. But
using PostmasterRandom() for that destroys the intended property that the
delay between random_start_time and random_stop_time will be unpredictable.
(Said delay is probably already more predictable than we could wish, but
that doesn't mean that reducing it by a couple orders of magnitude is OK.)
Revert the previous patch and add a comment warning against misuse of
PostmasterRandom. Fix the original problem by calling srandom() early in
PostmasterMain, using a low-security seed that will later be overwritten
by PostmasterRandom.
Discussion: <20789.1474390434@sss.pgh.pa.us>
Some experimentation with an older version of gcc showed that it is able
to determine whether "if (elevel_ >= ERROR)" is compile-time constant
if elevel_ is declared "const", but otherwise not so much. We had
accounted for that in ereport() but were too miserly with braces to
make it so in elog(). I don't know how many currently-interesting
compilers have the same quirk, but in case it will save some code
space, let's make sure that elog() is on the same footing as ereport()
for this purpose.
Back-patch to 9.3 where we introduced pg_unreachable() calls into
elog/ereport.
ExecReScanAgg's check for whether it could re-use a previously calculated
hashtable neglected the possibility that the Agg node might reference
PARAM_EXEC Params that are not referenced by its input plan node. That's
okay if the Params are in upper tlist or qual expressions; but if one
appears in aggregate input expressions, then the hashtable contents need
to be recomputed when the Param's value changes.
To avoid unnecessary performance degradation in the case of a Param that
isn't within an aggregate input, add logic to the planner to determine
which Params are within aggregate inputs. This requires a new field in
struct Agg, but fortunately we never write plans to disk, so this isn't
an initdb-forcing change.
Per report from Jeevan Chalke. This has been broken since forever,
so back-patch to all supported branches.
Andrew Gierth, with minor adjustments by me
Report: <CAM2+6=VY8ykfLT5Q8vb9B6EbeBk-NGuLbT6seaQ+Fq4zXvrDcA@mail.gmail.com>
NUMERIC_MAX_PRECISION is a purely arbitrary constraint on the precision
and scale you can write in a numeric typmod. It might once have had
something to do with the allowed range of a typmod-less numeric value,
but at least since 9.1 we've allowed, and documented that we allowed,
any value that would physically fit in the numeric storage format;
which is something over 100000 decimal digits, not 1000.
Hence, get rid of numeric_in()'s use of NUMERIC_MAX_PRECISION as a limit
on the allowed range of the exponent in scientific-format input. That was
especially silly in view of the fact that you can enter larger numbers as
long as you don't use 'e' to do it. Just constrain the value enough to
avoid localized overflow, and let make_result be the final arbiter of what
is too large. Likewise adjust ecpg's equivalent of this code.
Also get rid of numeric_recv()'s use of NUMERIC_MAX_PRECISION to limit the
number of base-NBASE digits it would accept. That created a dump/restore
hazard for binary COPY without doing anything useful; the wire-format
limit on number of digits (65535) is about as tight as we would want.
In HEAD, also get rid of pg_size_bytes()'s unnecessary intimacy with what
the numeric range limit is. That code doesn't exist in the back branches.
Per gripe from Aravind Kumar. Back-patch to all supported branches,
since they all contain the documentation claim about allowed range of
NUMERIC (cf commit cabf5d84b).
Discussion: <2895.1471195721@sss.pgh.pa.us>
If ANALYZE found no repeated non-null entries in its sample, it set the
column's stadistinct value to -1.0, intending to indicate that the entries
are all distinct. But what this value actually means is that the number
of distinct values is 100% of the table's rowcount, and thus it was
overestimating the number of distinct values by however many nulls there
are. This could lead to very poor selectivity estimates, as for example
in a recent report from Andreas Joseph Krogh. We should discount the
stadistinct value by whatever we've estimated the nulls fraction to be.
(That is what will happen if we choose to use a negative stadistinct for
a column that does have repeated entries, so this code path was just
inconsistent.)
In addition to fixing the stadistinct entries stored by several different
ANALYZE code paths, adjust the logic where get_variable_numdistinct()
forces an "all distinct" estimate on the basis of finding a relevant unique
index. Unique indexes don't reject nulls, so there's no reason to assume
that the null fraction doesn't apply.
Back-patch to all supported branches. Back-patching is a bit of a judgment
call, but this problem seems to affect only a few users (else we'd have
identified it long ago), and it's bad enough when it does happen that
destabilizing plan choices in a worse direction seems unlikely.
Patch by me, with documentation wording suggested by Dean Rasheed
Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>
Discussion: <16143.1470350371@sss.pgh.pa.us>
This coding pattern creates a race condition, because if an interesting
interrupt happens after we've checked InterruptPending but before we reset
our latch, the latch-setting done by the signal handler would get lost,
and then we might block at WaitLatch in the next iteration without ever
noticing the interrupt condition. You can put the CHECK_FOR_INTERRUPTS
before WaitLatch or after ResetLatch, but not between them.
Aside from fixing the bugs, add some explanatory comments to latch.h
to perhaps forestall the next person from making the same mistake.
In HEAD, also replace gather_readnext's direct call of
HandleParallelMessages with CHECK_FOR_INTERRUPTS. It does not seem clean
or useful for this one caller to bypass ProcessInterrupts and go straight
to HandleParallelMessages; not least because that fails to consider the
InterruptPending flag, resulting in useless work both here
(if InterruptPending isn't set) and in the next CHECK_FOR_INTERRUPTS call
(if it is).
This thinko seems to have been introduced in the initial coding of
storage/ipc/shm_mq.c (commit ec9037df2), and then blindly copied into all
the subsequent parallel-query support logic. Back-patch relevant hunks
to 9.4 to extirpate the error everywhere.
Discussion: <1661.1469996911@sss.pgh.pa.us>
Commits 4452000f3 et al established semantics for NullTest.argisrow that
are a bit different from its initial conception: rather than being merely
a cache of whether we've determined the input to have composite type,
the flag now has the further meaning that we should apply field-by-field
testing as per the standard's definition of IS [NOT] NULL. If argisrow
is false and yet the input has composite type, the construct instead has
the semantics of IS [NOT] DISTINCT FROM NULL. Update the comments in
primnodes.h to clarify this, and fix ruleutils.c and deparse.c to print
such cases correctly. In the case of ruleutils.c, this merely results in
cosmetic changes in EXPLAIN output, since the case can't currently arise
in stored rules. However, it represents a live bug for deparse.c, which
would formerly have sent a remote query that had semantics different
from the local behavior. (From the user's standpoint, this means that
testing a remote nested-composite column for null-ness could have had
unexpected recursive behavior much like that fixed in 4452000f3.)
In a related but somewhat independent fix, make plancat.c set argisrow
to false in all NullTest expressions constructed to represent "attnotnull"
constructs. Since attnotnull is actually enforced as a simple null-value
check, this is a more accurate representation of the semantics; we were
previously overpromising what it meant for composite columns, which might
possibly lead to incorrect planner optimizations. (It seems that what the
SQL spec expects a NOT NULL constraint to mean is an IS NOT NULL test, so
arguably we are violating the spec and should fix attnotnull to do the
other thing. If we ever do, this part should get reverted.)
Back-patch, same as the previous commit.
Discussion: <10682.1469566308@sss.pgh.pa.us>
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values. This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs. The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons. Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs. And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.
To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN. Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.
I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data. It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.
Per bug #14238 from Guang-Dih Lei. Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.
Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
Previously, these commands always planned the given query and went through
executor startup before deciding not to actually run the query if WITH NO
DATA is specified. This behavior is problematic for pg_dump because it
may cause errors to be raised that we would rather not see before a
REFRESH MATERIALIZED VIEW command is issued. See for example bug #13907
from Marian Krucina. This change is not sufficient to fix that particular
bug, because we also need to tweak pg_dump to issue the REFRESH later,
but it's a necessary step on the way.
A user-visible side effect of doing things this way is that the returned
command tag for WITH NO DATA cases will now be "CREATE MATERIALIZED VIEW"
or "CREATE TABLE AS", not "SELECT 0". We could preserve the old behavior
but it would take more code, and arguably that was just an implementation
artifact not intended behavior anyhow.
In 9.5 and HEAD, also get rid of the static variable CreateAsReladdr, which
was trouble waiting to happen; there is not any prohibition on nested
CREATE commands.
Back-patch to 9.3 where CREATE MATERIALIZED VIEW was introduced.
Michael Paquier and Tom Lane
Report: <20160202161407.2778.24659@wrigleys.postgresql.org>
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore. In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
ERROR: MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails. Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.
It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.
To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding. This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean. Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.
To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact. (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).
This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.
Bug analysis by Andrew Gierth and Álvaro Herrera.
A number of public reports match this bug:
https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.nethttps://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.comhttps://www.postgresql.org/message-id/556439CF.7070109@pscs.co.ukhttps://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.comhttps://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
Document these as "nearest integer >= argument" and "nearest integer <=
argument", which will hopefully be less confusing than the old formulation.
New wording is from Matlab via Dean Rasheed.
I changed the pg_description entries as well as the SGML docs. In the
back branches, this will only affect installations initdb'd in the future,
but it should be harmless otherwise.
Discussion: <CAEZATCW3yzJo-NMSiQs5jXNFbTsCEftZS-Og8=FvFdiU+kYuSA@mail.gmail.com>
If we ANALYZE only selected columns of a table, we should not postpone
auto-analyze because of that; other columns may well still need stats
updates. As committed, the counter is left alone if a column list is
given, whether or not it includes all analyzable columns of the table.
Per complaint from Tomasz Ostrowski.
It's been like this a long time, so back-patch to all supported branches.
Report: <ef99c1bd-ff60-5f32-2733-c7b504eb960c@ato.waw.pl>
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec. Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes. But the latter part of that got broken
in commit 187492b6c2, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)
To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so. This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry. And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.
In passing, improve the comments in this area, which have been rather
neglected. Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process. It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.
Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch. Thanks to Tomas
for reviewing this version.
Back-patch to 9.3 where the bug was introduced.
Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>