Commit graph

6575 commits

Author SHA1 Message Date
Andres Freund
55d7027d58 Unify SIGHUP handling between normal and walsender backends.
Because walsender and normal backends share the same main loop it's
problematic to have two different flag variables, set in signal
handlers, indicating a pending configuration reload.  Only certain
walsender commands reach code paths checking for the
variable (START_[LOGICAL_]REPLICATION, CREATE_REPLICATION_SLOT
... LOGICAL, notably not base backups).

This is a bug present since the introduction of walsender, but has
gotten worse in releases since then which allow walsender to do more.

A later patch, not slated for v10, will similarly unify SIGHUP
handling in other types of processes as well.

Author: Petr Jelinek, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170423235941.qosiuoyqprq4nu7v@alap3.anarazel.de
Backpatch: 9.2-, bug is present since 9.0
2017-06-05 19:18:16 -07:00
Andres Freund
1cdc0ab9c1 Prevent possibility of panics during shutdown checkpoint.
When the checkpointer writes the shutdown checkpoint, it checks
afterwards whether any WAL has been written since it started and
throws a PANIC if so.  At that point, only walsenders are still
active, so one might think this could not happen, but walsenders can
also generate WAL, for instance in BASE_BACKUP and logical decoding
related commands (e.g. via hint bits).  So they can trigger this panic
if such a command is run while the shutdown checkpoint is being
written.

To fix this, divide the walsender shutdown into two phases.  First,
checkpointer, itself triggered by postmaster, sends a
PROCSIG_WALSND_INIT_STOPPING signal to all walsenders.  If the backend
is idle or runs an SQL query this causes the backend to shutdown, if
logical replication is in progress all existing WAL records are
processed followed by a shutdown.  Otherwise this causes the walsender
to switch to the "stopping" state. In this state, the walsender will
reject any further replication commands. The checkpointer begins the
shutdown checkpoint once all walsenders are confirmed as
stopping. When the shutdown checkpoint finishes, the postmaster sends
us SIGUSR2. This instructs walsender to send any outstanding WAL,
including the shutdown checkpoint record, wait for it to be replicated
to the standby, and then exit.

Author: Andres Freund, based on an earlier patch by Michael Paquier
Reported-By: Fujii Masao, Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/20170602002912.tqlwn4gymzlxpvs2@alap3.anarazel.de
Backpatch: 9.4, where logical decoding was introduced
2017-06-05 19:18:16 -07:00
Andres Freund
79abd23db1 Fix race condition leading to hanging logical slot creation.
The snapshot assembly during the creation of logical slots relied
waiting for transactions in xl_running_xacts to end, by checking for
their commit/abort records.  Unfortunately, despite locking, it is
possible to see an xl_running_xact record listing transactions as
ready, that have already WAL-logged an commit/abort record, as the
locking just prevents the ProcArray to be adjusted, and the commit
record has to be logged first.

That lead to either delayed or hanging snapshot creation, because
snapbuild.c would wait "forever" to see commit/abort records for some
transactions.  That hang resolved only if a xl_running_xacts record
without any running transactions happened to be logged, far from
certain on a busy server.

It's impractical to prevent that via more heavyweight locking, the
likelihood of deadlocks and significantly increased contention would
be too big.

Instead change the initial snapshot creation to be solely based on
tracking the oldest running transaction via
xl_running_xacts->oldestRunningXid - that actually ends up
significantly simplifying the code.  That has two disadvantages:
1) Because we cannot fully "trust" the contents of xl_running_xacts,
   we cannot use it to build the initial snapshot.  Instead we have to
   wait twice for all running transactions to finish.
2) Previously a slot, unless the race occurred, could be created when
   the all transaction perceived as running based on commit/abort
   records, now we have to wait for the next xl_running_xacts record.
To address that, trigger logging new xl_running_xacts record from
within snapbuild.c exactly when necessary.

Unfortunately snabuild.c's SnapBuild is stored on disk, one of the
stupider ideas of a certain Mr Freund, so we can't change it in a
minor release.  As this is going to be backpatched, we have to hack
around a bit to keep on-disk compatibility.  A later commit will
rejigger that on master.

Author: Andres Freund, based on a quite different patch from Petr Jelinek
Analyzed-By: Petr Jelinek
Reviewed-By: Petr Jelinek
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backpatch: 9.4-, where logical decoding has been introduced
2017-05-13 14:21:00 -07:00
Tom Lane
32900700f4 Avoid searching for callback functions in CallSyscacheCallbacks().
We have now grown enough registerable syscache-invalidation callback
functions that the original assumption that there would be few of them
is causing performance problems.  In particular, let's fix things so that
CallSyscacheCallbacks doesn't have to search the whole array to find
which callback(s) to invoke for a given cache ID.  Preserve the original
behavior that callbacks are called in order of registration, just in
case there's someplace that depends on that (which I doubt).

In support of this, export the number of syscaches from syscache.h.
People could have found that out anyway from the enum, but adding a
#define makes that much safer.

This provides a useful additional speedup in Mathieu Fenniak's
logical-decoding test case, although we're reaching the point of
diminishing returns there.  I think any further improvement will have
to come from reducing the number of cache invalidations that are
triggered in the first place.  Still, we can hope that this change
gives some incremental benefit for all invalidation scenarios.

Back-patch to 9.4 where logical decoding was introduced.

Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
2017-05-12 19:05:30 -04:00
Tom Lane
64417f8d35 Avoid searching for the target catcache in CatalogCacheIdInvalidate.
A test case provided by Mathieu Fenniak shows that the initial search for
the target catcache in CatalogCacheIdInvalidate consumes a very significant
amount of overhead in cases where cache invalidation is triggered but has
little useful work to do.  There is no good reason for that search to exist
at all, as the index array maintained by syscache.c allows direct lookup of
the catcache from its ID.  We just need a frontend function in syscache.c,
matching the division of labor for most other cache-accessing operations.

While there's more that can be done in this area, this patch alone reduces
the runtime of Mathieu's example by 2X.  We can hope that it offers some
useful benefit in other cases too, although usually cache invalidation
overhead is not such a striking fraction of the total runtime.

Back-patch to 9.4 where logical decoding was introduced.  It might be
worth going further back, but presently the only case we know of where
cache invalidation is really a significant burden is in logical decoding.
Also, older branches have fewer catcaches, reducing the possible benefit.

(Note: although this nominally changes catcache's API, we have always
documented CatalogCacheIdInvalidate as a private function, so I would
have little sympathy for an external module calling it directly.  So
backpatching should be fine.)

Discussion: https://postgr.es/m/CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
2017-05-12 18:17:29 -04:00
Tom Lane
34af9129e6 Stamp 9.4.12. 2017-05-08 17:19:04 -04:00
Tom Lane
f793effdc7 Further patch rangetypes_selfuncs.c's statistics slot management.
Values in a STATISTIC_KIND_RANGE_LENGTH_HISTOGRAM slot are float8,
not of the type of the column the statistics are for.

This bug is at least partly the fault of sloppy specification comments
for get_attstatsslot()/free_attstatsslot(): the type OID they want is that
of the stavalues entries, not of the underlying column.  (I double-checked
other callers and they seem to get this right.)  Adjust the comments to be
more correct.

Per buildfarm.

Security: CVE-2017-7484
2017-05-08 15:02:58 -04:00
Peter Eisentraut
3e5ea1f9b2 Add security checks to selectivity estimation functions
Some selectivity estimation functions run user-supplied operators over
data obtained from pg_statistic without security checks, which allows
those operators to leak pg_statistic data without having privileges on
the underlying tables.  Fix by checking that one of the following is
satisfied: (1) the user has table or column privileges on the table
underlying the pg_statistic data, or (2) the function implementing the
user-supplied operator is leak-proof.  If neither is satisfied, planning
will proceed as if there are no statistics available.

At least one of these is satisfied in most cases in practice.  The only
situations that are negatively impacted are user-defined or
not-leak-proof operators on a security-barrier view.

Reported-by: Robert Haas <robertmhaas@gmail.com>
Author: Peter Eisentraut <peter_e@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>

Security: CVE-2017-7484
2017-05-08 09:19:15 -04:00
Heikki Linnakangas
96d0f988b1 Give nicer error message when connecting to a v10 server requiring SCRAM.
This is just to give the user a hint that they need to upgrade, if they try
to connect to a v10 server that uses SCRAM authentication, with an older
client.

Commit to all stable branches, but not master.

Discussion: https://www.postgresql.org/message-id/bbf45d92-3896-eeb7-7399-2111d517261b@pivotal.io
2017-05-05 11:24:29 +03:00
Andres Freund
b6ecf26ccc Don't use on-disk snapshots for exported logical decoding snapshot.
Logical decoding stores historical snapshots on disk, so that logical
decoding can restart without having to reconstruct a snapshot from
scratch (for which the resources are not guaranteed to be present
anymore).  These serialized snapshots were also used when creating a
new slot via the walsender interface, which can export a "full"
snapshot (i.e. one that can read all tables, not just catalog ones).

The problem is that the serialized snapshots are only useful for
catalogs and not for normal user tables.  Thus the use of such a
serialized snapshot could result in an inconsistent snapshot being
exported, which could lead to queries returning wrong data.  This
would only happen if logical slots are created while another logical
slot already exists.

Author: Petr Jelinek
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/f37e975c-908f-858e-707f-058d3b1eb214@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
2017-04-27 15:29:52 -07:00
Andres Freund
5da6461387 Preserve required !catalog tuples while computing initial decoding snapshot.
The logical decoding machinery already preserved all the required
catalog tuples, which is sufficient in the course of normal logical
decoding, but did not guarantee that non-catalog tuples were preserved
during computation of the initial snapshot when creating a slot over
the replication protocol.

This could cause a corrupted initial snapshot being exported.  The
time window for issues is usually not terribly large, but on a busy
server it's perfectly possible to it hit it.  Ongoing decoding is not
affected by this bug.

To avoid increased overhead for the SQL API, only retain additional
tuples when a logical slot is being created over the replication
protocol.  To do so this commit changes the signature of
CreateInitDecodingContext(), but it seems unlikely that it's being
used in an extension, so that's probably ok.

In a drive-by fix, fix handling of
ReplicationSlotsComputeRequiredXmin's already_locked argument, which
should only apply to ProcArrayLock, not ReplicationSlotControlLock.

Reported-By: Erik Rijkers
Analyzed-By: Petr Jelinek
Author: Petr Jelinek, heavily editorialized by Andres Freund
Reviewed-By: Andres Freund
Discussion: https://postgr.es/m/9a897b86-46e1-9915-ee4c-da02e4ff6a95@2ndquadrant.com
Backport: 9.4, where logical decoding was introduced.
2017-04-27 13:13:37 -07:00
Tom Lane
bb132cddf8 Support OpenSSL 1.1.0 in 9.4 branch.
This commit back-patches the equivalent of the 9.5-branch commits
e2838c580 and 48e5ba61e, so that we can work with OpenSSL 1.1.0 in 9.4.

(Going further back would be a good thing but will take more work;
meanwhile let's see what the buildfarm makes of this.)

Original patches by Andreas Karlsson and Heikki Linnakangas,
back-patching work by Andreas Karlsson.

Patch: https://postgr.es/m/0c817abb-3f7d-20fb-583a-58f7593a0bea@proxel.se
Discussion: https://postgr.es/m/5129.1492293840@sss.pgh.pa.us
2017-04-15 20:16:03 -04:00
Tom Lane
89a41a1b6b Improve castNode notation by introducing list-extraction-specific variants.
This extends the castNode() notation introduced by commit 5bcab1114 to
provide, in one step, extraction of a list cell's pointer and coercion to
a concrete node type.  For example, "lfirst_node(Foo, lc)" is the same
as "castNode(Foo, lfirst(lc))".  Almost half of the uses of castNode
that have appeared so far include a list extraction call, so this is
pretty widely useful, and it saves a few more keystrokes compared to the
old way.

As with the previous patch, back-patch the addition of these macros to
pg_list.h, so that the notation will be available when back-patching.

Patch by me, after an idea of Andrew Gierth's.

Discussion: https://postgr.es/m/14197.1491841216@sss.pgh.pa.us
2017-04-10 13:51:29 -04:00
Heikki Linnakangas
88101abe70 Remove dead code and fix comments in fast-path function handling.
HandleFunctionRequest() is no longer responsible for reading the protocol
message from the client, since commit 2b3a8b20c2. Fix the outdated
comments.

HandleFunctionRequest() now always returns 0, because the code that used
to return EOF was moved in 2b3a8b20c2. Therefore, the caller no longer
needs to check the return value.

Reported by Andres Freund. Backpatch to all supported versions, even though
this doesn't have any user-visible effect, to make backporting future
patches in this area easier.

Discussion: https://www.postgresql.org/message-id/20170405010525.rt5azbya5fkbhvrx@alap3.anarazel.de
2017-04-06 09:11:22 +03:00
Tom Lane
8851bcf881 Fix integer-overflow problems in interval comparison.
When using integer timestamps, the interval-comparison functions tried
to compute the overall magnitude of an interval as an int64 number of
microseconds.  As reported by Frazer McLean, this overflows for intervals
exceeding about 296000 years, which is bad since we nominally allow
intervals many times larger than that.  That results in wrong comparison
results, and possibly in corrupted btree indexes for columns containing
such large interval values.

To fix, compute the magnitude as int128 instead.  Although some compilers
have native support for int128 calculations, many don't, so create our
own support functions that can do 128-bit addition and multiplication
if the compiler support isn't there.  These support functions are designed
with an eye to allowing the int128 code paths in numeric.c to be rewritten
for use on all platforms, although this patch doesn't do that, or even
provide all the int128 primitives that will be needed for it.

Back-patch as far as 9.4.  Earlier releases did not guard against overflow
of interval values at all (commit 146604ec4 fixed that), so it seems not
very exciting to worry about overly-large intervals for them.

Before 9.6, we did not assume that unreferenced "static inline" functions
would not draw compiler warnings, so omit functions not directly referenced
by timestamp.c, the only present consumer of int128.h.  (We could have
omitted these functions in HEAD too, but since they were written and
debugged on the way to the present patch, and they look likely to be needed
by numeric.c, let's keep them in HEAD.)  I did not bother to try to prevent
such warnings in a --disable-integer-datetimes build, though.

Before 9.5, configure will never define HAVE_INT128, so the part of
int128.h that exploits a native int128 implementation is dead code in the
9.4 branch.  I didn't bother to remove it, thinking that keeping the file
looking similar in different branches is more useful.

In HEAD only, add a simple test harness for int128.h in src/tools/.

In back branches, this does not change the float-timestamps code path.
That's not subject to the same kind of overflow risk, since it computes
the interval magnitude as float8.  (No doubt, when this code was originally
written, overflow was disregarded for exactly that reason.)  There is a
precision hazard instead :-(, but we'll avert our eyes from that question,
since no complaints have been reported and that code's deprecated anyway.

Kyotaro Horiguchi and Tom Lane

Discussion: https://postgr.es/m/1490104629.422698.918452336.26FA96B7@webmail.messagingengine.com
2017-04-05 23:51:28 -04:00
Robert Haas
bbd5e600ff Fix failure to mark init buffers as BM_PERMANENT.
This could result in corruption of the init fork of an unlogged index
if the ambuildempty routine for that index used shared buffers to
create the init fork, which was true for gin, gist, and hash indexes.

Patch by me, based on an earlier patch by Michael Paquier, who also
reviewed this one.  This also incorporates an idea from Artur
Zakirov.

Discussion: http://postgr.es/m/CACYUyc8yccE4xfxhqxfh_Mh38j7dRFuxfaK1p6dSNAEUakxUyQ@mail.gmail.com
2017-03-14 12:06:36 -04:00
Tom Lane
c3b5cfe33c Formatting and docs corrections for logical decoding output plugins.
Make the typedefs for output plugins consistent with project style;
they were previously not even consistent with each other as to layout
or inclusion of parameter names.  Make the documentation look the same,
and fix errors therein (missing and misdescribed parameters).

Back-patch because of the documentation bugs.
2017-02-15 18:15:47 -05:00
Tom Lane
01306452b1 Stamp 9.4.11. 2017-02-06 16:49:02 -05:00
Heikki Linnakangas
1dd06ede17 Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:34:24 +02:00
Tom Lane
d02f038c35 Orthography fixes for new castNode() macro.
Clean up hastily-composed comment.  Normalize whitespace.

Erik Rijkers and myself
2017-01-27 08:33:58 -05:00
Andres Freund
cf8c86af95 Add castNode(type, ptr) for safe casting between NodeTag based types.
The new function allows to cast from one NodeTag based type to
another, while asserting that the conversion is valid.  This replaces
the common pattern of doing a cast and a Assert(IsA(ptr, type))
close-by.

As this seems likely to be used pervasively, we decided to backpatch
this change the addition of this macro. Otherwise backpatched fixes
are more likely not to work on back-branches.

On branches before 9.6, where we do not yet rely on inline functions
being available, the type assertion is only performed if PG_USE_INLINE
support is detected. The cast obviously is performed regardless.

For the benefit of verifying the macro compiles in the back-branches,
this commit contains a single use of the new macro. On master, a
somewhat larger conversion will be committed separately.

Author: Peter Eisentraut and Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/c5d387d9-3440-f5e0-f9d4-71d53b9fbe52@2ndquadrant.com
Backpatch: 9.2-
2017-01-26 16:47:03 -08:00
Alvaro Herrera
75c155f65b Change some test macros to return true booleans
These macros work fine when they are used directly in an "if" test or
similar, but as soon as the return values are assigned to boolean
variables (or passed as boolean arguments to some function), they become
bugs, hopefully caught by compiler warnings.  To avoid future problems,
fix the definitions so that they return actual booleans.

To further minimize the risk that somebody uses them in back-patched
fixes that only work correctly in branches starting from the current
master and not in old ones, back-patch the change to supported branches
as appropriate.

See also commit af4472bcb8, and the long
discussion (and larger patch) in the thread mentioned in its commit
message.

Discussion: https://postgr.es/m/18672.1483022414@sss.pgh.pa.us
2017-01-18 18:06:13 -03:00
Alvaro Herrera
30e3cb307d Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

This is a back-patch of commits 687f2cd7a0, 3e4b7d8798, b602842613
by Simon Riggs, to branches 9.4 and 9.5.  No further backpatch is
possible because this depends on catalog scans being MVCC.  I (Álvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
2016-11-17 13:31:30 -03:00
Tom Lane
3e844a34b8 Account for catalog snapshot in PGXACT->xmin updates.
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced.  In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures.  This effect appears to be the explanation for a recent
failure on buildfarm member piculet.

To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.

In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before.  To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input.  (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)

In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c.  And improve some obsolete comments.

No regression test because I don't know a deterministic way to cause this
failure.  But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.

Back-patch to 9.4 where MVCC catalog scans were introduced.  (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4.  But MVCC catalog scans were supposed to fix that.)

Discussion: <16447.1478818294@sss.pgh.pa.us>
2016-11-15 15:55:36 -05:00
Tom Lane
c7681b2b9a Stamp 9.4.10. 2016-10-24 16:12:53 -04:00
Tom Lane
80296813cc Revert addition of PGDLLEXPORT in PG_FUNCTION_INFO_V1 macro.
This turns out not to be as harmless as I thought: MSVC will complain
if it sees an "extern" declaration without PGDLLEXPORT and then one with.
(Seems fairly silly, given that this can be changed after the fact by the
linker, but there you have it.)  Therefore, contrib modules that have
extern's for V1 functions in header files are falling over in the
buildfarm, since none of those externs are marked PGDLLEXPORT.

We might or might not conclude that we're willing to plaster those
declarations with PGDLLEXPORT in HEAD, but in any case there's no way we're
going to ship this change in the back branches.  Third-party authors would
not thank us for breaking their code in a minor release.  Hence, revert
the addition of PGDLLEXPORT (but let's keep the extra info in the comment).
If we do the other changes we can revert this commit in HEAD.

Per buildfarm.
2016-10-12 18:02:12 -04:00
Tom Lane
ee6810fb46 Provide DLLEXPORT markers for C functions via PG_FUNCTION_INFO_V1 macro.
This isn't really necessary for our own code, because we use a .DEF file
in MSVC builds (see gendef.pl), or --export-all-symbols in MinGW and
Cygwin builds, to ensure that all global symbols in loadable modules
will be exported on Windows.  However, third-party authors might use
different build processes that need this marker, and it's harmless
enough for our own builds.

To some extent, this is an oversight in commit e7128e8db, so back-patch
to 9.4 where that was added.

Laurenz Albe

Discussion: <A737B7A37273E048B164557ADEF4A58B539300BD@ntex2010a.host.magwien.gv.at>
2016-10-12 12:46:16 -04:00
Heikki Linnakangas
69ade952e9 Fix copy-pasto in comment.
Amit Langote
2016-10-12 12:08:59 +03:00
Tom Lane
32cdf680f1 Avoid using PostmasterRandom() for DSM control segment ID.
Commits 470d886c3 et al intended to fix the problem that the postmaster
selected the same "random" DSM control segment ID on every start.  But
using PostmasterRandom() for that destroys the intended property that the
delay between random_start_time and random_stop_time will be unpredictable.
(Said delay is probably already more predictable than we could wish, but
that doesn't mean that reducing it by a couple orders of magnitude is OK.)
Revert the previous patch and add a comment warning against misuse of
PostmasterRandom.  Fix the original problem by calling srandom() early in
PostmasterMain, using a low-security seed that will later be overwritten
by PostmasterRandom.

Discussion: <20789.1474390434@sss.pgh.pa.us>
2016-09-23 09:54:11 -04:00
Robert Haas
626312d1b3 Use PostmasterRandom(), not random(), for DSM control segment ID.
Otherwise, every startup gets the same "random" value, which is
definitely not what was intended.
2016-09-20 12:30:42 -04:00
Tom Lane
ebd971f74d Improve unreachability recognition in elog() macro.
Some experimentation with an older version of gcc showed that it is able
to determine whether "if (elevel_ >= ERROR)" is compile-time constant
if elevel_ is declared "const", but otherwise not so much.  We had
accounted for that in ereport() but were too miserly with braces to
make it so in elog().  I don't know how many currently-interesting
compilers have the same quirk, but in case it will save some code
space, let's make sure that elog() is on the same footing as ereport()
for this purpose.

Back-patch to 9.3 where we introduced pg_unreachable() calls into
elog/ereport.
2016-09-10 17:54:23 -04:00
Tom Lane
08a823e53b Fix improper repetition of previous results from a hashed aggregate.
ExecReScanAgg's check for whether it could re-use a previously calculated
hashtable neglected the possibility that the Agg node might reference
PARAM_EXEC Params that are not referenced by its input plan node.  That's
okay if the Params are in upper tlist or qual expressions; but if one
appears in aggregate input expressions, then the hashtable contents need
to be recomputed when the Param's value changes.

To avoid unnecessary performance degradation in the case of a Param that
isn't within an aggregate input, add logic to the planner to determine
which Params are within aggregate inputs.  This requires a new field in
struct Agg, but fortunately we never write plans to disk, so this isn't
an initdb-forcing change.

Per report from Jeevan Chalke.  This has been broken since forever,
so back-patch to all supported branches.

Andrew Gierth, with minor adjustments by me

Report: <CAM2+6=VY8ykfLT5Q8vb9B6EbeBk-NGuLbT6seaQ+Fq4zXvrDcA@mail.gmail.com>
2016-08-24 14:37:51 -04:00
Tom Lane
4ac3d77947 Remove bogus dependencies on NUMERIC_MAX_PRECISION.
NUMERIC_MAX_PRECISION is a purely arbitrary constraint on the precision
and scale you can write in a numeric typmod.  It might once have had
something to do with the allowed range of a typmod-less numeric value,
but at least since 9.1 we've allowed, and documented that we allowed,
any value that would physically fit in the numeric storage format;
which is something over 100000 decimal digits, not 1000.

Hence, get rid of numeric_in()'s use of NUMERIC_MAX_PRECISION as a limit
on the allowed range of the exponent in scientific-format input.  That was
especially silly in view of the fact that you can enter larger numbers as
long as you don't use 'e' to do it.  Just constrain the value enough to
avoid localized overflow, and let make_result be the final arbiter of what
is too large.  Likewise adjust ecpg's equivalent of this code.

Also get rid of numeric_recv()'s use of NUMERIC_MAX_PRECISION to limit the
number of base-NBASE digits it would accept.  That created a dump/restore
hazard for binary COPY without doing anything useful; the wire-format
limit on number of digits (65535) is about as tight as we would want.

In HEAD, also get rid of pg_size_bytes()'s unnecessary intimacy with what
the numeric range limit is.  That code doesn't exist in the back branches.

Per gripe from Aravind Kumar.  Back-patch to all supported branches,
since they all contain the documentation claim about allowed range of
NUMERIC (cf commit cabf5d84b).

Discussion: <2895.1471195721@sss.pgh.pa.us>
2016-08-14 15:06:01 -04:00
Tom Lane
860f8fb1aa Stamp 9.4.9. 2016-08-08 16:29:39 -04:00
Tom Lane
8f180a6cca Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set the
column's stadistinct value to -1.0, intending to indicate that the entries
are all distinct.  But what this value actually means is that the number
of distinct values is 100% of the table's rowcount, and thus it was
overestimating the number of distinct values by however many nulls there
are.  This could lead to very poor selectivity estimates, as for example
in a recent report from Andreas Joseph Krogh.  We should discount the
stadistinct value by whatever we've estimated the nulls fraction to be.
(That is what will happen if we choose to use a negative stadistinct for
a column that does have repeated entries, so this code path was just
inconsistent.)

In addition to fixing the stadistinct entries stored by several different
ANALYZE code paths, adjust the logic where get_variable_numdistinct()
forces an "all distinct" estimate on the basis of finding a relevant unique
index.  Unique indexes don't reject nulls, so there's no reason to assume
that the null fraction doesn't apply.

Back-patch to all supported branches.  Back-patching is a bit of a judgment
call, but this problem seems to affect only a few users (else we'd have
identified it long ago), and it's bad enough when it does happen that
destabilizing plan choices in a worse direction seems unlikely.

Patch by me, with documentation wording suggested by Dean Rasheed

Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>
Discussion: <16143.1470350371@sss.pgh.pa.us>
2016-08-07 18:52:02 -04:00
Tom Lane
45e5496042 Don't CHECK_FOR_INTERRUPTS between WaitLatch and ResetLatch.
This coding pattern creates a race condition, because if an interesting
interrupt happens after we've checked InterruptPending but before we reset
our latch, the latch-setting done by the signal handler would get lost,
and then we might block at WaitLatch in the next iteration without ever
noticing the interrupt condition.  You can put the CHECK_FOR_INTERRUPTS
before WaitLatch or after ResetLatch, but not between them.

Aside from fixing the bugs, add some explanatory comments to latch.h
to perhaps forestall the next person from making the same mistake.

In HEAD, also replace gather_readnext's direct call of
HandleParallelMessages with CHECK_FOR_INTERRUPTS.  It does not seem clean
or useful for this one caller to bypass ProcessInterrupts and go straight
to HandleParallelMessages; not least because that fails to consider the
InterruptPending flag, resulting in useless work both here
(if InterruptPending isn't set) and in the next CHECK_FOR_INTERRUPTS call
(if it is).

This thinko seems to have been introduced in the initial coding of
storage/ipc/shm_mq.c (commit ec9037df2), and then blindly copied into all
the subsequent parallel-query support logic.  Back-patch relevant hunks
to 9.4 to extirpate the error everywhere.

Discussion: <1661.1469996911@sss.pgh.pa.us>
2016-08-01 15:13:53 -04:00
Tom Lane
66f7e40812 Fix assorted fallout from IS [NOT] NULL patch.
Commits 4452000f3 et al established semantics for NullTest.argisrow that
are a bit different from its initial conception: rather than being merely
a cache of whether we've determined the input to have composite type,
the flag now has the further meaning that we should apply field-by-field
testing as per the standard's definition of IS [NOT] NULL.  If argisrow
is false and yet the input has composite type, the construct instead has
the semantics of IS [NOT] DISTINCT FROM NULL.  Update the comments in
primnodes.h to clarify this, and fix ruleutils.c and deparse.c to print
such cases correctly.  In the case of ruleutils.c, this merely results in
cosmetic changes in EXPLAIN output, since the case can't currently arise
in stored rules.  However, it represents a live bug for deparse.c, which
would formerly have sent a remote query that had semantics different
from the local behavior.  (From the user's standpoint, this means that
testing a remote nested-composite column for null-ness could have had
unexpected recursive behavior much like that fixed in 4452000f3.)

In a related but somewhat independent fix, make plancat.c set argisrow
to false in all NullTest expressions constructed to represent "attnotnull"
constructs.  Since attnotnull is actually enforced as a simple null-value
check, this is a more accurate representation of the semantics; we were
previously overpromising what it meant for composite columns, which might
possibly lead to incorrect planner optimizations.  (It seems that what the
SQL spec expects a NOT NULL constraint to mean is an IS NOT NULL test, so
arguably we are violating the spec and should fix attnotnull to do the
other thing.  If we ever do, this part should get reverted.)

Back-patch, same as the previous commit.

Discussion: <10682.1469566308@sss.pgh.pa.us>
2016-07-28 16:09:15 -04:00
Tom Lane
054052b3a4 Fix GiST index build for NaN values in geometric types.
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values.  This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs.  The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons.  Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs.  And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.

To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN.  Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.

I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data.  It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.

Per bug #14238 from Guang-Dih Lei.  Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.

Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
2016-07-14 18:46:00 -04:00
Tom Lane
dc9e03bf47 Fix CREATE MATVIEW/CREATE TABLE AS ... WITH NO DATA to not plan the query.
Previously, these commands always planned the given query and went through
executor startup before deciding not to actually run the query if WITH NO
DATA is specified.  This behavior is problematic for pg_dump because it
may cause errors to be raised that we would rather not see before a
REFRESH MATERIALIZED VIEW command is issued.  See for example bug #13907
from Marian Krucina.  This change is not sufficient to fix that particular
bug, because we also need to tweak pg_dump to issue the REFRESH later,
but it's a necessary step on the way.

A user-visible side effect of doing things this way is that the returned
command tag for WITH NO DATA cases will now be "CREATE MATERIALIZED VIEW"
or "CREATE TABLE AS", not "SELECT 0".  We could preserve the old behavior
but it would take more code, and arguably that was just an implementation
artifact not intended behavior anyhow.

In 9.5 and HEAD, also get rid of the static variable CreateAsReladdr, which
was trouble waiting to happen; there is not any prohibition on nested
CREATE commands.

Back-patch to 9.3 where CREATE MATERIALIZED VIEW was introduced.

Michael Paquier and Tom Lane

Report: <20160202161407.2778.24659@wrigleys.postgresql.org>
2016-06-27 15:57:21 -04:00
Alvaro Herrera
61b24fef89 Fix handling of multixacts predating pg_upgrade
After pg_upgrade, it is possible that some tuples' Xmax have multixacts
corresponding to the old installation; such multixacts cannot have
running members anymore.  In many code sites we already know not to read
them and clobber them silently, but at least when VACUUM tries to freeze
a multixact or determine whether one needs freezing, there's an attempt
to resolve it to its member transactions by calling GetMultiXactIdMembers,
and if the multixact value is "in the future" with regards to the
current valid multixact range, an error like this is raised:
    ERROR:  MultiXactId 123 has not been created yet -- apparent wraparound
and vacuuming fails.  Per discussion with Andrew Gierth, it is completely
bogus to try to resolve multixacts coming from before a pg_upgrade,
regardless of where they stand with regards to the current valid
multixact range.

It's possible to get from under this problem by doing SELECT FOR UPDATE
of the problem tuples, but if tables are large, this is slow and
tedious, so a more thorough solution is desirable.

To fix, we realize that multixacts in xmax created in 9.2 and previous
have a specific bit pattern that is never used in 9.3 and later (we
already knew this, per comments and infomask tests sprinkled in various
places, but we weren't leveraging this knowledge appropriately).
Whenever the infomask of the tuple matches that bit pattern, we just
ignore the multixact completely as if Xmax wasn't set; or, in the case
of tuple freezing, we act as if an unwanted value is set and clobber it
without decoding.  This guarantees that no errors will be raised, and
that the values will be progressively removed until all tables are
clean.  Most callers of GetMultiXactIdMembers are patched to recognize
directly that the value is a removable "empty" multixact and avoid
calling GetMultiXactIdMembers altogether.

To avoid changing the signature of GetMultiXactIdMembers() in back
branches, we keep the "allow_old" boolean flag but rename it to
"from_pgupgrade"; if the flag is true, we always return an empty set
instead of looking up the multixact.  (I suppose we could remove the
argument in the master branch, but I chose not to do so in this commit).

This was broken all along, but the error-facing message appeared first
because of commit 8e9a16ab8f and was partially fixed in a25c2b7c4d.
This fix, backpatched all the way back to 9.3, goes approximately in the
same direction as a25c2b7c4d but should cover all cases.

Bug analysis by Andrew Gierth and Álvaro Herrera.

A number of public reports match this bug:
  https://www.postgresql.org/message-id/20140330040029.GY4582@tamriel.snowman.net
  https://www.postgresql.org/message-id/538F3D70.6080902@publicrelay.com
  https://www.postgresql.org/message-id/556439CF.7070109@pscs.co.uk
  https://www.postgresql.org/message-id/SG2PR06MB0760098A111C88E31BD4D96FB3540@SG2PR06MB0760.apcprd06.prod.outlook.com
  https://www.postgresql.org/message-id/20160615203829.5798.4594@wrigleys.postgresql.org
2016-06-24 18:29:28 -04:00
Tom Lane
d9ce54c0b8 Clarify documentation of ceil/ceiling/floor functions.
Document these as "nearest integer >= argument" and "nearest integer <=
argument", which will hopefully be less confusing than the old formulation.
New wording is from Matlab via Dean Rasheed.

I changed the pg_description entries as well as the SGML docs.  In the
back branches, this will only affect installations initdb'd in the future,
but it should be harmless otherwise.

Discussion: <CAEZATCW3yzJo-NMSiQs5jXNFbTsCEftZS-Og8=FvFdiU+kYuSA@mail.gmail.com>
2016-06-09 11:58:00 -04:00
Tom Lane
a4400c251f Don't reset changes_since_analyze after a selective-columns ANALYZE.
If we ANALYZE only selected columns of a table, we should not postpone
auto-analyze because of that; other columns may well still need stats
updates.  As committed, the counter is left alone if a column list is
given, whether or not it includes all analyzable columns of the table.
Per complaint from Tomasz Ostrowski.

It's been like this a long time, so back-patch to all supported branches.

Report: <ef99c1bd-ff60-5f32-2733-c7b504eb960c@ato.waw.pl>
2016-06-06 17:44:17 -04:00
Tom Lane
3ffff72577 Avoid useless closely-spaced writes of statistics files.
The original intent in the stats collector was that we should not write out
stats data oftener than every PGSTAT_STAT_INTERVAL msec.  Backends will not
make requests at all if they see the existing data is newer than that, and
the stats collector is supposed to disregard requests having a cutoff_time
older than its most recently written data, so that close-together requests
don't result in multiple writes.  But the latter part of that got broken
in commit 187492b6c2, so that if two backends concurrently decide
the existing stats are too old, the collector would write the data twice.
(In principle the collector's logic would still merge requests as long as
the second one arrives before we've actually written data ... but since
the message collection loop would write data immediately after processing
a single inquiry message, that never happened in practice, and in any case
the window in which it might work would be much shorter than
PGSTAT_STAT_INTERVAL.)

To fix, improve pgstat_recv_inquiry so that it checks whether the cutoff
time is too old, and doesn't add a request to the queue if so.  This means
that we do not need DBWriteRequest.request_time, because the decision is
taken before making a queue entry.  And that means that we don't really
need the DBWriteRequest data structure at all; an OID list of database
OIDs will serve and allow removal of some rather verbose and crufty code.

In passing, improve the comments in this area, which have been rather
neglected.  Also change backend_read_statsfile so that it's not silently
relying on MyDatabaseId to have some particular value in the autovacuum
launcher process.  It accidentally worked as desired because MyDatabaseId
is zero in that process; but that does not seem like a dependency we want,
especially with no documentation about it.

Although this patch is mine, it turns out I'd rediscovered a known bug,
for which Tomas Vondra had already submitted a patch that's functionally
equivalent to the non-cosmetic aspects of this patch.  Thanks to Tomas
for reviewing this version.

Back-patch to 9.3 where the bug was introduced.

Prior-Discussion: <1718942738eb65c8407fcd864883f4c8@fuzzy.cz>
Patch: <4625.1464202586@sss.pgh.pa.us>
2016-05-31 15:54:46 -04:00
Tom Lane
c2651cd249 Be more predictable about reporting "lock timeout" vs "statement timeout".
If both timeout indicators are set when we arrive at ProcessInterrupts,
we've historically just reported "lock timeout".  However, some buildfarm
members have been observed to fail isolationtester's timeouts test by
reporting "lock timeout" when the statement timeout was expected to fire
first.  The cause seems to be that the process is allowed to sleep longer
than expected (probably due to heavy machine load) so that the lock
timeout happens before we reach the point of reporting the error, and
then this arbitrary tiebreak rule does the wrong thing.  We can improve
matters by comparing the scheduled timeout times to decide which error
to report.

I had originally proposed greatly reducing the 1-second window between
the two timeouts in the test cases.  On reflection that is a bad idea,
at least for the case where the lock timeout is expected to fire first,
because that would assume that it takes negligible time to get from
statement start to the beginning of the lock wait.  Thus, this patch
doesn't completely remove the risk of test failures on slow machines.
Empirically, however, the case this handles is the one we are seeing
in the buildfarm.  The explanation may be that the other case requires
the scheduler to take the CPU away from a busy process, whereas the
case fixed here only requires the scheduler to not give the CPU back
right away to a process that has been woken from a multi-second sleep
(and, perhaps, has been swapped out meanwhile).

Back-patch to 9.3 where the isolationtester timeouts test was added.

Discussion: <8693.1464314819@sss.pgh.pa.us>
2016-05-27 10:40:20 -04:00
Tom Lane
a0cc89a281 Fix latent crash in do_text_output_multiline().
do_text_output_multiline() would fail (typically with a null pointer
dereference crash) if its input string did not end with a newline.  Such
cases do not arise in our current sources; but it certainly could happen
in future, or in extension code's usage of the function, so we should fix
it.  To fix, replace "eol += len" with "eol = text + len".

While at it, make two cosmetic improvements: mark the input string const,
and rename the argument from "text" to "txt" to dodge pgindent strangeness
(since "text" is a typedef name).

Even though this problem is only latent at present, it seems like a good
idea to back-patch the fix, since it's a very simple/safe patch and it's
not out of the realm of possibility that we might in future back-patch
something that expects sane behavior from do_text_output_multiline().

Per report from Hao Lee.

Report: <CAGoxFiFPAGyPAJLcFxTB5cGhTW2yOVBDYeqDugYwV4dEd1L_Ag@mail.gmail.com>
2016-05-23 14:16:41 -04:00
Tom Lane
d130536e93 Stamp 9.4.8. 2016-05-09 16:52:03 -04:00
Tom Lane
e1aecebc04 Fix pg_upgrade to not fail when new-cluster TOAST rules differ from old.
This patch essentially reverts commit 4c6780fd17, in favor of a much
simpler solution for the case where the new cluster would choose to create
a TOAST table but the old cluster doesn't have one: just don't create a
TOAST table.

The existing code failed in at least two different ways if the situation
arose: (1) ALTER TABLE RESET didn't grab an exclusive lock, so that the
lock sanity check in create_toast_table failed; (2) pg_upgrade did not
provide a pg_type OID for the new toast table, so that the crosscheck in
TypeCreate failed.  While both these problems were introduced by later
patches, they show that the hack being used to cause TOAST table creation
is overwhelmingly fragile (and untested).  I also note that before the
TypeCreate crosscheck was added, the code would have resulted in assigning
an indeterminate pg_type OID to the toast table, possibly causing a later
OID conflict in that catalog; so that it didn't really work even when
committed.

If we simply don't create a TOAST table, there will only be a problem if
the code tries to store a tuple that's wider than a page, and field
compression isn't sufficient to get it under a page.  Given that the TOAST
creation threshold is intended to be about a quarter of a page, it's very
hard to believe that cross-version differences in the do-we-need-a-toast-
table heuristic could result in an observable problem.  So let's just
follow the old version's conclusion about whether a TOAST table is needed.

(If we ever do change needs_toast_table() so much that this conclusion
doesn't apply, we can devise a solution at that time, and hopefully do
it in a less klugy way than 4c6780fd17 did.)

Back-patch to 9.3, like the previous patch.

Discussion: <8110.1462291671@sss.pgh.pa.us>
2016-05-06 22:05:51 -04:00
Tom Lane
72edc8ffeb Fix mishandling of equivalence-class tests in parameterized plans.
Given a three-or-more-way equivalence class, such as X.Y = Y.Y = Z.Z,
it was possible for the planner to omit one of the quals needed to
enforce that all members of the equivalence class are actually equal.
This only happened in the case of a parameterized join node for two
of the relations, that is a plan tree like

	Nested Loop
	  ->  Scan X
	  ->  Nested Loop
	    ->  Scan Y
	    ->  Scan Z
	          Filter: Z.Z = X.X

The eclass machinery normally expects to apply X.X = Y.Y when those
two relations are joined, but in this shape of plan tree they aren't
joined until the top node --- and, if the lower nested loop is marked
as parameterized by X, the top node will assume that the relevant eclass
condition(s) got pushed down into the lower node.  On the other hand,
the scan of Z assumes that it's only responsible for constraining Z.Z
to match any one of the other eclass members.  So one or another of
the required quals sometimes fell between the cracks, depending on
whether consideration of the eclass in get_joinrel_parampathinfo()
for the lower nested loop chanced to generate X.X = Y.Y or X.X = Z.Z
as the appropriate constraint there.  If it generated the latter,
it'd erroneously suppose that the Z scan would take care of matters.
To fix, force X.X = Y.Y to be generated and applied at that join node
when this case occurs.

This is *extremely* hard to hit in practice, because various planner
behaviors conspire to mask the problem; starting with the fact that the
planner doesn't really like to generate a parameterized plan of the
above shape.  (It might have been impossible to hit it before we
tweaked things to allow this plan shape for star-schema cases.)  Many
thanks to Alexander Kirkouski for submitting a reproducible test case.

The bug can be demonstrated in all branches back to 9.2 where parameterized
paths were introduced, so back-patch that far.
2016-04-29 20:19:38 -04:00
Tom Lane
65c2eeb003 Adjust DatumGetBool macro, this time for sure.
Commit 23a41573c attempted to fix the DatumGetBool macro to ignore bits
in a Datum that are to the left of the actual bool value.  But it did that
by casting the Datum to bool; and on compilers that use C99 semantics for
bool, that ends up being a whole-word test, not a 1-byte test.  This seems
to be the true explanation for contrib/seg failing in VS2015.  To fix, use
GET_1_BYTE() explicitly.  I think in the previous patch, I'd had some idea
of not having to commit to bool being exactly 1 byte wide, but regardless
of what the compiler's bool is, boolean columns and Datums are certainly
1 byte wide.

The previous fix was (eventually) back-patched into all active versions,
so do likewise with this one.
2016-04-28 11:51:13 -04:00
Tom Lane
56dee70d90 Improve TranslateSocketError() to handle more Windows error codes.
The coverage was rather lean for cases that bind() or listen() might
return.  Add entries for everything that there's a direct equivalent
for in the set of Unix errnos that elog.c has heard of.
2016-04-21 16:59:04 -04:00