Commit graph

6504 commits

Author SHA1 Message Date
Tom Lane
0901d68bab Fix another oversight in checking if a join with LATERAL refs is legal.
It was possible for the planner to decide to join a LATERAL subquery to
the outer side of an outer join before the outer join itself is completed.
Normally that's fine because of the associativity rules, but it doesn't
work if the subquery contains a lateral reference to the inner side of the
outer join.  In such a situation the outer join *must* be done first.
join_is_legal() missed this consideration and would allow the join to be
attempted, but the actual path-building code correctly decided that no
valid join path could be made, sometimes leading to planner errors such as
"failed to build any N-way joins".

Per report from Andreas Seltenreich.  Back-patch to 9.3 where LATERAL
support was added.
2015-12-07 17:41:45 -05:00
Tom Lane
7acad95463 Adopt the GNU convention for handling tar-archive members exceeding 8GB.
The POSIX standard for tar headers requires archive member sizes to be
printed in octal with at most 11 digits, limiting the representable file
size to 8GB.  However, GNU tar and apparently most other modern tars
support a convention in which oversized values can be stored in base-256,
allowing any practical file to be a tar member.  Adopt this convention
to remove two limitations:
* pg_dump with -Ft output format failed if the contents of any one table
exceeded 8GB.
* pg_basebackup failed if the data directory contained any file exceeding
8GB.  (This would be a fatal problem for installations configured with a
table segment size of 8GB or more, and it has also been seen to fail when
large core dump files exist in the data directory.)

File sizes under 8GB are still printed in octal, so that no compatibility
issues are created except in cases that would have failed entirely before.

In addition, this patch fixes several bugs in the same area:

* In 9.3 and later, we'd defined tarCreateHeader's file-size argument as
size_t, which meant that on 32-bit machines it would write a corrupt tar
header for file sizes between 4GB and 8GB, even though no error was raised.
This broke both "pg_dump -Ft" and pg_basebackup for such cases.

* pg_restore from a tar archive would fail on tables of size between 4GB
and 8GB, on machines where either "size_t" or "unsigned long" is 32 bits.
This happened even with an archive file not affected by the previous bug.

* pg_basebackup would fail if there were files of size between 4GB and 8GB,
even on 64-bit machines.

* In 9.3 and later, "pg_basebackup -Ft" failed entirely, for any file size,
on 64-bit big-endian machines.

In view of these potential data-loss bugs, back-patch to all supported
branches, even though removal of the documented 8GB limit might otherwise
be considered a new feature rather than a bug fix.
2015-11-21 20:22:39 -05:00
Tom Lane
4f33572ee6 Fix incorrect translation of minus-infinity datetimes for json/jsonb.
Commit bda76c1c8c caused both plus and
minus infinity to be rendered as "infinity", which is not only wrong
but inconsistent with the pre-9.4 behavior of to_json().  Fix that by
duplicating the coding in date_out/timestamp_out/timestamptz_out more
closely.  Per bug #13687 from Stepan Perlov.  Back-patch to 9.4, like
the previous commit.

In passing, also re-pgindent json.c, since it had gotten a bit messed up by
recent patches (and I was already annoyed by indentation-related problems
in back-patching this fix ...)
2015-10-20 11:06:24 -07:00
Tom Lane
0ecf4a9e55 Improve memory-usage accounting in regular-expression compiler.
This code previously counted the number of NFA states it created, and
complained if a limit was exceeded, so as to prevent bizarre regex patterns
from consuming unreasonable time or memory.  That's fine as far as it went,
but the code paid no attention to how many arcs linked those states.  Since
regexes can be contrived that have O(N) states but will need O(N^2) arcs
after fixempties() processing, it was still possible to blow out memory,
and take a long time doing it too.  To fix, modify the bookkeeping to count
space used by both states and arcs.

I did not bother with including the "color map" in the accounting; it
can only grow to a few megabytes, which is not a lot in comparison to
what we're allowing for states+arcs (about 150MB on 64-bit machines
or half that on 32-bit machines).

Looking at some of the larger real-world regexes captured in the Tcl
regression test suite suggests that the most that is likely to be needed
for regexes found in the wild is under 10MB, so I believe that the current
limit has enough headroom to make it okay to keep it as a hard-wired limit.

In connection with this, redefine REG_ETOOBIG as meaning "regular
expression is too complex"; the previous wording of "nfa has too many
states" was already somewhat inapropos because of the error code's use
for stack depth overrun, and it was not very user-friendly either.

Back-patch to all supported branches.
2015-10-16 15:36:17 -04:00
Tom Lane
bdde29e1ce Fix O(N^2) performance problems in regular-expression compiler.
Change the singly-linked in-arc and out-arc lists to be doubly-linked,
so that arc deletion is constant time rather than having worst-case time
proportional to the number of other arcs on the connected states.

Modify the bulk arc transfer operations copyins(), copyouts(), moveins(),
moveouts() so that they use a sort-and-merge algorithm whenever there's
more than a small number of arcs to be copied or moved.  The previous
method is O(N^2) in the number of arcs involved, because it performs
duplicate checking independently for each copied arc.  The new method may
change the ordering of existing arcs for the destination state, but nothing
really cares about that.

Provide another bulk arc copying method mergeins(), which is unused as
of this commit but is needed for the next one.  It basically is like
copyins(), but the source arcs might not all come from the same state.

Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort()
call.

These changes greatly improve the performance of regex compilation for
large or complex regexes, at the cost of extra space for arc storage during
compilation.  The original tradeoff was probably fine when it was made, but
now we care more about speed and less about memory consumption.

Back-patch to all supported branches.
2015-10-16 14:43:18 -04:00
Tom Lane
44a6e24fbc On Windows, ensure shared memory handle gets closed if not being used.
Postmaster child processes that aren't supposed to be attached to shared
memory were not bothering to close the shared memory mapping handle they
inherit from the postmaster process.  That's mostly harmless, since the
handle vanishes anyway when the child process exits -- but the syslogger
process, if used, doesn't get killed and restarted during recovery from a
backend crash.  That meant that Windows doesn't see the shared memory
mapping as becoming free, so it doesn't delete it and the postmaster is
unable to create a new one, resulting in failure to recover from crashes
whenever logging_collector is turned on.

Per report from Dmitry Vasilyev.  It's a bit astonishing that we'd not
figured this out long ago, since it's been broken from the very beginnings
of out native Windows support; probably some previously-unexplained trouble
reports trace to this.

A secondary problem is that on Cygwin (perhaps only in older versions?),
exec() may not detach from the shared memory segment after all, in which
case these child processes did remain attached to shared memory, posing
the risk of an unexpected shared memory clobber if they went off the rails
somehow.  That may be a long-gone bug, but we can deal with it now if it's
still live, by detaching within the infrastructure introduced here to deal
with closing the handle.

Back-patch to all supported branches.

Tom Lane and Amit Kapila
2015-10-13 11:21:33 -04:00
Tom Lane
3d701277f8 Perform an immediate shutdown if the postmaster.pid file is removed.
The postmaster now checks every minute or so (worst case, at most two
minutes) that postmaster.pid is still there and still contains its own PID.
If not, it performs an immediate shutdown, as though it had received
SIGQUIT.

The original goal behind this change was to ensure that failed buildfarm
runs would get fully cleaned up, even if the test scripts had left a
postmaster running, which is not an infrequent occurrence.  When the
buildfarm script removes a test postmaster's $PGDATA directory, its next
check on postmaster.pid will fail and cause it to exit.  Previously, manual
intervention was often needed to get rid of such orphaned postmasters,
since they'd block new test postmasters from obtaining the expected socket
address.

However, by checking postmaster.pid and not something else, we can provide
additional robustness: manual removal of postmaster.pid is a frequent DBA
mistake, and now we can at least limit the damage that will ensue if a new
postmaster is started while the old one is still alive.

Back-patch to all supported branches, since we won't get the desired
improvement in buildfarm reliability otherwise.
2015-10-06 17:15:27 -04:00
Tom Lane
d25c7d70ff Stamp 9.4.5. 2015-10-05 15:12:06 -04:00
Tom Lane
c5e38b93c6 Add recursion depth protections to regular expression matching.
Some of the functions in regex compilation and execution recurse, and
therefore could in principle be driven to stack overflow.  The Tcl crew
has seen this happen in practice in duptraverse(), though their fix was
to put in a hard-wired limit on the number of recursive levels, which is
not too appetizing --- fortunately, we have enough infrastructure to check
the actually available stack.  Greg Stark has also seen it in other places
while fuzz testing on a machine with limited stack space.  Let's put guards
in to prevent crashes in all these places.

Since the regex code would leak memory if we simply threw elog(ERROR),
we have to introduce an API that checks for stack depth without throwing
such an error.  Fortunately that's not difficult.
2015-10-02 14:51:58 -04:00
Tom Lane
f7d896ab91 Fix low-probability memory leak in regex execution.
After an internal failure in shortest() or longest() while pinning down the
exact location of a match, find() forgot to free the DFA structure before
returning.  This is pretty unlikely to occur, since we just successfully
ran the "search" variant of the DFA; but it could happen, and it would
result in a session-lifespan memory leak since this code uses malloc()
directly.  Problem seems to have been aboriginal in Spencer's library,
so back-patch all the way.

In passing, correct a thinko in a comment I added awhile back about the
meaning of the "ntree" field.

I happened across these issues while comparing our code to Tcl's version
of the library.
2015-09-18 13:55:17 -04:00
Fujii Masao
2244c0652d Remove files signaling a standby promotion request at postmaster startup
This commit makes postmaster forcibly remove the files signaling
a standby promotion request. Otherwise, the existence of those files
can trigger a promotion too early, whether a user wants that or not.

This removal of files is usually unnecessary because they can exist
only during a few moments during a standby promotion. However
there is a race condition: if pg_ctl promote is executed and creates
the files during a promotion, the files can stay around even after
the server is brought up to new master. Then, if new standby starts
by using the backup taken from that master, the files can exist
at the server startup and should be removed in order to avoid
an unexpected promotion.

Back-patch to 9.1 where promote signal file was introduced.

Problem reported by Feike Steenbergen.
Original patch by Michael Paquier, modified by me.

Discussion: 20150528100705.4686.91426@wrigleys.postgresql.org
2015-09-09 22:52:35 +09:00
Tom Lane
37d10c524c Fix subtransaction cleanup after an outer-subtransaction portal fails.
Formerly, we treated only portals created in the current subtransaction as
having failed during subtransaction abort.  However, if the error occurred
while running a portal created in an outer subtransaction (ie, a cursor
declared before the last savepoint), that has to be considered broken too.

To allow reliable detection of which ones those are, add a bookkeeping
field to struct Portal that tracks the innermost subtransaction in which
each portal has actually been executed.  (Without this, we'd end up
failing portals containing functions that had called the subtransaction,
thereby breaking plpgsql exception blocks completely.)

In addition, when we fail an outer-subtransaction Portal, transfer its
resources into the subtransaction's resource owner, so that they're
released early in cleanup of the subxact.  This fixes a problem reported by
Jim Nasby in which a function executed in an outer-subtransaction cursor
could cause an Assert failure or crash by referencing a relation created
within the inner subtransaction.

The proximate cause of the Assert failure is that AtEOSubXact_RelationCache
assumed it could blow away a relcache entry without first checking that the
entry had zero refcount.  That was a bad idea on its own terms, so add such
a check there, and to the similar coding in AtEOXact_RelationCache.  This
provides an independent safety measure in case there are still ways to
provoke the situation despite the Portal-level changes.

This has been broken since subtransactions were invented, so back-patch
to all supported branches.

Tom Lane and Michael Paquier
2015-09-04 13:37:17 -04:00
Tom Lane
747ca66977 Use "mb" not the nonexistent "rmb" for pg_read_barrier() on Alpha.
It's only necessary to fix this in 9.4; later versions don't have this
code (because we ripped out Alpha support entirely), while earlier ones
aren't actually using pg_read_barrier() anywhere.

Per rather belated report from Christoph Berg.
2015-08-29 16:34:30 -04:00
Tom Lane
3da9c060fc Fix s_lock.h PPC assembly code to be compatible with native AIX assembler.
On recent AIX it's necessary to configure gcc to use the native assembler
(because the GNU assembler hasn't been updated to handle AIX 6+).  This
caused PG builds to fail with assembler syntax errors, because we'd try
to compile s_lock.h's gcc asm fragment for PPC, and that assembly code
relied on GNU-style local labels.  We can't substitute normal labels
because it would fail in any file containing more than one inlined use of
tas().  Fortunately, that code is stable enough, and the PPC ISA is simple
enough, that it doesn't seem like too much of a maintenance burden to just
hand-code the branch offsets, removing the need for any labels.

Note that the AIX assembler only accepts "$" for the location counter
pseudo-symbol.  The usual GNU convention is "."; but it appears that all
versions of gas for PPC also accept "$", so in theory this patch will not
break any other PPC platforms.

This has been reported by a few people, but Steve Underwood gets the credit
for being the first to pursue the problem far enough to understand why it
was failing.  Thanks also to Noah Misch for additional testing.
2015-08-29 16:09:25 -04:00
Tom Lane
7e0add3861 Accept alternate spellings of __sparcv7 and __sparcv8.
Apparently some versions of gcc prefer __sparc_v7__ and __sparc_v8__.
Per report from Waldemar Brodkorb.
2015-08-10 17:34:51 -04:00
Tom Lane
d31e79415b Fix eclass_useful_for_merging to give valid results for appendrel children.
Formerly, this function would always return "true" for an appendrel child
relation, because it would think that the appendrel parent was a potential
join target for the child.  In principle that should only lead to some
inefficiency in planning, but fuzz testing by Andreas Seltenreich disclosed
that it could lead to "could not find pathkey item to sort" planner errors
in odd corner cases.  Specifically, we would think that all columns of a
child table's multicolumn index were interesting pathkeys, causing us to
generate a MergeAppend path that sorts by all the columns.  However, if any
of those columns weren't actually used above the level of the appendrel,
they would not get added to that rel's targetlist, which would result in
being unable to resolve the MergeAppend's sort keys against its targetlist
during createplan.c.

Backpatch to 9.3.  In older versions, columns of an appendrel get added
to its targetlist even if they're not mentioned above the scan level,
so that the failure doesn't occur.  It might be worth back-patching this
fix to older versions anyway, but I'll refrain for the moment.
2015-08-06 20:14:37 -04:00
Tom Lane
118c9bb8d1 Fix bogus "out of memory" reports in tuplestore.c.
The tuplesort/tuplestore memory management logic assumed that the chunk
allocation overhead for its memtuples array could not increase when
increasing the array size.  This is and always was true for tuplesort,
but we (I, I think) blindly copied that logic into tuplestore.c without
noticing that the assumption failed to hold for the much smaller array
elements used by tuplestore.  Given rather small work_mem, this could
result in an improper complaint about "unexpected out-of-memory situation",
as reported by Brent DeSpain in bug #13530.

The easiest way to fix this is just to increase tuplestore's initial
array size so that the assumption holds.  Rather than relying on magic
constants, though, let's export a #define from aset.c that represents
the safe allocation threshold, and make tuplestore's calculation depend
on that.

Do the same in tuplesort.c to keep the logic looking parallel, even though
tuplesort.c isn't actually at risk at present.  This will keep us from
breaking it if we ever muck with the allocation parameters in aset.c.

Back-patch to all supported versions.  The error message doesn't occur
pre-9.3, not so much because the problem can't happen as because the
pre-9.3 tuplestore code neglected to check for it.  (The chance of
trouble is a great deal larger as of 9.3, though, due to changes in the
array-size-increasing strategy.)  However, allowing LACKMEM() to become
true unexpectedly could still result in less-than-desirable behavior,
so let's patch it all the way back.
2015-08-04 18:18:46 -04:00
Tom Lane
c6d901292f Fix incorrect order of lock file removal and failure to close() sockets.
Commit c9b0cbe98b accidentally broke the
order of operations during postmaster shutdown: it resulted in removing
the per-socket lockfiles after, not before, postmaster.pid.  This creates
a race-condition hazard for a new postmaster that's started immediately
after observing that postmaster.pid has disappeared; if it sees the
socket lockfile still present, it will quite properly refuse to start.
This error appears to be the explanation for at least some of the
intermittent buildfarm failures we've seen in the pg_upgrade test.

Another problem, which has been there all along, is that the postmaster
has never bothered to close() its listen sockets, but has just allowed them
to close at process death.  This creates a different race condition for an
incoming postmaster: it might be unable to bind to the desired listen
address because the old postmaster is still incumbent.  This might explain
some odd failures we've seen in the past, too.  (Note: this is not related
to the fact that individual backends don't close their client communication
sockets.  That behavior is intentional and is not changed by this patch.)

Fix by adding an on_proc_exit function that closes the postmaster's ports
explicitly, and (in 9.3 and up) reshuffling the responsibility for where
to unlink the Unix socket files.  Lock file unlinking can stay where it
is, but teach it to unlink the lock files in reverse order of creation.
2015-08-02 14:55:05 -04:00
Heikki Linnakangas
579b9f97ce Fix handling of all-zero pages in SP-GiST vacuum.
SP-GiST initialized an all-zeros page at vacuum, but that was not
WAL-logged, which is not safe. You might get a torn page write, when it gets
flushed to disk, and end-up with a half-initialized index page. To fix,
leave it in the all-zeros state, and add it to the FSM. It will be
initialized when reused. Also don't set the page-deleted flag when recycling
an empty page. That was also not WAL-logged, and a torn write of that would
cause the page to have an invalid checksum.

Backpatch to 9.2, where SP-GiST indexes were added.
2015-07-27 12:32:08 +03:00
Tom Lane
1afc1fe9c7 Back-patch some minor bug fixes in GUC code.
In 9.4, fix a 9.4.1 regression that allowed multiple entries for a
PGC_POSTMASTER variable to cause bogus complaints in the postmaster log.
(The issue here was that commit bf007a27ac unintentionally reverted
3e3f65973a, which suppressed any duplicate entries within
ParseConfigFp.  Back-patch the reimplementation just made in HEAD, which
makes use of an "ignore" field to prevent application of superseded items.)

Add missed failure check in AlterSystemSetConfigFile().  We don't really
expect ParseConfigFp() to fail, but that's not an excuse for not checking.

In both 9.3 and 9.4, remove mistaken assignment to ConfigFileLineno that
caused line counting after an include_dir directive to be completely wrong.
2015-06-28 18:38:06 -04:00
Kevin Grittner
524e1e4031 Add opaque declaration of HTAB to tqual.h.
Commit b89e151054 added the
ResolveCminCmaxDuringDecoding declaration to tqual.h, which uses an
HTAB parameter, without declaring HTAB.  It accidentally fails to
fail to build with current sources because a declaration happens to
be included, directly or indirectly, in all source files that
currently use tqual.h before tqual.h is first included, but we
shouldn't count on that.  Since an opaque declaration is enough
here, just use that, as was done in snapmgr.h.

Backpatch to 9.4, where the HTAB reference was added to tqual.h.
2015-06-27 09:55:08 -05:00
Tom Lane
e118555cf9 Fix the logic for putting relations into the relcache init file.
Commit f3b5565dd4 was a couple of bricks shy
of a load; specifically, it missed putting pg_trigger_tgrelid_tgname_index
into the relcache init file, because that index is not used by any
syscache.  However, we have historically nailed that index into cache for
performance reasons.  The upshot was that load_relcache_init_file always
decided that the init file was busted and silently ignored it, resulting
in a significant hit to backend startup speed.

To fix, reinstantiate RelationIdIsInInitFile() as a wrapper around
RelationSupportsSysCache(), which can know about additional relations
that should be in the init file despite being unknown to syscache.c.

Also install some guards against future mistakes of this type: make
write_relcache_init_file Assert that all nailed relations get written to
the init file, and make load_relcache_init_file emit a WARNING if it takes
the "wrong number of nailed relations" exit path.  Now that we remove the
init files during postmaster startup, that case should never occur in the
field, even if we are starting a minor-version update that added or removed
rels from the nailed set.  So the warning shouldn't ever be seen by end
users, but it will show up in the regression tests if somebody breaks this
logic.

Back-patch to all supported branches, like the previous commit.
2015-06-25 14:39:05 -04:00
Kevin Grittner
e9eebbaebf Fix typo in comment.
Backpatch to 9.4 to minimize possible conflicts.
2015-06-10 17:04:06 -05:00
Tom Lane
7c055f3ec3 Stamp 9.4.4. 2015-06-09 15:29:38 -04:00
Tom Lane
be25a08a91 Use a safer method for determining whether relcache init file is stale.
When we invalidate the relcache entry for a system catalog or index, we
must also delete the relcache "init file" if the init file contains a copy
of that rel's entry.  The old way of doing this relied on a specially
maintained list of the OIDs of relations present in the init file: we made
the list either when reading the file in, or when writing the file out.
The problem is that when writing the file out, we included only rels
present in our local relcache, which might have already suffered some
deletions due to relcache inval events.  In such cases we correctly decided
not to overwrite the real init file with incomplete data --- but we still
used the incomplete initFileRelationIds list for the rest of the current
session.  This could result in wrong decisions about whether the session's
own actions require deletion of the init file, potentially allowing an init
file created by some other concurrent session to be left around even though
it's been made stale.

Since we don't support changing the schema of a system catalog at runtime,
the only likely scenario in which this would cause a problem in the field
involves a "vacuum full" on a catalog concurrently with other activity, and
even then it's far from easy to provoke.  Remarkably, this has been broken
since 2002 (in commit 7863404417), but we had
never seen a reproducible test case until recently.  If it did happen in
the field, the symptoms would probably involve unexpected "cache lookup
failed" errors to begin with, then "could not open file" failures after the
next checkpoint, as all accesses to the affected catalog stopped working.
Recovery would require manually removing the stale "pg_internal.init" file.

To fix, get rid of the initFileRelationIds list, and instead consult
syscache.c's list of relations used in catalog caches to decide whether a
relation is included in the init file.  This should be a tad more efficient
anyway, since we're replacing linear search of a list with ~100 entries
with a binary search.  It's a bit ugly that the init file contents are now
so directly tied to the catalog caches, but in practice that won't make
much difference.

Back-patch to all supported branches.
2015-06-07 15:32:09 -04:00
Tom Lane
f0a8515c44 Fix planner's cost estimation for SEMI/ANTI joins with inner indexscans.
When the inner side of a nestloop SEMI or ANTI join is an indexscan that
uses all the join clauses as indexquals, it can be presumed that both
matched and unmatched outer rows will be processed very quickly: for
matched rows, we'll stop after fetching one row from the indexscan, while
for unmatched rows we'll have an indexscan that finds no matching index
entries, which should also be quick.  The planner already knew about this,
but it was nonetheless charging for at least one full run of the inner
indexscan, as a consequence of concerns about the behavior of materialized
inner scans --- but those concerns don't apply in the fast case.  If the
inner side has low cardinality (many matching rows) this could make an
indexscan plan look far more expensive than it actually is.  To fix,
rearrange the work in initial_cost_nestloop/final_cost_nestloop so that we
don't add the inner scan cost until we've inspected the indexquals, and
then we can add either the full-run cost or just the first tuple's cost as
appropriate.

Experimentation with this fix uncovered another problem: add_path and
friends were coded to disregard cheap startup cost when considering
parameterized paths.  That's usually okay (and desirable, because it thins
the path herd faster); but in this fast case for SEMI/ANTI joins, it could
result in throwing away the desired plain indexscan path in favor of a
bitmap scan path before we ever get to the join costing logic.  In the
many-matching-rows cases of interest here, a bitmap scan will do a lot more
work than required, so this is a problem.  To fix, add a per-relation flag
consider_param_startup that works like the existing consider_startup flag,
but applies to parameterized paths, and set it for relations that are the
inside of a SEMI or ANTI join.

To make this patch reasonably safe to back-patch, care has been taken to
avoid changing the planner's behavior except in the very narrow case of
SEMI/ANTI joins with inner indexscans.  There are places in
compare_path_costs_fuzzily and add_path_precheck that are not terribly
consistent with the new approach, but changing them will affect planner
decisions at the margins in other cases, so we'll leave that for a
HEAD-only fix.

Back-patch to 9.3; before that, the consider_startup flag didn't exist,
meaning that the second aspect of the patch would be too invasive.

Per a complaint from Peter Holzer and analysis by Tomas Vondra.
2015-06-03 11:58:47 -04:00
Tom Lane
de17fe43fa Stamp 9.4.3. 2015-06-01 15:05:57 -04:00
Tom Lane
a3ae3db438 Fix fsync-at-startup code to not treat errors as fatal.
Commit 2ce439f337 introduced a rather serious
regression, namely that if its scan of the data directory came across any
un-fsync-able files, it would fail and thereby prevent database startup.
Worse yet, symlinks to such files also caused the problem, which meant that
crash restart was guaranteed to fail on certain common installations such
as older Debian.

After discussion, we agreed that (1) failure to start is worse than any
consequence of not fsync'ing is likely to be, therefore treat all errors
in this code as nonfatal; (2) we should not chase symlinks other than
those that are expected to exist, namely pg_xlog/ and tablespace links
under pg_tblspc/.  The latter restriction avoids possibly fsync'ing a
much larger part of the filesystem than intended, if the user has left
random symlinks hanging about in the data directory.

This commit takes care of that and also does some code beautification,
mainly moving the relevant code into fd.c, which seems a much better place
for it than xlog.c, and making sure that the conditional compilation for
the pre_sync_fname pass has something to do with whether pg_flush_data
works.

I also relocated the call site in xlog.c down a few lines; it seems a
bit silly to be doing this before ValidateXLOGDirectoryStructure().

The similar logic in initdb.c ought to be made to match this, but that
change is noncritical and will be dealt with separately.

Back-patch to all active branches, like the prior commit.

Abhijit Menon-Sen and Tom Lane
2015-05-28 17:33:03 -04:00
Andrew Dunstan
9b74f32cdb Unpack jbvBinary objects passed to pushJsonbValue
pushJsonbValue was accepting jbvBinary objects passed as WJB_ELEM or
WJB_VALUE data. While this succeeded, when those objects were later
encountered in attempting to convert the result to Jsonb, errors
occurred. With this change we ghuarantee that a JSonbValue constructed
from calls to pushJsonbValue does not contain any jbvBinary objects.
This cures a problem observed with jsonb_delete.

This means callers of pushJsonbValue no longer need to perform this
unpacking themselves. A subsequent patch will perform some cleanup in
that area.

The error was not triggered by any 9.4 code, but this is a publicly
visible routine, and so the error could be exercised by third party
code, therefore backpatch to 9.4.

Bug report from Peter Geoghegan, fix by me.
2015-05-22 10:31:29 -04:00
Tom Lane
2eb2fcd56b Revert error-throwing wrappers for the printf family of functions.
This reverts commit 16304a0134, except
for its changes in src/port/snprintf.c; as well as commit
cac18a76bb which is no longer needed.

Fujii Masao reported that the previous commit caused failures in psql on
OS X, since if one exits the pager program early while viewing a query
result, psql sees an EPIPE error from fprintf --- and the wrapper function
thought that was reason to panic.  (It's a bit surprising that the same
does not happen on Linux.)  Further discussion among the security list
concluded that the risk of other such failures was far too great, and
that the one-size-fits-all approach to error handling embodied in the
previous patch is unlikely to be workable.

This leaves us again exposed to the possibility of the type of failure
envisioned in CVE-2015-3166.  However, that failure mode is strictly
hypothetical at this point: there is no concrete reason to believe that
an attacker could trigger information disclosure through the supposed
mechanism.  In the first place, the attack surface is fairly limited,
since so much of what the backend does with format strings goes through
stringinfo.c or psprintf(), and those already had adequate defenses.
In the second place, even granting that an unprivileged attacker could
control the occurrence of ENOMEM with some precision, it's a stretch to
believe that he could induce it just where the target buffer contains some
valuable information.  So we concluded that the risk of non-hypothetical
problems induced by the patch greatly outweighs the security risks.
We will therefore revert, and instead undertake closer analysis to
identify specific calls that may need hardening, rather than attempt a
universal solution.

We have kept the portion of the previous patch that improved snprintf.c's
handling of errors when it calls the platform's sprintf().  That seems to
be an unalloyed improvement.

Security: CVE-2015-3166
2015-05-19 18:16:19 -04:00
Tom Lane
7aeba23ee2 Stamp 9.4.2. 2015-05-18 14:29:04 -04:00
Noah Misch
2e3bd06654 Add error-throwing wrappers for the printf family of functions.
All known standard library implementations of these functions can fail
with ENOMEM.  A caller neglecting to check for failure would experience
missing output, information exposure, or a crash.  Check return values
within wrappers and code, currently just snprintf.c, that bypasses the
wrappers.  The wrappers do not return after an error, so their callers
need not check.  Back-patch to 9.0 (all supported versions).

Popular free software standard library implementations do take pains to
bypass malloc() in simple cases, but they risk ENOMEM for floating point
numbers, positional arguments, large field widths, and large precisions.
No specification demands such caution, so this commit regards every call
to a printf family function as a potential threat.

Injecting the wrappers implicitly is a compromise between patch scope
and design goals.  I would prefer to edit each call site to name a
wrapper explicitly.  libpq and the ECPG libraries would, ideally, convey
errors to the caller rather than abort().  All that would be painfully
invasive for a back-patched security fix, hence this compromise.

Security: CVE-2015-3166
2015-05-18 10:02:35 -04:00
Noah Misch
f7c4fe7d95 Permit use of vsprintf() in PostgreSQL code.
The next commit needs it.  Back-patch to 9.0 (all supported versions).
2015-05-18 10:02:35 -04:00
Robert Haas
3ecab37d97 Teach autovacuum about multixact member wraparound.
The logic introduced in commit b69bf30b9b
and repaired in commits 669c7d20e6 and
7be47c56af helps to ensure that we don't
overwrite old multixact member information while it is still needed,
but a user who creates many large multixacts can still exhaust the
member space (and thus start getting errors) while autovacuum stands
idly by.

To fix this, progressively ramp down the effective value (but not the
actual contents) of autovacuum_multixact_freeze_max_age as member space
utilization increases.  This makes autovacuum more aggressive and also
reduces the threshold for a manual VACUUM to perform a full-table scan.

This patch leaves unsolved the problem of ensuring that emergency
autovacuums are triggered even when autovacuum=off.  We'll need to fix
that via a separate patch.

Thomas Munro and Robert Haas
2015-05-08 12:53:30 -04:00
Robert Haas
d8ac77ab17 Recursively fsync() the data directory after a crash.
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk.  This could lead to data corruption.

Back-patch to all supported versions.

Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
2015-05-04 14:19:32 -04:00
Heikki Linnakangas
d72792d021 Don't archive bogus recycled or preallocated files after timeline switch.
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.

This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.

To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
2015-04-13 17:22:21 +03:00
Tatsuo Ishii
3740bb8347 Make SyncRepWakeQueue to a static function
It is only used in src/backend/replication/syncrep.c.

Back-patch to all supported branches except 9.1 which declares the
function as static.
2015-03-26 10:38:11 +09:00
Tom Lane
c415c13b7e Build src/port/dirmod.c only on Windows.
Since commit ba7c5975ad, port/dirmod.c
has contained only Windows-specific functions.  Most platforms don't
seem to mind uselessly building an empty file, but OS X for one issues
warnings.  Hence, treat dirmod.c as a Windows-specific file selected
by configure rather than one that's always built.  We can revert this
change if dirmod.c ever gains any non-Windows functionality again.

Back-patch to 9.4 where the mentioned commit appeared.
2015-03-14 14:08:45 -04:00
Tom Lane
a271c9260f Remove code to match IPv4 pg_hba.conf entries to IPv4-in-IPv6 addresses.
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f5 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented).  But
actually the logic dates back to commit 3c9bb8886d and was simply
broken by erroneous refactoring in the later commit.  A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses.  The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.

Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct.  If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases.  So let's let the post-9.0 behavior
stand, and change the documentation to match it.

In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit.  A lot of this text
hasn't been touched since we were IPv4-only.
2015-02-17 12:49:18 -05:00
Heikki Linnakangas
56a23a83fd Fix broken #ifdef for __sparcv8
Rob Rowan. Backpatch to all supported versions, like the patch that added
the broken #ifdef.
2015-02-13 23:56:57 +02:00
Tom Lane
ccf655c446 Minor cleanup/code review for "indirect toast" stuff.
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer.  Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding.  An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input.  Back-patch to 9.4
just in case that is reachable now.

In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
2015-02-09 12:30:55 -05:00
Tom Lane
d0f83327d3 Stamp 9.4.1. 2015-02-02 15:42:55 -05:00
Heikki Linnakangas
57ec87c6b8 Be more careful to not lose sync in the FE/BE protocol.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.

We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.

To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.

To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.

Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.

Security: CVE-2015-0244
2015-02-02 17:09:46 +02:00
Tom Lane
b6a164e5cb Fix assorted oversights in range selectivity estimation.
calc_rangesel() failed outright when comparing range variables to empty
constant ranges with < or >=, as a result of missing cases in a switch.
It also produced a bogus estimate for > comparison to an empty range.

On top of that, the >= and > cases were mislabeled throughout.  For
nonempty constant ranges, they managed to produce the right answers
anyway as a result of counterbalancing typos.

Also, default_range_selectivity() omitted cases for elem <@ range,
range &< range, and range &> range, so that rather dubious defaults
were applied for these operators.

In passing, rearrange the code in rangesel() so that the elem <@ range
case is handled in a less opaque fashion.

Report and patch by Emre Hasegeli, some additional work by me
2015-01-30 12:31:08 -05:00
Heikki Linnakangas
dc40ca6963 Fix query-duration memory leak with GIN rescans.
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.

It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.

(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)

Backpatch to 9.4, where GIN fast scan was added.
2015-01-30 17:59:17 +01:00
Tom Lane
d25192892d Improve performance of EXPLAIN with large range tables.
As of 9.3, ruleutils.c goes to some lengths to ensure that table and column
aliases used in its output are unique.  Of course this takes more time than
was required before, which in itself isn't fatal.  However, EXPLAIN was set
up so that recalculation of the unique aliases was repeated for each
subexpression printed in a plan.  That results in O(N^2) time and memory
consumption for large plan trees, which did not happen in older branches.

Fortunately, the expensive work is the same across a whole plan tree,
so there is no need to repeat it; we can do most of the initialization
just once per query and re-use it for each subexpression.  This buys
back most (not all) of the performance loss since 9.2.

We need an extra ExplainState field to hold the precalculated deparse
context.  That's no problem in HEAD, but in the back branches, expanding
sizeof(ExplainState) seems risky because third-party extensions might
have local variables of that struct type.  So, in 9.4 and 9.3, introduce
an auxiliary struct to keep sizeof(ExplainState) the same.  We should
refactor the APIs to avoid such local variables in future, but that's
material for a separate HEAD-only commit.

Per gripe from Alexey Bashtanov.  Back-patch to 9.3 where the issue
was introduced.
2015-01-15 13:18:16 -05:00
Tom Lane
733728ff37 Fix libpq's behavior when /etc/passwd isn't readable.
Some users run their applications in chroot environments that lack an
/etc/passwd file.  This means that the current UID's user name and home
directory are not obtainable.  libpq used to be all right with that,
so long as the database role name to use was specified explicitly.
But commit a4c8f14364 broke such cases by
causing any failure of pg_fe_getauthname() to be treated as a hard error.
In any case it did little to advance its nominal goal of causing errors
in pg_fe_getauthname() to be reported better.  So revert that and instead
put some real error-reporting code in place.  This requires changes to the
APIs of pg_fe_getauthname() and pqGetpwuid(), since the latter had
departed from the POSIX-specified API of getpwuid_r() in a way that made
it impossible to distinguish actual lookup errors from "no such user".

To allow such failures to be reported, while not failing if the caller
supplies a role name, add a second call of pg_fe_getauthname() in
connectOptions2().  This is a tad ugly, and could perhaps be avoided with
some refactoring of PQsetdbLogin(), but I'll leave that idea for later.
(Note that the complained-of misbehavior only occurs in PQsetdbLogin,
not when using the PQconnect functions, because in the latter we will
never bother to call pg_fe_getauthname() if the user gives a role name.)

In passing also clean up the Windows-side usage of GetUserName(): the
recommended buffer size is 257 bytes, the passed buffer length should
be the buffer size not buffer size less 1, and any error is reported
by GetLastError() not errno.

Per report from Christoph Berg.  Back-patch to 9.4 where the chroot
failure case was introduced.  The generally poor reporting of errors
here is of very long standing, of course, but given the lack of field
complaints about it we won't risk changing these APIs further back
(even though they're theoretically internal to libpq).
2015-01-11 12:35:47 -05:00
Noah Misch
83fb1ca5cf On Darwin, detect and report a multithreaded postmaster.
Darwin --enable-nls builds use a substitute setlocale() that may start a
thread.  Buildfarm member orangutan experienced BackendList corruption
on account of different postmaster threads executing signal handlers
simultaneously.  Furthermore, a multithreaded postmaster risks undefined
behavior from sigprocmask() and fork().  Emit LOG messages about the
problem and its workaround.  Back-patch to 9.0 (all supported versions).
2015-01-07 22:36:35 -05:00
Andres Freund
70e36adb0d Add pg_string_endswith as the start of a string helper library in src/common.
Backpatch to 9.3 where src/common was introduce, because a bugfix that
needs to be backpatched, requires the function. Earlier branches will
have to duplicate the code.
2015-01-03 20:54:13 +01:00
Heikki Linnakangas
4e241f7cdf Revert the GinMaxItemSize calculation so that we fit 3 tuples per page.
Commit 36a35c55 changed the divisor from 3 to 6, for no apparent reason.
Reducing GinMaxItemSize like that created a dump/reload hazard: loading a
9.3 database to 9.4 might fail with "index row size XXX exceeds maximum 1352
for index ..." error. Revert the change.

While we're at it, make the calculation slightly more accurate. It used to
divide the available space on page by three, then subtract
sizeof(ItemIdData), and finally round down. That's not totally accurate; the
item pointers for the three items are packed tight right after the page
header, but there is alignment padding after the item pointers. Change the
calculation to reflect that, like BTMaxItemSize does. I tested this with
different block sizes on systems with 4- and 8-byte alignment, and the value
after the final MAXALIGN_DOWN was the same with both methods on all
configurations. So this does not make any difference currently, but let's be
tidy.

Also add a comment explaining what the macro does.

This fixes bug #12292 reported by Robert Thaler. Backpatch to 9.4, where the
bug was introduced.
2014-12-30 14:53:03 +02:00