Fix a longstanding thinko in replay of NEXTOID and checkpoint records: we
tried to advance nextOid only if it was behind the value in the WAL record,
but the comparison would draw the wrong conclusion if OID wraparound had
occurred since the previous value. Better to just unconditionally assign
the new value, since OID assignment shouldn't be happening during replay
anyway.
The consequences of a failure to update nextOid would be pretty minimal,
since we have long had the code set up to obtain another OID and try again
if the generated value is already in use. But in the worst case there
could be significant performance glitches while such loops iterate through
many already-used OIDs before finding a free one.
The odds of a wraparound happening during WAL replay would be small in a
crash-recovery scenario, and the length of any ensuing OID-assignment stall
quite limited anyway. But neither of these statements hold true for a
replication slave that follows a WAL stream for a long period; its behavior
upon going live could be almost unboundedly bad. Hence it seems worth
back-patching this fix into all supported branches.
Already fixed in HEAD in commit c6d76d7c82.
When default_text_search_config, default_tablespace, or temp_tablespaces
setting is set per-user or per-database, with an "ALTER USER/DATABASE SET
..." statement, don't throw an error if the text search configuration or
tablespace does not exist. In case of text search configuration, even if
it doesn't exist in the current database, it might exist in another
database, where the setting is intended to have its effect. This behavior
is now the same as search_path's.
Tablespaces are cluster-wide, so the same argument doesn't hold for
tablespaces, but there's a problem with pg_dumpall: it dumps "ALTER USER
SET ..." statements before the "CREATE TABLESPACE" statements. Arguably
that's pg_dumpall's fault - it should dump the statements in such an order
that the tablespace is created first and then the "ALTER USER SET
default_tablespace ..." statements after that - but it seems better to be
consistent with search_path and default_text_search_config anyway. Besides,
you could still create a dump that throws an error, by creating the
tablespace, running "ALTER USER SET default_tablespace", then dropping the
tablespace and running pg_dumpall on that.
Backpatch to all supported versions.
As previously coded, the QueryDesc's dest pointer was left dangling
(pointing at an already-freed receiver object) after ExecutorEnd. It's a
bit astonishing that it took us this long to notice, and I'm not sure that
the known problem case with SQL functions is the only one. Fix it by
saving and restoring the original receiver pointer, which seems the most
bulletproof way of ensuring any related bugs are also covered.
Per bug #6379 from Paul Ramsey. Back-patch to 8.4 where the current
handling of SELECT INTO was introduced.
We forgot to modify column ACLs, so privileges were still shown as having
been granted by the old owner. This meant that neither the new owner nor
a superuser could revoke the now-untraceable-to-table-owner permissions.
Per bug #6350 from Marc Balmer.
This has been wrong since column ACLs were added, so back-patch to 8.4.
smgrdounlink takes care to not throw an ERROR if it fails to unlink
something, but that caution was rendered useless by commit
3396000684, which put an smgrexists call in
front of it; smgrexists *does* throw error if anything looks funny, such
as getting a permissions error from trying to open the file. If that
happens post-commit, you get a PANIC, and what's worse the same logic
appears in the WAL replay code, so the database even fails to restart.
Restore the intended behavior by removing the smgrexists call --- it isn't
accomplishing anything that we can't do better by adjusting mdunlink's
ideas of whether it ought to warn about ENOENT or not.
Per report from Joseph Shraibman of unrecoverable crash after trying to
drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions.
Backpatch to 8.4, where the bogus coding was introduced.
I forgot to change the functions to use the PG_GETARG_INET_PP() macro,
when I changed DatumGetInetP() to unpack the datum, like Datum*P macros
usually do. Also, I screwed up the definition of the PG_GETARG_INET_PP()
macro, and didn't notice because it wasn't used.
This fixes the memory leak when sorting inet values, as reported
by Jochen Erwied and debugged by Andres Freund. Backpatch to 8.3, like
the previous patch that broke it.
Since record[] uses array_in, it needs to have its element type passed
as typioparam. In HEAD and 9.1, this fix essentially reverts commit
9bc933b212, which was a hack that is no
longer needed since domains don't set their typelem anymore. Before
that, adjust the logic so that only domains are excluded from being
treated like arrays, rather than assuming that only base types should
be included. Add a regression test to demonstrate the need for this.
Per report from Maxim Boguk.
Back-patch to 8.4, where type record[] was added.
On a platform that isn't supplying __FILE__, previous coding would either
crash or give a stale result for the filename string. Not sure how likely
that is, but the original code catered for it, so let's keep doing so.
In vpath builds, the __FILE__ macro that is used in verbose error
reports contains the full absolute file name, which makes the error
messages excessively verbose. So keep only the base name, thus
matching the behavior of non-vpath builds.
A simple thinko in ginRedoUpdateMetapage, namely failing to increment a
loop counter, led to inserting records into the last pending-list page in
the wrong order (the opposite of that intended). So far as I can tell,
this would not upset the code that eventually flushes pending items into
the main part of the GIN index. But it did break the code that searched
the pending list for matches, resulting in transient failure to find
matching entries during index lookups, as illustrated in bug #6307 from
Maksym Boguk.
Back-patch to 8.4 where the incorrect code was introduced.
When the system is idle for awhile after activity, the "smoothed_alloc"
state variable in BgBufferSync converges slowly to zero. With standard
IEEE float arithmetic this results in several iterations with denormalized
values, which causes kernel traps and annoying log messages on some
poorly-designed platforms. There's no real need to track such small values
of smoothed_alloc, so we can prevent the kernel traps by forcing it to zero
as soon as it's too small to be interesting for our purposes. This issue
is purely cosmetic, since the iterations don't happen fast enough for the
kernel traps to pose any meaningful performance problem, but still it seems
worth shutting up the log messages.
The kernel log messages were previously reported by a number of people,
but kudos to Greg Matthews for tracking down exactly where they were coming
from.
a new macro, DatumGetInetPP(), that does not. This brings these macros
in line with other DatumGet*P() macros.
Backpatch to 8.3, where 1-byte header varlenas were introduced.
This assumption can be wrong when the toaster is passed a raw on-disk
tuple, because the tuple might pre-date an ALTER TABLE ADD COLUMN operation
that added columns without rewriting the table. In such a case the tuple's
natts value is smaller than what we expect from the tuple descriptor, and
so its t_hoff value could be smaller too. In fact, the tuple might not
have a null bitmap at all, and yet our current opinion of it is that it
contains some trailing nulls.
In such a situation, toast_insert_or_update did the wrong thing, because
to save a few lines of code it would use the old t_hoff value as the offset
where heap_fill_tuple should start filling data. This did not leave enough
room for the new nulls bitmap, with the result that the first few bytes of
data could be overwritten with null flag bits, as in a recent report from
Hubert Depesz Lubaczewski.
The particular case reported requires ALTER TABLE ADD COLUMN followed by
CREATE TABLE AS SELECT * FROM ... or INSERT ... SELECT * FROM ..., and
further requires that there be some out-of-line toasted fields in one of
the tuples to be copied; else we'll not reach the troublesome code.
The problem can only manifest in this form in 8.4 and later, because
before commit a77eaa6a95, CREATE TABLE AS or
INSERT/SELECT wouldn't result in raw disk tuples getting passed directly
to heap_insert --- there would always have been at least a junkfilter in
between, and that would reconstitute the tuple header with an up-to-date
t_natts and hence t_hoff. But I'm backpatching the tuptoaster change all
the way anyway, because I'm not convinced there are no older code paths
that present a similar risk.
inline_set_returning_function failed to distinguish functions returning
generic RECORD (which require a column list in the RTE, as well as run-time
type checking) from those with multiple OUT parameters (which do not).
This prevented inlining from happening. Per complaint from Jay Levitt.
Back-patch to 8.4 where this capability was introduced.
This reverts commit d23165bbc3.
As pointed out by Naoya Anzai, we need to do more work to make that
idea handle end-of-index cases, and it is looking like too much risk
for a back-patch. So bug #6278 is only going to be fixed in HEAD.
If a tuple in a syscache contains an out-of-line toasted field, and we
try to fetch that field shortly after some other transaction has committed
an update or deletion of the tuple, there is a race condition: vacuum
could come along and remove the toast tuples before we can fetch them.
This leads to transient failures like "missing chunk number 0 for toast
value NNNNN in pg_toast_2619", as seen in recent reports from Andrew
Hammond and Tim Uckun.
The design idea of syscache is that access to stale syscache entries
should be prevented by relation-level locks, but that fails for at least
two cases where toasted fields are possible: ANALYZE updates pg_statistic
rows without locking out sessions that might want to plan queries on the
same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without
any meaningful lock at all.
The least risky fix seems to be an idea that Heikki suggested when we
were dealing with a related problem back in August: forcibly detoast any
out-of-line fields before putting a tuple into syscache in the first place.
This avoids the problem because at the time we fetch the parent tuple from
the catalog, we should be holding an MVCC snapshot that will prevent
removal of the toast tuples, even if the parent tuple is outdated
immediately after we fetch it. (Note: I'm not convinced that this
statement holds true at every instant where we could be fetching a syscache
entry at all, but it does appear to hold true at the times where we could
fetch an entry that could have a toasted field. We will need to be a bit
wary of adding toast tables to low-level catalogs that don't have them
already.) An additional benefit is that subsequent uses of the syscache
entry should be faster, since they won't have to detoast the field.
Back-patch to all supported versions. The problem is significantly harder
to reproduce in pre-9.0 releases, because of their willingness to flush
every entry in a syscache whenever the underlying catalog is vacuumed
(cf CatalogCacheFlushRelation); but there is still a window for trouble.
The existing scan-direction-sensitive tests were overly complex, and
failed to stop the scan in cases where it's perfectly legitimate to do so.
Per bug #6278 from Maksym Boguk.
Back-patch to 8.3, which is as far back as the patch applies easily.
Doesn't seem worth sweating over a relatively minor performance issue in
8.2 at this late date. (But note that this was a performance regression
from 8.1 and before, so 8.2 is being left as an outlier.)
cash_out failed to handle multiple-byte thousands separators, as per bug
#6277 from Alexander Law. In addition, cash_in didn't handle that either,
nor could it handle multiple-byte positive_sign. Both routines failed to
support multiple-byte mon_decimal_point, which I did not think was worth
changing, but at least now they check for the possibility and fall back to
using '.' rather than emitting invalid output. Also, make cash_in handle
trailing negative signs, which formerly it would reject. Since cash_out
generates trailing negative signs whenever the locale tells it to, this
last omission represents a fail-to-reload-dumped-data bug. IMO that
justifies patching this all the way back.
When a foreign-key constraint references another column of the same table,
row updates will queue both the PK's ON UPDATE action and the FK's CHECK
action in the same event. The ON UPDATE action must execute first, else
the CHECK will check a non-final state of the row and possibly throw an
inappropriate error, as seen in bug #6268 from Roman Lytovchenko.
Now, the firing order of multiple triggers for the same event is determined
by the sort order of their pg_trigger.tgnames, and the auto-generated names
we use for FK triggers are "RI_ConstraintTrigger_NNNN" where NNNN is the
trigger OID. So most of the time the firing order is the same as creation
order, and so rearranging the creation order fixes it.
This patch will fail to fix the problem if the OID counter wraps around or
adds a decimal digit (eg, from 99999 to 100000) while we are creating the
triggers for an FK constraint. Given the small odds of that, and the low
usage of self-referential FKs, we'll live with that solution in the back
branches. A better fix is to change the auto-generated names for FK
triggers, but it seems unwise to do that in stable branches because there
may be client code that depends on the naming convention. We'll fix it
that way in HEAD in a separate patch.
Back-patch to all supported branches, since this bug has existed for a long
time.
This view was being insufficiently careful about matching the FK constraint
to the depended-on primary or unique key constraint. That could result in
failure to show an FK constraint at all, or showing it multiple times, or
claiming that it depended on a different constraint than the one it really
does. Fix by joining via pg_depend to ensure that we find only the correct
dependency.
Back-patch, but don't bump catversion because we can't force initdb in back
branches. The next minor-version release notes should explain that if you
need to fix this in an existing installation, you can drop the
information_schema schema then re-create it by sourcing
$SHAREDIR/information_schema.sql in each database (as a superuser of
course).
transform_null_equals is only supposed to affect "foo = NULL" expressions
given directly by the user, not the internal "foo = NULL" expression
generated from CASE-WHEN.
This fixes bug #6242, reported by Sergey. Backpatch to all supported
branches.
This oversight meant that on Windows, the pg_settings view would not
display source file or line number information for values coming from
postgresql.conf, unless the backend had received a SIGHUP since starting.
In passing, also make the error detection in read_nondefault_variables a
tad more thorough, and fix it to not lose precision on float GUCs (these
changes are already in HEAD as of my previous commit).
Thus, an object referenced in a default expression could be dropped while
the function remained present. This was unaccountably missed in the
original patch to add default parameters for functions. Reported by
Pavel Stehule.
This oversight led to a massive memory leak --- upwards of 10KB per tuple
--- during creation-time verification of an exclusion constraint based on a
GIST index. In most other scenarios it'd just be a leak of 10KB that would
be recovered at end of query, so not too significant; though perhaps the
leak would be noticeable in a situation where a GIST index was being used
in a nestloop inner indexscan. In any case, it's a real leak of long
standing, so patch all supported branches. Per report from Harald Fuchs.
Trailing-zero stripping applied by the FM specifier could strip zeroes
to the left of the decimal point, for a format with no digit positions
after the decimal point (such as "FM999.").
Reported and diagnosed by Marti Raudsepp, though I didn't use his patch.
The code in shift_jis_20042euc_jis_2004() would fetch two bytes even when
only one remained in the string. Since conversion functions aren't
supposed to assume null-terminated input, this poses a small risk of
fetching past the end of memory and incurring SIGSEGV. No such crash has
been identified in the field, but we've certainly seen the equivalent
happen in other code paths, so patch this one all the way back.
Report and patch by Noah Misch.
Since the last couple of columns of pg_type are often NULL,
sizeof(FormData_pg_type) can be an overestimate of the actual size of the
tuple data part. Therefore memcpy'ing that much out of the catalog cache,
as analyze.c was doing, poses a small risk of copying past the end of
memory and incurring SIGSEGV. No such crash has been identified in the
field, but we've certainly seen the equivalent happen in other code paths,
so patch this one all the way back.
Per valgrind testing by Noah Misch, though this is not his proposed patch.
I chose to use SearchSysCacheCopy1 rather than inventing special-purpose
infrastructure for copying only the minimal part of a pg_type tuple.
Examination of examples provided by Mark Kirkwood and others has convinced
me that actually commit 7f3eba30c9 was quite
a few bricks shy of a load. The useful part of that patch was clamping
ndistinct for the inner side of a semi or anti join, and the reason why
that's needed is that it's the only way that restriction clauses
eliminating rows from the inner relation can affect the estimated size of
the join result. I had not clearly understood why the clamping was
appropriate, and so mis-extrapolated to conclude that we should clamp
ndistinct for the outer side too, as well as for both sides of regular
joins. These latter actions were all wrong, and are reverted with this
patch. In addition, the clamping logic is now made to affect the behavior
of both paths in eqjoinsel_semi, with or without MCV lists to compare.
When we have MCVs, we suppose that the most common values are the ones
that are most likely to survive the decimation resulting from a lower
restriction clause, so we think of the clamping as eliminating non-MCV
values, or potentially even the least-common MCVs for the inner relation.
Back-patch to 8.4, same as previous fixes in this area.
This patch fixes an oversight in my commit
7f3eba30c9 of 2008-10-23. That patch
accounted for baserel restriction clauses that reduced the number of rows
coming out of a table (and hence the number of possibly-distinct values of
a join variable), but not for join restriction clauses that might have been
applied at a lower level of join. To account for the latter, look up the
sizes of the min_lefthand and min_righthand inputs of the current join,
and clamp with those in the same way as for the base relations.
Noted while investigating a complaint from Ben Chobot, although this in
itself doesn't seem to explain his report.
Back-patch to 8.4; previous versions used different estimation methods
for which this heuristic isn't relevant.
It is possible for VACUUM to scan no pages at all, if the visibility map
shows that all pages are all-visible. In this situation VACUUM has no new
information to report about the relation's tuple density, so it wasn't
changing pg_class.reltuples ... but it updated pg_class.relpages anyway.
That's wrong in general, since there is no evidence to justify changing the
density ratio reltuples/relpages, but it's particularly bad if the previous
state was relpages=reltuples=0, which means "unknown tuple density".
We just replaced "unknown" with "zero". ANALYZE would eventually recover
from this, but it could take a lot of repetitions of ANALYZE to do so if
the relation size is much larger than the maximum number of pages ANALYZE
will scan, because of the moving-average behavior introduced by commit
b4b6923e03.
The only known situation where we could have relpages=reltuples=0 and yet
the visibility map asserts everything's visible is immediately following
a pg_upgrade. It might be advisable for pg_upgrade to try to preserve the
relpages/reltuples statistics; but in any case this code is wrong on its
own terms, so fix it. Per report from Sergey Koposov.
Back-patch to 8.4, where the visibility map was introduced, same as the
previous change.
tsvector_concat() allocated its result workspace using the "conservative"
estimate of the sum of the two input tsvectors' sizes. Unfortunately that
wasn't so conservative as all that, because it supposed that the number of
pad bytes required could not grow. Which it can, as per test case from
Jesper Krogh, if there's a mix of lexemes with positions and lexemes
without them in the input data. The fix is to assume that we might add
a not-previously-present pad byte for each and every lexeme in the two
inputs; which really is conservative, but it doesn't seem worthwhile to
try to be more precise.
This is an aboriginal bug in tsvector_concat, so back-patch to all
versions containing it.
As pointed out by Sergey Koposov, repeated invocations of tbm_lossify can
make building a large tidbitmap into an O(N^2) operation. To fix, make
sure we remove more than the minimum amount of information per call, and
add a fallback path to behave sanely if we're unable to fit the bitmap
within the requested amount of memory.
This has been wrong since the tidbitmap code was written, so back-patch
to all supported branches.
The previous code tried to synchronize by unlinking the init file twice,
but that doesn't actually work: it leaves a window wherein a third process
could read the already-stale init file but miss the SI messages that would
tell it the data is stale. The result would be bizarre failures in catalog
accesses, typically "could not read block 0 in file ..." later during
startup.
Instead, hold RelCacheInitLock across both the unlink and the sending of
the SI messages. This is more straightforward, and might even be a bit
faster since only one unlink call is needed.
This has been wrong since it was put in (in 2002!), so back-patch to all
supported releases.
When updating or deleting a system catalog tuple, it's necessary to acquire
RowExclusiveLock on the catalog before looking up the tuple; otherwise a
concurrent VACUUM FULL on the catalog might move the tuple to a different
TID before we can apply the update. Coding patterns that find the tuple
via a table scan aren't at risk here, but when obtaining the tuple from a
catalog cache, correct ordering is important; and several routines in
foreigncmds.c got it wrong. Noted while running the regression tests in
parallel with VACUUM FULL of assorted system catalogs.
For consistency I moved all the heap_open calls to the starts of their
functions, including a couple for which there was no actual bug.
Back-patch to 8.4 where foreigncmds.c was added.
A PlaceHolderVar's expression might contain another, lower-level
PlaceHolderVar. If the outer PlaceHolderVar is used, the inner one
certainly will be also, and so we have to make sure that both of them get
into the placeholder_list with correct ph_may_need values during the
initial pre-scan of the query (before deconstruct_jointree starts).
We did this correctly for PlaceHolderVars appearing in the query quals,
but overlooked the issue for those appearing in the top-level targetlist;
with the result that nested placeholders referenced only in the targetlist
did not work correctly, as illustrated in bug #6154.
While at it, add some error checking to find_placeholder_info to ensure
that we don't try to create new placeholders after it's too late to do so;
they have to all be created before deconstruct_jointree starts.
Back-patch to 8.4 where the PlaceHolderVar mechanism was introduced.
This disables an entirely unnecessary "sanity check" that causes failures
in nonblocking mode, because OpenSSL complains if we move or compact the
write buffer. The only actual requirement is that we not modify pending
data once we've attempted to send it, which we don't. Per testing and
research by Martin Pihlak, though this fix is a lot simpler than his patch.
I put the same change into the backend, although it's less clear whether
it's necessary there. We do use nonblock mode in some situations in
streaming replication, so seems best to keep the same behavior in the
backend as in libpq.
Back-patch to all supported releases.
This fixes SSPI login failures showing "The function
requested is not supported", often showing up when connecting
to localhost. The reason was not properly updating the SSPI
handle when multiple roundtrips were required to complete the
authentication sequence.
Report and analysis by Ahmed Shinwari, patch by Magnus Hagander
First, when following a right-link, we incorrectly marked the current page
as the parent of the right sibling. In reality, the parent of the right page
is the same as the parent of the current page (or some page to the right of
it, gistFindCorrectParent() will sort that out).
Secondly, when we follow a right-link, we must prepend, not append, the right
page to our list of pages to visit. That's because we assume that once we
hit a leaf page in the list, all the rest are leaf pages too, and give up.
To hit these bugs, you need concurrent actions and several unlucky accidents.
Another backend must split the root page, while you're in process of
splitting a lower-level page. Furthermore, while you scan the internal nodes
to re-find the parent, another backend needs to again split some more internal
pages. Even then, the bugs don't necessarily manifest as user-visible errors
or index corruption.
While we're at it, make the error reporting a bit better if gistFindPath()
fails to re-find the parent. It used to be an assertion, but an elog() seems
more appropriate.
Backpatch to all supported branches.
It is possible for a NestLoop plan node to pass an OUTER Var into an
"inner indexscan" that is an Append construct (derived from an inheritance
tree or UNION ALL subquery). The OUTER tuple is then passed down at
runtime to the leaf indexscan node(s) where it will actually be used.
EXPLAIN has to likewise pass the information about the nestloop's outer
subplan down through the Append node, else it will fail to print the
outer-reference Vars (with complaints like "bogus varno: 65001").
However, there was a case missed in all this: we could also have gating
Result nodes that were inserted into the appendrel plan tree to deal with
pseudoconstant qual conditions. So EXPLAIN has to pass down the outer plan
node to a Result's subplan, too. Per example from Jon Nelson.
The problem is gone in 9.1 because we replaced the nestloop outer-tuple
kluge with a Param-based data transfer mechanism. Also, so far as I can
tell, the case can't happen before 8.4 because of restrictions on what
sorts of appendrel members could be pulled up into the parent query.
So this patch is only needed for 8.4 and 9.0.
When recursing after an optimization in pull_up_sublinks_qual_recurse, the
available_rels value passed down must include only the relations that are
in the righthand side of the new SEMI or ANTI join; it's incorrect to pull
up a sub-select that refers to other relations, as seen in the added test
case. Per report from BangarRaju Vadapalli.
While at it, rethink the idea of recursing below a NOT EXISTS. That is
essentially the same situation as pulling up ANY/EXISTS sub-selects that
are in the ON clause of an outer join, and it has the same disadvantage:
we'd force the two joins to be evaluated according to the syntactic nesting
order, because the lower join will most likely not be able to commute with
the ANTI join. That could result in having to form a rather large join
product, whereas the handling of a correlated subselect is not quite that
dumb. So until we can handle those cases better, #ifdef NOT_USED that
case. (I think it's okay to pull up in the EXISTS/ANY cases, because SEMI
joins aren't so inflexible about ordering.)
Back-patch to 8.4, same as for previous patch in this area. Fortunately
that patch hadn't made it into any shipped releases yet.
I mis-simplified the test where ANALYZE decided if it could get away
without doing anything: under the new regime, that's never allowed. Per
bug #6068 from Jeff Janes. Back-patch to 8.4, just like previous patch.
The previous code went into an infinite loop after overflow. In fact,
an overflow is not really an error; it just means that the current
value is the last one we need to return. So, just arrange to stop
immediately when overflow is detected.
Back-patch all the way.
This oversight could result in a tuplestore using much more than the
intended amount of memory. It would only happen in a code path that loaded
a tuplestore via tuplestore_putvalues(), and many of those won't emit huge
amounts of data; but cases such as holdable cursors and plpgsql's RETURN
NEXT command could have the problem. The fix ensures that the tuplestore
will switch to write-to-disk mode when it overruns work_mem.
The potential overrun was finite, because we would still count the space
used by the tuple pointer array, so the tuplestore code would eventually
flip into write-to-disk mode anyway. When storing wide tuples we would
go far past the expected work_mem usage before that happened; but this
may account for the lack of prior reports.
Back-patch to 8.4, where tuplestore_putvalues was introduced.
Per bug #6061 from Yann Delorme.