">" should be ">>". This typo results in failure to use all of the bits
of the provided seed.
This might rise to the level of a security bug if we were relying on
srand48 for any security-critical purposes, but we are not --- in fact,
it's not used at all unless the platform lacks srandom(), which is
improbable. Even on such a platform the exposure seems minimal.
Reported privately by Andres Freund.
Formerly, set_subquery_pathlist and other creators of plans for subqueries
saved only the rangetable and rowMarks lists from the lower-level
PlannerInfo. But there's no reason not to remember the whole PlannerInfo,
and indeed this turns out to simplify matters in a number of places.
The immediate reason for doing this was so that the subroot will still be
accessible when we're trying to extract column statistics out of an
already-planned subquery. But now that I've done it, it seems like a good
code-beautification effort in its own right.
I also chose to get rid of the transient subrtable and subrowmark fields in
SubqueryScan nodes, in favor of having setrefs.c look up the subquery's
RelOptInfo. That required changing all the APIs in setrefs.c to pass
PlannerInfo not PlannerGlobal, which was a large but quite mechanical
transformation.
One side-effect not foreseen at the beginning is that this finally broke
inheritance_planner's assumption that replanning the same subquery RTE N
times would necessarily give interchangeable results each time. That
assumption was always pretty risky, but now we really have to make a
separate RTE for each instance so that there's a place to carry the
separate subroots.
In the past, relhassubclass always remained true if a relation had ever had
child relations, even if the last subclass was long gone. While this had
only marginal performance implications in most cases, it was annoying, and
I'm now considering some planner changes that would raise the cost of a
false positive. It was previously impractical to fix this because of race
condition concerns. However, given the recent change that made tablecmds.c
take ShareExclusiveLock on relations that are gaining a child (commit
fbcf4b92aa), we can now allow ANALYZE to
clear the flag when it's no longer relevant. There is no additional
locking cost to do so, since ANALYZE takes ShareExclusiveLock anyway.
1. Use new dropdb --if-exists option, to avoid alarming the user if
the database being dropped doesn't already exist.
2. Bail out if createdb fails.
3. exit 1 if the checks fail.
4. Make it executable.
Josh Kupershmidt, with some kibitzing by me.
on Windows. ecpglib doesn't link with libpgport, but picks and compiles
the .c files it needs individually. To cope with that, move the setlocale()
wrapper from chklocale.c to a separate setlocale.c file, and include that
in ecpglib.
dots. I previously worked around this in initdb, mapping the known
problematic locale names to aliases that work, but Hiroshi Inoue pointed
out that that's not enough because even if you use one of the aliases, like
"Chinese_HKG", setlocale(LC_CTYPE, NULL) returns back the long form, ie.
"Chinese_Hong Kong S.A.R.". When we try to restore an old locale value by
passing that value back to setlocale(), it fails. Note that you are affected
by this bug also if you use one of those short-form names manually, so just
reverting the hack in initdb won't fix it.
To work around that, move the locale name mapping from initdb to a wrapper
around setlocale(), so that the mapping is invoked on every setlocale() call.
Also, add a few checks for failed setlocale() calls in the backend. These
calls shouldn't fail, and if they do there isn't much we can do about it,
but at least you'll get a warning.
Backpatch to 9.1, where the initdb hack was introduced. The Windows bug
affects older versions too if you set locale manually to one of the aliases,
but given the lack of complaints from the field, I'm hesitent to backpatch.
ifdef block. It has nothing to do with whether the replacement snprintf
function is used. It caused no live bug, because the replacement snprintf
function is always used on Win32, but it was nevertheless misplaced.
Examination of examples provided by Mark Kirkwood and others has convinced
me that actually commit 7f3eba30c9 was quite
a few bricks shy of a load. The useful part of that patch was clamping
ndistinct for the inner side of a semi or anti join, and the reason why
that's needed is that it's the only way that restriction clauses
eliminating rows from the inner relation can affect the estimated size of
the join result. I had not clearly understood why the clamping was
appropriate, and so mis-extrapolated to conclude that we should clamp
ndistinct for the outer side too, as well as for both sides of regular
joins. These latter actions were all wrong, and are reverted with this
patch. In addition, the clamping logic is now made to affect the behavior
of both paths in eqjoinsel_semi, with or without MCV lists to compare.
When we have MCVs, we suppose that the most common values are the ones
that are most likely to survive the decimation resulting from a lower
restriction clause, so we think of the clamping as eliminating non-MCV
values, or potentially even the least-common MCVs for the inner relation.
Back-patch to 8.4, same as previous fixes in this area.
This patch fixes an oversight in my commit
7f3eba30c9 of 2008-10-23. That patch
accounted for baserel restriction clauses that reduced the number of rows
coming out of a table (and hence the number of possibly-distinct values of
a join variable), but not for join restriction clauses that might have been
applied at a lower level of join. To account for the latter, look up the
sizes of the min_lefthand and min_righthand inputs of the current join,
and clamp with those in the same way as for the base relations.
Noted while investigating a complaint from Ben Chobot, although this in
itself doesn't seem to explain his report.
Back-patch to 8.4; previous versions used different estimation methods
for which this heuristic isn't relevant.
It is possible for VACUUM to scan no pages at all, if the visibility map
shows that all pages are all-visible. In this situation VACUUM has no new
information to report about the relation's tuple density, so it wasn't
changing pg_class.reltuples ... but it updated pg_class.relpages anyway.
That's wrong in general, since there is no evidence to justify changing the
density ratio reltuples/relpages, but it's particularly bad if the previous
state was relpages=reltuples=0, which means "unknown tuple density".
We just replaced "unknown" with "zero". ANALYZE would eventually recover
from this, but it could take a lot of repetitions of ANALYZE to do so if
the relation size is much larger than the maximum number of pages ANALYZE
will scan, because of the moving-average behavior introduced by commit
b4b6923e03.
The only known situation where we could have relpages=reltuples=0 and yet
the visibility map asserts everything's visible is immediately following
a pg_upgrade. It might be advisable for pg_upgrade to try to preserve the
relpages/reltuples statistics; but in any case this code is wrong on its
own terms, so fix it. Per report from Sergey Koposov.
Back-patch to 8.4, where the visibility map was introduced, same as the
previous change.
Previously, 'yesterday 04:00:00'::timestamp didn't do the same thing as
'04:00:00 yesterday'::timestamp, and the return value from the latter
was midnight rather than the specified time.
Dean Rasheed, with some stylistic changes
Per my testing, this works just as well with gcc as it does with HP's
compiler; and there is no reason to think that the effect doesn't occur
with icc, either.
Also, rewrite the header comment about enforcing sequencing around spinlock
operations, per Robert's gripe that it was misleading.
At least on this architecture, it's very important to spin on a
non-atomic instruction and only retry the atomic once it appears
that it will succeed. To fix this, split TAS() into two macros:
TAS(), for trying to grab the lock the first time, and TAS_SPIN(),
for spinning until we get it. TAS_SPIN() defaults to same as TAS(),
but we can override it when we know there's a better way.
It's likely that some of the other cases in s_lock.h require
similar treatment, but this is the only one we've got conclusive
evidence for at present.
On closer inspection, whining in restore_toc_entries_parallel is really
much too late for any user-facing error case. The right place to do it
is at the start of RestoreArchive(), before we've done anything interesting
(suh as trying to DROP all the targets ...)
Back-patch to 8.4, where parallel restore was introduced.
If we are unable to do a parallel restore because the input file is stdin
or is otherwise unseekable, we should complain and fail immediately, not
after having done some of the restore. Complaining once per thread isn't
so cool either, and the messages should be worded to make it clear this is
an unsupported case not some weird race-condition bug. Per complaint from
Lonni Friedman.
Back-patch to 8.4, where parallel restore was introduced.
their own.
Avoid compile problems with defines being redefined after the removal of
the #if blocks.
Change script to use shell functions for simplicity.
These days, such a response is far more likely to signify a server-side
problem, such as fork failure. Reporting "server does not support SSL"
(in sslmode=require) could be quite misleading. But the results could
be even worse in sslmode=prefer: if the problem was transient and the
next connection attempt succeeds, we'll have silently fallen back to
protocol version 2.0, possibly disabling features the user needs.
Hence, it seems best to just eliminate the assumption that backing off
to non-SSL/2.0 protocol is the way to recover from an "E" response, and
instead treat the server error the same as we would in non-SSL cases.
I tested this change against a pre-7.0 server, and found that there
was a second logic bug in the "prefer" path: the test to decide whether
to make a fallback connection attempt assumed that we must have opened
conn->ssl, which in fact does not happen given an "E" response. After
fixing that, the code does indeed connect successfully to pre-7.0,
as long as you didn't set sslmode=require. (If you did, you get
"Unsupported frontend protocol", which isn't completely off base
given the server certainly doesn't support SSL.)
Since there seems no reason to believe that pre-7.0 servers exist anymore
in the wild, back-patch to all supported branches.
There are assorted situations wherein PQconnectPoll() will abandon a
connection attempt and try again with different parameters (eg, SSL versus
not SSL). However, the code forgot to discard any pending data in libpq's
I/O buffers when doing this. In at least one case (server returns E
message during SSL negotiation), there is unread input data which bollixes
the next connection attempt. I have not checked to see whether this is
possible in the other cases where we close the socket and retry, but it
seems like a matter of good defensive programming to add explicit
buffer-flushing code to all of them.
This is one of several issues exposed by Daniel Farina's report of
misbehavior after a server-side fork failure.
This has been wrong since forever, so back-patch to all supported branches.
tsvector_concat() allocated its result workspace using the "conservative"
estimate of the sum of the two input tsvectors' sizes. Unfortunately that
wasn't so conservative as all that, because it supposed that the number of
pad bytes required could not grow. Which it can, as per test case from
Jesper Krogh, if there's a mix of lexemes with positions and lexemes
without them in the input data. The fix is to assume that we might add
a not-previously-present pad byte for each and every lexeme in the two
inputs; which really is conservative, but it doesn't seem worthwhile to
try to be more precise.
This is an aboriginal bug in tsvector_concat, so back-patch to all
versions containing it.
These changes allow backtick command evaluation and psql variable
interpolation to happen on substrings of a single meta-command argument.
Formerly, no such evaluations happened at all if the backtick or colon
wasn't the first character of the argument, and we considered an argument
completed as soon as we'd processed one backtick, variable reference, or
quoted substring. A string like 'FOO'BAR was thus taken as two arguments
not one, not exactly what one would expect. In the new coding, an argument
is considered terminated only by unquoted whitespace or backslash.
Also, clean up a bunch of omissions, infelicities and outright errors in
the psql documentation of variables and metacommand argument syntax.
This was deemed unnecessary initially but in later discussion it was
agreed otherwise.
Original file from Kevin Grittner, allegedly from Dan Ports.
I had to clean up whitespace a bit per changes from Heikki.
Per previous experimentation, backtracking slows down lexing performance
significantly (by about a third). It's usually pretty easy to avoid, just
need to have rules that accept an incomplete construct and do whatever the
lexer would have done otherwise.
The backtracking was introduced by the patch that added quoted variable
substitution. Back-patch to 9.0 where that was added.
Rather than dumping out the raw array as PostgreSQL represents it
internally, we now print it out in a format similar to the one in
which the user input it, which seems a lot more user friendly.
Shigeru Hanada
The previous coding resulted in contrib modules unintentionally overriding
the use of CONTRIB_TESTDB. There seems no particularly good reason to
allow that (after all, the makefile can set CONTRIB_TESTDB if that's really
what it intends).
In passing, document REGRESS_OPTS where the other pgxs.mk options are
documented.
Back-patch to 9.1 --- in prior versions, there were no cases of contrib
modules setting REGRESS_OPTS without including the --dbname switch, so
while the coding was fragile there was no actual bug.
When we implemented extensions, we made findDependentObjects() treat
EXTENSION dependency links similarly to INTERNAL links. However, that
logic contained an implicit assumption that an object could have at most
one INTERNAL dependency, so it did not work correctly for objects having
both INTERNAL and DEPENDENCY links. This led to failure to drop some
extension member objects when dropping the extension. Furthermore, we'd
never actually exercised the case of recursing to an internally-referenced
(owning) object from anything other than a NORMAL dependency, and it turns
out that passing the incoming dependency's flags to the owning object is
the Wrong Thing. This led to sometimes dropping a whole extension silently
when we should have rejected the drop command for lack of CASCADE.
Since we obviously were under-testing extension drop scenarios, add some
regression test cases. Unfortunately, such test cases require some
extensions (duh), so we can't test for problems in the core regression
tests. I chose to add them to the earthdistance contrib module, which is
a good test case because it has a dependency on the cube contrib module.
Back-patch to 9.1. Arguably these are pre-existing bugs in INTERNAL
dependency handling, but since it appears that the cases can never arise
pre-9.1, I'll refrain from back-patching the logic changes further than
that.
When creating a new schema for a non-relocatable extension, we neglected
to check whether the calling user has permission to create schemas.
That didn't matter in the original coding, since we had already checked
superuserness, but in the new dispensation where users need not be
superusers, we should check it. Use CreateSchemaCommand() rather than
calling NamespaceCreate() directly, so that we also enforce the rules
about reserved schema names.
Per complaint from KaiGai Kohei, though this isn't the same as his patch.
set_append_rel_pathlist supposed that, while computing per-column width
estimates for the appendrel, it could ignore child rels for which the
translated reltargetlist entry wasn't a Var. This gave rise to completely
silly estimates in some common cases, such as constant outputs from some or
all of the arms of a UNION ALL. Instead, fall back on get_typavgwidth to
estimate from the value's datatype; which might be a poor estimate but at
least it's not completely wacko.
That problem was exposed by an Assert in set_subquery_size_estimates, which
unfortunately was still overoptimistic even with that fix, since we don't
compute attr_widths estimates for appendrels that are entirely excluded by
constraints. So remove the Assert; we'll just fall back on get_typavgwidth
in such cases.
Also, since set_subquery_size_estimates calls set_baserel_size_estimates
which calls set_rel_width, there's no need for set_subquery_size_estimates
to call get_typavgwidth; set_rel_width will handle it for us if we just
leave the estimate set to zero. Remove the unnecessary code.
Per report from Erik Rijkers and subsequent investigation.
The previous coding would result in deleting and not re-creating the
extension membership pg_depend rows, since there was no
CommandCounterIncrement that would allow recordDependencyOnCurrentExtension
to see that the deletion had happened. Make it work like the shell type
case, ie, keep the existing entries (and then throw an error if they're for
the wrong extension).
Per bug #6172 from Hitoshi Harada. Investigation and fix by Dimitri
Fontaine.
Due to tuple-slot mismanagement, evaluation of WHEN conditions for AFTER
ROW UPDATE triggers could crash if there had been a BEFORE ROW trigger
fired for the same update. Fix by not trying to overload the use of
estate->es_trig_tuple_slot. Per report from Yoran Heling.
Back-patch to 9.0, when trigger WHEN conditions were introduced.
As pointed out by Sergey Koposov, repeated invocations of tbm_lossify can
make building a large tidbitmap into an O(N^2) operation. To fix, make
sure we remove more than the minimum amount of information per call, and
add a fallback path to behave sanely if we're unable to fit the bitmap
within the requested amount of memory.
This has been wrong since the tidbitmap code was written, so back-patch
to all supported branches.
Now that we have a test that requires nondefault settings to pass, it seems
like we'd better mention that detail in the directions about how to run the
tests.
Also do some very minor copy-editing.
order of begin, prepare, and commit of three concurrent transactions that
have conflicts between them.
The test runs for a quite long time, and the expected output file is huge,
but this test caught some serious bugs during development, so seems
worthwhile to keep. The test uses prepared transactions, so it fails if the
server has max_prepared_transactions=0. Because of that, it's marked as
"ignore" in the schedule file.
Dan Ports
Perhaps we ought to add some other kind of documentation here instead,
but for now let's get rid of this woefully obsolete description of the
sinval machinery.
Module initialization functions in Python 3 must have external
linkage, because PyMODINIT_FUNC does dllexport on Windows-like
platforms. Without this change, the build with Python 3 fails on
Windows.
Dropped columns within a composite type were not handled correctly.
Also, we did not check for whether a composite result type had changed
since we cached the information about it.
Jan Urbański, per a bug report from Jean-Baptiste Quenot
This requires adjusting the API for syscache callback functions: they now
get a hash value, not a TID, to identify the target tuple. Most of them
weren't paying any attention to that argument anyway, but plancache did
require a small amount of fixing.
Also, improve performance a trifle by avoiding sending duplicate inval
messages when a heap_update isn't changing the catcache lookup columns.
The TID isn't stable enough: we might queue an sinval event before a VACUUM
FULL, and then process it afterwards, when the target tuple no longer has
the same TID. So we must invalidate entries on the basis of hash value
only. The old coding can be shown to result in various bizarre,
hard-to-reproduce errors in the presence of concurrent VACUUM FULLs on
system catalogs, and could easily result in permanent catalog corruption,
up to and including complete loss of tables.
This commit is just a minimal fix that removes the unsafe comparison.
We should remove transmission of the tuple TID from sinval messages
altogether, and then arrange to suppress the extra message in the common
case of a heap_update that doesn't change the key hashvalue. But that's
going to be much more invasive, and will only produce a probably-marginal
performance gain, so it doesn't seem like material for a back-patch.
Back-patch to 9.0. Before that, VACUUM FULL refused to do any tuple moving
if it found any INSERT_IN_PROGRESS or DELETE_IN_PROGRESS tuples (and
CLUSTER would give up altogether), so there was no risk of moving a tuple
that might be the subject of an unsent sinval message.
We have to be sure that we have revalidated each nailed-in-cache relcache
entry before we try to use it to load data for some other relcache entry.
The introduction of "mapped relations" in 9.0 broke this, because although
we updated the state kept in relmapper.c early enough, we failed to
propagate that information into relcache entries soon enough; in
particular, we could try to fetch pg_class rows out of pg_class before
we'd updated its relcache entry's rd_node.relNode value from the map.
This bug accounts for Dave Gould's report of failures after "vacuum full
pg_class", and I believe that there is risk for other system catalogs
as well.
The core part of the fix is to copy relmapper data into the relcache
entries during "phase 1" in RelationCacheInvalidate(), before they'll be
used in "phase 2". To try to future-proof the code against other similar
bugs, I also rearranged the order in which nailed relations are visited
during phase 2: now it's pg_class first, then pg_class_oid_index, then
other nailed relations. This should ensure that RelationClearRelation can
apply RelationReloadIndexInfo to all nailed indexes without risking use
of not-yet-revalidated relcache entries.
Back-patch to 9.0 where the relation mapper was introduced.
This works around the problem that a catalog cache entry might contain a
toast pointer that we try to dereference just as a VACUUM FULL completes
on that catalog. We will see the sinval message on the cache entry when
we acquire lock on the toast table, but by that point we've already told
tuptoaster.c "here's the pointer to fetch", so it's difficult from a code
structural standpoint to update the pointer before we use it. Much less
painful to ensure that toast pointers are not invalidated in the first
place. We have to add a bit of code to deal with the case that a value
that previously wasn't toasted becomes so; but that should be a
seldom-exercised corner case, so the inefficiency shouldn't be significant.
Back-patch to 9.0. In prior versions, we didn't allow CLUSTER on system
catalogs, and VACUUM FULL didn't result in reassignment of toast OIDs, so
there was no problem.
The previous code tried to synchronize by unlinking the init file twice,
but that doesn't actually work: it leaves a window wherein a third process
could read the already-stale init file but miss the SI messages that would
tell it the data is stale. The result would be bizarre failures in catalog
accesses, typically "could not read block 0 in file ..." later during
startup.
Instead, hold RelCacheInitLock across both the unlink and the sending of
the SI messages. This is more straightforward, and might even be a bit
faster since only one unlink call is needed.
This has been wrong since it was put in (in 2002!), so back-patch to all
supported releases.
When streaming including WAL, the size estimate will always be incorrect,
since we don't know how much WAL is included. To make sure the output doesn't
look completely unreasonable, this patch increases the total size whenever we
go past the estimate, to make sure we never go above 100%.
This makes it clearer that the error message is perhaps not supposed
to be understood by users, and it also makes it somewhat clearer that
it was not accidentally omitted from translation.
Idea from Heikki Linnakangas, except that we don't mark "Reason code"
for translation at this point, because that would make the
implementation too cumbersome.
When updating or deleting a system catalog tuple, it's necessary to acquire
RowExclusiveLock on the catalog before looking up the tuple; otherwise a
concurrent VACUUM FULL on the catalog might move the tuple to a different
TID before we can apply the update. Coding patterns that find the tuple
via a table scan aren't at risk here, but when obtaining the tuple from a
catalog cache, correct ordering is important; and several routines in
foreigncmds.c got it wrong. Noted while running the regression tests in
parallel with VACUUM FULL of assorted system catalogs.
For consistency I moved all the heap_open calls to the starts of their
functions, including a couple for which there was no actual bug.
Back-patch to 8.4 where foreigncmds.c was added.
The statement start timestamp was not set before initiating the transaction
that is used to look up client authentication information in pg_authid.
In consequence, enable_sig_alarm computed a wrong value (far in the past)
for statement_fin_time. That didn't have any immediate effect, because the
timeout alarm was set without reference to statement_fin_time; but if we
subsequently blocked on a lock for a short time, CheckStatementTimeout
would consult the bogus value when we cancelled the lock timeout wait,
and then conclude we'd timed out, leading to immediate failure of the
connection attempt. Thus an innocent "vacuum full pg_authid" would cause
failures of concurrent connection attempts. Noted while testing other,
more serious consequences of vacuum full on system catalogs.
We should set the statement timestamp before StartTransactionCommand(),
so that the transaction start timestamp is also valid. I'm not sure if
there are any non-cosmetic effects of it not being valid, but the xact
timestamp is at least sent to the statistics machinery.
Back-patch to 9.0. Before that, the client authentication timeout was done
outside any transaction and did not depend on this state to be valid.
poll() is preferred over select() on platforms where both are available,
because it tends to be a bit faster and it doesn't have an arbitrary limit
on the range of FD numbers that can be accessed. The FD range limit does
not appear to be a risk factor for any 9.1 usages, so this doesn't need to
be back-patched, but we need to have it in place if we keep on expanding
the uses of WaitLatch.
Instead of displaying comments on an arbitrary subset of the object
types which support them, make \dd display comments on exactly those
object types which don't have their own backlash commands. We now
regard the display of comments as properly the job of the relevant
backslash command (though many of them do so only in verbose mode)
rather than something that \dd should be responsible for. However,
a handful of object types have no backlash command, so make \dd
give information about those.
Josh Kupershmidt
The latch infrastructure is now capable of detecting all cases where the
walsender loop needs to wake up, so there is no reason to have an arbitrary
timeout.
Also, modify the walsender loop logic to follow the standard pattern of
ResetLatch, test for work to do, WaitLatch. The previous coding was both
hard to follow and buggy: it would sometimes busy-loop despite having
nothing available to do, eg between receipt of a signal and the next time
it was caught up with new WAL, and it also had interesting choices like
deciding to update to WALSNDSTATE_STREAMING on the strength of information
known to be obsolete.
This is in hopes of learning more about what causes "pgstat wait timeout"
warnings in the buildfarm. This patch should probably be reverted once
we've learned what we can. As coded, it will result in regression test
"failures" at half the delay that the existing code does, so I expect
to see a few more than before.
In pursuit of this (and with the expectation that WaitLatch will be needed
in more places), convert the latch field that was already added to PGPROC
for sync rep into a generic latch that is activated for all PGPROC-owning
processes, and change many of the standard backend signal handlers to set
that latch when a signal happens. This will allow WaitLatch callers to be
wakened properly by these signals.
In passing, fix a whole bunch of signal handlers that had been hacked to do
things that might change errno, without adding the necessary save/restore
logic for errno. Also make some minor fixes in unix_latch.c, and clean
up bizarre and unsafe scheme for disowning the process's latch. Much of
this has to be back-patched into 9.1.
Peter Geoghegan, with additional work by Tom
streamed backup, throw an error and refuse to start up. The restore has not
finished correctly in that case and the data directory is possibly corrupt.
We already errored out in case of archive recovery, but could not during
crash recovery because we couldn't distinguish between the case that
pg_start_backup() was called and the database then crashed (must not error,
data is OK), and the case that we're restoring from a backup and not all
the needed WAL was replayed (data can be corrupt).
To distinguish those cases, add a line to backup_label to indicate
whether the backup was taken with pg_start/stop_backup(), or by streaming
(ie. pg_basebackup).
This requires re-initdb, because of a new field added to the control file.
The original definition had the problem that timeouts exceeding about 2100
seconds couldn't be specified on 32-bit machines. Milliseconds seem like
sufficient resolution, and finer grain than that would be fantasy anyway
on many platforms.
Back-patch to 9.1 so that this aspect of the latch API won't change between
9.1 and later releases.
Peter Geoghegan
Improve the documentation around weak-memory-ordering risks, and do a pass
of general editorialization on the comments in the latch code. Make the
Windows latch code more like the Unix latch code where feasible; in
particular provide the same Assert checks in both implementations.
Fix poorly-placed WaitLatch call in syncrep.c.
This patch resolves, for the moment, concerns around weak-memory-ordering
bugs in latch-related code: we have documented the restrictions and checked
that existing calls meet them. In 9.2 I hope that we will install suitable
memory barrier instructions in SetLatch/ResetLatch, so that their callers
don't need to be quite so careful.
Such a construction is useless since the lower PlaceHolderVar is already
nullable; no need to make it more so. Noted while pursuing bug #6154.
This is just a minor planner efficiency improvement, since the final plan
will come out the same anyway after PHVs are flattened. So not worth the
risk of back-patching.
Don't try to allocate the default value for a string relopt in the same
palloc chunk as the relopt_string struct. That didn't work too well if you
added a built-in string relopt in the stringRelOpts array, as it's not
possible to have an initializer for a variable length struct in C. This
makes the code slightly simpler too.
While we're at it, move the call to validator function in
add_string_reloption to before the allocation, so that if someone does pass
a bogus default value, we don't leak memory.
A PlaceHolderVar's expression might contain another, lower-level
PlaceHolderVar. If the outer PlaceHolderVar is used, the inner one
certainly will be also, and so we have to make sure that both of them get
into the placeholder_list with correct ph_may_need values during the
initial pre-scan of the query (before deconstruct_jointree starts).
We did this correctly for PlaceHolderVars appearing in the query quals,
but overlooked the issue for those appearing in the top-level targetlist;
with the result that nested placeholders referenced only in the targetlist
did not work correctly, as illustrated in bug #6154.
While at it, add some error checking to find_placeholder_info to ensure
that we don't try to create new placeholders after it's too late to do so;
they have to all be created before deconstruct_jointree starts.
Back-patch to 8.4 where the PlaceHolderVar mechanism was introduced.
The relevant backslash commands already exist, so we're just adding an
additional column. With this commit, all objects that have psql backslash
commands and accept comments should now display those comments at least
in verbose mode.
Josh Kupershmidt, with doc additions by me.
\dc and \dD now accept a "+" option, which will cause the comments to
be displayed. Along the way, correct a few oversights in the previous
commit in this area, 3b17efdfdd - namely,
(1) when \dL+ is used, make description still be the last column, for
consistency with what we've done elsewhere; and (2) document the
difference between \dC and \dC+.
Josh Kupershmidt, with a couple of doc changes by me.
This lie has been harmless until now, but has been exposed by the
change to include postgres.h before the python headers, which
in some versions include inttypes.h if HAVE_INTTYPES_H is set.
Somebody thought it'd be cute to invent a set of Node tag numbers that were
defined independently of, and indeed conflicting with, the main tag-number
list. While this accidentally failed to fail so far, it would certainly
lead to trouble as soon as anyone wanted to, say, apply copyObject to these
node types. Clang was already complaining about the use of makeNode on
these tags, and I think quite rightly so. Fix by pushing these node
definitions into the mainstream, including putting replnodes.h where it
belongs.
The previous limit of 1024 was set on the assumption that all modern syslog
implementations have line length limits of 2KB or so. However, this is
false, as at least Solaris and sysklogd truncate at only 1KB. 900 seems
to leave enough room for the max likely length of the tacked-on prefixes,
so let's go with that.
As with the previous change, it doesn't seem wise to back-patch this into
already-released branches; but it should be OK to sneak it into 9.1.
Noah Misch
The previous test for status < 0 test is in fact testing nothing if the
compiler considers an enum to be an unsigned data type. clang doesn't
like tautologies, so do this instead.
Report by Peter Geoghegan, fix as suggested by Tom Lane.
To avoid having the python headers hijack various definitions,
we now include them after all the system headers we want, having
first undefined some of the things they want to define. After that's
done we restore the things they scribbled on that matter, namely our
snprintf and vsnprintf macros, if we're using them.
Instead of entering them on transaction startup, we materialize them
only when someone wants to wait, which will occur only during CREATE
INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also
be able to probe for conflicting VXID locks, but the lock need never be
fully materialized, because the startup process does not use the normal
lock wait mechanism. Since most VXID locks never need to touch the
lock manager partition locks, this can significantly reduce blocking
contention on read-heavy workloads.
Patch by me. Review by Jeff Davis.
The output of \dL (list languages) is fairly narrow, so we just always
display the comment. \dC (list casts) can get fairly wide, so we only
display comments if the new \dC+ option is specified.
Josh Kupershmidt
glibc renders random() thread-safe by wrapping a futex lock around it;
testing reveals that this limits the performance of pgbench on machines
with many CPU cores. Rather than switching to random_r(), which is
only available on GNU systems and crashes unless you use undocumented
alchemy to initialize the random state properly, switch to our built-in
implementation of erand48(), which is both thread-safe and concurrent.
Since the list of reasons not to use the operating system's erand48()
is getting rather long, rename ours to pg_erand48() (and similarly
for our implementations of lrand48() and srand48()) and just always
use those. We were already doing this on Cygwin anyway, and the
glibc implementation is not quite thread-safe, so pgbench wouldn't
be able to use that either.
Per discussion with Tom Lane.
This kluge was inserted in a spot apparently chosen at random: the lock
manager's state is not yet fully set up for the wait, and in particular
LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
will not get cleaned up if the ereport is thrown. This seems to not cause
any observable problem in trivial test cases, because LockReleaseAll will
silently clean up the debris; but I was able to cause failures with tests
involving subtransactions.
Fixes breakage induced by commit c85c941470.
Back-patch to all affected branches.
It was initialized in the wrong place and to the wrong value. With bad
luck this could result in incorrect query-cancellation failures in hot
standby sessions, should a HS backend be holding pin on buffer number 1
while trying to acquire a lock.
Testing shows that the overhead of acquiring and releasing
SInvalReadLock and msgNumLock on high-core count boxes can waste a lot
of CPU time and hurt performance. This patch adds a per-backend flag
that allows us to skip all that locking in most cases. Further
testing shows that this improves performance even when sinval traffic
is very high.
Patch by me. Review and testing by Noah Misch.
pg_backup_db.c contained a mini SQL lexer with which it tried to identify
boundaries between SQL commands, but that code was not designed to cope
with standard_conforming_strings, and would get the wrong answer if a
backslash immediately precedes a closing single quote in such a string,
as per report from Julian Mehnle. The bug only affects direct-to-database
restores from archive files made with standard_conforming_strings = on.
Rather than complicating the code some more to try to fix that, let's just
rip it all out. The only reason it was needed was to cope with COPY data
embedded into ordinary archive entries, which was a layout that was used
only for about the first three weeks of the archive format's existence,
and never in any production release of pg_dump. Instead, just rely on the
archive file layout to tell us whether we're printing COPY data or not.
This bug represents a data corruption hazard in all releases in which
standard_conforming_strings can be turned on, ie 8.2 and later, so
back-patch to all supported branches.
It turns out to be possible to link against a libxml2.so that does this
differently than the version we configured and built against, so we need
a runtime check to avoid bizarre behavior. Per report from Bernd Helmle.
Patch by Florian Pflug.
On balance, the need to cover this case changes my mind in favor of pushing
all error-message generation duties into the two fe-secure.c routines.
So do it that way.
In many cases, pqsecure_read/pqsecure_write set up useful error messages,
which were then overwritten with useless ones by their callers. Fix this
by defining the responsibility to set an error message to be entirely that
of the lower-level function when using SSL.
Back-patch to 8.3; the code is too different in 8.2 to be worth the
trouble.
This disables an entirely unnecessary "sanity check" that causes failures
in nonblocking mode, because OpenSSL complains if we move or compact the
write buffer. The only actual requirement is that we not modify pending
data once we've attempted to send it, which we don't. Per testing and
research by Martin Pihlak, though this fix is a lot simpler than his patch.
I put the same change into the backend, although it's less clear whether
it's necessary there. We do use nonblock mode in some situations in
streaming replication, so seems best to keep the same behavior in the
backend as in libpq.
Back-patch to all supported releases.
Also change "switch" to "arg" because "switch" is a bit of a sloppy
term. So the environment variable is called
PSQL_EDITOR_LINENUMBER_ARG. Set "+" as hardcoded default value on
Unix (since "vi" is the hardcoded default editor), so many users won't
have to configure this at all. Move the documentation around a bit to
centralize the editor configuration under environment variables,
rather than repeating bits of it under every backslash command that
invokes an editor.
The original implementation simply did nothing when replacing an existing
object during CREATE EXTENSION. The folly of this was exposed by a report
from Marc Munro: if the existing object belongs to another extension, we
are left in an inconsistent state. We should insist that the object does
not belong to another extension, and then add it to the current extension
if not already a member.
I broke this in commit 5da79169d3, which
was obviously insufficiently well tested. Add some regression tests
in the hope of making future slip-ups more likely to be noticed.
PQsetvalue unnecessarily duplicated the logic in pqAddTuple, and didn't
duplicate it exactly either --- pqAddTuple does not care what is in the
tuple-pointer array positions beyond the last valid entry, whereas the
code in PQsetvalue assumed such positions would contain NULL. This led
to possible crashes if PQsetvalue was applied to a PGresult that had
previously been enlarged with pqAddTuple, for instance one built from a
server query. Fix by relying on pqAddTuple instead of duplicating logic,
and not assuming anything about the contents of res->tuples[res->ntups].
Back-patch to 8.4, where PQsetvalue was introduced.
Andrew Chernow
Previously, xpath() simply returned an empty array if the expression did
not yield a node set. This is useless for expressions that return scalars,
such as one with name() at the top level. Arrange to return the scalar
value as a single-element xml array, instead. (String values will be
suitably escaped.)
This change will also cause xpath_exists() to return true, not false,
for such expressions.
Florian Pflug, reviewed by Radoslaw Smogura
Without this it's possible for the output to not be legal XML, as
illustrated by the added regression test cases.
NB: this change will need to be called out as an incompatibility in the
9.2 release notes, since it's possible somebody was relying on the old
behavior, even though it's clearly wrong.
Florian Pflug, reviewed by Radoslaw Smogura
This requires a new shared catalog, pg_shseclabel.
Along the way, fix the security_label regression tests so that they
don't monkey with the labels of any pre-existing objects. This is
unlikely to matter in practice, since only the label for the "dummy"
provider was being manipulated. But this way still seems cleaner.
KaiGai Kohei, with fairly extensive hacking by me.
libxml reports some errors (like invalid xmlns attributes) via the error
handler hook, but still returns a success indicator to the library caller.
This causes us to miss some errors that are important to report. Since the
"generic" error handler hook doesn't know whether the message it's getting
is for an error, warning, or notice, stop using that and instead start
using the "structured" error handler hook, which gets enough information
to be useful.
While at it, arrange to save and restore the error handler hook setting in
each libxml-using function, rather than assuming we can set and forget the
hook. This should improve the odds of working nicely with third-party
libraries that also use libxml.
In passing, volatile-ize some local variables that get modified within
PG_TRY blocks. I noticed this while testing with an older gcc version
than I'd previously tried to compile xml.c with.
Florian Pflug and Tom Lane, with extensive review/testing by Noah Misch
Noah Misch diagnosed the buildfarm problems in the isolation tests
partly as failure to differentiate backends properly; the old code was
using backend IDs, which is not good enough because a new backend might
use an already used ID. Use PIDs instead.
Also, the code was purposely careless about other concurrent activity,
because it isn't expected; and in fact, it doesn't affect the vast
majority of the time. However, it can be observed that autovacuum can
block tables for long enough to cause sporadic failures. The new code
accounts for that by ignoring locks held by processes not explicitly
declared in our spec file.
Author: Noah Misch
The previous value of 20ms is dangerously close to the time actually
spent just waiting for the deadlock to happen, so on occasion it causes
the test to fail simply because the other session didn't get to run
early enough, not managing to cause the deadlock that needs to be
detected. With this new value, it's expected that most machines on
normal load will be able to pass the test.
Author: Noah Misch
These new files allow the new FK tests on isolationtester to pass on the
serializable and repeatable read isolation levels (which are untested
by the buildfarm).
Author: Kevin Grittner
Reviewed by Noah Misch
Subtransaction locks now released en masse at main commit, rather than
repeatedly re-scanning for locks as we ascend the nested transaction tree.
Split transaction state TBLOCK_SUBEND into two states, TBLOCK_SUBCOMMIT
and TBLOCK_SUBRELEASE to allow the commit path to be optimised using
the existing code in ResourceOwnerRelease() which appears to have been
intended for this usage, judging from comments therein.
1. In GetLockStatusData, avoid initializing instance before we've ensured
that the array is large enough. Otherwise, if repalloc moves the block
around, we're hosed.
2. Add the word "Relation" to the name of some identifiers, to avoid
assuming that the fast-path mechanism will only ever apply to relations
(though these particular parts certainly will). Some of the macros
could possibly use similar treatment, but the names are getting awfully
long already.
3. Add a missing word to comment in AtPrepare_Locks().
Standby servers can now have WALSender processes, which can work with
either WALReceiver or archive_commands to pass data. Fully updated
docs, including new conceptual terms of sending server, upstream and
downstream servers. WALSenders terminated when promote to master.
Fujii Masao, review, rework and doc rewrite by Simon Riggs
This is more SQL-spec-compliant, more easily extensible, and better
performing than the old method of inventing special variables.
Pavel Stehule, reviewed by Shigeru Hanada and David Wheeler
When an AccessShareLock, RowShareLock, or RowExclusiveLock is requested
on an unshared database relation, and we can verify that no conflicting
locks can possibly be present, record the lock in a per-backend queue,
stored within the PGPROC, rather than in the primary lock table. This
eliminates a great deal of contention on the lock manager LWLocks.
This patch also refactors the interface between GetLockStatusData() and
pg_lock_status() to be a bit more abstract, so that we don't rely so
heavily on the lock manager's internal representation details. The new
fast path lock structures don't have a LOCK or PROCLOCK structure to
return, so we mustn't depend on that for purposes of listing outstanding
locks.
Review by Jeff Davis.
We already have similar functions for many other object types, including
operator classes, so it seems like we should have this one, too.
Extracted from a larger patch by Josh Kupershmidt
Move FileClose's decrement of temporary_files_size up, so that it will be
executed even if elog() throws an error. This is reasonable since if the
unlink() fails, the fact the file is still there is not our fault, and we
are going to forget about it anyhow. So we won't count it against
temp_file_limit anymore.
Update fileSize and temporary_files_size correctly in FileTruncate.
We probably don't have any places that truncate temp files, but fd.c
surely should not assume that.
If a Var was used only in a GROUP BY expression, the previous
implementation would include the Var by itself (as well as the expression)
in the generated targetlist. This wouldn't affect the efficiency of the
scan/join part of the plan at all, but it could result in passing
unnecessarily-wide rows through sorting and grouping steps. It turns out
to take only a little more code, and not noticeably more time, to generate
a tlist without such redundancy, so let's do that. Per a recent gripe from
HarmeekSingh Bedi.
There may be some other places where we should use errdetail_internal,
but they'll have to be evaluated case-by-case. This commit just hits
a bunch of places where invoking gettext is obviously a waste of cycles.
This function supports untranslated detail messages, in the same way that
errmsg_internal supports untranslated primary messages. We've needed this
for some time IMO, but discussion of some cases in the SSI code provided
the impetus to actually add it.
Kevin Grittner, with minor adjustments by me
This fixes SSPI login failures showing "The function
requested is not supported", often showing up when connecting
to localhost. The reason was not properly updating the SSPI
handle when multiple roundtrips were required to complete the
authentication sequence.
Report and analysis by Ahmed Shinwari, patch by Magnus Hagander
The commit action of temporary tables is currently not cataloged, so
we can't easily show it. The previous value was outdated from before
we had different commit actions.
GISTInsertStack.childoffnum used to mean "offset of the downlink in this
node, pointing to the child node in the stack". It's now replaced with
downlinkoffnum, which means "offset of the downlink in the parent of this
node". gistFindPath() already used childoffnum with this new meaning, and
had an extra step at the end to pull all the childoffnum values down one
node in the stack, to adjust the stack for the meaning that childoffnum had
elsewhere. That's no longer required.
The reason to do this now is this new representation is more convenient for
the GiST fast build patch that Alexander Korotkov is working on.
While we're at it, replace the linked list used in gistFindPath with a
standard List, and make gistFindPath() static.
Alexander Korotkov, with some changes by me.
First, when following a right-link, we incorrectly marked the current page
as the parent of the right sibling. In reality, the parent of the right page
is the same as the parent of the current page (or some page to the right of
it, gistFindCorrectParent() will sort that out).
Secondly, when we follow a right-link, we must prepend, not append, the right
page to our list of pages to visit. That's because we assume that once we
hit a leaf page in the list, all the rest are leaf pages too, and give up.
To hit these bugs, you need concurrent actions and several unlucky accidents.
Another backend must split the root page, while you're in process of
splitting a lower-level page. Furthermore, while you scan the internal nodes
to re-find the parent, another backend needs to again split some more internal
pages. Even then, the bugs don't necessarily manifest as user-visible errors
or index corruption.
While we're at it, make the error reporting a bit better if gistFindPath()
fails to re-find the parent. It used to be an assertion, but an elog() seems
more appropriate.
Backpatch to all supported branches.
There's a heuristic in estimate_rel_size() to clamp the minimum size
estimate for a table to 10 pages, unless we can see that vacuum or analyze
has been run (and set relpages to something nonzero, so this will always
happen for a table that's actually empty). However, it would be better
not to do this for inheritance parent tables, which very commonly are
really empty and can be expected to stay that way. Per discussion of a
recent pgsql-performance report from Anish Kejariwal. Also prevent it
from happening for indexes (although this is more in the nature of
documentation, since CREATE INDEX normally initializes relpages to
something nonzero anyway).
Back-patch to 9.0, because the ability to collect statistics across a
whole inheritance tree has improved the planner's estimates to the point
where this relatively small error makes a significant difference. In the
referenced report, merge or hash joins were incorrectly estimated as
cheaper than a nestloop with inner indexscan on the inherited table.
That was less likely before 9.0 because the lack of inherited stats would
have resulted in a default (and rather pessimistic) estimate of the cost
of a merge or hash join.