Programmers discovered decades ago that it was useful to have a simple
interface for allocating and freeing memory, which is why malloc() and
free() were invented. Unfortunately, those handy tools don't work
with dynamic shared memory segments because those are specific to
PostgreSQL and are not necessarily mapped at the same address in every
cooperating process. So invent our own allocator instead. This makes
it possible for processes cooperating as part of parallel query
execution to allocate and free chunks of memory without having to
reserve them prior to the start of execution. It could also be used
for longer lived objects; for example, we could consider storing data
for pg_stat_statements or the stats collector in shared memory using
these interfaces, rather than writing them to files. Basically,
anything that needs shared memory but can't predict in advance how
much it's going to need might find this useful.
Thomas Munro and Robert Haas. The original code (of mine) on which
Thomas based his work was actually designed to be a new backend-local
memory allocator for PostgreSQL, but that hasn't gone anywhere - or
not yet, anyway. Thomas took that work and performed major
refactoring and extensive modifications to make it work with dynamic
shared memory, including the addition of appropriate locking.
Discussion: CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
This is intended as infrastructure for a full-fledged allocator for
dynamic shared memory. The interface looks a bit like a real
allocator, but only supports allocating and freeing memory in
multiples of the 4kB page size. Further, to free memory, you must
know the size of the span you wish to free, in pages. While these are
make it unsuitable as an allocator in and of itself, it still serves
as very useful scaffolding for a full-fledged allocator.
Robert Haas and Thomas Munro. This code is mostly the same as my 2014
submission, but Thomas fixed quite a few bugs and made some changes to
the interface.
Discussion: CA+TgmobkeWptGwiNa+SGFWsTLzTzD-CeLz0KcE-y6LFgoUus4A@mail.gmail.com
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
C doesn't have any sort of built-in understanding of a pointer
relative to some arbitrary base address, but dynamic shared memory
segments can be mapped at different addresses in different processes,
so any sort of shared data structure stored within a dynamic shared
memory segment can't use absolute pointers. We could use something
like Size to represent a relative pointer, but then the compiler
provides no type-checking. Use stupid macro tricks to get some
type-checking.
Patch originally by me. Concept suggested by Andres Freund. Recently
resubmitted as part of Thomas Munro's work on dynamic shared memory
allocation.
Discussion: 20131205144434.GG12398@alap2.anarazel.de
Discussion: CAEepm=1z5WLuNoJ80PaCvz6EtG9dN0j-KuHcHtU6QEfcPP5-qA@mail.gmail.com
The CatalogSnapshot was not plugged into SnapshotResetXmin()'s accounting
for whether MyPgXact->xmin could be cleared or advanced. In normal
transactions this was masked by the fact that the transaction snapshot
would be older, but during backend startup and certain utility commands
it was possible to re-use the CatalogSnapshot after MyPgXact->xmin had
been cleared, meaning that recently-deleted rows could be pruned even
though this snapshot could still see them, causing unexpected catalog
lookup failures. This effect appears to be the explanation for a recent
failure on buildfarm member piculet.
To fix, add the CatalogSnapshot to the RegisteredSnapshots heap whenever
it is valid.
In the previous logic, it was possible for the CatalogSnapshot to remain
valid across waits for client input, but with this change that would mean
it delays advance of global xmin in cases where it did not before. To
avoid possibly causing new table-bloat problems with clients that sit idle
for long intervals, add code to invalidate the CatalogSnapshot before
waiting for client input. (When the backend is busy, it's unlikely that
the CatalogSnapshot would be the oldest snap for very long, so we don't
worry about forcing early invalidation of it otherwise.)
In passing, remove the CatalogSnapshotStale flag in favor of using
"CatalogSnapshot != NULL" to represent validity, as we do for the other
special snapshots in snapmgr.c. And improve some obsolete comments.
No regression test because I don't know a deterministic way to cause this
failure. But the stress test shown in the original discussion provokes
"cache lookup failed for relation 1255" within a few dozen seconds for me.
Back-patch to 9.4 where MVCC catalog scans were introduced. (Note: it's
quite easy to produce similar failures with the same test case in branches
before 9.4. But MVCC catalog scans were supposed to fix that.)
Discussion: <16447.1478818294@sss.pgh.pa.us>
The reloptions stuff allows this option to be set on a matview.
While it's questionable whether that is useful or was really intended,
it does work, and we shouldn't change that in minor releases. Commit
e3e66d8a9 disabled the option since I didn't realize that it was
possible for it to be set on a matview. Tweak the test to re-allow it.
Discussion: <19749.1478711862@sss.pgh.pa.us>
We really ought to make StdRdOptions and the other decoded forms of
reloptions self-identifying, but for the moment, assume that only plain
relations could possibly be user_catalog_tables. Fixes problem with bogus
"ON CONFLICT is not supported on table ... used as a catalog table" error
when target is a view with cascade option.
Discussion: <26681.1477940227@sss.pgh.pa.us>
This is infrastructure for the complete SQL standard feature. No
support is included at this point for execution nodes or PLs. The
intent is to add that soon.
As this patch leaves things, standard syntax can create tuplestores
to contain old and/or new versions of rows affected by a statement.
References to these tuplestores are in the TriggerData structure.
C triggers can access the tuplestores directly, so they are usable,
but they cannot yet be referenced within a SQL statement.
These functions were originally added in commit d8cedf67a to support
use of int2vector columns as catcache lookup keys. However, there are
no catcaches that use such columns. (Indeed I now think it must always
have been dead code: a catcache with such a key column would need an
underlying unique index on the column, but we've never had an int2vector
btree opclass.)
Getting rid of the int2vector-specific operator and function does not
lose any functionality, because operations on int2vectors will now fall
back to the generic anyarray support. This avoids a wart that a btree
index on an int2vector column (made using anyarray_ops) would fail to
match equality searches, because int2vectoreq wasn't a member of the
opclass. We don't really care much about that, since int2vector is not
meant as a type for users to use, but it's silly to have extra code and
less functionality.
If we ever do want a catcache to be indexed by an int2vector column,
we'd need to put back full btree and hash opclasses for int2vector,
comparable to the support for oidvector. (The anyarray code can't be
used at such a low level, because it needs to do catcache lookups.)
But we'll deal with that if/when the need arises.
Also worth noting is that removal of the hash int2vector_ops opclass will
break any user-created hash indexes on int2vector columns. While hash
anyarray_ops would serve the same purpose, it would probably not compute
the same hash values and thus wouldn't be on-disk-compatible. Given that
int2vector isn't a user-facing type and we're planning other incompatible
changes in hash indexes for v10 anyway, this doesn't seem like something
to worry about, but it's probably worth mentioning here.
Amit Langote
Discussion: <d9bb74f8-b194-7307-9ebd-90645d377e45@lab.ntt.co.jp>
Pass the buffer size as argument to LogicalTapeRewindForRead, rather than
setting it earlier with the separate LogicTapeAssignReadBufferSize call.
This way, the buffer size is set closer to where it's actually used, which
makes the code easier to understand.
This makes the calculation for how much memory to use for the buffers less
precise. We now use the same amount of memory for every tape, rounded down
to the nearest BLCKSZ boundary, instead of using one more block for some
tapes, to get the total up to exact amount of memory available. That should
be OK, merging isn't too sensitive to the exact amount of memory used.
Reviewed by Peter Geoghegan
Discussion: <0f607c4b-df23-353e-bf56-c0389d28495f@iki.fi>
Commit 88e982302 invented GUC_UNIT_XSEGS for min_wal_size and max_wal_size,
but neglected to make it display sensibly in pg_settings.unit (by adding a
case to the switch in GetConfigOptionByNum). Fix that, and adjust said
switch to throw a run-time error the next time somebody forgets.
In passing, avoid using a static buffer for the output string --- the rest
of this function pstrdup's from a local buffer, and I see no very good
reason why the units code should do it differently and less safely.
Per report from Otar Shavadze. Back-patch to 9.5 where the new unit type
was added.
Report: <CAG-jOyA=iNFhN+yB4vfvqh688B7Tr5SArbYcFUAjZi=0Exp-Lg@mail.gmail.com>
Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.
Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE case. In other words, it's used whenever we
do an external sort.
Reviewed by Peter Geoghegan and Claudio Freire.
Discussion: <CAM3SWZTpaORV=yQGVCG8Q4axcZ3MvF-05xe39ZvORdU9JcD6hQ@mail.gmail.com>
Some experimentation with an older version of gcc showed that it is able
to determine whether "if (elevel_ >= ERROR)" is compile-time constant
if elevel_ is declared "const", but otherwise not so much. We had
accounted for that in ereport() but were too miserly with braces to
make it so in elog(). I don't know how many currently-interesting
compilers have the same quirk, but in case it will save some code
space, let's make sure that elog() is on the same footing as ereport()
for this purpose.
Back-patch to 9.3 where we introduced pg_unreachable() calls into
elog/ereport.
Add a location field to the DefElem struct, used to parse many utility
commands. Update various error messages to supply error position
information.
To propogate the error position information in a more systematic way,
create a ParseState in standard_ProcessUtility() and pass that to
interested functions implementing the utility commands. This seems
better than passing the query string and then reassembling a parse state
ad hoc, which violates the encapsulation of the ParseState type.
Reviewed-by: Pavel Stehule <pavel.stehule@gmail.com>
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
This seems to offer significantly better search performance than the
existing GiST opclass for inet/cidr, at least on data with a wide mix
of network mask lengths. (That may suggest that the data splitting
heuristics in the GiST opclass could be improved.)
Emre Hasegeli, with mostly-cosmetic adjustments by me
Discussion: <CAE2gYzxtth9qatW_OAqdOjykS0bxq7AYHLuyAQLPgT7H9ZU0Cw@mail.gmail.com>
Add a variant of txid_current() that returns NULL if no transaction ID
is assigned. This version can be used even on a standby server,
although it will always return NULL since no transaction IDs can be
assigned during recovery.
Craig Ringer, per suggestion from Jim Nasby. Reviewed by Petr Jelinek
and by me.
Merge several copies of "copy an inet value and adjust the mask length"
code to create a single, conveniently C-callable function. This function
is exported for future use by inet SPGiST support, but it's good cleanup
anyway since we had three slightly-different-for-no-good-reason copies.
(Extracted from a larger patch, to separate new code from refactoring
of old code)
Emre Hasegeli
These types are storage-compatible with real arrays, but they don't support
toasting, so of course they can't support expansion either.
Per bug #14289 from Michael Overmeyer. Back-patch to 9.5 where expanded
arrays were introduced.
Report: <20160818174414.1529.37913@wrigleys.postgresql.org>
regexp_match() is like regexp_matches(), but it disallows the 'g' flag
and in consequence does not need to return a set. Instead, it returns
a simple text array value, or NULL if there's no match. Previously people
usually got that behavior with a sub-select, but this way is considerably
more efficient.
Documentation adjusted so that regexp_match() is presented first and then
regexp_matches() is introduced as a more complicated version. This is
a bit historically revisionist but seems pedagogically better.
Still TODO: extend contrib/citext to support this function.
Emre Hasegeli, reviewed by David Johnston
Discussion: <CAE2gYzy42sna2ME_e3y1KLQ-4UBrB-eVF0SWn8QG39sQSeVhEw@mail.gmail.com>
We implement a dozen or so parameterless functions that the SQL standard
defines special syntax for. Up to now, that was done by converting them
into more or less ad-hoc constructs such as "'now'::text::date". That's
messy for multiple reasons: it exposes what should be implementation
details to users, and performance is worse than it needs to be in several
cases. To improve matters, invent a new expression node type
SQLValueFunction that can represent any of these parameterless functions.
Bump catversion because this changes stored parsetrees for rules.
Discussion: <30058.1463091294@sss.pgh.pa.us>
NUMERIC_MAX_PRECISION is a purely arbitrary constraint on the precision
and scale you can write in a numeric typmod. It might once have had
something to do with the allowed range of a typmod-less numeric value,
but at least since 9.1 we've allowed, and documented that we allowed,
any value that would physically fit in the numeric storage format;
which is something over 100000 decimal digits, not 1000.
Hence, get rid of numeric_in()'s use of NUMERIC_MAX_PRECISION as a limit
on the allowed range of the exponent in scientific-format input. That was
especially silly in view of the fact that you can enter larger numbers as
long as you don't use 'e' to do it. Just constrain the value enough to
avoid localized overflow, and let make_result be the final arbiter of what
is too large. Likewise adjust ecpg's equivalent of this code.
Also get rid of numeric_recv()'s use of NUMERIC_MAX_PRECISION to limit the
number of base-NBASE digits it would accept. That created a dump/restore
hazard for binary COPY without doing anything useful; the wire-format
limit on number of digits (65535) is about as tight as we would want.
In HEAD, also get rid of pg_size_bytes()'s unnecessary intimacy with what
the numeric range limit is. That code doesn't exist in the back branches.
Per gripe from Aravind Kumar. Back-patch to all supported branches,
since they all contain the documentation claim about allowed range of
NUMERIC (cf commit cabf5d84b).
Discussion: <2895.1471195721@sss.pgh.pa.us>
Per discussion, we should provide such functions to replace the lost
ability to discover AM properties by inspecting pg_am (cf commit
65c5fcd35). The added functionality is also meant to displace any code
that was looking directly at pg_index.indoption, since we'd rather not
believe that the bit meanings in that field are part of any client API
contract.
As future-proofing, define the SQL API to not assume that properties that
are currently AM-wide or index-wide will remain so unless they logically
must be; instead, expose them only when inquiring about a specific index
or even specific index column. Also provide the ability for an index
AM to override the behavior.
In passing, document pg_am.amtype, overlooked in commit 473b93287.
Andrew Gierth, with kibitzing by me and others
Discussion: <87mvl5on7n.fsf@news-spur.riddles.org.uk>
Discussion of commit 3e2f3c2e4 exposed a problem that is of longer
standing: since we don't detoast data while sticking it into a portal's
holdStore for PORTAL_ONE_RETURNING and PORTAL_UTIL_SELECT queries, and we
release the query's snapshot as soon as we're done loading the holdStore,
later readout of the holdStore can do TOAST fetches against data that can
no longer be seen by any of the session's live snapshots. This means that
a concurrent VACUUM could remove the TOAST data before we can fetch it.
Commit 3e2f3c2e4 exposed the problem by showing that sometimes we had *no*
live snapshots while fetching TOAST data, but we'd be at risk anyway.
I believe this code was all right when written, because our management of a
session's exposed xmin was such that the TOAST references were safe until
end of transaction. But that's no longer true now that we can advance or
clear our PGXACT.xmin intra-transaction.
To fix, copy the query's snapshot during FillPortalStore() and save it in
the Portal; release it only when the portal is dropped. This essentially
implements a policy that we must hold a relevant snapshot whenever we
access potentially-toasted data. We had already come to that conclusion
in other places, cf commits 08e261cbc9 and ec543db77b.
I'd have liked to add a regression test case for this, but I didn't see
a way to make one that's not unreasonably bloated; it seems to require
returning a toasted value to the client, and those will be big.
In passing, improve PortalRunUtility() so that it positively verifies
that its ending PopActiveSnapshot() call will pop the expected snapshot,
removing a rather shaky assumption about which utility commands might
do their own PopActiveSnapshot(). There's no known bug here, but now
that we're actively referencing the snapshot it's almost free to make
this code a bit more bulletproof.
We might want to consider back-patching something like this into older
branches, but it would be prudent to let it prove itself more in HEAD
beforehand.
Discussion: <87vazemeda.fsf@credativ.de>
tqual.h is included in some front-end compiles, and a static inline
breaks on buildfarm member castoroides. Since the macro is never
referenced, it should dodge that problem, although this doesn't
seem like the cleanest way of hiding things from front-end compiles.
Report and review by Tom Lane; patch by me.
Previously, we tested for MVCC snapshots to see whether they were too
old, but not TOAST snapshots, which can lead to complaints about missing
TOAST chunks if those chunks are subject to early pruning. Ideally,
the threshold lsn and timestamp for a TOAST snapshot would be that of
the corresponding MVCC snapshot, but since we have no way of deciding
which MVCC snapshot was used to fetch the TOAST pointer, use the oldest
active or registered snapshot instead.
Reported by Andres Freund, who also sketched out what the fix should
look like. Patch by me, reviewed by Amit Kapila.
We must not push down a foreign join when the foreign tables involved
should be accessed under different user mappings. Previously we tried
to enforce that rule literally during planning, but that meant that the
resulting plans were dependent on the current contents of the
pg_user_mapping catalog, and we had to blow away all cached plans
containing any remote join when anything at all changed in pg_user_mapping.
This could have been improved somewhat, but the fact that a syscache inval
callback has very limited info about what changed made it hard to do better
within that design. Instead, let's change the planner to not consider user
mappings per se, but to allow a foreign join if both RTEs have the same
checkAsUser value. If they do, then they necessarily will use the same
user mapping at runtime, and we don't need to know specifically which one
that is. Post-plan-time changes in pg_user_mapping no longer require any
plan invalidation.
This rule does give up some optimization ability, to wit where two foreign
table references come from views with different owners or one's from a view
and one's directly in the query, but nonetheless the same user mapping
would have applied. We'll sacrifice the first case, but to not regress
more than we have to in the second case, allow a foreign join involving
both zero and nonzero checkAsUser values if the nonzero one is the same as
the prevailing effective userID. In that case, mark the plan as only
runnable by that userID.
The plancache code already had a notion of plans being userID-specific,
in order to support RLS. It was a little confused though, in particular
lacking clarity of thought as to whether it was the rewritten query or just
the finished plan that's dependent on the userID. Rearrange that code so
that it's clearer what depends on which, and so that the same logic applies
to both RLS-injected role dependency and foreign-join-injected role
dependency.
Note that this patch doesn't remove the other issue mentioned in the
original complaint, which is that while we'll reliably stop using a foreign
join if it's disallowed in a new context, we might fail to start using a
foreign join if it's now allowed, but we previously created a generic
cached plan that didn't use one. It was agreed that the chance of winning
that way was not high enough to justify the much larger number of plan
invalidations that would have to occur if we tried to cause it to happen.
In passing, clean up randomly-varying spelling of EXPLAIN commands in
postgres_fdw.sql, and fix a COSTS ON example that had been allowed to
leak into the committed tests.
This reverts most of commits fbe5a3fb7 and 5d4171d1c, which were the
previous attempt at ensuring we wouldn't push down foreign joins that
span permissions contexts.
Etsuro Fujita and Tom Lane
Discussion: <d49c1e5b-f059-20f4-c132-e9752ee0113e@lab.ntt.co.jp>
GiST index build could go into an infinite loop when presented with boxes
(or points, circles or polygons) containing NaN component values. This
happened essentially because the code assumed that x == x is true for any
"double" value x; but it's not true for NaNs. The looping behavior was not
the only problem though: we also attempted to sort the items using simple
double comparisons. Since NaNs violate the trichotomy law, qsort could
(in principle at least) get arbitrarily confused and mess up the sorting of
ordinary values as well as NaNs. And we based splitting choices on box size
calculations that could produce NaNs, again resulting in undesirable
behavior.
To fix, replace all comparisons of doubles in this logic with
float8_cmp_internal, which is NaN-aware and is careful to sort NaNs
consistently, higher than any non-NaN. Also rearrange the box size
calculation to not produce NaNs; instead it should produce an infinity
for a box with NaN on one side and not-NaN on the other.
I don't by any means claim that this solves all problems with NaNs in
geometric values, but it should at least make GiST index insertion work
reliably with such data. It's likely that the index search side of things
still needs some work, and probably regular geometric operations too.
But with this patch we're laying down a convention for how such cases
ought to behave.
Per bug #14238 from Guang-Dih Lei. Back-patch to 9.2; the code used before
commit 7f3bd86843 is quite different and doesn't lock up on my simple
test case, nor on the submitter's dataset.
Report: <20160708151747.1426.60150@wrigleys.postgresql.org>
Discussion: <28685.1468246504@sss.pgh.pa.us>
This patch provides a new implementation of the logic added by commit
137805f89 and later removed by 77ba61080. It differs from the original
primarily in expending much less effort per joinrel in large queries,
which it accomplishes by doing most of the matching work once per query not
once per joinrel. Hopefully, it's also less buggy and better commented.
The never-documented enable_fkey_estimates GUC remains gone.
There remains work to be done to make the selectivity estimates account
for nulls in FK referencing columns; but that was true of the original
patch as well. We may be able to address this point later in beta.
In the meantime, any error should be in the direction of overestimating
rather than underestimating joinrel sizes, which seems like the direction
we want to err in.
Tomas Vondra and Tom Lane
Discussion: <31041.1465069446@sss.pgh.pa.us>
Since indexes are created without valid LSNs, an index created
while a snapshot older than old_snapshot_threshold existed could
cause queries to return incorrect results when those old snapshots
were used, if any relevant rows had been subject to early pruning
before the index was built. Prevent usage of a newly created index
until all such snapshots are released, for relations where this can
happen.
Questions about the interaction of "snapshot too old" with index
creation were initially raised by Andres Freund.
Reviewed by Robert Haas.
Fix a couple of overlooked uses of "degree" terminology. Make the parallel
worker count selection logic in create_plain_partial_paths more robust (in
particular, it failed with max_parallel_workers_per_gather set to zero).
This terminology provoked widespread complaints. So, instead, rename
the GUC max_parallel_degree to max_parallel_workers_per_gather
(leaving room for a possible future GUC max_parallel_workers that acts
as a system-wide limit), and rename the parallel_degree reloption to
parallel_workers. Rename structure members to match.
These changes create a dump/restore hazard for users of PostgreSQL
9.6beta1 who have set the reloption (or applied the GUC using ALTER
USER or ALTER DATABASE).
This commit reverts 137805f89 as well as the associated commits 015e88942,
5306df283, and 68d704edb. We found multiple bugs in this feature, and
there was concern about possible planner slowdown (though to be fair,
exhibiting a very large slowdown proved difficult). The way forward
requires a considerable rewrite, which may or may not be possible to
accomplish in time for beta2. In my judgment reviewing the rewrite will
be easier to accomplish starting from a clean slate, so let's temporarily
revert what's there now. This also leaves us in a safe state if it turns
out to be necessary to postpone the rewrite to the next development cycle.
Discussion: <20160429102531.GA13701@huehner.biz>
This attempts to buy back some of whatever performance we lost from fixing
bug #14174 by inlining the initial checks in MakeExpandedObjectReadOnly()
into the callers. We can do that in a macro without creating multiple-
evaluation hazards, so it's pretty much free notationally; and the amount
of code added to callers should be minimal as well. (Testing a value can't
take many more instructions than passing it to a subroutine.)
Might as well inline DatumIsReadWriteExpandedObject() while we're at it.
This is an ABI break for callers, so it doesn't seem safe to put into 9.5,
but I see no reason not to do it in HEAD.
If both timeout indicators are set when we arrive at ProcessInterrupts,
we've historically just reported "lock timeout". However, some buildfarm
members have been observed to fail isolationtester's timeouts test by
reporting "lock timeout" when the statement timeout was expected to fire
first. The cause seems to be that the process is allowed to sleep longer
than expected (probably due to heavy machine load) so that the lock
timeout happens before we reach the point of reporting the error, and
then this arbitrary tiebreak rule does the wrong thing. We can improve
matters by comparing the scheduled timeout times to decide which error
to report.
I had originally proposed greatly reducing the 1-second window between
the two timeouts in the test cases. On reflection that is a bad idea,
at least for the case where the lock timeout is expected to fire first,
because that would assume that it takes negligible time to get from
statement start to the beginning of the lock wait. Thus, this patch
doesn't completely remove the risk of test failures on slow machines.
Empirically, however, the case this handles is the one we are seeing
in the buildfarm. The explanation may be that the other case requires
the scheduler to take the CPU away from a busy process, whereas the
case fixed here only requires the scheduler to not give the CPU back
right away to a process that has been woken from a multi-second sleep
(and, perhaps, has been swapped out meanwhile).
Back-patch to 9.3 where the isolationtester timeouts test was added.
Discussion: <8693.1464314819@sss.pgh.pa.us>
Hash indexes are not WAL-logged, and so do not maintain the LSN of
index pages. Since the "snapshot too old" feature counts on
detecting error conditions using the LSN of a table and all indexes
on it, this makes it impossible to safely do early vacuuming on any
table with a hash index, so add this to the tests for whether the
xid used to vacuum a table can be adjusted based on
old_snapshot_threshold.
While at it, add a paragraph to the docs for old_snapshot_threshold
which specifically mentions this and other aspects of the feature
which may otherwise surprise users.
Problem reported and patch reviewed by Amit Kapila
Without a few entries beyond old_snapshot_threshold, the lookup
would often fail, resulting in the more aggressive pruning or
vacuum being skipped often enough to matter. This was very clearly
shown by a python test script posted by Ants Aasma, and was likely
a factor in an earlier but somewhat less clear-cut test case posted
by Jeff Janes.
This patch makes no change to the logic, per se -- it just makes
the array of mapping entries big enough to make lookup misses based
on timing much less likely. An occasional miss is still possible
if a thread stalls for more than 10 minutes, but that does not
create any problem with correctness of behavior. Besides, if
things are so busy that a thread is stalling for more than 10
minutes, it is probably OK to skip the more aggressive cleanup at
that particular point in time.
Revert commit 7cb1db1d95, which represented
a misunderstanding of the problem (if snapmgr.h weren't already included
in bufmgr.h, things wouldn't compile anywhere). Instead install what
I think is the real fix.
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
This will prevent users from creating roles which begin with "pg_" and
will check for those roles before allowing an upgrade using pg_upgrade.
This will allow for default roles to be provided at initdb time.
Reviews by José Luis Tallón and Robert Haas
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
This allows parallel aggregation to use them. It may seem surprising
that we use float8_combine for both float4_accum and float8_accum
transition functions, but that's because those functions differ only
in the type of the non-transition-state argument.
Haribabu Kommi, reviewed by David Rowley and Tomas Vondra
Now indexes (but only B-tree for now) can contain "extra" column(s) which
doesn't participate in index structure, they are just stored in leaf
tuples. It allows to use index only scan by using single index instead
of two or more indexes.
Author: Anastasia Lubennikova with minor editorializing by me
Reviewers: David Rowley, Peter Geoghegan, Jeff Janes
The code that estimates what parallel degree should be uesd for the
scan of a relation is currently rather stupid, so add a parallel_degree
reloption that can be used to override the planner's rather limited
judgement.
Julien Rouhaud, reviewed by David Rowley, James Sewell, Amit Kapila,
and me. Some further hacking by me.