database to connect to. This is necessary for the walsender code to work
properly (it was previously using an untenable assumption that template1 would
always be available to connect to). This also gets rid of a small security
shortcoming that was introduced in the original patch to eliminate the flat
authentication files: before, you could find out whether or not the requested
database existed even if you couldn't pass the authentication checks.
The changes needed to support this are mainly just to treat pg_authid and
pg_auth_members as nailed relations, so that we can read them without having
to be able to locate real pg_class entries for them. This mechanism was
already debugged for pg_database, but we hadn't recognized the value of
applying it to those catalogs too.
Since the current code doesn't have support for accessing toast tables before
we've brought up all of the relcache, remove pg_authid's toast table to ensure
that no one can store an out-of-line toasted value of rolpassword. The case
seems quite unlikely to occur in practice, and was effectively unsupported
anyway in the old "flatfiles" implementation.
Update genbki.pl to actually implement the same rules as bootstrap.c does for
not-nullability of catalog columns. The previous coding was a bit cheesy but
worked all right for the previous set of bootstrap catalogs. It does not work
for pg_authid, where rolvaliduntil needs to be nullable.
Initdb forced due to minor catalog changes (mainly the toast table removal).
relcache reload works. In the patched code, a relcache entry in process of
being rebuilt doesn't get unhooked from the relcache hash table; which means
that if a cache flush occurs due to sinval queue overrun while we're
rebuilding it, the entry could get blown away by RelationCacheInvalidate,
resulting in crash or misbehavior. Fix by ensuring that an entry being
rebuilt has positive refcount, so it won't be seen as a target for removal
if a cache flush occurs. (This will mean that the entry gets rebuilt twice
in such a scenario, but that's okay.) It appears that the problem can only
arise within a transaction that has previously reassigned the relfilenode of
a pre-existing table, via TRUNCATE or a similar operation. Per bug #5412
from Rusty Conover.
Back-patch to 8.2, same as the patch that introduced the problem.
I think that the failure can't actually occur in 8.2, since it lacks the
rd_newRelfilenodeSubid optimization, but let's make it work like the later
branches anyway.
Patch by Heikki, slightly editorialized on by me.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4). This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.
Design and review by Tom Lane.
where a database has a non-default tablespaceid. Pass thru MyDatabaseId
and MyDatabaseTableSpace to allow file path to be re-created in
standby and correct invalidation to take place in all cases.
Update and rework xact_commit_desc() debug messages.
Bug report from Tom by code inspection. Fix by me.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.
In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
needed by nothing else.
The restructuring I just finished doing on cache management exposed to me how
silly this routine was. Its function was to go into the catcache and blow
away all entries related to a given relation when there was a relcache flush
on that relation. However, there is no point in removing a catcache entry
if the catalog row it represents is still valid --- and if it isn't valid,
there must have been a catcache entry flush on it, because that's triggered
directly by heap_update or heap_delete on the catalog row. So this routine
accomplished nothing except to blow away valid cache entries that we'd very
likely be wanting in the near future to help reconstruct the relcache entry.
Dumb.
On top of which, it required a subtle and easy-to-get-wrong attribute in
syscache definitions, ie, the column containing the OID of the related
relation if any. Removing that is a very useful maintenance simplification.
VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.
Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).
We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long
as we want to support in-place update from pre-9.0 databases.
of shared or nailed system catalogs. This has two key benefits:
* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
* We no longer have to use an unsafe reindex-in-place approach for reindexing
shared catalogs.
CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.
Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.
This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch. As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
of old and new toast tables can be done either at the logical level (by
swapping the heaps' reltoastrelid links) or at the physical level (by swapping
the relfilenodes of the toast tables and their indexes). This is necessary
infrastructure for upcoming changes to support CLUSTER/VAC FULL on shared
system catalogs, where we cannot change reltoastrelid. The physical swap
saves a few catalog updates too.
We unfortunately have to keep the logical-level swap logic because in some
cases we will be adding or deleting a toast table, so there's no possibility
of a physical swap. However, that only happens as a consequence of schema
changes in the table, which we do not need to support for system catalogs,
so such cases aren't an obstacle for that.
In passing, refactor the cluster support functions a little bit to eliminate
unnecessarily-duplicated code; and fix the problem that while CLUSTER had
been taught to rename the final toast table at need, ALTER TABLE had not.
the relfilenode of currently-not-relocatable system catalogs.
1. Get rid of inval.c's dependency on relfilenode, by not having it emit
smgr invalidations as a result of relcache flushes. Instead, smgr sinval
messages are sent directly from smgr.c when an actual relation delete or
truncate is done. This makes considerably more structural sense and allows
elimination of a large number of useless smgr inval messages that were
formerly sent even in cases where nothing was changing at the
physical-relation level. Note that this reintroduces the concept of
nontransactional inval messages, but that's okay --- because the messages
are sent by smgr.c, they will be sent in Hot Standby slaves, just from a
lower logical level than before.
2. Move setNewRelfilenode out of catalog/index.c, where it never logically
belonged, into relcache.c; which is a somewhat debatable choice as well but
better than before. (I considered catalog/storage.c, but that seemed too
low level.) Rename to RelationSetNewRelfilenode.
3. Cosmetic cleanups of some other relfilenode manipulations.
Attributes can now have options, just as relations and tablespaces do, and
the reloptions code is used to parse, validate, and store them. For
simplicity and because these options are not performance critical, we store
them in a separate cache rather than the main relcache.
Thanks to Alex Hunsaker for the review.
parse analysis phase, rather than at execution time. This makes parameter
handling work the same as it does in ordinary plannable queries, and in
particular fixes the incompatibility that Pavel pointed out with plpgsql's
new handling of variable references. plancache.c gets a little bit
grottier, but the alternatives seem worse.
underlying catalog not only the index itself. Otherwise, if the cache
load process touches the catalog (which will happen for many though not
all of these indexes), we are locking index before parent table, which can
result in a deadlock against processes that are trying to lock them in the
normal order. Per today's failure on buildfarm member gothic_moth; it's
surprising the problem hadn't been identified before.
Back-patch to 8.2. Earlier releases didn't have the issue because they
didn't try to lock these indexes during load (instead assuming that they
couldn't change schema at all during multiuser operation).
especially not ROLLBACK. ROLLBACK might need to be executed in an already
aborted transaction, when there is no safe way to revalidate the plan. But
in general there's no point in marking utility statements invalid, since
they have no plans in the normal sense of the word; so we might as well
work a bit harder here to avoid future revalidation cycles.
Back-patch to 8.4, where the bug was introduced.
occurring during a reload, such as query-cancel. Instead of zeroing out
an existing relcache entry and rebuilding it in place, build a new relcache
entry, then swap its contents with the old one, then free the new entry.
This avoids problems with code believing that a previously obtained pointer
to a cache entry must still reference a valid entry, as seen in recent
failures on buildfarm member jaguar. (jaguar is using CLOBBER_CACHE_ALWAYS
which raises the probability of failure substantially, but the problem
could occur in the field without that.) The previous design was okay
when it was made, but subtransactions and the ResourceOwner mechanism
make it unsafe now.
Also, make more use of the already existing rd_isvalid flag, so that we
remember that the entry requires rebuilding even if the first attempt fails.
Back-patch as far as 8.2. Prior versions have enough issues around relcache
reload anyway (due to inadequate locking) that fixing this one doesn't seem
worthwhile.
can upgrade clusters without renaming the tablespace directories. New
directory structure format is, e.g.:
$PGDATA/pg_tblspc/20981/PG_8.5_201001061/719849/83292814
deletion, so that we attempt to unlink the correct filepath. unlink()
errors are ignorable there, so lack of a DatabasePath initialization step
did not cause visible problems until a related bug showed up on Solaris.
Code refactored from xact_redo_commit() to
ProcessCommittedInvalidationMessages() in inval.c. Recovery may replay
shared invalidation messages for many databases, so we cannot
SetDatabasePath() once as we do in normal backends. Read the databaseid
from the shared invalidation messages, then set DatabasePath
temporarily before calling RelationCacheInitFileInvalidate().
Problem report by Robert Treat, analysis and fix by me.
Add missing varlena header to TableSpaceOpts structure. And, per
Tom Lane, instead of calling tablespace_reloptions in CacheMemoryContext,
call it in the caller's memory context and copy the value over
afterwards, to reduce the chances of a session-lifetime memory leak.
This patch only supports seq_page_cost and random_page_cost as parameters,
but it provides the infrastructure to scalably support many more.
In particular, we may want to add support for effective_io_concurrency,
but I'm leaving that as future work for now.
Thanks to Tom Lane for design help and Alvaro Herrera for the review.
pg_attribute, by having genbki.pl derive the information from the various
catalog header files. This greatly simplifies modification of the
"bootstrapped" catalogs.
This patch finally kills genbki.sh and Gen_fmgrtab.sh; we now rely entirely on
Perl scripts for those build steps. To avoid creating a Perl build dependency
where there was not one before, the output files generated by these scripts
are now treated as distprep targets, ie, they will be built and shipped in
tarballs. But you will need a reasonably modern Perl (probably at least
5.6) if you want to build from a CVS pull.
The changes to the MSVC build process are untested, and may well break ---
we'll soon find out from the buildfarm.
John Naylor, based on ideas from Robert Haas and others
"column < constant", and the comparison value is in the first or last
histogram bin or outside the histogram entirely, try to fetch the actual
column min or max value using an index scan (if there is an index on the
column). If successful, replace the lower or upper histogram bound with
that value before carrying on with the estimate. This limits the
estimation error caused by moving min/max values when the comparison
value is close to the min or max. Per a complaint from Josh Berkus.
It is tempting to consider using this mechanism for mergejoinscansel as well,
but that would inject index fetches into main-line join estimation not just
endpoint cases. I'm refraining from that until we can get a better handle
on the costs of doing this type of lookup.
and teach ANALYZE to compute such stats for tables that have subclasses.
Per my proposal of yesterday.
autovacuum still needs to be taught about running ANALYZE on parent tables
when their subclasses change, but the feature is useful even without that.
probably got there via blind copy-and-paste from one of the legitimate
callers, so rearrange and comment that code a bit to make it clearer that
this isn't a necessary prerequisite to hash_create. Per observation
from Robert Haas.
Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
support any indexable commutative operator, not just equality. Two rows
violate the exclusion constraint if "row1.col OP row2.col" is TRUE for
each of the columns in the constraint.
Jeff Davis, reviewed by Robert Haas
As proof of concept, modify plpgsql to use the hooks. plpgsql is still
inserting $n symbols textually, but the "back end" of the parsing process now
goes through the ParamRef hook instead of using a fixed parameter-type array,
and then execution only fetches actually-referenced parameters, using a hook
added to ParamListInfo.
Although there's a lot left to be done in plpgsql, this already cures the
"if (TG_OP = 'INSERT' and NEW.foo ...)" problem, as illustrated by the
changed regression test.
a lot of strange behaviors that occurred in join cases. We now identify the
"current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
UPDATE/SHARE queries. If an EvalPlanQual recheck is necessary, we jam the
appropriate row into each scan node in the rechecking plan, forcing it to emit
only that one row. The former behavior could rescan the whole of each joined
relation for each recheck, which was terrible for performance, and what's much
worse could result in duplicated output tuples.
Also, the original implementation of EvalPlanQual could not re-use the recheck
execution tree --- it had to go through a full executor init and shutdown for
every row to be tested. To avoid this overhead, I've associated a special
runtime Param with each LockRows or ModifyTable plan node, and arranged to
make every scan node below such a node depend on that Param. Thus, by
signaling a change in that Param, the EPQ machinery can just rescan the
already-built test plan.
This patch also adds a prohibition on set-returning functions in the
targetlist of SELECT FOR UPDATE/SHARE. This is needed to avoid the
duplicate-output-tuple problem. It seems fairly reasonable since the
other restrictions on SELECT FOR UPDATE are meant to ensure that there
is a unique correspondence between source tuples and result tuples,
which an output SRF destroys as much as anything else does.
They are now handled by a new plan node type called ModifyTable, which is
placed at the top of the plan tree. In itself this change doesn't do much,
except perhaps make the handling of RETURNING lists and inherited UPDATEs a
tad less klugy. But it is necessary preparation for the intended extension of
allowing RETURNING queries inside WITH.
Marko Tiikkaja
the privileges that will be applied to subsequently-created objects.
Such adjustments are always per owning role, and can be restricted to objects
created in particular schemas too. A notable benefit is that users can
override the traditional default privilege settings, eg, the PUBLIC EXECUTE
privilege traditionally granted by default for functions.
Petr Jelinek
relation rowtype OID into the relcache entries it builds. This ensures
that catcache copies of the relation tupdescs will be fully correct.
While the deficiency doesn't seem to have any effect in the current
sources, we have been bitten by not-quite-right catcache tupdescs before,
so it seems like a good idea to maintain the rule that they should be right.
possibility of shared-inval messages causing a relcache flush while it tries
to fill in missing data in preloaded relcache entries. There are actually
two distinct failure modes here:
1. The flush could delete the next-to-be-processed cache entry, causing
the subsequent hash_seq_search calls to go off into the weeds. This is
the problem reported by Michael Brown, and I believe it also accounts
for bug #5074. The simplest fix is to restart the hashtable scan after
we've read any new data from the catalogs. It appears that pre-8.4
branches have not suffered from this failure, because by chance there were
no other catalogs sharing the same hash chains with the catalogs that
RelationCacheInitializePhase2 had work to do for. However that's obviously
pretty fragile, and it seems possible that derivative versions with
additional system catalogs might be vulnerable, so I'm back-patching this
part of the fix anyway.
2. The flush could delete the *current* cache entry, in which case the
pointer to the newly-loaded data would end up being stored into an
already-deleted Relation struct. As long as it was still deleted, the only
consequence would be some leaked space in CacheMemoryContext. But it seems
possible that the Relation struct could already have been recycled, in
which case this represents a hard-to-reproduce clobber of cached data
structures, with unforeseeable consequences. The fix here is to pin the
entry while we work on it.
In passing, also change RelationCacheInitializePhase2 to Assert that
formrdesc() set up the relation's cached TupleDesc (rd_att) with the
correct type OID and hasoids values. This is more appropriate than
silently updating the values, because the original tupdesc might already
have been copied into the catcache. However this part of the patch is
not in HEAD because it fails due to some questionable recent changes in
formrdesc :-(. That will be cleaned up in a subsequent patch.
To make this work in the base case, pg_database now has a nailed-in-cache
relation descriptor that is initialized using hardwired knowledge in
relcache.c. This means pg_database is added to the set of relations that
need to have a Schema_pg_xxx macro maintained in pg_attribute.h. When this
path is taken, we'll have to do a seqscan of pg_database to find the row
we need.
In the normal case, we are able to do an indexscan to find the database's row
by name. This is made possible by storing a global relcache init file that
describes only the shared catalogs and their indexes (and therefore is usable
by all backends in any database). A new backend loads this cache file,
finds its database OID after an indexscan on pg_database, and then loads
the local relcache init file for that database.
This change should effectively eliminate number of databases as a factor
in backend startup time, even with large numbers of databases. However,
the real reason for doing it is as a first step towards getting rid of
the flat files altogether. There are still several other sub-projects
to be tackled before that can happen.
The current implementation fires an AFTER ROW trigger for each tuple that
looks like it might be non-unique according to the index contents at the
time of insertion. This works well as long as there aren't many conflicts,
but won't scale to massive unique-key reassignments. Improving that case
is a TODO item.
Dean Rasheed
RevalidateCachedPlan. This is to avoid a "SPI_ERROR_CONNECT" failure when
the planner calls a SPI-using function and we are already inside one.
The alternative fix is to expect callers of RevalidateCachedPlan to do this,
which seems likely to result in additional hard-to-detect bugs of omission.
Per reports from Frank van Vugt and Marek Lewczuk.
Back-patch to 8.3. It's much harder to trigger the bug in 8.3, due to a
smaller set of cases in which plans can be invalidated, but it could happen.
(I think perhaps only a SI reset event could make 8.3 fail here, but that's
certainly within the realm of possibility.)
temp relations; this is no more expensive than before, now that we have
pg_class.relistemp. Insert tests into bufmgr.c to prevent attempting
to fetch pages from nonlocal temp relations. This provides a low-level
defense against bugs-of-omission allowing temp pages to be loaded into shared
buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
While at it, tweak a bunch of places to use new relcache tests (instead of
expensive probes into pg_namespace) to detect local or nonlocal temp tables.
relations (including a temp table's indexes and toast table/index), and
false for normal relations. For ease of checking, this commit just adds
the column and fills it correctly --- revising the relation access machinery
to use it will come separately.
refactor the relcache code that used to do that. This allows other callers
(particularly autovacuum) to do the same without necessarily having to open
and lock a table.
field needs to be included in equalRuleLocks() comparisons, else updates
will fail to propagate into relcache entries when they have positive
reference count (ie someone is using the relcache entry).
Per report from Alex Hunsaker.
This doesn't do any remote or external things yet, but it gives modules
like plproxy and dblink a standardized and future-proof system for
managing their connection information.
Martin Pihlak and Peter Eisentraut
when they are invoked by the parser. We had been setting up a snapshot at
plan time but really it needs to be done earlier, before parse analysis.
Per report from Dmitry Koterov.
Also fix two related problems discovered while poking at this one:
exec_bind_message called datatype input functions without establishing a
snapshot, and SET CONSTRAINTS IMMEDIATE could call trigger functions without
establishing a snapshot.
Backpatch to 8.2. The underlying problem goes much further back, but it is
masked in 8.1 and before because we didn't attempt to invoke domain check
constraints within datatype input. It would only be exposed if a C-language
datatype input function used the snapshot; which evidently none do, or we'd
have heard complaints sooner. Since this code has changed a lot over time,
a back-patch is hardly risk-free, and so I'm disinclined to patch further
than absolutely necessary.
heap page, where a set bit indicates that all tuples on the page are
visible to all transactions, and the page therefore doesn't need
vacuuming. It is stored in a new relation fork.
Lazy vacuum uses the visibility map to skip pages that don't need
vacuuming. Vacuum is also responsible for setting the bits in the map.
In the future, this can hopefully be used to implement index-only-scans,
but we can't currently guarantee that the visibility map is always 100%
up-to-date.
In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
each heap page, also indicating that all tuples on the page are visible to
all transactions. It's important that this flag is kept up-to-date. It
is also used to skip visibility tests in sequential scans, which gives a
small performance gain on seqscans.
("there might be triggers") rather than an exact count. This is necessary
catalog infrastructure for the upcoming patch to reduce the strength of
locking needed for trigger addition/removal. Split out and committed
separately for ease of reviewing/testing.
In passing, also get rid of the unused pg_class columns relukeys, relfkeys,
and relrefs, which haven't been maintained in many years and now have no
chance of ever being maintained (because of wishing to avoid locking).
Simon Riggs
and heap_deformtuple in favor of the newer functions heap_form_tuple et al
(which do the same things but use bool control flags instead of arbitrary
char values). Eliminate the former duplicate coding of these functions,
reducing the deprecated functions to mere wrappers around the newer ones.
We can't get rid of them entirely because add-on modules probably still
contain many instances of the old coding style.
Kris Jurka
There are some unimplemented aspects: recursive queries must use UNION ALL
(should allow UNION too), and we don't have SEARCH or CYCLE clauses.
These might or might not get done for 8.4, but even without them it's a
pretty useful feature.
There are also a couple of small loose ends and definitional quibbles,
which I'll send a memo about to pgsql-hackers shortly. But let's land
the patch now so we can get on with other development.
Yoshiyuki Asaba, with lots of help from Tatsuo Ishii and Tom Lane
free space information is stored in a dedicated FSM relation fork, with each
relation (except for hash indexes; they don't use FSM).
This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
trace of them from the backend, initdb, and documentation.
Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
introduce a new variant of the get_raw_page(regclass, int4, int4) function in
contrib/pageinspect that let's you to return pages from any relation fork, and
a new fsm_page_contents() function to inspect the new FSM pages.
we regenerate the SQL query text not merely the plan derived from it. This
is needed to handle contingencies such as renaming of a table or column
used in an FK. Pre-8.3, such cases worked despite the lack of replanning
(because the cached plan needn't actually change), so this is a regression.
Per bug #4417 from Benjamin Bihler.
when user-defined functions used in a plan are modified. Also invalidate
plans when schemas, operators, or operator classes are modified; but for these
cases we just invalidate everything rather than tracking exact dependencies,
since these types of objects seldom change in a production database.
Tom Lane; loosely based on a patch by Martin Pihlak.
into nodes/nodeFuncs, so as to reduce wanton cross-subsystem #includes inside
the backend. There's probably more that should be done along this line,
but this is a start anyway.
REINDEX DATABASE including same) is done before a session has done any other
update on pg_class, the pg_class relcache entry was left with an incorrect
setting of rd_indexattr, because the indexed-attributes set would be first
demanded at a time when we'd forced a partial list of indexes into the
pg_class entry, and it would remain cached after that. This could result
in incorrect decisions about HOT-update safety later in the same session.
In practice, since only pg_class_relname_nsp_index would be missed out,
only ALTER TABLE RENAME and ALTER TABLE SET SCHEMA could trigger a problem.
Per report and test case from Ondrej Jirman.
as per my recent proposal:
1. Fold SortClause and GroupClause into a single node type SortGroupClause.
We were already relying on them to be struct-equivalent, so using two node
tags wasn't accomplishing much except to get in the way of comparing items
with equal().
2. Add an "eqop" field to SortGroupClause to carry the associated equality
operator. This is cheap for the parser to get at the same time it's looking
up the sort operator, and storing it eliminates the need for repeated
not-so-cheap lookups during planning. In future this will also let us
represent GROUP/DISTINCT operations on datatypes that have hash opclasses
but no btree opclasses (ie, they have equality but no natural sort order).
The previous representation simply didn't work for that, since its only
indicator of comparison semantics was a sort operator.
3. Add a hasDistinctOn boolean to struct Query to explicitly record whether
the distinctClause came from DISTINCT or DISTINCT ON. This allows removing
some complicated and not 100% bulletproof code that attempted to figure
that out from the distinctClause alone.
This patch doesn't in itself create any new capability, but it's necessary
infrastructure for future attempts to use hash-based grouping for DISTINCT
and UNION/INTERSECT/EXCEPT.
with system catalog lookups, as was foreseen to be necessary almost since
their creation. Instead put the information into two new pg_type columns,
typcategory and typispreferred. Add support for setting these when
creating a user-defined base type.
The category column is just a "char" (i.e. a poor man's enum), allowing
a crude form of user extensibility of the category list: just use an
otherwise-unused character. This seems sufficient for foreseen uses,
but we could upgrade to having an actual category catalog someday, if
there proves to be a huge demand for custom type categories.
In this patch I have attempted to hew exactly to the behavior of the
previous hardwired logic, except for introducing new type categories for
arrays, composites, and enums. In particular the default preferred state
for user-defined types remains TRUE. That seems worth revisiting, but it
should be done as a separate patch from introducing the infrastructure.
Likewise, any adjustment of the standard set of categories should be done
separately.
a portal are never NULL, but reliably provide the source text of the query.
It turns out that there was only one place that was really taking a short-cut,
which was the 'EXECUTE' utility statement. That doesn't seem like a
sufficiently critical performance hotspot to justify not offering a guarantee
of validity of the portal source text. Fix it to copy the source text over
from the cached plan. Add Asserts in the places that set up cached plans and
portals to reject null source strings, and simplify a bunch of places that
formerly needed to guard against nulls.
There may be a few places that cons up statements for execution without
having any source text at all; I found one such in ConvertTriggerToFK().
It seems sufficient to inject a phony source string in such a case,
for instance
ProcessUtility((Node *) atstmt,
"(generated ALTER TABLE ADD FOREIGN KEY command)",
NULL, false, None_Receiver, NULL);
We should take a second look at the usage of debug_query_string,
particularly the recently added current_query() SQL function.
ITAGAKI Takahiro and Tom Lane
unnecessary cache resets. The major changes are:
* When the queue overflows, we only issue a cache reset to the specific
backend or backends that still haven't read the oldest message, rather
than resetting everyone as in the original coding.
* When we observe backend(s) falling well behind, we signal SIGUSR1
to only one backend, the one that is furthest behind and doesn't already
have a signal outstanding for it. When it finishes catching up, it will
in turn signal SIGUSR1 to the next-furthest-back guy, if there is one that
is far enough behind to justify a signal. The PMSIGNAL_WAKEN_CHILDREN
mechanism is removed.
* We don't attempt to clean out dead messages after every message-receipt
operation; rather, we do it on the insertion side, and only when the queue
fullness passes certain thresholds.
* Split SInvalLock into SInvalReadLock and SInvalWriteLock so that readers
don't block writers nor vice versa (except during the infrequent queue
cleanout operations).
* Transfer multiple sinval messages for each acquisition of a read or
write lock.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
There are two ways to track a snapshot: there's the "registered" list, which
is used for arbitrary long-lived snapshots; and there's the "active stack",
which is used for the snapshot that is considered "active" at any time.
This also allows users of snapshots to stop worrying about snapshot memory
allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
assignment. This is all done automatically now.
As a consequence, this allows us to reset MyProc->xmin when there are no
more snapshots registered in the current backend, reducing the impact that
long-running transactions have on VACUUM.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
it is trying to build a relcache entry for. This is an oversight in my 8.2
patch that tried to ensure we always took a lock on a relation before trying
to build its relcache entry. The implication is that if someone committed a
reindex of a critical system index at about the same time that some other
backend were starting up without a valid pg_internal.init file, the second one
might PANIC due to not seeing any valid version of the index's pg_class row.
Improbable case, but definitely not impossible.
no particular need to do get_op_opfamily_properties() while building an
indexscan plan. Postpone that lookup until executor start. This simplifies
createplan.c a lot more than it complicates nodeIndexscan.c, and makes things
more uniform since we already had to do it that way for RowCompare
expressions. Should be a bit faster too, at least for plans that aren't
re-used many times, since we avoid palloc'ing and perhaps copying the
intermediate list data structure.
systable_endscan_ordered that have API similar to systable_beginscan etc
(in particular, the passed-in scankeys have heap not index attnums),
but guarantee ordered output, unlike the existing functions. For the moment
these are just very thin wrappers around index_beginscan/index_getnext/etc.
Someday they might need to get smarter; but for now this is just a code
refactoring exercise to reduce the number of direct callers of index_getnext,
in preparation for changing that function's API.
In passing, remove index_getnext_indexitem, which has been dead code for
quite some time, and will have even less use than that in the presence
of run-time-lossy indexes.
eval_const_expressions needs to be passed the PlannerInfo ("root") structure,
because in some cases we want it to substitute values for Param nodes.
(So "constant" is not so constant as all that ...) This mistake partially
disabled optimization of unnamed extended-Query statements in 8.3: in
particular the LIKE-to-indexscan optimization would never be applied if the
LIKE pattern was passed as a parameter, and constraint exclusion depending
on a parameter value didn't work either.
snapmgmt.c file for the former. The header files have also been reorganized
in three parts: the most basic snapshot definitions are now in a new file
snapshot.h, and the also new snapmgmt.h keeps the definitions for snapmgmt.c.
tqual.h has been reduced to the bare minimum.
This patch is just a first step towards managing live snapshots within a
transaction; there is no functionality change.
Per my proposal to pgsql-patches on 20080318191940.GB27458@alvh.no-ip.org and
subsequent discussion.
strings. This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString. A number of
existing macros that provided variants on these themes were removed.
Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed. There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).
This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.
Brendan Jurd and Tom Lane
messages if the calling transaction aborts later on. Collapsing out line
pointer redirects is a done deal as soon as we complete the page update,
so syscache *must* be notified even if the VACUUM FULL as a whole doesn't
complete. To fix, add some functionality to inval.c to allow the pending
inval messages to be sent immediately while heap_page_prune is still
running. The implementation is a bit chintzy: it will only work in the
context of VACUUM FULL. But that's all we need now, and it can always be
extended later if needed. Per my trouble report of a week ago.
caches that we don't actually need to touch. This saves some trivial
number of cycles and avoids certain cases of deadlock when doing concurrent
VACUUM FULL on system catalogs. Per report from Gavin Roy.
Backpatch to 8.2. In earlier versions, CatalogCacheInitializeCache didn't
lock the relation so there's no deadlock risk (though that certainly had
plenty of risks of its own).
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit. In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.
The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples. CommandCounter values stored into
snapshots are presumed not to be used for this purpose. This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.
Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages. It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
reloading of operator class information on each use of LookupOpclassInfo.
Had this been in place a year ago, it would have helped me find a bug
in the then-new 'operator family' code. Now that we have a build farm
member testing CLOBBER_CACHE_ALWAYS on a regular basis, it seems worth
expending a little bit of effort here.
it affects. The original coding neglected tablespace entirely (causing
the indexes to move to the database's default tablespace) and for an index
belonging to a UNIQUE or PRIMARY KEY constraint, it would actually try to
assign the parent table's reloptions to the index :-(. Per bug #3672 and
subsequent investigation.
8.0 and 8.1 did not have reloptions, but the tablespace bug is present.
a relation as a reason to invalidate a plan when the relation changes. This
handles scenarios such as dropping/recreating a sequence that is referenced by
nextval('seq') in a cached plan. Rather than teach plancache.c all about
digging through plan trees to find regclass Consts, we charge the planner's
setrefs.c with making a list of the relation OIDs on which each plan depends.
That way the list can be built cheaply during a plan tree traversal that has
to happen anyway. Per bug #3662 and subsequent discussion.
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version. Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.
In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.
Pavan Deolasee, with help from a bunch of other people.
and in passing, fix some bogosities dating from the custom_variable_classes
patch. Fix guc-file.l to correctly check changes in custom_variable_classes
that are attempted concurrently with additions/removals of custom variables,
and don't allow the new setting to be applied in advance of checking it.
Clean up messy and undocumented situation for string variables with NULL
boot_val. Fix DefineCustomVariable functions to initialize boot_val
correctly. Prevent find_option from inserting bogus placeholders for custom
variables that are simply inquired about rather than being set.
init options of the template as top-level options in the syntax. This also
makes ALTER a bit easier to use, since options can be replaced individually.
I also made these statements verify that the tmplinit method will accept
the new settings before they get stored; in the original coding you didn't
find out about mistakes until the dictionary got invoked.
Under the hood, init methods now get options as a List of DefElem instead
of a raw text string --- that lets tsearch use existing options-pushing code
instead of duplicating functionality.
Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing,
so anything that's broken is probably my fault.
Documentation is nonexistent as yet, but let's land the patch so we can
get some portability testing done.
named pg_toast_temp_nnn, alongside the pg_temp_nnn schemas used for the temp
tables themselves. This allows low-level code such as the relcache to
recognize that these tables are indeed temporary, which enables various
optimizations such as not WAL-logging changes and using local rather than
shared buffers for access. Aside from obvious performance benefits, this
provides a solution to bug #3483, in which other backends unexpectedly held
open file references to temporary tables. The scheme preserves the property
that TOAST tables are not in any schema that's normally in the search path,
so they don't conflict with user table names.
initdb forced because of changes in system view definitions.