Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.
To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.
A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.
Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.
We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.
Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.
Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen
If the foreign table's rowtype is being used as the type of a column in
another table, we can't just up and change its data type. This was
already checked for composite types and ordinary tables, but we
previously failed to enforce it for foreign tables.
This can be used to build 64 bit Windows binaries, not only on 64 bit
Windows but on supported cross-compiling hosts including 32 bit Windows,
Cygwin, Darwin and Linux.
This reverts commit a06e41deeb of 2011-01-26.
Per discussion, this behavior is not wanted, as it would need to change if
we ever made composite types support DEFAULT.
The previous coding prevented ALTER TABLE .. ADD COLUMN from being used
with a non-NULL default in situations where the table's rowtype was being
used elsewhere. But this is a completely arbitrary restriction since
you could do the same operation in multiple steps (add the column, add
the default, update the table).
Inspired by a patch from Noah Misch, though I didn't use his code.
This feature allows a unique or pkey constraint to be created using an
already-existing unique index. While the constraint isn't very
functionally different from the bare index, it's nice to be able to do that
for documentation purposes. The main advantage over just issuing a plain
ALTER TABLE ADD UNIQUE/PRIMARY KEY is that the index can be created with
CREATE INDEX CONCURRENTLY, so that there is not a long interval where the
table is locked against updates.
On the way, refactor some of the code in DefineIndex() and index_create()
so that we don't have to pass through those functions in order to create
the index constraint's catalog entries. Also, in parse_utilcmd.c, pass
around the ParseState pointer in struct CreateStmtContext to save on
notation, and add error location pointers to some error reports that didn't
have one before.
Gurjeet Singh, reviewed by Steve Singer and Tom Lane
This is still pretty rough - among other things, the documentation
needs work, and the messages need a visit from the style police -
but this gets the basic framework in place.
KaiGai Kohei
As in commit fb4c5d2798 on 2011-01-21,
this avoids spurious debug messages and allows idempotent changes at
any time. Along the way, make assign_XactIsoLevel allow idempotent
changes even when not within a subtransaction, to be consistent with
the new coding of assign_transaction_read_only and because there's
no compelling reason to do otherwise.
Kevin Grittner, with some adjustments.
Failure to do so can lead to constraint violations. This was broken by
commit 1ddc2703a9 on 2010-02-07, so
back-patch to 9.0.
Noah Misch. Regression test by me.
Instead, run them in the encoding that the locale selects, which is
more representative of real use.
Also document how locale and encoding for regression test runs can be
selected.
Per my recent proposal(s). Null key datums can now be returned by
extractValue and extractQuery functions, and will be stored in the index.
Also, placeholder entries are made for indexable items that are NULL or
contain no keys according to extractValue. This means that the index is
now always complete, having at least one entry for every indexed heap TID,
and so we can get rid of the prohibition on full-index scans. A full-index
scan is implemented much the same way as partial-match scans were already:
we build a bitmap representing all the TIDs found in the index, and then
drive the results off that.
Also, introduce a concept of a "search mode" that can be requested by
extractQuery when the operator requires matching to empty items (this is
just as cheap as matching to a single key) or requires a full index scan
(which is not so cheap, but it sure beats failing or giving wrong answers).
The behavior remains backward compatible for opclasses that don't return
any null keys or request a non-default search mode.
Using these features, we can now make the GIN index opclass for anyarray
behave in a way that matches the actual anyarray operators for &&, <@, @>,
and = ... which it failed to do before in assorted corner cases.
This commit fixes the core GIN code and ginarrayprocs.c, updates the
documentation, and adds some simple regression test cases for the new
behaviors using the array operators. The tsearch and contrib GIN opclass
support functions still need to be looked over and probably fixed.
Another thing I intend to fix separately is that this is pretty inefficient
for cases where more than one scan condition needs a full-index search:
we'll run duplicate GinScanEntrys, each one of which builds a large bitmap.
There is some existing logic to merge duplicate GinScanEntrys but it needs
refactoring to make it work for entries belonging to different scan keys.
Note that most of gin.h has been split out into a new file gin_private.h,
so that gin.h doesn't export anything that's not supposed to be used by GIN
opclasses or the rest of the backend. I did quite a bit of other code
beautification work as well, mostly fixing comments and choosing more
appropriate names for things.
Add new function pg_sequence_parameters that returns a sequence's start,
minimum, maximum, increment, and cycle values, and use that in the view.
(bug #5662; design suggestion by Tom Lane)
Also slightly adjust the view's column order and permissions after review of
SQL standard.
Foreign tables are a core component of SQL/MED. This commit does
not provide a working SQL/MED infrastructure, because foreign tables
cannot yet be queried. Support for foreign table scans will need to
be added in a future patch. However, this patch creates the necessary
system catalog structure, syntax support, and support for ancillary
operations such as COMMENT and SECURITY LABEL.
Shigeru Hanada, heavily revised by Robert Haas
This privilege is required to do Streaming Replication, instead of
superuser, making it possible to set up a SR slave that doesn't
have write permissions on the master.
Superuser privileges do NOT override this check, so in order to
use the default superuser account for replication it must be
explicitly granted the REPLICATION permissions. This is backwards
incompatible change, in the interest of higher default security.
This commit represents a rather heavily editorialized version of
Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1
patches. I redid the opclass API to add a separate Distance method
instead of turning the Consistent method into an illogical mess,
fixed some bit-rot in the rbtree interfaces, and generally worked over
the code style and comments.
There's still no non-code documentation to speak of, but I'll work on
that separately. Some contrib-module changes are also yet to come
(right now, point <-> point is the only KNN-ified operator).
Teodor Sigaev and Tom Lane
There were corner cases in which the planner would attempt to inline such
a function, which would result in a failure at runtime due to loss of
information about exactly what the result record type is. Fix by disabling
inlining when the function's recorded result type is RECORD. There might
be some sub-cases where inlining could still be allowed, but this is a
simple and backpatchable fix, so leave refinements for another day.
Per bug #5777 from Nate Carson.
Back-patch to all supported branches. 8.1 happens to avoid a core-dump
here, but it still does the wrong thing.
This adds support for changing the schema of a conversion, operator,
operator class, operator family, text search configuration, text search
dictionary, text search parser, or text search template.
Dimitri Fontaine, with assorted corrections and other kibitzing.
This commit adds columns amoppurpose and amopsortfamily to pg_amop, and
column amcanorderbyop to pg_am. For the moment all the entries in
amcanorderbyop are "false", since the underlying support isn't there yet.
Also, extend the CREATE OPERATOR CLASS/ALTER OPERATOR FAMILY commands with
[ FOR SEARCH | FOR ORDER BY sort_operator_family ] clauses to allow the new
columns of pg_amop to be populated, and create pg_dump support for dumping
that information.
I also added some documentation, although it's perhaps a bit premature
given that the feature doesn't do anything useful yet.
Teodor Sigaev, Robert Haas, Tom Lane
Currently, three conversion format specifiers are supported: %s for a
string, %L for an SQL literal, and %I for an SQL identifier. The latter
two are deliberately designed not to overlap with what sprintf() already
supports, in case we want to add more of sprintf()'s functionality here
later.
Patch by Pavel Stehule, heavily revised by me. Reviewed by Jeff Janes
and, in earlier versions, by Itagaki Takahiro and Tom Lane.
Avoid depending on LL notation, which is likely to not work in pre-C99
compilers; don't pointlessly use INT32_MIN/INT64_MIN in code that has
the numerical value hard-wired into it anyway; remove some gratuitous
style inconsistencies between pg_ltoa and pg_lltoa; fix int2 test case
so it actually tests int2.
This eliminates the need for inefficient implementions of this
functionality in both contrib/dblink and contrib/tablefunc, so remove
them. The upcoming patch implementing an in-core format() function
will also require this functionality.
In passing, add some regression tests.
Use INT_MIN rather than INT32_MIN as we do elsewhere in the code, and
try to work around nonexistence of INT64_MIN if necessary. Adjust the
new regression tests to something hopefully saner, per observation by
Tom Lane.
A hand-coded implementation turns out to be much faster than calling
printf(). In passing, add a few more regresion tests.
Andres Freund, with assorted, mostly cosmetic changes.
In the previous coding, we simply issued ALTER SEQUENCE RESTART commands,
which do not roll back on error. This meant that an error between
truncating and committing left the sequences out of sync with the table
contents, with potentially bad consequences as were noted in a Warning on
the TRUNCATE man page.
To fix, create a new storage file (relfilenode) for a sequence that is to
be reset due to RESTART IDENTITY. If the transaction aborts, we'll
automatically revert to the old storage file. This acts just like a
rewriting ALTER TABLE operation. A penalty is that we have to take
exclusive lock on the sequence, but since we've already got exclusive lock
on its owning table, that seems unlikely to be much of a problem.
The interaction of this with usual nontransactional behaviors of sequence
operations is a bit weird, but it's hard to see what would be completely
consistent. Our choice is to discard cached-but-unissued sequence values
both when the RESTART is executed, and at rollback if any; but to not touch
the currval() state either time.
In passing, move the sequence reset operations to happen before not after
any AFTER TRUNCATE triggers are fired. The previous ordering was not
logically sensible, but was forced by the need to minimize inconsistency
if the triggers caused an error. Transactional rollback is a much better
solution to that.
Patch by Steve Singer, rather heavily adjusted by me.
This new field counts the number of times that a backend which writes a
buffer out to the OS must also fsync() it. This happens when the
bgwriter fsync request queue is full, and is generally detrimental to
performance, so it's good to know when it's happening. Along the way,
log a new message at level DEBUG1 whenever we fail to hand off an fsync,
so that the problem can also be seen in examination of log files
(if the logging level is cranked up high enough).
Greg Smith, with minor tweaks by me.
Replace for loops in makefiles with proper dependencies. Parallel
make can now span across directories. Also, make -k and make -q work
properly.
GNU make 3.80 or newer is now required.
PG 8.4 added a built-in feature for casting pretty much any data type to
string types (text, varchar, etc). We allowed this to work in any of the
historically-allowed syntaxes: CAST(x AS text), x::text, text(x), or
x.text. However, multiple complaints have shown that it's too easy to
invoke such casts unintentionally in the latter two styles, particularly
field selection. To cure the problem with the narrowest possible change
of behavior, disallow use of I/O conversion casts from composite types to
string types via functional/attribute syntax. The new functionality is
still available via cast syntax.
In passing, document the equivalence of functional and attribute syntax
in a more visible place.
Rather than considering this result as meaning "unknown", report LONG_MAX.
This won't change what superusers can set max_stack_depth to, but it will
cause InitializeGUCOptions() to set the built-in default to 2MB not 100kB.
The latter seems like a fairly unreasonable interpretation of "infinity".
Per my investigation of odd buildfarm results as well as an old complaint
from Heikki.
Since this should persuade all the buildfarm animals to use a reasonable
stack depth setting during "make check", revert previous patch that dumbed
down a recursive regression test to only 5 levels.
Per my recent proposal, get rid of all the direct inspection of indexes
and manual generation of paths in planagg.c. Instead, set up
EquivalenceClasses for the aggregate argument expressions, and let the
regular path generation logic deal with creating paths that can satisfy
those sort orders. This makes planagg.c a bit more visible to the rest
of the planner than it was originally, but the approach is basically a lot
cleaner than before. A major advantage of doing it this way is that we get
MIN/MAX optimization on inheritance trees (using MergeAppend of indexscans)
practically for free, whereas in the old way we'd have had to add a whole
lot more duplicative logic.
One small disadvantage of this approach is that MIN/MAX aggregates can no
longer exploit partial indexes having an "x IS NOT NULL" predicate, unless
that restriction or something that implies it is specified in the query.
The previous implementation was able to use the added "x IS NOT NULL"
condition as an extra predicate proof condition, but in this version we
rely entirely on indexes that are considered usable by the main planning
process. That seems a fair tradeoff for the simplicity and functionality
gained.
Some buildfarm members fail the test with the original depth of 10 levels,
apparently because they are running at the minimum max_stack_depth setting
of 100kB and using ~ 10k per recursion level. While it might be
interesting to try to figure out why they're eating so much stack, it isn't
likely that any fix for that would be back-patchable. So just change the
test to recurse only 5 levels. The extra levels don't prove anything
correctness-wise anyway.
In general, expression execution state trees aren't re-entrantly usable,
since functions can store private state information in them.
For efficiency reasons, plpgsql tries to cache and reuse state trees for
"simple" expressions. It can get away with that most of the time, but it
can fail if the state tree is dirty from a previous failed execution (as
in an example from Alvaro) or is being used recursively (as noted by me).
Fix by tracking whether a state tree is in use, and falling back to the
"non-simple" code path if so. This results in a pretty considerable speed
hit when the non-simple path is taken, but the available alternatives seem
even more unpleasant because they add overhead in the simple path. Per
idea from Heikki.
Back-patch to all supported branches.
After much expenditure of effort, we've got this to the point where the
performance penalty is pretty minimal in typical cases.
Andrew Dunstan, reviewed by Brendan Jurd, Dean Rasheed, and Tom Lane
This patch eliminates various bizarre behaviors caused by sloppy thinking
about the difference between a domain type and its underlying array type.
In particular, the operation of updating one element of such an array
has to be considered as yielding a value of the underlying array type,
*not* a value of the domain, because there's no assurance that the
domain's CHECK constraints are still satisfied. If we're intending to
store the result back into a domain column, we have to re-cast to the
domain type so that constraints are re-checked.
For similar reasons, such a domain can't be blindly matched to an ANYARRAY
polymorphic parameter, because the polymorphic function is likely to apply
array-ish operations that could invalidate the domain constraints. For the
moment, we just forbid such matching. We might later wish to insert an
automatic downcast to the underlying array type, but such a change should
also change matching of domains to ANYELEMENT for consistency.
To ensure that all such logic is rechecked, this patch removes the original
hack of setting a domain's pg_type.typelem field to match its base type;
the typelem will always be zero instead. In those places where it's really
okay to look through the domain type with no other logic changes, use the
newly added get_base_element_type function in place of get_element_type.
catversion bumped due to change in pg_type contents.
Per bug #5717 from Richard Huxton and subsequent discussion.
A couple of places in the planner need to generate whole-row Vars, and were
cutting corners by setting vartype = RECORDOID in the Vars, even in cases
where there's an identifiable named composite type for the RTE being
referenced. While we mostly got away with this, it failed when there was
also a parser-generated whole-row reference to the same RTE, because the
two Vars weren't equal() due to the difference in vartype. Fix by
providing a subroutine the planner can call to generate whole-row Vars
the same way the parser does.
Per bug #5716 from Andrew Tipton. Back-patch to 9.0 where one of the bogus
calls was introduced (the other one is new in HEAD).