Commit graph

5827 commits

Author SHA1 Message Date
Heikki Linnakangas
20ba5ca64c Move WAL continuation record information to WAL page header.
The continuation record only contained one field, xl_rem_len, so it makes
things simpler to just include it in the WAL page header. This wastes four
bytes on pages that don't begin with a continuation from previos page, plus
four bytes on every page, because of padding.

The motivation of this is to make it easier to calculate how much space a
WAL record needs. Before this patch, it depended on how many page boundaries
the record crosses. The motivation of that, in turn, is to separate the
allocation of space in the WAL from the copying of the record data to the
allocated space. Keeping the calculation of space required simple helps to
keep the critical section of allocating the space from WAL short. But that's
not included in this patch yet.

Bump WAL version number again, as this is an incompatible change.
2012-06-24 18:35:30 +03:00
Heikki Linnakangas
dfda6ebaec Don't waste the last segment of each 4GB logical log file.
The comments claimed that wasting the last segment made it easier to do
calculations with XLogRecPtrs, because you don't have problems representing
last-byte-position-plus-1 that way. In my experience, however, it only made
things more complicated, because the there was two ways to represent the
boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0,
or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were
picky about which representation was used.

Also, use a 64-bit segment number instead of the log/seg combination, to
point to a certain WAL segment. We assume that all platforms have a working
64-bit integer type nowadays.

This is an incompatible change in WAL format, so bumping WAL version number.
2012-06-24 18:35:29 +03:00
Tom Lane
d14241c2cf Fix memory leak in ARRAY(SELECT ...) subqueries.
Repeated execution of an uncorrelated ARRAY_SUBLINK sub-select (which
I think can only happen if the sub-select is embedded in a larger,
correlated subquery) would leak memory for the duration of the query,
due to not reclaiming the array generated in the previous execution.
Per bug #6698 from Armando Miraglia.  Diagnosis and fix idea by Heikki,
patch itself by me.

This has been like this all along, so back-patch to all supported versions.
2012-06-21 17:27:19 -04:00
Heikki Linnakangas
eeb6f37d89 Add a small cache of locks owned by a resource owner in ResourceOwner.
This speeds up reassigning locks to the parent owner, when the transaction
holds a lot of locks, but only a few of them belong to the current resource
owner. This is particularly helps pg_dump when dumping a large number of
objects.

The cache can hold up to 15 locks in each resource owner. After that, the
cache is marked as overflowed, and we fall back to the old method of
scanning the whole local lock table. The tradeoff here is that the cache has
to be scanned whenever a lock is released, so if the cache is too large,
lock release becomes more expensive. 15 seems enough to cover pg_dump, and
doesn't have much impact on lock release.

Jeff Janes, reviewed by Amit Kapila and Heikki Linnakangas.
2012-06-21 15:30:26 +03:00
Tom Lane
cfa0f4255b Improve tests for whether we can skip queueing RI enforcement triggers.
During an update of a PK row, we can skip firing the RI trigger if any old
key value is NULL, because then the row could not have had any matching
rows in the FK table.  Conversely, during an update of an FK row, the
outcome is determined if any new key value is NULL.  In either case it
becomes unnecessary to compare individual key values.

This patch was inspired by discussion of Vik Reykja's patch to use IS NOT
DISTINCT semantics for the key comparisons.  In the event there is no need
for that and so this patch looks nothing like his, but he should still get
credit for having re-opened consideration of the trigger skip logic.
2012-06-19 20:07:33 -04:00
Tom Lane
f5297bdfe4 Refer to the default foreign key match style as MATCH SIMPLE internally.
Previously we followed the SQL92 wording, "MATCH <unspecified>", but since
SQL99 there's been a less awkward way to refer to the default style.

In addition to the code changes, pg_constraint.confmatchtype now stores
this match style as 's' (SIMPLE) rather than 'u' (UNSPECIFIED).  This
doesn't affect pg_dump or psql because they use pg_get_constraintdef()
to reconstruct foreign key definitions.  But other client-side code might
examine that column directly, so this change will have to be marked as
an incompatibility in the 9.3 release notes.
2012-06-17 20:16:44 -04:00
Tom Lane
9e18eacbdf Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file.  Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP.  Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.

We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message.  The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.

To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time.  The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.

We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 17:11:49 -04:00
Peter Eisentraut
15b1918e7d Improve reporting of permission errors for array types
Because permissions are assigned to element types, not array types,
complaining about permission denied on an array type would be
misleading to users.  So adjust the reporting to refer to the element
type instead.

In order not to duplicate the required logic in two dozen places,
refactor the permission denied reporting for types a bit.

pointed out by Yeb Havinga during the review of the type privilege
feature
2012-06-15 22:55:03 +03:00
Robert Haas
68de499bda New SQL functons pg_backup_in_progress() and pg_backup_start_time()
Darold Gilles, reviewed by Gabriele Bartolini and others, rebased by
Marco Nenciarini.  Stylistic cleanup and OID fixes by me.
2012-06-14 13:25:43 -04:00
Robert Haas
6cd015bea3 Add new function log_newpage_buffer.
When I implemented the ginbuildempty() function as part of
implementing unlogged tables, I falsified the note in the header
comment for log_newpage.  Although we could fix that up by changing
the comment, it seems cleaner to add a new function which is
specifically intended to handle this case.  So do that.
2012-06-14 10:11:16 -04:00
Robert Haas
a475c60367 Remove misplaced sanity check from heap_create().
Even when allow_system_table_mods is not set, we allow creation of any
type of SQL object in pg_catalog, except for relations.  And you can
get relations into pg_catalog, too, by initially creating them in some
other schema and then moving them with ALTER .. SET SCHEMA.  So this
restriction, which prevents relations (only) from being created in
pg_catalog directly, is fairly pointless.  If we need a safety mechanism
for this, it should be placed further upstream, so that it affects all
SQL objects uniformly, and picks up both CREATE and SET SCHEMA.

For now, just rip it out, per discussion with Tom Lane.
2012-06-14 09:58:53 -04:00
Robert Haas
d2c86a1ccd Remove RELKIND_UNCATALOGED.
This may have been important at some point in the past, but it no
longer does anything useful.

Review by Tom Lane.
2012-06-14 09:47:30 -04:00
Tom Lane
bed88fceac Stamp HEAD as 9.3devel.
Let the hacking begin ...
2012-06-13 20:03:02 -04:00
Bruce Momjian
927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Peter Eisentraut
8570114dc1 Make include files work without having to include other ones first 2012-06-10 12:46:14 +03:00
Tom Lane
ece01aae47 Scan the buffer pool just once, not once per fork, during relation drop.
This provides a speedup of about 4X when NBuffers is large enough.
There is also a useful reduction in sinval traffic, since we
only do CacheInvalidateSmgr() once not once per fork.

Simon Riggs, reviewed and somewhat revised by Tom Lane
2012-06-07 17:43:11 -04:00
Simon Riggs
055c352abb After any checkpoint, close all smgr files handles in bgwriter 2012-06-01 09:24:53 +01:00
Tom Lane
4bec93ac0f Stamp 9.2beta2. 2012-05-31 19:16:55 -04:00
Tom Lane
ad0009e7be Force PL and range-type support functions to be owned by a superuser.
We allow non-superusers to create procedural languages (with restrictions)
and range datatypes.  Previously, the automatically-created support
functions for these objects ended up owned by the creating user.  This
represents a rather considerable security hazard, because the owning user
might be able to alter a support function's definition in such a way as to
crash the server, inject trojan-horse SQL code, or even execute arbitrary
C code directly.  It appears that right now the only actually exploitable
problem is the infinite-recursion bug fixed in the previous patch for
CVE-2012-2655.  However, it's not hard to imagine that future additions of
more ALTER FUNCTION capability might unintentionally open up new hazards.
To forestall future problems, cause these support functions to be owned by
the bootstrap superuser, not the user creating the parent object.
2012-05-30 23:47:57 -04:00
Tom Lane
cd0ff9c0f4 Expand the allowed range of timezone offsets to +/-15:59:59 from Greenwich.
We used to only allow offsets less than +/-13 hours, then it was +/14,
then it was +/-15.  That's still not good enough though, as per today's bug
report from Patric Bechtel.  This time I actually looked through the Olson
timezone database to find the largest offsets used anywhere.  The winners
are Asia/Manila, at -15:56:00 until 1844, and America/Metlakatla, at
+15:13:42 until 1867.  So we'd better allow offsets less than +/-16 hours.

Given the history, we are way overdue to have some greppable #define
symbols controlling this, so make some ... and also remove an obsolete
comment that didn't get fixed the last time.

Back-patch to all supported branches.
2012-05-30 19:58:35 -04:00
Heikki Linnakangas
d1996ed5e8 Change the way parent pages are tracked during buffered GiST build.
We used to mimic the way a stack is constructed when descending the tree
during normal GiST inserts, but that was quite complicated during a buffered
build. It was also wrong: in GiST, the left-to-right relationships on
different levels might not match each other, so that when you know the
parent of a child page, you won't necessarily find the parent of the page to
the right of the child page by following the rightlinks at the parent level.
This sometimes led to "could not re-find parent" errors while building a
GiST index.

We now use a simple hash table to track the parent of every internal page.
Whenever a page is split, and downlinks are moved from one page to another,
we update the hash table accordingly. This is also better for performance
than the old method, as we never need to move right to re-find the parent
page, which could take a significant amount of time for buffers that were
created much earlier in the index build.
2012-05-30 12:05:57 +03:00
Heikki Linnakangas
1d27dcf578 Fix bug in gistRelocateBuildBuffersOnSplit().
When we create a temporary copy of the old node buffer, in stack, we mustn't
leak that into any of the long-lived data structures. Before this patch,
when we called gistPopItupFromNodeBuffer(), it got added to the array of
"loaded buffers". After gistRelocateBuildBuffersOnSplit() exits, the
pointer added to the loaded buffers array points to garbage. Often that goes
unnotied, because when we go through the array of loaded buffers to unload
them, buffers with a NULL pageBuffer are ignored, which can often happen by
accident even if the pointer points to garbage.

This patch fixes that by marking the temporary copy in stack explicitly as
temporary, and refrain from adding buffers marked as temporary to the array
of loaded buffers.

While we're at it, initialize nodeBuffer->pageBlocknum to InvalidBlockNumber
and improve comments a bit. This isn't strictly necessary, but makes
debugging easier.
2012-05-18 19:38:32 +03:00
Peter Eisentraut
be6d1c88a4 Change COLLATION keyword category
It was changed from unreserved to reserved as part of the COLLATION
FOR syntax, but it turns out that type_func_name_keyword is
sufficient.
2012-05-16 20:19:44 +03:00
Tom Lane
f667747b6d Put back AC_REQUIRE([AC_STRUCT_TM]).
The BSD-ish members of the buildfarm all seem to think removing this
was a bad idea.  It looks to me like it resulted in omitting the system
header inclusion necessary to detect the fields of struct tm correctly.
2012-05-14 23:06:48 -04:00
Peter Eisentraut
ff4628f37a Remove unused AC_DEFINE symbols
ENABLE_DTRACE            unused as of a7b7b07af3
HAVE_ERR_SET_MARK        unused as of 4ed4b6c54e
HAVE_FCVT                unused as of 4553e1d80f
HAVE_STRUCT_SOCKADDR_UN  unused as of b4cea00a1f
HAVE_SYSCONF             unused as of f83356c7f5
TM_IN_SYS_TIME           never used, obsolescent per Autoconf documentation
2012-05-14 22:51:21 +03:00
Heikki Linnakangas
9e4637bf89 Update comments that became out-of-date with the PGXACT struct.
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.

Noah Misch
2012-05-14 10:28:55 +03:00
Peter Eisentraut
64f09ca386 Remove leftovers of BeOS port
These should have been removed when the BeOS port was removed in
44f9021223.
2012-05-14 04:50:39 +03:00
Robert Haas
1331cc6c1a Prevent loss of init fork when truncating an unlogged table.
Fixes bug #6635, reported by Akira Kurosawa.
2012-05-11 09:48:56 -04:00
Simon Riggs
b06679e012 Ensure age() returns a stable value rather than the latest value 2012-05-11 14:36:24 +01:00
Bruce Momjian
ee24de4001 Revert catalog bump; was post-beta1, and unnecessary. 2012-05-10 18:44:47 -04:00
Bruce Momjian
d2fe836cd2 Update comment for 'name' data type to say 63 "bytes".
Catalog version bump so everyone has the same comment for beta1.
2012-05-10 18:40:40 -04:00
Tom Lane
f70fa835e0 Stamp 9.2beta1. 2012-05-10 18:35:09 -04:00
Heikki Linnakangas
60a3dffb72 Fix outdated comment.
Multi-insert records observe XLOG_HEAP_INIT_PAGE flag too, as Andres Freund
pointed out.
2012-05-10 09:55:48 +03:00
Tom Lane
6308ba05a7 Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89 added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption.  However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out.  That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening.  Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop.  To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.

In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes.  I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately.  In any case it did nothing to improve the readability or
robustness of the code.

In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant.  I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
2012-05-09 23:37:10 -04:00
Simon Riggs
8f28789bff Rename BgWriterShmem/Request to CheckpointerShmem/Request 2012-05-09 14:23:45 +01:00
Simon Riggs
bbd3ec9dce Rename BgWriterCommLock to CheckpointerCommLock 2012-05-09 14:11:48 +01:00
Tom Lane
acd4c7d58b Fix an issue in recent walwriter hibernation patch.
Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk.  The
original version of the walwriter hibernation patch broke that.  Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.
2012-05-08 23:06:40 -04:00
Tom Lane
5461564a9d Reduce idle power consumption of walwriter and checkpointer processes.
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption.  It reverts to
normal as soon as it's done any successful flushes.  It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.

Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle.  This also is primarily useful for
reducing the server's power consumption when idle.

In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch.  Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.

Peter Geoghegan, somewhat simplified by Tom
2012-05-08 20:03:26 -04:00
Peter Eisentraut
3284e03d5d Remove strdup, strtol, strtoul from libpgport
These should not be needed anymore, at least after the recent port
removals.  So let's see whether we can do without them.
2012-05-07 23:10:28 +03:00
Tom Lane
71b9549d05 Overdue code review for transaction-level advisory locks patch.
Commit 62c7bd31c8 had assorted problems, most
visibly that it broke PREPARE TRANSACTION in the presence of session-level
advisory locks (which should be ignored by PREPARE), as per a recent
complaint from Stephen Rees.  More abstractly, the patch made the
LockMethodData.transactional flag not merely useless but outright
dangerous, because in point of fact that flag no longer tells you anything
at all about whether a lock is held transactionally.  This fix therefore
removes that flag altogether.  We now rely entirely on the convention
already in use in lock.c that transactional lock holds must be owned by
some ResourceOwner, while session holds are never so owned.  Setting the
locallock struct's owner link to NULL thus denotes a session hold, and
there is no redundant marker for that.

PREPARE TRANSACTION now works again when there are session-level advisory
locks, and it is also able to transfer transactional advisory locks to the
prepared transaction, but for implementation reasons it throws an error if
we hold both types of lock on a single lockable object.  Perhaps it will be
worth improving that someday.

Assorted other minor cleanup and documentation editing, as well.

Back-patch to 9.1, except that in the 9.1 branch I did not remove the
LockMethodData.transactional flag for fear of causing an ABI break for
any external code that might be examining those structs.
2012-05-04 17:44:31 -04:00
Bruce Momjian
ebcaa5fcde Remove BSD/OS (BSDi) port. There are no known users upgrading to
Postgres 9.2, and perhaps no existing users either.
2012-05-03 10:58:44 -04:00
Robert Haas
8e0c5195df Add missing parenthesis in comment. 2012-05-02 14:30:58 -04:00
Robert Haas
0038110421 Avoid repeated CLOG access from heap_hot_search_buffer.
At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate.  So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.

Review by Tom Lane.
2012-05-02 12:40:07 -04:00
Robert Haas
e01e66f808 More duplicate word removal. 2012-05-02 09:28:16 -04:00
Heikki Linnakangas
f291ccd43e Remove duplicate words in comments.
Found these with grep -r "for for ".
2012-05-02 10:20:27 +03:00
Peter Eisentraut
f2f9439fbf Remove dead ports
Remove the following ports:

- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel

These are obsolete and not worth rescuing.  In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.
2012-05-01 22:11:12 +03:00
Tom Lane
809e7e21af Converge all SQL-level statistics timing values to float8 milliseconds.
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds.  By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)

The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time

The first four of these are new in 9.2, so there is no compatibility issue
from changing them.  The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
2012-04-30 14:03:33 -04:00
Peter Eisentraut
26471a51fc Mark ReThrowError() with attribute noreturn
All related functions were already so marked.
2012-04-30 20:23:41 +03:00
Tom Lane
1dd89eadcd Rename I/O timing statistics columns to blk_read_time and blk_write_time.
This seems more consistent with the pre-existing choices for names of
other statistics columns.  Rename assorted internal identifiers to match.
2012-04-29 18:13:33 -04:00
Tom Lane
309c64745e Rename track_iotiming GUC to track_io_timing.
This spelling seems significantly more readable to me.
2012-04-29 16:23:54 -04:00