Commit graph

7870 commits

Author SHA1 Message Date
Andres Freund
e8898e9169 Minor ON CONFLICT related comments and doc fixes.
Geoff Winkless, Stephen Frost, Peter Geoghegan and me.
2015-05-08 19:24:14 +02:00
Robert Haas
53bb309d2d Teach autovacuum about multixact member wraparound.
The logic introduced in commit b69bf30b9b
and repaired in commits 669c7d20e6 and
7be47c56af helps to ensure that we don't
overwrite old multixact member information while it is still needed,
but a user who creates many large multixacts can still exhaust the
member space (and thus start getting errors) while autovacuum stands
idly by.

To fix this, progressively ramp down the effective value (but not the
actual contents) of autovacuum_multixact_freeze_max_age as member space
utilization increases.  This makes autovacuum more aggressive and also
reduces the threshold for a manual VACUUM to perform a full-table scan.

This patch leaves unsolved the problem of ensuring that emergency
autovacuums are triggered even when autovacuum=off.  We'll need to fix
that via a separate patch.

Thomas Munro and Robert Haas
2015-05-08 12:53:00 -04:00
Andres Freund
168d5805e4 Add support for INSERT ... ON CONFLICT DO NOTHING/UPDATE.
The newly added ON CONFLICT clause allows to specify an alternative to
raising a unique or exclusion constraint violation error when inserting.
ON CONFLICT refers to constraints that can either be specified using a
inference clause (by specifying the columns of a unique constraint) or
by naming a unique or exclusion constraint.  DO NOTHING avoids the
constraint violation, without touching the pre-existing row.  DO UPDATE
SET ... [WHERE ...] updates the pre-existing tuple, and has access to
both the tuple proposed for insertion and the existing tuple; the
optional WHERE clause can be used to prevent an update from being
executed.  The UPDATE SET and WHERE clauses have access to the tuple
proposed for insertion using the "magic" EXCLUDED alias, and to the
pre-existing tuple using the table name or its alias.

This feature is often referred to as upsert.

This is implemented using a new infrastructure called "speculative
insertion". It is an optimistic variant of regular insertion that first
does a pre-check for existing tuples and then attempts an insert.  If a
violating tuple was inserted concurrently, the speculatively inserted
tuple is deleted and a new attempt is made.  If the pre-check finds a
matching tuple the alternative DO NOTHING or DO UPDATE action is taken.
If the insertion succeeds without detecting a conflict, the tuple is
deemed inserted.

To handle the possible ambiguity between the excluded alias and a table
named excluded, and for convenience with long relation names, INSERT
INTO now can alias its target table.

Bumps catversion as stored rules change.

Author: Peter Geoghegan, with significant contributions from Heikki
    Linnakangas and Andres Freund. Testing infrastructure by Jeff Janes.
Reviewed-By: Heikki Linnakangas, Andres Freund, Robert Haas, Simon Riggs,
    Dean Rasheed, Stephen Frost and many others.
2015-05-08 05:43:10 +02:00
Andres Freund
2c8f4836db Represent columns requiring insert and update privileges indentently.
Previously, relation range table entries used a single Bitmapset field
representing which columns required either UPDATE or INSERT privileges,
despite the fact that INSERT and UPDATE privileges are separately
cataloged, and may be independently held.  As statements so far required
either insert or update privileges but never both, that was
sufficient. The required permission could be inferred from the top level
statement run.

The upcoming INSERT ... ON CONFLICT UPDATE feature needs to
independently check for both privileges in one statement though, so that
is not sufficient anymore.

Bumps catversion as stored rules change.

Author: Peter Geoghegan
Reviewed-By: Andres Freund
2015-05-08 00:20:46 +02:00
Alvaro Herrera
db5f98ab4f Improve BRIN infra, minmax opclass and regression test
The minmax opclass was using the wrong support functions when
cross-datatypes queries were run.  Instead of trying to fix the
pg_amproc definitions (which apparently is not possible), use the
already correct pg_amop entries instead.  This requires jumping through
more hoops (read: extra syscache lookups) to obtain the underlying
functions to execute, but it is necessary for correctness.

Author: Emre Hasegeli, tweaked by Álvaro
Review: Andreas Karlsson

Also change BrinOpcInfo to record each stored type's typecache entry
instead of just the OID.  Turns out that the full type cache is
necessary in brin_deform_tuple: the original code used the indexed
type's byval and typlen properties to extract the stored tuple, which is
correct in Minmax; but in other implementations that want to store
something different, that's wrong.  The realization that this is a bug
comes from Emre also, but I did not use his patch.

I also adopted Emre's regression test code (with smallish changes),
which is more complete.
2015-05-07 13:02:22 -03:00
Robert Haas
1998261034 Avoid using a C++ keyword as a structure member name.
Per request from Peter Eisentraut.
2015-05-05 22:41:03 -04:00
Alvaro Herrera
3b6db1f445 Add geometry/range functions to support BRIN inclusion
This commit adds the following functions:
    box(point) -> box
    bound_box(box, box) -> box
    inet_same_family(inet, inet) -> bool
    inet_merge(inet, inet) -> cidr
    range_merge(anyrange, anyrange) -> anyrange

The first of these is also used to implement a new assignment cast from
point to box.

These functions are the first part of a base to implement an "inclusion"
operator class for BRIN, for multidimensional data types.

Author: Emre Hasegeli
Reviewed by: Andreas Karlsson
2015-05-05 15:22:24 -03:00
Tom Lane
2503982be4 Improve procost estimates for some text search functions.
The text search functions that involve parsing raw text into lexemes are
remarkably CPU-intensive, so estimating them at the same cost as most other
built-in functions seems like a mistake; moreover, doing so turns out to
discourage the optimizer from using functional indexes on these functions.
After some debate, we've agreed to raise procost from 1 to 100 for
to_tsvector(), plainto_tsvector(), to_tsquery(), ts_headline(),
ts_match_tt(), and ts_match_tq(), which are all the text search functions
that parse raw text.

Also increase procost for the 2-argument form of ts_rewrite()
(tsquery_rewrite_query); while this function doesn't do text parsing,
it does execute a user-supplied SQL query, so its previous procost of 1 is
clearly a drastic underestimate.  It seems reasonable to assign it the same
cost we assign to PL functions by default, so 100 is the number here too.

I did not bother bumping catversion for this change, since it does not
break catalog compatibility with the server executable nor result in
any regression test changes.

Per complaint from Andrew Gierth and subsequent discussion.
2015-05-04 15:38:57 -04:00
Robert Haas
2ce439f337 Recursively fsync() the data directory after a crash.
Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk.  This could lead to data corruption.

Back-patch to all supported versions.

Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.
2015-05-04 14:13:53 -04:00
Robert Haas
e7cb7ee145 Allow FDWs and custom scan providers to replace joins with scans.
Foreign data wrappers can use this capability for so-called "join
pushdown"; that is, instead of executing two separate foreign scans
and then joining the results locally, they can generate a path which
performs the join on the remote server and then is scanned locally.
This commit does not extend postgres_fdw to take advantage of this
capability; it just provides the infrastructure.

Custom scan providers can use this in a similar way.  Previously,
it was only possible for a custom scan provider to scan a single
relation.  Now, it can scan an entire join tree, provided of course
that it knows how to produce the same results that the join would
have produced if executed normally.

KaiGai Kohei, reviewed by Shigeru Hanada, Ashutosh Bapat, and me.
2015-05-01 08:50:35 -04:00
Andres Freund
2b22795b32 Copy editing of the replication origins patch.
Michael Paquier and myself.
2015-05-01 12:22:13 +02:00
Andres Freund
1db12da85b Fix unaligned memory access in xlog parsing due to replication origin patch.
ParseCommitRecord() accessed xl_xact_origin directly. But the chunks in
the commit record's data only have 4 byte alignment, whereas
xl_xact_origin's members require 8 byte alignment on some
platforms. Update comments to make not of that and copy the record to
stack local storage before reading.

With help from Stefan Kaltenbrunner in pinning down the buildfarm and
verifying the fix.
2015-05-01 11:36:14 +02:00
Robert Haas
924bcf4f16 Create an infrastructure for parallel computation in PostgreSQL.
This does four basic things.  First, it provides convenience routines
to coordinate the startup and shutdown of parallel workers.  Second,
it synchronizes various pieces of state (e.g. GUCs, combo CID
mappings, transaction snapshot) from the parallel group leader to the
worker processes.  Third, it prohibits various operations that would
result in unsafe changes to that state while parallelism is active.
Finally, it propagates events that would result in an ErrorResponse,
NoticeResponse, or NotifyResponse message being sent to the client
from the parallel workers back to the master, from which they can then
be sent on to the client.

Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke.
Suggestions and review from Andres Freund, Heikki Linnakangas, Noah
Misch, Simon Riggs, Euler Taveira, and Jim Nasby.
2015-04-30 15:02:14 -04:00
Andres Freund
e0f26fc765 Correct replication origin's use of UINT16_MAX to PG_UINT16_MAX.
We can't rely on UINT16_MAX being present, which is why we introduced
PG_UINT16_MAX...

Buildfarm animal bowerbird via Andrew Gierth.
2015-04-30 00:19:36 +02:00
Andres Freund
5aa2350426 Introduce replication progress tracking infrastructure.
When implementing a replication solution ontop of logical decoding, two
related problems exist:
* How to safely keep track of replication progress
* How to change replication behavior, based on the origin of a row;
  e.g. to avoid loops in bi-directional replication setups

The solution to these problems, as implemented here, consist out of
three parts:

1) 'replication origins', which identify nodes in a replication setup.
2) 'replication progress tracking', which remembers, for each
   replication origin, how far replay has progressed in a efficient and
   crash safe manner.
3) The ability to filter out changes performed on the behest of a
   replication origin during logical decoding; this allows complex
   replication topologies. E.g. by filtering all replayed changes out.

Most of this could also be implemented in "userspace", e.g. by inserting
additional rows contain origin information, but that ends up being much
less efficient and more complicated.  We don't want to require various
replication solutions to reimplement logic for this independently. The
infrastructure is intended to be generic enough to be reusable.

This infrastructure also replaces the 'nodeid' infrastructure of commit
timestamps. It is intended to provide all the former capabilities,
except that there's only 2^16 different origins; but now they integrate
with logical decoding. Additionally more functionality is accessible via
SQL.  Since the commit timestamp infrastructure has also been introduced
in 9.5 (commit 73c986add) changing the API is not a problem.

For now the number of origins for which the replication progress can be
tracked simultaneously is determined by the max_replication_slots
GUC. That GUC is not a perfect match to configure this, but there
doesn't seem to be sufficient reason to introduce a separate new one.

Bumps both catversion and wal page magic.

Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer
Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer
Discussion: 20150216002155.GI15326@awork2.anarazel.de,
    20140923182422.GA15776@alap3.anarazel.de,
    20131114172632.GE7522@alap2.anarazel.de
2015-04-29 19:30:53 +02:00
Stephen Frost
dcbf5948e1 Improve qual pushdown for RLS and SB views
The original security barrier view implementation, on which RLS is
built, prevented all non-leakproof functions from being pushed down to
below the view, even when the function was not receiving any data from
the view.  This optimization improves on that situation by, instead of
checking strictly for non-leakproof functions, it checks for Vars being
passed to non-leakproof functions and allows functions which do not
accept arguments or whose arguments are not from the current query level
(eg: constants can be particularly useful) to be pushed down.

As discussed, this does mean that a function which is pushed down might
gain some idea that there are rows meeting a certain criteria based on
the number of times the function is called, but this isn't a
particularly new issue and the documentation in rules.sgml already
addressed similar covert-channel risks.  That documentation is updated
to reflect that non-leakproof functions may be pushed down now, if
they meet the above-described criteria.

Author: Dean Rasheed, with a bit of rework to make things clearer,
along with comment and documentation updates from me.
2015-04-27 12:29:42 -04:00
Andres Freund
6aab1f45ac Fix various typos and grammar errors in comments.
Author: Dmitriy Olshevskiy
Discussion: 553D00A6.4090205@bk.ru
2015-04-26 18:42:31 +02:00
Peter Eisentraut
cac7658205 Add transforms feature
This provides a mechanism for specifying conversions between SQL data
types and procedural languages.  As examples, there are transforms
for hstore and ltree for PL/Perl and PL/Python.

reviews by Pavel Stěhule and Andres Freund
2015-04-26 10:33:14 -04:00
Stephen Frost
e89bd02f58 Perform RLS WITH CHECK before constraints, etc
The RLS capability is built on top of the WITH CHECK OPTION
system which was added for auto-updatable views, however, unlike
WCOs on views (which are mandated by the SQL spec to not fire until
after all other constraints and checks are done), it makes much more
sense for RLS checks to happen earlier than constraint and uniqueness
checks.

This patch reworks the structure which holds the WCOs a bit to be
explicitly either VIEW or RLS checks and the RLS-related checks are
done prior to the constraint and uniqueness checks.  This also allows
better error reporting as we are now reporting when a violation is due
to a WITH CHECK OPTION and when it's due to an RLS policy violation,
which was independently noted by Craig Ringer as being confusing.

The documentation is also updated to include a paragraph about when RLS
WITH CHECK handling is performed, as there have been a number of
questions regarding that and the documentation was previously silent on
the matter.

Author: Dean Rasheed, with some kabitzing and comment changes by me.
2015-04-24 20:34:26 -04:00
Heikki Linnakangas
62420ae7d6 Move functions related to index maintenance to separate source file.
There is enough code here to deserve a file of their own, not be buried
in the middle of execUtils.c.
2015-04-24 09:33:23 +03:00
Stephen Frost
0bf22e0c8b RLS fixes, new hooks, and new test module
In prepend_row_security_policies(), defaultDeny was always true, so if
there were any hook policies, the RLS policies on the table would just
get discarded.  Fixed to start off with defaultDeny as false and then
properly set later if we detect that only the default deny policy exists
for the internal policies.

The infinite recursion detection in fireRIRrules() didn't properly
manage the activeRIRs list in the case of WCOs, so it would incorrectly
report infinite recusion if the same relation with RLS appeared more
than once in the rtable, for example "UPDATE t ... FROM t ...".

Further, the RLS expansion code in fireRIRrules() was handling RLS in
the main loop through the rtable, which lead to RTEs being visited twice
if they contained sublink subqueries, which
prepend_row_security_policies() attempted to handle by exiting early if
the RTE already had securityQuals.  That doesn't work, however, since
if the query involved a security barrier view on top of a table with
RLS, the RTE would already have securityQuals (from the view) by the
time fireRIRrules() was invoked, and so the table's RLS policies would
be ignored.  This is fixed in fireRIRrules() by handling RLS in a
separate loop at the end, after dealing with any other sublink
subqueries, thus ensuring that each RTE is only visited once for RLS
expansion.

The inheritance planner code didn't correctly handle non-target
relations with RLS, which would get turned into subqueries during
planning. Thus an update of the form "UPDATE t1 ... FROM t2 ..." where
t1 has inheritance and t2 has RLS quals would fail.  Fix by making sure
to copy in and update the securityQuals when they exist for non-target
relations.

process_policies() was adding WCOs to non-target relations, which is
unnecessary, and could lead to a lot of wasted time in the rewriter and
the planner. Fix by only adding WCO policies when working on the result
relation.  Also in process_policies, we should be copying the USING
policies to the WITH CHECK policies on a per-policy basis, fix by moving
the copying up into the per-policy loop.

Lastly, as noted by Dean, we were simply adding policies returned by the
hook provided to the list of quals being AND'd, meaning that they would
actually restrict records returned and there was no option to have
internal policies and hook-based policies work together permissively (as
all internal policies currently work).  Instead, explicitly add support
for both permissive and restrictive policies by having a hook for each
and combining the results appropriately.  To ensure this is all done
correctly, add a new test module (test_rls_hooks) to test the various
combinations of internal, permissive, and restrictive hook policies.

Largely from Dean Rasheed (thanks!):

CAEZATCVmFUfUOwwhnBTcgi6AquyjQ0-1fyKd0T3xBWJvn+xsFA@mail.gmail.com

Author: Dean Rasheed, though I added the new hooks and test module.
2015-04-22 12:01:06 -04:00
Andres Freund
cef939c347 Rename pg_replication_slot's new active_in to active_pid.
In d811c037ce active_in was added but discussion since showed that
active_pid is preferred as a name.

Discussion: CAMsr+YFKgZca5_7_ouaMWxA5PneJC9LNViPzpDHusaPhU9pA7g@mail.gmail.com
2015-04-22 09:43:40 +02:00
Andres Freund
d811c037ce Add 'active_in' column to pg_replication_slots.
Right now it is visible whether a replication slot is active in any
session, but not in which.  Adding the active_in column, containing the
pid of the backend having acquired the slot, makes it much easier to
associate pg_replication_slots entries with the corresponding
pg_stat_replication/pg_stat_activity row.

This should have been done from the start, but I (Andres) dropped the
ball there somehow.

Author: Craig Ringer, revised by me Discussion:
CAMsr+YFKgZca5_7_ouaMWxA5PneJC9LNViPzpDHusaPhU9pA7g@mail.gmail.com
2015-04-21 11:51:06 +02:00
Bruce Momjian
f92fc4c95d pg_upgrade: binary_upgrade_create_empty_extension() is strict
Was broken by commit 30982be4e5.

Patch by Jeff Janes
2015-04-17 20:08:42 -04:00
Peter Eisentraut
30982be4e5 Integrate pg_upgrade_support module into backend
Previously, these functions were created in a schema "binary_upgrade",
which was deleted after pg_upgrade was finished.  Because we don't want
to keep that schema around permanently, move them to pg_catalog but
rename them with a binary_upgrade_... prefix.

The provided functions are only small wrappers around global variables
that were added specifically for pg_upgrade use, so keeping the module
separate does not create any modularity.

The functions still check that they are only called in binary upgrade
mode, so it is not possible to call these during normal operation.

Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
2015-04-14 19:26:37 -04:00
Heikki Linnakangas
b73e7a0716 Oops, fix misspelled #endif
I hope this fixes the Windows builfarm failures.
2015-04-14 22:00:52 +03:00
Heikki Linnakangas
3dc2d62d04 Use Intel SSE 4.2 CRC instructions where available.
Modern x86 and x86-64 processors with SSE 4.2 support have special
instructions, crc32b and crc32q, for calculating CRC-32C. They greatly
speed up CRC calculation.

Whether the instructions can be used or not depends on the compiler and the
target architecture. If generation of SSE 4.2 instructions is allowed for
the target (-msse4.2 flag on gcc and clang), use them. If they are not
allowed by default, but the compiler supports the -msse4.2 flag to enable
them, compile just the CRC-32C function with -msse4.2 flag, and check at
runtime whether the processor we're running on supports it. If it doesn't,
fall back to the slicing-by-8 algorithm. (With the common defaults on
current operating systems, the runtime-check variant is what you get in
practice.)

Abhijit Menon-Sen, heavily modified by me, reviewed by Andres Freund.
2015-04-14 17:05:03 +03:00
Heikki Linnakangas
4f700bcd20 Reorganize our CRC source files again.
Now that we use CRC-32C in WAL and the control file, the "traditional" and
"legacy" CRC-32 variants are not used in any frontend programs anymore.
Move the code for those back from src/common to src/backend/utils/hash.

Also move the slicing-by-8 implementation (back) to src/port. This is in
preparation for next patch that will add another implementation that uses
Intel SSE 4.2 instructions to calculate CRC-32C, where available.
2015-04-14 17:03:42 +03:00
Heikki Linnakangas
b2a5545bd6 Don't archive bogus recycled or preallocated files after timeline switch.
After a timeline switch, we would leave behind recycled WAL segments that
are in the future, but on the old timeline. After promotion, and after they
become old enough to be recycled again, we would notice that they don't have
a .ready or .done file, create a .ready file for them, and archive them.
That's bogus, because the files contain garbage, recycled from an older
timeline (or prealloced as zeros). We shouldn't archive such files.

This could happen when we're following a timeline switch during replay, or
when we switch to new timeline at end-of-recovery.

To fix, whenever we switch to a new timeline, scan the data directory for
WAL segments on the old timeline, but with a higher segment number, and
remove them. Those don't belong to our timeline history, and are most
likely bogus recycled or preallocated files. They could also be valid files
that we streamed from the primary ahead of time, but in any case, they're
not needed to recover to the new timeline.
2015-04-13 16:53:49 +03:00
Magnus Hagander
9029f4b374 Add system view pg_stat_ssl
This view shows information about all connections, such as if the
connection is using SSL, which cipher is used, and which client
certificate (if any) is used.

Reviews by Alex Shulgin, Heikki Linnakangas, Andres Freund & Michael Paquier
2015-04-12 19:07:46 +02:00
Alvaro Herrera
27846f02c1 Optimize locking a tuple already locked by another subxact
Locking and updating the same tuple repeatedly led to some strange
multixacts being created which had several subtransactions of the same
parent transaction holding locks of the same strength.  However,
once a subxact of the current transaction holds a lock of a given
strength, it's not necessary to acquire the same lock again.  This made
some coding patterns much slower than required.

The fix is twofold.  First we change HeapTupleSatisfiesUpdate to return
HeapTupleBeingUpdated for the case where the current transaction is
already a single-xid locker for the given tuple; it used to return
HeapTupleMayBeUpdated for that case.  The new logic is simpler, and the
change to pgrowlocks is a testament to that: previously we needed to
check for the single-xid locker separately in a very ugly way.  That
test is simpler now.

As fallout from the HTSU change, some of its callers need to be amended
so that tuple-locked-by-own-transaction is taken into account in the
BeingUpdated case rather than the MayBeUpdated case.  For many of them
there is no difference; but heap_delete() and heap_update now check
explicitely and do not grab tuple lock in that case.

The HTSU change also means that routine MultiXactHasRunningRemoteMembers
introduced in commit 11ac4c73cb is no longer necessary and can be
removed; the case that used to require it is now handled naturally as
result of the changes to heap_delete and heap_update.

The second part of the fix to the performance issue is to adjust
heap_lock_tuple to avoid the slowness:

1. Previously we checked for the case that our own transaction already
held a strong enough lock and returned MayBeUpdated, but only in the
multixact case.  Now we do it for the plain Xid case as well, which
saves having to LockTuple.

2. If the current transaction is the only locker of the tuple (but with
a lock not as strong as what we need; otherwise it would have been
caught in the check mentioned above), we can skip sleeping on the
multixact, and instead go straight to create an updated multixact with
the additional lock strength.

3. Most importantly, make sure that both the single-xid-locker case and
the multixact-locker case optimization are applied always.  We do this
by checking both in a single place, rather than them appearing in two
separate portions of the routine -- something that is made possible by
the HeapTupleSatisfiesUpdate API change.  Previously we would only check
for the single-xid case when HTSU returned MayBeUpdated, and only
checked for the multixact case when HTSU returned BeingUpdated.  This
was at odds with what HTSU actually returned in one case: if our own
transaction was locker in a multixact, it returned MayBeUpdated, so the
optimization never applied.  This is what led to the large multixacts in
the first place.

Per bug report #8470 by Oskari Saarenmaa.
2015-04-10 13:47:15 -03:00
Alvaro Herrera
e9a077cad3 pg_event_trigger_dropped_objects: add is_temp column
It now also reports temporary objects dropped that are local to the
backend.  Previously we weren't reporting any temp objects because it
was deemed unnecessary; but as it turns out, it is necessary if we want
to keep close track of DDL command execution inside one session.  Temp
objects are reported as living in schema pg_temp, which works because
such a schema-qualification always refers to the temp objects of the
current session.
2015-04-06 11:40:55 -03:00
Alvaro Herrera
4ff695b17d Add log_min_autovacuum_duration per-table option
This is useful to control autovacuum log volume, for situations where
monitoring only a set of tables is necessary.

Author: Michael Paquier
Reviewed by: A team led by Naoya Anzai (also including Akira Kurosawa,
Taiki Kondo, Huong Dangminh), Fujii Masao.
2015-04-03 11:55:50 -03:00
Fujii Masao
8c8a886268 Add palloc_extended for frontend and backend.
This commit also adds pg_malloc_extended for frontend. These interfaces
can be used to control at a lower level memory allocation using an interface
similar to MemoryContextAllocExtended. For example, the callers can specify
MCXT_ALLOC_NO_OOM if they want to suppress the "out of memory" error while
allocating the memory and handle a NULL return value.

Michael Paquier, reviewed by me.
2015-04-03 17:36:12 +09:00
Robert Haas
abd94bcac4 Use abbreviated keys for faster sorting of numeric datums.
Andrew Gierth, reviewed by Peter Geoghegan, with further tweaks by me.
2015-04-02 14:04:26 -04:00
Andres Freund
62e2a8dc2c Define integer limits independently from the system definitions.
In 83ff1618 we defined integer limits iff they're not provided by the
system. That turns out not to be the greatest idea because there's
different ways some datatypes can be represented. E.g. on OSX PG's 64bit
datatype will be a 'long int', but OSX unconditionally uses 'long
long'. That disparity then can lead to warnings, e.g. around printf
formats.

One way to fix that would be to back int64 using stdint.h's
int64_t. While a good idea it's not that easy to implement. We would
e.g. need to include stdint.h in our external headers, which we don't
today. Also computing the correct int64 printf formats in that case is
nontrivial.

Instead simply prefix the integer limits with PG_ and define them
unconditionally. I've adjusted all the references to them in code, but
not the ones in comments; the latter seems unnecessary to me.

Discussion: 20150331141423.GK4878@alap3.anarazel.de
2015-04-02 17:43:35 +02:00
Robert Haas
4cd639baf4 Revert "psql: fix \connect with URIs and conninfo strings"
This reverts commit fcef161729, about
which both the buildfarm and my local machine are very unhappy.
2015-04-02 10:10:22 -04:00
Alvaro Herrera
fcef161729 psql: fix \connect with URIs and conninfo strings
psql was already accepting conninfo strings as the first parameter in
\connect, but the way it worked wasn't sane; some of the other
parameters would get the previous connection's values, causing it to
connect to a completely unexpected server or, more likely, not finding
any server at all because of completely wrong combinations of
parameters.

Fix by explicitely checking for a conninfo-looking parameter in the
dbname position; if one is found, use its complete specification rather
than mix with the other arguments.  Also, change tab-completion to not
try to complete conninfo/URI-looking "dbnames" and document that
conninfos are accepted as first argument.

There was a weak consensus to backpatch this, because while the behavior
of using the dbname as a conninfo is nowhere documented for \connect, it
is reasonable to expect that it works because it does work in many other
contexts.  Therefore this is backpatched all the way back to 9.0.

To implement this, routines previously private to libpq have been
duplicated so that psql can decide what looks like a conninfo/URI
string.  In back branches, just duplicate the same code all the way back
to 9.2, where URIs where introduced; 9.0 and 9.1 have a simpler version.
In master, the routines are moved to src/common and renamed.

Author: David Fetter, Andrew Dunstan.  Some editorialization by me
(probably earning a Gierth's "Sloppy" badge in the process.)
Reviewers: Andrew Gierth, Erik Rijkers, Pavel Stěhule, Stephen Frost,
Robert Haas, Andrew Dunstan.
2015-04-01 20:00:07 -03:00
Heikki Linnakangas
f770870d9e Move inet/cidr GiST opclass functions to correct place in header file.
They were accidentally placed under the GIN heading.

Andreas Karlsson
2015-04-01 19:20:45 +03:00
Andrew Dunstan
fa1e5afa8a Run pg_upgrade and pg_resetxlog with restricted token on Windows
As with initdb these programs need to run with a restricted token, and
if they don't pg_upgrade will fail when run as a user with Adminstrator
privileges.

Backpatch to all live branches. On the development branch the code is
reorganized so that the restricted token code is now in a single
location. On the stable bramches a less invasive change is made by
simply copying the relevant code to pg_upgrade.c and pg_resetxlog.c.

Patches and bug report from Muhammad Asif Naeem, reviewed by Michael
Paquier, slightly edited by me.
2015-03-30 17:07:52 -04:00
Alvaro Herrera
97690ea6e8 Change array_offset to return subscripts, not offsets
... and rename it and its sibling array_offsets to array_position and
array_positions, to account for the changed behavior.

Having the functions return subscripts better matches existing practice,
and is better suited to using the result value as a subscript into the
array directly.  For one-based arrays, the new definition is identical
to what was originally committed.

(We use the term "subscript" in the documentation, which is what we use
whenever we talk about arrays; but the functions themselves are named
using the word "position" to match the standard-defined POSITION()
functions.)

Author: Pavel Stěhule
Behavioral problem noted by Dean Rasheed.
2015-03-30 16:13:21 -03:00
Heikki Linnakangas
0633a60f4d Add index-only scan support to range type GiST opclass.
Andreas Karlsson
2015-03-30 13:22:38 +03:00
Heikki Linnakangas
3a20b0e7b6 Add index-only scan support to inet GiST opclass.
Andreas Karlsson
2015-03-28 15:11:53 +02:00
Heikki Linnakangas
55b59eda13 Fix GiST index-only scans for opclasses with different storage type.
We cannot use the index's tuple descriptor directly to describe the index
tuples returned in an index-only scan. That's because the index might use
a different datatype for the values stored on disk than the type originally
indexed. As long as they were both pass-by-ref, it worked, but will not work
for pass-by-value types of different sizes. I noticed this as a crash when I
started hacking a patch to add fetch methods to btree_gist.
2015-03-26 23:07:52 +02:00
Tom Lane
785941cdc3 Tweak __attribute__-wrapping macros for better pgindent results.
This improves on commit bbfd7edae5 by
making two simple changes:

* pg_attribute_noreturn now takes parentheses, ie pg_attribute_noreturn().
Likewise pg_attribute_unused(), pg_attribute_packed().  This reduces
pgindent's tendency to misformat declarations involving them.

* attributes are now always attached to function declarations, not
definitions.  Previously some places were taking creative shortcuts,
which were not merely candidates for bad misformatting by pgindent
but often were outright wrong anyway.  (It does little good to put a
noreturn annotation where callers can't see it.)  In any case, if
we would like to believe that these macros can be used with non-gcc
compilers, we should avoid gratuitous variance in usage patterns.

I also went through and manually improved the formatting of a lot of
declarations, and got rid of excessively repetitive (and now obsolete
anyway) comments informing the reader what pg_attribute_printf is for.
2015-03-26 14:03:25 -04:00
Heikki Linnakangas
d04c8ed904 Add support for index-only scans in GiST.
This adds a new GiST opclass method, 'fetch', which is used to reconstruct
the original Datum from the value stored in the index. Also, the 'canreturn'
index AM interface function gains a new 'attno' argument. That makes it
possible to use index-only scans on a multi-column index where some of the
opclasses support index-only scans but some do not.

This patch adds support in the box and point opclasses. Other opclasses
can added later as follow-on patches (btree_gist would be particularly
interesting).

Anastasia Lubennikova, with additional fixes and modifications by me.
2015-03-26 19:12:00 +02:00
Heikki Linnakangas
8fa393a6d7 Minor cleanup of GiST code, for readability.
Remove the gistcentryinit function, inlining the relevant part of it into
the only caller.
2015-03-26 19:11:54 +02:00
Tatsuo Ishii
656ea810e5 Make SyncRepWakeQueue to a static function
It is only used in src/backend/replication/syncrep.c.

Back-patch to all supported branches except 9.1 which declares the
function as static.
2015-03-26 10:34:08 +09:00
Andres Freund
83ff1618bc Centralize definition of integer limits.
Several submitted and even committed patches have run into the problem
that C89, our baseline, does not provide minimum/maximum values for
various integer datatypes. C99's stdint.h does, but we can't rely on
it.

Several parts of the code defined limits locally, so instead centralize
the definitions to c.h.

This patch also changes the more obvious usages of literal limit values;
there's more places that could be changed, but it's less clear whether
it's beneficial to change those.

Author: Andrew Gierth
Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk
2015-03-25 22:39:42 +01:00
Alvaro Herrera
bdc3d7fa23 Return ObjectAddress in many ALTER TABLE sub-routines
Since commit a2e35b53c3, most CREATE and ALTER commands return the
ObjectAddress of the affected object.  This is useful for event triggers
to try to figure out exactly what happened.  This patch extends this
idea a bit further to cover ALTER TABLE as well: an auxiliary
ObjectAddress is returned for each of several subcommands of ALTER
TABLE.  This makes it possible to decode with precision what happened
during execution of any ALTER TABLE command; for instance, which
constraint was added by ALTER TABLE ADD CONSTRAINT, or which parent got
dropped from the parents list by ALTER TABLE NO INHERIT.

As with the previous patch, there is no immediate user-visible change
here.

This is all really just continuing what c504513f83 started.

Reviewed by Stephen Frost.
2015-03-25 17:17:56 -03:00
Kevin Grittner
2ed5b87f96 Reduce pinning and buffer content locking for btree scans.
Even though the main benefit of the Lehman and Yao algorithm for
btrees is that no locks need be held between page reads in an
index search, we were holding a buffer pin on each leaf page after
it was read until we were ready to read the next one.  The reason
was so that we could treat this as a weak lock to create an
"interlock" with vacuum's deletion of heap line pointers, even
though our README file pointed out that this was not necessary for
a scan using an MVCC snapshot.

The main goal of this patch is to reduce the blocking of vacuum
processes by in-progress btree index scans (including a cursor
which is idle), but the code rearrangement also allows for one
less buffer content lock to be taken when a forward scan steps from
one page to the next, which results in a small but consistent
performance improvement in many workloads.

This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits.  These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged.  If later patches allow  all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page.

Reviewed by Heikki Linnakangas and Kyotaro Horiguchi
2015-03-25 14:24:43 -05:00
Alvaro Herrera
8217fb1441 Add OID output argument to DefineTSConfiguration
... which is set to the OID of a copied text search config, whenever the
COPY clause is used.

This is in the spirit of commit a2e35b53c3.
2015-03-25 15:57:08 -03:00
Tom Lane
cb1ca4d800 Allow foreign tables to participate in inheritance.
Foreign tables can now be inheritance children, or parents.  Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.

As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS.  Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior.  Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it.  And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.

An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:

 Update on pt1  (cost=0.00..321.05 rows=3541 width=46)
   Update on pt1
   Foreign Update on ft1
   Foreign Update on ft2
   Update on child3
   ->  Seq Scan on pt1  (cost=0.00..0.00 rows=1 width=46)
   ->  Foreign Scan on ft1  (cost=100.00..148.03 rows=1170 width=46)
   ->  Foreign Scan on ft2  (cost=100.00..148.03 rows=1170 width=46)
   ->  Seq Scan on child3  (cost=0.00..25.00 rows=1200 width=46)

This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.

Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me
2015-03-22 13:53:21 -04:00
Bruce Momjian
1c7087af42 Add TOAST table to pg_shseclabel for long label use
Report by Andres Freund
2015-03-21 22:14:49 -04:00
Bruce Momjian
34afbba84e Use mmap MAP_NOSYNC option to limit shared memory writes
mmap() is rarely used for shared memory, but when it is, this option is
useful, particularly on the BSDs.

Patch by Sean Chittenden
2015-03-21 22:06:19 -04:00
Andres Freund
959277a4f5 Use 128-bit math to accelerate some aggregation functions.
On platforms where we support 128bit integers, use them to implement
faster transition functions for sum(int8), avg(int8),
var_*(int2/int4),stdev_*(int2/int4). Where not supported continue to use
numeric as a transition type.

In some synthetic benchmarks this has been shown to provide significant
speedups.

Bumps catversion.

Discussion: 544BB5F1.50709@proxel.se
Author: Andreas Karlsson
Reviewed-By: Peter Geoghegan, Petr Jelinek, Andres Freund,
    Oskari Saarenmaa, David Rowley
2015-03-20 10:29:32 +01:00
Andres Freund
8122e1437e Add, optional, support for 128bit integers.
We will, for the foreseeable future, not expose 128 bit datatypes to
SQL. But being able to use 128bit math will allow us, in a later patch,
to use 128bit accumulators for some aggregates; leading to noticeable
speedups over using numeric.

So far we only detect a gcc/clang extension that supports 128bit math,
but no 128bit literals, and no *printf support. We might want to expand
this in the future to further compilers; if there are any that that
provide similar support.

Discussion: 544BB5F1.50709@proxel.se
Author: Andreas Karlsson, with significant editorializing by me
Reviewed-By: Peter Geoghegan, Oskari Saarenmaa
2015-03-20 10:26:17 +01:00
Robert Haas
12968cf408 Add flags argument to dsm_create.
Right now, there's only one flag, DSM_CREATE_NULL_IF_MAXSEGMENTS,
which suppresses the error that would normally be thrown when the
maximum number of segments already exists, instead returning NULL.
It might be useful to add more flags in the future, such as one to
ignore allocation errors, but I haven't done that here.
2015-03-19 13:03:03 -04:00
Alvaro Herrera
13dbc7a824 array_offset() and array_offsets()
These functions return the offset position or positions of a value in an
array.

Author: Pavel Stěhule
Reviewed by: Jim Nasby
2015-03-18 16:01:34 -03:00
Alvaro Herrera
0d83138974 Rationalize vacuuming options and parameters
We were involving the parser too much in setting up initial vacuuming
parameters.  This patch moves that responsibility elsewhere to simplify
code, and also to make future additions easier.  To do this, create a
new struct VacuumParams which is filled just prior to vacuum execution,
instead of at parse time; for user-invoked vacuuming this is set up in a
new function ExecVacuum, while autovacuum sets it up by itself.

While at it, add a new member VACOPT_SKIPTOAST to enum VacuumOption,
only set by autovacuum, which is used to disable vacuuming of the toast
table instead of the old do_toast parameter; this relieves the argument
list of vacuum() and some callees a bit.  This partially makes up for
having added more arguments in an effort to avoid having autovacuum from
constructing a VacuumStmt parse node.

Author: Michael Paquier. Some tweaks by Álvaro
Reviewed by: Robert Haas, Stephen Frost, Álvaro Herrera
2015-03-18 11:52:33 -03:00
Alvaro Herrera
a61fd5334e Support opfamily members in get_object_address
In the spirit of 890192e99a and 4464303405: have get_object_address
understand individual pg_amop and pg_amproc objects.  There is no way to
refer to such objects directly in the grammar -- rather, they are almost
always considered an integral part of the opfamily that contains them.
(The only case that deals with them individually is ALTER OPERATOR
FAMILY ADD/DROP, which carries the opfamily address separately and thus
does not need it to be part of each added/dropped element's address.)
In event triggers it becomes possible to become involved with individual
amop/amproc elements, and this commit enables pg_get_object_address to
do so as well.

To make the overall coding simpler, this commit also slightly changes
the get_object_address representation for opclasses and opfamilies:
instead of having the AM name in the objargs array, I moved it as the
first element of the objnames array.  This enables the new code to use
objargs for the type names used by pg_amop and pg_amproc.

Reviewed by: Stephen Frost
2015-03-16 12:06:34 -03:00
Tom Lane
7b8b8a4331 Improve representation of PlanRowMark.
This patch fixes two inadequacies of the PlanRowMark representation.

First, that the original LockingClauseStrength isn't stored (and cannot be
inferred for foreign tables, which always get ROW_MARK_COPY).  Since some
PlanRowMarks are created out of whole cloth and don't actually have an
ancestral RowMarkClause, this requires adding a dummy LCS_NONE value to
enum LockingClauseStrength, which is fairly annoying but the alternatives
seem worse.  This fix allows getting rid of the use of get_parse_rowmark()
in FDWs (as per the discussion around commits 462bd95705 and
8ec8760fc8), and it simplifies some things elsewhere.

Second, that the representation assumed that all child tables in an
inheritance hierarchy would use the same RowMarkType.  That's true today
but will soon not be true.  We add an "allMarkTypes" field that identifies
the union of mark types used in all a parent table's children, and use
that where appropriate (currently, only in preprocess_targetlist()).

In passing fix a couple of minor infelicities left over from the SKIP
LOCKED patch, notably that _outPlanRowMark still thought waitPolicy
is a bool.

Catversion bump is required because the numeric values of enum
LockingClauseStrength can appear in on-disk rules.

Extracted from a much larger patch to support foreign table inheritance;
it seemed worth breaking this out, since it's a separable concern.

Shigeru Hanada and Etsuro Fujita, somewhat modified by me
2015-03-15 18:41:47 -04:00
Tom Lane
9fac5fd741 Move LockClauseStrength, LockWaitPolicy into new file nodes/lockoptions.h.
Commit df630b0dd5 moved enum LockWaitPolicy
into its very own header file utils/lockwaitpolicy.h, which does not seem
like a great idea from here.  First, it's still a node-related declaration,
and second, a file named like that can never sensibly be used for anything
else.  I do not think we want to encourage a one-typedef-per-header-file
approach.  The upcoming foreign table inheritance patch was doubling down
on this bad idea by moving enum LockClauseStrength into its *own*
can-never-be-used-for-anything-else file.  Instead, let's put them both in
a file named nodes/lockoptions.h.  (They do seem to need a separate header
file because we need them in both parsenodes.h and plannodes.h, and we
don't want either of those including the other.  Past practice might
suggest adding them to nodes/nodes.h, but they don't seem sufficiently
globally useful to justify that.)

Committed separately since there's no functional change here, just some
header-file refactoring.
2015-03-15 15:19:04 -04:00
Andres Freund
4f1b890b13 Merge the various forms of transaction commit & abort records.
Since 465883b0a two versions of commit records have existed. A compact
version that was used when no cache invalidations, smgr unlinks and
similar were needed, and a full version that could deal with all
that. Additionally the full version was embedded into twophase commit
records.

That resulted in a measurable reduction in the size of the logged WAL in
some workloads. But more recently additions like logical decoding, which
e.g. needs information about the database something was executed on,
made it applicable in fewer situations. The static split generally made
it hard to expand the commit record, because concerns over the size made
it hard to add anything to the compact version.

Additionally it's not particularly pretty to have twophase.c insert
RM_XACT records.

Rejigger things so that the commit and abort records only have one form
each, including the twophase equivalents. The presence of the various
optional (in the sense of not being in every record) pieces is indicated
by a bits in the 'xinfo' flag.  That flag previously was not included in
compact commit records. To prevent an increase in size due to its
presence, it's only included if necessary; signalled by a bit in the
xl_info bits available for xact.c, similar to heapam.c's
XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE.

Twophase commit/aborts are now the same as their normal
counterparts. The original transaction's xid is included in an optional
data field.

This means that commit records generally are smaller, except in the case
of a transaction with subtransactions, but no other special cases; the
increase there is four bytes, which seems acceptable given that the more
common case of not having subtransactions shrank.  The savings are
especially measurable for twophase commits, which previously always used
the full version; but will in practice only infrequently have required
that.

The motivation for this work are not the space savings and and
deduplication though; it's that it makes it easier to extend commit
records with additional information. That's just a few lines of code
now; without impacting the common case where that information is not
needed.

Discussion: 20150220152150.GD4149@awork2.anarazel.de,
    235610.92468.qm%40web29004.mail.ird.yahoo.com

Reviewed-By: Heikki Linnakangas, Simon Riggs
2015-03-15 17:37:07 +01:00
Tom Lane
91f4a5a976 Build src/port/dirmod.c only on Windows.
Since commit ba7c5975ad, port/dirmod.c
has contained only Windows-specific functions.  Most platforms don't
seem to mind uselessly building an empty file, but OS X for one issues
warnings.  Hence, treat dirmod.c as a Windows-specific file selected
by configure rather than one that's always built.  We can revert this
change if dirmod.c ever gains any non-Windows functionality again.

Back-patch to 9.4 where the mentioned commit appeared.
2015-03-14 14:08:45 -04:00
Tom Lane
f4abd0241d Support flattening of empty-FROM subqueries and one-row VALUES tables.
We can't handle this in the general case due to limitations of the
planner's data representations; but we can allow it in many useful cases,
by being careful to flatten only when we are pulling a single-row subquery
up into a FROM (or, equivalently, inner JOIN) node that will still have at
least one remaining relation child.  Per discussion of an example from
Kyotaro Horiguchi.
2015-03-11 23:18:03 -04:00
Tom Lane
b55722692b Improve planner's cost estimation in the presence of semijoins.
If we have a semijoin, say
	SELECT * FROM x WHERE x1 IN (SELECT y1 FROM y)
and we're estimating the cost of a parameterized indexscan on x, the number
of repetitions of the indexscan should not be taken as the size of y; it'll
really only be the number of distinct values of y1, because the only valid
plan with y on the outside of a nestloop would require y to be unique-ified
before joining it to x.  Most of the time this doesn't make that much
difference, but sometimes it can lead to drastically underestimating the
cost of the indexscan and hence choosing a bad plan, as pointed out by
David Kubečka.

Fixing this is a bit difficult because parameterized indexscans are costed
out quite early in the planning process, before we have the information
that would be needed to call estimate_num_groups() and thereby estimate the
number of distinct values of the join column(s).  However we can move the
code that extracts a semijoin RHS's unique-ification columns, so that it's
done in initsplan.c rather than on-the-fly in create_unique_path().  That
shouldn't make any difference speed-wise and it's really a bit cleaner too.

The other bit of information we need is the size of the semijoin RHS,
which is easy if it's a single relation (we make those estimates before
considering indexscan costs) but problematic if it's a join relation.
The solution adopted here is just to use the product of the sizes of the
join component rels.  That will generally be an overestimate, but since
estimate_num_groups() only uses this input as a clamp, an overestimate
shouldn't hurt us too badly.  In any case we don't allow this new logic
to produce a value larger than we would have chosen before, so that at
worst an overestimate leaves us no wiser than we were before.
2015-03-11 21:21:00 -04:00
Alvaro Herrera
4464303405 Support default ACLs in get_object_address
In the spirit of 890192e99a, this time add support for the things
living in the pg_default_acl catalog.  These are not really "objects",
but they show up as such in event triggers.

There is no "DROP DEFAULT PRIVILEGES" or similar command, so it doesn't
look like the new representation given would be useful anywhere else, so
I didn't try to use it outside objectaddress.c.  (That might be a bug in
itself, but that would be material for another commit.)

Reviewed by Stephen Frost.
2015-03-11 19:23:47 -03:00
Alvaro Herrera
890192e99a Support user mappings in get_object_address
Since commit 72dd233d3e we were trying to obtain object addressing
information in sql_drop event triggers, but that caused failures when
the drops involved user mappings.  This addition enables that to work
again.  Naturally, pg_get_object_address can work with these objects
now, too.

I toyed with the idea of removing DropUserMappingStmt as a node and
using DropStmt instead in the DropUserMappingStmt grammar production,
but that didn't go very well: for one thing the messages thrown by the
specific code are specialized (you get "server not found" if you specify
the wrong server, instead of a generic "user mapping for ... not found"
which you'd get it we were to merge this with RemoveObjects --- unless
we added even more special cases).  For another thing, it would require
to pass RoleSpec nodes through the objname/objargs representation used
by RemoveObjects, which works in isolation, but gets messy when
pg_get_object_address is involved.  So I dropped this part for now.

Reviewed by Stephen Frost.
2015-03-11 17:04:27 -03:00
Tom Lane
c6b3c939b7 Make operator precedence follow the SQL standard more closely.
While the SQL standard is pretty vague on the overall topic of operator
precedence (because it never presents a unified BNF for all expressions),
it does seem reasonable to conclude from the spec for <boolean value
expression> that OR has the lowest precedence, then AND, then NOT, then IS
tests, then the six standard comparison operators, then everything else
(since any non-boolean operator in a WHERE clause would need to be an
argument of one of these).

We were only sort of on board with that: most notably, while "<" ">" and
"=" had properly low precedence, "<=" ">=" and "<>" were treated as generic
operators and so had significantly higher precedence.  And "IS" tests were
even higher precedence than those, which is very clearly wrong per spec.

Another problem was that "foo NOT SOMETHING bar" constructs, such as
"x NOT LIKE y", were treated inconsistently because of a bison
implementation artifact: they had the documented precedence with respect
to operators to their right, but behaved like NOT (i.e., very low priority)
with respect to operators to their left.

Fixing the precedence issues is just a small matter of rearranging the
precedence declarations in gram.y, except for the NOT problem, which
requires adding an additional lookahead case in base_yylex() so that we
can attach a different token precedence to NOT LIKE and allied two-word
operators.

The bulk of this patch is not the bug fix per se, but adding logic to
parse_expr.c to allow giving warnings if an expression has changed meaning
because of these precedence changes.  These warnings are off by default
and are enabled by the new GUC operator_precedence_warning.  It's believed
that very few applications will be affected by these changes, but it was
agreed that a warning mechanism is essential to help debug any that are.
2015-03-11 13:22:52 -04:00
Robert Haas
e529cd4ffa Suggest to the user the column they may have meant to reference.
Error messages informing the user that no such column exists can
sometimes provoke a perplexed response.  This often happens due to
a subtle typo in the column name or, perhaps less likely, in the
alias name.  To speed discovery of what the real issue is in such
cases, we'll now search the range table for approximate matches.
If there are one or two such matches that are good enough to think
that they might be what the user intended to type, and better than
all other approximate matches, we'll issue a hint suggesting that
the user might have intended to reference those columns.

Peter Geoghegan and Robert Haas
2015-03-11 10:44:04 -04:00
Andres Freund
bbfd7edae5 Add macros wrapping all usage of gcc's __attribute__.
Until now __attribute__() was defined to be empty for all compilers but
gcc. That's problematic because it prevents using it in other compilers;
which is necessary e.g. for atomics portability.  It's also just
generally dubious to do so in a header as widely included as c.h.

Instead add pg_attribute_format_arg, pg_attribute_printf,
pg_attribute_noreturn macros which are implemented in the compilers that
understand them. Also add pg_attribute_noreturn and pg_attribute_packed,
but don't provide fallbacks, since they can affect functionality.

This means that external code that, possibly unwittingly, relied on
__attribute__ defined to be empty on !gcc compilers may now run into
warnings or errors on those compilers. But there shouldn't be many
occurances of that and it's hard to work around...

Discussion: 54B58BA3.8040302@ohmu.fi
Author: Oskari Saarenmaa, with some minor changes by me.
2015-03-11 14:30:01 +01:00
Fujii Masao
57aa5b2bb1 Add GUC to enable compression of full page images stored in WAL.
When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server
compresses a full page image written to WAL when full_page_writes is on or
during a base backup. A compressed page image will be decompressed during WAL
replay. Turning this parameter on can reduce the WAL volume without increasing
the risk of unrecoverable data corruption, but at the cost of some extra CPU
spent on the compression during WAL logging and on the decompression during
WAL replay.

This commit changes the WAL format (so bumping WAL version number) so that
the one-byte flag indicating whether a full page image is compressed or not is
included in its header information. This means that the commit increases the
WAL volume one-byte per a full page image even if WAL compression is not used
at all. We can save that one-byte by borrowing one-bit from the existing field
like hole_offset in the header and using it as the flag, for example. But which
would reduce the code readability and the extensibility of the feature.
Per discussion, it's not worth paying those prices to save only one-byte, so we
decided to add the one-byte flag to the header.

This commit doesn't introduce any new compression algorithm like lz4.
Currently a full page image is compressed using the existing PGLZ algorithm.
Per discussion, we decided to use it at least in the first version of the
feature because there were no performance reports showing that its compression
ratio is unacceptably lower than that of other algorithm. Of course,
in the future, it's worth considering the support of other compression
algorithm for the better compression.

Rahila Syed and Michael Paquier, reviewed in various versions by myself,
Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.
2015-03-11 15:52:24 +09:00
Tom Lane
2fbb286647 Clean up the mess from => patch.
Commit 865f14a2d3 was quite a few bricks
shy of a load: psql, ecpg, and plpgsql were all left out-of-step with
the core lexer.  Of these only the last was likely to be a fatal
problem; but still, a minimal amount of grepping, or even just reading
the comments adjacent to the places that were changed, would have found
the other places that needed to be changed.
2015-03-10 11:48:38 -04:00
Alvaro Herrera
e491bd2ee3 Move BRIN page type to page's last two bytes
... which is the usual convention among AMs, so that pg_filedump and
similar utilities can tell apart pages of different AMs.  It was also
the intent of the original code, but I failed to realize that alignment
considerations would move the whole thing to the previous-to-last word
in the page.

The new definition of the associated macro makes surrounding code a bit
leaner, too.

Per note from Heikki at
http://www.postgresql.org/message-id/546A16EF.9070005@vmware.com
2015-03-10 12:27:15 -03:00
Alvaro Herrera
4f3924d9cd Keep CommitTs module in sync in standby and master
We allow this module to be turned off on restarts, so a restart time
check is enough to activate or deactivate the module; however, if there
is a standby replaying WAL emitted from a master which is restarted, but
the standby isn't, the state in the standby becomes inconsistent and can
easily be crashed.

Fix by activating and deactivating the module during WAL replay on
parameter change as well as on system start.

Problem reported by Fujii Masao in
http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com

Author: Petr Jelínek
2015-03-09 17:44:00 -03:00
Alvaro Herrera
31eae6028e Allow CURRENT/SESSION_USER to be used in certain commands
Commands such as ALTER USER, ALTER GROUP, ALTER ROLE, GRANT, and the
various ALTER OBJECT / OWNER TO, as well as ad-hoc clauses related to
roles such as the AUTHORIZATION clause of CREATE SCHEMA, the FOR clause
of CREATE USER MAPPING, and the FOR ROLE clause of ALTER DEFAULT
PRIVILEGES can now take the keywords CURRENT_USER and SESSION_USER as
user specifiers in place of an explicit user name.

This commit also fixes some quite ugly handling of special standards-
mandated syntax in CREATE USER MAPPING, which in particular would fail
to work in presence of a role named "current_user".

The special role specifiers PUBLIC and NONE also have more consistent
handling now.

Also take the opportunity to add location tracking to user specifiers.

Authors: Kyotaro Horiguchi.  Heavily reworked by Álvaro Herrera.
Reviewed by: Rushabh Lathia, Adam Brightwell, Marti Raudsepp.
2015-03-09 15:41:54 -03:00
Heikki Linnakangas
f1fd515b39 Move WAL-related definitions from dbcommands.h to separate header file.
This makes it easier to write frontend programs that needs to understand
the WAL record format of CREATE/DROP DATABASE. dbcommands.h cannot easily
be #included in a frontend program, because it pulls in other header files
that need backend stuff, but the new dbcommands_xlog.h header file has
fewer dependencies.
2015-03-09 15:50:49 +02:00
Fujii Masao
828599acec Fix typo in comment. 2015-03-09 14:39:46 +09:00
Tom Lane
01cca2c1b1 Remove struct PQArgBlock from server-side header libpq/libpq.h.
This struct is purely a client-side artifact.  Perhaps there was once
reason for the server to know it, but any such reason is lost in the
mists of time.  We certainly don't need two independent declarations
of it.
2015-03-08 13:42:59 -04:00
Tom Lane
90c35a9ed0 Code cleanup for REINDEX DATABASE/SCHEMA/SYSTEM.
Fix some minor infelicities.  Some of these things were introduced in
commit fe263d115a, and some are older.
2015-03-08 12:18:43 -04:00
Peter Eisentraut
bb8582abf3 Remove rolcatupdate
This role attribute is an ancient PostgreSQL feature, but could only be
set by directly updating the system catalogs, and it doesn't have any
clearly defined use.

Author: Adam Brightwell <adam.brightwell@crunchydatasolutions.com>
2015-03-06 23:42:38 -05:00
Tom Lane
3200b15b20 Remove comment claiming that PARAM_EXTERN Params always have typmod -1.
This hasn't been true in quite some time, cf plpgsql's make_datum_param().
2015-03-05 13:16:27 -05:00
Alvaro Herrera
a2e35b53c3 Change many routines to return ObjectAddress rather than OID
The changed routines are mostly those that can be directly called by
ProcessUtilitySlow; the intention is to make the affected object
information more precise, in support for future event trigger changes.
Originally it was envisioned that the OID of the affected object would
be enough, and in most cases that is correct, but upon actually
implementing the event trigger changes it turned out that ObjectAddress
is more widely useful.

Additionally, some command execution routines grew an output argument
that's an object address which provides further info about the executed
command.  To wit:

* for ALTER DOMAIN / ADD CONSTRAINT, it corresponds to the address of
  the new constraint

* for ALTER OBJECT / SET SCHEMA, it corresponds to the address of the
  schema that originally contained the object.

* for ALTER EXTENSION {ADD, DROP} OBJECT, it corresponds to the address
  of the object added to or dropped from the extension.

There's no user-visible change in this commit, and no functional change
either.

Discussion: 20150218213255.GC6717@tamriel.snowman.net
Reviewed-By: Stephen Frost, Andres Freund
2015-03-03 14:10:50 -03:00
Tom Lane
b67f1ce181 Reduce json <=> jsonb casts from explicit-only to assignment level.
There's no reason to make users write an explicit cast to store a
json value in a jsonb column or vice versa.

We could probably even make these implicit, but that might open us up
to problems with ambiguous function calls, so for now just do this.
2015-03-03 11:26:04 -05:00
Tom Lane
8abb3cda0d Use the typcache to cache constraints for domain types.
Previously, we cached domain constraints for the life of a query, or
really for the life of the FmgrInfo struct that was used to invoke
domain_in() or domain_check().  But plpgsql (and probably other places)
are set up to cache such FmgrInfos for the whole lifespan of a session,
which meant they could be enforcing really stale sets of constraints.
On the other hand, searching pg_constraint once per query gets kind of
expensive too: testing says that as much as half the runtime of a
trivial query such as "SELECT 0::domaintype" went into that.

To fix this, delegate the responsibility for tracking a domain's
constraints to the typcache, which has the infrastructure needed to
detect syscache invalidation events that signal possible changes.
This not only removes unnecessary repeat reads of pg_constraint,
but ensures that we never apply stale constraint data: whatever we
use is the current data according to syscache rules.

Unfortunately, the current configuration of the system catalogs means
we have to flush cached domain-constraint data whenever either pg_type
or pg_constraint changes, which happens rather a lot (eg, creation or
deletion of a temp table will do it).  It might be worth rearranging
things to split pg_constraint into two catalogs, of which the domain
constraint one would probably be very low-traffic.  That's a job for
another patch though, and in any case this patch should improve matters
materially even with that handicap.

This patch makes use of the recently-added memory context reset callback
feature to manage the lifespan of domain constraint caches, so that we
don't risk deleting a cache that might be in the midst of evaluation.

Although this is a bug fix as well as a performance improvement, no
back-patch.  There haven't been many if any field complaints about
stale domain constraint checks, so it doesn't seem worth taking the
risk of modifying data structures as basic as MemoryContexts in back
branches.
2015-03-01 14:06:55 -05:00
Noah Misch
b8a18ad485 Add transform functions for AT TIME ZONE.
This makes "ALTER TABLE tabname ALTER tscol TYPE ... USING tscol AT TIME
ZONE 'UTC'" skip rewriting the table when altering from "timestamp" to
"timestamptz" or vice versa.  While it would be nicer still to optimize
this in the absence of the USING clause given timezone==UTC, transform
functions must consult IMMUTABLE facts only.
2015-03-01 13:22:34 -05:00
Tom Lane
097fe194aa Move memory context callback declarations into palloc.h.
Initial experience with this feature suggests that instances of
MemoryContextCallback are likely to propagate into some widely-used headers
over time.  As things stood, that would result in pulling memutils.h or
at least memnodes.h into common headers, which does not seem desirable.
Instead, let's decide that this feature is part of the "ordinary palloc
user" API rather than the "specialized context management" API, and as
such should be declared in palloc.h not memutils.h.
2015-03-01 12:31:32 -05:00
Tom Lane
eaa5808e8e Redefine MemoryContextReset() as deleting, not resetting, child contexts.
That is, MemoryContextReset() now means what was formerly meant by
MemoryContextResetAndDeleteChildren(), and the latter is now just a macro
alias for the former.  If you really want the functionality that was
formerly provided by MemoryContextReset(), what you have to do is
MemoryContextResetChildren() plus MemoryContextResetOnly() (which is a
new API to reset *only* the named context and not touch its children).

The reason for this change is that near fifteen years of experience has
proven that there is noplace where old-style MemoryContextReset() is
actually what you want.  Making that the default behavior has led to lots
of context-leakage bugs, while we've not found anyplace where it's actually
necessary to keep the child contexts; at least the standard regression
tests do not reveal anyplace where this change breaks anything.  And there
are upcoming patches that will introduce additional reasons why child
contexts need to be removed.

We could change existing calls of MemoryContextResetAndDeleteChildren to be
just MemoryContextReset, but for the moment I'll leave them alone; they're
not costing anything.
2015-02-27 18:10:04 -05:00
Tom Lane
f65e827058 Invent a memory context reset/delete callback mechanism.
This allows cleanup actions to be registered to be called just before a
particular memory context's contents are flushed (either by deletion or
MemoryContextReset).  The patch in itself has no use-cases for this, but
several likely reasons for wanting this exist.

In passing, per discussion, rearrange some boolean fields in struct
MemoryContextData so as to avoid wasted padding space.  For safety,
this requires making allowInCritSection's existence unconditional;
but I think that's a better approach than what was there anyway.
2015-02-27 17:16:43 -05:00
Tom Lane
d809fd0008 Improve parser's one-extra-token lookahead mechanism.
There are a couple of places in our grammar that fail to be strict LALR(1),
by requiring more than a single token of lookahead to decide what to do.
Up to now we've dealt with that by using a filter between the lexer and
parser that merges adjacent tokens into one in the places where two tokens
of lookahead are necessary.  But that creates a number of user-visible
anomalies, for instance that you can't name a CTE "ordinality" because
"WITH ordinality AS ..." triggers folding of WITH and ORDINALITY into one
token.  I realized that there's a better way.

In this patch, we still do the lookahead basically as before, but we never
merge the second token into the first; we replace just the first token by
a special lookahead symbol when one of the lookahead pairs is seen.

This requires a couple extra productions in the grammar, but it involves
fewer special tokens, so that the grammar tables come out a bit smaller
than before.  The filter logic is no slower than before, perhaps a bit
faster.

I also fixed the filter logic so that when backing up after a lookahead,
the current token's terminator is correctly restored; this eliminates some
weird behavior in error message issuance, as is shown by the one change in
existing regression test outputs.

I believe that this patch entirely eliminates odd behaviors caused by
lookahead for WITH.  It doesn't really improve the situation for NULLS
followed by FIRST/LAST unfortunately: those sequences still act like a
reserved word, even though there are cases where they should be seen as two
ordinary identifiers, eg "SELECT nulls first FROM ...".  I experimented
with additional grammar hacks but couldn't find any simple solution for
that.  Still, this is better than before, and it seems much more likely
that we *could* somehow solve the NULLS case on the basis of this filter
behavior than the previous one.
2015-02-24 17:53:45 -05:00
Peter Eisentraut
23a78352c0 Error when creating names too long for tar format
The tar format (at least the version we are using), does not support
file names or symlink targets longer than 99 bytes.  Until now, the tar
creation code would silently truncate any names that are too long.  (Its
original application was pg_dump, where this never happens.)  This
creates problems when running base backups over the replication
protocol.

The most important problem is when a tablespace path is longer than 99
bytes, which will result in a truncated tablespace path being backed up.
Less importantly, the basebackup protocol also promises to back up any
other files it happens to find in the data directory, which would also
lead to file name truncation if someone put a file with a long name in
there.

Now both of these cases result in an error during the backup.

Add tests that fail when a too-long file name or symlink is attempted to
be backed up.

Reviewed-by: Robert Hass <robertmhaas@gmail.com>
2015-02-24 13:41:07 -05:00
Tom Lane
56be925e4b Further tweaking of raw grammar output to distinguish different inputs.
Use a different A_Expr_Kind for LIKE/ILIKE/SIMILAR TO constructs, so that
they can be distinguished from direct invocation of the underlying
operators.  Also, postpone selection of the operator name when transforming
"x IN (select)" to "x = ANY (select)", so that those syntaxes can be told
apart at parse analysis time.

I had originally thought I'd also have to do something special for the
syntaxes IS NOT DISTINCT FROM, IS NOT DOCUMENT, and x NOT IN (SELECT...),
which the grammar translates as though they were NOT (construct).
On reflection though, we can distinguish those cases reliably by noting
whether the parse location shown for the NOT is the same as for its child
node.  This only requires tweaking the parse locations for NOT IN, which
I've done here.

These changes should have no effect outside the parser; they're just in
support of being able to give accurate warnings for planned operator
precedence changes.
2015-02-23 12:46:50 -05:00
Alvaro Herrera
296f3a6053 Support more commands in event triggers
COMMENT, SECURITY LABEL, and GRANT/REVOKE now also fire
ddl_command_start and ddl_command_end event triggers, when they operate
on database-local objects.

Reviewed-By: Michael Paquier, Andres Freund, Stephen Frost
2015-02-23 14:22:42 -03:00
Heikki Linnakangas
88e9823026 Replace checkpoint_segments with min_wal_size and max_wal_size.
Instead of having a single knob (checkpoint_segments) that both triggers
checkpoints, and determines how many checkpoints to recycle, they are now
separate concerns. There is still an internal variable called
CheckpointSegments, which triggers checkpoints. But it no longer determines
how many segments to recycle at a checkpoint. That is now auto-tuned by
keeping a moving average of the distance between checkpoints (in bytes),
and trying to keep that many segments in reserve. The advantage of this is
that you can set max_wal_size very high, but the system won't actually
consume that much space if there isn't any need for it. The min_wal_size
sets a floor for that; you can effectively disable the auto-tuning behavior
by setting min_wal_size equal to max_wal_size.

The max_wal_size setting is now the actual target size of WAL at which a
new checkpoint is triggered, instead of the distance between checkpoints.
Previously, you could calculate the actual WAL usage with the formula
"(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this
patch, you set the desired WAL usage with max_wal_size, and the system
calculates the appropriate CheckpointSegments with the reverse of that
formula. That's a lot more intuitive for administrators to set.

Reviewed by Amit Kapila and Venkata Balaji N.
2015-02-23 18:53:02 +02:00
Heikki Linnakangas
0fec000365 Renumber GUC_* constants.
This moves all the regular flags back together (for aesthetic reasons), and
makes room for more GUC_UNIT_* types.
2015-02-23 18:33:16 +02:00
Heikki Linnakangas
1b63026473 Refactor unit conversions code in guc.c.
Replace the if-switch-case constructs with two conversion tables,
containing all the supported conversions between human-readable unit
strings and the base units used in GUC variables. This makes the code
easier to read, and makes adding new units simpler.
2015-02-23 18:06:16 +02:00
Fujii Masao
5d2b45e3f7 Add GUC to control the time to wait before retrieving WAL after failed attempt.
Previously when the standby server failed to retrieve WAL files from any sources
(i.e., streaming replication, local pg_xlog directory or WAL archive), it always
waited for five seconds (hard-coded) before the next attempt. For example,
this is problematic in warm-standby because restore_command can fail
every five seconds even while new WAL file is expected to be unavailable for
a long time and flood the log files with its error messages.

This commit adds new parameter, wal_retrieve_retry_interval, to control that
wait time.

Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.
2015-02-23 20:55:17 +09:00
Tom Lane
c063da1769 Add parse location fields to NullTest and BooleanTest structs.
We did not need a location tag on NullTest or BooleanTest before, because
no error messages referred directly to their locations.  That's planned
to change though, so add these fields in a separate housekeeping commit.

Catversion bump because stored rules may change.
2015-02-22 14:40:27 -05:00
Tom Lane
6a75562ed1 Get rid of multiple applications of transformExpr() to the same tree.
transformExpr() has for many years had provisions to do nothing when
applied to an already-transformed expression tree.  However, this was
always ugly and of dubious reliability, so we'd be much better off without
it.  The primary historical reason for it was that gram.y sometimes
returned multiple links to the same subexpression, which is no longer true
as of my BETWEEN fixes.  We'd also grown some lazy hacks in CREATE TABLE
LIKE (failing to distinguish between raw and already-transformed index
specifications) and one or two other places.

This patch removes the need for and support for re-transforming already
transformed expressions.  The index case is dealt with by adding a flag
to struct IndexStmt to indicate that it's already been transformed;
which has some benefit anyway in that tablecmds.c can now Assert that
transformation has happened rather than just assuming.  The other main
reason was some rather sloppy code for array type coercion, which can
be fixed (and its performance improved too) by refactoring.

I did leave transformJoinUsingClause() still constructing expressions
containing untransformed operator nodes being applied to Vars, so that
transformExpr() still has to allow Var inputs.  But that's a much narrower,
and safer, special case than before, since Vars will never appear in a raw
parse tree, and they don't have any substructure to worry about.

In passing fix some oversights in the patch that added CREATE INDEX
IF NOT EXISTS (missing processing of IndexStmt.if_not_exists).  These
appear relatively harmless, but still sloppy coding practice.
2015-02-22 13:59:09 -05:00
Tom Lane
34af082f95 Represent BETWEEN as a special node type in raw parse trees.
Previously, gram.y itself converted BETWEEN into AND (or AND/OR) nests of
expression comparisons.  This was always as bogus as could be, but fixing
it hasn't risen to the top of the to-do list.  The present patch invents an
A_Expr representation for BETWEEN expressions, and does the expansion to
comparison trees in parse_expr.c which is at least a slightly saner place
to be doing semantic conversions.  There should be no change in the post-
parse-analysis results.

This does nothing for the semantic issues with BETWEEN (dubious connection
to btree-opclass semantics, and multiple evaluation of possibly volatile
subexpressions) ... but it's a necessary preliminary step before we could
fix any of that.  The main immediate benefit is that preserving BETWEEN as
an identifiable raw-parse-tree construct will enable better error messages.

While at it, fix the code so that multiply-referenced subexpressions are
physically duplicated before being passed through transformExpr().  This
gets rid of one of the principal reasons why transformExpr() has
historically had to allow already-processed input.
2015-02-22 13:57:56 -05:00
Jeff Davis
b419865a81 In array_agg(), don't create a new context for every group.
Previously, each new array created a new memory context that started
out at 8kB. This is incredibly wasteful when there are lots of small
groups of just a few elements each.

Change initArrayResult() and friends to accept a "subcontext" argument
to indicate whether the caller wants the ArrayBuildState allocated in
a new subcontext or not. If not, it can no longer be released
separately from the rest of the memory context.

Fixes bug report by Frank van Vugt on 2013-10-19.

Tomas Vondra. Reviewed by Ali Akbar, Tom Lane, and me.
2015-02-21 17:24:48 -08:00
Andres Freund
82a532b34d Force some system catalog table columns to be marked NOT NULL.
In a manual pass over the catalog declaration I found a number of
columns which the boostrap automatism didn't mark NOT NULL even though
they actually were. Add BKI_FORCE_NOT_NULL markings to them.

It's usually not critical if a system table column is falsely determined
to be nullable as the code should always catch relevant cases. But it's
good to have a extra layer in place.

Discussion: 20150215170014.GE15326@awork2.anarazel.de
2015-02-21 22:37:05 +01:00
Andres Freund
eb68379c38 Allow forcing nullness of columns during bootstrap.
Bootstrap determines whether a column is null based on simple builtin
rules. Those work surprisingly well, but nonetheless a few existing
columns aren't set correctly. Additionally there is at least one patch
sent to hackers where forcing the nullness of a column would be helpful.

The boostrap format has gained FORCE [NOT] NULL for this, which will be
emitted by genbki.pl when BKI_FORCE_(NOT_)?NULL is specified for a
column in a catalog header.

This patch doesn't change the marking of any existing columns.

Discussion: 20150215170014.GE15326@awork2.anarazel.de
2015-02-21 22:31:54 +01:00
Tom Lane
e1a11d9311 Use FLEXIBLE_ARRAY_MEMBER for HeapTupleHeaderData.t_bits[].
This requires changing quite a few places that were depending on
sizeof(HeapTupleHeaderData), but it seems for the best.

Michael Paquier, some adjustments by me
2015-02-21 15:13:06 -05:00
Robert Haas
64235fecc6 Don't require users of src/port/gettimeofday.c to initialize it.
Commit 8001fe67a3 introduced this
requirement, but per discussion, we want to avoid requirements of
this type to make things easier on the calling code.  An especially
important consideration is that this may be used in frontend code,
not just the backend.

Asif Naeem, reviewed by Michael Paquier
2015-02-21 12:17:04 -05:00
Tom Lane
f2874feb7c Some more FLEXIBLE_ARRAY_MEMBER fixes. 2015-02-21 01:46:43 -05:00
Tom Lane
33b2a2c97f Fix statically allocated struct with FLEXIBLE_ARRAY_MEMBER member.
clang complains about this, not unreasonably, so define another struct
that's explicitly for a WordEntryPos with exactly one element.

While at it, get rid of pretty dubious use of a static variable for
more than one purpose --- if it were being treated as const maybe
I'd be okay with this, but it isn't.
2015-02-20 17:50:18 -05:00
Tom Lane
e38b1eb098 Use FLEXIBLE_ARRAY_MEMBER in struct varlena.
This forces some minor coding adjustments in tuptoaster.c and inv_api.c,
but the new coding there is cleaner anyway.

Michael Paquier
2015-02-20 16:51:53 -05:00
Alvaro Herrera
3b14bb7771 Update PGSTAT_FILE_FORMAT_ID
Previous commit should have bumped it but didn't.  Oops.

Per note from Tom.
2015-02-20 12:59:27 -03:00
Alvaro Herrera
d42358efb1 Have TRUNCATE update pgstat tuple counters
This works by keeping a per-subtransaction record of the ins/upd/del
counters before the truncate, and then resetting them; this record is
useful to return to the previous state in case the truncate is rolled
back, either in a subtransaction or whole transaction.  The state is
propagated upwards as subtransactions commit.

When the per-table data is sent to the stats collector, a flag indicates
to reset the live/dead counters to zero as well.

Catalog version bumped due to the change in pgstat format.

Author: Alexander Shulgin
Discussion: 1007.1207238291@sss.pgh.pa.us
Discussion: 548F7D38.2000401@BlueTreble.com
Reviewed-by: Álvaro Herrera, Jim Nasby
2015-02-20 12:10:01 -03:00
Tom Lane
692bd09ad1 Use "#ifdef CATALOG_VARLEN" to protect nullable fields of pg_authid.
This gives a stronger guarantee than a mere comment against accessing these
fields as simple struct members.  Since rolpassword is in fact varlena,
it's not clear why these didn't get marked from the beginning, but let's
do it now.

Michael Paquier
2015-02-20 00:23:48 -05:00
Tom Lane
09d8d110a6 Use FLEXIBLE_ARRAY_MEMBER in a bunch more places.
Replace some bogus "x[1]" declarations with "x[FLEXIBLE_ARRAY_MEMBER]".
Aside from being more self-documenting, this should help prevent bogus
warnings from static code analyzers and perhaps compiler misoptimizations.

This patch is just a down payment on eliminating the whole problem, but
it gets rid of a lot of easy-to-fix cases.

Note that the main problem with doing this is that one must no longer rely
on computing sizeof(the containing struct), since the result would be
compiler-dependent.  Instead use offsetof(struct, lastfield).  Autoconf
also warns against spelling that offsetof(struct, lastfield[0]).

Michael Paquier, review and additional fixes by me.
2015-02-20 00:11:42 -05:00
Tom Lane
2fb7a75f37 Add pg_stat_get_snapshot_timestamp() to show statistics snapshot timestamp.
Per discussion, this could be useful for purposes such as programmatically
detecting a nonresponding stats collector.  We already have the timestamp
anyway, it's just a matter of providing a SQL-accessible function to fetch
it.

Matt Kelly, reviewed by Jim Nasby
2015-02-19 21:36:50 -05:00
Heikki Linnakangas
634618ecd0 Remove dead structs.
These are not used with the new WAL format anymore. GIN split records are
simply always recorded as full-page images.

Michael Paquier
2015-02-19 21:14:37 +02:00
Tom Lane
56a79a869b Split array_push into separate array_append and array_prepend functions.
There wasn't any good reason for a single C function to implement both
these SQL functions: it saved very little code overall, and it required
significant pushups to re-determine at runtime which case applied.  Redoing
it as two functions ends up with just slightly more lines of code, but it's
simpler to understand, and faster too because we need not repeat syscache
lookups on every call.

An important side benefit is that this eliminates the only case in which
different aliases of the same C function had both anyarray and anyelement
arguments at the same position, which would almost always be a mistake.
The opr_sanity regression test will now notice such mistakes since there's
no longer a valid case where it happens.
2015-02-18 20:53:33 -05:00
Tom Lane
abe45a9b31 Fix EXPLAIN output for cases where parent table is excluded by constraints.
The previous coding in EXPLAIN always labeled a ModifyTable node with the
name of the target table affected by its first child plan.  When originally
written, this was necessarily the parent table of the inheritance tree,
so everything was unconfusing.  But when we added NO INHERIT constraints,
it became possible for the parent table to be deleted from the plan by
constraint exclusion while still leaving child tables present.  This led to
the ModifyTable plan node being labeled with the first surviving child,
which was deemed confusing.  Fix it by retaining the parent table's RT
index in a new field in ModifyTable.

Etsuro Fujita, reviewed by Ashutosh Bapat and myself
2015-02-17 18:04:11 -05:00
Heikki Linnakangas
931bf3eb9b Fix a bug in pairing heap removal code.
After removal, the next_sibling pointer of a node was sometimes incorrectly
left to point to another node in the heap, which meant that a node was
sometimes linked twice into the heap. Surprisingly that didn't cause any
crashes in my testing, but it was clearly wrong and could easily segfault
in other scenarios.

Also always keep the prev_or_parent pointer as NULL on the root node. That
was not a correctness issue AFAICS, but let's be tidy.

Add a debugging function, to dump the contents of a pairing heap as a
string. It's #ifdef'd out, as it's not used for anything in any normal
code, but it was highly useful in debugging this. Let's keep it handy for
further reference.
2015-02-17 22:55:53 +02:00
Tom Lane
2e105def09 Remove code to match IPv4 pg_hba.conf entries to IPv4-in-IPv6 addresses.
In investigating yesterday's crash report from Hugo Osvaldo Barrera, I only
looked back as far as commit f3aec2c7f5 where the breakage occurred
(which is why I thought the IPv4-in-IPv6 business was undocumented).  But
actually the logic dates back to commit 3c9bb8886d and was simply
broken by erroneous refactoring in the later commit.  A bit of archives
excavation shows that we added the whole business in response to a report
that some 2003-era Linux kernels would report IPv4 connections as having
IPv4-in-IPv6 addresses.  The fact that we've had no complaints since 9.0
seems to be sufficient confirmation that no modern kernels do that, so
let's just rip it all out rather than trying to fix it.

Do this in the back branches too, thus essentially deciding that our
effective behavior since 9.0 is correct.  If there are any platforms on
which the kernel reports IPv4-in-IPv6 addresses as such, yesterday's fix
would have made for a subtle and potentially security-sensitive change in
the effective meaning of IPv4 pg_hba.conf entries, which does not seem like
a good thing to do in minor releases.  So let's let the post-9.0 behavior
stand, and change the documentation to match it.

In passing, I failed to resist the temptation to wordsmith the description
of pg_hba.conf IPv4 and IPv6 address entries a bit.  A lot of this text
hasn't been touched since we were IPv4-only.
2015-02-17 12:49:18 -05:00
Tom Lane
e983c4d1aa Rationalize the APIs of array element/slice access functions.
The four functions array_ref, array_set, array_get_slice, array_set_slice
have traditionally declared their array inputs and results as being of type
"ArrayType *".  This is a lie, and has been since Berkeley days, because
they actually also support "fixed-length array" types such as "name" and
"point"; not to mention that the inputs could be toasted.  These values
should be declared Datum instead to avoid confusion.  The current coding
already risks possible misoptimization by compilers, and it'll get worse
when "expanded" array representations become a valid alternative.

However, there's a fair amount of code using array_ref and array_set with
arrays that *are* known to be ArrayType structures, and there might be more
such places in third-party code.  Rather than cluttering those call sites
with PointerGetDatum/DatumGetArrayTypeP cruft, what I did was to rename the
existing functions to array_get_element/array_set_element, fix their
signatures, then reincarnate array_ref/array_set as backwards compatibility
wrappers.

array_get_slice/array_set_slice have no such constituency in the core code,
and probably not in third-party code either, so I just changed their APIs.
2015-02-16 12:23:58 -05:00
Heikki Linnakangas
33e879c4e9 Fix broken #ifdef for __sparcv8
Rob Rowan. Backpatch to all supported versions, like the patch that added
the broken #ifdef.
2015-02-13 23:56:25 +02:00
Heikki Linnakangas
80788a431e Simplify waiting logic in reading from / writing to client.
The client socket is always in non-blocking mode, and if we actually want
blocking behaviour, we emulate it by sleeping and retrying. But we have
retry loops at different layers for reads and writes, which was confusing.
To simplify, remove all the sleeping and retrying code from the lower
levels, from be_tls_read and secure_raw_read and secure_raw_write, and put
all the logic in secure_read() and secure_write().
2015-02-13 21:46:14 +02:00
Heikki Linnakangas
025c02420d Speed up CRC calculation using slicing-by-8 algorithm.
This speeds up WAL generation and replay. The new algorithm is
significantly faster with large inputs, like full-page images or when
inserting wide rows. It is slower with tiny inputs, i.e. less than 10 bytes
or so, but the speedup with longer inputs more than make up for that. Even
small WAL records at least have 24 byte header in the front.

The output is identical to the current byte-at-a-time computation, so this
does not affect compatibility. The new algorithm is only used for the
CRC-32C variant, not the legacy version used in tsquery or the
"traditional" CRC-32 used in hstore and ltree. Those are not as performance
critical, and are usually only applied over small inputs, so it seems
better to not carry around the extra lookup tables to speed up those rare
cases.

Abhijit Menon-Sen
2015-02-10 10:54:40 +02:00
Tom Lane
bc4de01db3 Minor cleanup/code review for "indirect toast" stuff.
Fix some issues I noticed while fooling with an extension to allow an
additional kind of toast pointer.  Much of this is just comment
improvement, but there are a couple of actual bugs, which might or might
not be reachable today depending on what can happen during logical
decoding.  An example is that toast_flatten_tuple() failed to cover the
possibility of an indirection pointer in its input.  Back-patch to 9.4
just in case that is reachable now.

In HEAD, also correct some really minor issues with recent compression
reorganization, such as dangerously underparenthesized macros.
2015-02-09 12:30:52 -05:00
Heikki Linnakangas
c619c2351f Move pg_crc.c to src/common, and remove pg_crc_tables.h
To get CRC functionality in a client program, you now need to link with
libpgcommon instead of libpgport. The CRC code has nothing to do with
portability, so libpgcommon is a better home. (libpgcommon didn't exist
when pg_crc.c was originally moved to src/port.)

Remove the possibility to get CRC functionality by just #including
pg_crc_tables.h. I'm not aware of any extensions that actually did that and
couldn't simply link with libpgcommon.

This also moves the pg_crc.h header file from src/include/utils to
src/include/common, which will require changes to any external programs
that currently does #include "utils/pg_crc.h". That seems acceptable, as
include/common is clearly the right home for it now, and the change needed
to any such programs is trivial.
2015-02-09 11:17:56 +02:00
Fujii Masao
40bede5477 Move pg_lzcompress.c to src/common.
The meta data of PGLZ symbolized by PGLZ_Header is removed, to make
the compression and decompression code independent on the backend-only
varlena facility. PGLZ_Header is being used to store some meta data
related to the data being compressed like the raw length of the uncompressed
record or some varlena-related data, making it unpluggable once PGLZ is
stored in src/common as it contains some backend-only code paths with
the management of varlena structures. The APIs of PGLZ are reworked
at the same time to do only compression and decompression of buffers
without the meta-data layer, simplifying its use for a more general usage.

On-disk format is preserved as well, so there is no incompatibility with
previous major versions of PostgreSQL for TOAST entries.

Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. Especially this commit is required
for upcoming WAL compression feature so that the WAL reader facility can
decompress the WAL data by using pglz_decompress.

Michael Paquier, reviewed by me.
2015-02-09 15:15:24 +09:00
Heikki Linnakangas
d88976cfa1 Use a separate memory context for GIN scan keys.
It was getting tedious to track and release all the different things that
form a scan key. We were leaking at least the queryCategories array, and
possibly more, on a rescan. That was visible if a GIN index was used in a
nested loop join. This also protects from leaks in extractQuery method.

No backpatching, given the lack of complaints from the field. Maybe later,
after this has received more field testing.
2015-02-04 17:40:25 +02:00
Andres Freund
2505ce0be0 Remove remnants of ImmediateInterruptOK handling.
Now that nothing sets ImmediateInterruptOK to true anymore, we can
remove all the supporting code.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:25:47 +01:00
Andres Freund
d06995710b Remove the option to service interrupts during PGSemaphoreLock().
The remaining caller (lwlocks) doesn't need that facility, and we plan
to remove ImmedidateInterruptOK entirely. That means that interrupts
can't be serviced race-free and portably anyway, so there's little
reason for keeping the feature.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:25:00 +01:00
Andres Freund
6753333f55 Move deadlock and other interrupt handling in proc.c out of signal handlers.
Deadlock checking was performed inside signal handlers up to
now. While it's a remarkable feat to have made this work reliably,
it's quite complex to understand why that is the case. Partially it
worked due to the assumption that semaphores are signal safe - which
is not actually documented to be the case for sysv semaphores.

The reason we had to rely on performing this work inside signal
handlers is that semaphores aren't guaranteed to be interruptable by
signals on all platforms. But now that latches provide a somewhat
similar API, which actually has the guarantee of being interruptible,
we can avoid doing so.

Signalling between ProcSleep, ProcWakeup, ProcWaitForSignal and
ProcSendSignal is now done using latches. This increases the
likelihood of spurious wakeups. As spurious wakeup already were
possible and aren't likely to be frequent enough to be an actual
problem, this seems acceptable.

This change would allow for further simplification of the deadlock
checking, now that it doesn't have to run in a signal handler. But
even if I were motivated to do so right now, it would still be better
to do that separately. Such a cleanup shouldn't have to be reviewed a
the same time as the more fundamental changes in this commit.

There is one possible usability regression due to this commit. Namely
it is more likely than before that log_lock_waits messages are output
more than once.

Reviewed-By: Heikki Linnakangas
2015-02-03 23:24:38 +01:00
Tom Lane
cec916f35b Remove unused "m" field in LSEG.
This field has been unreferenced since 1998, and does not appear in lseg
values stored on disk (since sizeof(lseg) is only 32 bytes according to
pg_type).  There was apparently some idea of maintaining it just in values
appearing in memory, but the bookkeeping required to make that work would
surely far outweigh the cost of recalculating the line's slope when needed.
Remove it to (a) simplify matters and (b) suppress some uninitialized-field
whining from Coverity.
2015-02-03 16:53:32 -05:00
Andres Freund
4fe384bd85 Process 'die' interrupts while reading/writing from the client socket.
Up to now it was impossible to terminate a backend that was trying to
send/recv data to/from the client when the socket's buffer was already
full/empty. While the send/recv calls itself might have gotten
interrupted by signals on some platforms, we just immediately retried.

That could lead to situations where a backend couldn't be terminated ,
after a client died without the connection being closed, because it
was blocked in send/recv.

The problem was far more likely to be hit when sending data than when
reading. That's because while reading a command from the client, and
during authentication, we processed interrupts immediately . That
primarily left COPY FROM STDIN as being problematic for recv.

Change things so that that we process 'die' events immediately when
the appropriate signal arrives. We can't sensibly react to query
cancels at that point, because we might loose sync with the client as
we could be in the middle of writing a message.

We don't interrupt writes if the write buffer isn't full, as indicated
by write() returning EWOULDBLOCK, as that would lead to fewer error
messages reaching clients.

Per discussion with Kyotaro HORIGUCHI and Heikki Linnakangas

Discussion: 20140927191243.GD5423@alap3.anarazel.de
2015-02-03 22:45:45 +01:00
Andres Freund
4f85fde8eb Introduce and use infrastructure for interrupt processing during client reads.
Up to now large swathes of backend code ran inside signal handlers
while reading commands from the client, to allow for speedy reaction to
asynchronous events. Most prominently shared invalidation and NOTIFY
handling. That means that complex code like the starting/stopping of
transactions is run in signal handlers...  The required code was
fragile and verbose, and is likely to contain bugs.

That approach also severely limited what could be done while
communicating with the client. As the read might be from within
openssl it wasn't safely possible to trigger an error, e.g. to cancel
a backend in idle-in-transaction state. We did that in some cases,
namely fatal errors, nonetheless.

Now that FE/BE communication in the backend employs non-blocking
sockets and latches to block, we can quite simply interrupt reads from
signal handlers by setting the latch. That allows us to signal an
interrupted read, which is supposed to be retried after returning from
within the ssl library.

As signal handlers now only need to set the latch to guarantee timely
interrupt processing, remove a fair amount of complicated & fragile
code from async.c and sinval.c.

We could now actually start to process some kinds of interrupts, like
sinval ones, more often that before, but that seems better done
separately.

This work will hopefully allow to handle cases like being blocked by
sending data, interrupting idle transactions and similar to be
implemented without too much effort.  In addition to allowing getting
rid of ImmediateInterruptOK, that is.

Author: Andres Freund
Reviewed-By: Heikki Linnakangas
2015-02-03 22:25:20 +01:00
Robert Haas
5d2f957f3f Add new function BackgroundWorkerInitializeConnectionByOid.
Sometimes it's useful for a background worker to be able to initialize
its database connection by OID rather than by name, so provide a way
to do that.
2015-02-02 16:23:59 -05:00
Heikki Linnakangas
2b3a8b20c2 Be more careful to not lose sync in the FE/BE protocol.
If any error occurred while we were in the middle of reading a protocol
message from the client, we could lose sync, and incorrectly try to
interpret a part of another message as a new protocol message. That will
usually lead to an "invalid frontend message" error that terminates the
connection. However, this is a security issue because an attacker might
be able to deliberately cause an error, inject a Query message in what's
supposed to be just user data, and have the server execute it.

We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other
operations that could ereport(ERROR) in the middle of processing a message,
but a query cancel interrupt or statement timeout could nevertheless cause
it to happen. Also, the V2 fastpath and COPY handling were not so careful.
It's very difficult to recover in the V2 COPY protocol, so we will just
terminate the connection on error. In practice, that's what happened
previously anyway, as we lost protocol sync.

To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set
whenever we're in the middle of reading a message. When it's set, we cannot
safely ERROR out and continue running, because we might've read only part
of a message. PqCommReadingMsg acts somewhat similarly to critical sections
in that if an error occurs while it's set, the error handler will force the
connection to be terminated, as if the error was FATAL. It's not
implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted
to PANIC in critical sections, because we want to be able to use
PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes
advantage of that to prevent an OOM error from terminating the connection.

To prevent unnecessary connection terminations, add a holdoff mechanism
similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel
interrupts, but still allow die interrupts. The rules on which interrupts
are processed when are now a bit more complicated, so refactor
ProcessInterrupts() and the calls to it in signal handlers so that the
signal handlers always call it if ImmediateInterruptOK is set, and
ProcessInterrupts() can decide to not do anything if the other conditions
are not met.

Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund.
Backpatch to all supported versions.

Security: CVE-2015-0244
2015-02-02 17:09:53 +02:00
Robert Haas
bd4e2fd97d Provide a way to supress the "out of memory" error when allocating.
Using the new interface MemoryContextAllocExtended, callers can
specify MCXT_ALLOC_NO_OOM if they are prepared to handle a NULL
return value.

Michael Paquier, reviewed and somewhat revised by me.
2015-01-30 12:56:48 -05:00
Tom Lane
3d660d33aa Fix assorted oversights in range selectivity estimation.
calc_rangesel() failed outright when comparing range variables to empty
constant ranges with < or >=, as a result of missing cases in a switch.
It also produced a bogus estimate for > comparison to an empty range.

On top of that, the >= and > cases were mislabeled throughout.  For
nonempty constant ranges, they managed to produce the right answers
anyway as a result of counterbalancing typos.

Also, default_range_selectivity() omitted cases for elem <@ range,
range &< range, and range &> range, so that rather dubious defaults
were applied for these operators.

In passing, rearrange the code in rangesel() so that the elem <@ range
case is handled in a less opaque fashion.

Report and patch by Emre Hasegeli, some additional work by me
2015-01-30 12:30:59 -05:00
Heikki Linnakangas
68fa75f318 Fix query-duration memory leak with GIN rescans.
The requiredEntries / additionalEntries arrays were not freed in
freeScanKeys() like other per-key stuff.

It's not obvious, but startScanKey() was only ever called after the keys
have been initialized with ginNewScanKey(). That's why it doesn't need to
worry about freeing existing arrays. The ginIsNewKey() test in gingetbitmap
was never true, because ginrescan free's the existing keys, and it's not OK
to call gingetbitmap twice in a row without calling ginrescan in between.
To make that clear, remove the unnecessary ginIsNewKey(). And just to be
extra sure that nothing funny happens if there is an existing key after all,
call freeScanKeys() to free it if it exists. This makes the code more
straightforward.

(I'm seeing other similar leaks in testing a query that rescans an GIN index
scan, but that's a different issue. This just fixes the obvious leak with
those two arrays.)

Backpatch to 9.4, where GIN fast scan was added.
2015-01-30 17:58:23 +01:00
Andres Freund
ed127002d8 Align buffer descriptors to cache line boundaries.
Benchmarks has shown that aligning the buffer descriptor array to
cache lines is important for scalability; especially on bigger,
multi-socket, machines.

Currently the array sometimes already happens to be aligned by
happenstance, depending how large previous shared memory allocations
were. That can lead to wildly varying performance results after minor
configuration changes.

In addition to aligning the start of descriptor array, also force the
size of individual descriptors to be of a common cache line size (64
bytes). That happens to already be the case on 64bit platforms, but
this way we can change the struct BufferDesc more easily.

As the alignment primarily matters in highly concurrent workloads
which probably all are 64bit these days, and the space wastage of
element alignment would be a bit more noticeable on 32bit systems, we
don't force the stride to be cacheline sized on 32bit platforms for
now. If somebody does actual performance testing, we can reevaluate
that decision by changing the definition of BUFFERDESC_PADDED_SIZE.

Discussion: 20140202151319.GD32123@awork2.anarazel.de

Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter
Geoghegan.
2015-01-29 22:48:45 +01:00
Stephen Frost
c7cf9a2433 Add usebypassrls to pg_user and pg_shadow
The row level security patches didn't add the 'usebypassrls' columns to
the pg_user and pg_shadow views on the belief that they were deprecated,
but we havn't actually said they are and therefore we should include it.

This patch corrects that, adds missing documentation for rolbypassrls
into the system catalog page for pg_authid, along with the entries for
pg_user and pg_shadow, and cleans up a few other uses of 'row-level'
cases to be 'row level' in the docs.

Pointed out by Amit Kapila.

Catalog version bump due to system view changes.
2015-01-28 21:47:15 -05:00
Stephen Frost
804b6b6db4 Fix column-privilege leak in error-message paths
While building error messages to return to the user,
BuildIndexValueDescription, ExecBuildSlotValueDescription and
ri_ReportViolation would happily include the entire key or entire row in
the result returned to the user, even if the user didn't have access to
view all of the columns being included.

Instead, include only those columns which the user is providing or which
the user has select rights on.  If the user does not have any rights
to view the table or any of the columns involved then no detail is
provided and a NULL value is returned from BuildIndexValueDescription
and ExecBuildSlotValueDescription.  Note that, for key cases, the user
must have access to all of the columns for the key to be shown; a
partial key will not be returned.

Further, in master only, do not return any data for cases where row
security is enabled on the relation and row security should be applied
for the user.  This required a bit of refactoring and moving of things
around related to RLS- note the addition of utils/misc/rls.c.

Back-patch all the way, as column-level privileges are now in all
supported versions.

This has been assigned CVE-2014-8161, but since the issue and the patch
have already been publicized on pgsql-hackers, there's no point in trying
to hide this commit.
2015-01-28 12:31:30 -05:00
Tom Lane
4b2a254793 Add a note to PG_TRY's documentation about volatile safety.
We had better memorialize what the actual requirements are for this.
2015-01-26 15:53:37 -05:00
Tom Lane
fd496129d1 Clean up some mess in row-security patches.
Fix unsafe coding around PG_TRY in RelationBuildRowSecurity: can't change
a variable inside PG_TRY and then use it in PG_CATCH without marking it
"volatile".  In this case though it seems saner to avoid that by doing
a single assignment before entering the TRY block.

I started out just intending to fix that, but the more I looked at the
row-security code the more distressed I got.  This patch also fixes
incorrect construction of the RowSecurityPolicy cache entries (there was
not sufficient care taken to copy pass-by-ref data into the cache memory
context) and a whole bunch of sloppiness around the definition and use of
pg_policy.polcmd.  You can't use nulls in that column because initdb will
mark it NOT NULL --- and I see no particular reason why a null entry would
be a good idea anyway, so changing initdb's behavior is not the right
answer.  The internal value of '\0' wouldn't be suitable in a "char" column
either, so after a bit of thought I settled on using '*' to represent ALL.
Chasing those changes down also revealed that somebody wasn't paying
attention to what the underlying values of ACL_UPDATE_CHR etc really were,
and there was a great deal of lackadaiscalness in the catalogs.sgml
documentation for pg_policy and pg_policies too.

This doesn't pretend to be a complete code review for the row-security
stuff, it just fixes the things that were in my face while dealing with
the bugs in RelationBuildRowSecurity.
2015-01-24 16:16:22 -05:00
Robert Haas
d1747571b6 Fix typos, update README.
Peter Geoghegan
2015-01-23 15:06:53 -05:00
Tom Lane
eb213acfe2 Prevent duplicate escape-string warnings when using pg_stat_statements.
contrib/pg_stat_statements will sometimes run the core lexer a second time
on submitted statements.  Formerly, if you had standard_conforming_strings
turned off, this led to sometimes getting two copies of any warnings
enabled by escape_string_warning.  While this is probably no longer a big
deal in the field, it's a pain for regression testing.

To fix, change the lexer so it doesn't consult the escape_string_warning
GUC variable directly, but looks at a copy in the core_yy_extra_type state
struct.  Then, pg_stat_statements can change that copy to disable warnings
while it's redoing the lexing.

It seemed like a good idea to make this happen for all three of the GUCs
consulted by the lexer, not just escape_string_warning.  There's not an
immediate use-case for callers to adjust the other two AFAIK, but making
it possible is easy enough and seems like good future-proofing.

Arguably this is a bug fix, but there doesn't seem to be enough interest to
justify a back-patch.  We'd not be able to back-patch exactly as-is anyway,
for fear of breaking ABI compatibility of the struct.  (We could perhaps
back-patch the addition of only escape_string_warning by adding it at the
end of the struct, where there's currently alignment padding space.)
2015-01-22 18:11:00 -05:00
Alvaro Herrera
972bf7d6f1 Tweak BRIN minmax operator class
In the union support proc, we were not checking the hasnulls flag of
value A early enough, so it could be skipped if the "allnulls" flag in
value B is set.  Also, a check on the allnulls flag of value "B" was
redundant, so remove it.

Also change inet_minmax_ops to not be the default opclass for type inet,
as a future inclusion operator class would be more useful and it's
pretty difficult to change default opclass for a datatype later on.
(There is no catversion bump for this catalog change; this shouldn't be
a problem.)

Extracted from a larger patch to add an "inclusion" operator class.

Author: Emre Hasegeli
2015-01-22 17:01:09 -03:00
Alvaro Herrera
813ffc0ef9 reinit.h: Fix typo in identification comment
Author: Sawada Masahiko
2015-01-22 12:26:51 -03:00
Robert Haas
f32a1fa462 Add strxfrm_l to list of functions where Windows adds an underscore.
Per buildfarm failure on bowerbird after last night's commit
4ea51cdfe8.

Peter Geoghegan
2015-01-20 10:52:01 -05:00
Robert Haas
4ea51cdfe8 Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.

There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.

Peter Geoghegan, reviewed by me.
2015-01-19 15:28:27 -05:00
Andres Freund
ff44fba46c Replace walsender's latch with the general shared latch.
Relying on the normal shared latch simplifies interrupt/signal
handling because we can rely on all signal handlers setting the proc
latch. That in turn allows us to avoid the use of
ImmediateInterruptOK, which arguably isn't correct because
WaitLatchOrSocket isn't declared to be immediately interruptible.

Also change sections that wait on the walsender's latch to notice
interrupts quicker/more reliably and make them more consistent with
each other.

This is part of a larger "get rid of ImmediateInterruptOK" series.

Discussion: 20150115020335.GZ5245@awork2.anarazel.de
2015-01-17 13:00:42 +01:00
Heikki Linnakangas
9402869160 Advance backend's advertised xmin more aggressively.
Currently, a backend will reset it's PGXACT->xmin value when it doesn't
have any registered snapshots left. That covered the common case that a
transaction in read committed mode runs several queries, one after each
other, as there would be no snapshots active between those queries.
However, if you hold cursors across each of the query, we didn't get a
chance to reset xmin.

To make that better, keep all the registered snapshots in a pairing heap,
ordered by xmin so that it's always quick to find the snapshot with the
smallest xmin. That allows us to advance PGXACT->xmin whenever the oldest
snapshot is deregistered, even if there are others still active.

Per discussion originally started by Jeff Davis back in 2009 and more
recently by Robert Haas.
2015-01-17 01:15:23 +02:00
Tom Lane
8e166e164c Rearrange explain.c's API so callers need not embed sizeof(ExplainState).
The folly of the previous arrangement was just demonstrated: there's no
convenient way to add fields to ExplainState without breaking ABI, even
if callers have no need to touch those fields.  Since we might well need
to do that again someday in back branches, let's change things so that
only explain.c has to have sizeof(ExplainState) compiled into it.  This
costs one extra palloc() per EXPLAIN operation, which is surely pretty
negligible.
2015-01-15 13:39:33 -05:00
Tom Lane
a5cd70dcbc Improve performance of EXPLAIN with large range tables.
As of 9.3, ruleutils.c goes to some lengths to ensure that table and column
aliases used in its output are unique.  Of course this takes more time than
was required before, which in itself isn't fatal.  However, EXPLAIN was set
up so that recalculation of the unique aliases was repeated for each
subexpression printed in a plan.  That results in O(N^2) time and memory
consumption for large plan trees, which did not happen in older branches.

Fortunately, the expensive work is the same across a whole plan tree,
so there is no need to repeat it; we can do most of the initialization
just once per query and re-use it for each subexpression.  This buys
back most (not all) of the performance loss since 9.2.

We need an extra ExplainState field to hold the precalculated deparse
context.  That's no problem in HEAD, but in the back branches, expanding
sizeof(ExplainState) seems risky because third-party extensions might
have local variables of that struct type.  So, in 9.4 and 9.3, introduce
an auxiliary struct to keep sizeof(ExplainState) the same.  We should
refactor the APIs to avoid such local variables in future, but that's
material for a separate HEAD-only commit.

Per gripe from Alexey Bashtanov.  Back-patch to 9.3 where the issue
was introduced.
2015-01-15 13:18:12 -05:00
Andres Freund
6cfd5086e1 Blindly try to fix a warning in s_lock.h when compiling with gcc on HPPA.
The possibly, depending on compiler settings, generated warning was
"warning: `S_UNLOCK' redefined".

The hppa spinlock implementation doesn't follow the rules of s_lock.h
and provides a gcc specific implementation outside of the the part of
the file that's supposed to do that.  It does so to avoid duplication
between the HP compiler and gcc. That unfortunately means that
S_UNLOCK is already defined when the HPPA specific section is reached.

Undefine the generic fallback S_UNLOCK definition inside the HPPA
section. That's far from pretty, but has the big advantage of being
simple. If somebody is interested to fix this in a prettier way...

This presumably got broken in the course of 0709b7ee72.

Discussion: 20150114225919.GY5245@awork2.anarazel.de

Per complaint from Tom Lane.
2015-01-15 13:26:25 +01:00
Andres Freund
59f71a0d0b Add a default local latch for use in signal handlers.
To do so, move InitializeLatchSupport() into the new common process
initialization functions, and add a new global variable MyLatch.

MyLatch is usable as soon InitPostmasterChild() has been called
(i.e. very early during startup). Initially it points to a process
local latch that exists in all processes. InitProcess/InitAuxiliaryProcess
then replaces that local latch with PGPROC->procLatch. During shutdown
the reverse happens.

This is primarily advantageous for two reasons: For one it simplifies
dealing with the shared process latch, especially in signal handlers,
because instead of having to check for MyProc, MyLatch can be used
unconditionally. For another, a later patch that makes FEs/BE
communication use latches, now can rely on the existence of a latch,
even before having gone through InitProcess.

Discussion: 20140927191243.GD5423@alap3.anarazel.de
2015-01-14 18:45:22 +01:00
Andres Freund
31c453165b Commonalize process startup code.
Move common code, that was duplicated in every postmaster child/every
standalone process, into two functions in miscinit.c.  Not only does
that already result in a fair amount of net code reduction but it also
makes it much easier to remove more duplication in the future. The
prime motivation wasn't code deduplication though, but easier addition
of new common code.
2015-01-14 00:33:14 +01:00
Andres Freund
14e8803f10 Add barriers to the latch code.
Since their introduction latches have required barriers in SetLatch
and ResetLatch - but when they were introduced there wasn't any
barrier abstraction. Instead latches were documented to rely on the
callsites to provide barrier semantics.

Now that the barrier support looks halfway complete, add the necessary
barriers to both latch implementations.

Also remove a now superflous lock acquisition from syncrep.c and a
superflous (and insufficient) barrier from freelist.c. There might be
other cases that can now be simplified, but those are the only ones
I've seen on a quick scan.

We might want to backpatch this at some later point, but right now the
barrier infrastructure in the backbranches isn't totally on par with
master.

Discussion: 20150112154026.GB2092@awork2.anarazel.de
2015-01-13 12:58:43 +01:00
Heikki Linnakangas
3dfce37627 Fix typos in comment.
Plus some tiny wordsmithing of not-quite-typos.
2015-01-13 10:32:38 +02:00
Tom Lane
8883bae33b Remove configure test for nonstandard variants of getpwuid_r().
We had code that supposed that some platforms might offer a nonstandard
version of getpwuid_r() with only four arguments.  However, the 5-argument
definition has been standardized at least since the Single Unix Spec v2,
which is our normal reference for what's portable across all Unix-oid
platforms.  (What's more, this wasn't the only pre-standardization version
of getpwuid_r(); my old HPUX 10.20 box has still another signature.)
So let's just get rid of the now-useless configure step.
2015-01-11 12:52:37 -05:00
Tom Lane
080eabe2e8 Fix libpq's behavior when /etc/passwd isn't readable.
Some users run their applications in chroot environments that lack an
/etc/passwd file.  This means that the current UID's user name and home
directory are not obtainable.  libpq used to be all right with that,
so long as the database role name to use was specified explicitly.
But commit a4c8f14364 broke such cases by
causing any failure of pg_fe_getauthname() to be treated as a hard error.
In any case it did little to advance its nominal goal of causing errors
in pg_fe_getauthname() to be reported better.  So revert that and instead
put some real error-reporting code in place.  This requires changes to the
APIs of pg_fe_getauthname() and pqGetpwuid(), since the latter had
departed from the POSIX-specified API of getpwuid_r() in a way that made
it impossible to distinguish actual lookup errors from "no such user".

To allow such failures to be reported, while not failing if the caller
supplies a role name, add a second call of pg_fe_getauthname() in
connectOptions2().  This is a tad ugly, and could perhaps be avoided with
some refactoring of PQsetdbLogin(), but I'll leave that idea for later.
(Note that the complained-of misbehavior only occurs in PQsetdbLogin,
not when using the PQconnect functions, because in the latter we will
never bother to call pg_fe_getauthname() if the user gives a role name.)

In passing also clean up the Windows-side usage of GetUserName(): the
recommended buffer size is 257 bytes, the passed buffer length should
be the buffer size not buffer size less 1, and any error is reported
by GetLastError() not errno.

Per report from Christoph Berg.  Back-patch to 9.4 where the chroot
failure case was introduced.  The generally poor reporting of errors
here is of very long standing, of course, but given the lack of field
complaints about it we won't risk changing these APIs further back
(even though they're theoretically internal to libpq).
2015-01-11 12:35:44 -05:00
Andres Freund
de6429a8fd Provide a generic fallback for pg_compiler_barrier using an extern function.
If the compiler/arch combination does not provide compiler barriers,
provide a fallback. That fallback simply consists out of a function
call into a externally defined function.  That should guarantee
compiler barrierer semantics except for compilers that do inter
translation unit/global optimization - those better provide an actual
compiler barrier.

Hopefully this fixes Tom's report of linker failures due to
pg_compiler_barrier_impl not being provided.

I'm not backpatching this commit as it builds on the new atomics
infrastructure. If we decide an equivalent fix needs to be
backpatched, I'll do so in a separate commit.

Discussion: 27746.1420930690@sss.pgh.pa.us

Per report from Tom Lane.
2015-01-11 01:15:29 +01:00
Andres Freund
db4ec2ffce Fix alignment of pg_atomic_uint64 variables on some 32bit platforms.
I failed to recognize that pg_atomic_uint64 wasn't guaranteed to be 8
byte aligned on some 32bit platforms - which it has to be on some
platforms to guarantee the desired atomicity and which we assert.

As this is all compiler specific code anyway we can just rely on
compiler specific tricks to enforce alignment.

I've been unable to find concrete documentation about the version that
introduce the sunpro alignment support, so that might need additional
guards.

I've verified that this works with gcc x86 32bit, but I don't have
access to any other 32bit environment.

Discussion: op.xpsjdkil0sbe7t@vld-kuci

Per report from Vladimir Koković.
2015-01-11 01:06:37 +01:00
Andres Freund
93be095007 Move comment about sun cc's __machine_rw_barrier being a full barrier.
I'd accidentally written the comment besides the read barrier, instead
of the full barrier, implementation.

Noticed by Oskari Saarenmaa
2015-01-08 13:08:05 +01:00
Noah Misch
894459e59f On Darwin, detect and report a multithreaded postmaster.
Darwin --enable-nls builds use a substitute setlocale() that may start a
thread.  Buildfarm member orangutan experienced BackendList corruption
on account of different postmaster threads executing signal handlers
simultaneously.  Furthermore, a multithreaded postmaster risks undefined
behavior from sigprocmask() and fork().  Emit LOG messages about the
problem and its workaround.  Back-patch to 9.0 (all supported versions).
2015-01-07 22:35:44 -05:00
Bruce Momjian
4baaf863ec Update copyright for 2015
Backpatch certain files through 9.0
2015-01-06 11:43:47 -05:00
Andres Freund
ccb161b66a Add pg_string_endswith as the start of a string helper library in src/common.
Backpatch to 9.3 where src/common was introduce, because a bugfix that
needs to be backpatched, requires the function. Earlier branches will
have to duplicate the code.
2015-01-03 20:54:12 +01:00
Alvaro Herrera
72dd233d3e pg_event_trigger_dropped_objects: Add name/args output columns
These columns can be passed to pg_get_object_address() and used to
reconstruct the dropped objects identities in a remote server containing
similar objects, so that the drop can be replicated.

Reviewed by Stephen Frost, Heikki Linnakangas, Abhijit Menon-Sen, Andres
Freund.
2014-12-30 17:41:46 -03:00
Alvaro Herrera
a676201490 Add pg_identify_object_as_address
This function returns object type and objname/objargs arrays, which can
be passed to pg_get_object_address.  This is especially useful because
the textual representation can be copied to a remote server in order to
obtain the corresponding OID-based address.  In essence, this function
is the inverse of recently added pg_get_object_address().

Catalog version bumped due to the addition of the new function.

Also add docs to pg_get_object_address.
2014-12-30 15:41:50 -03:00
Heikki Linnakangas
930fd68455 Revert the GinMaxItemSize calculation so that we fit 3 tuples per page.
Commit 36a35c55 changed the divisor from 3 to 6, for no apparent reason.
Reducing GinMaxItemSize like that created a dump/reload hazard: loading a
9.3 database to 9.4 might fail with "index row size XXX exceeds maximum 1352
for index ..." error. Revert the change.

While we're at it, make the calculation slightly more accurate. It used to
divide the available space on page by three, then subtract
sizeof(ItemIdData), and finally round down. That's not totally accurate; the
item pointers for the three items are packed tight right after the page
header, but there is alignment padding after the item pointers. Change the
calculation to reflect that, like BTMaxItemSize does. I tested this with
different block sizes on systems with 4- and 8-byte alignment, and the value
after the final MAXALIGN_DOWN was the same with both methods on all
configurations. So this does not make any difference currently, but let's be
tidy.

Also add a comment explaining what the macro does.

This fixes bug #12292 reported by Robert Thaler. Backpatch to 9.4, where the
bug was introduced.
2014-12-30 14:53:11 +02:00
Tom Lane
966115c305 Temporarily revert "Move pg_lzcompress.c to src/common."
This reverts commit 60838df922.
That change needs a bit more thought to be workable.  In view of
the potentially machine-dependent stuff that went in today,
we need all of the buildfarm to be testing those other changes.
2014-12-25 13:22:55 -05:00
Andres Freund
d72731a704 Lockless StrategyGetBuffer clock sweep hot path.
StrategyGetBuffer() has proven to be a bottleneck in a number of
buffer acquisition heavy workloads. To some degree this has already
been alleviated by 5d7962c6, but it still can be quite a heavy
bottleneck.  The problem is that in unfortunate usage patterns a
single StrategyGetBuffer() call will have to look at a large number of
buffers - in turn making it likely that the process will be put to
sleep while still holding the spinlock.

Replace most of the usage of the buffer_strategy_lock spinlock for the
clock sweep by a atomic nextVictimBuffer variable. That variable,
modulo NBuffers, is the current hand of the clock sweep. The buffer
clock-sweep then only needs to acquire the spinlock after a
wraparound. And even then only in the process that did the wrapping
around. That alleviates nearly all the contention on the relevant
spinlock, although significant contention on the cacheline can still
exist.

Reviewed-By: Robert Haas and Amit Kapila

Discussion: 20141010160020.GG6670@alap3.anarazel.de,
    20141027133218.GA2639@awork2.anarazel.de
2014-12-25 18:26:25 +01:00
Andres Freund
ab5194e6f6 Improve LWLock scalability.
The old LWLock implementation had the problem that concurrent lock
acquisitions required exclusively acquiring a spinlock. Often that
could lead to acquirers waiting behind the spinlock, even if the
actual LWLock was free.

The new implementation doesn't acquire the spinlock when acquiring the
lock itself. Instead the new atomic operations are used to atomically
manipulate the state. Only the waitqueue, used solely in the slow
path, is still protected by the spinlock. Check lwlock.c's header for
an explanation about the used algorithm.

For some common workloads on larger machines this can yield
significant performance improvements. Particularly in read mostly
workloads.

Reviewed-By: Amit Kapila and Robert Haas
Author: Andres Freund

Discussion: 20130926225545.GB26663@awork2.anarazel.de
2014-12-25 17:24:30 +01:00
Andres Freund
7882c3b0b9 Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it.
Besides being shorter and much easier to read it changes the logic in
LWLockRelease() to release all shared lockers when waking up any. This
can yield some significant performance improvements - and the fairness
isn't really much worse than before, as we always allowed new shared
lockers to jump the queue.
2014-12-25 17:24:30 +01:00
Andres Freund
570bd2b3fd Add capability to suppress CONTEXT: messages to elog machinery.
Hiding context messages usually is not a good idea - except for rather
verbose debugging/development utensils like LOG_DEBUG. There the
amount of repeated context messages just bloat the log without adding
information.
2014-12-25 17:24:30 +01:00
Fujii Masao
60838df922 Move pg_lzcompress.c to src/common.
Exposing compression and decompression APIs of pglz makes possible its
use by extensions and contrib modules. pglz_decompress contained a call
to elog to emit an error message in case of corrupted data. This function
is changed to return a status code to let its callers return an error instead.

This commit is required for upcoming WAL compression feature so that
the WAL reader facility can decompress the WAL data by using pglz_decompress.

Michael Paquier
2014-12-25 20:46:14 +09:00
Fujii Masao
3b6ca123b5 Remove unused fields from ReindexStmt.
fe263d1 changed the REINDEX logic so that those fields are not used at all,
but forgot to remove them.

Sawada Masahiko
2014-12-24 21:40:47 +09:00
Alvaro Herrera
a609d96778 Revert "Use a bitmask to represent role attributes"
This reverts commit 1826987a46.

The overall design was deemed unacceptable, in discussion following the
previous commit message; we might find some parts of it still
salvageable, but I don't want to be on the hook for fixing it, so let's
wait until we have a new patch.
2014-12-23 15:35:49 -03:00
Alvaro Herrera
d7ee82e50f Add SQL-callable pg_get_object_address
This allows access to get_object_address from SQL, which is useful to
obtain OID addressing information from data equivalent to that emitted
by the parser.  This is necessary infrastructure of a project to let
replication systems propagate object dropping events to remote servers,
where the schema might be different than the server originating the
DROP.

This patch also adds support for OBJECT_DEFAULT to get_object_address;
that is, it is now possible to refer to a column's default value.

Catalog version bumped due to the new function.

Reviewed by Stephen Frost, Heikki Linnakangas, Robert Haas, Andres
Freund, Abhijit Menon-Sen, Adam Brightwell.
2014-12-23 15:31:29 -03:00
Alvaro Herrera
1826987a46 Use a bitmask to represent role attributes
The previous representation using a boolean column for each attribute
would not scale as well as we want to add further attributes.

Extra auxilliary functions are added to go along with this change, to
make up for the lost convenience of access of the old representation.

Catalog version bumped due to change in catalogs and the new functions.

Author: Adam Brightwell, minor tweaks by Álvaro
Reviewed by: Stephen Frost, Andres Freund, Álvaro Herrera
2014-12-23 10:22:09 -03:00
Alvaro Herrera
7eca575d1c get_object_address: separate domain constraints from table constraints
Apart from enabling comments on domain constraints, this enables a
future project to replicate object dropping to remote servers: with the
current mechanism there's no way to distinguish between the two types of
constraints, so there's no way to know what to drop.

Also added support for the domain constraint comments in psql's \dd and
pg_dump.

Catalog version bumped due to the change in ObjectType enum.
2014-12-23 09:06:44 -03:00
Heikki Linnakangas
955557ddcc Move rbtree.c from src/backend/utils/misc to src/backend/lib.
We have other general-purpose data structures in src/backend/lib, so it
seems like a better home for the red-black tree as well.
2014-12-22 17:52:08 +02:00
Heikki Linnakangas
e7032610f7 Use a pairing heap for the priority queue in kNN-GiST searches.
This performs slightly better, uses less memory, and needs slightly less
code in GiST, than the Red-Black tree previously used.

Reviewed by Peter Geoghegan
2014-12-22 12:05:57 +02:00
Alvaro Herrera
0ee98d1cbf pg_event_trigger_dropped_objects: add behavior flags
Add "normal" and "original" flags as output columns to the
pg_event_trigger_dropped_objects() function.  With this it's possible to
distinguish which objects, among those listed, need to be explicitely
referenced when trying to replicate a deletion.

This is necessary so that the list of objects can be pruned to the
minimum necessary to replicate the DROP command in a remote server that
might have slightly different schema (for instance, TOAST tables and
constraints with different names and such.)

Catalog version bumped due to change of function definition.

Reviewed by: Abhijit Menon-Sen, Stephen Frost, Heikki Linnakangas,
Robert Haas.
2014-12-19 15:00:45 -03:00
Andres Freund
9959abb012 Define Assert() et al to ((void)0) to avoid pedantic warnings.
gcc's -Wempty-body warns about the current usage when compiling
postgres without --enable-cassert.
2014-12-19 14:27:45 +01:00
Tom Lane
4a14f13a0a Improve hash_create's API for selecting simple-binary-key hash functions.
Previously, if you wanted anything besides C-string hash keys, you had to
specify a custom hashing function to hash_create().  Nearly all such
callers were specifying tag_hash or oid_hash; which is tedious, and rather
error-prone, since a caller could easily miss the opportunity to optimize
by using hash_uint32 when appropriate.  Replace this with a design whereby
callers using simple binary-data keys just specify HASH_BLOBS and don't
need to mess with specific support functions.  hash_create() itself will
take care of optimizing when the key size is four bytes.

This nets out saving a few hundred bytes of code space, and offers
a measurable performance improvement in tidbitmap.c (which was not
exploiting the opportunity to use hash_uint32 for its 4-byte keys).
There might be some wins elsewhere too, I didn't analyze closely.

In future we could look into offering a similar optimized hashing function
for 8-byte keys.  Under this design that could be done in a centralized
and machine-independent fashion, whereas getting it right for keys of
platform-dependent sizes would've been notationally painful before.

For the moment, the old way still works fine, so as not to break source
code compatibility for loadable modules.  Eventually we might want to
remove tag_hash and friends from the exported API altogether, since there's
no real need for them to be explicitly referenced from outside dynahash.c.

Teodor Sigaev and Tom Lane
2014-12-18 13:36:36 -05:00
Fujii Masao
38628db8d8 Add memory barriers for PgBackendStatus.st_changecount protocol.
st_changecount protocol needs the memory barriers to ensure that
the apparent order of execution is as it desires. Otherwise,
for example, the CPU might rearrange the code so that st_changecount
is incremented twice before the modification on a machine with
weak memory ordering. This surprising result can lead to bugs.

This commit introduces the macros to load and store st_changecount
with the memory barriers. These are called before and after
PgBackendStatus entries are modified or copied into private memory,
in order to prevent CPU from reordering PgBackendStatus access.

Per discussion on pgsql-hackers, we decided not to back-patch this
to 9.4 or before until we get an actual bug report about this.

Patch by me. Review by Robert Haas.
2014-12-18 23:07:51 +09:00
Tom Lane
66709133c7 Fix off-by-one loop count in MapArrayTypeName, and get rid of static array.
MapArrayTypeName would copy up to NAMEDATALEN-1 bytes of the base type
name, which of course is wrong: after prepending '_' there is only room for
NAMEDATALEN-2 bytes.  Aside from being the wrong result, this case would
lead to overrunning the statically allocated work buffer.  This would be a
security bug if the function were ever used outside bootstrap mode, but it
isn't, at least not in any currently supported branches.

Aside from fixing the off-by-one loop logic, this patch gets rid of the
static work buffer by having MapArrayTypeName pstrdup its result; the sole
caller was already doing that, so this just requires moving the pstrdup
call.  This saves a few bytes but mainly it makes the API a lot cleaner.

Back-patch on the off chance that there is some third-party code using
MapArrayTypeName with less-secure input.  Pushing pstrdup into the function
should not cause any serious problems for such hypothetical code; at worst
there might be a short term memory leak.

Per Coverity scanning.
2014-12-16 15:35:33 -05:00
Heikki Linnakangas
4d65e16a6f Misc comment typo fixes.
Backpatch the applicable parts, just to make backpatching future patches
easier.
2014-12-16 16:37:46 +02:00
Heikki Linnakangas
da9f6a78ef Fix incorrect comment about XLogRecordBlockHeader.data_length field.
It does not include the possible full-page image. While at it, reformat the
comment slightly to make it more readable.

Reported by Rahila Syed
2014-12-16 15:41:58 +02:00
Heikki Linnakangas
4520ba6769 Add point <-> polygon distance operator.
Alexander Korotkov, reviewed by Emre Hasegeli.
2014-12-15 17:06:21 +02:00
Andrew Dunstan
e39b6f953e Add CINE option for CREATE TABLE AS and CREATE MATERIALIZED VIEW
Fabrízio de Royes Mello reviewed by Rushabh Lathia.
2014-12-13 13:56:09 -05:00
Heikki Linnakangas
50f2c0687f Remove duplicate #define
Mark Dilger
2014-12-13 18:22:07 +02:00
Andrew Dunstan
7e354ab9fe Add several generator functions for jsonb that exist for json.
The functions are:
    to_jsonb()
    jsonb_object()
    jsonb_build_object()
    jsonb_build_array()
    jsonb_agg()
    jsonb_object_agg()

Also along the way some better logic is implemented in
json_categorize_type() to match that in the newly implemented
jsonb_categorize_type().

Andrew Dunstan, reviewed by Pavel Stehule and Alvaro Herrera.
2014-12-12 15:31:14 -05:00
Andrew Dunstan
237a882443 Add json_strip_nulls and jsonb_strip_nulls functions.
The functions remove object fields, including in nested objects, that
have null as a value. In certain cases this can lead to considerably
smaller datums, with no loss of semantic information.

Andrew Dunstan, reviewed by Pavel Stehule.
2014-12-12 09:00:43 -05:00
Heikki Linnakangas
b1332e98c4 Put the logic to decide which synchronous standby is active into a function.
This avoids duplicating the code.

Michael Paquier, reviewed by Simon Riggs and me
2014-12-12 14:26:42 +02:00
Simon Riggs
fe263d115a REINDEX SCHEMA
Add new SCHEMA option to REINDEX and reindexdb.

Sawada Masahiko

Reviewed by Michael Paquier and Fabrízio de Royes Mello
2014-12-09 00:28:00 +09:00
Simon Riggs
8001fe67a3 Windows: use GetSystemTimePreciseAsFileTime if available
PostgreSQL on Windows 8 or Windows Server 2012 will now
get high-resolution timestamps by dynamically loading the
GetSystemTimePreciseAsFileTime function. It'll fall back to
to GetSystemTimeAsFileTime if the higher precision variant
isn't found, so the same binaries without problems on older
Windows releases.

No attempt is made to detect the Windows version.  Only the
presence or absence of the desired function is considered.

Craig Ringer
2014-12-08 23:36:06 +09:00
Simon Riggs
618c9430a8 Event Trigger for table_rewrite
Generate a table_rewrite event when ALTER TABLE
attempts to rewrite a table. Provide helper
functions to identify table and reason.

Intended use case is to help assess or to react
to schema changes that might hold exclusive locks
for long periods.

Dimitri Fontaine, triggering an edit by Simon Riggs

Reviewed in detail by Michael Paquier
2014-12-08 00:55:28 +09:00
Peter Eisentraut
e86507d770 Move PG_AUTOCONF_FILENAME definition
Since this is not something that a user should change,
pg_config_manual.h was an inappropriate place for it.

In initdb.c, remove the use of the macro, because utils/guc.h can't be
included by non-backend code.  But we hardcode all the other
configuration file names there, so this isn't a disaster.
2014-12-03 19:54:01 -05:00
Alvaro Herrera
73c986adde Keep track of transaction commit timestamps
Transactions can now set their commit timestamp directly as they commit,
or an external transaction commit timestamp can be fed from an outside
system using the new function TransactionTreeSetCommitTsData().  This
data is crash-safe, and truncated at Xid freeze point, same as pg_clog.

This module is disabled by default because it causes a performance hit,
but can be enabled in postgresql.conf requiring only a server restart.

A new test in src/test/modules is included.

Catalog version bumped due to the new subdirectory within PGDATA and a
couple of new SQL functions.

Authors: Álvaro Herrera and Petr Jelínek

Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert
Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven
Singer, Peter Eisentraut
2014-12-03 11:53:02 -03:00
Alvaro Herrera
6597ec9be6 Fix typos 2014-12-03 11:52:15 -03:00
Andres Freund
0fd38e1370 Don't skip SQL backends in logical decoding for visibility computation.
The logical decoding patchset introduced PROC_IN_LOGICAL_DECODING flag
PGXACT flag, that allows such backends to be skipped when computing
the xmin horizon/snapshots. That's fine and sensible for walsenders
streaming out logical changes, but not at all fine for SQL backends
doing logical decoding. If the latter set that flag any change they
have performed outside of logical decoding will not be regarded as
visible - which e.g. can lead to that change being vacuumed away.

Note that not setting the flag for SQL backends isn't particularly
bothersome - the SQL backend doesn't do streaming, so it only runs for
a limited amount of time.

Per buildfarm member 'tick' and Alvaro.

Backpatch to 9.4, where logical decoding was introduced.
2014-12-02 23:47:08 +01:00
Tom Lane
1511521a36 Minor cleanup of function declarations for BRIN.
Get rid of PG_FUNCTION_INFO_V1() macros, which are quite inappropriate
for built-in functions (possibly leftovers from testing as a loadable
module?).  Also, fix gratuitous inconsistency between SQL-level and
C-level names of the minmax support functions.
2014-12-02 14:07:54 -05:00
Andrew Dunstan
e09996ff8d Fix hstore_to_json_loose's detection of valid JSON number values.
We expose a function IsValidJsonNumber that internally calls the lexer
for json numbers. That allows us to use the same test everywhere,
instead of inventing a broken test for hstore conversions. The new
function is also used in datum_to_json, replacing the code that is now
moved to the new function.

Backpatch to 9.3 where hstore_to_json_loose was introduced.
2014-12-01 11:28:45 -05:00
Tom Lane
866737c923 Add a #define for the inet overlaps operator.
Extracted from pending inet selectivity patch.  The rest of it isn't
quite ready to commit, but we might as well push this part so the patch
doesn't have to track the moving target of pg_operator.h.
2014-11-30 19:43:43 -05:00
Alvaro Herrera
816e10d800 Fix BRIN operator family definitions
The original definitions were leaving no room for cross-type operators,
so queries that compared a column of one type against something of a
different type were not taking advantage of the index.  Fix by making
the opfamilies more like the ones for Btree, and include a few
cross-type operator classes.

Catalog version bumped.

Per complaints from Hubert Lubaczewski, Mark Wong, Heikki Linnakangas.
2014-11-28 18:09:19 -03:00
Tom Lane
d25367ec4f Add bms_get_singleton_member(), and use it where appropriate.
This patch adds a function that replaces a bms_membership() test followed
by a bms_singleton_member() call, performing both the test and the
extraction of a singleton set's member in one scan of the bitmapset.
The performance advantage over the old way is probably minimal in current
usage, but it seems worthwhile on notational grounds anyway.

David Rowley
2014-11-28 14:16:24 -05:00
Tom Lane
f4e031c662 Add bms_next_member(), and use it where appropriate.
This patch adds a way of iterating through the members of a bitmapset
nondestructively, unlike the old way with bms_first_member().  While
bms_next_member() is very slightly slower than bms_first_member()
(at least for typical-size bitmapsets), eliminating the need to palloc
and pfree a temporary copy of the target bitmapset is a significant win.
So this method should be preferred in all cases where a temporary copy
would be necessary.

Tom Lane, with suggestions from Dean Rasheed and David Rowley
2014-11-28 13:37:25 -05:00
Stephen Frost
143b39c185 Rename pg_rowsecurity -> pg_policy and other fixes
As pointed out by Robert, we should really have named pg_rowsecurity
pg_policy, as the objects stored in that catalog are policies.  This
patch fixes that and updates the column names to start with 'pol' to
match the new catalog name.

The security consideration for COPY with row level security, also
pointed out by Robert, has also been addressed by remembering and
re-checking the OID of the relation initially referenced during COPY
processing, to make sure it hasn't changed under us by the time we
finish planning out the query which has been built.

Robert and Alvaro also commented on missing OCLASS and OBJECT entries
for POLICY (formerly ROWSECURITY or POLICY, depending) in various
places.  This patch fixes that too, which also happens to add the
ability to COMMENT on policies.

In passing, attempt to improve the consistency of messages, comments,
and documentation as well.  This removes various incarnations of
'row-security', 'row-level security', 'Row-security', etc, in favor
of 'policy', 'row level security' or 'row_security' as appropriate.

Happy Thanksgiving!
2014-11-27 01:15:57 -05:00
Heikki Linnakangas
1812ee5767 Remove dead function prototype
It was added in commit efc16ea5, but never defined.
2014-11-26 11:05:54 +02:00
Simon Riggs
aedccb1f6f action_at_recovery_target recovery config option
action_at_recovery_target = pause | promote | shutdown

Petr Jelinek

Reviewed by Muhammad Asif Naeem, Fujji Masao and
Simon Riggs
2014-11-25 20:13:30 +00:00
Tom Lane
bac27394a1 Support arrays as input to array_agg() and ARRAY(SELECT ...).
These cases formerly failed with errors about "could not find array type
for data type".  Now they yield arrays of the same element type and one
higher dimension.

The implementation involves creating functions with API similar to the
existing accumArrayResult() family.  I (tgl) also extended the base family
by adding an initArrayResult() function, which allows callers to avoid
special-casing the zero-inputs case if they just want an empty array as
result.  (Not all do, so the previous calling convention remains valid.)
This allowed simplifying some existing code in xml.c and plperl.c.

Ali Akbar, reviewed by Pavel Stehule, significantly modified by me
2014-11-25 12:21:28 -05:00
Heikki Linnakangas
e453cc2741 Make Port->ssl_in_use available, even when built with !USE_SSL
Code that check the flag no longer need #ifdef's, which is more convenient.
In particular, makes it easier to write extensions that depend on it.

In the passing, modify sslinfo's ssl_is_used function to check ssl_in_use
instead of the OpenSSL specific 'ssl' pointer. It doesn't make any
difference currently, as sslinfo is only compiled when built with OpenSSL,
but seems cleaner anyway.
2014-11-25 09:46:11 +02:00
Robert Haas
f5d9698a84 Add infrastructure to save and restore GUC values.
This is further infrastructure for parallelism.

Amit Khandekar, Noah Misch, Robert Haas
2014-11-24 16:37:56 -05:00
Heikki Linnakangas
0bd624d63b Distinguish XLOG_FPI records generated for hint-bit updates.
Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images
generated for hint bit updates, when checksums are enabled. The new record
type is replayed exactly the same as XLOG_FPI, but allows them to be tallied
separately e.g. in pg_xlogdump.
2014-11-24 11:09:08 +02:00
Noah Misch
b779168ffe Detect PG_PRINTF_ATTRIBUTE automatically.
This eliminates gobs of "unrecognized format function type" warnings
under MinGW compilers predating GCC 4.4.
2014-11-23 09:34:03 -05:00
Tom Lane
447770404c Rearrange CustomScan API.
Make it work more like FDW plans do: instead of assuming that there are
expressions in a CustomScan plan node that the core code doesn't know
about, insist that all subexpressions that need planner attention be in
a "custom_exprs" list in the Plan representation.  (Of course, the
custom plugin can break the list apart again at executor initialization.)
This lets us revert the parts of the patch that exposed setrefs.c and
subselect.c processing to the outside world.

Also revert the GetSpecialCustomVar stuff in ruleutils.c; that concept
may work in future, but it's far from fully baked right now.
2014-11-21 18:21:46 -05:00
Tom Lane
c2ea2285e9 Simplify API for initially hooking custom-path providers into the planner.
Instead of register_custom_path_provider and a CreateCustomScanPath
callback, let's just provide a standard function hook in set_rel_pathlist.
This is more flexible than what was previously committed, is more like the
usual conventions for planner hooks, and requires less support code in the
core.  We had discussed this design (including centralizing the
set_cheapest() calls) back in March or so, so I'm not sure why it wasn't
done like this already.
2014-11-21 14:05:46 -05:00
Tom Lane
adbfab119b Remove dead code supporting mark/restore in SeqScan, TidScan, ValuesScan.
There seems no prospect that any of this will ever be useful, and indeed
it's questionable whether some of it would work if it ever got called;
it's certainly not been exercised in a very long time, if ever. So let's
get rid of it, and make the comments about mark/restore in execAmi.c less
wishy-washy.

The mark/restore support for Result nodes is also currently dead code,
but that's due to planner limitations not because it's impossible that
it could be useful.  So I left it in.
2014-11-20 20:20:54 -05:00
Tom Lane
a34fa8ee7c Initial code review for CustomScan patch.
Get rid of the pernicious entanglement between planner and executor headers
introduced by commit 0b03e5951b.

Also, rearrange the CustomFoo struct/typedef definitions so that all the
typedef names are seen as used by the compiler.  Without this pgindent
will mess things up a bit, which is not so important perhaps, but it also
removes a bizarre discrepancy between the declaration arrangement used for
CustomExecMethods and that used for CustomScanMethods and
CustomPathMethods.

Clean up the commentary around ExecSupportsMarkRestore to reflect the
rather large change in its API.

Const-ify register_custom_path_provider's argument.  This necessitates
casting away const in the function, but that seems better than forcing
callers of the function to do so (or else not const-ify their method
pointer structs, which was sort of the whole point).

De-export fix_expr_common.  I don't like the exporting of fix_scan_expr
or replace_nestloop_params either, but this one surely has got little
excuse.
2014-11-20 18:36:07 -05:00
Tom Lane
c5111ea9ca Remove no-longer-needed phony typedefs in genbki.h.
Now that we have a policy of hiding varlena catalog fields behind
"#ifdef CATALOG_VARLEN", there is no need for their type names to be
acceptable to the C compiler.  And experimentation shows that it does
not matter to pgindent either.  (If it did, we'd have problems anyway,
since these typedefs are unreferenced so far as the C compiler is
concerned, and find_typedef fails to identify such typedefs.)

Hence, remove the phony typedefs that genbki.h provided to make
some varlena field definitions compilable.

In passing, rearrange #define's into what seemed a more logical order.
2014-11-20 13:16:14 -05:00
Heikki Linnakangas
2c03216d83 Revamp the WAL record format.
Each WAL record now carries information about the modified relation and
block(s) in a standardized format. That makes it easier to write tools that
need that information, like pg_rewind, prefetching the blocks to speed up
recovery, etc.

There's a whole new API for building WAL records, replacing the XLogRecData
chains used previously. The new API consists of XLogRegister* functions,
which are called for each buffer and chunk of data that is added to the
record. The new API also gives more control over when a full-page image is
written, by passing flags to the XLogRegisterBuffer function.

This also simplifies the XLogReadBufferForRedo() calls. The function can dig
the relation and block number from the WAL record, so they no longer need to
be passed as arguments.

For the convenience of redo routines, XLogReader now disects each WAL record
after reading it, copying the main data part and the per-block data into
MAXALIGNed buffers. The data chunks are not aligned within the WAL record,
but the redo routines can assume that the pointers returned by XLogRecGet*
functions are. Redo routines are now passed the XLogReaderState, which
contains the record in the already-disected format, instead of the plain
XLogRecord.

The new record format also makes the fixed size XLogRecord header smaller,
by removing the xl_len field. The length of the "main data" portion is now
stored at the end of the WAL record, and there's a separate header after
XLogRecord for it. The alignment padding at the end of XLogRecord is also
removed. This compansates for the fact that the new format would otherwise
be more bulky than the old format.

Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera,
Fujii Masao.
2014-11-20 18:46:41 +02:00
Simon Riggs
606c0123d6 Reduce btree scan overhead for < and > strategies
For <, <=, > and >= strategies, mark the first scan key
as already matched if scanning in an appropriate direction.
If index tuple contains no nulls we can skip the first
re-check for each tuple.

Author: Rajeev Rastogi
Reviewer: Haribabu Kommi
Rework of the code and comments by Simon Riggs
2014-11-18 10:24:55 +00:00
Heikki Linnakangas
dedae6c211 Remove obsolete debugging option, RTDEBUG.
The r-tree AM that used it was removed back in 2005.

Peter Geoghegan
2014-11-18 09:55:05 +02:00
Alvaro Herrera
0f9692b40d Fix relpersistence setting in reindex_index
Buildfarm members with CLOBBER_CACHE_ALWAYS advised us that commit
85b506bbfc was mistaken in setting the relpersistence value of the
index directly in the relcache entry, within reindex_index.  The reason
for the failure is that an invalidation message that comes after mucking
with the relcache entry directly, but before writing it to the catalogs,
would cause the entry to become rebuilt in place from catalogs with the
old contents, losing the update.

Fix by passing the correct persistence value to
RelationSetNewRelfilenode instead; this routine also writes the updated
tuple to pg_class, avoiding the problem.  Suggested by Tom Lane.
2014-11-17 11:23:35 -03:00
Alvaro Herrera
85b506bbfc Get rid of SET LOGGED indexes persistence kludge
This removes ATChangeIndexesPersistence() introduced by f41872d0c1
which was too ugly to live for long.  Instead, the correct persistence
marking is passed all the way down to reindex_index, so that the
transient relation built to contain the index relfilenode can
get marked correctly right from the start.

Author: Fabrízio de Royes Mello
Review and editorialization by Michael Paquier
                                     and Álvaro Herrera
2014-11-15 01:19:49 -03:00
Alvaro Herrera
e4d1e26491 Remove unused InhPaths
Allegedly, the last remaining usages of that struct were removed by
0e99be1c.

Author: Peter Geoghegan
2014-11-15 01:19:39 -03:00
Stephen Frost
80eacaa3cd Clean up includes from RLS patch
The initial patch for RLS mistakenly included headers associated with
the executor and planner bits in rewrite/rowsecurity.h.  Per policy and
general good sense, executor headers should not be included in planner
headers or vice versa.

The include of execnodes.h was a mistaken holdover from previous
versions, while the include of relation.h was used for Relation's
definition, which should have been coming from utils/relcache.h.  This
patch cleans these issues up, adds comments to the RowSecurityPolicy
struct and the RowSecurityConfigType enum, and changes Relation->rsdesc
to Relation->rd_rsdesc to follow Relation field naming convention.

Additionally, utils/rel.h was including rewrite/rowsecurity.h, which
wasn't a great idea since that was pulling in things not really needed
in utils/rel.h (which gets included in quite a few places).  Instead,
use 'struct RowSecurityDesc' for the rd_rsdesc field and add comments
explaining why.

Lastly, add an include into access/nbtree/nbtsort.c for
utils/sortsupport.h, which was evidently missed due to the above mess.

Pointed out by Tom in 16970.1415838651@sss.pgh.pa.us; note that the
concerns regarding a similar situation in the custom-path commit still
need to be addressed.
2014-11-14 17:05:17 -05:00
Heikki Linnakangas
81c4508196 Fix race condition between hot standby and restoring a full-page image.
There was a window in RestoreBackupBlock where a page would be zeroed out,
but not yet locked. If a backend pinned and locked the page in that window,
it saw the zeroed page instead of the old page or new page contents, which
could lead to missing rows in a result set, or errors.

To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins,
zeroes, and locks the page, if it's not in the buffer cache already.

In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE,
to avoid breaking any 3rd party extensions that might use RBM_ZERO. More
importantly, this avoids renumbering the other enum values, which would
cause even bigger confusion in extensions that use ReadBufferExtended, but
haven't been recompiled.

Backpatch to all supported versions; this has been racy since hot standby
was introduced.
2014-11-13 20:02:37 +02:00
Robert Haas
c0828b78e9 Move the guts of our Levenshtein implementation into core.
The hope is that we can use this to produce better diagnostics in
some cases.

Peter Geoghegan, reviewed by Michael Paquier, with some further
changes by me.
2014-11-13 12:33:26 -05:00
Fujii Masao
c291503b1c Rename pending_list_cleanup_size to gin_pending_list_limit.
Since this parameter is only for GIN index, it's better to
add "gin" to the parameter name for easier understanding.
2014-11-13 12:14:48 +09:00
Tom Lane
677708032c Explicitly support the case that a plancache's raw_parse_tree is NULL.
This only happens if a client issues a Parse message with an empty query
string, which is a bit odd; but since it is explicitly called out as legal
by our FE/BE protocol spec, we'd probably better continue to allow it.

Fix by adding tests everywhere that the raw_parse_tree field is passed to
functions that don't or shouldn't accept NULL.  Also make it clear in the
relevant comments that NULL is an expected case.

This reverts commits a73c9dbab0 and
2e9650cbcf, which fixed specific crash
symptoms by hacking things at what now seems to be the wrong end, ie the
callee functions.  Making the callees allow NULL is superficially more
robust, but it's not always true that there is a defensible thing for the
callee to do in such cases.  The caller has more context and is better
able to decide what the empty-query case ought to do.

Per followup discussion of bug #11335.  Back-patch to 9.2.  The code
before that is sufficiently different that it would require development
of a separate patch, which doesn't seem worthwhile for what is believed
to be an essentially cosmetic change.
2014-11-12 15:59:01 -05:00
Fujii Masao
1871c89202 Add generate_series(numeric, numeric).
Платон Малюгин
Reviewed by Michael Paquier, Ali Akbar and Marti Raudsepp
2014-11-11 21:44:46 +09:00
Fujii Masao
a1b395b6a2 Add GUC and storage parameter to set the maximum size of GIN pending list.
Previously the maximum size of GIN pending list was controlled only by
work_mem. But the reasonable value of work_mem and the reasonable size
of the list are basically not the same, so it was not appropriate to
control both of them by only one GUC, i.e., work_mem. This commit
separates new GUC, pending_list_cleanup_size, from work_mem to allow
users to control only the size of the list.

Also this commit adds pending_list_cleanup_size as new storage parameter
to allow users to specify the size of the list per index. This is useful,
for example, when users want to increase the size of the list only for
the GIN index which can be updated heavily, and decrease it otherwise.

Reviewed by Etsuro Fujita.
2014-11-11 21:08:21 +09:00
Heikki Linnakangas
ae667f778d Really fix compilation failure on MIPS.
I missed an additional colon in previous patch. Oops. to make that mistake
less likely in the future, add comments as placeholders for unused inputs
and outputs in inline assembly.
2014-11-11 10:25:22 +02:00
Heikki Linnakangas
baf7b3a503 Fix compilation failure on MIPS.
Rémi Zara
2014-11-11 01:06:06 +02:00
Tom Lane
bf7ca15875 Ensure that RowExprs and whole-row Vars produce the expected column names.
At one time it wasn't terribly important what column names were associated
with the fields of a composite Datum, but since the introduction of
operations like row_to_json(), it's important that looking up the rowtype
ID embedded in the Datum returns the column names that users would expect.
That did not work terribly well before this patch: you could get the column
names of the underlying table, or column aliases from any level of the
query, depending on minor details of the plan tree.  You could even get
totally empty field names, which is disastrous for cases like row_to_json().

To fix this for whole-row Vars, look to the RTE referenced by the Var, and
make sure its column aliases are applied to the rowtype associated with
the result Datums.  This is a tad scary because we might have to return
a transient RECORD type even though the Var is declared as having some
named rowtype.  In principle it should be all right because the record
type will still be physically compatible with the named rowtype; but
I had to weaken one Assert in ExecEvalConvertRowtype, and there might be
third-party code containing similar assumptions.

Similarly, RowExprs have to be willing to override the column names coming
from a named composite result type and produce a RECORD when the column
aliases visible at the site of the RowExpr differ from the underlying
table's column names.

In passing, revert the decision made in commit 398f70ec07 to add
an alias-list argument to ExecTypeFromExprList: better to provide that
functionality in a separate function.  This also reverts most of the code
changes in d685814835, which we don't need because we're no longer
depending on the tupdesc found in the child plan node's result slot to be
blessed.

Back-patch to 9.4, but not earlier, since this solution changes the results
in some cases that users might not have realized were buggy.  We'll apply a
more restricted form of this patch in older branches.
2014-11-10 15:21:09 -05:00
Bruce Momjian
67067f9ae3 C comment: mention 1500-02-29 as an invalid date
It is invalid because the Gregorian calendar is used for all years.
2014-11-09 20:50:15 -05:00
Alvaro Herrera
b89ee54e20 Fix some coding issues in BRIN
Reported by David Rowley: variadic macros are a problem.  Get rid of
them using a trick suggested by Tom Lane: add extra parentheses where
needed.  In the future we might decide we don't need the calls at all
and remove them, but it seems appropriate to keep them while this code
is still new.

Also from David Rowley: brininsert() was trying to use a variable before
initializing it.  Fix by moving the brin_form_tuple call (which
initializes the variable) to within the locked section.

Reported by Peter Eisentraut: can't use "new" as a struct member name,
because C++ compilers will choke on it, as reported by cpluspluscheck.
2014-11-08 00:31:03 -03:00
Robert Haas
0b03e5951b Introduce custom path and scan providers.
This allows extension modules to define their own methods for
scanning a relation, and get the core code to use them.  It's
unclear as yet how much use this capability will find, but we
won't find out if we never commit it.

KaiGai Kohei, reviewed at various times and in various levels
of detail by Shigeru Hanada, Tom Lane, Andres Freund, Álvaro
Herrera, and myself.
2014-11-07 17:34:36 -05:00
Robert Haas
5ea86e6e65 Use the sortsupport infrastructure in more cases.
This removes some fmgr overhead from cases such as btree index builds.

Peter Geoghegan, reviewed by Andreas Karlsson and me.
2014-11-07 15:50:55 -05:00
Alvaro Herrera
7516f52594 BRIN: Block Range Indexes
BRIN is a new index access method intended to accelerate scans of very
large tables, without the maintenance overhead of btrees or other
traditional indexes.  They work by maintaining "summary" data about
block ranges.  Bitmap index scans work by reading each summary tuple and
comparing them with the query quals; all pages in the range are returned
in a lossy TID bitmap if the quals are consistent with the values in the
summary tuple, otherwise not.  Normal index scans are not supported
because these indexes do not store TIDs.

As new tuples are added into the index, the summary information is
updated (if the block range in which the tuple is added is already
summarized) or not; in the latter case, a subsequent pass of VACUUM or
the brin_summarize_new_values() function will create the summary
information.

For data types with natural 1-D sort orders, the summary info consists
of the maximum and the minimum values of each indexed column within each
page range.  This type of operator class we call "Minmax", and we
supply a bunch of them for most data types with B-tree opclasses.
Since the BRIN code is generalized, other approaches are possible for
things such as arrays, geometric types, ranges, etc; even for things
such as enum types we could do something different than minmax with
better results.  In this commit I only include minmax.

Catalog version bumped due to new builtin catalog entries.

There's more that could be done here, but this is a good step forwards.

Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera,
with contribution by Heikki Linnakangas.

Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas.
Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo.

PS:
  The research leading to these results has received funding from the
  European Union's Seventh Framework Programme (FP7/2007-2013) under
  grant agreement n° 318633.
2014-11-07 16:38:14 -03:00
Heikki Linnakangas
2effb72e68 Remove obsolete cases from GiST update redo code.
The code that generated a record to clear the F_TUPLES_DELETED flag hasn't
existed since we got rid of old-style VACUUM FULL. I kept the code that sets
the flag, although it's not used for anything anymore, because it might
still be interesting information for debugging purposes that some tuples
have been deleted from a page.

Likewise, the code to turn the root page from non-leaf to leaf page was
removed when we got rid of old-style VACUUM FULL. Remove the code to replay
that action, too.
2014-11-07 15:13:02 +02:00
Heikki Linnakangas
2076db2aea Move the backup-block logic from XLogInsert to a new file, xloginsert.c.
xlog.c is huge, this makes it a little bit smaller, which is nice. Functions
related to putting together the WAL record are in xloginsert.c, and the
lower level stuff for managing WAL buffers and such are in xlog.c.

Also move the definition of XLogRecord to a separate header file. This
causes churn in the #includes of all the files that write WAL records, and
redo routines, but it avoids pulling in xlog.h into most places.

Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.
2014-11-06 13:55:36 +02:00
Fujii Masao
08309aaf74 Implement IF NOT EXIST for CREATE INDEX.
Fabrízio de Royes Mello, reviewed by Marti Raudsepp, Adam Brightwell and me.
2014-11-06 18:48:33 +09:00
Bruce Momjian
171c377a0a C comment: mention why the Gregorian calendar is used pre-1582 2014-11-06 02:33:05 -05:00
Robert Haas
c30be9787b Fix thinko in commit 2bd9e412f9.
Obviously, every translation unit should not be declaring this
separately.  It needs to be PGDLLIMPORT as well, to avoid breaking
third-party code that uses any of the functions that the commit
mentioned above changed to macros.
2014-11-05 17:12:23 -05:00
Heikki Linnakangas
5028f22f6e Switch to CRC-32C in WAL and other places.
The old algorithm was found to not be the usual CRC-32 algorithm, used by
Ethernet et al. We were using a non-reflected lookup table with code meant
for a reflected lookup table. That's a strange combination that AFAICS does
not correspond to any bit-wise CRC calculation, which makes it difficult to
reason about its properties. Although it has worked well in practice, seems
safer to use a well-known algorithm.

Since we're changing the algorithm anyway, we might as well choose a
different polynomial. The Castagnoli polynomial has better error-correcting
properties than the traditional CRC-32 polynomial, even if we had
implemented it correctly. Another reason for picking that is that some new
CPUs have hardware support for calculating CRC-32C, but not CRC-32, let
alone our strange variant of it. This patch doesn't add any support for such
hardware, but a future patch could now do that.

The old algorithm is kept around for tsquery and pg_trgm, which use the
values in indexes that need to remain compatible so that pg_upgrade works.
While we're at it, share the old lookup table for CRC-32 calculation
between hstore, ltree and core. They all use the same table, so might as
well.
2014-11-04 11:39:48 +02:00
Heikki Linnakangas
404bc51cde Remove support for 64-bit CRC.
It hasn't been used for anything for a long time.
2014-11-04 11:33:08 +02:00
Robert Haas
585e0b9b27 pqmq.h needs to include something that defines StringInfo.
Reported by Peter Eisentraut.
2014-11-03 12:24:42 -05:00
Robert Haas
2bd9e412f9 Support frontend-backend protocol communication using a shm_mq.
A background worker can use pq_redirect_to_shm_mq() to direct protocol
that would normally be sent to the frontend to a shm_mq so that another
process may read them.

The receiving process may use pq_parse_errornotice() to parse an
ErrorResponse or NoticeResponse from the background worker and, if
it wishes, ThrowErrorData() to propagate the error (with or without
further modification).

Patch by me.  Review by Andres Freund.
2014-10-31 12:02:40 -04:00
Robert Haas
f7102b0463 Extend dsm API with a new function dsm_unpin_mapping.
This reassociates a dynamic shared memory handle previous passed to
dsm_pin_mapping with the current resource owner, so that it will be
cleaned up at the end of the current query.

Patch by me.  Review of the function name by Andres Freund, Amit
Kapila, Jim Nasby, Petr Jelinek, and Álvaro Herrera.
2014-10-30 14:55:23 -04:00
Tom Lane
fd0f651a86 Test IsInTransactionChain, not IsTransactionBlock, in vac_update_relstats.
As noted by Noah Misch, my initial cut at fixing bug #11638 didn't cover
all cases where ANALYZE might be invoked in an unsafe context.  We need to
test the result of IsInTransactionChain not IsTransactionBlock; which is
notationally a pain because IsInTransactionChain requires an isTopLevel
flag, which would have to be passed down through several levels of callers.
I chose to pass in_outer_xact (ie, the result of IsInTransactionChain)
rather than isTopLevel per se, as that seemed marginally more apropos
for the intermediate functions to know about.
2014-10-30 13:04:06 -04:00
Robert Haas
6057c212f3 "Pin", rather than "keep", dynamic shared memory mappings and segments.
Nobody seemed concerned about this naming when it originally went in,
but there's a pending patch that implements the opposite of
dsm_keep_mapping, and the term "unkeep" was judged unpalatable.
"unpin" has existing precedent in the PostgreSQL code base, and the
English language, so use this terminology instead.

Per discussion, back-patch to 9.4.
2014-10-30 11:35:55 -04:00
Robert Haas
6cb4afff33 Avoid setup work for invalidation messages at start-of-(sub)xact.
Instead of initializing a new TransInvalidationInfo for every
transaction or subtransaction, we can just do it for those
transactions or subtransactions that actually need to queue
invalidation messages.  That also avoids needing to free those
entries at the end of a transaction or subtransaction that does
not generate any invalidation messages, which is by far the
common case.

Patch by me.  Review by Simon Riggs and Andres Freund.
2014-10-29 12:35:19 -04:00
Bruce Momjian
a4da35a0d2 Add variable names to two LWLock C prototypes
Previously only the variable types appeared.
2014-10-27 04:45:57 -04:00
Andres Freund
4a54b99e9c Add native compiler and memory barriers for solaris studio.
Discussion: 20140925133459.GB9633@alap3.anarazel.de
Author: Oskari Saarenmaa
2014-10-25 11:11:39 +02:00
Robert Haas
5ac372fc1a Add a function to get the authenticated user ID.
Previously, this was not exposed outside of miscinit.c.  It is needed
for the pending pg_background patch, and will also be needed for
parallelism.  Without it, there's no way for a background worker to
re-create the exact authentication environment that was present in the
process that started it, which could lead to security exposures.
2014-10-23 08:18:45 -04:00
Andres Freund
11abd6c90f Renumber CHECKPOINT_* flags.
Commit 7dbb606938 added a new CHECKPOINT_FLUSH_ALL flag. As that
commit needed to be backpatched I didn't change the numeric values of
the existing flags as that could lead to nastly problems if any
external code issued checkpoints. That's not a concern on master, so
renumber them there.

Also add a comment about CHECKPOINT_FLUSH_ALL above
CreateCheckPoint().
2014-10-21 00:20:08 +02:00
Andres Freund
7dbb606938 Flush unlogged table's buffers when copying or moving databases.
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source
database directory on the filesystem level. To ensure the on disk
state is consistent they block out users of the affected database and
force a checkpoint to flush out all data to disk. Unfortunately, up to
now, that checkpoint didn't flush out dirty buffers from unlogged
relations.

That bug means there could be leftover dirty buffers in either the
template database, or the database in its old location. Leading to
problems when accessing relations in an inconsistent state; and to
possible problems during shutdown in the SET TABLESPACE case because
buffers belonging files that don't exist anymore are flushed.

This was reported in bug #10675 by Maxim Boguk.

Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and
Fujii Masao.

Backpatch to 9.1 where unlogged tables were introduced.
2014-10-20 23:43:46 +02:00
Andrew Dunstan
af2b8fd057 Correct volatility markings of a few json functions.
json_agg and json_object_agg and their associated transition functions
should have been marked as stable rather than immutable, as they call IO
functions indirectly. Changing this probably isn't going to make much
difference, as you can't use an aggregate function in an index
expression, but we should be correct nevertheless.

json_object, on the other hand, should be marked immutable rather than
stable, as it does not call IO functions.

As discussed on -hackers, this change is being made without bumping the
catalog version, as we don't want to do that at this stage of the  cycle,
and  the changes are very unlikely to affect anyone.
2014-10-20 15:31:05 -04:00
Tom Lane
60f8133dc9 Declare mkdtemp() only if we're providing it.
Follow our usual style of providing an "extern" for a standard library
function only when we're also providing the implementation.  This avoids
issues when the system headers declare the function slightly differently
than we do, as noted by Caleb Welton.

We might have to go to the extent of probing to see if the system headers
declare the function, but let's not do that until it's demonstrated to be
necessary.

Oversight in commit 9e6b1bf258.  Back-patch
to all supported branches, as that was.
2014-10-17 22:55:20 -04:00
Tom Lane
b2cbced9ee Support timezone abbreviations that sometimes change.
Up to now, PG has assumed that any given timezone abbreviation (such as
"EDT") represents a constant GMT offset in the usage of any particular
region; we had a way to configure what that offset was, but not for it
to be changeable over time.  But, as with most things horological, this
view of the world is too simplistic: there are numerous regions that have
at one time or another switched to a different GMT offset but kept using
the same timezone abbreviation.  Almost the entire Russian Federation did
that a few years ago, and later this month they're going to do it again.
And there are similar examples all over the world.

To cope with this, invent the notion of a "dynamic timezone abbreviation",
which is one that is referenced to a particular underlying timezone
(as defined in the IANA timezone database) and means whatever it currently
means in that zone.  For zones that use or have used daylight-savings time,
the standard and DST abbreviations continue to have the property that you
can specify standard or DST time and get that time offset whether or not
DST was theoretically in effect at the time.  However, the abbreviations
mean what they meant at the time in question (or most recently before that
time) rather than being absolutely fixed.

The standard abbreviation-list files have been changed to use this behavior
for abbreviations that have actually varied in meaning since 1970.  The
old simple-numeric definitions are kept for abbreviations that have not
changed, since they are a bit faster to resolve.

While this is clearly a new feature, it seems necessary to back-patch it
into all active branches, because otherwise use of Russian zone
abbreviations is going to become even more problematic than it already was.
This change supersedes the changes in commit 513d06ded et al to modify the
fixed meanings of the Russian abbreviations; since we've not shipped that
yet, this will avoid an undesirably incompatible (not to mention incorrect)
change in behavior for timestamps between 2011 and 2014.

This patch makes some cosmetic changes in ecpglib to keep its usage of
datetime lookup tables as similar as possible to the backend code, but
doesn't do anything about the increasingly obsolete set of timezone
abbreviation definitions that are hard-wired into ecpglib.  Whatever we
do about that will likely not be appropriate material for back-patching.
Also, a potential free() of a garbage pointer after an out-of-memory
failure in ecpglib has been fixed.

This patch also fixes pre-existing bugs in DetermineTimeZoneOffset() that
caused it to produce unexpected results near a timezone transition, if
both the "before" and "after" states are marked as standard time.  We'd
only ever thought about or tested transitions between standard and DST
time, but that's not what's happening when a zone simply redefines their
base GMT offset.

In passing, update the SGML documentation to refer to the Olson/zoneinfo/
zic timezone database as the "IANA" database, since it's now being
maintained under the auspices of IANA.
2014-10-16 15:22:10 -04:00
Tom Lane
90063a7612 Print planning time only in EXPLAIN ANALYZE, not plain EXPLAIN.
We've gotten enough push-back on that change to make it clear that it
wasn't an especially good idea to do it like that.  Revert plain EXPLAIN
to its previous behavior, but keep the extra output in EXPLAIN ANALYZE.
Per discussion.

Internally, I set this up as a separate flag ExplainState.summary that
controls printing of planning time and execution time.  For now it's
just copied from the ANALYZE option, but we could consider exposing it
to users.
2014-10-15 18:50:13 -04:00
Kevin Grittner
30d7ae3c76 Increase number of hash join buckets for underestimate.
If we expect batching at the very beginning, we size nbuckets for
"full work_mem" (see how many tuples we can get into work_mem,
while not breaking NTUP_PER_BUCKET threshold).

If we expect to be fine without batching, we start with the 'right'
nbuckets and track the optimal nbuckets as we go (without actually
resizing the hash table). Once we hit work_mem (considering the
optimal nbuckets value), we keep the value.

At the end of the first batch, we check whether (nbuckets !=
nbuckets_optimal) and resize the hash table if needed. Also, we
keep this value for all batches (it's OK because it assumes full
work_mem, and it makes the batchno evaluation trivial). So the
resize happens only once.

There could be cases where it would improve performance to allow
the NTUP_PER_BUCKET threshold to be exceeded to keep everything in
one batch rather than spilling to a second batch, but attempts to
generate such a case have so far been unsuccessful; that issue may
be addressed with a follow-on patch after further investigation.

Tomas Vondra with minor format and comment cleanup by me
Reviewed by Robert Haas, Heikki Linnakangas, and Kevin Grittner
2014-10-13 10:16:36 -05:00
Alvaro Herrera
7b1c2a0f20 Split builtins.h to a new header ruleutils.h
The new header contains many prototypes for functions in ruleutils.c
that are not exposed to the SQL level.

Reviewed by Andres Freund and Michael Paquier.
2014-10-08 18:10:47 -03:00
Robert Haas
7bb0e97407 Extend shm_mq API with new functions shm_mq_sendv, shm_mq_set_handle.
shm_mq_sendv sends a message to the queue assembled from multiple
locations.  This is expected to be used by forthcoming patches to
allow frontend/backend protocol messages to be sent via shm_mq, but
might be useful for other purposes as well.

shm_mq_set_handle associates a BackgroundWorkerHandle with an
already-existing shm_mq_handle.  This solves a timing problem when
creating a shm_mq to communicate with a newly-launched background
worker: if you attach to the queue first, and the background worker
fails to start, you might block forever trying to do I/O on the queue;
but if you start the background worker first, but then die before
attaching to the queue, the background worrker might block forever
trying to do I/O on the queue.  This lets you attach before starting
the worker (so that the worker is protected) and then associate the
BackgroundWorkerHandle later (so that you are also protected).

Patch by me, reviewed by Stephen Frost.
2014-10-08 14:38:31 -04:00
Alvaro Herrera
df630b0dd5 Implement SKIP LOCKED for row-level locks
This clause changes the behavior of SELECT locking clauses in the
presence of locked rows: instead of causing a process to block waiting
for the locks held by other processes (or raise an error, with NOWAIT),
SKIP LOCKED makes the new reader skip over such rows.  While this is not
appropriate behavior for general purposes, there are some cases in which
it is useful, such as queue-like tables.

Catalog version bumped because this patch changes the representation of
stored rules.

Reviewed by Craig Ringer (based on a previous attempt at an
implementation by Simon Riggs, who also provided input on the syntax
used in the current patch), David Rowley, and Álvaro Herrera.

Author: Thomas Munro
2014-10-07 17:23:34 -03:00
Robert Haas
9019264f2e Still another typo fix for 0709b7ee72.
Buildfarm member anole caught this one.
2014-10-03 11:39:38 -04:00
Robert Haas
3acc10c997 Increase the number of buffer mapping partitions to 128.
Testing by Amit Kapila, Andres Freund, and myself, with and without
other patches that also aim to improve scalability, seems to indicate
that this change is a significant win over the current value and over
smaller values such as 64.  It's not clear how high we can push this
value before it starts to have negative side-effects elsewhere, but
going this far looks OK.
2014-10-02 13:58:50 -04:00
Andres Freund
952872698d Install all headers for the new atomics API.
Previously, by mistake, only atomics.h was installed.

Kohei KaiGai
2014-10-02 16:52:21 +02:00
Tom Lane
5a6c168c78 Fix some more problems with nested append relations.
As of commit a87c72915 (which later got backpatched as far as 9.1),
we're explicitly supporting the notion that append relations can be
nested; this can occur when UNION ALL constructs are nested, or when
a UNION ALL contains a table with inheritance children.

Bug #11457 from Nelson Page, as well as an earlier report from Elvis
Pranskevichus, showed that there were still nasty bugs associated with such
cases: in particular the EquivalenceClass mechanism could try to generate
"join" clauses connecting an appendrel child to some grandparent appendrel,
which would result in assertion failures or bogus plans.

Upon investigation I concluded that all current callers of
find_childrel_appendrelinfo() need to be fixed to explicitly consider
multiple levels of parent appendrels.  The most complex fix was in
processing of "broken" EquivalenceClasses, which are ECs for which we have
been unable to generate all the derived equality clauses we would like to
because of missing cross-type equality operators in the underlying btree
operator family.  That code path is more or less entirely untested by
the regression tests to date, because no standard opfamilies have such
holes in them.  So I wrote a new regression test script to try to exercise
it a bit, which turned out to be quite a worthwhile activity as it exposed
existing bugs in all supported branches.

The present patch is essentially the same as far back as 9.2, which is
where parameterized paths were introduced.  In 9.0 and 9.1, we only need
to back-patch a small fragment of commit 5b7b5518d, which fixes failure to
propagate out the original WHERE clauses when a broken EC contains constant
members.  (The regression test case results show that these older branches
are noticeably stupider than 9.2+ in terms of the quality of the plans
generated; but we don't really care about plan quality in such cases,
only that the plan not be outright wrong.  A more invasive fix in the
older branches would not be a good idea anyway from a plan-stability
standpoint.)
2014-10-01 19:31:12 -04:00
Heikki Linnakangas
5fa6c81a43 Remove num_xloginsert_locks GUC, replace with a #define
I left the GUC in place for the beta period, so that people could experiment
with different values. No-one's come up with any data that a different value
would be better under some circumstances, so rather than try to document to
users what the GUC, let's just hard-code the current value, 8.
2014-10-01 16:42:26 +03:00
Andres Freund
ef8863844b Rename CACHE_LINE_SIZE to PG_CACHE_LINE_SIZE.
As noted in http://bugs.debian.org/763098 there is a conflict between
postgres' definition of CACHE_LINE_SIZE and the definition by various
*bsd platforms. It's debatable who has the right to define such a
name, but postgres' use was only introduced in 375d8526f2 (9.4), so
it seems like a good idea to rename it.

Discussion: 20140930195756.GC27407@msg.df7cb.de

Per complaint of Christoph Berg in the above email, although he's not
the original bug reporter.

Backpatch to 9.4 where the define was introduced.
2014-10-01 12:17:03 +02:00
Stephen Frost
c8a026e4f1 Revert 95d737ff to add 'ignore_nulls'
Per discussion, revert the commit which added 'ignore_nulls' to
row_to_json.  This capability would be better added as an independent
function rather than being bolted on to row_to_json.  Additionally,
the implementation didn't address complex JSON objects, and so was
incomplete anyway.

Pointed out by Tom and discussed with Andrew and Robert.
2014-09-29 13:32:22 -04:00
Tom Lane
def4c28cf9 Change JSONB's on-disk format for improved performance.
The original design used an array of offsets into the variable-length
portion of a JSONB container.  However, such an array is basically
uncompressible by simple compression techniques such as TOAST's LZ
compressor.  That's bad enough, but because the offset array is at the
front, it tended to trigger the give-up-after-1KB heuristic in the TOAST
code, so that the entire JSONB object was stored uncompressed; which was
the root cause of bug #11109 from Larry White.

To fix without losing the ability to extract a random array element in O(1)
time, change this scheme so that most of the JEntry array elements hold
lengths rather than offsets.  With data that's compressible at all, there
tend to be fewer distinct element lengths, so that there is scope for
compression of the JEntry array.  Every N'th entry is still an offset.
To determine the length or offset of any specific element, we might have
to examine up to N preceding JEntrys, but that's still O(1) so far as the
total container size is concerned.  Testing shows that this cost is
negligible compared to other costs of accessing a JSONB field, and that
the method does largely fix the incompressible-data problem.

While at it, rearrange the order of elements in a JSONB object so that
it's "all the keys, then all the values" not alternating keys and values.
This doesn't really make much difference right at the moment, but it will
allow providing a fast path for extracting individual object fields from
large JSONB values stored EXTERNAL (ie, uncompressed), analogously to the
existing optimization for substring extraction from large EXTERNAL text
values.

Bump catversion to denote the incompatibility in on-disk format.
We will need to fix pg_upgrade to disallow upgrading jsonb data stored
with 9.4 betas 1 and 2.

Heikki Linnakangas and Tom Lane
2014-09-29 12:29:21 -04:00
Andres Freund
f9f07411a5 Further atomic ops portability improvements and bug fixes.
* Don't play tricks for a more efficient pg_atomic_clear_flag() in the
  generic gcc implementation. The old version was broken on gcc < 4.7
  on !x86 platforms. Per buildfarm member chipmunk.
* Make usage of __atomic() fences depend on HAVE_GCC__ATOMIC_INT32_CAS
  instead of HAVE_GCC__ATOMIC_INT64_CAS - there's platforms with 32bit
  support that don't support 64bit atomics.
* Blindly fix two superflous #endif in generic-xlc.h
* Check for --disable-atomics in platforms but x86.
2014-09-26 15:55:44 +02:00
Andres Freund
a30199b01b Fix a couple occurrences of 'the the' in the new atomics API.
Author: Erik Rijkers
2014-09-26 09:37:20 +02:00
Peter Eisentraut
d11339c099 Fix whitespace 2014-09-26 02:43:46 -04:00
Andres Freund
f18cad9449 Fix atomic ops inline x86 inline assembly for older 32bit gccs.
Some x86 32bit versions of gcc apparently generate references to the
nonexistant %sil register when using when using the r input
constraint, but not with the =q constraint. The latter restricts
allocations to a/b/c/d which should all work.
2014-09-26 02:44:44 +02:00
Andres Freund
88bcdf9da5 Fix atomic ops for x86 gcc compilers that don't understand atomic intrinsics.
Per buildfarm animal locust.
2014-09-26 02:28:52 +02:00
Andres Freund
b64d92f1a5 Add a basic atomic ops API abstracting away platform/architecture details.
Several upcoming performance/scalability improvements require atomic
operations. This new API avoids the need to splatter compiler and
architecture dependent code over all the locations employing atomic
ops.

For several of the potential usages it'd be problematic to maintain
both, a atomics using implementation and one using spinlocks or
similar. In all likelihood one of the implementations would not get
tested regularly under concurrency. To avoid that scenario the new API
provides a automatic fallback of atomic operations to spinlocks. All
properties of atomic operations are maintained. This fallback -
obviously - isn't as fast as just using atomic ops, but it's not bad
either. For one of the future users the atomics ontop spinlocks
implementation was actually slightly faster than the old purely
spinlock using implementation. That's important because it reduces the
fear of regressing older platforms when improving the scalability for
new ones.

The API, loosely modeled after the C11 atomics support, currently
provides 'atomic flags' and 32 bit unsigned integers. If the platform
efficiently supports atomic 64 bit unsigned integers those are also
provided.

To implement atomics support for a platform/architecture/compiler for
a type of atomics 32bit compare and exchange needs to be
implemented. If available and more efficient native support for flags,
32 bit atomic addition, and corresponding 64 bit operations may also
be provided. Additional useful atomic operations are implemented
generically ontop of these.

The implementation for various versions of gcc, msvc and sun studio have
been tested. Additional existing stub implementations for
* Intel icc
* HUPX acc
* IBM xlc
are included but have never been tested. These will likely require
fixes based on buildfarm and user feedback.

As atomic operations also require barriers for some operations the
existing barrier support has been moved into the atomics code.

Author: Andres Freund with contributions from Oskari Saarenmaa
Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera
Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com,
    20131015123303.GH5300@awork2.anarazel.de,
    20131028205522.GI20248@awork2.anarazel.de
2014-09-25 23:49:05 +02:00
Robert Haas
5d7962c679 Change locking regimen around buffer replacement.
Previously, we used an lwlock that was held from the time we began
seeking a candidate buffer until the time when we found and pinned
one, which is disastrous for concurrency.  Instead, use a spinlock
which is held just long enough to pop the freelist or advance the
clock sweep hand, and then released.  If we need to advance the clock
sweep further, we reacquire the spinlock once per buffer.

This represents a significant increase in atomic operations around
buffer eviction, but it still wins on many workloads.  On others, it
may result in no gain, or even cause a regression, unless the number
of buffer mapping locks is also increased.  However, that seems like
material for a separate commit.  We may also need to consider other
methods of mitigating contention on this spinlock, such as splitting
it into multiple locks or jumping the clock sweep hand more than one
buffer at a time, but those, too, seem like separate improvements.

Patch by me, inspired by a much larger patch from Amit Kapila.
Reviewed by Andres Freund.
2014-09-25 10:43:24 -04:00
Stephen Frost
6550b901fe Code review for row security.
Buildfarm member tick identified an issue where the policies in the
relcache for a relation were were being replaced underneath a running
query, leading to segfaults while processing the policies to be added
to a query.  Similar to how TupleDesc RuleLocks are handled, add in a
equalRSDesc() function to check if the policies have actually changed
and, if not, swap back the rsdesc field (using the original instead of
the temporairly built one; the whole structure is swapped and then
specific fields swapped back).  This now passes a CLOBBER_CACHE_ALWAYS
for me and should resolve the buildfarm error.

In addition to addressing this, add a new chapter in Data Definition
under Privileges which explains row security and provides examples of
its usage, change \d to always list policies (even if row security is
disabled- but note that it is disabled, or enabled with no policies),
rework check_role_for_policy (it really didn't need the entire policy,
but it did need to be using has_privs_of_role()), and change the field
in pg_class to relrowsecurity from relhasrowsecurity, based on
Heikki's suggestion.  Also from Heikki, only issue SET ROW_SECURITY in
pg_restore when talking to a 9.5+ server, list Bypass RLS in \du, and
document --enable-row-security options for pg_dump and pg_restore.

Lastly, fix a number of minor whitespace and typo issues from Heikki,
Dimitri, add a missing #include, per Peter E, fix a few minor
variable-assigned-but-not-used and resource leak issues from Coverity
and add tab completion for role attribute bypassrls as well.
2014-09-24 16:32:22 -04:00
Andrew Dunstan
b1a52872ae Fix typos in descriptions of json_object functions. 2014-09-24 11:24:42 -04:00
Andres Freund
604f7956b9 Improve code around the recently added rm_identify rmgr callback.
There are four weaknesses in728f152e07f998d2cb4fe5f24ec8da2c3bda98f2:

* append_init() in heapdesc.c was ugly and required that rm_identify
  return values are only valid till the next call. Instead just add a
  couple more switch() cases for the INIT_PAGE cases. Now the returned
  value will always be valid.
* a couple rm_identify() callbacks missed masking xl_info with
  ~XLR_INFO_MASK.
* pg_xlogdump didn't map a NULL rm_identify to UNKNOWN or a similar
  string.
* append_init() was called when id=NULL - which should never actually
  happen. But it's better to be careful.
2014-09-22 17:49:34 +02:00
Stephen Frost
491c029dbc Row-Level Security Policies (RLS)
Building on the updatable security-barrier views work, add the
ability to define policies on tables to limit the set of rows
which are returned from a query and which are allowed to be added
to a table.  Expressions defined by the policy for filtering are
added to the security barrier quals of the query, while expressions
defined to check records being added to a table are added to the
with-check options of the query.

New top-level commands are CREATE/ALTER/DROP POLICY and are
controlled by the table owner.  Row Security is able to be enabled
and disabled by the owner on a per-table basis using
ALTER TABLE .. ENABLE/DISABLE ROW SECURITY.

Per discussion, ROW SECURITY is disabled on tables by default and
must be enabled for policies on the table to be used.  If no
policies exist on a table with ROW SECURITY enabled, a default-deny
policy is used and no records will be visible.

By default, row security is applied at all times except for the
table owner and the superuser.  A new GUC, row_security, is added
which can be set to ON, OFF, or FORCE.  When set to FORCE, row
security will be applied even for the table owner and superusers.
When set to OFF, row security will be disabled when allowed and an
error will be thrown if the user does not have rights to bypass row
security.

Per discussion, pg_dump sets row_security = OFF by default to ensure
that exports and backups will have all data in the table or will
error if there are insufficient privileges to bypass row security.
A new option has been added to pg_dump, --enable-row-security, to
ask pg_dump to export with row security enabled.

A new role capability, BYPASSRLS, which can only be set by the
superuser, is added to allow other users to be able to bypass row
security using row_security = OFF.

Many thanks to the various individuals who have helped with the
design, particularly Robert Haas for his feedback.

Authors include Craig Ringer, KaiGai Kohei, Adam Brightwell, Dean
Rasheed, with additional changes and rework by me.

Reviewers have included all of the above, Greg Smith,
Jeff McCormick, and Robert Haas.
2014-09-19 11:18:35 -04:00
Andres Freund
e5603a2f35 Mark x86's memory barrier inline assembly as clobbering the cpu flags.
x86's memory barrier assembly was marked as clobbering "memory" but
not "cc" even though 'addl' sets various flags. As it turns out gcc on
x86 implicitly assumes "cc" on every inline assembler statement, so
it's not a bug. But as that's poorly documented and might get copied
to architectures or compilers where that's not the case, it seems
better to be precise.

Discussion: 20140919100016.GH4277@alap3.anarazel.de

To keep the code common, backpatch to 9.2 where explicit memory
barriers were introduced.
2014-09-19 17:04:00 +02:00
Andres Freund
728f152e07 Add rmgr callback to name xlog record types for display purposes.
This is primarily useful for the upcoming pg_xlogdump --stats feature,
but also allows to remove some duplicated code in the rmgr_desc
routines.

Due to the separation and harmonization, the output of dipsplayed
records changes somewhat. But since this isn't enduser oriented
content that's ok.

It's potentially desirable to further change pg_xlogdump's display of
records. It previously wasn't possible to show the record type
separately from the description forcing it to be in the last
column. But that's better done in a separate commit.

Author: Abhijit Menon-Sen, slightly editorialized by me
Reviewed-By: Álvaro Herrera, Andres Freund, and Heikki Linnakangas
Discussion: 20140604104716.GA3989@toroid.org
2014-09-19 16:20:29 +02:00
Heikki Linnakangas
77e65bf369 Fix the return type of GIN triConsistent support functions to "char".
They were marked to return a boolean, but they actually return a
GinTernaryValue, which is more like a "char". It makes no practical
difference, as the triConsistent functions cannot be called directly from
SQL because they have "internal" arguments, but this nevertheless seems
more correct.

Also fix the GinTernaryValue name in the documentation. I renamed the enum
earlier, but neglected the docs.

Alexander Korotkov. This is new in 9.4, so backpatch there.
2014-09-16 09:22:33 +03:00
Tom Lane
fe550b2ac2 Invent PGC_SU_BACKEND and mark log_connections/log_disconnections that way.
This new GUC context option allows GUC parameters to have the combined
properties of PGC_BACKEND and PGC_SUSET, ie, they don't change after
session start and non-superusers can't change them.  This is a more
appropriate choice for log_connections and log_disconnections than their
previous context of PGC_BACKEND, because we don't want non-superusers
to be able to affect whether their sessions get logged.

Note: the behavior for log_connections is still a bit odd, in that when
a superuser attempts to set it from PGOPTIONS, the setting takes effect
but it's too late to enable or suppress connection startup logging.
It's debatable whether that's worth fixing, and in any case there is
a reasonable argument for PGC_SU_BACKEND to exist.

In passing, re-pgindent the files touched by this commit.

Fujii Masao, reviewed by Joe Conway and Amit Kapila
2014-09-13 21:01:57 -04:00
Fujii Masao
4ad2a54805 Add GUC to enable logging of replication commands.
Previously replication commands like IDENTIFY_COMMAND were not logged
even when log_statements is set to all. Some users who want to audit
all types of statements were not satisfied with this situation. To
address the problem, this commit adds new GUC log_replication_commands.
If it's enabled, all replication commands are logged in the server log.

There are many ways to allow us to enable that logging. For example,
we can extend log_statement so that replication commands are logged
when it's set to all. But per discussion in the community, we reached
the consensus to add separate GUC for that.

Reviewed by Ian Barwick, Robert Haas and Heikki Linnakangas.
2014-09-13 02:55:45 +09:00
Stephen Frost
95d737ff45 Add 'ignore_nulls' option to row_to_json
Provide an option to skip NULL values in a row when generating a JSON
object from that row with row_to_json.  This can reduce the size of the
JSON object in cases where columns are NULL without really reducing the
information in the JSON object.

This also makes row_to_json into a single function with default values,
rather than having multiple functions.  In passing, change array_to_json
to also be a single function with default values (we don't add an
'ignore_nulls' option yet- it's not clear that there is a sensible
use-case there, and it hasn't been asked for in any case).

Pavel Stehule
2014-09-11 21:23:51 -04:00
Bruce Momjian
36ad1a87a3 Implement mxid_age() to compute multi-xid age
Report by Josh Berkus
2014-09-10 17:13:04 -04:00
Robert Haas
5b26278822 Fix thinko in 0709b7ee72.
Buildfarm member castoroides is unhappy with this, for entirely
understandable reasons.
2014-09-10 14:40:21 -04:00
Heikki Linnakangas
45f6240a8f Pack tuples in a hash join batch densely, to save memory.
Instead of palloc'ing each HashJoinTuple individually, allocate 32kB chunks
and pack the tuples densely in the chunks. This avoids the AllocChunk
header overhead, and the space wasted by standard allocator's habit of
rounding sizes up to the nearest power of two.

This doesn't contain any planner changes, because the planner's estimate of
memory usage ignores the palloc overhead. Now that the overhead is smaller,
the planner's estimates are in fact more accurate.

Tomas Vondra, reviewed by Robert Haas.
2014-09-10 21:24:52 +03:00
Andres Freund
311da16439 Add support for optional_argument to our own getopt_long() implementation.
07c8651dd9 currently causes compilation errors on mscv (and
probably some other) compilers because our getopt_long()
implementation doesn't have support for optional_argument.

Thus implement optional_argument in our fallback implemenation. It's
quite possibly also useful in other cases.

Arguably this needs a configure check for optional_argument, but it
has existed pretty much since getopt_long() was introduced and thus
doesn't seem worth the configure runtime.

Normally I'd would not push a patch this fast, but this allows msvc to
build again and has low risk as only optional_argument behaviour has
changed.

Author: Michael Paquier and Andres Freund

Discussion: CAB7nPqS5VeedSCxrK=QouokbawgGKLpyc1Q++RRFCa_sjcSVrg@mail.gmail.com
2014-09-10 17:21:50 +02:00
Andres Freund
b4c28d1b92 Fix typo in 07c8651dd9 causing WIN32_ONLY_COMPILER builds to fail. 2014-09-10 02:44:39 +02:00
Robert Haas
0709b7ee72 Change the spinlock primitives to function as compiler barriers.
Previously, they functioned as barriers against CPU reordering but not
compiler reordering, an odd API that required extensive use of volatile
everywhere that spinlocks are used.  That's error-prone and has negative
implications for performance, so change it.

In theory, this makes it safe to remove many of the uses of volatile
that we currently have in our code base, but we may find that there are
some bugs in this effort when we do.  In the long run, though, this
should make for much more maintainable code.

Patch by me.  Review by Andres Freund.
2014-09-09 17:48:50 -04:00
Tom Lane
e80252d424 Add width_bucket(anyelement, anyarray).
This provides a convenient method of classifying input values into buckets
that are not necessarily equal-width.  It works on any sortable data type.

The choice of function name is a bit debatable, perhaps, but showing that
there's a relationship to the SQL standard's width_bucket() function seems
more attractive than the other proposals.

Petr Jelinek, reviewed by Pavel Stehule
2014-09-09 15:34:14 -04:00
Andres Freund
50881036b1 Fix typo in solaris spinlock fix.
07968dbfaa missed part of the S_UNLOCK define when building for
sparcv8+.
2014-09-09 13:57:38 +02:00
Andres Freund
07968dbfaa Fix spinlock implementation for some !solaris sparc platforms.
Some Sparc CPUs can be run in various coherence models, ranging from
RMO (relaxed) over PSO (partial) to TSO (total). Solaris has always
run CPUs in TSO mode while in userland, but linux didn't use to and
the various *BSDs still don't. Unfortunately the sparc TAS/S_UNLOCK
were only correct under TSO. Fix that by adding the necessary memory
barrier instructions. On sparcv8+, which should be all relevant CPUs,
these are treated as NOPs if the current consistency model doesn't
require the barriers.

Discussion: 20140630222854.GW26930@awork2.anarazel.de

Will be backpatched to all released branches once a few buildfarm
cycles haven't shown up problems. As I've no access to sparc, this is
blindly written.
2014-09-09 00:47:32 +02:00
Robert Haas
d8d4965dc2 Update comment to reflect commit 1d41739e5a.
Peter Geoghegan
2014-09-04 12:17:10 -04:00
Heikki Linnakangas
f8f4227976 Refactor per-page logic common to all redo routines to a new function.
Every redo routine uses the same idiom to determine what to do to a page:
check if there's a backup block for it, and if not read, the buffer if the
block exists, and check its LSN. Refactor that into a common function,
XLogReadBufferForRedo, making all the redo routines shorter and more
readable.

This has no user-visible effect, and makes no changes to the WAL format.

Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
2014-09-02 15:10:28 +03:00
Heikki Linnakangas
26f8b99b24 Silence warning on new versions of clang.
Andres Freund
2014-09-02 14:26:06 +03:00
Andres Freund
5a64cb740d Fix s/pluggins/plugins/ typo in two comments.
Michael Paquier
2014-09-01 12:01:29 +02:00
Bruce Momjian
d5d7d07765 Again update C comments for pg_attribute.attislocal 2014-08-30 10:25:11 -04:00
Andres Freund
4b4b680c3d Make backend local tracking of buffer pins memory efficient.
Since the dawn of time (aka Postgres95) multiple pins of the same
buffer by one backend have been optimized not to modify the shared
refcount more than once. This optimization has always used a NBuffer
sized array in each backend keeping track of a backend's pins.

That array (PrivateRefCount) was one of the biggest per-backend memory
allocations, depending on the shared_buffers setting. Besides the
waste of memory it also has proven to be a performance bottleneck when
assertions are enabled as we make sure that there's no remaining pins
left at the end of transactions. Also, on servers with lots of memory
and a correspondingly high shared_buffers setting the amount of random
memory accesses can also lead to poor cpu cache efficiency.

Because of these reasons a backend's buffers pins are now kept track
of in a small statically sized array that overflows into a hash table
when necessary. Benchmarks have shown neutral to positive performance
results with considerably lower memory usage.

Patch by me, review by Robert Haas.

Discussion: 20140321182231.GA17111@alap3.anarazel.de
2014-08-30 14:03:21 +02:00
Bruce Momjian
c6eaa880ee Update C comment for pg_attribute.attislocal
Indicates if column has ever been local/non-inherited
2014-08-29 19:01:04 -04:00
Tom Lane
6c40f8316e Add min and max aggregates for inet/cidr data types.
Haribabu Kommi, reviewed by Muhammad Asif Naeem
2014-08-28 22:37:58 -04:00
Fujii Masao
9df492664a Revert "Allow units to be specified in relation option setting value."
This reverts commit e23014f3d4.

As the side effect of the reverted commit, when the unit is
specified, the reloption was stored in the catalog with the unit.
This broke pg_dump (specifically, it prevented pg_dump from
outputting restorable backup regarding the reloption) and
turned the buildfarm red. Revert the commit until the fixed
version is ready.
2014-08-29 05:10:47 +09:00
Fujii Masao
e23014f3d4 Allow units to be specified in relation option setting value.
This introduces an infrastructure which allows us to specify the units
like ms (milliseconds) in integer relation option, like GUC parameter.
Currently only autovacuum_vacuum_cost_delay reloption can accept
the units.

Reviewed by Michael Paquier
2014-08-28 15:55:50 +09:00
Alvaro Herrera
1c9701cfe5 Fix FOR UPDATE NOWAIT on updated tuple chains
If SELECT FOR UPDATE NOWAIT tries to lock a tuple that is concurrently
being updated, it might fail to honor its NOWAIT specification and block
instead of raising an error.

Fix by adding a no-wait flag to EvalPlanQualFetch which it can pass down
to heap_lock_tuple; also use it in EvalPlanQualFetch itself to avoid
blocking while waiting for a concurrent transaction.

Authors: Craig Ringer and Thomas Munro, tweaked by Álvaro
http://www.postgresql.org/message-id/51FB6703.9090801@2ndquadrant.com

Per Thomas Munro in the course of his SKIP LOCKED feature submission,
who also provided one of the isolation test specs.

Backpatch to 9.4, because that's as far back as it applies without
conflicts (although the bug goes all the way back).  To that branch also
backpatch Thomas Munro's new NOWAIT test cases, committed in master by
Heikki as commit 9ee16b49f0 .
2014-08-27 19:15:18 -04:00
Andres Freund
5569d75d6a Mark IsBinaryUpgrade as PGDLLIMPORT to fix windows builds after a7ae1dc.
Author: David Rowley
2014-08-26 15:25:18 +02:00
Heikki Linnakangas
0076f264b6 Implement IF NOT EXISTS for CREATE SEQUENCE.
Fabrízio de Royes Mello
2014-08-26 16:18:17 +03:00
Bruce Momjian
73fe87503f rename macro isTempOrToastNamespace to isTempOrTempToastNamespace
Done for clarity
2014-08-25 21:28:19 -04:00
Alvaro Herrera
301fcf33eb Have CREATE TABLE AS and REFRESH return an OID
Other DDL commands are already returning the OID, which is required for
future additional event trigger work.  This is merely making these
commands in line with the rest of utility command support.
2014-08-25 15:32:18 -04:00
Alvaro Herrera
f41872d0c1 Implement ALTER TABLE .. SET LOGGED / UNLOGGED
This enables changing permanent (logged) tables to unlogged and
vice-versa.

(Docs for ALTER TABLE / SET TABLESPACE got shuffled in an order that
hopefully makes more sense than the original.)

Author: Fabrízio de Royes Mello
Reviewed by: Christoph Berg, Andres Freund, Thom Brown
Some tweaking by Álvaro Herrera
2014-08-22 14:27:00 -04:00
Stephen Frost
3c4cf08087 Rework 'MOVE ALL' to 'ALTER .. ALL IN TABLESPACE'
As 'ALTER TABLESPACE .. MOVE ALL' really didn't change the tablespace
but instead changed objects inside tablespaces, it made sense to
rework the syntax and supporting functions to operate under the
'ALTER (TABLE|INDEX|MATERIALIZED VIEW)' syntax and to be in
tablecmds.c.

Pointed out by Alvaro, who also suggested the new syntax.

Back-patch to 9.4.
2014-08-21 19:06:17 -04:00
Heikki Linnakangas
ce486056ec Add #define INT64_MODIFIER for the printf length modifier for 64-bit ints.
We have had INT64_FORMAT and UINT64_FORMAT for a long time, but that's not
good enough if you want something more exotic, like "%20lld".

Abhijit Menon-Sen, per Andres Freund's suggestion.
2014-08-21 09:56:44 +03:00
Tom Lane
e3f9c16838 Fix bogus commutator/negator links for JSONB containment operators.
<@ and @> are each other's commutators, but they were incorrectly marked
as being each other's negators instead.  (This was actually questioned
in a comment in the original commit, but nobody followed through :-(.)
Per bug #11178 from Christian Pronovost.

In passing, fix some JSONB operator descriptions that were randomly
different from the phrasing of every other similar description.

catversion bump for pg_catalog contents change.
2014-08-16 12:53:54 -04:00
Heikki Linnakangas
c07693f0c7 Remove remnants of a JENTRY_ISFIRST flag bit.
I removed the flag earlier, but missed a few references in jsonb.h.
2014-08-15 09:41:28 +03:00
Robert Haas
b34e37bfef Add sortsupport routines for text.
This provides a small but worthwhile speedup when sorting text, at least
in cases to which the sortsupport machinery applies.

Robert Haas and Peter Geoghegan
2014-08-14 12:09:52 -04:00
Peter Eisentraut
1d678bf7bc Add some noreturn attributes based on compiler recommendations 2014-08-13 22:40:48 -04:00
Heikki Linnakangas
680513ab79 Break out OpenSSL-specific code to separate files.
This refactoring is in preparation for adding support for other SSL
implementations, with no user-visible effects. There are now two #defines,
USE_OPENSSL which is defined when building with OpenSSL, and USE_SSL which
is defined when building with any SSL implementation. Currently, OpenSSL is
the only implementation so the two #defines go together, but USE_SSL is
supposed to be used for implementation-independent code.

The libpq SSL code is changed to use a custom BIO, which does all the raw
I/O, like we've been doing in the backend for a long time. That makes it
possible to use MSG_NOSIGNAL to block SIGPIPE when using SSL, which avoids
a couple of syscall for each send(). Probably doesn't make much performance
difference in practice - the SSL encryption is expensive enough to mask the
effect - but it was a natural result of this refactoring.

Based on a patch by Martijn van Oosterhout from 2006. Briefly reviewed by
Alvaro Herrera, Andreas Karlsson, Jeff Janes.
2014-08-11 11:54:19 +03:00
Bruce Momjian
4c6780fd17 pg_upgrade: prevent oid conflicts with new-cluster TOAST tables
Previously, TOAST tables only required in the new cluster could cause
oid conflicts if they were auto-numbered and a later conflicting oid had
to be assigned.

Backpatch through 9.3
2014-08-07 14:56:13 -04:00
Robert Haas
73cfa37afe Add PG_RETURN_UINT16 macro.
Manuel Kniep
2014-08-06 16:11:43 -04:00
Robert Haas
1d41739e5a Don't require sort support functions to provide a comparator.
This could be useful for datatypes like text, where we might want
to optimize for some collations but not others.  However, this patch
doesn't introduce any new sortsupport functions that work this way;
it merely revises the code so that future patches may do so.

Patch by me.  Review by Peter Geoghegan.
2014-08-06 16:06:06 -04:00
Heikki Linnakangas
54685338e3 Move log_newpage and log_newpage_buffer to xlog.c.
log_newpage is used by many indexams, in addition to heap, but for
historical reasons it's always been part of the heapam rmgr. Starting with
9.3, we have another WAL record type for logging an image of a page,
XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to
xlog.c, and switch to using the XLOG_FPI record type.

Bump the WAL version number because the code to replay the old HEAP_NEWPAGE
records is removed.
2014-07-31 16:48:55 +03:00
Alvaro Herrera
0531549801 Avoid uselessly looking up old LOCK_ONLY multixacts
Commit 0ac5ad5134 removed an optimization in multixact.c that skipped
fetching members of MultiXactId that were older than our
OldestVisibleMXactId value.  The reason this was removed is that it is
possible for multixacts that contain updates to be older than that
value.  However, if the caller is certain that the multi does not
contain an update (because the infomask bits say so), it can pass this
info down to GetMultiXactIdMembers, enabling it to use the old
optimization.

Pointed out by Andres Freund in 20131121200517.GM7240@alap2.anarazel.de
2014-07-29 15:41:06 -04:00
Heikki Linnakangas
e74e0906fa Treat 2PC commit/abort the same as regular xacts in recovery.
There were several oversights in recovery code where COMMIT/ABORT PREPARED
records were ignored:

* pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits)
* recovery_min_apply_delay (2PC commits were applied immediately)
* recovery_target_xid (recovery would not stop if the XID used 2PC)

The first of those was reported by Sergiy Zuban in bug #11032, analyzed by
Tom Lane and Andres Freund. The bug was always there, but was masked before
commit d19bd29f07, because COMMIT PREPARED
always created an extra regular transaction that was WAL-logged.

Backpatch to all supported versions (older versions didn't have all the
features and therefore didn't have all of the above bugs).
2014-07-29 11:59:22 +03:00
Magnus Hagander
c0e4520b16 Add option to pg_ctl to choose event source for logging
pg_ctl will log to the Windows event log when it is running as a service,
which is the primary way of running PostgreSQL on Windows. This option
makes it possible to specify which event source to use for this, in order
to separate different instances. The server logging itself is still controlled
by the regular logging parameters, including a separate setting for the event
source. The parameter to pg_ctl only controlls the logging from pg_ctl itself.

MauMau, review in many iterations by Amit Kapila and me.
2014-07-17 12:42:08 +02:00
Tom Lane
f15821eefd Allow join removal in some cases involving a left join to a subquery.
We can remove a left join to a relation if the relation's output is
provably distinct for the columns involved in the join clause (considering
only equijoin clauses) and the relation supplies no variables needed above
the join.  Previously, the join removal logic could only prove distinctness
by reference to unique indexes of a table.  This patch extends the logic
to consider subquery relations, wherein distinctness might be proven by
reference to GROUP BY, DISTINCT, etc.

We actually already had some code to check that a subquery's output was
provably distinct, but it was hidden inside pathnode.c; which was a pretty
bad place for it really, since that file is mostly boilerplate Path
construction and comparison.  Move that code to analyzejoins.c, which is
arguably a more appropriate location, and is certainly the site of the
new usage for it.

David Rowley, reviewed by Simon Riggs
2014-07-15 21:12:43 -04:00
Andrew Dunstan
0f43a55331 json_build_object and json_build_array are stable, not immutable.
These functions indirectly invoke output functions, so they can't be
immutable.

Backpatch to 9.4 where they were introduced.

Catalog version bumped.
2014-07-15 14:24:47 -04:00
Magnus Hagander
c9e1ad7faf Detect presence of SSL_get_current_compression
Apparently we still build against OpenSSL so old that it doesn't
have this function, so add an autoconf check for it to make the
buildfarm happy. If the function doesn't exist, always return
that compression is disabled, since presumably the actual
compression functionality is always missing.

For now, hardcode the function as present on MSVC, since we should
hopefully be well beyond those old versions on that platform.
2014-07-15 18:07:03 +02:00
Alvaro Herrera
346d7be184 Move view reloptions into their own varlena struct
Per discussion after a gripe from me in
http://www.postgresql.org/message-id/20140611194633.GH18688@eldon.alvh.no-ip.org

Jaime Casanova
2014-07-14 17:24:40 -04:00
Andres Freund
626bfad6cc Fix decoding of consecutive MULTI_INSERTs emitted by one heap_multi_insert().
Commit 1b86c81d2d fixed the decoding of toasted columns for the rows
contained in one xl_heap_multi_insert record. But that's not actually
enough, because heap_multi_insert() will actually first toast all
passed in rows and then emit several *_multi_insert records; one for
each page it fills with tuples.

Add a XLOG_HEAP_LAST_MULTI_INSERT flag which is set in
xl_heap_multi_insert->flag denoting that this multi_insert record is
the last emitted by one heap_multi_insert() call. Then use that flag
in decode.c to only set clear_toast_afterwards in the right situation.

Expand the number of rows inserted via COPY in the corresponding
regression test to make sure that more than one heap page is filled
with tuples by one heap_multi_insert() call.

Backpatch to 9.4 like the previous commit.
2014-07-12 14:28:19 +02:00
Peter Eisentraut
80ddd04b4d Fix whitespace 2014-07-11 15:12:11 -04:00
Tom Lane
59efda3e50 Implement IMPORT FOREIGN SCHEMA.
This command provides an automated way to create foreign table definitions
that match remote tables, thereby reducing tedium and chances for error.
In this patch, we provide the necessary core-server infrastructure and
implement the feature fully in the postgres_fdw foreign-data wrapper.
Other wrappers will throw a "feature not supported" error until/unless
they are updated.

Ronan Dunklau and Michael Paquier, additional work by me
2014-07-10 15:01:43 -04:00
Fujii Masao
4cbd128328 Fix typos in comments. 2014-07-07 19:39:42 +09:00
Robert Haas
4893ccd024 Remove swpb-based spinlock implementation for ARMv5 and earlier.
Per recent analysis by Andres Freund, this implementation is in fact
unsafe, because ARMv5 has weak memory ordering, which means tha the
CPU could move loads or stores across the volatile store performed by
the default S_UNLOCK.  We could try to fix this, but have no ARMv5
hardware to test on, so removing support seems better.  We can still
support ARMv5 systems on GCC versions new enough to have built-in
atomics support for this platform, and can also re-add support for
the old way if someone has hardware that can be used to test a fix.
However, since the requirement to use a relatively-new GCC hasn't
been an issue for ARMv6 or ARMv7, which lack the swpb instruction
altogether, perhaps it won't be an issue for ARMv5 either.
2014-07-06 14:56:36 -04:00
Andres Freund
1b86c81d2d Fix decoding of MULTI_INSERTs when rows other than the last are toasted.
When decoding the results of a HEAP2_MULTI_INSERT (currently only
generated by COPY FROM) toast columns for all but the last tuple
weren't replaced by their actual contents before being handed to the
output plugin. The reassembled toast datums where disregarded after
every REORDER_BUFFER_CHANGE_(INSERT|UPDATE|DELETE) which is correct
for plain inserts, updates, deletes, but not multi inserts - there we
generate several REORDER_BUFFER_CHANGE_INSERTs for a single
xl_heap_multi_insert record.

To solve the problem add a clear_toast_afterwards boolean to
ReorderBufferChange's union member that's used by modifications. All
row changes but multi_inserts always set that to true, but
multi_insert sets it only for the last change generated.

Add a regression test covering decoding of multi_inserts - there was
none at all before.

Backpatch to 9.4 where logical decoding was introduced.

Bug found by Petr Jelinek.
2014-07-06 15:58:01 +02:00
Tom Lane
6f5034eda0 Redesign API presented by nodeAgg.c for ordered-set and similar aggregates.
The previous design exposed the input and output ExprContexts of the
Agg plan node, but work on grouping sets has suggested that we'll regret
doing that.  Instead provide more narrowly-defined APIs that can be
implemented in multiple ways, namely a way to get a short-term memory
context and a way to register an aggregate shutdown callback.

Back-patch to 9.4 where the bad APIs were introduced, since we don't
want third-party code using these APIs and then having to change in 9.5.

Andrew Gierth
2014-07-03 18:25:33 -04:00
Andres Freund
a36a8fa376 Rename logical decoding's pg_llog directory to pg_logical.
The old name wasn't very descriptive as of actual contents of the
directory, which are historical snapshots in the snapshots/
subdirectory and mappingdata for rewritten tuples in
mappings/. There's been a fair amount of discussion what would be a
good name. I'm settling for pg_logical because it's likely that
further data around logical decoding and replication will need saving
in the future.

Also add the missing entry for the directory into storage.sgml's list
of PGDATA contents.

Bumps catversion as the data directories won't be compatible.
2014-07-02 21:07:47 +02:00
Tom Lane
15c82efd69 Refactor CREATE/ALTER DATABASE syntax so options need not be keywords.
Most of the existing option names are keywords anyway, but we can get rid
of LC_COLLATE and LC_CTYPE as keywords known to the lexer/grammar.  This
immediately reduces the size of the grammar tables by about 8KB, and will
save more when we add additional CREATE/ALTER DATABASE options in future.

A side effect of the implementation is that the CONNECTION LIMIT option
can now also be spelled CONNECTION_LIMIT.  We choose not to document this,
however.

Vik Fearing, based on a suggestion by me; reviewed by Pavel Stehule
2014-07-01 19:02:21 -04:00
Tom Lane
2e8ce9ae46 Remove some useless code in the configure script.
Almost ten years ago, commit e48322a6d6 broke
the logic in ACX_PTHREAD by looping through all the possible flags rather
than stopping with the first one that would work.  This meant that
$acx_pthread_ok was no longer meaningful after the loop; it would usually
be "no", whether or not we'd found working thread flags.  The reason nobody
noticed is that Postgres doesn't actually use any of the symbols set up
by the code after the loop.  Rather than complicate things some more to
make it work as designed, let's just remove all that dead code, and thereby
save a few cycles in each configure run.
2014-07-01 17:51:53 -04:00
Robert Haas
9f03ca9151 Avoid copying index tuples when building an index.
The previous code, perhaps out of concern for avoid memory leaks, formed
the tuple in one memory context and then copied it to another memory
context.  However, this doesn't appear to be necessary, since
index_form_tuple and the functions it calls take precautions against
leaking memory.  In my testing, building the tuple directly inside the
sort context shaves several percent off the index build time.
Rearrange things so we do that.

Patch by me.  Review by Amit Kapila, Tom Lane, Andres Freund.
2014-07-01 10:34:42 -04:00
Heikki Linnakangas
1c6821be31 Fix and enhance the assertion of no palloc's in a critical section.
The assertion failed if WAL_DEBUG or LWLOCK_STATS was enabled; fix that by
using separate memory contexts for the allocations made within those code
blocks.

This patch introduces a mechanism for marking any memory context as allowed
in a critical section. Previously ErrorContext was exempt as a special case.

Instead of a blanket exception of the checkpointer process, only exempt the
memory context used for the pending ops hash table.
2014-06-30 10:26:00 +03:00
Tom Lane
a749a23d7a Remove use_json_as_text options from json_to_record/json_populate_record.
The "false" case was really quite useless since all it did was to throw
an error; a definition not helped in the least by making it the default.
Instead let's just have the "true" case, which emits nested objects and
arrays in JSON syntax.  We might later want to provide the ability to
emit sub-objects in Postgres record or array syntax, but we'd be best off
to drive that off a check of the target field datatype, not a separate
argument.

For the functions newly added in 9.4, we can just remove the flag arguments
outright.  We can't do that for json_populate_record[set], which already
existed in 9.3, but we can ignore the argument and always behave as if it
were "true".  It helps that the flag arguments were optional and not
documented in any useful fashion anyway.
2014-06-29 13:50:58 -04:00
Andres Freund
51adcaa0df Add cluster_name GUC which is included in process titles if set.
When running several postgres clusters on one OS instance it's often
inconveniently hard to identify which "postgres" process belongs to
which postgres instance.

Add the cluster_name GUC, whose value will be included as part of the
process titles if set. With that processes can more easily identified
using tools like 'ps'.

To avoid problems with encoding mismatches between postgresql.conf,
consoles, and individual databases replace non-ASCII chars in the name
with question marks. The length is limited to NAMEDATALEN to make it
less likely to truncate important information at the end of the
status.

Thomas Munro, with some adjustments by me and review by a host of people.
2014-06-29 14:15:09 +02:00
Andres Freund
a6d488cb53 Remove Alpha and Tru64 support.
Support for running postgres on Alpha hasn't been tested for a long
while. Due to Alpha's uniquely lax cache coherency model it's a hard
to develop for platform (especially blindly!) and thought to be
unlikely to currently work correctly.

As Alpha is the only supported architecture for Tru64 drop support for
it as well. Tru64's support has ended 2012 and it has been in
maintenance-only mode for much longer.

Also remove stray references to __ksr__ and ultrix defines.
2014-06-28 21:46:15 +02:00
Alvaro Herrera
f741300c90 Have multixact be truncated by checkpoint, not vacuum
Instead of truncating pg_multixact at vacuum time, do it only at
checkpoint time.  The reason for doing it this way is twofold: first, we
want it to delete only segments that we're certain will not be required
if there's a crash immediately after the removal; and second, we want to
do it relatively often so that older files are not left behind if
there's an untimely crash.

Per my proposal in
http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org
we now execute the truncation in the checkpointer process rather than as
part of vacuum.  Vacuum is in only charge of maintaining in shared
memory the value to which it's possible to truncate the files; that
value is stored as part of checkpoints also, and so upon recovery we can
reuse the same value to re-execute truncate and reset the
oldest-value-still-safe-to-use to one known to remain after truncation.

Per bug reported by Jeff Janes in the course of his tests involving
bug #8673.

While at it, update some comments that hadn't been updated since
multixacts were changed.

Backpatch to 9.3, where persistency of pg_multixact files was
introduced by commit 0ac5ad5134.
2014-06-27 14:43:53 -04:00
Tom Lane
f71136eeeb Get rid of bogus separate pg_proc entries for json_extract_path operators.
These should not have existed to begin with, but there was apparently some
misunderstanding of the purpose of the opr_sanity regression test item
that checks for operator implementation functions with their own comments.
The idea there is to check for unintentional violations of the rule that
operator implementation functions shouldn't be documented separately
.... but for these functions, that is in fact what we want, since the
variadic option is useful and not accessible via the operator syntax.
Get rid of the extra pg_proc entries and fix the regression test and
documentation to be explicit about what we're doing here.
2014-06-26 16:22:15 -07:00
Tom Lane
8b38a538c0 Add Asserts to verify that catalog cache keys are unique and not null.
The catcache code is effectively assuming this already, so let's insist
that the catalog and index are actually declared that way.

Having done that, the comments in indexing.h about non-unique indexes
not being used for catcaches are completely redundant not just mostly so;
and we didn't have such a comment for every such index anyway.  So let's
get rid of them.

Per discussion of whether we should identify primary keys for catalogs.
We might or might not take that further step, but this change in itself
will allow quicker detection of misdeclared catcaches, so it seems worth
doing in any case.
2014-06-20 18:21:05 -04:00
Andres Freund
3bdcf6a5a7 Don't allow to disable backend assertions via the debug_assertions GUC.
The existance of the assert_enabled variable (backing the
debug_assertions GUC) reduced the amount of knowledge some static code
checkers (like coverity and various compilers) could infer from the
existance of the assertion. That could have been solved by optionally
removing the assertion_enabled variable from the Assert() et al macros
at compile time when some special macro is defined, but the resulting
complication doesn't seem to be worth the gain from having
debug_assertions. Recompiling is fast enough.

The debug_assertions GUC is still available, but readonly, as it's
useful when diagnosing problems. The commandline/client startup option
-A, which previously also allowed to enable/disable assertions, has
been removed as it doesn't serve a purpose anymore.

While at it, reduce code duplication in bufmgr.c and localbuf.c
assertions checking for spurious buffer pins. That code had to be
reindented anyway to cope with the assert_enabled removal.
2014-06-20 11:09:17 +02:00
Tom Lane
45b0f35723 Avoid leaking memory while evaluating arguments for a table function.
ExecMakeTableFunctionResult evaluated the arguments for a function-in-FROM
in the query-lifespan memory context.  This is insignificant in simple
cases where the function relation is scanned only once; but if the function
is in a sub-SELECT or is on the inside of a nested loop, any memory
consumed during argument evaluation can add up quickly.  (The potential for
trouble here had been foreseen long ago, per existing comments; but we'd
not previously seen a complaint from the field about it.)  To fix, create
an additional temporary context just for this purpose.

Per an example from MauMau.  Back-patch to all active branches.
2014-06-19 22:14:26 -04:00
Kevin Grittner
bfaa8c665f Fix calculation of PREDICATELOCK_MANAGER_LWLOCK_OFFSET.
Commit ea9df812d8 failed to include
NUM_BUFFER_PARTITIONS in this offset, resulting in a bad offset.
Ultimately this threw off NUM_FIXED_LWLOCKS which is based on
earlier offsets, leading to memory allocation problems.  It seems
likely to have also caused increased LWLOCK contention when
serializable transactions were used, because lightweight locks used
for that overlapped others.

Reported by Amit Kapila with analysis and fix.
Backpatch to 9.4, where the bug was introduced.
2014-06-19 08:40:37 -05:00
Fujii Masao
9ba78fb0b9 Don't allow data_directory to be set in postgresql.auto.conf by ALTER SYSTEM.
data_directory could be set both in postgresql.conf and postgresql.auto.conf so far.
This could cause some problematic situations like circular definition. To avoid such
situations, this commit forbids a user to set data_directory in postgresql.auto.conf.

Backpatch this to 9.4 where ALTER SYSTEM command was introduced.

Amit Kapila, reviewed by Abhijit Menon-Sen, with minor adjustments by me.
2014-06-19 20:31:20 +09:00
Tom Lane
8f889b1083 Implement UPDATE tab SET (col1,col2,...) = (SELECT ...), ...
This SQL-standard feature allows a sub-SELECT yielding multiple columns
(but only one row) to be used to compute the new values of several columns
to be updated.  While the same results can be had with an independent
sub-SELECT per column, such a workaround can require a great deal of
duplicated computation.

The standard actually says that the source for a multi-column assignment
could be any row-valued expression.  The implementation used here is
tightly tied to our existing sub-SELECT support and can't handle other
cases; the Bison grammar would have some issues with them too.  However,
I don't feel too bad about this since other cases can be converted into
sub-SELECTs.  For instance, "SET (a,b,c) = row_valued_function(x)" could
be written "SET (a,b,c) = (SELECT * FROM row_valued_function(x))".
2014-06-18 13:22:34 -04:00
Heikki Linnakangas
b29e715143 Revert accidental change of WAL_DEBUG default.
Oops.
2014-06-17 08:52:41 +03:00
Tom Lane
2146f13408 Avoid recursion when processing simple lists of AND'ed or OR'ed clauses.
Since most of the system thinks AND and OR are N-argument expressions
anyway, let's have the grammar generate a representation of that form when
dealing with input like "x AND y AND z AND ...", rather than generating
a deeply-nested binary tree that just has to be flattened later by the
planner.  This avoids stack overflow in parse analysis when dealing with
queries having more than a few thousand such clauses; and in any case it
removes some rather unsightly inconsistencies, since some parts of parse
analysis were generating N-argument ANDs/ORs already.

It's still possible to get a stack overflow with weirdly parenthesized
input, such as "x AND (y AND (z AND ( ... )))", but such cases are not
mainstream usage.  The maximum depth of parenthesization is already
limited by Bison's stack in such cases, anyway, so that the limit is
probably fairly platform-independent.

Patch originally by Gurjeet Singh, heavily revised by me
2014-06-16 15:55:30 -04:00
Noah Misch
9e6b1bf258 Add mkdtemp() to libpgport.
This function is pervasive on free software operating systems; import
NetBSD's implementation.  Back-patch to 8.4, like the commit that will
harness it.
2014-06-14 09:41:13 -04:00
Heikki Linnakangas
0ef0b6784c Change the signature of rm_desc so that it's passed a XLogRecord.
Just feels more natural, and is more consistent with rm_redo.
2014-06-14 10:46:48 +03:00
Tom Lane
154146d208 Rename lo_create(oid, bytea) to lo_from_bytea().
The previous naming broke the query that libpq's lo_initialize() uses
to collect the OIDs of the server-side functions it requires, because
that query effectively assumes that there is only one function named
lo_create in the pg_catalog schema (and likewise only one lo_open, etc).

While we should certainly make libpq more robust about this, the naive
query will remain in use in the field for the foreseeable future, so it
seems the only workable choice is to use a different name for the new
function.  lo_from_bytea() won a small straw poll.

Back-patch into 9.4 where the new function was introduced.
2014-06-12 15:39:09 -04:00
Andres Freund
e04a9ccd2c Consistency improvements for slot and decoding code.
Change the order of checks in similar functions to be the same; remove
a parameter that's not needed anymore; rename a memory context and
expand a couple of comments.

Per review comments from Amit Kapila
2014-06-12 13:33:27 +02:00
Tom Lane
a24c104b9a Stamp HEAD as 9.5devel.
Let the hacking begin ...
2014-06-10 21:36:13 -04:00
Alvaro Herrera
b0b263baab Wrap multixact/members correctly during extension, take 2
In a50d976254 I already changed this, but got it wrong for the case
where the number of members is larger than the number of entries that
fit in the last page of the last segment.

As reported by Serge Negodyuck in a followup to bug #8673.
2014-06-09 15:17:23 -04:00
Tom Lane
5f93c37805 Add defenses against running with a wrong selection of LOBLKSIZE.
It's critical that the backend's idea of LOBLKSIZE match the way data has
actually been divided up in pg_largeobject.  While we don't provide any
direct way to adjust that value, doing so is a one-line source code change
and various people have expressed interest recently in changing it.  So,
just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in
pg_control and cross-check that the backend's compiled-in setting matches
the on-disk data.

Also tweak the code in inv_api.c so that fetches from pg_largeobject
explicitly verify that the length of the data field is not more than
LOBLKSIZE.  Formerly we just had Asserts() for that, which is no protection
at all in production builds.  In some of the call sites an overlength data
value would translate directly to a security-relevant stack clobber, so it
seems worth one extra runtime comparison to be sure.

In the back branches, we can't change the contents of pg_control; but we
can still make the extra checks in inv_api.c, which will offer some amount
of protection against running with the wrong value of LOBLKSIZE.
2014-06-05 11:31:06 -04:00
Andres Freund
f0c108560b Consistently spell a replication slot's name as slot_name.
Previously there's been a mix between 'slotname' and 'slot_name'. It's
not nice to be unneccessarily inconsistent in a new feature. As a post
beta1 initdb now is required in the wake of eeca4cd35e, fix the
inconsistencies.
Most the changes won't affect usage of replication slots because the
majority of changes is around function parameter names. The prominent
exception to that is that the recovery.conf parameter
'primary_slotname' is now named 'primary_slot_name'.
2014-06-05 16:29:20 +02:00
Heikki Linnakangas
8776faa81c Adjust SP-GiST WAL record formats to reduce alignment padding.
The way the code was written, the padding was copied from uninitialized
memory areas.. Because the structs are local variables in the code where
the WAL records are constructed, making them larger and zeroing the padding
bytes would not make the code very pretty, so rather than fixing this
directly by zeroing out the padding bytes, it seems more clear to not try to
align the tuples in the WAL records. The redo functions are taught to copy
the tuple header to a local variable to avoid unaligned access.

Stable-branches have the same problem, but we can't change the WAL format
there, so fix in master only. Reading a few random extra bytes at the stack
is harmless in practice, so it's not worth crafting a different
back-patchable fix.

Per reports from Kevin Grittner and Andres Freund, using clang static
analyzer and Valgrind, respectively.
2014-06-05 12:55:35 +03:00
Tom Lane
4c8ab1b91d Add btree and hash opclasses for pg_lsn.
This is needed to allow ORDER BY, DISTINCT, etc to work as expected for
pg_lsn values.

We had previously decided to put this off for 9.5, but in view of commit
eeca4cd35e there's no reason to avoid a
catversion bump for 9.4beta2, and this does make a pretty significant
usability difference for pg_lsn.

Michael Paquier, with fixes from Andres Freund and Tom Lane
2014-06-04 20:45:56 -04:00
Tom Lane
eeca4cd35e Bump PG_CONTROL_VERSION for previous 9.4 changes.
This should have been done in 6bc8ef0b7f
and/or 50e547096c, but better late than
never.  If we don't change this then we risk 9.3 pg_controldata or
pg_resetxlog being inappropriately used against a 9.4 pg_control file,
or vice versa.
2014-06-04 18:16:17 -04:00
Fujii Masao
654e8e4447 Save pg_stat_statements statistics file into $PGDATA/pg_stat directory at shutdown.
187492b6c2 changed pgstat.c so that
the stats files were saved into $PGDATA/pg_stat directory when the server
was shutdowned. But it accidentally forgot to change the location of
pg_stat_statements permanent stats file. This commit fixes pg_stat_statements
so that its stats file is also saved into $PGDATA/pg_stat at shutdown.

Since this fix changes the file layout, we don't back-patch it to 9.3
where this oversight was introduced.
2014-06-04 12:09:45 +09:00
Tom Lane
ec3357a3bc pg_lsn should not be marked typispreferred.
In general it's not a good idea for built-in types in the 'U' category
to be marked preferred; they could draw behavior away from user-defined
types with similarly-named operators.  pg_lsn is probably at low risk
of that right now given the lack of casts between it and other types,
but that doesn't make this marking OK.

Ordinarily we'd bump catversion when changing any predefined catalog
contents like this, but since we're past beta1, the costs of a forced
initdb seem to outweigh the benefits of guaranteed behavioral consistency.
There's not any known behavioral impact today anyway --- this is more
in the nature of being sure there's not problems in future.

Per an off-list complaint from Thomas Fanghaenel.
2014-05-28 00:26:46 -04:00
Tom Lane
b8cc8f9473 Support BSD and e2fsprogs UUID libraries alongside OSSP UUID library.
Allow the contrib/uuid-ossp extension to be built atop any one of these
three popular UUID libraries.  (The extension's name is now arguably a
misnomer, but we'll keep it the same so as not to cause unnecessary
compatibility issues for users.)

We would not normally consider a change like this post-beta1, but the issue
has been forced by our upgrade to autoconf 2.69, whose more rigorous header
checks are causing OSSP's header files to be rejected on some platforms.
It's been foreseen for some time that we'd have to move away from depending
on OSSP UUID due to lack of upstream maintenance, so this is a down payment
on that problem.

While at it, add some simple regression tests, in hopes of catching any
major incompatibilities between the three implementations.

Matteo Beccati, with some further hacking by me
2014-05-27 19:42:08 -04:00
Fujii Masao
06db9cce22 Fix typo in comment.
Erik Rijkers
2014-05-22 16:31:55 +09:00
Tom Lane
f62d417825 Fix unportable setvbuf() usage in initdb.
In yesterday's commit 2dc4f011fd, I tried
to force buffering of stdout/stderr in initdb to be what it is by
default when the program is run interactively on Unix (since that's how
most manual testing is done).  This tripped over the fact that Windows
doesn't support _IOLBF mode.  We dealt with that a long time ago in
syslogger.c by falling back to unbuffered mode on Windows.  Export that
solution in port.h and use it in initdb.

Back-patch to 8.4, like the previous commit.
2014-05-15 15:57:54 -04:00
Heikki Linnakangas
bb38fb0d43 Fix race condition in preparing a transaction for two-phase commit.
To lock a prepared transaction's shared memory entry, we used to mark it
with the XID of the backend. When the XID was no longer active according
to the proc array, the entry was implicitly considered as not locked
anymore. However, when preparing a transaction, the backend's proc array
entry was cleared before transfering the locks (and some other state) to
the prepared transaction's dummy PGPROC entry, so there was a window where
another backend could finish the transaction before it was in fact fully
prepared.

To fix, rewrite the locking mechanism of global transaction entries. Instead
of an XID, just have simple locked-or-not flag in each entry (we store the
locking backend's backend id rather than a simple boolean, but that's just
for debugging purposes). The backend is responsible for explicitly unlocking
the entry, and to make sure that that happens, install a callback to unlock
it on abort or process exit.

Backpatch to all supported versions.
2014-05-15 16:37:50 +03:00
Heikki Linnakangas
a82a17475d Silence warnings about redefining popen on Mingw-w64.
Mingw-w64 headers map popen/pclose to _popen and _pclose, but we want to use
our popen wrapper rather than the Mingw-w64. #undef the Mingw's version.
2014-05-15 12:18:49 +03:00
Tom Lane
b23b0f5588 Code review for recent changes in relcache.c.
rd_replidindex should be managed the same as rd_oidindex, and rd_keyattr
and rd_idattr should be managed like rd_indexattr.  Omissions in this area
meant that the bitmapsets computed for rd_keyattr and rd_idattr would be
leaked during any relcache flush, resulting in a slow but permanent leak in
CacheMemoryContext.  There was also a tiny probability of relcache entry
corruption if we ran out of memory at just the wrong point in
RelationGetIndexAttrBitmap.  Otherwise, the fields were not zeroed where
expected, which would not bother the code any AFAICS but could greatly
confuse anyone examining the relcache entry while debugging.

Also, create an API function RelationGetReplicaIndex rather than letting
non-relcache code be intimate with the mechanisms underlying caching of
that value (we won't even mention the memory leak there).

Also, fix a relcache flush hazard identified by Andres Freund:
RelationGetIndexAttrBitmap must not assume that rd_replidindex stays valid
across index_open.

The aspects of this involving rd_keyattr date back to 9.3, so back-patch
those changes.
2014-05-14 14:56:08 -04:00
Tom Lane
e6df2e1be6 Stamp 9.4beta1. 2014-05-11 17:16:48 -04:00
Tom Lane
12e611d43e Rename jsonb_hash_ops to jsonb_path_ops.
There's no longer much pressure to switch the default GIN opclass for
jsonb, but there was still some unhappiness with the name "jsonb_hash_ops",
since hashing is no longer a distinguishing property of that opclass,
and anyway it seems like a relatively minor detail.  At the suggestion of
Heikki Linnakangas, we'll use "jsonb_path_ops" instead; that captures the
important characteristic that each index entry depends on the entire path
from the document root to the indexed value.

Also add a user-facing explanation of the implementation properties of
these two opclasses.
2014-05-11 12:06:04 -04:00
Tom Lane
bdf9dd4db7 Fix typcategory labeling of jsonb.
Dunno who had the cute idea of labeling jsonb as typcategory 'C',
but it is not a composite type.  Label it 'U', since that's what
json is using.
2014-05-09 09:25:58 -04:00
Heikki Linnakangas
d9daff0e0c More jsonb cleanup.
Fix JSONB_MAX_ELEMS and JSONB_MAX_PAIRS macros to use CB_MASK in the
calculation. JENTRY_POSMASK happens to have the same value at the moment,
but that's just coincidental.

Refactor jsonb iterator functions, for readability.

Get rid of the JENTRY_ISFIRST flag. Whenever we handle JEntrys, we have
access to the whole array and have enough context information to know
which entry is the first. This frees up one bit in the JEntry header for
future use. While we're at it, shuffle the JEntry bits so that boolean
true and false go together, for aesthetic reasons.

Bump catalog version as this changes the on-disk format slightly.
2014-05-09 15:55:56 +03:00
Tom Lane
46dddf7673 Improve key representation for GIN jsonb_ops, and fix existence-search bug.
Change the key representation so that values that would exceed 127 bytes
are hashed into short strings, and so that the original JSON datatype of
each value is recorded in the index.  The hashing rule eliminates the major
objection to having this opclass be the default for jsonb, namely that it
could fail for plausible input data (due to GIN's restrictions on maximum
key length).  Preserving datatype information doesn't really buy us much
right now, but it requires no extra space compared to the previous way,
and it might be useful later.

Also, change the consistency-checking functions to request recheck for
exists (jsonb ? text) and related operators.  The original analysis that
this is an exactly checkable query was incorrect, since the index does
not preserve information about whether a key appears at top level in
the indexed JSON object.  Add a test case demonstrating the problem.

Make some other, mostly cosmetic improvements to the code in jsonb_gin.c
as well.

catversion bump due to on-disk data format change in jsonb_ops indexes.
2014-05-09 08:41:26 -04:00
Heikki Linnakangas
d3c72e23df Avoid some pnstrdup()s when constructing jsonb
This speeds up text to jsonb parsing and hstore to jsonb conversions
somewhat.
2014-05-09 12:46:21 +03:00
Tom Lane
b910d7ea35 Increase the default value of effective_cache_size to 4GB.
Per discussion, the old value of 128MB is ridiculously small on modern
machines; in fact, it's not even any larger than the default value of
shared_buffers, which it certainly should be.  Increase to 4GB, which
is unlikely to be any worse than the old default for anyone, and should
be noticeably better for most.  Eventually we might have an autotuning
scheme for this setting, but the recent attempt crashed and burned,
so for now just do this.
2014-05-08 21:11:47 -04:00
Tom Lane
a16d421ca4 Revert "Auto-tune effective_cache size to be 4x shared buffers"
This reverts commit ee1e5662d8, as well as
a remarkably large number of followup commits, which were mostly concerned
with the fact that the implementation didn't work terribly well.  It still
doesn't: we probably need some rather basic work in the GUC infrastructure
if we want to fully support GUCs whose default varies depending on the
value of another GUC.  Meanwhile, it also emerged that there wasn't really
consensus in favor of the definition the patch tried to implement (ie,
effective_cache_size should default to 4 times shared_buffers).  So whack
it all back to where it was.  In a followup commit, I'll do what was
recently agreed to, which is to simply change the default to a higher
value.
2014-05-08 20:49:38 -04:00
Tom Lane
1e81f8462a Fix comment.
Previous commit was confused about the case we're handling: actually,
what the patch is dealing with is platforms that have optreset, *and*
have <getopt.h>, but the latter fails to declare the former.  Because
we use a linking probe to set HAVE_INT_OPTRESET, we need to be sure we
have a declaration even if <getopt.h> doesn't think it exists.
2014-05-08 12:42:56 -04:00
Tom Lane
0c15a524c5 Allow for platforms that have optreset but not <getopt.h>.
Reportedly, some versions of mingw are like that, and it seems plausible
in general that older platforms might be that way.  However, we'd
determined experimentally that just doing "extern int" conflicts with
the way Cygwin declares these variables, so explicitly exclude Cygwin.

Michael Paquier, tweaked by me to hopefully not break Cygwin
2014-05-08 12:33:29 -04:00
Robert Haas
be7558162a When a background worker exists with code 0, unregister it.
The previous behavior was to restart immediately, which was generally
viewed as less useful.

Petr Jelinek, with some adjustments by me.
2014-05-07 17:44:42 -04:00
Robert Haas
970d1f76d1 Restart bgworkers immediately after a crash-and-restart cycle.
Just as we would start bgworkers immediately after an initial startup
of the server, we should restart them immediately when reinitializing.

Petr Jelinek and Robert Haas
2014-05-07 16:19:35 -04:00
Heikki Linnakangas
364ddc3e5c Clean up jsonb code.
The main target of this cleanup is the convertJsonb() function, but I also
touched a lot of other things that I spotted into in the process.

The new convertToJsonb() function uses an output buffer that's resized on
demand, so the code to estimate of the size of JsonbValue is removed.

The on-disk format was not changed, even though I refactored the structs
used to handle it. The term "superheader" is replaced with "container".

The jsonb_exists_any and jsonb_exists_all functions no longer sort the input
array. That was a premature optimization, the idea being that if there are
duplicates in the input array, you only need to check them once. Also,
sorting the array saves some effort in the binary search used to find a key
within an object. But there were drawbacks too: the sorting and
deduplicating obviously isn't free, and in the typical case there are no
duplicates to remove, and the gain in the binary search was minimal. Remove
all that, which makes the code simpler too.

This includes a bug-fix; the total length of the elements in a jsonb array
or object mustn't exceed 2^28. That is now checked.
2014-05-07 23:16:19 +03:00
Bruce Momjian
0a78320057 pgindent run for 9.4
This includes removing tabs after periods in C comments, which was
applied to back branches, so this change should not effect backpatching.
2014-05-06 12:12:18 -04:00
Tom Lane
3727afafee Fix pg_type.typlen for newly-revived line type.
Commit 261c7d4b65 removed the "m" field
from struct LINE, but neglected to make pg_type.h's idea of the type's
size match.  This resulted in reading past the end of palloc'd LINE
values when inserting them into tuples etc.  In principle that could
cause a SIGSEGV, though the odds of detectable problems seem low.

Bump catversion since this makes an incompatible on-disk format change.
Note that if the line type had been in use in the field, this would
break pg_upgrade'ability of databases containing line values; but
it seems unlikely that there are any (they'd have had to be compiled
with -DENABLE_LINE_TYPE).

Spotted by Andres Freund.
2014-05-05 13:37:54 -04:00
Heikki Linnakangas
a692ee5870 Replace SYSTEMQUOTEs with Windows-specific wrapper functions.
It's easy to forget using SYSTEMQUOTEs when constructing command strings
for system() or popen(). Even if we fix all the places missing it now, it is
bound to be forgotten again in the future. Introduce wrapper functions that
do the the extra quoting for you, and get rid of SYSTEMQUOTEs in all the
callers.

We previosly used SYSTEMQUOTEs in all the hard-coded command strings, and
this doesn't change the behavior of those. But user-supplied commands, like
archive_command, restore_command, COPY TO/FROM PROGRAM calls, as well as
pgbench's \shell, will now gain an extra pair of quotes. That is desirable,
but if you have existing scripts or config files that include an extra
pair of quotes, those might need to be adjusted.

Reviewed by Amit Kapila and Tom Lane
2014-05-05 16:07:40 +03:00
Tom Lane
3f8c8e3c61 Fix failure to detoast fields in composite elements of structured types.
If we have an array of records stored on disk, the individual record fields
cannot contain out-of-line TOAST pointers: the tuptoaster.c mechanisms are
only prepared to deal with TOAST pointers appearing in top-level fields of
a stored row.  The same applies for ranges over composite types, nested
composites, etc.  However, the existing code only took care of expanding
sub-field TOAST pointers for the case of nested composites, not for other
structured types containing composites.  For example, given a command such
as

UPDATE tab SET arraycol = ARRAY[(ROW(x,42)::mycompositetype] ...

where x is a direct reference to a field of an on-disk tuple, if that field
is long enough to be toasted out-of-line then the TOAST pointer would be
inserted as-is into the array column.  If the source record for x is later
deleted, the array field value would become a dangling pointer, leading
to errors along the line of "missing chunk number 0 for toast value ..."
when the value is referenced.  A reproducible test case for this was
provided by Jan Pecek, but it seems likely that some of the "missing chunk
number" reports we've heard in the past were caused by similar issues.

Code-wise, the problem is that PG_DETOAST_DATUM() is not adequate to
produce a self-contained Datum value if the Datum is of composite type.
Seen in this light, the problem is not just confined to arrays and ranges,
but could also affect some other places where detoasting is done in that
way, for example form_index_tuple().

I tried teaching the array code to apply toast_flatten_tuple_attribute()
along with PG_DETOAST_DATUM() when the array element type is composite,
but this was messy and imposed extra cache lookup costs whether or not any
TOAST pointers were present, indeed sometimes when the array element type
isn't even composite (since sometimes it takes a typcache lookup to find
that out).  The idea of extending that approach to all the places that
currently use PG_DETOAST_DATUM() wasn't attractive at all.

This patch instead solves the problem by decreeing that composite Datum
values must not contain any out-of-line TOAST pointers in the first place;
that is, we expand out-of-line fields at the point of constructing a
composite Datum, not at the point where we're about to insert it into a
larger tuple.  This rule is applied only to true composite Datums, not
to tuples that are being passed around the system as tuples, so it's not
as invasive as it might sound at first.  With this approach, the amount
of code that has to be touched for a full solution is greatly reduced,
and added cache lookup costs are avoided except when there actually is
a TOAST pointer that needs to be inlined.

The main drawback of this approach is that we might sometimes dereference
a TOAST pointer that will never actually be used by the query, imposing a
rather large cost that wasn't there before.  On the other side of the coin,
if the field value is used multiple times then we'll come out ahead by
avoiding repeat detoastings.  Experimentation suggests that common SQL
coding patterns are unaffected either way, though.  Applications that are
very negatively affected could be advised to modify their code to not fetch
columns they won't be using.

In future, we might consider reverting this solution in favor of detoasting
only at the point where data is about to be stored to disk, using some
method that can drill down into multiple levels of nested structured types.
That will require defining new APIs for structured types, though, so it
doesn't seem feasible as a back-patchable fix.

Note that this patch changes HeapTupleGetDatum() from a macro to a function
call; this means that any third-party code using that macro will not get
protection against creating TOAST-pointer-containing Datums until it's
recompiled.  The same applies to any uses of PG_RETURN_HEAPTUPLEHEADER().
It seems likely that this is not a big problem in practice: most of the
tuple-returning functions in core and contrib produce outputs that could
not possibly be toasted anyway, and the same probably holds for third-party
extensions.

This bug has existed since TOAST was invented, so back-patch to all
supported branches.
2014-05-01 15:19:06 -04:00
Tom Lane
2d00190495 Rationalize common/relpath.[hc].
Commit a730183926 created rather a mess by
putting dependencies on backend-only include files into include/common.
We really shouldn't do that.  To clean it up:

* Move TABLESPACE_VERSION_DIRECTORY back to its longtime home in
catalog/catalog.h.  We won't consider this symbol part of the FE/BE API.

* Push enum ForkNumber from relfilenode.h into relpath.h.  We'll consider
relpath.h as the source of truth for fork numbers, since relpath.c was
already partially serving that function, and anyway relfilenode.h was
kind of a random place for that enum.

* So, relfilenode.h now includes relpath.h rather than vice-versa.  This
direction of dependency is fine.  (That allows most, but not quite all,
of the existing explicit #includes of relpath.h to go away again.)

* Push forkname_to_number from catalog.c to relpath.c, just to centralize
fork number stuff a bit better.

* Push GetDatabasePath from catalog.c to relpath.c; it was rather odd
that the previous commit didn't keep this together with relpath().

* To avoid needing relfilenode.h in common/, redefine the underlying
function (now called GetRelationPath) as taking separate OID arguments,
and make the APIs using RelFileNode or RelFileNodeBackend into macro
wrappers.  (The macros have a potential multiple-eval risk, but none of
the existing call sites have an issue with that; one of them had such a
risk already anyway.)

* Fix failure to follow the directions when "init" fork type was added;
specifically, the errhint in forkname_to_number wasn't updated, and neither
was the SGML documentation for pg_relation_size().

* Fix tablespace-path-too-long check in CreateTableSpace() to account for
fork-name component of maximum-length pathnames.  This requires putting
FORKNAMECHARS into a header file, but it was rather useless (and
actually unreferenced) where it was.

The last couple of items are potentially back-patchable bug fixes,
if anyone is sufficiently excited about them; but personally I'm not.

Per a gripe from Christoph Berg about how include/common wasn't
self-contained.
2014-04-30 17:30:50 -04:00
Tom Lane
a9baeb361d Can't completely get rid of #ifndef FRONTEND in palloc.h :-(
pg_controldata includes postgres.h not postgres_fe.h, so utils/palloc.h
must be able to compile in a "#define FRONTEND" context.  It appears that
Solaris Studio is smart enough to persuade us to define PG_USE_INLINE,
but not smart enough to not make a copy of unreferenced static functions;
which leads to an unsatisfied reference to CurrentMemoryContext.  So we
need an #ifndef FRONTEND around that declaration.  Per buildfarm.
2014-04-27 21:24:19 -04:00
Tom Lane
528c454b2a Don't #include utils/palloc.h in common/fe_memutils.h.
This breaks the principle that common/ ought not depend on anything in the
server, not only code-wise but in the headers.  The only arguable advantage
is avoidance of duplication of half a dozen extern declarations, and even
that is rather dubious, considering that the previous coding was wrong
about which declarations to duplicate: it exposed pnstrdup() to frontend
code even though no such function is provided in fe_memutils.c.

On the same principle, don't #include utils/memutils.h in the frontend
build of psprintf.c.  This requires duplicating the definition of
MaxAllocSize, but that seems fine to me: there's no a-priori reason why
frontend code should use the same size limit as the backend anyway.

In passing, clean up some rather odd layout and ordering choices that
were imposed on palloc.h to reduce the number of #ifdefs required by
the previous approach.

Per gripe from Christoph Berg.  There's still more work to do to make
include/common/ clean, but this part seems reasonably noncontroversial.
2014-04-26 14:14:28 -04:00
Alvaro Herrera
1a917ae861 Fix race when updating a tuple concurrently locked by another process
If a tuple is locked, and this lock is later upgraded either to an
update or to a stronger lock, and in the meantime some other process
tries to lock, update or delete the same tuple, it (the tuple) could end
up being updated twice, or having conflicting locks held.

The reason for this is that the second updater checks for a change in
Xmax value, or in the HEAP_XMAX_IS_MULTI infomask bit, after noticing
the first lock; and if there's a change, it restarts and re-evaluates
its ability to update the tuple.  But it neglected to check for changes
in lock strength or in lock-vs-update status when those two properties
stayed the same.  This would lead it to take the wrong decision and
continue with its own update, when in reality it shouldn't do so but
instead restart from the top.

This could lead to either an assertion failure much later (when a
multixact containing multiple updates is detected), or duplicate copies
of tuples.

To fix, make sure to compare the other relevant infomask bits alongside
the Xmax value and HEAP_XMAX_IS_MULTI bit, and restart from the top if
necessary.

Also, in the belt-and-suspenders spirit, add a check to
MultiXactCreateFromMembers that a multixact being created does not have
two or more members that are claimed to be updates.  This should protect
against other bugs that might cause similar bogus situations.

Backpatch to 9.3, where the possibility of multixacts containing updates
was introduced.  (In prior versions it was possible to have the tuple
lock upgraded from shared to exclusive, and an update would not restart
from the top; yet we're protected against a bug there because there's
always a sleep to wait for the locking transaction to complete before
continuing to do anything.  Really, the fact that tuple locks always
conflicted with concurrent updates is what protected against bugs here.)

Per report from Andrew Dunstan and Josh Berkus in thread at
http://www.postgresql.org/message-id/534C8B33.9050807@pgexperts.com

Bug analysis by Andres Freund.
2014-04-24 15:41:55 -03:00
Tom Lane
a0f9358149 Fix incorrect pg_proc.proallargtypes entries for two built-in functions.
pg_sequence_parameters() and pg_identify_object() have had incorrect
proallargtypes entries since 9.1 and 9.3 respectively.  This was mostly
masked by the correct information in proargtypes, but a few operations
such as pg_get_function_arguments() (and thus psql's \df display) would
show the wrong data types for these functions' input parameters.

In HEAD, fix the wrong info, bump catversion, and add an opr_sanity
regression test to catch future mistakes of this sort.

In the back branches, just fix the wrong info so that installations
initdb'd with future minor releases will have the right data.  We
can't force an initdb, and it doesn't seem like a good idea to add
a regression test that will fail on existing installations.

Andres Freund
2014-04-23 21:21:05 -04:00
Tom Lane
f0fedfe82c Allow polymorphic aggregates to have non-polymorphic state data types.
Before 9.4, such an aggregate couldn't be declared, because its final
function would have to have polymorphic result type but no polymorphic
argument, which CREATE FUNCTION would quite properly reject.  The
ordered-set-aggregate patch found a workaround: allow the final function
to be declared as accepting additional dummy arguments that have types
matching the aggregate's regular input arguments.  However, we failed
to notice that this problem applies just as much to regular aggregates,
despite the fact that we had a built-in regular aggregate array_agg()
that was known to be undeclarable in SQL because its final function
had an illegal signature.  So what we should have done, and what this
patch does, is to decouple the extra-dummy-arguments behavior from
ordered-set aggregates and make it generally available for all aggregate
declarations.  We have to put this into 9.4 rather than waiting till
later because it slightly alters the rules for declaring ordered-set
aggregates.

The patch turned out a bit bigger than I'd hoped because it proved
necessary to record the extra-arguments option in a new pg_aggregate
column.  I'd thought we could just look at the final function's pronargs
at runtime, but that didn't work well for variadic final functions.
It's probably just as well though, because it simplifies life for pg_dump
to record the option explicitly.

While at it, fix array_agg() to have a valid final-function signature,
and add an opr_sanity test to notice future deviations from polymorphic
consistency.  I also marked the percentile_cont() aggregates as not
needing extra arguments, since they don't.
2014-04-23 19:17:41 -04:00
Heikki Linnakangas
4fafc4ecd9 Cleanup of new b-tree page deletion code.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
2014-04-23 10:19:54 +03:00
Tom Lane
d26b042ce5 Fix documentation of FmgrInfo.fn_nargs.
Some ancient comments claimed that fn_nargs could be -1 to indicate a
variable number of input arguments; but this was never implemented, and
is at variance with what we ultimately did with "variadic" functions.
Update the comments.
2014-04-22 23:22:12 -04:00
Robert Haas
602b27ab8e Fix another typo.
Etsuro Fujita
2014-04-20 16:32:57 +02:00
Peter Eisentraut
e7128e8dbb Create function prototype as part of PG_FUNCTION_INFO_V1 macro
Because of gcc -Wmissing-prototypes, all functions in dynamically
loadable modules must have a separate prototype declaration.  This is
meant to detect global functions that are not declared in header files,
but in cases where the function is called via dfmgr, this is redundant.
Besides filling up space with boilerplate, this is a frequent source of
compiler warnings in extension modules.

We can fix that by creating the function prototype as part of the
PG_FUNCTION_INFO_V1 macro, which such modules have to use anyway.  That
makes the code of modules cleaner, because there is one less place where
the entry points have to be listed, and creates an additional check that
functions have the right prototype.

Remove now redundant prototypes from contrib and other modules.
2014-04-18 00:03:19 -04:00
Robert Haas
dfc0219f64 Add to_regprocedure() and to_regoperator().
These are natural complements to the functions added by commit
0886fc6a5c, but they weren't included
in the original patch for some reason.  Add them.

Patch by me, per a complaint by Tom Lane.  Review by Tatsuo
Ishii.
2014-04-16 12:21:43 -04:00
Heikki Linnakangas
f1dadd34fa Set pd_lower on internal GIN posting tree pages.
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.

In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.

Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.

This change is completely backwards-compatible, no effect on pg_upgrade.
2014-04-14 21:13:19 +03:00
Tom Lane
e0c91a7ff0 Improve some O(N^2) behavior in window function evaluation.
Repositioning the tuplestore seek pointer in window_gettupleslot() turns
out to be a very significant expense when the window frame is sizable and
the frame end can move.  To fix, introduce a tuplestore function for
skipping an arbitrary number of tuples in one call, parallel to the one we
introduced for tuplesort objects in commit 8d65da1f.  This reduces the cost
of window_gettupleslot() to O(1) if the tuplestore has not spilled to disk.
As in the previous commit, I didn't try to do any real optimization of
tuplestore_skiptuples for the case where the tuplestore has spilled to
disk.  There is probably no practical way to get the cost to less than O(N)
anyway, but perhaps someone can think of something later.

Also fix PersistHoldablePortal() to make use of this API now that we have
it.

Based on a suggestion by Dean Rasheed, though this turns out not to look
much like his patch.
2014-04-13 13:59:17 -04:00
Tom Lane
d95425c8b9 Provide moving-aggregate support for boolean aggregates.
David Rowley and Florian Pflug, reviewed by Dean Rasheed
2014-04-13 00:01:46 -04:00
Stephen Frost
842faa714c Make security barrier views automatically updatable
Views which are marked as security_barrier must have their quals
applied before any user-defined quals are called, to prevent
user-defined functions from being able to see rows which the
security barrier view is intended to prevent them from seeing.

Remove the restriction on security barrier views being automatically
updatable by adding a new securityQuals list to the RTE structure
which keeps track of the quals from security barrier views at each
level, independently of the user-supplied quals.  When RTEs are
later discovered which have securityQuals populated, they are turned
into subquery RTEs which are marked as security_barrier to prevent
any user-supplied quals being pushed down (modulo LEAKPROOF quals).

Dean Rasheed, reviewed by Craig Ringer, Simon Riggs, KaiGai Kohei
2014-04-12 21:04:58 -04:00
Tom Lane
9d229f399e Provide moving-aggregate support for a bunch of numerical aggregates.
First installment of the promised moving-aggregate support in built-in
aggregates: count(), sum(), avg(), stddev() and variance() for
assorted datatypes, though not for float4/float8.

In passing, remove a 2001-vintage kluge in interval_accum(): interval
array elements have been properly aligned since around 2003, but
nobody remembered to take out this workaround.  Also, fix a thinko
in the opr_sanity tests for moving-aggregate catalog entries.

David Rowley and Florian Pflug, reviewed by Dean Rasheed
2014-04-12 20:33:09 -04:00
Tom Lane
a9d9acbf21 Create infrastructure for moving-aggregate optimization.
Until now, when executing an aggregate function as a window function
within a window with moving frame start (that is, any frame start mode
except UNBOUNDED PRECEDING), we had to recalculate the aggregate from
scratch each time the frame head moved.  This patch allows an aggregate
definition to include an alternate "moving aggregate" implementation
that includes an inverse transition function for removing rows from
the aggregate's running state.  As long as this can be done successfully,
runtime is proportional to the total number of input rows, rather than
to the number of input rows times the average frame length.

This commit includes the core infrastructure, documentation, and regression
tests using user-defined aggregates.  Follow-on commits will update some
of the built-in aggregates to use this feature.

David Rowley and Florian Pflug, reviewed by Dean Rasheed; additional
hacking by me
2014-04-12 12:03:30 -04:00
Heikki Linnakangas
150a9df528 Fix a few more misc typos in comments. 2014-04-10 00:53:55 +03:00
Tom Lane
f23a5630eb Add an in-core GiST index opclass for inet/cidr types.
This operator class can accelerate subnet/supernet tests as well as
btree-equivalent ordered comparisons.  It also handles a new network
operator inet && inet (overlaps, a/k/a "is supernet or subnet of"),
which is expected to be useful in exclusion constraints.

Ideally this opclass would be the default for GiST with inet/cidr data,
but we can't mark it that way until we figure out how to do a more or
less graceful transition from the current situation, in which the
really-completely-bogus inet/cidr opclasses in contrib/btree_gist are
marked as default.  Having the opclass in core and not default is better
than not having it at all, though.

While at it, add new documentation sections to allow us to officially
document GiST/GIN/SP-GiST opclasses, something there was never a clear
place to do before.  I filled these in with some simple tables listing
the existing opclasses and the operators they support, but there's
certainly scope to put more information there.

Emre Hasegeli, reviewed by Andreas Karlsson, further hacking by me
2014-04-08 15:46:43 -04:00
Robert Haas
11a65eed16 Get rid of the dynamic shared memory state file.
Instead of storing the ID of the dynamic shared memory control
segment in a file within the data directory, store it in the main
control segment.  This avoids a number of nasty corner cases,
most seriously that doing an online backup and then using it on
the same machine (e.g. to fire up a standby) would result in the
standby clobbering all of the master's dynamic shared memory
segments.

Per complaints from Heikki Linnakangas, Fujii Masao, and Tom
Lane.
2014-04-08 11:39:55 -04:00
Robert Haas
0886fc6a5c Add new to_reg* functions for error-free OID lookups.
These functions won't throw an error if the object doesn't exist,
or if (for functions and operators) there's more than one matching
object.

Yugo Nagata and Nozomi Anzai, reviewed by Amit Khandekar, Marti
Raudsepp, Amit Kapila, and me.
2014-04-08 10:27:56 -04:00
Simon Riggs
e5550d5fec Reduce lock levels of some ALTER TABLE cmds
VALIDATE CONSTRAINT

CLUSTER ON
SET WITHOUT CLUSTER

ALTER COLUMN SET STATISTICS
ALTER COLUMN SET ()
ALTER COLUMN RESET ()

All other sub-commands use AccessExclusiveLock

Simon Riggs and Noah Misch

Reviews by Robert Haas and Andres Freund
2014-04-06 11:13:43 -04:00
Tom Lane
9aca512506 Make sure -D is an absolute path when starting server on Windows.
This is needed because Windows services may get started with a different
current directory than where pg_ctl is executed.  We want relative -D
paths to be interpreted relative to pg_ctl's CWD, similarly to what
happens on other platforms.

In support of this, move the backend's make_absolute_path() function
into src/port/path.c (where it probably should have been long since)
and get rid of the rather inferior version in pg_regress.

Kumar Rajeev Rastogi, reviewed by MauMau
2014-04-04 18:42:13 -04:00
Robert Haas
59202fae04 Fix some compiler warnings that clang emits with -pedantic.
Andres Freund
2014-04-04 11:29:50 -04:00
Tom Lane
c7b3539599 Fix non-equivalence of VARIADIC and non-VARIADIC function call formats.
For variadic functions (other than VARIADIC ANY), the syntaxes foo(x,y,...)
and foo(VARIADIC ARRAY[x,y,...]) should be considered equivalent, since the
former is converted to the latter at parse time.  They have indeed been
equivalent, in all releases before 9.3.  However, commit 75b39e790 made an
ill-considered decision to record which syntax had been used in FuncExpr
nodes, and then to make equal() test that in checking node equality ---
which caused the syntaxes to not be seen as equivalent by the planner.
This is the underlying cause of bug #9817 from Dmitry Ryabov.

It might seem that a quick fix would be to make equal() disregard
FuncExpr.funcvariadic, but the same commit made that untenable, because
the field actually *is* semantically significant for some VARIADIC ANY
functions.  This patch instead adopts the approach of redefining
funcvariadic (and aggvariadic, in HEAD) as meaning that the last argument
is a variadic array, whether it got that way by parser intervention or was
supplied explicitly by the user.  Therefore the value will always be true
for non-ANY variadic functions, restoring the principle of equivalence.
(However, the planner will continue to consider use of VARIADIC as a
meaningful difference for VARIADIC ANY functions, even though some such
functions might disregard it.)

In HEAD, this change lets us simplify the decompilation logic in
ruleutils.c, since the funcvariadic/aggvariadic flag tells directly whether
to print VARIADIC.  However, in 9.3 we have to continue to cope with
existing stored rules/views that might contain the previous definition.
Fortunately, this just means no change in ruleutils.c, since its existing
behavior effectively ignores funcvariadic for all cases other than VARIADIC
ANY functions.

In HEAD, bump catversion to reflect the fact that FuncExpr.funcvariadic
changed meanings; this is sort of pro forma, since I don't believe any
built-in views are affected.

Unfortunately, this patch doesn't magically fix everything for affected
9.3 users.  After installing 9.3.5, they might need to recreate their
rules/views/indexes containing variadic function calls in order to get
everything consistent with the new definition.  As in the cited bug,
the symptom of a problem would be failure to use a nominally matching
index that has a variadic function call in its definition.  We'll need
to mention this in the 9.3.5 release notes.
2014-04-03 22:02:24 -04:00
Heikki Linnakangas
04e298b826 Avoid palloc in critical section in GiST WAL-logging.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.

There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.

The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
2014-04-03 15:43:50 +03:00
Tom Lane
fc752505a9 Fix assorted issues in client host name lookup.
The code for matching clients to pg_hba.conf lines that specify host names
(instead of IP address ranges) failed to complain if reverse DNS lookup
failed; instead it silently didn't match, so that you might end up getting
a surprising "no pg_hba.conf entry for ..." error, as seen in bug #9518
from Mike Blackwell.  Since we don't want to make this a fatal error in
situations where pg_hba.conf contains a mixture of host names and IP
addresses (clients matching one of the numeric entries should not have to
have rDNS data), remember the lookup failure and mention it as DETAIL if
we get to "no pg_hba.conf entry".  Apply the same approach to forward-DNS
lookup failures, too, rather than treating them as immediate hard errors.

Along the way, fix a couple of bugs that prevented us from detecting an
rDNS lookup error reliably, and make sure that we make only one rDNS lookup
attempt; formerly, if the lookup attempt failed, the code would try again
for each host name entry in pg_hba.conf.  Since more or less the whole
point of this design is to ensure there's only one lookup attempt not one
per entry, the latter point represents a performance bug that seems
sufficient justification for back-patching.

Also, adjust src/port/getaddrinfo.c so that it plays as well as it can
with this code.  Which is not all that well, since it does not have actual
support for rDNS lookup, but at least it should return the expected (and
required by spec) error codes so that the main code correctly perceives the
lack of functionality as a lookup failure.  It's unlikely that PG is still
being used in production on any machines that require our getaddrinfo.c,
so I'm not excited about working harder than this.

To keep the code in the various branches similar, this includes
back-patching commits c424d0d105 and
1997f34db4 into 9.2 and earlier.

Back-patch to 9.1 where the facility for hostnames in pg_hba.conf was
introduced.
2014-04-02 17:11:24 -04:00
Tom Lane
f33a71a786 De-anonymize the union in JsonbValue.
Needed for strict C89 compliance.
2014-04-02 14:30:08 -04:00
Heikki Linnakangas
f7534296b4 Move SizeOfHeapNewCid next to xl_heap_new_cid struct.
They belong together, but the xl_heap_rewrite_mapping struct was wedged
in between.
2014-04-01 16:23:16 +03:00
Heikki Linnakangas
14d02f0bb3 Rewrite the way GIN posting lists are packed on a page, to reduce WAL volume.
Inserting (in retail) into the new 9.4 format GIN posting tree created much
larger WAL records than in 9.3. The previous strategy to WAL logging was
basically to log the whole page on each change, with the exception of
completely unmodified segments up to the first modified one. That was not
too bad when appending to the end of the page, as only the last segment had
to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the
WAL volume that 9.3 did.

The new strategy is to keep track of changes to the posting lists in a more
fine-grained fashion, and also make the repacking" code smarter to avoid
decoding and re-encoding segments unnecessarily.
2014-03-31 15:23:50 +03:00
Heikki Linnakangas
0cfa34c25a Rename GinLogicValue to GinTernaryValue.
It's more descriptive. Also, get rid of the enum, and use #defines instead,
per Greg Stark's suggestion.
2014-03-31 10:26:38 +03:00
Andrew Dunstan
f9c6d72cbf Cleanup around json_to_record/json_to_recordset
Set function parameter names and defaults. Add jsonb versions (which the
code already provided for so the actual new code is trivial). Add jsonb
regression tests and docs.

Bump catalog version (which I apparently forgot to do when jsonb was
committed).
2014-03-26 10:18:24 -04:00
Heikki Linnakangas
bb42e21be2 Change ginMergeItemPointers to return a palloc'd array.
That seems nicer than making it the caller's responsibility to pass a
suitable-sized array. All the callers were just palloc'ing an array anyway.
2014-03-24 18:44:40 +02:00
Andrew Dunstan
d9134d0a35 Introduce jsonb, a structured format for storing json.
The new format accepts exactly the same data as the json type. However, it is
stored in a format that does not require reparsing the orgiginal text in order
to process it, making it much more suitable for indexing and other operations.
Insignificant whitespace is discarded, and the order of object keys is not
preserved. Neither are duplicate object keys kept - the later value for a given
key is the only one stored.

The new type has all the functions and operators that the json type has,
with the exception of the json generation functions (to_json, json_agg etc.)
and with identical semantics. In addition, there are operator classes for
hash and btree indexing, and two classes for GIN indexing, that have no
equivalent in the json type.

This feature grew out of previous work by Oleg Bartunov and Teodor Sigaev, which
was intended to provide similar facilities to a nested hstore type, but which
in the end proved to have some significant compatibility issues.

Authors: Oleg Bartunov,  Teodor Sigaev, Peter Geoghegan and Andrew Dunstan.
Review: Andres Freund
2014-03-23 16:40:19 -04:00
Noah Misch
7cbe57c34d Offer triggers on foreign tables.
This covers all the SQL-standard trigger types supported for regular
tables; it does not cover constraint triggers.  The approach for
acquiring the old row mirrors that for view INSTEAD OF triggers.  For
AFTER ROW triggers, we spool the foreign tuples to a tuplestore.

This changes the FDW API contract; when deciding which columns to
populate in the slot returned from data modification callbacks, writable
FDWs will need to check for AFTER ROW triggers in addition to checking
for a RETURNING clause.

In support of the feature addition, refactor the TriggerFlags bits and
the assembly of old tuples in ModifyTable.

Ronan Dunklau, reviewed by KaiGai Kohei; some additional hacking by me.
2014-03-23 02:16:34 -04:00
Heikki Linnakangas
4c0e97c2d5 Fix thinkos in GinLogicValue enum.
It was incorrectly declared as global variable, not an enum type, and
the comments for GIN_FALSE and GIN_TRUE were backwards.
2014-03-21 23:41:37 +01:00
Heikki Linnakangas
68a2e52bba Replace the XLogInsert slots with regular LWLocks.
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)

Reviewed by Andres Freund.
2014-03-21 15:10:48 +01:00
Alvaro Herrera
f88d4cfc9d Setup error context callback for transaction lock waits
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple.  For example:

LOG:  process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL:  Process holding the lock: 11366. Wait queue: 11367.
CONTEXT:  while updating tuple (0,2) in relation "foo"
STATEMENT:  UPDATE foo SET value = 3;

Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.

Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
2014-03-19 15:10:36 -03:00
Heikki Linnakangas
59a5ab3f42 Remove rm_safe_restartpoint machinery.
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.

Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
2014-03-18 22:10:35 +02:00
Heikki Linnakangas
40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Robert Haas
3bd261ca18 Improve shm_mq portability around MAXIMUM_ALIGNOF and sizeof(Size).
Revise the original decision to expose a uint64-based interface and
use Size everywhere possible.  Avoid assuming that MAXIMUM_ALIGNOF is
8, or making any assumption about the relationship between that value
and sizeof(Size).  If MAXIMUM_ALIGNOF is bigger, we'll now insert
padding after the length word; if it's smaller, we are now prepared
to read and write the length word in chunks.

Per discussion with Tom Lane.
2014-03-18 11:23:13 -04:00
Robert Haas
79a4d24f31 Make it easy to detach completely from shared memory.
The new function dsm_detach_all() can be used either by postmaster
children that don't wish to take any risk of accidentally corrupting
shared memory; or by forked children of regular backends with
the same need.  This patch also updates the postmaster children that
already do PGSharedMemoryDetach() to do dsm_detach_all() as well.

Per discussion with Tom Lane.
2014-03-18 07:58:53 -04:00
Magnus Hagander
02703ff227 Fix small typo in comment
Michael Paquier
2014-03-17 09:09:21 +01:00
Magnus Hagander
0294023a6b Cleanups from the remove-native-krb5 patch
krb_srvname is actually not available anymore as a parameter server-side, since
with gssapi we accept all principals in our keytab. It's still used in libpq for
client side specification.

In passing remove declaration of krb_server_hostname, where all the functionality
was already removed.

Noted by Stephen Frost, though a different solution than his suggestion
2014-03-16 15:22:45 +01:00
Heikki Linnakangas
efada2b8e9 Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 16:07:19 +02:00
Heikki Linnakangas
a3115f0d9e Only WAL-log the modified portion in an UPDATE, if possible.
When a row is updated, and the new tuple version is put on the same page as
the old one, only WAL-log the part of the new tuple that's not identical to
the old. This saves significantly on the amount of WAL that needs to be
written, in the common case that most fields are not modified.

Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
2014-03-12 23:28:36 +02:00
Fujii Masao
588fb50715 Show PIDs of lock holders and waiters in log_lock_waits log message.
Christian Kruse, reviewed by Kumar Rajeev Rastogi.
2014-03-13 03:26:47 +09:00
Heikki Linnakangas
c5608ea26a Allow opclasses to provide tri-valued GIN consistent functions.
With the GIN "fast scan" feature, GIN can skip items without fetching all
the keys for them, if it can prove that they don't match regardless of
those keys. So far, it has done the proving by calling the boolean
consistent function with all combinations of TRUE/FALSE for the unfetched
keys, but since that's O(n^2), it becomes unfeasible with more than a few
keys. We can avoid calling consistent with all the combinations, if we can
tell the operator class implementation directly which keys are unknown.

This commit includes a triConsistent function for the built-in array and
tsvector opclasses.

Alexander Korotkov, with some changes by me.
2014-03-12 17:51:30 +02:00
Robert Haas
8722017bbc Allow dynamic shared memory segments to be kept until shutdown.
Amit Kapila, reviewed by Kyotaro Horiguchi, with some further
changes by me.
2014-03-10 14:04:47 -04:00
Robert Haas
5a991ef869 Allow logical decoding via the walsender interface.
In order for this to work, walsenders need the optional ability to
connect to a database, so the "replication" keyword now allows true
or false, for backward-compatibility, and the new value "database"
(which causes the "dbname" parameter to be respected).

walsender needs to loop not only when idle but also when sending
decoded data to the user and when waiting for more xlog data to decode.
This means that there are now three separate loops inside walsender.c;
although some refactoring has been done here, this is still a bit ugly.

Andres Freund, with contributions from Álvaro Herrera, and further
review by me.
2014-03-10 13:50:28 -04:00
Robert Haas
cb9a0c7987 Teach on_exit_reset() to discard pending cleanups for dsm.
If a postmaster child invokes fork() and then calls on_exit_reset, that
should be sufficient to let it exit() without breaking anything, but
dynamic shared memory broke that by not updating on_exit_reset() to
discard callbacks registered with dynamic shared memory segments.

Per investigation of a complaint from Tom Lane.
2014-03-10 10:17:19 -04:00
Bruce Momjian
5024044a20 C comments: improve description of relfilenode uniqueness
Report by Antonin Houska
2014-03-08 12:20:30 -05:00
Tom Lane
ea177a3ba7 Remove unportable use of anonymous unions from reorderbuffer.h.
In b89e151054 I had assumed it was ok to use anonymous unions as
struct members, but while a longstanding extension in many compilers,
it's only been standardized in C11.

To fix, remove one of the anonymous unions which tried to hide some
implementation specific enum values and give the other a name. The
latter unfortunately requires changes in output plugins, but since the
feature has only been added a few days ago...

Andres Freund
2014-03-07 17:03:26 -05:00
Heikki Linnakangas
55566c9a74 Fix dangling smgr_owner pointer when a fake relcache entry is freed.
A fake relcache entry can "own" a SmgrRelation object, like a regular
relcache entry. But when it was free'd, the owner field in SmgrRelation
was not cleared, so it was left pointing to free'd memory.

Amazingly this apparently hasn't caused crashes in practice, or we would've
heard about it earlier. Andres found this with Valgrind.

Report and fix by Andres Freund, with minor modifications by me. Backpatch
to all supported versions.
2014-03-07 13:28:52 +02:00
Tom Lane
7c31874945 Avoid getting more than AccessShareLock when deparsing a query.
In make_ruledef and get_query_def, we have long used AcquireRewriteLocks
to ensure that the querytree we are about to deparse is up-to-date and
the schemas of the underlying relations aren't changing.  Howwever, that
function thinks the query is about to be executed, so it acquires locks
that are stronger than necessary for the purpose of deparsing.  Thus for
example, if pg_dump asks to deparse a rule that includes "INSERT INTO t",
we'd acquire RowExclusiveLock on t.  That results in interference with
concurrent transactions that might for example ask for ShareLock on t.
Since pg_dump is documented as being purely read-only, this is unexpected.
(Worse, it used to actually be read-only; this behavior dates back only
to 8.1, cf commit ba4200246.)

Fix this by adding a parameter to AcquireRewriteLocks to tell it whether
we want the "real" execution locks or only AccessShareLock.

Report, diagnosis, and patch by Dean Rasheed.  Back-patch to all supported
branches.
2014-03-06 19:31:05 -05:00
Bruce Momjian
0024a3a3b6 C comment update: relfilenode is only unique with a tablespace
Report from Antonin Houska
2014-03-05 20:52:34 -05:00
Tom Lane
8cf0ad1ea3 Add comment that ec_relids excludes "child" EquivalenceClass members.
This was already documented a few lines further down, but the comment
just beside the field declaration could be misleading.  Per gripe
from Kyotaro Horiguchi.
2014-03-05 16:00:33 -05:00
Alvaro Herrera
84df54b22e Constructors for interval, timestamp, timestamptz
Author: Pavel Stěhule, editorialized somewhat by Álvaro Herrera
Reviewed-by: Tomáš Vondra, Marko Tiikkaja
With input from Fabrízio de Royes Mello, Jim Nasby
2014-03-04 15:09:43 -03:00
Robert Haas
7e8db2dc42 Minor corrections to logical decoding patch. 2014-03-04 11:07:54 -05:00
Robert Haas
b89e151054 Introduce logical decoding.
This feature, building on previous commits, allows the write-ahead log
stream to be decoded into a series of logical changes; that is,
inserts, updates, and deletes and the transactions which contain them.
It is capable of handling decoding even across changes to the schema
of the effected tables.  The output format is controlled by a
so-called "output plugin"; an example is included.  To make use of
this in a real replication system, the output plugin will need to be
modified to produce output in the format appropriate to that system,
and to perform filtering.

Currently, information can be extracted from the logical decoding
system only via SQL; future commits will add the ability to stream
changes via walsender.

Andres Freund, with review and other contributions from many other
people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan,
Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit
Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve
Singer.
2014-03-03 16:32:18 -05:00
Heikki Linnakangas
f8ce16d0d2 Rename huge_tlb_pages to huge_pages, and improve docs.
Christian Kruse
2014-03-03 20:52:48 +02:00
Robert Haas
a8e9b86b5e Bump catversion.
The previous patch should have entailed a catversion bump, but I
forgot.
2014-03-03 07:22:20 -05:00
Robert Haas
d83ee62231 Corrections to replication slots code and documentation.
Andres Freund, per a report from Vik Faering
2014-03-03 07:16:54 -05:00
Robert Haas
ae95f5f74a Define LSNOID in pg_type.h.
Most other built-in types have a similarly-named constant, so this
type should probably have one, too.

Michael Paquier
2014-03-03 07:03:41 -05:00
Tom Lane
9662143f0c Allow regex operations to be terminated early by query cancel requests.
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time.  Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.

If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library.  But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.

A bigger issue is that there may be more places than these two where
we need to check for cancels.  We can always add more checks later,
now that the infrastructure is in place.

Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
2014-03-01 15:20:56 -05:00
Alvaro Herrera
ef5856fd9b Allow BASE_BACKUP to be throttled
A new MAX_RATE option allows imposing a limit to the network transfer
rate from the server side.  This is useful to limit the stress that
taking a base backup has on the server.

pg_basebackup is now able to specify a value to the server, too.

Author: Antonin Houska

Patch reviewed by Stefan Radomski, Andres Freund, Zoltán Böszörményi,
Fujii Masao, and Álvaro Herrera.
2014-02-27 18:55:57 -03:00
Robert Haas
dd1a3bccca Show xid and xmin in pg_stat_activity and pg_stat_replication.
Christian Kruse, reviewed by Andres Freund and myself, with further
minor adjustments by me.
2014-02-25 12:34:04 -05:00
Robert Haas
6615e77439 Use pg_lsn data type in pg_stat_replication, too.
Michael Paquier, per a suggestion from Andres Freund
2014-02-24 10:38:45 -05:00
Robert Haas
6f289c2b7d Switch various builtin functions to use pg_lsn instead of text.
The functions in slotfuncs.c don't exist in any released version,
but the changes to xlogfuncs.c represent backward-incompatibilities.
Per discussion, we're hoping that the queries using these functions
are few enough and simple enough that this won't cause too much
breakage for users.

Michael Paquier, reviewed by Andres Freund and further modified
by me.
2014-02-19 11:37:43 -05:00
Robert Haas
694e3d139a Further code review for pg_lsn data type.
Change input function error messages to be more consistent with what is
done elsewhere.  Remove a bunch of redundant type casts, so that the
compiler will warn us if we screw up.  Don't pass LSNs by value on
platforms where a Datum is only 32 bytes, per buildfarm.  Move macros
for packing and unpacking LSNs to pg_lsn.h so that we can include
access/xlogdefs.h, to avoid an unsatisfied dependency on XLogRecPtr.
2014-02-19 10:06:59 -05:00
Robert Haas
844a28a9dd pg_lsn macro naming and type behavior revisions.
Change pg_lsn_mi so that it can return negative values when subtracting
LSNs, and clean up some perhaps ill-considered macro names.
2014-02-19 09:34:15 -05:00
Robert Haas
7d03a83f4d Add a pg_lsn data type, to represent an LSN.
Robert Haas and Michael Paquier
2014-02-19 08:35:23 -05:00
Noah Misch
31400a6733 Predict integer overflow to avoid buffer overruns.
Several functions, mostly type input functions, calculated an allocation
size such that the calculation wrapped to a small positive value when
arguments implied a sufficiently-large requirement.  Writes past the end
of the inadvertent small allocation followed shortly thereafter.
Coverity identified the path_in() vulnerability; code inspection led to
the rest.  In passing, add check_stack_depth() to prevent stack overflow
in related functions.

Back-patch to 8.4 (all supported versions).  The non-comment hstore
changes touch code that did not exist in 8.4, so that part stops at 9.0.

Noah Misch and Heikki Linnakangas, reviewed by Tom Lane.

Security: CVE-2014-0064
2014-02-17 09:33:31 -05:00
Noah Misch
4318daecc9 Fix handling of wide datetime input/output.
Many server functions use the MAXDATELEN constant to size a buffer for
parsing or displaying a datetime value.  It was much too small for the
longest possible interval output and slightly too small for certain
valid timestamp input, particularly input with a long timezone name.
The long input was rejected needlessly; the long output caused
interval_out() to overrun its buffer.  ECPG's pgtypes library has a copy
of the vulnerable functions, which bore the same vulnerabilities along
with some of its own.  In contrast to the server, certain long inputs
caused stack overflow rather than failing cleanly.  Back-patch to 8.4
(all supported versions).

Reported by Daniel Schüssler, reviewed by Tom Lane.

Security: CVE-2014-0063
2014-02-17 09:33:31 -05:00
Robert Haas
5f173040e3 Avoid repeated name lookups during table and index DDL.
If the name lookups come to different conclusions due to concurrent
activity, we might perform some parts of the DDL on a different table
than other parts.  At least in the case of CREATE INDEX, this can be
used to cause the permissions checks to be performed against a
different table than the index creation, allowing for a privilege
escalation attack.

This changes the calling convention for DefineIndex, CreateTrigger,
transformIndexStmt, transformAlterTableStmt, CheckIndexCompatible
(in 9.2 and newer), and AlterTable (in 9.1 and older).  In addition,
CheckRelationOwnership is removed in 9.2 and newer and the calling
convention is changed in older branches.  A field has also been added
to the Constraint node (FkConstraint in 8.4).  Third-party code calling
these functions or using the Constraint node will require updating.

Report by Andres Freund.  Patch by Robert Haas and Andres Freund,
reviewed by Tom Lane.

Security: CVE-2014-0062
2014-02-17 09:33:31 -05:00
Noah Misch
537cbd35c8 Prevent privilege escalation in explicit calls to PL validators.
The primary role of PL validators is to be called implicitly during
CREATE FUNCTION, but they are also normal functions that a user can call
explicitly.  Add a permissions check to each validator to ensure that a
user cannot use explicit validator calls to achieve things he could not
otherwise achieve.  Back-patch to 8.4 (all supported versions).
Non-core procedural language extensions ought to make the same two-line
change to their own validators.

Andres Freund, reviewed by Tom Lane and Noah Misch.

Security: CVE-2014-0061
2014-02-17 09:33:31 -05:00
Tom Lane
fa1f0d7859 PGDLLIMPORT-ify MainLWLockArray, ProcDiePending, proc_exit_inprogress.
These are needed in HEAD to make assorted contrib modules build on Windows.
Now that all the MSVC and Mingw buildfarm members seem to be on the same
page about the need for them, we can have some confidence that future
problems of this ilk will be detected promptly; there seems nothing more
to be learned by delaying this fix further.

I chose to mark QueryCancelPending as well, since it's easy to imagine code
that wants to touch ProcDiePending also caring about QueryCancelPending.
2014-02-16 20:12:43 -05:00
Tom Lane
a5cf60682e PGDLLIMPORT'ify DateStyle and IntervalStyle.
This is needed on Windows to support contrib/postgres_fdw.  Although it's
been broken since last March, we didn't notice until recently because there
were no active buildfarm members that complained about missing PGDLLIMPORT
marking.  Efforts are underway to improve that situation, in support of
which we're delaying fixing some other cases of global variables that
should be marked PGDLLIMPORT.  However, this case affects 9.3, so we
can't wait any longer to fix it.

I chose to mark DateOrder as well, though it's not strictly necessary
for postgres_fdw.
2014-02-16 12:37:07 -05:00
Tom Lane
60ff2fdd99 Centralize getopt-related declarations in a new header file pg_getopt.h.
We used to have externs for getopt() and its API variables scattered
all over the place.  Now that we find we're going to need to tweak the
variable declarations for Cygwin, it seems like a good idea to have
just one place to tweak.

In this commit, the variables are declared "#ifndef HAVE_GETOPT_H".
That may or may not work everywhere, but we'll soon find out.

Andres Freund
2014-02-15 14:31:30 -05:00
Peter Eisentraut
0f2ca0075c Fix typo
Stefan Kaltenbrunner
2014-02-13 21:50:43 -05:00
Alvaro Herrera
801c2dc72c Separate multixact freezing parameters from xid's
Previously we were piggybacking on transaction ID parameters to freeze
multixacts; but since there isn't necessarily any relationship between
rates of Xid and multixact consumption, this turns out not to be a good
idea.

Therefore, we now have multixact-specific freezing parameters:

vacuum_multixact_freeze_min_age: when to remove multis as we come across
them in vacuum (default to 5 million, i.e. early in comparison to Xid's
default of 50 million)

vacuum_multixact_freeze_table_age: when to force whole-table scans
instead of scanning only the pages marked as not all visible in
visibility map (default to 150 million, same as for Xids).  Whichever of
both which reaches the 150 million mark earlier will cause a whole-table
scan.

autovacuum_multixact_freeze_max_age: when for cause emergency,
uninterruptible whole-table scans (default to 400 million, double as
that for Xids).  This means there shouldn't be more frequent emergency
vacuuming than previously, unless multixacts are being used very
rapidly.

Backpatch to 9.3 where multixacts were made to persist enough to require
freezing.  To avoid an ABI break in 9.3, VacuumStmt has a couple of
fields in an unnatural place, and StdRdOptions is split in two so that
the newly added fields can go at the end.

Patch by me, reviewed by Robert Haas, with additional input from Andres
Freund and Tom Lane.
2014-02-13 19:36:31 -03:00
Peter Eisentraut
66c04c981d Mark some more variables as static or include the appropriate header
Detected by clang's -Wmissing-variable-declarations.

From: Andres Freund <andres@anarazel.de>
2014-02-08 21:21:46 -05:00
Heikki Linnakangas
dbc649fd77 Speed up "rare & frequent" type GIN queries.
If you have a GIN query like "rare & frequent", we currently fetch all the
items that match either rare or frequent, call the consistent function for
each item, and let the consistent function filter out items that only match
one of the terms. However, if we can deduce that "rare" must be present for
the overall qual to be true, we can scan all the rare items, and for each
rare item, skip over to the next frequent item with the same or greater TID.
That greatly speeds up "rare & frequent" type queries.

To implement that, introduce the concept of a tri-state consistent function,
where the 3rd value is MAYBE, indicating that we don't know if that term is
present. Operator classes only provide a boolean consistent function, so we
simulate the tri-state consistent function by calling the boolean function
several times, with the MAYBE arguments set to all combinations of TRUE and
FALSE. Testing all combinations is only feasible for a small number of MAYBE
arguments, but it is envisioned that we'll provide a way for operator
classes to provide a native tri-state consistent function, which can be much
more efficient. But that is not included in this patch.

We were already using that trick to for lossy pages, calling the consistent
function with the lossy entry set to TRUE and FALSE. Now that we have the
tri-state consistent function, use it for lossy pages too.

Alexander Korotkov, with fair amount of refactoring by me.
2014-02-07 15:22:48 +02:00
Robert Haas
80353f3528 Adjust pg_sleep_for/pg_sleep_until to use clock_timestamp.
Otherwise, pg_sleep_until does the wrong thing in a multi-statement
transaction.

Julien Rouhaud
2014-02-03 14:33:43 -05:00
Fujii Masao
3e8554a54a Make pg_basebackup skip temporary statistics files.
The temporary statistics files don't need to be included in the backup
because they are always reset at the beginning of the archive recovery.
This patch changes pg_basebackup so that it skips all files located in
$PGDATA/pg_stat_tmp or the directory specified by stats_temp_directory
parameter.
2014-02-03 23:19:49 +09:00
Robert Haas
858ec11858 Introduce replication slots.
Replication slots are a crash-safe data structure which can be created
on either a master or a standby to prevent premature removal of
write-ahead log segments needed by a standby, as well as (with
hot_standby_feedback=on) pruning of tuples whose removal would cause
replication conflicts.  Slots have some advantages over existing
techniques, as explained in the documentation.

In a few places, we refer to the type of replication slots introduced
by this patch as "physical" slots, because forthcoming patches for
logical decoding will also have slots, but with somewhat different
properties.

Andres Freund and Robert Haas
2014-01-31 22:45:36 -05:00
Bruce Momjian
fc4ffba968 system catalogs: reorder pg_amproc entries into proper sections
Report form Antonin Houska
2014-01-31 16:04:18 -05:00
Robert Haas
760c770ff6 Add convenience functions pg_sleep_for and pg_sleep_until.
Vik Fearing, reviewed by Pavel Stehule and myself
2014-01-30 15:47:56 -05:00
Andrew Dunstan
5e52e9d6d4 Forgot to bump catalog version for json_array_elements_text. 2014-01-29 16:38:31 -05:00
Robert Haas
9347baa5bb Include planning time in EXPLAIN ANALYZE output.
This doesn't work for prepared queries, but it's not too easy to get
the information in that case and there's some debate as to exactly
what the right thing to measure is, so just do this for now.

Andreas Karlsson, with slight doc changes by me.
2014-01-29 16:09:15 -05:00
Andrew Dunstan
5264d91541 Add json_array_elements_text function.
This was a notable omission from the json functions added in 9.3 and
there have been numerous complaints about its absence.

Laurence Rowe.
2014-01-29 15:39:01 -05:00
Heikki Linnakangas
626a120656 Further optimize GIN multi-key searches.
When skipping over some items in a posting tree, re-find the new location
by descending the tree from root, rather than walking the right links.
This can save a lot of I/O.

Heavily modified from Alexander Korotkov's fast scan patch.
2014-01-29 21:24:38 +02:00
Heikki Linnakangas
25b1dafab6 Further optimize multi-key GIN searches.
If we're skipping past a certain TID, avoid decoding posting list segments
that only contain smaller TIDs.

Extracted from Alexander Korotkov's fast scan patch, heavily modified.
2014-01-29 18:26:40 +02:00
Heikki Linnakangas
1a3458b6d8 Allow using huge TLB pages on Linux (MAP_HUGETLB)
This patch adds an option, huge_tlb_pages, which allows requesting the
shared memory segment to be allocated using huge pages, by using the
MAP_HUGETLB flag in mmap(). This can improve performance.

The default is 'try', which means that we will attempt using huge pages,
and fall back to non-huge pages if it doesn't work. Currently, only Linux
has MAP_HUGETLB. On other platforms, the default 'try' behaves the same as
'off'.

In the passing, don't try to round the mmap() size to a multiple of
pagesize. mmap() doesn't require that, and there's no particular reason for
PostgreSQL to do that either. When using MAP_HUGETLB, however, round the
request size up to nearest 2MB boundary. This is to work around a bug in
some Linux kernel versions, but also to avoid wasting memory, because the
kernel will round the size up anyway.

Many people were involved in writing this patch, including Christian Kruse,
Richard Poole, Abhijit Menon-Sen, reviewed by Peter Geoghegan, Andres Freund
and me.
2014-01-29 14:08:30 +02:00
Andrew Dunstan
105639900b New json functions.
json_build_array() and json_build_object allow for the construction of
arbitrarily complex json trees. json_object() turns a one or two
dimensional array, or two separate arrays, into a json_object of
name/value pairs, similarly to the hstore() function.
json_object_agg() aggregates its two arguments into a single json object
as name value pairs.

Catalog version bumped.

Andrew Dunstan, reviewed by Marko Tiikkaja.
2014-01-28 17:48:21 -05:00
Fujii Masao
9132b189bf Add pg_stat_archiver statistics view.
This view shows the statistics about the WAL archiver process's activity.

Gabriele Bartolini, reviewed by Michael Paquier, refactored a bit by me.
2014-01-29 02:58:22 +09:00
Bruce Momjian
051b3341c1 Remove orphaned prototype
Rajeev rastogi
2014-01-28 11:29:39 -05:00
Tom Lane
64e43c59b8 Log a detail message for auth failures due to missing or expired password.
It's worth distinguishing these cases from run-of-the-mill wrong-password
problems, since users have been known to waste lots of time pursuing the
wrong theory about what's failing.  Now, our longstanding policy about how
to report authentication failures is that we don't really want to tell the
*client* such things, since that might be giving information to a bad guy.
But there's nothing wrong with reporting the details to the postmaster log,
and indeed the comments in this area of the code contemplate that
interesting details should be so reported.  We just weren't handling these
particular interesting cases usefully.

To fix, add infrastructure allowing subroutines of ClientAuthentication()
to return a string to be added to the errdetail_log field of the main
authentication-failed error report.  We might later want to use this to
report other subcases of authentication failure the same way, but for the
moment I just dealt with password cases.

Per discussion of a patch from Josh Drake, though this is not what
he proposed.
2014-01-27 21:04:09 -05:00
Robert Haas
ea9df812d8 Relax the requirement that all lwlocks be stored in a single array.
This makes it possible to store lwlocks as part of some other data
structure in the main shared memory segment, or in a dynamic shared
memory segment.  There is still a main LWLock array and this patch does
not move anything out of it, but it provides necessary infrastructure
for doing that in the future.

This change is likely to increase the size of LWLockPadded on some
platforms, especially 32-bit platforms where it was previously only
16 bytes.

Patch by me.  Review by Andres Freund and KaiGai Kohei.
2014-01-27 11:07:44 -05:00
Tom Lane
2850896961 Code review for auto-tuned effective_cache_size.
Fix integer overflow issue noted by Magnus Hagander, as well as a bunch
of other infelicities in commit ee1e5662d8
and its unreasonably large number of followups.
2014-01-27 00:05:56 -05:00
Fujii Masao
7c619be623 Fix typos in comments for ALTER SYSTEM.
Michael Paquier
2014-01-27 12:23:20 +09:00
Andrew Dunstan
cec8394b5c Enable building with Visual Studion 2013.
Backpatch to 9.3.

Brar Piening.
2014-01-26 09:49:10 -05:00
Heikki Linnakangas
71c6a8e375 Add recovery_target='immediate' option.
This allows ending recovery as a consistent state has been reached. Without
this, there was no easy way to e.g restore an online backup, without
replaying any extra WAL after the backup ended.

MauMau and me.
2014-01-25 17:34:04 +02:00
Stephen Frost
fbe19ee3b8 ALTER TABLESPACE ... MOVE ... OWNED BY
Add the ability to specify the objects to move by who those objects are
owned by (as relowner) and change ALL to mean ALL objects.  This
makes the command always operate against a well-defined set of objects
and not have the objects-to-be-moved based on the role of the user
running the command.

Per discussion with Simon and Tom.
2014-01-23 23:52:40 -05:00
Tom Lane
ac4ef637ad Allow use of "z" flag in our printf calls, and use it where appropriate.
Since C99, it's been standard for printf and friends to accept a "z" size
modifier, meaning "whatever size size_t has".  Up to now we've generally
dealt with printing size_t values by explicitly casting them to unsigned
long and using the "l" modifier; but this is really the wrong thing on
platforms where pointers are wider than longs (such as Win64).  So let's
start using "z" instead.  To ensure we can do that on all platforms, teach
src/port/snprintf.c to understand "z", and add a configure test to force
use of that implementation when the platform's version doesn't handle "z".

Having done that, modify a bunch of places that were using the
unsigned-long hack to use "z" instead.  This patch doesn't pretend to have
gotten everyplace that could benefit, but it catches many of them.  I made
an effort in particular to ensure that all uses of the same error message
text were updated together, so as not to increase the number of
translatable strings.

It's possible that this change will result in format-string warnings from
pre-C99 compilers.  We might have to reconsider if there are any popular
compilers that will warn about this; but let's start by seeing what the
buildfarm thinks.

Andres Freund, with a little additional work by me
2014-01-23 17:18:33 -05:00
Alvaro Herrera
b152c6cd0d Make DROP IF EXISTS more consistently not fail
Some cases were still reporting errors and aborting, instead of a NOTICE
that the object was being skipped.  This makes it more difficult to
cleanly handle pg_dump --clean, so change that to instead skip missing
objects properly.

Per bug #7873 reported by Dave Rolsky; apparently this affects a large
number of users.

Authors: Pavel Stehule and Dean Rasheed.  Some tweaks by Álvaro Herrera
2014-01-23 14:40:29 -03:00
Heikki Linnakangas
36a35c550a Compress GIN posting lists, for smaller index size.
GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.

To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.

This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.

Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
2014-01-22 19:20:58 +02:00
Robert Haas
01f7808b3e Add a cardinality function for arrays.
Unlike our other array functions, this considers the total number of
elements across all dimensions, and returns 0 rather than NULL when the
array has no elements.  But it seems that both of those behaviors are
almost universally disliked, so hopefully that's OK.

Marko Tiikkaja, reviewed by Dean Rasheed and Pavel Stehule
2014-01-21 12:38:53 -05:00
Alvaro Herrera
d2458e3b20 Expose a routine to print triggers during EXPLAIN ANALYZE
This is so that auto_explain can use it.

Kyotaro HORIGUCHI
2014-01-20 17:13:47 -03:00
Simon Riggs
4d1e2aeb1a Speed up COPY into tables with DEFAULT nextval()
Previously the presence of a nextval() prevented the
use of batch-mode COPY.  This patch introduces a
special case just for nextval() functions. In future
we will introduce a general case solution for
labelling volatile functions as safe for use.
2014-01-20 17:22:38 +00:00
Magnus Hagander
98de86e422 Remove support for native krb5 authentication
krb5 has been deprecated since 8.3, and the recommended way to do
Kerberos authentication is using the GSSAPI authentication method
(which is still fully supported).

libpq retains the ability to identify krb5 authentication, but only
gives an error message about it being unsupported. Since all authentication
is initiated from the backend, there is no need to keep it at all
in the backend.
2014-01-19 17:05:01 +01:00
Stephen Frost
5254958e92 Add CREATE TABLESPACE ... WITH ... Options
Tablespaces have a few options which can be set on them to give PG hints
as to how the tablespace behaves (perhaps it's faster for sequential
scans, or better able to handle random access, etc).  These options were
only available through the ALTER TABLESPACE command.

This adds the ability to set these options at CREATE TABLESPACE time,
removing the need to do both a CREATE TABLESPACE and ALTER TABLESPACE to
get the correct options set on the tablespace.

Vik Fearing, reviewed by Michael Paquier.
2014-01-18 20:59:31 -05:00
Tom Lane
115f414124 Fix VACUUM's reporting of dead-tuple counts to the stats collector.
Historically, VACUUM has just reported its new_rel_tuples estimate
(the same thing it puts into pg_class.reltuples) to the stats collector.
That number counts both live and dead-but-not-yet-reclaimable tuples.
This behavior may once have been right, but modern versions of the
pgstats code track live and dead tuple counts separately, so putting
the total into n_live_tuples and zero into n_dead_tuples is surely
pretty bogus.  Fix it to report live and dead tuple counts separately.

This doesn't really do much for situations where updating transactions
commit concurrently with a VACUUM scan (possibly causing double-counting or
omission of the tuples they add or delete); but it's clearly an improvement
over what we were doing before.

Hari Babu, reviewed by Amit Kapila
2014-01-18 19:24:33 -05:00
Stephen Frost
76e91b38ba Add ALTER TABLESPACE ... MOVE command
This adds a 'MOVE' sub-command to ALTER TABLESPACE which allows moving sets of
objects from one tablespace to another.  This can be extremely handy and avoids
a lot of error-prone scripting.  ALTER TABLESPACE ... MOVE will only move
objects the user owns, will notify the user if no objects were found, and can
be used to move ALL objects or specific types of objects (TABLES, INDEXES, or
MATERIALIZED VIEWS).
2014-01-18 18:56:40 -05:00
Tom Lane
0d79c0a8cc Make various variables const (read-only).
These changes should generally improve correctness/maintainability.
A nice side benefit is that several kilobytes move from initialized
data to text segment, allowing them to be shared across processes and
probably reducing copy-on-write overhead while forking a new backend.
Unfortunately this doesn't seem to help libpq in the same way (at least
not when it's compiled with -fpic on x86_64), but we can hope the linker
at least collects all nominally-const data together even if it's not
actually part of the text segment.

Also, make pg_encname_tbl[] static in encnames.c, since there seems
no very good reason for any other code to use it; per a suggestion
from Wim Lewis, who independently submitted a patch that was mostly
a subset of this one.

Oskari Saarenmaa, with some editorialization by me
2014-01-18 16:04:32 -05:00
Andrew Dunstan
7d7eee8bb7 Export a few more symbols required for test_shm_mq module.
Patch from Amit Kapila.
2014-01-18 15:29:45 -05:00
Andrew Dunstan
708c529c7f Export set_latch_on_sigusr1 symbol for Windows.
Per buildfarm currawong and grip from David Rowley.
2014-01-17 12:48:23 -05:00
Andrew Dunstan
b64d956d58 Prevent double macro definition of WIN32.
David Rowley.
2014-01-17 11:49:44 -05:00
Magnus Hagander
9c14dd22e1 Define WIN32 when _WIN32 is set
_WIN32 is set by the compiler, whereas our code uses WIN32 that is
normally set through our build system. To make it possible to build
extensions out of tree we cannot rely on that, so set the WIN32
symbol explicitly whenever the compiler has set _WIN32.

Not setting this symbol causes double inclusion of pg_config_os.h,
and possibly other errors as well.

Craig Ringer
2014-01-17 12:41:32 +01:00
Robert Haas
ed46758381 Logging running transactions every 15 seconds.
Previously, we did this just once per checkpoint, but that could make
Hot Standby take a long time to initialize.  To avoid busying an
otherwise-idle system, we don't do this if no WAL has been written
since we did it last.

Andres Freund
2014-01-15 12:41:20 -05:00
Tom Lane
061b079f89 Fix multiple bugs in index page locking during hot-standby WAL replay.
In ordinary operation, VACUUM must be careful to take a cleanup lock on
each leaf page of a btree index; this ensures that no indexscans could
still be "in flight" to heap tuples due to be deleted.  (Because of
possible index-tuple motion due to concurrent page splits, it's not enough
to lock only the pages we're deleting index tuples from.)  In Hot Standby,
the WAL replay process must likewise lock every leaf page.  There were
several bugs in the code for that:

* The replay scan might come across unused, all-zero pages in the index.
While btree_xlog_vacuum itself did the right thing (ie, nothing) with
such pages, xlogutils.c supposed that such pages must be corrupt and
would throw an error.  This accounts for various reports of replication
failures with "PANIC: WAL contains references to invalid pages".  To
fix, add a ReadBufferMode value that instructs XLogReadBufferExtended
not to complain when we're doing this.

* btree_xlog_vacuum performed the extra locking if standbyState ==
STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up
for hot standby queries until the database has reached consistency, and
we don't want to do the extra locking till then either, for fear of reading
corrupted pages (which bufmgr.c would complain about).  Fix by exporting a
new function from xlog.c that will report whether we're actually in hot
standby replay mode.

* To ensure full coverage of the index in the replay scan, btvacuumscan
would emit a dummy WAL record for the last page of the index, if no
vacuuming work had been done on that page.  However, if the last page
of the index is all-zero, that would result in corruption of said page,
since the functions called on it weren't prepared to handle that case.
There's no need to lock any such pages, so change the logic to target
the last normal leaf page instead.

The first two of these bugs were diagnosed by Andres Freund, the other one
by me.  Fixes based on ideas from Heikki Linnakangas and myself.

This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
2014-01-14 17:35:21 -05:00
Robert Haas
246a9a8d0c Fix typo in comment.
Etsuro Fujita
2014-01-14 14:34:57 -05:00
Robert Haas
ec9037df26 Single-reader, single-writer, lightweight shared message queue.
This code provides infrastructure for user backends to communicate
relatively easily with background workers.  The message queue is
structured as a ring buffer and allows messages of arbitary length
to be sent and received.

Patch by me.  Review by KaiGai Kohei and Andres Freund.
2014-01-14 12:23:22 -05:00
Robert Haas
6ddd5137b2 Simple table of contents for a shared memory segment.
This interface is intended to make it simple to divide a dynamic shared
memory segment into different regions with distinct purposes.  It
therefore serves much the same purpose that ShmemIndex accomplishes for
the main shared memory segment, but it is intended to be more
lightweight.

Patch by me.  Review by Andres Freund.
2014-01-14 12:18:58 -05:00
Robert Haas
2bb1f14b89 Make bitmap heap scans show exact/lossy block info in EXPLAIN ANALYZE.
Etsuro Fujita
2014-01-13 14:42:16 -05:00
Tom Lane
158b7fa6a3 Disallow LATERAL references to the target table of an UPDATE/DELETE.
On second thought, commit 0c051c9008 was
over-hasty: rather than allowing this case, we ought to reject it for now.
That leaves the field clear for a future feature that allows the target
table to be re-specified in the FROM (or USING) clause, which will enable
left-joining the target table to something else.  We can then also allow
LATERAL references to such an explicitly re-specified target table.
But allowing them right now will create ambiguities or worse for such a
feature, and it isn't something we documented 9.3 as supporting.

While at it, add a convenience subroutine to avoid having several copies
of the ereport for disalllowed-LATERAL-reference cases.
2014-01-11 19:03:12 -05:00
Andrew Dunstan
11829ff8b2 Remove DESCR entries for json operator functions.
Per -hackers discussion.
2014-01-10 22:25:04 -05:00
Bruce Momjian
111022eac6 Move username lookup functions from /port to /common
Per suggestion from Peter E and Alvaro
2014-01-10 18:03:28 -05:00
Tom Lane
220b34331f We don't need to include pg_sema.h in s_lock.h anymore.
Minor improvement to commit daa7527afc:
s_lock.h no longer has any need to mention PGSemaphoreData, so we can
rip out the #include that supplies that.  In a non-HAVE_SPINLOCKS
build, this doesn't really buy much since we still need the #include
in spin.h --- but everywhere else, this reduces #include footprint by
some trifle, and helps keep the different locking facilities separate.
2014-01-08 20:58:22 -05:00
Robert Haas
daa7527afc Reduce the number of semaphores used under --disable-spinlocks.
Instead of allocating a semaphore from the operating system for every
spinlock, allocate a fixed number of semaphores (by default, 1024)
from the operating system and multiplex all the spinlocks that get
created onto them.  This could self-deadlock if a process attempted
to acquire more than one spinlock at a time, but since processes
aren't supposed to execute anything other than short stretches of
straight-line code while holding a spinlock, that shouldn't happen.

One motivation for this change is that, with the introduction of
dynamic shared memory, it may be desirable to create spinlocks that
last for less than the lifetime of the server.  Without this change,
attempting to use such facilities under --disable-spinlocks would
quickly exhaust any supply of available semaphores.  Quite apart
from that, it's desirable to contain the quantity of semaphores
needed to run the server simply on convenience grounds, since using
too many may make it harder to get PostgreSQL running on a new
platform, which is mostly the point of --disable-spinlocks in the
first place.

Patch by me; review by Tom Lane.
2014-01-08 18:58:00 -05:00
Bruce Momjian
7e04792a1c Update copyright for 2014
Update all files in head, and files COPYRIGHT and legal.sgml in all back
branches.
2014-01-07 16:05:30 -05:00
Tom Lane
8b49a6044d Cache catalog lookup data across groups in ordered-set aggregates.
The initial commit of ordered-set aggregates just did all the setup work
afresh each time the aggregate function is started up.  But in a GROUP BY
query, the catalog lookups need not be repeated for each group, since the
column datatypes and sort information won't change.  When there are many
small groups, this makes for a useful, though not huge, performance
improvement.  Per suggestion from Andrew Gierth.

Profiling of these cases suggests that it might be profitable to avoid
duplicate lookups within tuplesort startup as well; but changing the
tuplesort APIs would have much broader impact, so I left that for
another day.
2014-01-05 12:28:39 -05:00
Tom Lane
a7ef273e1c Fix calculation of maximum statistics-message size.
The PGSTAT_NUM_TABENTRIES macro should have been updated when new fields
were added to struct PgStat_MsgTabstat in commit 644828908, but it wasn't.
Fix that.

Also, add a static assertion that we didn't overrun the intended size limit
on stats messages.  This will not necessarily catch every mistake in
computing the maximum array size for stats messages, but it will catch ones
that have practical consequences.  (The assertion in fact doesn't complain
about the aforementioned error in PGSTAT_NUM_TABENTRIES, because that was
not big enough to cause the array length to increase.)

No back-patch, as there's no actual bug in existing releases; this is just
in the nature of future-proofing.

Mark Dilger and Tom Lane
2014-01-02 21:45:51 -05:00
Alvaro Herrera
722acf51a0 Handle wraparound during truncation in multixact/members
In pg_multixact/members, relying on modulo-2^32 arithmetic for
wraparound handling doesn't work all that well.  Because we don't
explicitely track wraparound of the allocation counter for members, it
is possible that the "live" area exceeds 2^31 entries; trying to remove
SLRU segments that are "old" according to the original logic might lead
to removal of segments still in use.  To fix, have the truncation
routine use a tailored SlruScanDirectory callback that keeps track of
the live area in actual use; that way, when the live range exceeds 2^31
entries, the oldest segments still live will not get removed untimely.

This new SlruScanDir callback needs to take care not to remove segments
that are "in the future": if new SLRU segments appear while the
truncation is ongoing, make sure we don't remove them.  This requires
examination of shared memory state to recheck for false positives, but
testing suggests that this doesn't cause a problem.  The original coding
didn't suffer from this pitfall because segments created when truncation
is running are never considered to be removable.

Per Andres Freund's investigation of bug #8673 reported by Serge
Negodyuck.
2014-01-02 18:16:54 -03:00
Robert Haas
3cff1879f8 Aggressively freeze tables when CLUSTER or VACUUM FULL rewrites them.
We haven't wanted to do this in the past on the grounds that in rare
cases the original xmin value will be needed for forensic purposes, but
commit 37484ad2aa removes that objection,
so now we can.

Per extensive discussion, among many people, on pgsql-hackers.
2014-01-02 15:15:51 -05:00
Robert Haas
4b351841fa Rename walLogHints to wal_log_hints for easier grepping.
Michael Paquier
2014-01-01 20:17:00 -05:00
Tom Lane
f7fbf4b0be Remove dead code now that orindxpath.c is history.
We don't need make_restrictinfo_from_bitmapqual() anymore at all.
generate_bitmap_or_paths() doesn't need to be exported, and we can
drop its rather klugy restriction_only flag.
2013-12-30 12:50:31 -05:00
Tom Lane
f343a880d5 Extract restriction OR clauses whether or not they are indexable.
It's possible to extract a restriction OR clause from a join clause that
has the form of an OR-of-ANDs, if each sub-AND includes a clause that
mentions only one specific relation.  While PG has been aware of that idea
for many years, the code previously only did it if it could extract an
indexable OR clause.  On reflection, though, that seems a silly limitation:
adding a restriction clause can be a win by reducing the number of rows
that have to be filtered at the join step, even if we have to test the
clause as a plain filter clause during the scan.  This should be especially
useful for foreign tables, where the change can cut the number of rows that
have to be retrieved from the foreign server; but testing shows it can win
even on local tables.  Per a suggestion from Robert Haas.

As a heuristic, I made the code accept an extracted restriction clause
if its estimated selectivity is less than 0.9, which will probably result
in accepting extracted clauses just about always.  We might need to tweak
that later based on experience.

Since the code no longer has even a weak connection to Path creation,
remove orindxpath.c and create a new file optimizer/util/orclauses.c.

There's some additional janitorial cleanup of now-dead code that needs
to happen, but it seems like that's a fit subject for a separate commit.
2013-12-30 12:24:37 -05:00
Tom Lane
ed011d9754 Undo autoconf 2.69's attempt to #define _DARWIN_USE_64_BIT_INODE.
Defining this symbol causes OS X 10.5 to use a buggy version of readdir(),
which can sometimes fail with EINVAL if the previously-fetched directory
entry has been deleted or renamed.  In later OS X versions that bug has
been repaired, but we still don't need the #define because it's on by
default.  So this is just an all-around bad idea, and we can do without it.
2013-12-29 12:57:56 -05:00
Peter Eisentraut
a09e3fd776 Fix whitespace 2013-12-26 23:51:56 -05:00
Tom Lane
8d65da1f01 Support ordered-set (WITHIN GROUP) aggregates.
This patch introduces generic support for ordered-set and hypothetical-set
aggregate functions, as well as implementations of the instances defined in
SQL:2008 (percentile_cont(), percentile_disc(), rank(), dense_rank(),
percent_rank(), cume_dist()).  We also added mode() though it is not in the
spec, as well as versions of percentile_cont() and percentile_disc() that
can compute multiple percentile values in one pass over the data.

Unlike the original submission, this patch puts full control of the sorting
process in the hands of the aggregate's support functions.  To allow the
support functions to find out how they're supposed to sort, a new API
function AggGetAggref() is added to nodeAgg.c.  This allows retrieval of
the aggregate call's Aggref node, which may have other uses beyond the
immediate need.  There is also support for ordered-set aggregates to
install cleanup callback functions, so that they can be sure that
infrastructure such as tuplesort objects gets cleaned up.

In passing, make some fixes in the recently-added support for variadic
aggregates, and make some editorial adjustments in the recent FILTER
additions for aggregates.  Also, simplify use of IsBinaryCoercible() by
allowing it to succeed whenever the target type is ANY or ANYELEMENT.
It was inconsistent that it dealt with other polymorphic target types
but not these.

Atri Sharma and Andrew Gierth; reviewed by Pavel Stehule and Vik Fearing,
and rather heavily editorialized upon by Tom Lane
2013-12-23 16:11:35 -05:00
Robert Haas
37484ad2aa Change the way we mark tuples as frozen.
Instead of changing the tuple xmin to FrozenTransactionId, the combination
of HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID, which were previously never
set together, is now defined as HEAP_XMIN_FROZEN.  A variety of previous
proposals to freeze tuples opportunistically before vacuum_freeze_min_age
is reached have foundered on the objection that replacing xmin by
FrozenTransactionId might hinder debugging efforts when things in this
area go awry; this patch is intended to solve that problem by keeping
the XID around (but largely ignoring the value to which it is set).

Third-party code that checks for HEAP_XMIN_INVALID on tuples where
HEAP_XMIN_COMMITTED might be set will be broken by this change.  To fix,
use the new accessor macros in htup_details.h rather than consulting the
bits directly.  HeapTupleHeaderGetXmin has been modified to return
FrozenTransactionId when the infomask bits indicate that the tuple is
frozen; use HeapTupleHeaderGetRawXmin when you already know that the
tuple isn't marked commited or frozen, or want the raw value anyway.
We currently do this in routines that display the xmin for user consumption,
in tqual.c where it's known to be safe and important for the avoidance of
extra cycles, and in the function-caching code for various procedural
languages, which shouldn't invalidate the cache just because the tuple
gets frozen.

Robert Haas and Andres Freund
2013-12-22 15:49:09 -05:00
Fujii Masao
961bf59fb7 Rename wal_log_hintbits to wal_log_hints, per discussion on pgsql-hackers.
Sawada Masahiko
2013-12-21 03:33:16 +09:00
Bruce Momjian
527fdd9df1 Move pg_upgrade_support global variables to their own include file
Previously their declarations were spread around to avoid accidental
access.
2013-12-19 16:10:07 -05:00
Peter Eisentraut
94b899b829 Upgrade to Autoconf 2.69 2013-12-18 20:53:23 -05:00
Robert Haas
001a573a20 Allow on-detach callbacks for dynamic shared memory segments.
Just as backends must clean up their shared memory state (releasing
lwlocks, buffer pins, etc.) before exiting, they must also perform
any similar cleanups related to dynamic shared memory segments they
have mapped before unmapping those segments.  So add a mechanism to
ensure that.

Existing on_shmem_exit hooks include both "user level" cleanup such
as transaction abort and removal of leftover temporary relations and
also "low level" cleanup that forcibly released leftover shared
memory resources.  On-detach callbacks should run after the first
group but before the second group, so create a new before_shmem_exit
function for registering the early callbacks and keep on_shmem_exit
for the regular callbacks.  (An earlier draft of this patch added an
additional argument to on_shmem_exit, but that had a much larger
footprint and probably a substantially higher risk of breaking third
party code for no real gain.)

Patch by me, reviewed by KaiGai Kohei and Andres Freund.
2013-12-18 13:09:09 -05:00
Bruce Momjian
613c6d26bd Fix incorrect error message reported for non-existent users
Previously, lookups of non-existent user names could return "Success";
it will now return "User does not exist" by resetting errno.  This also
centralizes the user name lookup code in libpgport.

Report and analysis by Nicolas Marchildon;  patch by me
2013-12-18 12:16:21 -05:00
Alvaro Herrera
11ac4c73cb Don't ignore tuple locks propagated by our updates
If a tuple was locked by transaction A, and transaction B updated it,
the new version of the tuple created by B would be locked by A, yet
visible only to B; due to an oversight in HeapTupleSatisfiesUpdate, the
lock held by A wouldn't get checked if transaction B later deleted (or
key-updated) the new version of the tuple.  This might cause referential
integrity checks to give false positives (that is, allow deletes that
should have been rejected).

This is an easy oversight to have made, because prior to improved tuple
locks in commit 0ac5ad5134 it wasn't possible to have tuples created by
our own transaction that were also locked by remote transactions, and so
locks weren't even considered in that code path.

It is recommended that foreign keys be rechecked manually in bulk after
installing this update, in case some referenced rows are missing with
some referencing row remaining.

Per bug reported by Daniel Wood in
CAPweHKe5QQ1747X2c0tA=5zf4YnS2xcvGf13Opd-1Mq24rF1cQ@mail.gmail.com
2013-12-18 13:45:51 -03:00
Tatsuo Ishii
65d6e4cb5c Add ALTER SYSTEM command to edit the server configuration file.
Patch contributed by Amit Kapila. Reviewed by Hari Babu, Masao Fujii,
Boszormenyi Zoltan, Andres Freund, Greg Smith and others.
2013-12-18 23:42:44 +09:00
Alvaro Herrera
3b97e6823b Rework tuple freezing protocol
Tuple freezing was broken in connection to MultiXactIds; commit
8e53ae025d tried to fix it, but didn't go far enough.  As noted by
Noah Misch, freezing a tuple whose Xmax is a multi containing an aborted
update might cause locks in the multi to go ignored by later
transactions.  This is because the code depended on a multixact above
their cutoff point not having any lock-only member older than the cutoff
point for Xids, which is easily defeated in READ COMMITTED transactions.

The fix for this involves creating a new MultiXactId when necessary.
But this cannot be done during WAL replay, and moreover multixact
examination requires using CLOG access routines which are not supposed
to be used during WAL replay either; so tuple freezing cannot be done
with the old freeze WAL record.  Therefore, separate the freezing
computation from its execution, and change the WAL record to carry all
necessary information.  At WAL replay time, it's easy to re-execute
freezing because we don't need to re-compute the new infomask/Xmax
values but just take them from the WAL record.

While at it, restructure the coding to ensure all page changes occur in
a single critical section without much room for failures.  The previous
coding wasn't using a critical section, without any explanation as to
why this was acceptable.

In replication scenarios using the 9.3 branch, standby servers must be
upgraded before their master, so that they are prepared to deal with the
new WAL record once the master is upgraded; failure to do so will cause
WAL replay to die with a PANIC message.  Later upgrade of the standby
will allow the process to continue where it left off, so there's no
disruption of the data in the standby in any case.  Standbys know how to
deal with the old WAL record, so it's okay to keep the master running
the old code for a while.

In master, the old freeze WAL record is gone, for cleanliness' sake;
there's no compatibility concern there.

Backpatch to 9.3, where the original bug was introduced and where the
previous fix was backpatched.

Álvaro Herrera and Andres Freund
2013-12-16 11:29:50 -03:00
Heikki Linnakangas
50e547096c Add GUC to enable WAL-logging of hint bits, even with checksums disabled.
WAL records of hint bit updates is useful to tools that want to examine
which pages have been modified. In particular, this is required to make
the pg_rewind tool safe (without checksums).

This can also be used to test how much extra WAL-logging would occur if
you enabled checksums, without actually enabling them (which you can't
currently do without re-initdb'ing).

Sawada Masahiko, docs by Samrat Revagade. Reviewed by Dilip Kumar, with
further changes by me.
2013-12-13 16:26:14 +02:00
Simon Riggs
8693559cac New autovacuum_work_mem parameter
If autovacuum_work_mem is set, autovacuum workers now use
this parameter in preference to maintenance_work_mem.

Peter Geoghegan
2013-12-12 11:42:39 +00:00
Robert Haas
66abc2608c Add a new reloption, user_catalog_table.
When this reloption is set and wal_level=logical is configured,
we'll record the CIDs stamped by inserts, updates, and deletes to
the table just as we would for an actual catalog table.  This will
allow logical decoding to use historical MVCC snapshots to access
such tables just as they access ordinary catalog tables.

Replication solutions built around the logical decoding machinery
will likely need to set this operation for their configuration
tables; it might also be needed by extensions which perform table
access in their output functions.

Andres Freund, reviewed by myself and others.
2013-12-10 19:17:34 -05:00
Robert Haas
e55704d8b2 Add new wal_level, logical, sufficient for logical decoding.
When wal_level=logical, we'll log columns from the old tuple as
configured by the REPLICA IDENTITY facility added in commit
07cacba983.  This makes it possible
a properly-configured logical replication solution to correctly
follow table updates even if they change the chosen key columns,
or, with REPLICA IDENTITY FULL, even if the table has no key at
all.  Note that updates which do not modify the replica identity
column won't log anything extra, making the choice of a good key
(i.e. one that will rarely be changed) important to performance
when wal_level=logical is configured.

Each insert, update, or delete to a catalog table will also log
the CMIN and/or CMAX values of stamped by the current transaction.
This is necessary because logical decoding will require access to
historical snapshots of the catalog in order to decode some data
types, and the CMIN/CMAX values that we may need in order to judge
row visibility may have been overwritten by the time we need them.

Andres Freund, reviewed in various versions by myself, Heikki
Linnakangas, KONDO Mitsumasa, and many others.
2013-12-10 19:01:40 -05:00
Noah Misch
53685d7981 Rename TABLE() to ROWS FROM().
SQL-standard TABLE() is a subset of UNNEST(); they deal with arrays and
other collection types.  This feature, however, deals with set-returning
functions.  Use a different syntax for this feature to keep open the
possibility of implementing the standard TABLE().
2013-12-10 09:34:37 -05:00
Heikki Linnakangas
9e857436ef Don't include unused space in LOG_NEWPAGE records.
This is the same trick we use when taking a full page image of a buffer
passed to XLogInsert.
2013-12-04 00:10:47 +02:00
Peter Eisentraut
34fa72ec9c Remove use of obsolescent Autoconf macros
Remove the use of the following macros, which are obsolescent according
to the Autoconf documentation:

- AC_C_CONST
- AC_C_STRINGIZE
- AC_C_VOLATILE
- AC_FUNC_MEMCMP
2013-11-30 09:17:08 -05:00
Alvaro Herrera
1df0122daa Truncate pg_multixact/'s contents during crash recovery
Commit 9dc842f08 of 8.2 era prevented MultiXact truncation during crash
recovery, because there was no guarantee that enough state had been
setup, and because it wasn't deemed to be a good idea to remove data
during crash recovery anyway.  Since then, due to Hot-Standby, streaming
replication and PITR, the amount of time a cluster can spend doing crash
recovery has increased significantly, to the point that a cluster may
even never come out of it.  This has made not truncating the content of
pg_multixact/ not defensible anymore.

To fix, take care to setup enough state for multixact truncation before
crash recovery starts (easy since checkpoints contain the required
information), and move the current end-of-recovery actions to a new
TrimMultiXact() function, analogous to TrimCLOG().

At some later point, this should probably done similarly to the way
clog.c is doing it, which is to just WAL log truncations, but we can't
do that for the back branches.

Back-patch to 9.0.  8.4 also has the problem, but since there's no hot
standby there, it's much less pressing.  In 9.2 and earlier, this patch
is simpler than in newer branches, because multixact access during
recovery isn't required.  Add appropriate checks to make sure that's not
happening.

Andres Freund
2013-11-29 21:47:15 -03:00
Alvaro Herrera
f54106f77e Fix full-table-vacuum request mechanism for MultiXactIds
While autovacuum dutifully launched anti-multixact-wraparound vacuums
when the multixact "age" was reached, the vacuum code was not aware that
it needed to make them be full table vacuums.  As the resulting
partial-table vacuums aren't capable of actually increasing relminmxid,
autovacuum continued to launch anti-wraparound vacuums that didn't have
the intended effect, until age of relfrozenxid caused the vacuum to
finally be a full table one via vacuum_freeze_table_age.

To fix, introduce logic for multixacts similar to that for plain
TransactionIds, using the same GUCs.

Backpatch to 9.3, where permanent MultiXactIds were introduced.

Andres Freund, some cleanup by Álvaro
2013-11-29 21:47:13 -03:00
Tom Lane
16e1b7a1b7 Fix assorted race conditions in the new timeout infrastructure.
Prevent handle_sig_alarm from losing control partway through due to a query
cancel (either an asynchronous SIGINT, or a cancel triggered by one of the
timeout handler functions).  That would at least result in failure to
schedule any required future interrupt, and might result in actual
corruption of timeout.c's data structures, if the interrupt happened while
we were updating those.

We could still lose control if an asynchronous SIGINT arrives just as the
function is entered.  This wouldn't break any data structures, but it would
have the same effect as if the SIGALRM interrupt had been silently lost:
we'd not fire any currently-due handlers, nor schedule any new interrupt.
To forestall that scenario, forcibly reschedule any pending timer interrupt
during AbortTransaction and AbortSubTransaction.  We can avoid any extra
kernel call in most cases by not doing that until we've allowed
LockErrorCleanup to kill the DEADLOCK_TIMEOUT and LOCK_TIMEOUT events.

Another hazard is that some platforms (at least Linux and *BSD) block a
signal before calling its handler and then unblock it on return.  When we
longjmp out of the handler, the unblock doesn't happen, and the signal is
left blocked indefinitely.  Again, we can fix that by forcibly unblocking
signals during AbortTransaction and AbortSubTransaction.

These latter two problems do not manifest when the longjmp reaches
postgres.c, because the error recovery code there kills all pending timeout
events anyway, and it uses sigsetjmp(..., 1) so that the appropriate signal
mask is restored.  So errors thrown outside any transaction should be OK
already, and cleaning up in AbortTransaction and AbortSubTransaction should
be enough to fix these issues.  (We're assuming that any code that catches
a query cancel error and doesn't re-throw it will do at least a
subtransaction abort to clean up; but that was pretty much required already
by other subsystems.)

Lastly, ProcSleep should not clear the LOCK_TIMEOUT indicator flag when
disabling that event: if a lock timeout interrupt happened after the lock
was granted, the ensuing query cancel is still going to happen at the next
CHECK_FOR_INTERRUPTS, and we want to report it as a lock timeout not a user
cancel.

Per reports from Dan Wood.

Back-patch to 9.3 where the new timeout handling infrastructure was
introduced.  We may at some point decide to back-patch the signal
unblocking changes further, but I'll desist from that until we hear
actual field complaints about it.
2013-11-29 16:41:00 -05:00
Robert Haas
8e18d04d4d Refine our definition of what constitutes a system relation.
Although user-defined relations can't be directly created in
pg_catalog, it's possible for them to end up there, because you can
create them in some other schema and then use ALTER TABLE .. SET SCHEMA
to move them there.  Previously, such relations couldn't afterwards
be manipulated, because IsSystemRelation()/IsSystemClass() rejected
all attempts to modify objects in the pg_catalog schema, regardless
of their origin.  With this patch, they now reject only those
objects in pg_catalog which were created at initdb-time, allowing
most operations on user-created tables in pg_catalog to proceed
normally.

This patch also adds new functions IsCatalogRelation() and
IsCatalogClass(), which is similar to IsSystemRelation() and
IsSystemClass() but with a slightly narrower definition: only TOAST
tables of system catalogs are included, rather than *all* TOAST tables.
This is currently used only for making decisions about when
invalidation messages need to be sent, but upcoming logical decoding
patches will find other uses for this information.

Andres Freund, with some modifications by me.
2013-11-28 20:57:20 -05:00
Tom Lane
7db285afc9 Fix stale-pointer problem in fast-path locking logic.
When acquiring a lock in fast-path mode, we must reset the locallock
object's lock and proclock fields to NULL.  They are not necessarily that
way to start with, because the locallock could be left over from a failed
lock acquisition attempt earlier in the transaction.  Failure to do this
led to all sorts of interesting misbehaviors when LockRelease tried to
clean up no-longer-related lock and proclock objects in shared memory.
Per report from Dan Wood.

In passing, modify LockRelease to elog not just Assert if it doesn't find
lock and proclock objects for a formerly fast-path lock, matching the code
in FastPathGetRelationLockEntry and LockRefindAndRelease.  This isn't a
bug but it will help in diagnosing any future bugs in this area.

Also, modify FastPathTransferRelationLocks and FastPathGetRelationLockEntry
to break out of their loops over the fastpath array once they've found the
sole matching entry.  This was inconsistently done in some search loops
and not others.

Improve assorted related comments, too.

Back-patch to 9.2 where the fast-path mechanism was introduced.
2013-11-27 18:10:00 -05:00
Heikki Linnakangas
631118fe1e Get rid of the post-recovery cleanup step of GIN page splits.
Replace it with an approach similar to what GiST uses: when a page is split,
the left sibling is marked with a flag indicating that the parent hasn't been
updated yet. When the parent is updated, the flag is cleared. If an insertion
steps on a page with the flag set, it will finish split before proceeding
with the insertion.

The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.

This also divides the responsibility of WAL-logging more clearly between
the generic ginbtree.c code, and the parts specific to entry and posting
trees. There is now a common WAL record format for insertions and deletions,
which is written by ginbtree.c, followed by tree-specific payload, which is
returned by the placetopage- and split- callbacks.
2013-11-27 19:21:23 +02:00
Heikki Linnakangas
ce5326eed3 More GIN refactoring.
Separate the insertion payload from the more static portions of GinBtree.
GinBtree now only contains information related to searching the tree, and
the information of what to insert is passed separately.

Add root block number to GinBtree, instead of passing it around all the
functions as argument.

Split off ginFinishSplit() from ginInsertValue(). ginFinishSplit is
responsible for finding the parent and inserting the downlink to it.
2013-11-27 15:43:05 +02:00
Peter Eisentraut
85ed91ee7d Implement information_schema.parameters.parameter_default column
Reviewed-by: Ali Dar <ali.munir.dar@gmail.com>
Reviewed-by: Amit Khandekar <amit.khandekar@enterprisedb.com>
Reviewed-by: Rodolfo Campero <rodolfo.campero@anachronics.com>
2013-11-26 23:21:35 -05:00
Bruce Momjian
a6542a4b68 Change SET LOCAL/CONSTRAINTS/TRANSACTION and ABORT behavior
Change SET LOCAL/CONSTRAINTS/TRANSACTION behavior outside of a
transaction block from error (post-9.3) to warning.  (Was nothing in <=
9.3.)  Also change ABORT outside of a transaction block from notice to
warning.
2013-11-25 19:19:40 -05:00
Tom Lane
45e02e3232 Fix array slicing of int2vector and oidvector values.
The previous coding labeled expressions such as pg_index.indkey[1:3] as
being of int2vector type; which is not right because the subscript bounds
of such a result don't, in general, satisfy the restrictions of int2vector.
To fix, implicitly promote the result of slicing int2vector to int2[],
or oidvector to oid[].  This is similar to what we've done with domains
over arrays, which is a good analogy because these types are very much
like restricted domains of the corresponding regular-array types.

A side-effect is that we now also forbid array-element updates on such
columns, eg while "update pg_index set indkey[4] = 42" would have worked
before if you were superuser (and corrupted your catalogs irretrievably,
no doubt) it's now disallowed.  This seems like a good thing since, again,
some choices of subscripting would've led to results not satisfying the
restrictions of int2vector.  The case of an array-slice update was
rejected before, though with a different error message than you get now.
We could make these cases work in future if we added a cast from int2[]
to int2vector (with a cast function checking the subscript restrictions)
but it seems unlikely that there's any value in that.

Per report from Ronan Dunklau.  Back-patch to all supported branches
because of the crash risks involved.
2013-11-23 20:03:56 -05:00
Tom Lane
784e762e88 Support multi-argument UNNEST(), and TABLE() syntax for multiple functions.
This patch adds the ability to write TABLE( function1(), function2(), ...)
as a single FROM-clause entry.  The result is the concatenation of the
first row from each function, followed by the second row from each
function, etc; with NULLs inserted if any function produces fewer rows than
others.  This is believed to be a much more useful behavior than what
Postgres currently does with multiple SRFs in a SELECT list.

This syntax also provides a reasonable way to combine use of column
definition lists with WITH ORDINALITY: put the column definition list
inside TABLE(), where it's clear that it doesn't control the ordinality
column as well.

Also implement SQL-compliant multiple-argument UNNEST(), by turning
UNNEST(a,b,c) into TABLE(unnest(a), unnest(b), unnest(c)).

The SQL standard specifies TABLE() with only a single function, not
multiple functions, and it seems to require an implicit UNNEST() which is
not what this patch does.  There may be something wrong with that reading
of the spec, though, because if it's right then the spec's TABLE() is just
a pointless alternative spelling of UNNEST().  After further review of
that, we might choose to adopt a different syntax for what this patch does,
but in any case this functionality seems clearly worthwhile.

Andrew Gierth, reviewed by Zoltán Böszörményi and Heikki Linnakangas, and
significantly revised by me
2013-11-21 19:37:20 -05:00
Heikki Linnakangas
501012631e Refactor the internal GIN B-tree interface for forming a downlink.
This creates a new gin-btree callback function for creating a downlink for
a page. Previously, ginxlog.c duplicated the logic used during normal
operation.
2013-11-20 16:57:41 +02:00
Heikki Linnakangas
04965ad40e Further GIN refactoring.
Merge some functions that were always called together. Makes the code
little bit more readable.
2013-11-20 16:09:14 +02:00
Tom Lane
f901bb50e3 Add make_date() and make_time() functions.
Pavel Stehule, reviewed by Jeevan Chalke and Atri Sharma
2013-11-17 15:06:50 -05:00
Tom Lane
69c8fbac20 Improve performance of numeric sum(), avg(), stddev(), variance(), etc.
This patch improves performance of most built-in aggregates that formerly
used a NUMERIC or NUMERIC array as their transition type; this includes
not only aggregates on numeric inputs, but some aggregates on integer
inputs where overflow of an int8 value is a possibility.  The code now
uses a special-purpose data structure to avoid array construction and
deconstruction overhead, as well as packing and unpacking overhead for
numeric values.

These aggregates' transition type is now declared as INTERNAL, since
it doesn't correspond to any SQL data type.  To keep the planner from
thinking that that means a lot of storage will be used, we make use
of the just-added pg_aggregate.aggtransspace feature.  The space estimate
is set to 128 bytes, which is at least in the right ballpark.

Hadi Moshayedi, reviewed by Pavel Stehule and Tomas Vondra
2013-11-16 18:46:34 -05:00
Tom Lane
6cb86143e8 Allow aggregates to provide estimates of their transition state data size.
Formerly the planner had a hard-wired rule of thumb for guessing the amount
of space consumed by an aggregate function's transition state data.  This
estimate is critical to deciding whether it's OK to use hash aggregation,
and in many situations the built-in estimate isn't very good.  This patch
adds a column to pg_aggregate wherein a per-aggregate estimate can be
provided, overriding the planner's default, and infrastructure for setting
the column via CREATE AGGREGATE.

It may be that additional smarts will be required in future, perhaps even
a per-aggregate estimation function.  But this is already a step forward.

This is extracted from a larger patch to improve the performance of numeric
and int8 aggregates.  I (tgl) thought it was worth reviewing and committing
this infrastructure separately.  In this commit, all built-in aggregates
are given aggtransspace = 0, so no behavior should change.

Hadi Moshayedi, reviewed by Pavel Stehule and Tomas Vondra
2013-11-16 16:03:40 -05:00
Tom Lane
f3b3b8d5be Compute correct em_nullable_relids in get_eclass_for_sort_expr().
Bug #8591 from Claudio Freire demonstrates that get_eclass_for_sort_expr
must be able to compute valid em_nullable_relids for any new equivalence
class members it creates.  I'd worried about this in the commit message
for db9f0e1d9a, but claimed that it wasn't a
problem because multi-member ECs should already exist when it runs.  That
is transparently wrong, though, because this function is also called by
initialize_mergeclause_eclasses, which runs during deconstruct_jointree.
The example given in the bug report (which the new regression test item
is based upon) fails because the COALESCE() expression is first seen by
initialize_mergeclause_eclasses rather than process_equivalence.

Fixing this requires passing the appropriate nullable_relids set to
get_eclass_for_sort_expr, and it requires new code to compute that set
for top-level expressions such as ORDER BY, GROUP BY, etc.  We store
the top-level nullable_relids in a new field in PlannerInfo to avoid
computing it many times.  In the back branches, I've added the new
field at the end of the struct to minimize ABI breakage for planner
plugins.  There doesn't seem to be a good alternative to changing
get_eclass_for_sort_expr's API signature, though.  There probably aren't
any third-party extensions calling that function directly; moreover,
if there are, they probably need to think about what to pass for
nullable_relids anyway.

Back-patch to 9.2, like the previous patch in this area.
2013-11-15 16:46:18 -05:00
Peter Eisentraut
001e114b8d Fix whitespace issues found by git diff --check, add gitattributes
Set per file type attributes in .gitattributes to fine-tune whitespace
checks.  With the associated cleanups, the tree is now clean for git
2013-11-10 14:48:29 -05:00
Heikki Linnakangas
ac4ab97ec0 Fix race condition in GIN posting tree page deletion.
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.

To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.

Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.

Backpatch to all supported versions.
2013-11-08 22:21:42 +02:00
Robert Haas
07cacba983 Add the notion of REPLICA IDENTITY for a table.
Pending patches for logical replication will use this to determine
which columns of a tuple ought to be considered as its candidate key.

Andres Freund, with minor, mostly cosmetic adjustments by me
2013-11-08 12:30:43 -05:00
Heikki Linnakangas
0ea53256a8 Fix missing argument and function prototypes.
Not sure how I missed these in previous commit.
2013-11-06 11:22:58 +02:00
Heikki Linnakangas
ecaa4708e5 Misc GIN refactoring.
Merge the isEnoughSpace and placeToPage functions in the b-tree interface
into one function that tries to put a tuple on page, and returns false if
it doesn't fit.

Move createPostingTree function to gindatapage.c, and change its contract
so that it can be passed more items than fit on the root page. It's in a
better position than the callers to know how many items fit.

Move ginMergeItemPointers out of gindatapage.c, into a separate file.

These changes make no difference now, but reduce the footprint of Alexander
Korotkov's upcoming patch to pack item pointers more tightly.
2013-11-06 10:32:09 +02:00
Tom Lane
45f64f1bbf Remove CTimeZone/HasCTZSet, root and branch.
These variables no longer have any useful purpose, since there's no reason
to special-case brute force timezones now that we have a valid
session_timezone setting for them.  Remove the variables, and remove the
SET/SHOW TIME ZONE code that deals with them.

The user-visible impact of this is that SHOW TIME ZONE will now show a
POSIX-style zone specification, in the form "<+-offset>-+offset", rather
than an interval value when a brute-force zone has been set.  While perhaps
less intuitive, this is a better definition than before because it's
actually possible to give that string back to SET TIME ZONE and get the
same behavior, unlike what used to happen.

We did not previously mention the angle-bracket syntax when describing
POSIX timezone specifications; add some documentation so that people
can figure out what these strings do.  (There's still quite a lot of
undocumented functionality there, but anybody who really cares can
go read the POSIX spec to find out about it.  In practice most people
seem to prefer Olsen-style city names anyway.)
2013-11-01 13:57:31 -04:00
Tom Lane
631dc390f4 Fix some odd behaviors when using a SQL-style simple GMT offset timezone.
Formerly, when using a SQL-spec timezone setting with a fixed GMT offset
(called a "brute force" timezone in the code), the session_timezone
variable was not updated to match the nominal timezone; rather, all code
was expected to ignore session_timezone if HasCTZSet was true.  This is
of course obviously fragile, though a search of the code finds only
timeofday() failing to honor the rule.  A bigger problem was that
DetermineTimeZoneOffset() supposed that if its pg_tz parameter was
pointer-equal to session_timezone, then HasCTZSet should override the
parameter.  This would cause datetime input containing an explicit zone
name to be treated as referencing the brute-force zone instead, if the
zone name happened to match the session timezone that had prevailed
before installing the brute-force zone setting (as reported in bug #8572).
The same malady could affect AT TIME ZONE operators.

To fix, set up session_timezone so that it matches the brute-force zone
specification, which we can do using the POSIX timezone definition syntax
"<abbrev>offset", and get rid of the bogus lookaside check in
DetermineTimeZoneOffset().  Aside from fixing the erroneous behavior in
datetime parsing and AT TIME ZONE, this will cause the timeofday() function
to print its result in the user-requested time zone rather than some
previously-set zone.  It might also affect results in third-party
extensions, if there are any that make use of session_timezone without
considering HasCTZSet, but in all cases the new behavior should be saner
than before.

Back-patch to all supported branches.
2013-11-01 12:13:18 -04:00
Tom Lane
6756c8ad30 Fix old typo in comment.
NFAs have children, but their individual states don't.
2013-10-29 15:34:18 -04:00
Robert Haas
d2aecaea15 Modify dynamic shared memory code to use Size rather than uint64.
This is more consistent with what we do elsewhere.
2013-10-28 12:12:06 -04:00
Noah Misch
c50b7c09d8 Add large object functions catering to SQL callers.
With these, one need no longer manipulate large object descriptors and
extract numeric constants from header files in order to read and write
large object contents from SQL.

Pavel Stehule, reviewed by Rushabh Lathia.
2013-10-27 22:56:54 -04:00
Tom Lane
3147acd63e Use improved vsnprintf calling logic in more places.
When we are using a C99-compliant vsnprintf implementation (which should be
most places, these days) it is worth the trouble to make use of its report
of how large the buffer needs to be to succeed.  This patch adjusts
stringinfo.c and some miscellaneous usages in pg_dump to do that, relying
on the logic recently added in libpgcommon's psprintf.c.  Since these
places want to know the number of bytes written once we succeed, modify the
API of pvsnprintf() to report that.

There remains near-duplicate logic in pqexpbuffer.c, but since that code
is in libpq, psprintf.c's approach of exit()-on-error isn't appropriate
for use there.  Also note that I didn't bother touching the multitude
of places that call (v)snprintf without any attempt to provide a resizable
buffer.

Release-note-worthy incompatibility: the API of appendStringInfoVA()
changed.  If there's any third-party code that's calling that directly,
it will need tweaking along the same lines as in this patch.

David Rowley and Tom Lane
2013-10-24 21:43:57 -04:00
Tom Lane
2c66f9924c Replace pg_asprintf() with psprintf().
This eliminates an awkward coding pattern that's also unnecessarily
inconsistent with backend coding.  psprintf() is now the thing to
use everywhere.
2013-10-22 19:40:26 -04:00
Tom Lane
09a89cb5fc Get rid of use of asprintf() in favor of a more portable implementation.
asprintf(), aside from not being particularly portable, has a fundamentally
badly-designed API; the psprintf() function that was added in passing in
the previous patch has a much better API choice.  Moreover, the NetBSD
implementation that was borrowed for the previous patch doesn't work with
non-C99-compliant vsnprintf, which is something we still have to cope with
on some platforms; and it depends on va_copy which isn't all that portable
either.  Get rid of that code in favor of an implementation similar to what
we've used for many years in stringinfo.c.  Also, move it into libpgcommon
since it's not really libpgport material.

I think this patch will be enough to turn the buildfarm green again, but
there's still cosmetic work left to do, namely get rid of pg_asprintf()
in favor of using psprintf().  That will come in a followon patch.
2013-10-22 18:42:13 -04:00
Noah Misch
709170b790 Consistently use unsigned arithmetic for alignment calculations.
This avoids an assumption about the signed number representation.  It is
anticipated to have no functional changes on supported configurations;
many two's complement assumptions remain elsewhere.

Per a suggestion from Andres Freund.
2013-10-20 21:04:52 -04:00
Robert Haas
cab5dc5daf Allow only some columns of a view to be auto-updateable.
Previously, unless all columns were auto-updateable, we wouldn't
inserts, updates, or deletes, or at least not without a rule or trigger;
now, we'll allow inserts and updates that target only the auto-updateable
columns, and deletes even if there are no auto-updateable columns at
all provided the view definition is otherwise suitable.

Dean Rasheed, reviewed by Marko Tiikkaja
2013-10-18 10:35:36 -04:00
Robert Haas
523beaa11b Provide a reliable mechanism for terminating a background worker.
Although previously-introduced APIs allow the process that registers a
background worker to obtain the worker's PID, there's no way to prevent
a worker that is not currently running from being restarted.  This
patch introduces a new API TerminateBackgroundWorker() that prevents
the background worker from being restarted, terminates it if it is
currently running, and causes it to be unregistered if or when it is
not running.

Patch by me.  Review by Michael Paquier and KaiGai Kohei.
2013-10-18 10:23:11 -04:00
Peter Eisentraut
c2316dcda1 Fix for lack of va_copy() on certain Windows versions
Based-on-patch-by: David Rowley <dgrowleyml@gmail.com>
2013-10-18 09:54:50 -04:00
Robert Haas
ea91a6be89 Remove IRIX port.
Development of IRIX has been discontinued, and support is scheduled
to end in December of 2013.  Therefore, there will be no supported
versions of this operating system by the time PostgreSQL 9.4 is
released.  Furthermore, we have no maintainer for this platform.
2013-10-18 08:14:21 -04:00
Robert Haas
81051a86bc Remove spinlock support for SINIX, Sun3, and NS32K.
All of these platforms are very much obsolete.

As far as I can determine, the last version of SINIX, later renamed
Reliant, occurred some time between 2002 and 2005.

The last release of SunOS that would run on a sun3 was released in
November of 1991; the last release of OpenBSD which supported that
platform was in 2001.  The highest clock speed of any processor in
the family was 25MHz.

The NS32K (national semiconductor 320xx) architecture was retired
in 1990.

Support can be re-added if a maintainer emerges for any of these
platforms, but it seems unlikely.

Reviewed by Andres Freund.
2013-10-17 12:02:05 -04:00
Peter Eisentraut
5b6d08cd29 Add use of asprintf()
Add asprintf(), pg_asprintf(), and psprintf() to simplify string
allocation and composition.  Replacement implementations taken from
NetBSD.

Reviewed-by: Álvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Asif Naeem <anaeem.it@gmail.com>
2013-10-13 00:09:18 -04:00
Alvaro Herrera
31cf1a1a43 Rework SSL renegotiation code
The existing renegotiation code was home for several bugs: it might
erroneously report that renegotiation had failed; it might try to
execute another renegotiation while the previous one was pending; it
failed to terminate the connection if the renegotiation never actually
took place; if a renegotiation was started, the byte count was reset,
even if the renegotiation wasn't completed (this isn't good from a
security perspective because it means continuing to use a session that
should be considered compromised due to volume of data transferred.)

The new code is structured to avoid these pitfalls: renegotiation is
started a little earlier than the limit has expired; the handshake
sequence is retried until it has actually returned successfully, and no
more than that, but if it fails too many times, the connection is
closed.  The byte count is reset only when the renegotiation has
succeeded, and if the renegotiation byte count limit expires, the
connection is terminated.

This commit only touches the master branch, because some of the changes
are controversial.  If everything goes well, a back-patch might be
considered.

Per discussion started by message
20130710212017.GB4941@eldon.alvh.no-ip.org
2013-10-10 23:45:20 -03:00
Peter Eisentraut
5dd41f3574 Remove maintainer-check target, fold into normal build
make maintainer-check was obscure and rarely called in practice, and
many breakages were missed.  Fold everything that make maintainer-check
used to do into the normal build.  Specifically:

- Call duplicate_oids when genbki.pl is called.

- Check for tabs in SGML files when the documentation is built.

- Run msgfmt with the -c option during the regular build.  Add an
  additional configure check to see whether we are using the GNU
  version.  (make maintainer-check probably used to fail with non-GNU
  msgfmt.)

Keep maintainer-check as around as phony target for the time being in
case anyone is calling it.  But it won't do anything anymore.
2013-10-10 20:11:56 -04:00
Peter Eisentraut
3dc543b3d8 Replace duplicate_oids with Perl implementation
It is more portable, more robust, and more readable.

From: Andrew Dunstan <andrew@dunslane.net>
2013-10-10 20:09:42 -04:00
Andrew Dunstan
4d212bac17 json_typeof function.
Andrew Tipton.
2013-10-10 12:21:59 -04:00
Peter Eisentraut
261c7d4b65 Revive line type
Change the input/output format to {A,B,C}, to match the internal
representation.

Complete the implementations of line_in, line_out, line_recv, line_send.
Remove comments and error messages about the line type not being
implemented.  Add regression tests for existing line operators and
functions.

Reviewed-by: rui hua <365507506hua@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Jeevan Chalke <jeevan.chalke@enterprisedb.com>
2013-10-09 22:34:38 -04:00
Robert Haas
0ac5e5a7e1 Allow dynamic allocation of shared memory segments.
Patch by myself and Amit Kapila.  Design help from Noah Misch.  Review
by Andres Freund.
2013-10-09 21:05:02 -04:00
Kevin Grittner
f566515192 Add record_image_ops opclass for matview concurrent refresh.
REFRESH MATERIALIZED VIEW CONCURRENTLY was broken for any matview
containing a column of a type without a default btree operator
class.  It also did not produce results consistent with a non-
concurrent REFRESH or a normal view if any column was of a type
which allowed user-visible differences between values which
compared as equal according to the type's default btree opclass.
Concurrent matview refresh was modified to use the new operators
to solve these problems.

Documentation was added for record comparison, both for the
default btree operator class for record, and the newly added
operators.  Regression tests now check for proper behavior both
for a matview with a box column and a matview containing a citext
column.

Reviewed by Steve Singer, who suggested some of the doc language.
2013-10-09 14:26:09 -05:00
Bruce Momjian
ee1e5662d8 Auto-tune effective_cache size to be 4x shared buffers 2013-10-08 12:12:24 -04:00
Heikki Linnakangas
5962519b36 TYPEALIGN doesn't work on int64 on 32-bit platforms.
The TYPEALIGN macro, and the related ones like MAXALIGN, don't work with
values larger than intptr_t, because TYPEALIGN casts the argument to
intptr_t to do the arithmetic. That's not a problem when dealing with
pointers or lengths or offsets related to pointers, but the XLogInsert
scaling patch added a call to MAXALIGN with an XLogRecPtr argument.

To fix, add wider variants of the macros, called TYPEALIGN64 and MAXALIGN64,
which are just like the existing variants but work with uint64 instead of
intptr_t.

Report and patch by David Rowley, analysis by Andres Freund.
2013-10-08 01:59:57 +03:00
Kevin Grittner
c01262a824 Eliminate xmin from hash tag for predicate locks on heap tuples.
If a tuple was frozen while its predicate locks mattered,
read-write dependencies could be missed, resulting in failure to
detect conflicts which could lead to anomalies in committed
serializable transactions.

This field was added to the tag when we still thought that it was
necessary to carry locks forward to a new version of an updated
row.  That was later proven to be unnecessary, which allowed
simplification of the code, but elimination of xmin from the tag
was missed at the time.

Per report and analysis by Heikki Linnakangas.
Backpatch to 9.1.
2013-10-07 14:16:54 -05:00
Bruce Momjian
a54141aebc Issue error on SET outside transaction block in some cases
Issue error for SET LOCAL/CONSTRAINTS/TRANSACTION outside a transaction
block, as they have no effect.

Per suggestion from Morten Hustveit
2013-10-04 13:50:28 -04:00
Robert Haas
d90ced8bb2 Add DISCARD SEQUENCES command.
DISCARD ALL will now discard cached sequence information, as well.

Fabrízio de Royes Mello, reviewed by Zoltán Böszörményi, with some
further tweaks by me.
2013-10-03 16:23:31 -04:00
Heikki Linnakangas
c2b175b948 Minor GIN code refactoring.
It makes for cleaner code to have separate Get/Add functions for PostingItems
and ItemPointers.  A few callsites that have to deal with both types need to
be duplicated because of this, but all the callers have to know which one
they're dealing with anyway. Overall, this reduces the amount of casting
required.

Extracted from Alexander Korotkov's larger patch to change the data page
format.
2013-10-03 11:51:31 +03:00
Alvaro Herrera
15732b34e8 Add WaitForLockers in lmgr, refactoring index.c code
This is in support of a future REINDEX CONCURRENTLY feature.

Michael Paquier
2013-10-01 17:57:01 -03:00
Alvaro Herrera
1247ea28cb Remove proc argument from LockCheckConflicts
This has been unused since commit 8563ccae2c.

Noted by Antonin Houska
2013-09-16 22:14:14 -03:00
Alvaro Herrera
dd778e9d88 Rename various "freeze multixact" variables
It seems to make more sense to use "cutoff multixact" terminology
throughout the backend code; "freeze" is associated with replacing of an
Xid with FrozenTransactionId, which is not what we do for MultiXactIds.

Andres Freund
Some adjustments by Álvaro Herrera
2013-09-16 15:47:31 -03:00
Bruce Momjian
0f59f4a645 Add comment for VARSIZE_ANY_EXHDR macro
Gurjeet Singh
2013-09-10 20:18:53 -04:00
Fujii Masao
71129b6fc5 Remove leftover function prototype.
The prototype for inval_twophase_postcommit wasn't removed when it's definition
was removed in efc16ea520 / the initial HS commit.

Andres Freund
2013-09-11 01:32:24 +09:00
Robert Haas
71901ab6da Introduce InvalidCommandId.
This allows a 32-bit field to represent an *optional* command ID
without a separate flag bit.

Andres Freund
2013-09-09 16:25:29 -04:00
Kevin Grittner
277607d600 Eliminate pg_rewrite.ev_attr column and related dead code.
Commit 95ef6a3448 removed the
ability to create rules on an individual column as of 7.3, but
left some residual code which has since been useless.  This cleans
up that dead code without any change in behavior other than
dropping the useless column from the catalog.
2013-09-05 14:03:43 -05:00
Heikki Linnakangas
20cb18db46 Make catalog cache hash tables resizeable.
If the hash table backing a catalog cache becomes too full (fillfactor > 2),
enlarge it. A new buckets array, double the size of the old, is allocated,
and all entries in the old hash are moved to the right bucket in the new
hash.

This has two benefits. First, cache lookups don't get so expensive when
there are lots of entries in a cache, like if you access hundreds of
thousands of tables. Second, we can make the (initial) sizes of the caches
much smaller, which saves memory.

This patch dials down the initial sizes of the catcaches. The new sizes are
chosen so that a backend that only runs a few basic queries still won't need
to enlarge any of them.
2013-09-05 20:20:03 +03:00
Jeff Davis
b1892aaeaa Revert WAL posix_fallocate() patches.
This reverts commit 269e780822
and commit 5b571bb8c8.

Unfortunately, the initial patch had insufficient performance testing,
and resulted in a regression.

Per report by Thom Brown.
2013-09-04 23:43:41 -07:00
Heikki Linnakangas
375d8526f2 Keep heavily-contended fields in XLogCtlInsert on different cache lines.
Performance testing shows that if the insertpos_lck spinlock and the fields
that it protects are on the same cache line with other variables that are
frequently accessed, the false sharing can hurt performance a lot. Keep
them apart by adding some padding.
2013-09-04 23:14:33 +03:00
Robert Haas
cc52d5b33f Expose fsync_fname as a public API.
Andres Freund
2013-09-04 11:15:00 -04:00
Tom Lane
0c66a22377 Update comments concerning PGC_S_TEST.
This GUC context value was once only used by ALTER DATABASE SET and
ALTER USER SET.  That's not true anymore, though, so rewrite the
comments to be a bit more general.

Patch in HEAD only, since this is just an internal documentation issue.
2013-09-03 18:56:22 -04:00
Tom Lane
0d3f4406df Allow aggregate functions to be VARIADIC.
There's no inherent reason why an aggregate function can't be variadic
(even VARIADIC ANY) if its transition function can handle the case.
Indeed, this patch to add the feature touches none of the planner or
executor, and little of the parser; the main missing stuff was DDL and
pg_dump support.

It is true that variadic aggregates can create the same sort of ambiguity
about parameters versus ORDER BY keys that was complained of when we
(briefly) had both one- and two-argument forms of string_agg().  However,
the policy formed in response to that discussion only said that we'd not
create any built-in aggregates with varying numbers of arguments, not that
we shouldn't allow users to do it.  So the logical extension of that is
we can allow users to make variadic aggregates as long as we're wary about
shipping any such in core.

In passing, this patch allows aggregate function arguments to be named, to
the extent of remembering the names in pg_proc and dumping them in pg_dump.
You can't yet call an aggregate using named-parameter notation.  That seems
like a likely future extension, but it'll take some work, and it's not what
this patch is really about.  Likewise, there's still some work needed to
make window functions handle VARIADIC fully, but I left that for another
day.

initdb forced because of new aggvariadic field in Aggref parse nodes.
2013-09-03 17:08:46 -04:00
Alvaro Herrera
8b290f3115 Update obsolete comment 2013-09-03 16:53:16 -04:00
Tom Lane
8e2b71d2d0 Reset the binary heap in MergeAppend rescans.
Failing to do so can cause queries to return wrong data, error out or crash.
This requires adding a new binaryheap_reset() method to binaryheap.c,
but that probably should have been there anyway.

Per bug #8410 from Terje Elde.  Diagnosis and patch by Andres Freund.
2013-08-30 19:15:21 -04:00
Heikki Linnakangas
b03d196be0 Use a non-locking initial test in TAS_SPIN on x86_64.
Testing done in 2011 by Tom Lane concluded that this is a win on Intel Xeons
and AMD Opterons, but it was not changed back then, because of an old
comment in tas() that suggested that it's a huge loss on older Opterons.
However, didn't have separate TAS() and TAS_SPIN() macros back then, so the
comment referred to doing a non-locked initial test even on the first
access, in uncontended case. I don't have access to older Opterons, but I'm
pretty sure that doing an initial unlocked test is unlikely to be a loss
while spinning, even though it might be for the first access.

We probably should do the same on 32-bit x86, but I'm afraid of changing it
without any testing. Hence just add a note to the x86 implementation
suggesting that we probably should do the same there.
2013-08-29 14:04:37 +03:00
Robert Haas
090d0f2050 Allow discovery of whether a dynamic background worker is running.
Using the infrastructure provided by this patch, it's possible either
to wait for the startup of a dynamically-registered background worker,
or to poll the status of such a worker without waiting.  In either
case, the current PID of the worker process can also be obtained.
As usual, worker_spi is updated to demonstrate the new functionality.

Patch by me.  Review by Andres Freund.
2013-08-28 14:08:13 -04:00
Tom Lane
fcf9ecad57 In locate_grouping_columns(), don't expect an exact match of Var typmods.
It's possible that inlining of SQL functions (or perhaps other changes?)
has exposed typmod information not known at parse time.  In such cases,
Vars generated by query_planner might have valid typmod values while the
original grouping columns only have typmod -1.  This isn't a semantic
problem since the behavior of grouping only depends on type not typmod,
but it breaks locate_grouping_columns' use of tlist_member to locate the
matching entry in query_planner's result tlist.

We can fix this without an excessive amount of new code or complexity by
relying on the fact that locate_grouping_columns only gets called when
make_subplanTargetList has set need_tlist_eval == false, and that can only
happen if all the grouping columns are simple Vars.  Therefore we only need
to search the sub_tlist for a matching Var, and we can reasonably define a
"match" as being a match of the Var identity fields
varno/varattno/varlevelsup.  The code still Asserts that vartype matches,
but ignores vartypmod.

Per bug #8393 from Evan Martin.  The added regression test case is
basically the same as his example.  This has been broken for a very long
time, so back-patch to all supported branches.
2013-08-23 17:30:53 -04:00
Andrew Dunstan
73838b5251 Unconditionally use the WSA equivalents of Socket error constants.
This change will only apply to mingw compilers, and has been found
necessary by late versions of the mingw-w64 compiler. It's the same as
what is done elsewhere for the Microsoft compilers.

If this doesn't upset older compilers in the buildfarm, it will be
backpatched to 9.1.

Problem reported by Michael Cronenworth, although not his patch.
2013-08-20 14:11:36 -04:00
Alvaro Herrera
78e1220104 Fix pg_upgrade failure from servers older than 9.3
When upgrading from servers of versions 9.2 and older, and MultiXactIds
have been used in the old server beyond the first page (that is, 2048
multis or more in the default 8kB-page build), pg_upgrade would set the
next multixact offset to use beyond what has been allocated in the new
cluster.  This would cause a failure the first time the new cluster
needs to use this value, because the pg_multixact/offsets/ file wouldn't
exist or wouldn't be large enough.  To fix, ensure that the transient
server instances launched by pg_upgrade extend the file as necessary.

Per report from Jesse Denardo in
CANiVXAj4c88YqipsyFQPboqMudnjcNTdB3pqe8ReXqAFQ=HXyA@mail.gmail.com
2013-08-19 12:56:18 -04:00
Tom Lane
9e7e29c75a Fix planner problems with LATERAL references in PlaceHolderVars.
The planner largely failed to consider the possibility that a
PlaceHolderVar's expression might contain a lateral reference to a Var
coming from somewhere outside the PHV's syntactic scope.  We had a previous
report of a problem in this area, which I tried to fix in a quick-hack way
in commit 4da6439bd8, but Antonin Houska
pointed out that there were still some problems, and investigation turned
up other issues.  This patch largely reverts that commit in favor of a more
thoroughly thought-through solution.  The new theory is that a PHV's
ph_eval_at level cannot be higher than its original syntactic level.  If it
contains lateral references, those don't change the ph_eval_at level, but
rather they create a lateral-reference requirement for the ph_eval_at join
relation.  The code in joinpath.c needs to handle that.

Another issue is that createplan.c wasn't handling nested PlaceHolderVars
properly.

In passing, push knowledge of lateral-reference checks for join clauses
into join_clause_is_movable_to.  This is mainly so that FDWs don't need
to deal with it.

This patch doesn't fix the original join-qual-placement problem reported by
Jeremy Evans (and indeed, one of the new regression test cases shows the
wrong answer because of that).  But the PlaceHolderVar problems need to be
fixed before that issue can be addressed, so committing this separately
seems reasonable.
2013-08-17 20:22:37 -04:00
Robert Haas
2dee7998f9 Move more bgworker code to bgworker.c; also, some renaming.
Per discussion on pgsql-hackers.

Michael Paquier, slightly modified by me.  Original suggestion
from Amit Kapila.
2013-08-16 15:31:28 -04:00
Tom Lane
1b1d3d92c3 Remove ph_may_need from PlaceHolderInfo, with attendant simplifications.
The planner logic that attempted to make a preliminary estimate of the
ph_needed levels for PlaceHolderVars seems to be completely broken by
lateral references.  Fortunately, the potential join order optimization
that this code supported seems to be of relatively little value in
practice; so let's just get rid of it rather than trying to fix it.

Getting rid of this allows fairly substantial simplifications in
placeholder.c, too, so planning in such cases should be a bit faster.

Issue noted while pursuing bugs reported by Jeremy Evans and Antonin
Houska, though this doesn't in itself fix either of their reported cases.
What this does do is prevent an Assert crash in the kind of query
illustrated by the added regression test.  (I'm not sure that the plan for
that query is stable enough across platforms to be usable as a regression
test output ... but we'll soon find out from the buildfarm.)

Back-patch to 9.3.  The problem case can't arise without LATERAL, so
no need to touch older branches.
2013-08-14 18:38:47 -04:00
Tom Lane
3d5282c6f0 Emit a log message if output is about to be redirected away from stderr.
We've seen multiple cases of people looking at the postmaster's original
stderr output to try to diagnose problems, not realizing/remembering that
their logging configuration is set up to send log messages somewhere else.
This seems particularly likely to happen in prepackaged distributions,
since many packagers patch the code to change the factory-standard logging
configuration to something more in line with their platform conventions.

In hopes of reducing confusion, emit a LOG message about this at the point
in startup where we are about to switch log output away from the original
stderr, providing a pointer to where to look instead.  This message will
appear as the last thing in the original stderr output.  (We might later
also try to emit such link messages when logging parameters are changed
on-the-fly; but that case seems to be both noticeably harder to do nicely,
and much less frequently a problem in practice.)

Per discussion, back-patch to 9.3 but not further.
2013-08-13 15:24:52 -04:00
Tom Lane
3ced8837db Simplify query_planner's API by having it return the top-level RelOptInfo.
Formerly, query_planner returned one or possibly two Paths for the topmost
join relation, so that grouping_planner didn't see the join RelOptInfo
(at least not directly; it didn't have any hesitation about examining
cheapest_path->parent, though).  However, correct selection of the Paths
involved a significant amount of coupling between query_planner and
grouping_planner, a problem which has gotten worse over time.  It seems
best to give up on this API choice and instead return the topmost
RelOptInfo explicitly.  Then grouping_planner can pull out the Paths it
wants from the rel's path list.  In this way we can remove all knowledge
of grouping behaviors from query_planner.

The only real benefit of the old way is that in the case of an empty
FROM clause, we never made any RelOptInfos at all, just a Path.  Now
we have to gin up a dummy RelOptInfo to represent the empty FROM clause.
That's not a very big deal though.

While at it, simplify query_planner's API a bit more by having the caller
set up root->tuple_fraction and root->limit_tuples, rather than passing
those values as separate parameters.  Since query_planner no longer does
anything with either value, requiring it to fill the PlannerInfo fields
seemed pretty arbitrary.

This patch just rearranges code; it doesn't (intentionally) change any
behaviors.  Followup patches will do more interesting things.
2013-08-05 15:01:09 -04:00
Alvaro Herrera
88c556680c Fix crash in error report of invalid tuple lock
My tweak of these error messages in commit c359a1b082 contained the
thinko that a query would always have rowMarks set for a query
containing a locking clause.  Not so: when declaring a cursor, for
instance, rowMarks isn't set at the point we're checking, so we'd be
dereferencing a NULL pointer.

The fix is to pass the lock strength to the function raising the error,
instead of trying to reverse-engineer it.  The result not only is more
robust, but it also seems cleaner overall.

Per report from Robert Haas.
2013-08-02 13:18:37 -04:00
Robert Haas
149e38e5ee Assorted bgworker-related comment fixes.
Per gripes by Amit Kapila.
2013-08-01 12:20:31 -04:00
Robert Haas
813fb03155 Remove SnapshotNow and HeapTupleSatisfiesNow.
We now use MVCC catalog scans, and, per discussion, have eliminated
all other remaining uses of SnapshotNow, so that we can now get rid of
it.  This will break third-party code which is still using it, which
is intentional, as we want such code to be updated to do things the
new way.
2013-08-01 10:46:19 -04:00
Stephen Frost
ddef1a39c6 Allow a context to be passed in for error handling
As pointed out by Tom Lane, we can allow other users of the error
handler callbacks to provide their own memory context by adding
the context to use to ErrorData and using that instead of explicitly
using ErrorContext.

This then allows GetErrorContextStack() to be called from inside
exception handlers, so modify plpgsql to take advantage of that and
add an associated regression test for it.
2013-08-01 01:07:20 -04:00
Alvaro Herrera
3142cf6dd5 Fix a couple of inconsequential typos in new header 2013-07-31 17:57:00 -04:00
Greg Stark
c62736cc37 Add SQL Standard WITH ORDINALITY support for UNNEST (and any other SRF)
Author: Andrew Gierth, David Fetter
Reviewers: Dean Rasheed, Jeevan Chalke, Stephen Frost
2013-07-29 16:38:01 +01:00
Tom Lane
3d13623d75 Prevent leakage of SPI tuple tables during subtransaction abort.
plpgsql often just remembers SPI-result tuple tables in local variables,
and has no mechanism for freeing them if an ereport(ERROR) causes an escape
out of the execution function whose local variable it is.  In the original
coding, that wasn't a problem because the tuple table would be cleaned up
when the function's SPI context went away during transaction abort.
However, once plpgsql grew the ability to trap exceptions, repeated
trapping of errors within a function could result in significant
intra-function-call memory leakage, as illustrated in bug #8279 from
Chad Wagner.

We could fix this locally in plpgsql with a bunch of PG_TRY/PG_CATCH
coding, but that would be tedious, probably slow, and prone to bugs of
omission; moreover it would do nothing for similar risks elsewhere.
What seems like a better plan is to make SPI itself responsible for
freeing tuple tables at subtransaction abort.  This patch attacks the
problem that way, keeping a list of live tuple tables within each SPI
function context.  Currently, such freeing is automatic for tuple tables
made within the failed subtransaction.  We might later add a SPI call to
mark a tuple table as not to be freed this way, allowing callers to opt
out; but until someone exhibits a clear use-case for such behavior, it
doesn't seem worth bothering.

A very useful side-effect of this change is that SPI_freetuptable() can
now defend itself against bad calls, such as duplicate free requests;
this should make things more robust in many places.  (In particular,
this reduces the risks involved if a third-party extension contains
now-redundant SPI_freetuptable() calls in error cleanup code.)

Even though the leakage problem is of long standing, it seems imprudent
to back-patch this into stable branches, since it does represent an API
semantics change for SPI users.  We'll patch this in 9.3, but live with
the leakage in older branches.
2013-07-25 16:46:14 -04:00
Stephen Frost
8312832567 Add GET DIAGNOSTICS ... PG_CONTEXT in PL/PgSQL
This adds the ability to get the call stack as a string from within a
PL/PgSQL function, which can be handy for logging to a table, or to
include in a useful message to an end-user.

Pavel Stehule, reviewed by Rushabh Lathia and rather heavily whacked
around by Stephen Frost.
2013-07-24 18:53:27 -04:00
Tom Lane
fa2fad3c06 Improve ilist.h's support for deletion of slist elements during iteration.
Previously one had to use slist_delete(), implying an additional scan of
the list, making this infrastructure considerably less efficient than
traditional Lists when deletion of element(s) in a long list is needed.
Modify the slist_foreach_modify() macro to support deleting the current
element in O(1) time, by keeping a "prev" pointer in addition to "cur"
and "next".  Although this makes iteration with this macro a bit slower,
no real harm is done, since in any scenario where you're not going to
delete the current list element you might as well just use slist_foreach
instead.  Improve the comments about when to use each macro.

Back-patch to 9.3 so that we'll have consistent semantics in all branches
that provide ilist.h.  Note this is an ABI break for callers of
slist_foreach_modify().

Andres Freund and Tom Lane
2013-07-24 17:42:34 -04:00
Tom Lane
10a509d829 Move strip_implicit_coercions() from optimizer to nodeFuncs.c.
Use of this function has spread into the parser and rewriter, so it seems
like time to pull it out of the optimizer and put it into the more central
nodeFuncs module.  This eliminates the need to #include optimizer/clauses.h
in most of the calling files, demonstrating that this function was indeed a
bit outside the normal code reference patterns.
2013-07-23 18:21:19 -04:00
Tom Lane
a7cd853b75 Change post-rewriter representation of dropped columns in joinaliasvars.
It's possible to drop a column from an input table of a JOIN clause in a
view, if that column is nowhere actually referenced in the view.  But it
will still be there in the JOIN clause's joinaliasvars list.  We used to
replace such entries with NULL Const nodes, which is handy for generation
of RowExpr expansion of a whole-row reference to the view.  The trouble
with that is that it can't be distinguished from the situation after
subquery pull-up of a constant subquery output expression below the JOIN.
Instead, replace such joinaliasvars with null pointers (empty expression
trees), which can't be confused with pulled-up expressions.  expandRTE()
still emits the old convention, though, for convenience of RowExpr
generation and to reduce the risk of breaking extension code.

In HEAD and 9.3, this patch also fixes a problem with some new code in
ruleutils.c that was failing to cope with implicitly-casted joinaliasvars
entries, as per recent report from Feike Steenbergen.  That oversight was
because of an inadequate description of the data structure in parsenodes.h,
which I've now corrected.  There were some pre-existing oversights of the
same ilk elsewhere, which I believe are now all fixed.
2013-07-23 16:23:45 -04:00
Alvaro Herrera
c359a1b082 Tweak FOR UPDATE/SHARE error message wording (again)
In commit 0ac5ad5134 I changed some error messages from "FOR
UPDATE/SHARE" to a rather long gobbledygook which nobody liked.  Then,
in commit cb9b66d31 I changed them again, but the alternative chosen
there was deemed suboptimal by Peter Eisentraut, who in message
1373937980.20441.8.camel@vanquo.pezone.net proposed an alternative
involving a dynamically-constructed string based on the actual locking
strength specified in the SQL command.  This patch implements that
suggestion.
2013-07-23 14:03:09 -04:00
Robert Haas
f40a318eea Remove bgw_sighup and bgw_sigterm.
Per discussion on pgsql-hackers, these aren't really needed.  Interim
versions of the background worker patch had the worker starting with
signals already unblocked, which would have made this necessary.
But the final version does not, so we don't really need it; and it
doesn't work well with the new facility for starting dynamic background
workers, so just rip it out.

Also per discussion on pgsql-hackers, back-patch this change to 9.3.
It's best to get the API break out of the way before we do an
official release of this facility, to avoid more pain for extension
authors later.
2013-07-22 14:13:00 -04:00
Robert Haas
0518eceec3 Adjust HeapTupleSatisfies* routines to take a HeapTuple.
Previously, these functions took a HeapTupleHeader, but upcoming
patches for logical replication will introduce new a new snapshot
type under which the tuple's TID will be used to lookup (CMIN, CMAX)
for visibility determination purposes.  This makes that information
available.  Code churn is minimal since HeapTupleSatisfiesVisibility
took the HeapTuple anyway, and deferenced it before calling the
satisfies function.

Independently of logical replication, this allows t_tableOid and
t_self to be cross-checked via assertions in tqual.c.  This seems
like a useful way to make sure that all callers are setting these
values properly, which has been previously put forward as
desirable.

Andres Freund, reviewed by Álvaro Herrera
2013-07-22 13:38:44 -04:00
Robert Haas
f01d1ae3a1 Add infrastructure for mapping relfilenodes to relation OIDs.
Future patches are expected to introduce logical replication that
works by decoding WAL.  WAL contains relfilenodes rather than relation
OIDs, so this infrastructure will be needed to find the relation OID
based on WAL contents.

If logical replication does not make it into this release, we probably
should consider reverting this, since it will add some overhead to DDL
operations that create new relations.  One additional index insert per
pg_class row is not a large overhead, but it's more than zero.
Another way of meeting the needs of logical replication would be to
the relation OID to WAL, but that would burden DML operations, not
only DDL.

Andres Freund, with some changes by me.  Design review, in earlier
versions, by Álvaro Herrera.
2013-07-22 11:09:10 -04:00
Peter Eisentraut
ff41a5de09 Clean up new JSON API typedefs
The new JSON API uses a bit of an unusual typedef scheme, where for
example OkeysState is a pointer to okeysState.  And that's not applied
consistently either.  Change that to the more usual PostgreSQL style
where struct typedefs are upper case, and use pointers explicitly.
2013-07-20 06:38:31 -04:00
Stephen Frost
4cbe3ac3e8 WITH CHECK OPTION support for auto-updatable VIEWs
For simple views which are automatically updatable, this patch allows
the user to specify what level of checking should be done on records
being inserted or updated.  For 'LOCAL CHECK', new tuples are validated
against the conditionals of the view they are being inserted into, while
for 'CASCADED CHECK' the new tuples are validated against the
conditionals for all views involved (from the top down).

This option is part of the SQL specification.

Dean Rasheed, reviewed by Pavel Stehule
2013-07-18 17:10:16 -04:00
Andrew Dunstan
d26888bc4d Move checking an explicit VARIADIC "any" argument into the parser.
This is more efficient and simpler . It does mean that an untyped NULL
can no longer be used in such cases, which should be mentioned in
Release Notes, but doesn't seem a terrible loss. The workaround is to
cast the NULL to some array type.

Pavel Stehule, reviewed by Jeevan Chalke.
2013-07-18 11:52:12 -04:00
Tom Lane
89779bf2c8 Fix a few problems in barrier.h.
On HPPA, implement pg_memory_barrier() as pg_compiler_barrier(), which
should be correct since this arch doesn't do memory access reordering,
and is anyway better than the completely-nonfunctional-on-this-arch
dummy_spinlock code.  (But note this patch only fixes things for gcc,
not for builds with HP's compiler.)

Also, fix incorrect default definition of pg_memory_barrier as a macro
requiring an argument.

Also, fix incorrect spelling of "#elif" as "#else if" in icc code path
(spotted by pgindent).

This doesn't come close to fixing all of the functional and stylistic
deficiencies in barrier.h, but at least it un-breaks my personal build.
Now that we're actually using barriers in the code, this file is going
to need some serious attention.
2013-07-17 18:38:20 -04:00
Noah Misch
b560ec1b0d Implement the FILTER clause for aggregate function calls.
This is SQL-standard with a few extensions, namely support for
subqueries and outer references in clause expressions.

catversion bump due to change in Aggref and WindowFunc.

David Fetter, reviewed by Dean Rasheed.
2013-07-16 20:15:36 -04:00
Kevin Grittner
cc1965a99b Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.
This allows reads to continue without any blocking while a REFRESH
runs.  The new data appears atomically as part of transaction
commit.

Review questioned the Assert that a matview was not a system
relation.  This will be addressed separately.

Reviewed by Hitoshi Harada, Robert Haas, Andres Freund.
Merged after review with security patch f3ab5d4.
2013-07-16 12:55:44 -05:00
Robert Haas
7f7485a0cd Allow background workers to be started dynamically.
There is a new API, RegisterDynamicBackgroundWorker, which allows
an ordinary user backend to register a new background writer during
normal running.  This means that it's no longer necessary for all
background workers to be registered during processing of
shared_preload_libraries, although the option of registering workers
at that time remains available.

When a background worker exits and will not be restarted, the
slot previously used by that background worker is automatically
released and becomes available for reuse.  Slots used by background
workers that are configured for automatic restart can't (yet) be
released without shutting down the system.

This commit adds a new source file, bgworker.c, and moves some
of the existing control logic for background workers there.
Previously, there was little enough logic that it made sense to
keep everything in postmaster.c, but not any more.

This commit also makes the worker_spi contrib module into an
extension and adds a new function, worker_spi_launch, which can
be used to demonstrate the new facility.
2013-07-16 13:02:15 -04:00
Peter Eisentraut
070518ddab Add session_preload_libraries configuration parameter
This is like shared_preload_libraries except that it takes effect at
backend start and can be changed without a full postmaster restart.  It
is like local_preload_libraries except that it is still only settable by
a superuser.  This can be a better way to load modules such as
auto_explain.

Since there are now three preload parameters, regroup the documentation
a bit.  Put all parameters into one section, explain common
functionality only once, update the descriptions to reflect current and
future realities.

Reviewed-by: Dimitri Fontaine <dimitri@2ndQuadrant.fr>
2013-07-12 21:23:50 -04:00
Heikki Linnakangas
e5592c61ad Fix memory barrier support on icc on ia64, 2nd attempt.
Itanium doesn't have the mfence instruction - that's a 386 thing. Use the
"mf" instruction instead.

This reverts the previous commit to add "#include <emmintrinsic.h>"; the
problem was not with a missing #include.
2013-07-09 11:34:18 +03:00
Heikki Linnakangas
6052bceba5 Add #include needed for _mm_mfence() intrinsic on ia64.
Hopefully this fixes the build failure on buildfarm member dugong.
2013-07-09 10:29:43 +03:00
Heikki Linnakangas
9a20a9b21b Improve scalability of WAL insertions.
This patch replaces WALInsertLock with a number of WAL insertion slots,
allowing multiple backends to insert WAL records to the WAL buffers
concurrently. This is particularly useful for parallel loading large amounts
of data on a system with many CPUs.

This has one user-visible change: switching to a new WAL segment with
pg_switch_xlog() now fills the remaining unused portion of the segment with
zeros. This potentially adds some overhead, but it has been a very common
practice by DBA's to clear the "tail" of the segment with an external
pg_clearxlogtail utility anyway, to make the WAL files compress better.
With this patch, it's no longer necessary to do that.

This patch adds a new GUC, xloginsert_slots, to tune the number of WAL
insertion slots. Performance testing suggests that the default, 8, works
pretty well for all kinds of worklods, but I left the GUC in place to allow
others with different hardware to test that easily. We might want to remove
that before release.

Reviewed by Andres Freund.
2013-07-08 11:23:56 +03:00
Magnus Hagander
5a348fe077 Fix include-guard
Looks like a cut/paste error in the original addition of the file.

Andres Freund
2013-07-07 13:36:20 +02:00
Jeff Davis
269e780822 Use posix_fallocate() for new WAL files, where available.
This function is more efficient than actually writing out zeroes to
the new file, per microbenchmarks by Jon Nelson. Also, it may reduce
the likelihood of WAL file fragmentation.

Jon Nelson, with review by Andres Freund, Greg Smith and me.
2013-07-05 12:30:29 -07:00
Magnus Hagander
c87ff71f37 Expose the estimation of number of changed tuples since last analyze
This value, now pg_stat_all_tables.n_mod_since_analyze, was already
tracked and used by autovacuum, but not exposed to the user.

Mark Kirkwood, review by Laurenz Albe
2013-07-05 15:10:15 +02:00
Noah Misch
79e0f87a15 Use type "int64" for memory accounting in tuplesort.c/tuplestore.c.
Commit 263865a489 switched tuplesort.c and
tuplestore.c variables representing memory usage from type "long" to
type "Size".  This was unnecessary; I thought doing so avoided overflow
scenarios on 64-bit Windows, but guc.c already limited work_mem so as to
prevent the overflow.  It was also incomplete, not touching the logic
that assumed a signed data type.  Change the affected variables to
"int64".  This is perfect for 64-bit platforms, and it reduces the need
to contemplate platform-specific overflow scenarios.  It also puts us
close to being able to support work_mem over 2 GiB on 64-bit Windows.

Per report from Andres Freund.
2013-07-04 23:13:54 -04:00
Robert Haas
6bc8ef0b7f Add new GUC, max_worker_processes, limiting number of bgworkers.
In 9.3, there's no particular limit on the number of bgworkers;
instead, we just count up the number that are actually registered,
and use that to set MaxBackends.  However, that approach causes
problems for Hot Standby, which needs both MaxBackends and the
size of the lock table to be the same on the standby as on the
master, yet it may not be desirable to run the same bgworkers in
both places.  9.3 handles that by failing to notice the problem,
which will probably work fine in nearly all cases anyway, but is
not theoretically sound.

A further problem with simply counting the number of registered
workers is that new workers can't be registered without a
postmaster restart.  This is inconvenient for administrators,
since bouncing the postmaster causes an interruption of service.
Moreover, there are a number of applications for background
processes where, by necessity, the background process must be
started on the fly (e.g. parallel query).  While this patch
doesn't actually make it possible to register new background
workers after startup time, it's a necessary prerequisite.

Patch by me.  Review by Michael Paquier.
2013-07-04 11:24:24 -04:00
Fujii Masao
2ef085d0e6 Get rid of pg_class.reltoastidxid.
Treat TOAST index just the same as normal one and get the OID
of TOAST index from pg_index but not pg_class.reltoastidxid.
This change allows us to handle multiple TOAST indexes, and
which is required infrastructure for upcoming
REINDEX CONCURRENTLY feature.

Patch by Michael Paquier, reviewed by Andres Freund and me.
2013-07-04 03:24:09 +09:00
Peter Eisentraut
d864852685 Add #include to make header file independent 2013-07-02 20:19:52 -04:00
Robert Haas
3682025015 Add support for multiple kinds of external toast datums.
To that end, support tags rather than lengths for external datums.
As an example of how this can be used, add support or "indirect"
tuples which point to some externally allocated memory containing
a toast tuple.  Similar infrastructure could be used for other
purposes, including, perhaps, support for alternative compression
algorithms.

Andres Freund, reviewed by Hitoshi Harada and myself
2013-07-02 13:38:55 -04:00
Robert Haas
568d4138c6 Use an MVCC snapshot, rather than SnapshotNow, for catalog scans.
SnapshotNow scans have the undesirable property that, in the face of
concurrent updates, the scan can fail to see either the old or the new
versions of the row.  In many cases, we work around this by requiring
DDL operations to hold AccessExclusiveLock on the object being
modified; in some cases, the existing locking is inadequate and random
failures occur as a result.  This commit doesn't change anything
related to locking, but will hopefully pave the way to allowing lock
strength reductions in the future.

The major issue has held us back from making this change in the past
is that taking an MVCC snapshot is significantly more expensive than
using a static special snapshot such as SnapshotNow.  However, testing
of various worst-case scenarios reveals that this problem is not
severe except under fairly extreme workloads.  To mitigate those
problems, we avoid retaking the MVCC snapshot for each new scan;
instead, we take a new snapshot only when invalidation messages have
been processed.  The catcache machinery already requires that
invalidation messages be sent before releasing the related heavyweight
lock; else other backends might rely on locally-cached data rather
than scanning the catalog at all.  Thus, making snapshot reuse
dependent on the same guarantees shouldn't break anything that wasn't
already subtly broken.

Patch by me.  Review by Michael Paquier and Andres Freund.
2013-07-02 09:47:01 -04:00
Robert Haas
0d22987ae9 Add a convenience routine makeFuncCall to reduce duplication.
David Fetter and Andrew Gierth, reviewed by Jeevan Chalke
2013-07-01 14:46:54 -04:00
Peter Eisentraut
129759d6a5 Fix cpluspluscheck in checksum code
C++ is more picky about comparing signed and unsigned integers.
2013-06-30 10:25:43 -04:00
Heikki Linnakangas
ee6556555b Inline ginCompareItemPointers function for speed.
ginCompareItemPointers function is called heavily in gin index scans -
inlining it speeds up some kind of queries a lot.
2013-06-29 12:55:34 +03:00
Simon Riggs
f177cbfe67 ALTER TABLE ... ALTER CONSTRAINT for FKs
Allow constraint attributes to be altered,
so the default setting of NOT DEFERRABLE
can be altered to DEFERRABLE and back.

Review by Abhijit Menon-Sen
2013-06-29 00:27:30 +01:00
Robert Haas
5893ffa79c Make the OVER keyword unreserved.
This results in a slightly less specific error message when OVER
is used in a context where we don't accept window functions, but
per discussion, it's worth it to get the benefit of not needing
to reserve this keyword any more.  This same refactoring will
also let us avoid reserving some other keywords that we expect
to add in upcoming patches (specifically, IGNORE, RESPECT, and
FILTER).

Troels Nielsen, with minor changes by me
2013-06-28 11:11:00 -04:00
Robert Haas
5ee73525d5 Define Trap and TrapMacro even in non-cassert builds.
In some cases, the use of these macros may be preferable to Assert()
or AssertMacro(), since this way the caller can set the trap message.

Andres Freund and Robert Haas
2013-06-28 09:33:34 -04:00
Noah Misch
263865a489 Permit super-MaxAllocSize allocations with MemoryContextAllocHuge().
The MaxAllocSize guard is convenient for most callers, because it
reduces the need for careful attention to overflow, data type selection,
and the SET_VARSIZE() limit.  A handful of callers are happy to navigate
those hazards in exchange for the ability to allocate a larger chunk.
Introduce MemoryContextAllocHuge() and repalloc_huge().  Use this in
tuplesort.c and tuplestore.c, enabling internal sorts of up to INT_MAX
tuples, a factor-of-48 increase.  In particular, B-tree index builds can
now benefit from much-larger maintenance_work_mem settings.

Reviewed by Stephen Frost, Simon Riggs and Jeff Janes.
2013-06-27 14:53:57 -04:00
Noah Misch
19085116ee Cooperate with the Valgrind instrumentation framework.
Valgrind "client requests" in aset.c and mcxt.c teach Valgrind and its
Memcheck tool about the PostgreSQL allocator.  This makes Valgrind
roughly as sensitive to memory errors involving palloc chunks as it is
to memory errors involving malloc chunks.  Further client requests in
PageAddItem() and printtup() verify that all bits being added to a
buffer page or furnished to an output function are predictably-defined.
Those tests catch failures of C-language functions to fully initialize
the bits of a Datum, which in turn stymie optimizations that rely on
_equalConst().  Define the USE_VALGRIND symbol in pg_config_manual.h to
enable these additions.  An included "suppression file" silences nominal
errors we don't plan to fix.

Reviewed in earlier versions by Peter Geoghegan and Korry Douglas.
2013-06-26 20:22:25 -04:00
Noah Misch
5f538ad004 Renovate display of non-ASCII messages on Windows.
GNU gettext selects a default encoding for the messages it emits in a
platform-specific manner; it uses the Windows ANSI code page on Windows
and follows LC_CTYPE on other platforms.  This is inconvenient for
PostgreSQL server processes, so realize consistent cross-platform
behavior by calling bind_textdomain_codeset() on Windows each time we
permanently change LC_CTYPE.  This primarily affects SQL_ASCII databases
and processes like the postmaster that do not attach to a database,
making their behavior consistent with PostgreSQL on non-Windows
platforms.  Messages from SQL_ASCII databases use the encoding implied
by the database LC_CTYPE, and messages from non-database processes use
LC_CTYPE from the postmaster system environment.  PlatformEncoding
becomes unused, so remove it.

Make write_console() prefer WriteConsoleW() to write() regardless of the
encodings in use.  In this situation, write() will invariably mishandle
non-ASCII characters.

elog.c has assumed that messages conform to the database encoding.
While usually true, this does not hold for SQL_ASCII and MULE_INTERNAL.
Introduce MessageEncoding to track the actual encoding of message text.
The present consumers are Windows-specific code for converting messages
to UTF16 for use in system interfaces.  This fixes the appearance in
Windows event logs and consoles of translated messages from SQL_ASCII
processes like the postmaster.  Note that SQL_ASCII inherently disclaims
a strong notion of encoding, so non-ASCII byte sequences interpolated
into messages by %s may yet yield a nonsensical message.  MULE_INTERNAL
has similar problems at present, albeit for a different reason: its lack
of libiconv support or a conversion to UTF8.

Consequently, one need no longer restart Windows with a different
Windows ANSI code page to broadly test backend logging under a given
language.  Changing the user's locale ("Format") is enough.  Several
accounts can simultaneously run postmasters under different locales, all
correctly logging localized messages to Windows event logs and consoles.

Alexander Law and Noah Misch
2013-06-26 11:17:33 -04:00
Simon Riggs
4f14c86d74 Reverting previous commit, pending investigation
of sporadic seg faults from various build farm members.
2013-06-24 21:21:18 +01:00
Simon Riggs
b577a57d41 ALTER TABLE ... ALTER CONSTRAINT for FKs
Allow constraint attributes to be altered,
so the default setting of NOT DEFERRABLE
can be altered to DEFERRABLE and back.

Review by Abhijit Menon-Sen
2013-06-24 20:07:41 +01:00
Simon Riggs
1f09121b4e Ensure no xid gaps during Hot Standby startup
In some cases with higher numbers of subtransactions
it was possible for us to incorrectly initialize
subtrans leading to complaints of missing pages.

Bug report by Sergey Konoplev
Analysis and fix by Andres Freund
2013-06-23 11:05:02 +01:00
Jeff Davis
b8fd1a09f3 Add buffer_std flag to MarkBufferDirtyHint().
MarkBufferDirtyHint() writes WAL, and should know if it's got a
standard buffer or not. Currently, the only callers where buffer_std
is false are related to the FSM.

In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.

Back-patch to 9.3.
2013-06-17 08:02:12 -07:00
Tom Lane
5242fefb47 Be consistent about #define'ing configure symbols as "1" not empty.
This is just neatnik-ism, since all the tests in the code are #ifdefs,
but we shouldn't specify symbols as "Define to 1 ..." and then not
actually define them that way.
2013-06-15 14:11:43 -04:00
Tom Lane
58ae1f4577 Stamp HEAD as 9.4devel.
Let the hacking begin ...
2013-06-14 14:41:28 -04:00
Tom Lane
e472b92140 Avoid deadlocks during insertion into SP-GiST indexes.
SP-GiST's original scheme for avoiding deadlocks during concurrent index
insertions doesn't work, as per report from Hailong Li, and there isn't any
evident way to make it work completely.  We could possibly lock individual
inner tuples instead of their whole pages, but preliminary experimentation
suggests that the performance penalty would be huge.  Instead, if we fail
to get a buffer lock while descending the tree, just restart the tree
descent altogether.  We keep the old tuple positioning rules, though, in
hopes of reducing the number of cases where this can happen.

Teodor Sigaev, somewhat edited by Tom Lane
2013-06-14 14:26:43 -04:00
Tom Lane
f04216341d Refactor checksumming code to make it easier to use externally.
pg_filedump and other external utility programs are likely to want to be
able to check Postgres page checksums.  To avoid messy duplication of code,
move the checksumming functionality into an exported header file, much as
we did awhile back for the CRC code.

In passing, get rid of an unportable assumption that a static char[] array
will be word-aligned, and do some other minor code beautification.
2013-06-13 22:35:56 -04:00
Noah Misch
813895e4ac Don't pass oidvector by value.
Since the structure ends with a flexible array, doing so truncates any
vector having more than one element.  New in 9.3, so no back-patch.
2013-06-12 19:50:37 -04:00
Tom Lane
dc3eb56383 Improve updatability checking for views and foreign tables.
Extend the FDW API (which we already changed for 9.3) so that an FDW can
report whether specific foreign tables are insertable/updatable/deletable.
The default assumption continues to be that they're updatable if the
relevant executor callback function is supplied by the FDW, but finer
granularity is now possible.  As a test case, add an "updatable" option to
contrib/postgres_fdw.

This patch also fixes the information_schema views, which previously did
not think that foreign tables were ever updatable, and fixes
view_is_auto_updatable() so that a view on a foreign table can be
auto-updatable.

initdb forced due to changes in information_schema views and the functions
they rely on.  This is a bit unfortunate to do post-beta1, but if we don't
change this now then we'll have another API break for FDWs when we do
change it.

Dean Rasheed, somewhat editorialized on by Tom Lane
2013-06-12 17:53:33 -04:00
Tom Lane
5c7603c318 Add ARM64 (aarch64) support to s_lock.h.
Use the same gcc atomic functions as we do on newer ARM chips.
(Basically this is a copy and paste of the __arm__ code block,
but omitting the SWPB option since that definitely won't work.)

Back-patch to 9.2.  The patch would work further back, but we'd also
need to update config.guess/config.sub in older branches to make them
build out-of-the-box, and there hasn't been demand for it.

Mark Salter
2013-06-04 15:42:02 -04:00
Heikki Linnakangas
15386281a6 Put back allow_system_table_mods check in heap_create().
This reverts commit a475c60367.

Erik Rijkers reported back in January 2013 that after the patch, if you do
"pg_dump -t myschema.mytable" to dump a single table, and restore that in
a database where myschema does not exist, the table is silently created in
pg_catalog instead. That is because pg_dump uses
"SET search_path=myschema, pg_catalog" to set schema the table is created
in. While allow_system_table_mods is not a very elegant solution to this,
we can't leave it as it is, so for now, revert it back to the way it was
previously.
2013-06-03 17:22:31 +03:00
Bruce Momjian
9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Robert Haas
6eb971bd64 Fix typo in comment.
Pavan Deolasee
2013-05-23 11:34:30 -04:00
Heikki Linnakangas
cb953d8b1b Use the term "radix tree" instead of "suffix tree" for SP-GiST text opclass.
What we have implemented is a radix tree (or a radix trie or a patricia
trie), but the docs and code comments incorrectly called it a "suffix tree".

Alexander Korotkov
2013-05-08 14:34:26 +03:00
Tom Lane
817a89423f Stamp 9.3beta1. 2013-05-06 16:57:06 -04:00
Tom Lane
1d6c72a55b Move materialized views' is-populated status into their pg_class entries.
Previously this state was represented by whether the view's disk file had
zero or nonzero size, which is problematic for numerous reasons, since it's
breaking a fundamental assumption about heap storage.  This was done to
allow unlogged matviews to revert to unpopulated status after a crash
despite our lack of any ability to update catalog entries post-crash.
However, this poses enough risk of future problems that it seems better to
not support unlogged matviews until we can find another way.  Accordingly,
revert that choice as well as a number of existing kluges forced by it
in favor of creating a pg_class.relispopulated flag column.
2013-05-06 13:27:22 -04:00
Simon Riggs
ceabfb20f9 Bump PG_CONTROL_VERSION to 937 2013-04-30 13:27:47 +01:00
Simon Riggs
443951748c Record data_checksum_version in control file.
The value is not used anywhere in code, but will
allow future changes to the checksum version
should that become necessary in the future.
2013-04-30 12:27:12 +01:00
Tom Lane
db9f0e1d9a Postpone creation of pathkeys lists to fix bug #8049.
This patch gets rid of the concept of, and infrastructure for,
non-canonical PathKeys; we now only ever create canonical pathkey lists.

The need for non-canonical pathkeys came from the desire to have
grouping_planner initialize query_pathkeys and related pathkey lists before
calling query_planner.  However, since query_planner didn't actually *do*
anything with those lists before they'd been made canonical, we can get rid
of the whole mess by just not creating the lists at all until the point
where we formerly canonicalized them.

There are several ways in which we could implement that without making
query_planner itself deal with grouping/sorting features (which are
supposed to be the province of grouping_planner).  I chose to add a
callback function to query_planner's API; other alternatives would have
required adding more fields to PlannerInfo, which while not bad in itself
would create an ABI break for planner-related plugins in the 9.2 release
series.  This still breaks ABI for anything that calls query_planner
directly, but it seems somewhat unlikely that there are any such plugins.

I had originally conceived of this change as merely a step on the way to
fixing bug #8049 from Teun Hoogendoorn; but it turns out that this fixes
that bug all by itself, as per the added regression test.  The reason is
that now get_eclass_for_sort_expr is adding the ORDER BY expression at the
end of EquivalenceClass creation not the start, and so anything that is in
a multi-member EquivalenceClass has already been created with correct
em_nullable_relids.  I am suspicious that there are related scenarios in
which we still need to teach get_eclass_for_sort_expr to compute correct
nullable_relids, but am not eager to risk destabilizing either 9.2 or 9.3
to fix bugs that are only hypothetical.  So for the moment, do this and
stop here.

Back-patch to 9.2 but not to earlier branches, since they don't exhibit
this bug for lack of join-clause-movement logic that depends on
em_nullable_relids being correct.  (We might have to revisit that choice
if any related bugs turn up.)  In 9.2, don't change the signature of
make_pathkeys_for_sortclauses nor remove canonicalize_pathkeys, so as
not to risk more plugin breakage than we have to.
2013-04-29 14:50:03 -04:00
Simon Riggs
43e7a66849 Introduce new page checksum algorithm and module.
Isolate checksum calculation to its own module, so that bufpage
knows little if anything about the details of the calculation.

This implementation is a modified FNV-1a hash checksum, details
of which are given in the new checksum.c header comments.

Basic implementation only, so we fix the output value.

Later related commits will add version numbers to pg_control,
compiler optimization flags and memory barriers.

Ants Aasma, reviewed by Jeff Davis and Simon Riggs
2013-04-29 09:05:27 +01:00
Tom Lane
f8db76e875 Editorialize a bit on new ProcessUtility() API.
Choose a saner ordering of parameters (adding a new input param after
the output params seemed a bit random), update the function's header
comment to match reality (cmon folks, is this really that hard?),
get rid of useless and sloppily-defined distinction between
PROCESS_UTILITY_SUBCOMMAND and PROCESS_UTILITY_GENERATED.
2013-04-28 00:18:45 -04:00
Tom Lane
5194024d72 Incidental cleanup of matviews code.
Move checking for unscannable matviews into ExecOpenScanRelation, which is
a better place for it first because the open relation is already available
(saving a relcache lookup cycle), and second because this eliminates the
problem of telling the difference between rangetable entries that will or
will not be scanned by the query.  In particular we can get rid of the
not-terribly-well-thought-out-or-implemented isResultRel field that the
initial matviews patch added to RangeTblEntry.

Also get rid of entirely unnecessary scannability check in the rewriter,
and a bogus decision about whether RefreshMatViewStmt requires a parse-time
snapshot.

catversion bump due to removal of a RangeTblEntry field, which changes
stored rules.
2013-04-27 17:48:57 -04:00
Peter Eisentraut
cc26ea9fe2 Clean up references to SQL92
In most cases, these were just references to the SQL standard in
general.  In a few cases, a contrast was made between SQL92 and later
standards -- those have been kept unchanged.
2013-04-20 11:04:41 -04:00
Heikki Linnakangas
87ae9e7265 Remove some unused and seldom used fields from RelationAmInfo.
This saves some memory from each index relcache entry. At least on a 64-bit
machine, it saves just enough to shrink a typical relcache entry's memory
usage from 2k to 1k. That's nice if you have a lot of backends and a lot of
indexes.
2013-04-16 15:07:58 +03:00
Andrew Dunstan
d788121aba Mark json IO and extraction functions immutable.
Per complaint from Hubert Depesz Lubaczewski.

Catalog version bumped.
2013-04-15 21:46:25 -04:00
Tom Lane
0b33790421 Clean up the mess around EXPLAIN and materialized views.
Revert the matview-related changes in explain.c's API, as per recent
complaint from Robert Haas.  The reason for these appears to have been
principally some ill-considered choices around having intorel_startup do
what ought to be parse-time checking, plus a poor arrangement for passing
it the view parsetree it needs to store into pg_rewrite when creating a
materialized view.  Do the latter by having parse analysis stick a copy
into the IntoClause, instead of doing it at runtime.  (On the whole,
I seriously question the choice to represent CREATE MATERIALIZED VIEW as a
variant of SELECT INTO/CREATE TABLE AS, because that means injecting even
more complexity into what was already a horrid legacy kluge.  However,
I didn't go so far as to rethink that choice ... yet.)

I also moved several error checks into matview parse analysis, and
made the check for external Params in a matview more accurate.

In passing, clean things up a bit more around interpretOidsOption(),
and fix things so that we can use that to force no-oids for views,
sequences, etc, thereby eliminating the need to cons up "oids = false"
options when creating them.

catversion bump due to change in IntoClause.  (I wonder though if we
really need readfuncs/outfuncs support for IntoClause anymore.)
2013-04-12 19:25:31 -04:00
Robert Haas
f8a54e936b sepgsql: Enforce db_procedure:{execute} permission.
To do this, we add an additional object access hook type,
OAT_FUNCTION_EXECUTE.

KaiGai Kohei
2013-04-12 08:58:01 -04:00
Robert Haas
d017bf41a3 Minor wording corrections for object-access hook stuff.
KaiGai Kohei
2013-04-12 08:40:02 -04:00
Alvaro Herrera
6a76edb188 Fix confusion between ObjectType and ObjectClass
Per report by Will Leinweber and Peter Eisentraut
2013-04-11 11:59:47 -03:00
Kevin Grittner
52e6e33ab4 Create a distinction between a populated matview and a scannable one.
The intent was that being populated would, long term, be just one
of the conditions which could affect whether a matview was
scannable; being populated should be necessary but not always
sufficient to scan the relation.  Since only CREATE and REFRESH
currently determine the scannability, names and comments
accidentally conflated these concepts, leading to confusion.

Also add missing locking for the SQL function which allows a
test for scannability, and fix a modularity violatiion.

Per complaints from Tom Lane, although its not clear that these
will satisfy his concerns.  Hopefully this will at least better
frame the discussion.
2013-04-09 13:02:49 -05:00
Robert Haas
0bf42a5f3b Adjust ExplainOneQuery_hook_type to take a DestReceiver argument.
The materialized views patch adjusted ExplainOneQuery to take an
additional DestReceiver argument, but failed to add a matching
argument to the definition of ExplainOneQuery_hook.  This is a
problem for users of the hook that want to call ExplainOnePlan.
Fix by adding the missing argument.
2013-04-09 10:25:08 -04:00
Tom Lane
3ccae48f44 Support indexing of regular-expression searches in contrib/pg_trgm.
This works by extracting trigrams from the given regular expression,
in generally the same spirit as the previously-existing support for
LIKE searches, though of course the details are far more complicated.

Currently, only GIN indexes are supported.  We might be able to make
it work with GiST indexes later.

The implementation includes adding API functions to backend/regex/
to provide a view of the search NFA created from a regular expression.
These functions are meant to be generic enough to be supportable in
a standalone version of the regex library, should that ever happen.

Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
2013-04-09 01:06:54 -04:00
Simon Riggs
47c4333189 Avoid tricky race condition recording XLOG_HINT
We copy the buffer before inserting an XLOG_HINT to avoid WAL CRC errors
caused by concurrent hint writes to buffer while share locked. To make this work
we refactor RestoreBackupBlock() to allow an XLOG_HINT to avoid the normal
path for backup blocks, which assumes the underlying buffer is exclusive locked.
Resulting code completely changes layout of XLOG_HINT WAL records, but
this isn't even beta code, so this is a low impact change.
In passing, avoid taking WALInsertLock for full page writes on checksummed
hints, remove related cruft from XLogInsert() and improve xlog_desc record for
XLOG_HINT.

Andres Freund

Bug report by Fujii Masao, testing by Jeff Janes and Jaime Casanova,
review by Jeff Davis and Simon Riggs. Applied with changes from review
and some comment editing.
2013-04-08 08:52:39 +01:00
Robert Haas
e965e6344c sepgsql: Enforce db_schema:search permission.
KaiGai Kohei, with comment and doc wordsmithing by me
2013-04-05 08:51:31 -04:00
Heikki Linnakangas
bf2b0a1478 Fix crash on compiling a regular expression with more than 32k colors.
Throw an error instead.

Backpatch to all supported branches.
2013-04-04 19:48:11 +03:00
Tom Lane
17fe2793ea Fix insecure parsing of server command-line switches.
An oversight in commit e710b65c1c allowed
database names beginning with "-" to be treated as though they were secure
command-line switches; and this switch processing occurs before client
authentication, so that even an unprivileged remote attacker could exploit
the bug, needing only connectivity to the postmaster's port.  Assorted
exploits for this are possible, some requiring a valid database login,
some not.  The worst known problem is that the "-r" switch can be invoked
to redirect the process's stderr output, so that subsequent error messages
will be appended to any file the server can write.  This can for example be
used to corrupt the server's configuration files, so that it will fail when
next restarted.  Complete destruction of database tables is also possible.

Fix by keeping the database name extracted from a startup packet fully
separate from command-line switches, as had already been done with the
user name field.

The Postgres project thanks Mitsumasa Kondo for discovering this bug,
Kyotaro Horiguchi for drafting the fix, and Noah Misch for recognizing
the full extent of the danger.

Security: CVE-2013-1899
2013-04-01 14:00:51 -04:00
Tom Lane
ce9ab88981 Make REPLICATION privilege checks test current user not authenticated user.
The pg_start_backup() and pg_stop_backup() functions checked the privileges
of the initially-authenticated user rather than the current user, which is
wrong.  For example, a user-defined index function could successfully call
these functions when executed by ANALYZE within autovacuum.  This could
allow an attacker with valid but low-privilege database access to interfere
with creation of routine backups.  Reported and fixed by Noah Misch.

Security: CVE-2013-1901
2013-04-01 13:09:24 -04:00
Andrew Dunstan
a570c98d7f Add new JSON processing functions and parser API.
The JSON parser is converted into a recursive descent parser, and
exposed for use by other modules such as extensions. The API provides
hooks for all the significant parser event such as the beginning and end
of objects and arrays, and providing functions to handle these hooks
allows for fairly simple construction of a wide variety of JSON
processing functions. A set of new basic processing functions and
operators is also added, which use this API, including operations to
extract array elements, object fields, get the length of arrays and the
set of keys of a field, deconstruct an object into a set of key/value
pairs, and create records from JSON objects and arrays of objects.

Catalog version bumped.

Andrew Dunstan, with some documentation assistance from Merlin Moncure.
2013-03-29 14:12:13 -04:00
Alvaro Herrera
473ab40c8b Add sql_drop event for event triggers
This event takes place just before ddl_command_end, and is fired if and
only if at least one object has been dropped by the command.  (For
instance, DROP TABLE IF EXISTS of a table that does not in fact exist
will not lead to such a trigger firing).  Commands that drop multiple
objects (such as DROP SCHEMA or DROP OWNED BY) will cause a single event
to fire.  Some firings might be surprising, such as
ALTER TABLE DROP COLUMN.

The trigger is fired after the drop has taken place, because that has
been deemed the safest design, to avoid exposing possibly-inconsistent
internal state (system catalogs as well as current transaction) to the
user function code.  This means that careful tracking of object
identification is required during the object removal phase.

Like other currently existing events, there is support for tag
filtering.

To support the new event, add a new pg_event_trigger_dropped_objects()
set-returning function, which returns a set of rows comprising the
objects affected by the command.  This is to be used within the user
function code, and is mostly modelled after the recently introduced
pg_identify_object() function.

Catalog version bumped due to the new function.

Dimitri Fontaine and Álvaro Herrera
Review by Robert Haas, Tom Lane
2013-03-28 13:05:48 -03:00
Simon Riggs
593c39d156 Revoke bc5334d867 2013-03-28 09:18:02 +00:00
Simon Riggs
bc5334d867 Allow external recovery_config_directory
If required, recovery.conf can now be located outside of the data directory.
Server needs read/write permissions on this directory.
2013-03-27 11:45:42 +00:00
Kevin Grittner
549dae0352 Fix problems with incomplete attempt to prohibit OIDS with MVs.
Problem with assertion failure in restoring from pg_dump output
reported by Joachim Wieland.

Review and suggestions by Tom Lane and Robert Haas.
2013-03-22 13:27:34 -05:00
Simon Riggs
96ef3b8ff1 Allow I/O reliability checks using 16-bit checksums
Checksums are set immediately prior to flush out of shared buffers
and checked when pages are read in again. Hint bit setting will
require full page write when block is dirtied, which causes various
infrastructure changes. Extensive comments, docs and README.

WARNING message thrown if checksum fails on non-all zeroes page;
ERROR thrown but can be disabled with ignore_checksum_failure = on.

Feature enabled by an initdb option, since transition from option off
to option on is long and complex and has not yet been implemented.
Default is not to use checksums.

Checksum used is WAL CRC-32 truncated to 16-bits.

Simon Riggs, Jeff Davis, Greg Smith
Wide input and assistance from many community members. Thank you.
2013-03-22 13:54:07 +00:00
Tom Lane
9cbc4b80dd Redo postgres_fdw's planner code so it can handle parameterized paths.
I wasn't going to ship this without having at least some example of how
to do that.  This version isn't terribly bright; in particular it won't
consider any combinations of multiple join clauses.  Given the cost of
executing a remote EXPLAIN, I'm not sure we want to be very aggressive
about doing that, anyway.

In support of this, refactor generate_implied_equalities_for_indexcol
so that it can be used to extract equivalence clauses that aren't
necessarily tied to an index.
2013-03-21 19:44:32 -04:00
Alvaro Herrera
f8348ea32e Allow extracting machine-readable object identity
Introduce pg_identify_object(oid,oid,int4), which is similar in spirit
to pg_describe_object but instead produces a row of machine-readable
information to uniquely identify the given object, without resorting to
OIDs or other internal representation.  This is intended to be used in
the event trigger implementation, to report objects being operated on;
but it has usefulness of its own.

Catalog version bumped because of the new function.
2013-03-20 18:19:19 -03:00
Simon Riggs
bb7cc2623f Remove PageSetTLI and rename pd_tli to pd_checksum
Remove use of PageSetTLI() from all page manipulation functions
and adjust README to indicate change in the way we make changes
to pages. Repurpose those bytes into the pd_checksum field and
explain how that works in comments about page header.

Refactoring ahead of actual feature patch which would make use
of the checksum field, arriving later.

Jeff Davis, with comments and doc changes by Simon Riggs
Direction suggested by Robert Haas; many others providing
review comments.
2013-03-18 13:46:42 +00:00
Robert Haas
05f3f9c7b2 Extend object-access hook machinery to support post-alter events.
This also slightly widens the scope of what we support in terms of
post-create events.

KaiGai Kohei, with a few changes, mostly to the comments, by me
2013-03-17 22:57:26 -04:00
Tom Lane
e2a203a190 initdb needs pqsignal() even on Windows.
I had thought we weren't using this version of pqsignal() at all on
Windows, but that's wrong --- initdb is using it (and coping with the
POSIX-ish semantics of bare signal() :-().  So allow the file to be
built in WIN32+FRONTEND case, and add it to the MSVC build logic.
2013-03-17 15:19:47 -04:00
Tom Lane
da5aeccf64 Move pqsignal() to libpgport.
We had two copies of this function in the backend and libpq, which was
already pretty bogus, but it turns out that we need it in some other
programs that don't use libpq (such as pg_test_fsync).  So put it where
it probably should have been all along.  The signal-mask-initialization
support in src/backend/libpq/pqsignal.c stays where it is, though, since
we only need that in the backend.
2013-03-17 12:06:42 -04:00
Tom Lane
d43837d030 Add lock_timeout configuration parameter.
This GUC allows limiting the time spent waiting to acquire any one
heavyweight lock.

In support of this, improve the recently-added timeout infrastructure
to permit efficiently enabling or disabling multiple timeouts at once.
That reduces the performance hit from turning on lock_timeout, though
it's still not zero.

Zoltán Böszörményi, reviewed by Tom Lane,
Stephen Frost, and Hari Babu
2013-03-16 23:22:57 -04:00
Tom Lane
4387cf956b Avoid inserting Result nodes that only compute identity projections.
The planner sometimes inserts Result nodes to perform column projections
(ie, arbitrary scalar calculations) above plan nodes that lack projection
logic of their own.  However, we did that even if the lower plan node was
in fact producing the required column set already; which is a pretty common
case given the popularity of "SELECT * FROM ...".  Measurements show that
the useless plan node adds non-negligible overhead, especially when there
are many columns in the result.  So add a check to avoid inserting a Result
node unless there's something useful for it to do.

There are a couple of remaining places where unnecessary Result nodes
could get inserted, but they are (a) much less performance-critical,
and (b) coded in such a way that it's hard to avoid inserting a Result,
because the desired tlist is changed on-the-fly in subsequent logic.
We'll leave those alone for now.

Kyotaro Horiguchi; reviewed and further hacked on by Amit Kapila and
Tom Lane.
2013-03-14 13:43:18 -04:00
Heikki Linnakangas
59d0bf9dca Add cost estimation of range @> and <@ operators.
The estimates are based on the existing lower bound histogram, and a new
histogram of range lengths.

Bump catversion, because the range length histogram now needs to be present
in statistic slot kind 6, or you get an error on @> and <@ queries. (A
re-ANALYZE would be enough to fix that, though)

Alexander Korotkov, with some refactoring by me.
2013-03-14 15:36:56 +02:00
Andrew Dunstan
38fb4d978c JSON generation improvements.
This adds the following:

    json_agg(anyrecord) -> json
    to_json(any) -> json
    hstore_to_json(hstore) -> json (also used as a cast)
    hstore_to_json_loose(hstore) -> json

The last provides heuristic treatment of numbers and booleans.

Also, in json generation, if any non-builtin type has a cast to json,
that function is used instead of the type's output function.

Andrew Dunstan, reviewed by Steve Singer.

Catalog version bumped.
2013-03-10 17:35:36 -04:00
Tom Lane
21734d2fb8 Support writable foreign tables.
This patch adds the core-system infrastructure needed to support updates
on foreign tables, and extends contrib/postgres_fdw to allow updates
against remote Postgres servers.  There's still a great deal of room for
improvement in optimization of remote updates, but at least there's basic
functionality there now.

KaiGai Kohei, reviewed by Alexander Korotkov and Laurenz Albe, and rather
heavily revised by Tom Lane.
2013-03-10 14:16:02 -04:00
Magnus Hagander
7f49a67f95 Report pg_hba line number and contents when users fail to log in
Instead of just reporting which user failed to log in, log both the
line number in the active pg_hba.conf file (which may not match reality
in case the file has been edited and not reloaded) and the contents of
the matching line (which will always be correct), to make it easier
to debug incorrect pg_hba.conf files.

The message to the client remains unchanged and does not include this
information, to prevent leaking security sensitive information.

Reviewed by Tom Lane and Dean Rasheed
2013-03-10 15:54:37 +01:00
Heikki Linnakangas
96443d1420 Forgot catversion bump in the SP-GiST adjacent support patch. 2013-03-08 17:12:38 +02:00
Heikki Linnakangas
23f10b6473 SP-GiST support of the range adjacent operator -|-
Alexander Korotkov, reviewed by Jeff Davis.
2013-03-08 15:03:19 +02:00
Heikki Linnakangas
7ccefe8610 Fix tli history file fetching, broken by the archive after crash recevery patch.
If we were about to enter archive recovery after crash recovery, we scanned
the archive for the latest tli history file, and set the recovery target
timeline to that. However, when we actually tried to read the history file,
we would not fetch the file from the archive, because we were not in archive
recovery yet.

To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch
the file from archive if archive recovery is requested, even if we're not in
archive recovery yet.

Backpatch to 9.2. Mitsumasa KONDO
2013-03-07 12:33:24 +02:00
Tom Lane
1908abc4a3 Arrange to cache FdwRoutine structs in foreign tables' relcache entries.
This saves several catalog lookups per reference.  It's not all that
exciting right now, because we'd managed to minimize the number of places
that need to fetch the data; but the upcoming writable-foreign-tables patch
needs this info in a lot more places.
2013-03-06 23:48:09 -05:00
Robert Haas
f90cc26982 Code beautification for object-access hook machinery.
KaiGai Kohei
2013-03-06 20:53:25 -05:00
Tom Lane
e11cb8ba2c Fix missing #include in commands/matview.h.
It needs parsenodes.h to be compilable regardless of previous headers.
2013-03-06 18:21:05 -05:00
Tom Lane
80b011ef0a Fix to_char() to use ASCII-only case-folding rules where appropriate.
formatting.c used locale-dependent case folding rules in some code paths
where the result isn't supposed to be locale-dependent, for example
to_char(timestamp, 'DAY').  Since the source data is always just ASCII
in these cases, that usually didn't matter ... but it does matter in
Turkish locales, which have unusual treatment of "i" and "I".  To confuse
matters even more, the misbehavior was only visible in UTF8 encoding,
because in single-byte encodings we used pg_toupper/pg_tolower which
don't have locale-specific behavior for ASCII characters.  Fix by providing
intentionally ASCII-only case-folding functions and using these where
appropriate.  Per bug #7913 from Adnan Dursun.  Back-patch to all active
branches, since it's been like this for a long time.
2013-03-05 13:02:30 -05:00
Kevin Grittner
c8056592bc Bump catversion because of new function in the materialized view patch. 2013-03-05 05:32:03 -06:00
Kevin Grittner
3bf3ab8c56 Add a materialized view relations.
A materialized view has a rule just like a view and a heap and
other physical properties like a table.  The rule is only used to
populate the table, references in queries refer to the
materialized data.

This is a minimal implementation, but should still be useful in
many cases.  Currently data is only populated "on demand" by the
CREATE MATERIALIZED VIEW and REFRESH MATERIALIZED VIEW statements.
It is expected that future releases will add incremental updates
with various timings, and that a more refined concept of defining
what is "fresh" data will be developed.  At some point it may even
be possible to have queries use a materialized in place of
references to underlying tables, but that requires the other
above-mentioned features to be working first.

Much of the documentation work by Robert Haas.
Review by Noah Misch, Thom Brown, Robert Haas, Marko Tiikkaja
Security review by KaiGai Kohei, with a decision on how best to
implement sepgsql still pending.
2013-03-03 18:23:31 -06:00
Tom Lane
b15a6da292 Get rid of any toast table when converting a table to a view.
Also make sure other fields of the view's pg_class entry are appropriate
for a view; it shouldn't have relfrozenxid set for instance.

This ancient omission isn't believed to have any serious consequences in
versions 8.4-9.2, so no backpatch.  But let's fix it before it does bite
us in some serious way.  It's just luck that the case doesn't cause
problems for autovacuum.  (It did cause problems in 8.3, but that's out
of support.)

Andres Freund
2013-03-03 19:05:47 -05:00
Tom Lane
2b78d101d1 Fix SQL function execution to be safe with long-lived FmgrInfos.
fmgr_sql had been designed on the assumption that the FmgrInfo it's called
with has only query lifespan.  This is demonstrably unsafe in connection
with range types, as shown in bug #7881 from Andrew Gierth.  Fix things
so that we re-generate the function's cache data if the (sub)transaction
it was made in is no longer active.

Back-patch to 9.2.  This might be needed further back, but it's not clear
whether the case can realistically arise without range types, so for now
I'll desist from back-patching further.
2013-03-03 17:39:58 -05:00
Heikki Linnakangas
3d009e45bd Add support for piping COPY to/from an external program.
This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
psql \copy syntax. Like with reading/writing files, the backend version is
superuser-only, and in the psql version, the program is run in the client.

In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
the stdin/stdout is quoted, it's now interpreted as a filename. For example,
"\copy foo from 'stdin'" now reads from a file called 'stdin', not from
standard input. Before this, there was no way to specify a filename called
stdin, stdout, pstdin or pstdout.

This creates a new function in pgport, wait_result_to_str(), which can
be used to convert the exit status of a process, as returned by wait(3),
to a human-readable string.

Etsuro Fujita, reviewed by Amit Kapila.
2013-02-27 18:22:31 +02:00
Tom Lane
c153530dc1 Install headers from the new src/include/common subdirectory.
This got missed in commit 8396447cdb.

Andres Freund
2013-02-26 15:27:30 -05:00
Alvaro Herrera
639ed4e84b Add pg_xlogdump contrib program
This program relies on rm_desc backend routines and the xlogreader
infrastructure to emit human-readable rendering of WAL records.

Author: Andres Freund, with many reworks by Álvaro
Reviewed (in a much earlier version) by Peter Eisentraut
2013-02-22 16:56:55 -03:00
Alvaro Herrera
a730183926 Move relpath() to libpgcommon
This enables non-backend code, such as pg_xlogdump, to use it easily.
The previous location, in src/backend/catalog/catalog.c, made that
essentially impossible because that file depends on many backend-only
facilities; so this needs to live separately.
2013-02-21 22:46:17 -03:00
Tom Lane
54a2786835 Need to decorate XactIsoLevel as PGDLLIMPORT for postgres_fdw.
Per buildfarm.
2013-02-21 09:28:42 -05:00
Alvaro Herrera
a40d09e27f Move ExceptionalCondition back to postgres.h
It needs to be defined in the backend even when assertions are not
enabled.  It's cleaner to put it back, than create a separate #ifdef
section in c.h.

Per trouble report from Jeff Janes
2013-02-18 18:53:32 -03:00
Alvaro Herrera
187492b6c2 Split pgstat file in smaller pieces
We now write one file per database and one global file, instead of
having the whole thing in a single huge file.  This reduces the I/O that
must be done when partial data is required -- which is all the time,
because each process only needs information on its own database anyway.
Also, the autovacuum launcher does not need data about tables and
functions in each database; having the global stats for all DBs is
enough.

Catalog version bumped because we have a new subdir under PGDATA.

Author: Tomas Vondra.  Some rework by Álvaro
Testing by Jeff Janes
Other discussion by Heikki Linnakangas, Tom Lane.
2013-02-18 18:12:52 -03:00
Peter Eisentraut
9475db3a4e Add ALTER ROLE ALL SET command
This generalizes the existing ALTER ROLE ... SET and ALTER DATABASE
... SET functionality to allow creating settings that apply to all users
in all databases.

reviewed by Pavel Stehule
2013-02-17 23:45:36 -05:00
Simon Riggs
c2f79ba269 Force archive_status of .done for xlogs created by dearchival/replication.
This is a forward-patch of commit 6f4b8a4f4f,
applied to 9.2 back in August. The plan was to do something else in master,
but it looks like it's not going to happen, so let's just apply the 9.2
solution to master as well.

Fujii Masao
2013-02-15 19:28:06 +02:00
Tom Lane
fdaf44862b Invent pre-commit/pre-prepare/pre-subcommit events for xact callbacks.
Currently it's only possible for loadable modules to get control during
post-commit cleanup of a transaction.  That doesn't work too well if they
want to do something that could throw an error; for example, an FDW might
need to issue a remote commit, which could well fail.  To improve matters,
extend the existing APIs for XactCallback and SubXactCallback functions
to provide new pre-commit events for this purpose.

The release notes will need to mention that existing callback functions
should be checked to make sure they don't do something unwanted when one
of the new event types occurs.  In the examples within our source tree,
contrib/sepgsql was fine but plpgsql had been a bit too cute.
2013-02-14 20:35:08 -05:00
Tom Lane
71627f3d19 Fix CVE-2013-0255 properly.
Revert commit ab0f7b6089 (in HEAD only)
in favor of the proper solution, which is to declare enum_recv() correctly
in the system catalogs.  It should be declared to take type "internal"
not "cstring".

Also improve the type_sanity regression test, which should have caught
this typo, so that it actually would.  Most of the relevant checks on
the signature of type I/O functions should not have been restricted to
basetypes/pseudotypes, as they should apply to any type's I/O functions.
2013-02-13 16:20:01 -05:00
Alvaro Herrera
0e81ddde2c Rename "string" pstrdup argument to "in"
The former name collides with a symbol also used in the isolation test's
parser, causing assorted failures in certain platforms.
2013-02-12 12:43:09 -03:00
Alvaro Herrera
8396447cdb Create libpgcommon, and move pg_malloc et al to it
libpgcommon is a new static library to allow sharing code among the
various frontend programs and backend; this lets us eliminate duplicate
implementations of common routines.  We avoid libpgport, because that's
intended as a place for porting issues; per discussion, it seems better
to keep them separate.

The first use case, and the only implemented by this patch, is pg_malloc
and friends, which many frontend programs were already using.

At the same time, we can use this to provide palloc emulation functions
for the frontend; this way, some palloc-using files in the backend can
also be used by the frontend cleanly.  To do this, we change palloc() in
the backend to be a function instead of a macro on top of
MemoryContextAlloc().  This was previously believed to cause loss of
performance, but this implementation has been tweaked by Tom and Andres
so that on modern compilers it provides a slight improvement over the
previous one.

This lets us clean up some places that were already with
localized hacks.

Most of the pg_malloc/palloc changes in this patch were authored by
Andres Freund. Zoltán Böszörményi also independently provided a form of
that.  libpgcommon infrastructure was authored by Álvaro.
2013-02-12 11:21:05 -03:00
Peter Eisentraut
0cb1fac3b1 Add noreturn attributes to some error reporting functions 2013-02-12 07:13:22 -05:00
Heikki Linnakangas
62401db45c Support unlogged GiST index.
The reason this wasn't supported before was that GiST indexes need an
increasing sequence to detect concurrent page-splits. In a regular WAL-
logged GiST index, the LSN of the page-split record is used for that
purpose, and in a temporary index, we can get away with a backend-local
counter. Neither of those methods works for an unlogged relation.

To provide such an increasing sequence of numbers, create a "fake LSN"
counter that is saved and restored across shutdowns. On recovery, unlogged
relations are blown away, so the counter doesn't need to survive that
either.

Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.
2013-02-11 23:07:09 +02:00
Heikki Linnakangas
7803e9327d Include previous TLI in end-of-recovery and shutdown checkpoint records.
This isn't used for anything but a sanity check at the moment, but it could
be highly valuable for debugging purposes. It could also be used to recreate
timeline history by traversing WAL, which seems useful.
2013-02-11 18:16:25 +02:00
Tom Lane
0fd0f3688b Document and clean up gistsplit.c.
Improve comments, rename some variables and functions, slightly simplify
a couple of APIs, in an attempt to make this code readable by people other
than its original author.

Even though this is essentially just cosmetic, back-patch to all active
branches, because otherwise it's going to make back-patching future fixes
in this file very painful.
2013-02-10 11:58:15 -05:00
Tom Lane
c61e26ee3e Add support for ALTER RULE ... RENAME TO.
Ali Dar, reviewed by Dean Rasheed.
2013-02-08 23:58:40 -05:00
Alvaro Herrera
381d4b70a9 Clean up c.h / postgres.h after Assert() move
Per Tom
2013-02-08 12:50:58 -03:00
Tom Lane
166d534fcd Repair bugs in GiST page splitting code for multi-column indexes.
When considering a non-last column in a multi-column GiST index,
gistsplit.c tries to improve on the split chosen by the opclass-specific
pickSplit function by considering penalties for the next column.  However,
there were two bugs in this code: it failed to recompute the union keys for
the leftmost index columns, even though these might well change after
reassigning tuples; and it included the old union keys in the recomputation
for the columns it did recompute, so that those keys couldn't get smaller
even if they should.  The first problem could result in an invalid index
in which searches wouldn't find index entries that are in fact present;
the second would make the index less efficient to search.

Both of these errors were caused by misuse of gistMakeUnionItVec, whose
API was designed in a way that just begged such errors to be made.  There
is no situation in which it's safe or useful to compute the union keys for
a subset of the index columns, and there is no caller that wants any
previous union keys to be included in the computation; so the undocumented
choice to treat the union keys as in/out rather than pure output parameters
is a waste of code as well as being dangerous.

Hence, rather than just making a minimal patch, I've changed the API of
gistMakeUnionItVec to remove the "startkey" parameter (it now always
processes all index columns) and treat the attr/isnull arrays as purely
output parameters.

In passing, also get rid of a couple of unnecessary and dangerous uses
of static variables in gistutil.c.  It's remarkable that the one in
gistMakeUnionKey hasn't given us portability troubles before now, because
in addition to posing a re-entrancy hazard, it was unsafely assuming that
a static char[] array would have at least Datum alignment.

Per investigation of a trouble report from Tomas Vondra.  (There are also
some bugs in contrib/btree_gist to be fixed, but that seems like material
for a separate patch.)  Back-patch to all supported branches.
2013-02-07 17:44:02 -05:00
Alvaro Herrera
5a1cd89f8f Split out list of XLog resource managers
The new rmgrlist.h header, containing all necessary data
about built-in resource managers, allows other pieces of code to
access them.

In particular, this allows a future pg_xlogdump program to extract
rm_desc function pointers, without having to keep a duplicate list of
them.
2013-02-06 08:47:28 -03:00
Alvaro Herrera
e1d25de35a Move Assert() definitions to c.h
This way, they can be used by frontend and backend code.  We already
supported that, but doing it this way allows us to mix true frontend
files with backend files compiled in frontend environment.

Author: Andres Freund
2013-02-01 17:50:04 -03:00
Tom Lane
0900ac2d0d Fix plpgsql's reporting of plan-time errors in possibly-simple expressions.
exec_simple_check_plan and exec_eval_simple_expr attempted to call
GetCachedPlan directly.  This meant that if an error was thrown during
planning, the resulting context traceback would not include the line
normally contributed by _SPI_error_callback.  This is already inconsistent,
but just to be really odd, a re-execution of the very same expression
*would* show the additional context line, because we'd already have cached
the plan and marked the expression as non-simple.

The problem is easy to demonstrate in 9.2 and HEAD because planning of a
cached plan doesn't occur at all until GetCachedPlan is done.  In earlier
versions, it could only be an issue if initial planning had succeeded, then
a replan was forced (already somewhat improbable for a simple expression),
and the replan attempt failed.  Since the issue is mainly cosmetic in older
branches anyway, it doesn't seem worth the risk of trying to fix it there.
It is worth fixing in 9.2 since the instability of the context printout can
affect the results of GET STACKED DIAGNOSTICS, as per a recent discussion
on pgsql-novice.

To fix, introduce a SPI function that wraps GetCachedPlan while installing
the correct callback function.  Use this instead of calling GetCachedPlan
directly from plpgsql.

Also introduce a wrapper function for extracting a SPI plan's
CachedPlanSource list.  This lets us stop including spi_priv.h in
pl_exec.c, which was never a very good idea from a modularity standpoint.

In passing, fix a similar inconsistency that could occur in SPI_cursor_open,
which was also calling GetCachedPlan without setting up a context callback.
2013-01-30 20:02:23 -05:00
Tom Lane
991f3e5ab3 Provide database object names as separate fields in error messages.
This patch addresses the problem that applications currently have to
extract object names from possibly-localized textual error messages,
if they want to know for example which index caused a UNIQUE_VIOLATION
failure.  It adds new error message fields to the wire protocol, which
can carry the name of a table, table column, data type, or constraint
associated with the error.  (Since the protocol spec has always instructed
clients to ignore unrecognized field types, this should not create any
compatibility problem.)

Support for providing these new fields has been added to just a limited set
of error reports (mainly, those in the "integrity constraint violation"
SQLSTATE class), but we will doubtless add them to more calls in future.

Pavel Stehule, reviewed and extensively revised by Peter Geoghegan, with
additional hacking by Tom Lane.
2013-01-29 17:08:26 -05:00
Simon Riggs
fd4ced5230 Fast promote mode skips checkpoint at end of recovery.
pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we
can achieve very fast failover when the apply delay is low. Write new WAL record
XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log
readers. If we skip synchronous end of recovery checkpoint we request a normal
spread checkpoint so that the window of re-recovery is low.

Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao.
Review by Heikki Linnakangas
2013-01-29 00:06:15 +00:00
Bruce Momjian
7e2322dff3 Allow CREATE TABLE IF EXIST so succeed if the schema is nonexistent
Previously, CREATE TABLE IF EXIST threw an error if the schema was
nonexistent.  This was done by passing 'missing_ok' to the function that
looks up the schema oid.
2013-01-26 13:24:50 -05:00
Tom Lane
0d5fbdc157 Change plan caching to honor, not resist, changes in search_path.
In the initial implementation of plan caching, we saved the active
search_path when a plan was first cached, then reinstalled that path
anytime we needed to reparse or replan.  The idea of that was to try to
reselect the same referenced objects, in somewhat the same way that views
continue to refer to the same objects in the face of schema or name
changes.  Of course, that analogy doesn't bear close inspection, since
holding the search_path fixed doesn't cope with object drops or renames.
Moreover sticking with the old path seems to create more surprises than
it avoids.  So instead of doing that, consider that the cached plan depends
on search_path, and force reparse/replan if the active search_path is
different than it was when we last saved the plan.

This gets us fairly close to having "transparency" of plan caching, in the
sense that the cached statement acts the same as if you'd just resubmitted
the original query text for another execution.  There are still some corner
cases where this fails though: a new object added in the search path
schema(s) might capture a reference in the query text, but we'd not realize
that and force a reparse.  We might try to fix that in the future, but for
the moment it looks too expensive and complicated.
2013-01-25 14:14:41 -05:00
Andrew Dunstan
1068771abf Use correct output device for Windows prompts.
This ensures that mapping of non-ascii prompts
to the correct code page occurs.

Bug report and original patch from Alexander Law,
reviewed and reworked by Noah Misch.

Backpatch to all live branches.
2013-01-24 16:01:31 -05:00
Alvaro Herrera
74ebba84ae Redefine HEAP_XMAX_IS_LOCKED_ONLY
Tuples marked SELECT FOR UPDATE in a cluster that's later processed by
pg_upgrade would have a different infomask bit pattern than those
produced by 9.3dev; that bit pattern was being seen as "dead" by HEAD
(because they would fail the "is this tuple locked" test, and so the
visibility rules would thing they're updated, even though there's no
HEAP_UPDATED version of them).  In other words, some rows could silently
disappear after pg_upgrade.

With this new definition, those tuples become visible again.

This is breakage resulting from my commit 0ac5ad5134.
2013-01-24 16:10:02 -03:00
Alvaro Herrera
0ac5ad5134 Improve concurrency of foreign key locking
This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE".  These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE".  UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.

Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.

The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid.  Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates.  This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed.  pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.

Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header.  This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.

Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)

With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.

As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.

Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane.  There's probably room for several more tests.

There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it.  Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.

This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
	AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
	1290721684-sup-3951@alvh.no-ip.org
	1294953201-sup-2099@alvh.no-ip.org
	1320343602-sup-2290@alvh.no-ip.org
	1339690386-sup-8927@alvh.no-ip.org
	4FE5FF020200002500048A3D@gw.wicourts.gov
	4FEAB90A0200002500048B7D@gw.wicourts.gov
2013-01-23 12:04:59 -03:00
Heikki Linnakangas
52906f175a Implement pg_unreachable() on MSVC. 2013-01-23 12:53:55 +02:00
Heikki Linnakangas
990fe3c4ed Fix more issues with cascading replication and timeline switches.
When a standby server follows the master using WAL archive, and it chooses
a new timeline (recovery_target_timeline='latest'), it only fetches the
timeline history file for the chosen target timeline, not any other history
files that might be missing from pg_xlog. For example, if the current
timeline is 2, and we choose 4 as the new recovery target timeline, the
history file for timeline 3 is not fetched, even if it's part of this
server's history. That's enough for the standby itself - the history file
for timeline 4 includes timeline 3 as well - but if a cascading standby
server wants to recover to timeline 3, it needs the history file. To fix,
when a new recovery target timeline is chosen, try to copy any missing
history files from the archive to pg_xlog between the old and new target
timeline.

A second similar issue was with the WAL files. When a standby recovers from
archive, and it reaches a segment that contains a switch to a new timeline,
recovery fetches only the WAL file labelled with the new timeline's ID. The
file from the new timeline contains a copy of the WAL from the old timeline
up to the point where the switch happened, and recovery recovers it from the
new file. But in streaming replication, walsender only tries to read it
from the old timeline's file. To fix, change walsender to read it from the
new file, so that it behaves the same as recovery in that sense, and doesn't
try to open the possibly nonexistent file with the old timeline's ID.
2013-01-23 10:19:20 +02:00
Tom Lane
75b39e7909 Add infrastructure for storing a VARIADIC ANY function's VARIADIC flag.
Originally we didn't bother to mark FuncExprs with any indication whether
VARIADIC had been given in the source text, because there didn't seem to be
any need for it at runtime.  However, because we cannot fold a VARIADIC ANY
function's arguments into an array (since they're not necessarily all the
same type), we do actually need that information at runtime if VARIADIC ANY
functions are to respond unsurprisingly to use of the VARIADIC keyword.
Add the missing field, and also fix ruleutils.c so that VARIADIC ANY
function calls are dumped properly.

Extracted from a larger patch that also fixes concat() and format() (the
only two extant VARIADIC ANY functions) to behave properly when VARIADIC is
specified.  This portion seems appropriate to review and commit separately.

Pavel Stehule
2013-01-21 20:26:15 -05:00
Robert Haas
841a5150c5 Add ddl_command_end support for event triggers.
Dimitri Fontaine, with slight changes by me
2013-01-21 18:00:24 -05:00
Alvaro Herrera
765cbfdc92 Refactor ALTER some-obj RENAME implementation
Remove duplicate implementations of catalog munging and miscellaneous
privilege checks.  Instead rely on already existing data in
objectaddress.c to do the work.

Author: KaiGai Kohei, changes by me
Reviewed by: Robert Haas, Álvaro Herrera, Dimitri Fontaine
2013-01-21 12:06:41 -03:00
Heikki Linnakangas
6f7cddc7ae Now that START_REPLICATION returns the next timeline's ID after reaching end
of timeline, take advantage of that in walreceiver.

Startup process is still in control of choosign the target timeline, by
scanning the timeline history files present in pg_xlog, but walreceiver now
uses the next timeline's ID to fetch its history file immediately after it
has finished streaming the old timeline. Before, the standby would first try
to restart streaming on the old timeline, which fetches the missing timeline
history file as a side-effect, and only then restart from the new timeline.
This patch eliminates the extra iteration, which speeds up the timeline
switch and reduces the noise in the log caused by the extra restart on the
old timeline.
2013-01-18 11:59:34 +02:00
Heikki Linnakangas
2ff6555313 Use the right timeline when beginning to stream from master.
The xlogreader refactoring broke the logic to decide which timeline to start
streaming from. XLogPageRead() uses the timeline history to check which
timeline the requested WAL position falls into. However, after the
refactoring, XLogPageRead() is always first called with the first page in
the segment, to verify the segment header, and only then with the actual WAL
position we're interested in. That first read of the segment's header made
XLogPageRead() to always start streaming from the old timeline containing
the segment header, not the timeline containing the actual record, if there
was a timeline switch within the segment.

I thought I fixed this yesterday, but that fix was too narrow and only fixed
this for the corner-case that the timeline switch happened in the first page
of the segment. To fix this more robustly, pass explicitly the position of
the record we're actually interested in to XLogPageRead, and use that to
decide which timeline to read from, rather than deduce it from the page and
offset.

Per report from Fujii Masao.
2013-01-18 11:46:49 +02:00
Alvaro Herrera
279628a0a7 Accelerate end-of-transaction dropping of relations
When relations are dropped, at end of transaction we need to remove the
files and clean the buffer pool of buffers containing pages of those
relations.  Previously we would scan the buffer pool once per relation
to clean up buffers.  When there are many relations to drop, the
repeated scans make this process slow; so we now instead pass a list of
relations to drop and scan the pool once, checking each buffer against
the passed list.  When the number of relations is larger than a
threshold (which as of this patch is being set to 20 relations) we sort
the array before starting, and bsearch the array; when it's smaller, we
simply scan the array linearly each time, because that's faster.  The
exact optimal threshold value depends on many factors, but the
difference is not likely to be significant enough to justify making it
user-settable.

This has been measured to be a significant win (a 15x win when dropping
100,000 relations; an extreme case, but reportedly a real one).

Author: Tomas Vondra, some tweaks by me
Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera
2013-01-17 16:13:17 -03:00
Heikki Linnakangas
0b6329130e Make pg_receivexlog and pg_basebackup -X stream work across timeline switches.
This mirrors the changes done earlier to the server in standby mode. When
receivelog reaches the end of a timeline, as reported by the server, it
fetches the timeline history file of the next timeline, and restarts
streaming from the new timeline by issuing a new START_STREAMING command.

When pg_receivexlog crosses a timeline, it leaves the .partial suffix on the
last segment on the old timeline. This helps you to tell apart a partial
segment left in the directory because of a timeline switch, and a completed
segment. If you just follow a single server, it won't make a difference, but
it can be significant in more complicated scenarios where new WAL is still
generated on the old timeline.

This includes two small changes to the streaming replication protocol:
First, when you reach the end of timeline while streaming, the server now
sends the TLI of the next timeline in the server's history to the client.
pg_receivexlog uses that as the next timeline, so that it doesn't need to
parse the timeline history file like a standby server does. Second, when
BASE_BACKUP command sends the begin and end WAL positions, it now also sends
the timeline IDs corresponding the positions.
2013-01-17 20:23:00 +02:00
Heikki Linnakangas
9ee4d06f3f Make GiST indexes on-disk compatible with 9.2 again.
The patch that turned XLogRecPtr into a uint64 inadvertently changed the
on-disk format of GiST indexes, because the NSN field in the GiST page
opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
field back to the two-field struct that XLogRecPtr was before. This is the
same we did to LSNs in the page header to avoid changing on-disk format.

Bump catversion, as this invalidates any existing GiST indexes built on
9.3devel.
2013-01-17 16:46:16 +02:00
Alvaro Herrera
7fcbf6a405 Split out XLog reading as an independent facility
This new facility can not only be used by xlog.c to carry out crash
recovery, but also by external programs.  By supplying a function to
read XLog pages from somewhere, all the WAL reading can be used for
completely different purposes.

For the standard backend use, the behavior should be pretty much the
same as previously.  As for non-backend programs, an hypothetical
pg_xlogdump program is now closer to reality, but some more backend
support is still necessary.

This patch was originally submitted by Andres Freund in a different
form, but Heikki Linnakangas opted for and authored another design of
the concept.  Andres has advanced the patch since Heikki's initial
version.  Review and some (mostly cosmetics) changes by me.
2013-01-16 16:12:53 -03:00
Alvaro Herrera
7ac5760fa2 Rework order of checks in ALTER / SET SCHEMA
When attempting to move an object into the schema in which it already
was, for most objects classes we were correctly complaining about
exactly that ("object is already in schema"); but for some other object
classes, such as functions, we were instead complaining of a name
collision ("object already exists in schema").  The latter is wrong and
misleading, per complaint from Robert Haas in
CA+TgmoZ0+gNf7RDKRc3u5rHXffP=QjqPZKGxb4BsPz65k7qnHQ@mail.gmail.com

To fix, refactor the way these checks are done.  As a bonus, the
resulting code is smaller and can also share some code with Rename
cases.

While at it, remove use of getObjectDescriptionOids() in error messages.
These are normally disallowed because of translatability considerations,
but this one had slipped through since 9.1.  (Not sure that this is
worth backpatching, though, as it would create some untranslated
messages in back branches.)

This is loosely based on a patch by KaiGai Kohei, heavily reworked by
me.
2013-01-15 13:23:43 -03:00
Tom Lane
2065dd2834 Prevent very-low-probability PANIC during PREPARE TRANSACTION.
The code in PostPrepare_Locks supposed that it could reassign locks to
the prepared transaction's dummy PGPROC by deleting the PROCLOCK table
entries and immediately creating new ones.  This was safe when that code
was written, but since we invented partitioning of the shared lock table,
it's not safe --- another process could steal away the PROCLOCK entry in
the short interval when it's on the freelist.  Then, if we were otherwise
out of shared memory, PostPrepare_Locks would have to PANIC, since it's
too late to back out of the PREPARE at that point.

Fix by inventing a dynahash.c function to atomically update a hashtable
entry's key.  (This might possibly have other uses in future.)

This is an ancient bug that in principle we ought to back-patch, but the
odds of someone hitting it in the field seem really tiny, because (a) the
risk window is small, and (b) nobody runs servers with maxed-out lock
tables for long, because they'll be getting non-PANIC out-of-memory errors
anyway.  So fixing it in HEAD seems sufficient, at least until the new
code has gotten some testing.
2013-01-13 22:20:22 -05:00
Tom Lane
b853eb9718 Improve handling of ereport(ERROR) and elog(ERROR).
In commit 71450d7fd6, we added code to inform
suitably-intelligent compilers that ereport() doesn't return if the elevel
is ERROR or higher.  This patch extends that to elog(), and also fixes a
double-evaluation hazard that the previous commit created in ereport(),
as well as reducing the emitted code size.

The elog() improvement requires the compiler to support __VA_ARGS__, which
should be available in just about anything nowadays since it's required by
C99.  But our minimum language baseline is still C89, so add a configure
test for that.

The previous commit assumed that ereport's elevel could be evaluated twice,
which isn't terribly safe --- there are already counterexamples in xlog.c.
On compilers that have __builtin_constant_p, we can use that to protect the
second test, since there's no possible optimization gain if the compiler
doesn't know the value of elevel.  Otherwise, use a local variable inside
the macros to prevent double evaluation.  The local-variable solution is
inferior because (a) it leads to useless code being emitted when elevel
isn't constant, and (b) it increases the optimization level needed for the
compiler to recognize that subsequent code is unreachable.  But it seems
better than not teaching non-gcc compilers about unreachability at all.

Lastly, if the compiler has __builtin_unreachable(), we can use that
instead of abort(), resulting in a noticeable code savings since no
function call is actually emitted.  However, it seems wise to do this only
in non-assert builds.  In an assert build, continue to use abort(), so that
the behavior will be predictable and debuggable if the "impossible"
happens.

These changes involve making the ereport and elog macros emit do-while
statement blocks not just expressions, which forces small changes in
a few call sites.

Andres Freund, Tom Lane, Heikki Linnakangas
2013-01-13 18:40:09 -05:00
Tom Lane
31f38f28b0 Redesign the planner's handling of index-descent cost estimation.
Historically we've used a couple of very ad-hoc fudge factors to try to
get the right results when indexes of different sizes would satisfy a
query with the same number of index leaf tuples being visited.  In
commit 21a39de580 I tweaked one of these
fudge factors, with results that proved disastrous for larger indexes.
Commit bf01e34b55 fudged it some more,
but still with not a lot of principle behind it.

What seems like a better way to address these issues is to explicitly model
index-descent costs, since that's what's really at stake when considering
diferent indexes with similar leaf-page-level costs.  We tried that once
long ago, and found that charging random_page_cost per page descended
through was way too much, because upper btree levels tend to stay in cache
in real-world workloads.  However, there's still CPU costs to think about,
and the previous fudge factors can be seen as a crude attempt to account
for those costs.  So this patch replaces those fudge factors with explicit
charges for the number of tuple comparisons needed to descend the index
tree, plus a small charge per page touched in the descent.  The cost
multipliers are chosen so that the resulting charges are in the vicinity of
the historical (pre-9.2) fudge factors for indexes of up to about a million
tuples, while not ballooning unreasonably beyond that, as the old fudge
factor did (even more so in 9.2).

To make this work accurately for btree indexes, add some code that allows
extraction of the known root-page height from a btree.  There's no
equivalent number readily available for other index types, but we can use
the log of the number of index pages as an approximate substitute.

This seems like too much of a behavioral change to risk back-patching,
but it should improve matters going forward.  In 9.2 I'll just revert
the fudge-factor change.
2013-01-11 12:56:58 -05:00
Magnus Hagander
940d136661 Centralize single quote escaping in src/port/quotes.c
For code-reuse in upcoming functionality in pg_basebackup.

Zoltan Boszormenyi
2013-01-05 15:40:19 +01:00
Tom Lane
94afbd5831 Invent a "one-shot" variant of CachedPlans for better performance.
SPI_execute() and related functions create a CachedPlan, execute it once,
and immediately discard it, so that the functionality offered by
plancache.c is of no value in this code path.  And performance measurements
show that the extra data copying and invalidation checking done by
plancache.c slows down simple queries by 10% or more compared to 9.1.
However, enough of the SPI code is shared with functions that do need plan
caching that it seems impractical to bypass plancache.c altogether.
Instead, let's invent a variant version of cached plans that preserves
99% of the API but doesn't offer any of the actual functionality, nor the
overhead.  This puts SPI_execute() performance back on par, or maybe even
slightly better, than it was before.  This change should resolve recent
complaints of performance degradation from Dong Ye, Pavel Stehule, and
others.

By avoiding data copying, this change also reduces the amount of memory
needed to execute many-statement SPI_execute() strings, as for instance in
a recent complaint from Tomas Vondra.

An additional benefit of this change is that multi-statement SPI_execute()
query strings are now processed fully serially, that is we complete
execution of earlier statements before running parse analysis and planning
on following ones.  This eliminates a long-standing POLA violation, in that
DDL that affects the behavior of a later statement will now behave as
expected.

Back-patch to 9.2, since this was a performance regression compared to 9.1.
(In 9.2, place the added struct fields so as to avoid changing the offsets
of existing fields.)

Heikki Linnakangas and Tom Lane
2013-01-04 17:42:19 -05:00
Heikki Linnakangas
b0daba57bb Tolerate timeline switches while "pg_basebackup -X fetch" is running.
If you take a base backup from a standby server with "pg_basebackup -X
fetch", and the timeline switches while the backup is being taken, the
backup used to fail with an error "requested WAL segment %s has already
been removed". This is because the server-side code that sends over the
required WAL files would not construct the WAL filename with the correct
timeline after a switch.

Fix that by using readdir() to scan pg_xlog for all the WAL segments in the
range, regardless of timeline.

Also, include all timeline history files in the backup, if taken with
"-X fetch". That fixes another related bug: If a timeline switch happened
just before the backup was initiated in a standby, the WAL segment
containing the initial checkpoint record contains WAL from the older
timeline too. Recovery will not accept that without a timeline history file
that lists the older timeline.

Backpatch to 9.2. Versions prior to that were not affected as you could not
take a base backup from a standby before 9.2.
2013-01-03 19:51:00 +02:00
Magnus Hagander
794397ae1d Move tar function headers to pgtar.h
This makes it possible to include them only where they are used, so
we can avoid the conflict of the uid_t and gid_t datatypes that happened
in plperl (since plperl doesn't need the tar functions)
2013-01-02 20:34:08 +01:00
Alvaro Herrera
dfbba2c86c Make sure MaxBackends is always set
Auxiliary and bootstrap processes weren't getting it, causing initdb to
fail completely.
2013-01-02 14:39:11 -03:00
Alvaro Herrera
cdbc0ca48c Fix background workers for EXEC_BACKEND
Commit da07a1e8 was broken for EXEC_BACKEND because I failed to realize
that the MaxBackends recomputation needed to be duplicated by
subprocesses in SubPostmasterMain.  However, instead of having the value
be recomputed at all, it's better to assign the correct value at
postmaster initialization time, and have it be propagated to exec'ed
backends via BackendParameters.

MaxBackends stays as zero until after modules in
shared_preload_libraries have had a chance to register bgworkers, since
the value is going to be untrustworthy till that's finished.

Heikki Linnakangas and Álvaro Herrera
2013-01-02 12:01:14 -03:00
Bruce Momjian
bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Magnus Hagander
f5d4bdd3a5 Unify some tar functionality across different parts
Move some of the tar functionality that existed mostly duplicated
in both pg_dump and the walsender basebackup functionality into
port/tar.c instead, so it can be used from both. It will also be
used by pg_basebackup in the future, which would've caused a third
copy of it around.

Zoltan Boszormenyi and Magnus Hagander
2013-01-01 18:15:57 +01:00
Heikki Linnakangas
60df192aea Keep timeline history files restored from archive in pg_xlog.
The cascading standby patch in 9.2 changed the way WAL files are treated
when restored from the archive. Before, they were restored under a temporary
filename, and not kept in pg_xlog, but after the patch, they were copied
under pg_xlog. This is necessary for a cascading standby to find them, but
it also means that if the archive goes offline and a standby is restarted,
it can recover back to where it was using the files in pg_xlog. It also
means that if you take an offline backup from a standby server, it includes
all the required WAL files in pg_xlog.

However, the same change was not made to timeline history files, so if the
WAL segment containing the checkpoint record contains a timeline switch, you
will still get an error if you try to restart recovery without the archive,
or recover from an offline backup taken from the standby.

With this patch, timeline history files restored from archive are copied
into pg_xlog like WAL files are, so that pg_xlog contains all the files
required to recover. This is a corner-case pre-existing issue in 9.2, but
even more important in master where it's possible for a standby to follow a
timeline switch through streaming replication. To make that possible, the
timeline history files must be present in pg_xlog.
2012-12-30 14:29:45 +02:00
Robert Haas
82b1b213ca Adjust more backend functions to return OID rather than void.
This is again intended to support extensions to the event trigger
functionality.  This may go a bit further than we need for that
purpose, but there's some value in being consistent, and the OID
may be useful for other purposes also.

Dimitri Fontaine
2012-12-29 07:55:37 -05:00
Alvaro Herrera
5ab3af46dd Remove obsolete XLogRecPtr macros
This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance.
These were useful for brevity when XLogRecPtrs were split in
xlogid/xrecoff; but now that they are simple uint64's, they are just
clutter.  The only downside to making this change would be ease of
backporting patches, but that has been negated by other substantive
changes to the involved code anyway.  The clarity of simpler expressions
makes the change worthwhile.

Most of the changes are mechanical, but in a couple of places, the patch
author chose to invert the operator sense, making the code flow more
logical (and more in line with preceding comments).

Author: Andres Freund
Eyeballed by Dimitri Fontaine and Alvaro Herrera
2012-12-28 13:06:15 -03:00
Alvaro Herrera
eaa1f7220a Remove unused NextLogPage macro
Commit 061e7efb1b did away with its last caller, but neglected to remove
the actual definition.

Author: Andres Freund
2012-12-27 18:23:23 -03:00
Simon Riggs
c2b3218064 Update comments on rd_newRelfilenodeSubid.
Ensure comments accurately reflect state of code
given new understanding, and recent changes.
Include example code from Noah Misch to
illustrate how rd_newRelfilenodeSubid can be
reset deterministically. No code changes.
2012-12-24 17:07:06 +00:00
Simon Riggs
42fa810c14 Fix more weird compiler messages caused
by unmatched function prototypes.

Andres Freund
2012-12-24 16:25:26 +00:00
Simon Riggs
2dcb2ebee2 Add function prototype from previous commit. 2012-12-24 09:18:42 +00:00
Robert Haas
c504513f83 Adjust many backend functions to return OID rather than void.
Extracted from a larger patch by Dimitri Fontaine.  It is hoped that
this will provide infrastructure for enriching the new event trigger
functionality, but it seems possibly useful for other purposes as
well.
2012-12-23 18:37:58 -05:00
Tom Lane
31bc839724 Prevent failure when RowExpr or XmlExpr is parse-analyzed twice.
transformExpr() is required to cope with already-transformed expression
trees, for various ugly-but-not-quite-worth-cleaning-up reasons.  However,
some of its newer subroutines hadn't gotten the memo.  This accounts for
bug #7763 from Norbert Buchmuller: transformRowExpr() was overwriting the
previously determined type of a RowExpr during CREATE TABLE LIKE INCLUDING
INDEXES.  Additional investigation showed that transformXmlExpr had the
same kind of problem, but all the other cases seem to be safe.

Andres Freund and Tom Lane
2012-12-23 14:07:24 -05:00
Heikki Linnakangas
d57a97343e Forgot to remove extern declaration of GetRecoveryTargetTLI()
Fujii Masao
2012-12-21 09:29:03 +02:00
Heikki Linnakangas
af275a12df Follow TLI of last replayed record, not recovery target TLI, in walsenders.
Most of the time, the last replayed record comes from the recovery target
timeline, but there is a corner case where it makes a difference. When
the startup process scans for a new timeline, and decides to change recovery
target timeline, there is a window where the recovery target TLI has already
been bumped, but there are no WAL segments from the new timeline in pg_xlog
yet. For example, if we have just replayed up to point 0/30002D8, on
timeline 1, there is a WAL file called 000000010000000000000003 in pg_xlog
that contains the WAL up to that point. When recovery switches recovery
target timeline to 2, a walsender can immediately try to read WAL from
0/30002D8, from timeline 2, so it will try to open WAL file
000000020000000000000003. However, that doesn't exist yet - the startup
process hasn't copied that file from the archive yet nor has the walreceiver
streamed it yet, so walsender fails with error "requested WAL segment
000000020000000000000003 has already been removed". That's harmless, in that
the standby will try to reconnect later and by that time the segment is
already created, but error messages that should be ignored are not good.

To fix that, have walsender track the TLI of the last replayed record,
instead of the recovery target timeline. That way walsender will not try to
read anything from timeline 2, until the WAL segment has been created and at
least one record has been replayed from it. The recovery target timeline is
now xlog.c's internal affair, it doesn't need to be exposed in shared memory
anymore.

This fixes the error reported by Thom Brown. depesz the same error message,
but I'm not sure if this fixes his scenario.
2012-12-20 14:39:04 +02:00
Tom Lane
6919b7e329 Fix failure to ignore leftover temp tables after a server crash.
During crash recovery, we remove disk files belonging to temporary tables,
but the system catalog entries for such tables are intentionally not
cleaned up right away.  Instead, the first backend that uses a temp schema
is expected to clean out any leftover objects therein.  This approach
requires that we be careful to ignore leftover temp tables (since any
actual access attempt would fail), *even if their BackendId matches our
session*, if we have not yet established use of the session's corresponding
temp schema.  That worked fine in the past, but was broken by commit
debcec7dc3 which incorrectly removed the
rd_islocaltemp relcache flag.  Put it back, and undo various changes
that substituted tests like "rel->rd_backend == MyBackendId" for use
of a state-aware flag.  Per trouble report from Heikki Linnakangas.

Back-patch to 9.1 where the erroneous change was made.  In the back
branches, be careful to add rd_islocaltemp in a spot in the struct that
was alignment padding before, so as not to break existing add-on code.
2012-12-17 20:15:32 -05:00
Andrew Dunstan
1c382655ad Provide Assert() for frontend code.
Per discussion on-hackers. psql is converted to use the new code.

Follows a suggestion from Heikki Linnakangas.
2012-12-14 18:03:07 -05:00
Heikki Linnakangas
abfd192b1b Allow a streaming replication standby to follow a timeline switch.
Before this patch, streaming replication would refuse to start replicating
if the timeline in the primary doesn't exactly match the standby. The
situation where it doesn't match is when you have a master, and two
standbys, and you promote one of the standbys to become new master.
Promoting bumps up the timeline ID, and after that bump, the other standby
would refuse to continue.

There's significantly more timeline related logic in streaming replication
now. First of all, when a standby connects to primary, it will ask the
primary for any timeline history files that are missing from the standby.
The missing files are sent using a new replication command TIMELINE_HISTORY,
and stored in standby's pg_xlog directory. Using the timeline history files,
the standby can follow the latest timeline present in the primary
(recovery_target_timeline='latest'), just as it can follow new timelines
appearing in an archive directory.

START_REPLICATION now takes a TIMELINE parameter, to specify exactly which
timeline to stream WAL from. This allows the standby to request the primary
to send over WAL that precedes the promotion. The replication protocol is
changed slightly (in a backwards-compatible way although there's little hope
of streaming replication working across major versions anyway), to allow
replication to stop when the end of timeline reached, putting the walsender
back into accepting a replication command.

Many thanks to Amit Kapila for testing and reviewing various versions of
this patch.
2012-12-13 19:17:32 +02:00
Heikki Linnakangas
527668717a Make xlog_internal.h includable in frontend context.
This makes unnecessary the ugly hack used to #include postgres.h in
pg_basebackup.

Based on Alvaro Herrera's patch
2012-12-13 14:59:13 +02:00
Kevin Grittner
b19e4250b4 Fix performance problems with autovacuum truncation in busy workloads.
In situations where there are over 8MB of empty pages at the end of
a table, the truncation work for trailing empty pages takes longer
than deadlock_timeout, and there is frequent access to the table by
processes other than autovacuum, there was a problem with the
autovacuum worker process being canceled by the deadlock checking
code. The truncation work done by autovacuum up that point was
lost, and the attempt tried again by a later autovacuum worker. The
attempts could continue indefinitely without making progress,
consuming resources and blocking other processes for up to
deadlock_timeout each time.

This patch has the autovacuum worker checking whether it is
blocking any other thread at 20ms intervals. If such a condition
develops, the autovacuum worker will persist the work it has done
so far, release its lock on the table, and sleep in 50ms intervals
for up to 5 seconds, hoping to be able to re-acquire the lock and
try again. If it is unable to get the lock in that time, it moves
on and a worker will try to continue later from the point this one
left off.

While this patch doesn't change the rules about when and what to
truncate, it does cause the truncation to occur sooner, with less
blocking, and with the consumption of fewer resources when there is
contention for the table's lock.

The only user-visible change other than improved performance is
that the table size during truncation may change incrementally
instead of just once.

This problem exists in all supported versions but is infrequently
reported, although some reports of performance problems when
autovacuum runs might be caused by this. Initial commit is just the
master branch, but this should probably be backpatched once the
build farm and general developer usage confirm that there are no
surprising effects.

Jan Wieck
2012-12-11 14:33:08 -06:00
Tom Lane
a99c42f291 Support automatically-updatable views.
This patch makes "simple" views automatically updatable, without the need
to create either INSTEAD OF triggers or INSTEAD rules.  "Simple" views
are those classified as updatable according to SQL-92 rules.  The rewriter
transforms INSERT/UPDATE/DELETE commands on such views directly into an
equivalent command on the underlying table, which will generally have
noticeably better performance than is possible with either triggers or
user-written rules.  A view that has INSTEAD OF triggers or INSTEAD rules
continues to operate the same as before.

For the moment, security_barrier views are not considered simple.
Also, we do not support WITH CHECK OPTION.  These features may be
added in future.

Dean Rasheed, reviewed by Amit Kapila
2012-12-08 18:26:21 -05:00
Alvaro Herrera
da07a1e856 Background worker processes
Background workers are postmaster subprocesses that run arbitrary
user-specified code.  They can request shared memory access as well as
backend database connections; or they can just use plain libpq frontend
database connections.

Modules listed in shared_preload_libraries can register background
workers in their _PG_init() function; this is early enough that it's not
necessary to provide an extra GUC option, because the necessary extra
resources can be allocated early on.  Modules can install more than one
bgworker, if necessary.

Care is taken that these extra processes do not interfere with other
postmaster tasks: only one such process is started on each ServerLoop
iteration.  This means a large number of them could be waiting to be
started up and postmaster is still able to quickly service external
connection requests.  Also, shutdown sequence should not be impacted by
a worker process that's reasonably well behaved (i.e. promptly responds
to termination signals.)

The current implementation lets worker processes specify their start
time, i.e. at what point in the server startup process they are to be
started: right after postmaster start (in which case they mustn't ask
for shared memory access), when consistent state has been reached
(useful during recovery in a HOT standby server), or when recovery has
terminated (i.e. when normal backends are allowed).

In case of a bgworker crash, actions to take depend on registration
data: if shared memory was requested, then all other connections are
taken down (as well as other bgworkers), just like it were a regular
backend crashing.  The bgworker itself is restarted, too, within a
configurable timeframe (which can be configured to be never).

More features to add to this framework can be imagined without much
effort, and have been discussed, but this seems good enough as a useful
unit already.

An elementary sample module is supplied.

Author: Álvaro Herrera

This patch is loosely based on prior patches submitted by KaiGai Kohei,
and unsubmitted code by Simon Riggs.

Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund,
Heikki Linnakangas, Simon Riggs, Amit Kapila
2012-12-06 17:47:30 -03:00
Heikki Linnakangas
32f4de0adf Write exact xlog position of timeline switch in the timeline history file.
This allows us to do some more rigorous sanity checking for various
incorrect point-in-time recovery scenarios, and provides more information
for debugging purposes. It will also come handy in the upcoming patch to
allow timeline switches to be replicated by streaming replication.
2012-12-04 17:29:07 +02:00
Heikki Linnakangas
5ce108bf32 Track the timeline associated with minRecoveryPoint, for more sanity checks.
This allows recovery to notice certain incorrect recovery scenarios.
If a server has recovered to point X on timeline 5, and you restart
recovery, it better be on timeline 5 when it reaches point X again, not on
some timeline with a higher ID. This can happen e.g if you a standby server
is shut down, a new timeline appears in the WAL archive, and the standby
server is restarted. It will try to follow the new timeline, which is wrong
because some WAL on the old timeline was already replayed before shutdown.

Requires an initdb (or at least pg_resetxlog), because this adds a field to
the control file.
2012-12-04 11:31:00 +02:00
Peter Eisentraut
aa2fec0a18 Add support for LDAP URLs
Allow specifying LDAP authentication parameters as RFC 4516 LDAP URLs.
2012-12-03 23:31:02 -05:00
Simon Riggs
f21bb9cfb5 Refactor inCommit flag into generic delayChkpt flag.
Rename PGXACT->inCommit flag into delayChkpt flag,
and generalise comments to allow use in other situations,
such as the forthcoming potential use in checksum patch.
Replace wait loop to look for VXIDs with delayChkpt set.
No user visible changes, not behaviour changes at present.

Simon Riggs, reviewed and rebased by Jeff Davis
2012-12-03 13:13:53 +00:00
Simon Riggs
5457a130d3 Reduce scope of changes for COPY FREEZE.
Allow support only for freezing tuples by explicit
command. Previous coding mistakenly extended
slightly beyond what was agreed as correct on -hackers.
So essentially a partial revoke of earlier work,
leaving just the COPY FREEZE command.
2012-12-02 20:52:52 +00:00
Tom Lane
3114cb60a1 Don't advance checkPoint.nextXid near the end of a checkpoint sequence.
This reverts commit c11130690d in favor of
actually fixing the problem: namely, that we should never have been
modifying the checkpoint record's nextXid at this point to begin with.
The nextXid should match the state as of the checkpoint's logical WAL
position (ie the redo point), not the state as of its physical position.
It's especially bogus to advance it in some wal_levels and not others.
In any case there is no need for the checkpoint record to carry the
same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
LogStandbySnapshot, as any replay operation will already have adopted
that value as current.

This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
#6291 from Daniel Farina, in that if a checkpoint were in progress at the
instant of XID wraparound, the epoch bump would be lost as reported.
(And, of course, these days there's at least a 50-50 chance of a checkpoint
being in progress at any given instant.)

Diagnosed by me and independently by Andres Freund.  Back-patch to all
branches supporting hot standby.
2012-12-02 15:20:41 -05:00
Simon Riggs
5c11725867 Rearrange storage of data in xl_running_xacts.
Previously we stored all xids mixed together.
Now we store top-level xids first, followed
by all subxids. Also skip logging any subxids
if the snapshot is suboverflowed, since there
are potentially large numbers of them and they
are not useful in that case anyway. Has value
in the envisaged design for decoding of WAL.
No planned effect on Hot Standby.

Andres Freund, reviewed by me
2012-12-02 19:39:37 +00:00
Tom Lane
7b90469b71 Allow adding values to an enum type created in the current transaction.
Normally it is unsafe to allow ALTER TYPE ADD VALUE in a transaction block,
because instances of the value could be added to indexes later in the same
transaction, and then they would still be accessible even if the
transaction rolls back.  However, we can allow this if the enum type itself
was created in the current transaction, because then any such indexes would
have to go away entirely on rollback.

The reason for allowing this is to support pg_upgrade's new usage of
pg_restore --single-transaction: in --binary-upgrade mode, pg_dump emits
enum types as a succession of ALTER TYPE ADD VALUE commands so that it can
preserve the values' OIDs.  The support is a bit limited, so we'll leave
it undocumented.

Andres Freund
2012-12-01 14:27:30 -05:00
Simon Riggs
8de72b66a2 COPY FREEZE and mark committed on fresh tables.
When a relfilenode is created in this subtransaction or
a committed child transaction and it cannot otherwise
be seen by our own process, mark tuples committed ahead
of transaction commit for all COPY commands in same
transaction. If FREEZE specified on COPY
and pre-conditions met then rows will also be frozen.
Both options designed to avoid revisiting rows after commit,
increasing performance of subsequent commands after
data load and upgrade. pg_restore changes later.

Simon Riggs, review comments from Heikki Linnakangas, Noah Misch and design
input from Tom Lane, Robert Haas and Kevin Grittner
2012-12-01 12:54:20 +00:00
Tom Lane
4af446e7cd Produce a more useful error message for over-length Unix socket paths.
The length of a socket path name is constrained by the size of struct
sockaddr_un, and there's not a lot we can do about it since that is a
kernel API.  However, it would be a good thing if we produced an
intelligible error message when the user specifies a socket path that's too
long --- and getaddrinfo's standard API is too impoverished to do this in
the natural way.  So insert explicit tests at the places where we construct
a socket path name.  Now you'll get an error that makes sense and even
tells you what the limit is, rather than something generic like
"Non-recoverable failure in name resolution".

Per trouble report from Jeremy Drake and a fix idea from Andrew Dunstan.
2012-11-29 19:57:01 -05:00
Simon Riggs
f1e57a4ec9 Cleanup VirtualXact at end of Hot Standby. 2012-11-29 21:59:11 +00:00
Robert Haas
7a2fe9bd03 Basic binary heap implementation.
There are probably other places where this can be used, but for now,
this just makes MergeAppend use it, so that this code will have test
coverage.  There is other work in the queue that will use this, as
well.

Abhijit Menon-Sen, reviewed by Andres Freund, Robert Haas, Álvaro
Herrera, Tom Lane, and others.
2012-11-29 11:16:59 -05:00
Tom Lane
3c84046490 Fix assorted bugs in CREATE/DROP INDEX CONCURRENTLY.
Commit 8cb53654db, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation.  The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY.  This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions.  Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.

To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished.  (This
change obviously is only possible in HEAD.  This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)

In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update.  The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption.  This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.

In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate.  These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.

Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation.  Previously we
could have been left with stale values of some fields in an index relcache
entry.  It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.

In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.

This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.

Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
2012-11-28 21:26:01 -05:00
Alvaro Herrera
1577b46b7c Split out rmgr rm_desc functions into their own files
This is necessary (but not sufficient) to have them compilable outside
of a backend environment.
2012-11-28 13:01:15 -03:00
Tom Lane
e78d288c89 Add explicit casts in ilist.h's inline functions.
Needed to silence C++ errors, per report from Peter Eisentraut.

Andres Freund
2012-11-27 10:58:37 -05:00
Heikki Linnakangas
1f67078ea3 Add OpenTransientFile, with automatic cleanup at end-of-xact.
Files opened with BasicOpenFile or PathNameOpenFile are not automatically
cleaned up on error. That puts unnecessary burden on callers that only want
to keep the file open for a short time. There is AllocateFile, but that
returns a buffered FILE * stream, which in many cases is not the nicest API
to work with. So add function called OpenTransientFile, which returns a
unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().

This plugs a few rare fd leaks in error cases:

1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
   use OpenTransientFile here because the fd is supposed to persist over
   transaction boundaries.
3. lo_import/lo_export - fixed by using OpenTransientFile instead of
   PathNameOpenFile.

In addition to plugging those leaks, this replaces many BasicOpenFile() calls
with OpenTransientFile() that were not leaking, because the code meticulously
closed the file on error. That wasn't strictly necessary, but IMHO it's good
for robustness.

The same leaks exist in older versions, but given the rarity of the issues,
I'm not backpatching this. Not yet, anyway - it might be good to backpatch
later, after this mechanism has had some more testing in master branch.
2012-11-27 10:25:50 +02:00
Tom Lane
532994299e Revert patch for taking fewer snapshots.
This reverts commit d573e239f0, "Take fewer
snapshots".  While that seemed like a good idea at the time, it caused
execution to use a snapshot that had been acquired before locking any of
the tables mentioned in the query.  This created user-visible anomalies
that were not present in any prior release of Postgres, as reported by
Tomas Vondra.  While this whole area could do with a redesign (since there
are related cases that have anomalies anyway), it doesn't seem likely that
any future patch would be reasonably back-patchable; and we don't want 9.2
to exhibit a behavior that's subtly unlike either past or future releases.
Hence, revert to prior code while we rethink the problem.
2012-11-26 15:55:43 -05:00
Tom Lane
d3237e04ca Fix SELECT DISTINCT with index-optimized MIN/MAX on inheritance trees.
In a query such as "SELECT DISTINCT min(x) FROM tab", the DISTINCT is
pretty useless (there being only one output row), but nonetheless it
shouldn't fail.  But it could fail if "tab" is an inheritance parent,
because planagg.c's code for fixing up equivalence classes after making the
index-optimized MIN/MAX transformation wasn't prepared to find child-table
versions of the aggregate expression.  The least ugly fix seems to be
to add an option to mutate_eclass_expressions() to skip child-table
equivalence class members, which aren't used anymore at this stage of
planning so it's not really necessary to fix them.  Since child members
are ignored in many cases already, it seems plausible for
mutate_eclass_expressions() to have an option to ignore them too.

Per bug #7703 from Maxim Boguk.

Back-patch to 9.1.  Although the same code exists before that, it cannot
encounter child-table aggregates AFAICS, because the index optimization
transformation cannot succeed on inheritance trees before 9.1 (for lack
of MergeAppend).
2012-11-26 12:57:58 -05:00
Heikki Linnakangas
644a0a6379 Fix archive_cleanup_command.
When I moved ExecuteRecoveryCommand() from xlog.c to xlogarchive.c, I didn't
realize that it's called from the checkpoint process, not the startup
process. I tried to use InRedo variable to decide whether or not to attempt
cleaning up the archive (must not do so before we have read the initial
checkpoint record), but that variable is only valid within the startup
process.

Instead, let ExecuteRecoveryCommand() always clean up the archive, and add
an explicit argument to RestoreArchivedFile() to say whether that's allowed
or not. The caller knows better.

Reported by Erik Rijkers, diagnosis by Fujii Masao. Only 9.3devel is
affected.
2012-11-19 10:14:20 +02:00
Tom Lane
3bbf668de9 Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than
one page failed to ensure that those pages were locked correctly to ensure
that concurrent queries could not see inconsistent page states.  This is
a hangover from coding decisions made long before Hot Standby was added,
when it was hardly necessary to acquire buffer locks during WAL replay
at all, let alone hold them for carefully-chosen periods.

The key problem was that RestoreBkpBlocks was written to hold lock on each
page restored from a full-page image for only as long as it took to update
that page.  This was guaranteed to break any WAL replay function in which
there was any update-ordering constraint between pages, because even if the
nominal order of the pages is the right one, any mixture of full-page and
non-full-page updates in the same record would result in out-of-order
updates.  Moreover, it wouldn't work for situations where there's a
requirement to maintain lock on one page while updating another.  Failure
to honor an update ordering constraint in this way is thought to be the
cause of bug #7648 from Daniel Farina: what seems to have happened there
is that a btree page being split was rewritten from a full-page image
before the new right sibling page was written, and because lock on the
original page was not maintained it was possible for hot standby queries to
try to traverse the page's right-link to the not-yet-existing sibling page.

To fix, get rid of RestoreBkpBlocks as such, and instead create a new
function RestoreBackupBlock that restores just one full-page image at a
time.  This function can be invoked by WAL replay functions at the points
where they would otherwise perform non-full-page updates; in this way, the
physical order of page updates remains the same no matter which pages are
replaced by full-page images.  We can then further adjust the logic in
individual replay functions if it is necessary to hold buffer locks
for overlapping periods.  A side benefit is that we can simplify the
handling of concurrency conflict resolution by moving that code into the
record-type-specfic functions; there's no more need to contort the code
layout to keep conflict resolution in front of the RestoreBkpBlocks call.

In connection with that, standardize on zero-based numbering rather than
one-based numbering for referencing the full-page images.  In HEAD, I
removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4.  They are
still there in the header files in previous branches, but are no longer
used by the code.

In addition, fix some other bugs identified in the course of making these
changes:

spgRedoAddNode could fail to update the parent downlink at all, if the
parent tuple is in the same page as either the old or new split tuple and
we're not doing a full-page image: it would get fooled by the LSN having
been advanced already.  This would result in permanent index corruption,
not just transient failure of concurrent queries.

Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

heap_xlog_freeze() was inconsistent about using a cleanup lock or plain
exclusive lock: it did the former in the normal path but the latter for a
full-page image.  A plain exclusive lock seems sufficient, so change to
that.

Also, remove gistRedoPageDeleteRecord(), which has been dead code since
VACUUM FULL was rewritten.

Back-patch to 9.0, where hot standby was introduced.  Note however that 9.0
had a significantly different WAL-logging scheme for GIST index updates,
and it doesn't appear possible to make that scheme safe for concurrent hot
standby queries, because it can leave inconsistent states in the index even
between WAL records.  Given the lack of complaints from the field, we won't
work too hard on fixing that branch.
2012-11-12 22:05:53 -05:00
Heikki Linnakangas
dbdf9679d7 Use correct text domain for translating errcontext() messages.
errcontext() is typically used in an error context callback function, not
within an ereport() invocation like e.g errmsg and errdetail are. That means
that the message domain that the TEXTDOMAIN magic in ereport() determines
is not the right one for the errcontext() calls. The message domain needs to
be determined by the C file containing the errcontext() call, not the file
containing the ereport() call.

Fix by turning errcontext() into a macro that passes the TEXTDOMAIN to use
for the errcontext message. "errcontext" was used in a few places as a
variable or struct field name, I had to rename those out of the way, now
that errcontext is a macro.

We've had this problem all along, but this isn't doesn't seem worth
backporting. It's a fairly minor issue, and turning errcontext from a
function to a macro requires at least a recompile of any external code that
calls errcontext().
2012-11-12 17:07:29 +02:00
Heikki Linnakangas
c9d44a75d4 Silence "expression result unused" warnings in AssertVariableIsOfTypeMacro
At least clang 3.1 generates those warnings. Prepend (void) to avoid them,
like we have in AssertMacro.
2012-11-12 15:02:40 +02:00
Tom Lane
3e7fdcffd6 Fix WaitLatch() to return promptly when the requested timeout expires.
If the sleep is interrupted by a signal, we must recompute the remaining
time to wait; otherwise, a steady stream of non-wait-terminating interrupts
could delay return from WaitLatch indefinitely.  This has been shown to be
a problem for the autovacuum launcher, and there may well be other places
now or in the future with similar issues.  So we'd better make the function
robust, even though this'll add at least one gettimeofday call per wait.

Back-patch to 9.2.  We might eventually need to fix 9.1 as well, but the
code is quite different there, and the usage of WaitLatch in 9.1 is so
limited that it's not clearly important to do so.

Reported and diagnosed by Jeff Janes, though I rewrote his patch rather
heavily.
2012-11-08 20:04:48 -05:00
Tom Lane
dcc55dd21a Rename ResolveNew() to ReplaceVarsFromTargetList(), and tweak its API.
This function currently lacks the option to throw error if the provided
targetlist doesn't have any matching entry for a Var to be replaced.
Two of the four existing call sites would be better off with an error,
as would the usage in the pending auto-updatable-views patch, so it seems
past time to extend the API to support that.  To do so, replace the "event"
parameter (historically of type CmdType, though it was declared plain int)
with a special-purpose enum type.

It's unclear whether this function might be called by third-party code.
Since many C compilers wouldn't warn about a call site continuing to use
the old calling convention, rename the function to forcibly break any
such code that hasn't been updated.  The old name was none too well chosen
anyhow.
2012-11-08 16:52:49 -05:00
Bruce Momjian
aa69670e42 Add URLs to document why DLLIMPORT is needed on Windows.
Per email from Craig Ringer
2012-11-07 15:01:25 -05:00
Heikki Linnakangas
add6c3179a Make the streaming replication protocol messages architecture-independent.
We used to send structs wrapped in CopyData messages, which works as long as
the client and server agree on things like endianess, timestamp format and
alignment. That's good enough for running a standby server, which has to run
on the same platform anyway, but it's useful for tools like pg_receivexlog
to work across platforms.

This breaks protocol compatibility of streaming replication, but we never
promised that to be compatible across versions, anyway.
2012-11-07 19:09:13 +02:00
Tom Lane
5ed6546cf7 Fix handling of inherited check constraints in ALTER COLUMN TYPE.
This case got broken in 8.4 by the addition of an error check that
complains if ALTER TABLE ONLY is used on a table that has children.
We do use ONLY for this situation, but it's okay because the necessary
recursion occurs at a higher level.  So we need to have a separate
flag to suppress recursion without making the error check.

Reported and patched by Pavan Deolasee, with some editorial adjustments by
me.  Back-patch to 8.4, since this is a regression of functionality that
worked in earlier branches.
2012-11-05 13:36:16 -05:00
Alvaro Herrera
04f28bdb84 Fix ALTER EXTENSION / SET SCHEMA
In its original conception, it was leaving some objects into the old
schema, but without their proper pg_depend entries; this meant that the
old schema could be dropped, causing future pg_dump calls to fail on the
affected database.  This was originally reported by Jeff Frost as #6704;
there have been other complaints elsewhere that can probably be traced
to this bug.

To fix, be more consistent about altering a table's subsidiary objects
along the table itself; this requires some restructuring in how tables
are relocated when altering an extension -- hence the new
AlterTableNamespaceInternal routine which encapsulates it for both the
ALTER TABLE and the ALTER EXTENSION cases.

There was another bug lurking here, which was unmasked after fixing the
previous one: certain objects would be reached twice via the dependency
graph, and the second attempt to move them would cause the entire
operation to fail.  Per discussion, it seems the best fix for this is to
do more careful tracking of objects already moved: we now maintain a
list of moved objects, to avoid attempting to do it twice for the same
object.

Authors: Alvaro Herrera, Dimitri Fontaine
Reviewed by Tom Lane
2012-10-31 10:52:55 -03:00
Kevin Grittner
6868ed7491 Throw error if expiring tuple is again updated or deleted.
This prevents surprising behavior when a FOR EACH ROW trigger
BEFORE UPDATE or BEFORE DELETE directly or indirectly updates or
deletes the the old row.  Prior to this patch the requested action
on the row could be silently ignored while all triggered actions
based on the occurence of the requested action could be committed.
One example of how this could happen is if the BEFORE DELETE
trigger for a "parent" row deleted "children" which had trigger
functions to update summary or status data on the parent.

This also prevents similar surprising problems if the query has a
volatile function which updates a target row while it is already
being updated.

There are related issues present in FOR UPDATE cursors and READ
COMMITTED queries which are not handled by this patch.  These
issues need further evalution to determine what change, if any, is
needed.

Where the new error messages are generated, in most cases the best
fix will be to move code from the BEFORE trigger to an AFTER
trigger.  Where this is not feasible, the trigger can avoid the
error by re-issuing the triggering statement and returning NULL.

Documentation changes will be submitted in a separate patch.

Kevin Grittner and Tom Lane with input from Florian Pflug and
Robert Haas, based on problems encountered during conversion of
Wisconsin Circuit Court trigger logic to plpgsql triggers.
2012-10-26 14:55:36 -05:00
Tom Lane
a4e8680a6c When converting a table to a view, remove its system columns.
Views should not have any pg_attribute entries for system columns.
However, we forgot to remove such entries when converting a table to a
view.  This could lead to crashes later on, if someone attempted to
reference such a column, as reported by Kohei KaiGai.

Patch in HEAD only.  This bug has been there forever, but in the back
branches we will have to defend against existing mis-converted views,
so it doesn't seem worthwhile to change the conversion code too.
2012-10-24 13:39:37 -04:00
Alvaro Herrera
f4c4335a4a Add context info to OAT_POST_CREATE security hook
... and have sepgsql use it to determine whether to check permissions
during certain operations.  Indexes that are being created as a result
of REINDEX, for instance, do not need to have their permissions checked;
they were already checked when the index was created.

Author: KaiGai Kohei, slightly revised by me
2012-10-23 18:24:24 -03:00
Tom Lane
dc5aeca168 Remove unnecessary "head" arguments from some dlist/slist functions.
dlist_delete, dlist_insert_after, dlist_insert_before, slist_insert_after
do not need access to the list header, and indeed insisting on that negates
one of the main advantages of a doubly-linked list.

In consequence, revert addition of "cache_bucket" field to CatCTup.
2012-10-18 19:04:20 -04:00
Tom Lane
8f8d746478 Code review for inline-list patch.
Make foreach macros less syntactically dangerous, and fix some typos in
evidently-never-tested ones.  Add missing slist_next_node and
slist_head_node functions.  Fix broken dlist_check code.  Assorted comment
improvements.
2012-10-18 16:47:07 -04:00
Tom Lane
72a4231f0c Fix planning of non-strict equivalence clauses above outer joins.
If a potential equivalence clause references a variable from the nullable
side of an outer join, the planner needs to take care that derived clauses
are not pushed to below the outer join; else they may use the wrong value
for the variable.  (The problem arises only with non-strict clauses, since
if an upper clause can be proven strict then the outer join will get
simplified to a plain join.)  The planner attempted to prevent this type
of error by checking that potential equivalence clauses aren't
outerjoin-delayed as a whole, but actually we have to check each side
separately, since the two sides of the clause will get moved around
separately if it's treated as an equivalence.  Bugs of this type can be
demonstrated as far back as 7.4, even though releases before 8.3 had only
a very ad-hoc notion of equivalence clauses.

In addition, we neglected to account for the possibility that such clauses
might have nonempty nullable_relids even when not outerjoin-delayed; so the
equivalence-class machinery lacked logic to compute correct nullable_relids
values for clauses it constructs.  This oversight was harmless before 9.2
because we were only using RestrictInfo.nullable_relids for OR clauses;
but as of 9.2 it could result in pushing constructed equivalence clauses
to incorrect places.  (This accounts for bug #7604 from Bill MacArthur.)

Fix the first problem by adding a new test check_equivalence_delay() in
distribute_qual_to_rels, and fix the second one by adding code in
equivclass.c and called functions to set correct nullable_relids for
generated clauses.  Although I believe the second part of this is not
currently necessary before 9.2, I chose to back-patch it anyway, partly to
keep the logic similar across branches and partly because it seems possible
we might find other reasons why we need valid values of nullable_relids in
the older branches.

Add regression tests illustrating these problems.  In 9.0 and up, also
add test cases checking that we can push constants through outer joins,
since we've broken that optimization before and I nearly broke it again
with an overly simplistic patch for this problem.
2012-10-18 12:30:10 -04:00
Tom Lane
ff3f9c8de5 Close un-owned SMgrRelations at transaction end.
If an SMgrRelation is not "owned" by a relcache entry, don't allow it to
live past transaction end.  This design allows the same SMgrRelation to be
used for blind writes of multiple blocks during a transaction, but ensures
that we don't hold onto such an SMgrRelation indefinitely.  Because an
SMgrRelation typically corresponds to open file descriptors at the fd.c
level, leaving it open when there's no corresponding relcache entry can
mean that we prevent the kernel from reclaiming deleted disk space.
(While CacheInvalidateSmgr messages usually fix that, there are cases
where they're not issued, such as DROP DATABASE.  We might want to add
some more sinval messaging for that, but I'd be inclined to keep this
type of logic anyway, since allowing VFDs to accumulate indefinitely
for blind-written relations doesn't seem like a good idea.)

This code replaces a previous attempt towards the same goal that proved
to be unreliable.  Back-patch to 9.1 where the previous patch was added.
2012-10-17 12:38:21 -04:00
Tom Lane
9bacf0e373 Revert "Use "transient" files for blind writes, take 2".
This reverts commit fba105b109.
That approach had problems with the smgr-level state not tracking what
we really want to happen, and with the VFD-level state not tracking the
smgr-level state very well either.  In consequence, it was still possible
to hold kernel file descriptors open for long-gone tables (as in recent
report from Tore Halset), and yet there were also cases of FDs being closed
undesirably soon.  A replacement implementation will follow.
2012-10-17 12:37:08 -04:00
Alvaro Herrera
a66ee69add Embedded list interface
Provide a common implementation of embedded singly-linked and
doubly-linked lists.  "Embedded" in the sense that the nodes'
next/previous pointers exist within some larger struct; this design
choice reduces memory allocation overhead.

Most of the implementation uses inlineable functions (where supported),
for performance.

Some existing uses of both types of lists have been converted to the new
code, for demonstration purposes.  Other uses can (and probably will) be
converted in the future.  Since dllist.c is unused after this conversion,
it has been removed.

Author: Andres Freund
Some tweaks by me
Reviewed by Tom Lane, Peter Geoghegan
2012-10-17 11:31:20 -03:00
Heikki Linnakangas
ff6c78c480 Remove comment that is no longer true.
AddToDataDirLockFile() supports out-of-order updates of the lockfile
nowadays.
2012-10-15 11:03:39 +03:00
Tom Lane
e81e8f9342 Split up process latch initialization for more-fail-soft behavior.
In the previous coding, new backend processes would attempt to create their
self-pipe during the OwnLatch call in InitProcess.  However, pipe creation
could fail if the kernel is short of resources; and the system does not
recover gracefully from a FATAL error right there, since we have armed the
dead-man switch for this process and not yet set up the on_shmem_exit
callback that would disarm it.  The postmaster then forces an unnecessary
database-wide crash and restart, as reported by Sean Chittenden.

There are various ways we could rearrange the code to fix this, but the
simplest and sanest seems to be to split out creation of the self-pipe into
a new function InitializeLatchSupport, which must be called from a place
where failure is allowed.  For most processes that gets called in
InitProcess or InitAuxiliaryProcess, but processes that don't call either
but still use latches need their own calls.

Back-patch to 9.1, which has only a part of the latch logic that 9.2 and
HEAD have, but nonetheless includes this bug.
2012-10-14 22:59:56 -04:00
Tom Lane
a29f7ed554 Get rid of COERCE_DONTCARE.
We don't need this hack any more.
2012-10-12 13:35:00 -04:00
Tom Lane
71e58dcfb9 Make equal() ignore CoercionForm fields for better planning with casts.
This change ensures that the planner will see implicit and explicit casts
as equivalent for all purposes, except in the minority of cases where
there's actually a semantic difference (as reflected by having a 3-argument
cast function).  In particular, this fixes cases where the EquivalenceClass
machinery failed to consider two references to a varchar column as
equivalent if one was implicitly cast to text but the other was explicitly
cast to text, as seen in bug #7598 from Vaclav Juza.  We have had similar
bugs before in other parts of the planner, so I think it's time to fix this
problem at the core instead of continuing to band-aid around it.

Remove set_coercionform_dontcare(), which represents the band-aid
previously in use for allowing matching of index and constraint expressions
with inconsistent cast labeling.  (We can probably get rid of
COERCE_DONTCARE altogether, but I don't think removing that enum value in
back branches would be wise; it's possible there's third party code
referring to it.)

Back-patch to 9.2.  We could go back further, and might want to once this
has been tested more; but for the moment I won't risk destabilizing plan
choices in long-since-stable branches.
2012-10-12 12:11:22 -04:00
Heikki Linnakangas
6f60fdd701 Improve replication connection timeouts.
Rename replication_timeout to wal_sender_timeout, and add a new setting
called wal_receiver_timeout that does the same at the walreceiver side.
There was previously no timeout in walreceiver, so if the network went down,
for example, the walreceiver could take a long time to notice that the
connection was lost. Now with the two settings, both sides of a replication
connection will detect a broken connection similarly.

It is no longer necessary to manually set wal_receiver_status_interval to
a value smaller than the timeout. Both wal sender and receiver now
automatically send a "ping" message if more than 1/2 of the configured
timeout has elapsed, and it hasn't received any messages from the other end.

Amit Kapila, heavily edited by me.
2012-10-11 17:48:08 +03:00
Tom Lane
a80889a735 Set procost to 10 for each of the pg_foo_is_visible() functions.
The idea here is to make sure the planner will evaluate these functions
last not first among the filter conditions in psql pattern search and
tab-completion queries.  We've discussed this several times, and there
was consensus to do it back in August, but we didn't want to do it just
before a release.  Now seems like a safer time.

No catversion bump, since this catalog change doesn't create a backend
incompatibility nor any regression test result changes.
2012-10-10 12:19:25 -04:00
Tom Lane
7e0cce0265 Remove unnecessary overhead in backend's large-object operations.
Do read/write permissions checks at most once per large object descriptor,
not once per lo_read or lo_write call as before.  The repeated tests were
quite useless in the read case since the snapshot-based tests were
guaranteed to produce the same answer every time.  In the write case,
the extra tests could in principle detect revocation of write privileges
after a series of writes has started --- but there's a race condition there
anyway, since we'd check privileges before performing and certainly before
committing the write.  So there's no real advantage to checking every
single time, and we might as well redefine it as "only check the first
time".

On the same reasoning, remove the LargeObjectExists checks in inv_write
and inv_truncate.  We already checked existence when the descriptor was
opened, and checking again doesn't provide any real increment of safety
that would justify the cost.
2012-10-09 16:38:00 -04:00
Alvaro Herrera
f46baf601d Rename USE_INLINE to PG_USE_INLINE
The former name was too likely to conflict with symbols from external
headers; and, as seen in recent buildfarm failures in member spoonbill,
it has now happened at least in plpython.
2012-10-09 11:17:33 -03:00
Tom Lane
26fe56481c Code review for 64-bit-large-object patch.
Fix broken-on-bigendian-machines byte-swapping functions, add missed update
of alternate regression expected file, improve error reporting, remove some
unnecessary code, sync testlo64.c with current testlo.c (it seems to have
been cloned from a very old copy of that), assorted cosmetic improvements.
2012-10-08 18:24:32 -04:00
Alvaro Herrera
976fa10d20 Add support for easily declaring static inline functions
We already had those, but they forced modules to spell out the function
bodies twice.  Eliminate some duplicates we had already grown.

Extracted from a somewhat larger patch from Andres Freund.
2012-10-08 16:28:01 -03:00
Robert Haas
08c8058ce9 Add #define for UUIDOID.
Phil Sorber and Thom Brown. Reviewed by Albe Laurenz.
2012-10-08 10:15:15 -04:00
Tom Lane
95d035e66d Autoconfiscate selection of 64-bit int type for 64-bit large object API.
Get rid of the fundamentally indefensible assumption that "long long int"
exists and is exactly 64 bits wide on every platform Postgres runs on.
Instead let the configure script select the type to use for "pg_int64".

This is a bit of a pain in the rear since we do not want to pollute client
namespace with all the random symbols that pg_config.h defines; instead
we have to create a separate generated header file, "pg_config_ext.h".
But now that the infrastructure is there, we might have the ability to
add some other stuff that's long been wanting in this area.
2012-10-07 21:52:43 -04:00
Tatsuo Ishii
7e2f8ed2b0 Fix compiling errors on Windows platform. Fix wrong usage of
INT64CONST macro. Fix lo_hton64 and lo_ntoh64 not to use int32_t and
uint32_t.
2012-10-07 23:30:31 +09:00
Tatsuo Ishii
b51a65f5bf Bump up catalog vesion due to 64-bit large object API functions
addition.
2012-10-07 09:36:20 +09:00
Tatsuo Ishii
461ef73f09 Add API for 64-bit large object access. Now users can access up to
4TB large objects (standard 8KB BLCKSZ case).  For this purpose new
libpq API lo_lseek64, lo_tell64 and lo_truncate64 are added.  Also
corresponding new backend functions lo_lseek64, lo_tell64 and
lo_truncate64 are added. inv_api.c is changed to handle 64-bit
offsets.

Patch contributed by Nozomi Anzai (backend side) and Yugo Nagata
(frontend side, docs, regression tests and example program). Reviewed
by Kohei Kaigai. Committed by Tatsuo Ishii with minor editings.
2012-10-07 08:36:48 +09:00
Heikki Linnakangas
fd5942c18f Use the regular main processing loop also in walsenders.
The regular backend's main loop handles signal handling and error recovery
better than the current WAL sender command loop does. For example, if the
client hangs and a SIGTERM is received before starting streaming, the
walsender will now terminate immediately, rather than hang until the
connection times out.
2012-10-05 17:21:12 +03:00
Tom Lane
fb34e94d21 Support CREATE SCHEMA IF NOT EXISTS.
Per discussion, schema-element subcommands are not allowed together with
this option, since it's not very obvious what should happen to the element
objects.

Fabrízio de Royes Mello
2012-10-03 19:47:11 -04:00
Alvaro Herrera
994c36e01d refactor ALTER some-obj SET OWNER implementation
Remove duplicate implementation of catalog munging and miscellaneous
privilege and consistency checks.  Instead rely on already existing data
in objectaddress.c to do the work.

Author: KaiGai Kohei
Tweaked by me
Reviewed by Robert Haas
2012-10-03 18:07:46 -03:00
Alvaro Herrera
2164f9a125 Refactor "ALTER some-obj SET SCHEMA" implementation
Instead of having each object type implement the catalog munging
independently, centralize knowledge about how to do it and expand the
existing table in objectaddress.c with enough data about each object
type to support this operation.

Author: KaiGai Kohei
Tweaks by me
Reviewed by Robert Haas
2012-10-02 18:13:54 -03:00
Heikki Linnakangas
d5497b95f3 Split off functions related to timeline history files and XLOG archiving.
This is just refactoring, to make the functions accessible outside xlog.c.
A followup patch will make use of that, to allow fetching timeline history
files over streaming replication.
2012-10-02 13:37:19 +03:00
Tom Lane
0d0aa5d291 Provide some static-assertion functionality on all compilers.
On reflection (especially after noticing how many buildfarm critters have
__builtin_types_compatible_p but not _Static_assert), it seems like we
ought to try a bit harder to make these macros do something everywhere.
The initial cut at it would have been no help to code that is compiled only
on platforms without _Static_assert, for instance; and in any case not all
our contributors do their initial coding on the latest gcc version.

Some googling about static assertions turns up quite a bit of prior art
for making it work in compilers that lack _Static_assert.  The method
that seems closest to our needs involves defining a struct with a bit-field
that has negative width if the assertion condition fails.  There seems no
reliable way to get the error message string to be output, but throwing a
compile error with a confusing message is better than missing the problem
altogether.

In the same spirit, if we don't have __builtin_types_compatible_p we can at
least insist that the variable have the same width as the type.  This won't
catch errors such as "wrong pointer type", but it's far better than
nothing.

In addition to changing the macro definitions, adjust a
compile-time-constant Assert in contrib/hstore to use StaticAssertStmt,
so we can get some buildfarm coverage on whether that macro behaves sanely
or not.  There's surely more places that could be converted, but this is
the first one I came across.
2012-09-30 22:46:29 -04:00
Tom Lane
ea473fb2de Add infrastructure for compile-time assertions about variable types.
Currently, the macros only work with fairly recent gcc versions, but there
is room to expand them to other compilers that have comparable features.

Heavily revised and autoconfiscated version of a patch by Andres Freund.
2012-09-30 14:38:31 -04:00
Andrew Dunstan
6e9876dc32 Remove checks for now long outdated compilers. 2012-09-28 19:43:50 -04:00
Tom Lane
70bc583319 Fix btmarkpos/btrestrpos to handle array keys.
This fixes another error in commit 9e8da0f757.
I neglected to make the mark/restore functionality save and restore the
current set of array key values, which led to strange behavior if an
IndexScan with ScalarArrayOpExpr quals was used as the inner side of a
mergejoin.  Per bug #7570 from Melese Tesfaye.
2012-09-27 17:01:02 -04:00
Heikki Linnakangas
2a0c81a12c Add support for include_dir in config file.
This allows easily splitting configuration into many files, deployed in a
directory.

Magnus Hagander, Greg Smith, Selena Deckelmann, reviewed by Noah Misch.
2012-09-24 18:07:53 +03:00
Tom Lane
31510194cc Minor corrections for ALTER TYPE ADD VALUE IF NOT EXISTS patch.
Produce a NOTICE when the label already exists, for consistency with other
CREATE IF NOT EXISTS commands.  Also, fix the code so it produces something
more user-friendly than an index violation when the label already exists.
This not incidentally enables making a regression test that the previous
patch didn't make for fear of exposing an unpredictable OID in the results.
Also some wordsmithing on the documentation.
2012-09-22 18:35:22 -04:00
Andrew Dunstan
6d12b68cd7 Allow IF NOT EXISTS when add a new enum label.
If the label is already in the enum the statement becomes a no-op.
This will reduce the pain that comes from our not allowing this
operation inside a transaction block.

Andrew Dunstan, reviewed by Tom Lane and Magnus Hagander.
2012-09-22 12:53:31 -04:00
Tom Lane
11e131854f Improve ruleutils.c's heuristics for dealing with rangetable aliases.
The previous scheme had bugs in some corner cases involving tables that had
been renamed since a view was made.  This could result in dumped views that
failed to reload or reloaded incorrectly, as seen in bug #7553 from Lloyd
Albin, as well as in some pgsql-hackers discussion back in January.  Also,
its behavior for printing EXPLAIN plans was sometimes confusing because of
willingness to use the same alias for multiple RTEs (it was Ashutosh
Bapat's complaint about that aspect that started the January thread).

To fix, ensure that each RTE in the query has a unique unqualified alias,
by modifying the alias if necessary (we add "_" and digits as needed to
create a non-conflicting name).  Then we can just print its variables with
that alias, avoiding the confusing and bug-prone scheme of sometimes
schema-qualifying variable names.  In EXPLAIN, it proves to be expedient to
take the further step of only assigning such aliases to RTEs that are
actually referenced in the query, since the planner has a habit of
generating extra RTEs with the same alias in situations such as
inheritance-tree expansion.

Although this fixes a bug of very long standing, I'm hesitant to back-patch
such a noticeable behavioral change.  My experiments while creating a
regression test convinced me that actually incorrect output (as opposed to
confusing output) occurs only in very narrow cases, which is backed up by
the lack of previous complaints from the field.  So we may be better off
living with it in released branches; and in any case it'd be smart to let
this ripen awhile in HEAD before we consider back-patching it.
2012-09-21 19:03:10 -04:00
Heikki Linnakangas
7c45e3a3c6 Parse pg_ident.conf when it's loaded, keeping it in memory in parsed format.
Similar changes were done to pg_hba.conf earlier already, this commit makes
pg_ident.conf to behave the same as pg_hba.conf.

This has two user-visible effects. First, if pg_ident.conf contains multiple
errors, the whole file is parsed at postmaster startup time and all the
errors are immediately reported. Before this patch, the file was parsed and
the errors were reported only when someone tries to connect using an
authentication method that uses the file, and the parsing stopped on first
error. Second, if you SIGHUP to reload the config files, and the new
pg_ident.conf file contains an error, the error is logged but the old file
stays in effect.

Also, regular expressions in pg_ident.conf are now compiled only once when
the file is loaded, rather than every time the a user is authenticated. That
should speed up authentication if you have a lot of regexps in the file.

Amit Kapila
2012-09-21 17:54:39 +03:00
Alvaro Herrera
22c734fcdb Remove execdesc.h inclusion from tcopprot.h 2012-09-20 11:07:59 -03:00
Tom Lane
9a93e71008 Fix a couple other leftover uses of 'conisonly' terminology. 2012-09-12 15:12:24 -04:00
Tom Lane
46c508fbcf Fix PARAM_EXEC assignment mechanism to be safe in the presence of WITH.
The planner previously assumed that parameter Vars having the same absolute
query level, varno, and varattno could safely be assigned the same runtime
PARAM_EXEC slot, even though they might be different Vars appearing in
different subqueries.  This was (probably) safe before the introduction of
CTEs, but the lazy-evalution mechanism used for CTEs means that a CTE can
be executed during execution of some other subquery, causing the lifespan
of Params at the same syntactic nesting level as the CTE to overlap with
use of the same slots inside the CTE.  In 9.1 we created additional hazards
by using the same parameter-assignment technology for nestloop inner scan
parameters, but it was broken before that, as illustrated by the added
regression test.

To fix, restructure the planner's management of PlannerParamItems so that
items having different semantic lifespans are kept rigorously separated.
This will probably result in complex queries using more runtime PARAM_EXEC
slots than before, but the slots are cheap enough that this hardly matters.
Also, stop generating PlannerParamItems containing Params for subquery
outputs: all we really need to do is reserve the PARAM_EXEC slot number,
and that now only takes incrementing a counter.  The planning code is
simpler and probably faster than before, as well as being more correct.

Per report from Vik Reykja.

These changes will mostly also need to be made in the back branches, but
I'm going to hold off on that until after 9.2.0 wraps.
2012-09-05 12:55:01 -04:00
Alvaro Herrera
e20a90e188 Trim spgist_private.h inclusion
It doesn't really need rel.h; relcache.h is enough.
2012-09-05 11:06:51 -03:00
Heikki Linnakangas
c4c227477b Fix bugs in cascading replication with recovery_target_timeline='latest'
The cascading replication code assumed that the current RecoveryTargetTLI
never changes, but that's not true with recovery_target_timeline='latest'.
The obvious upshot of that is that RecoveryTargetTLI in shared memory needs
to be protected by a lock. A less obvious consequence is that when a
cascading standby is connected, and the standby switches to a new target
timeline after scanning the archive, it will continue to stream WAL to the
cascading standby, but from a wrong file, ie. the file of the previous
timeline. For example, if the standby is currently streaming from the middle
of file 000000010000000000000005, and the timeline changes, the standby
will continue to stream from that file. However, the WAL on the new
timeline is in file 000000020000000000000005, so the standby sends garbage
from 000000010000000000000005 to the cascading standby, instead of the
correct WAL from file 000000020000000000000005.

This also fixes a related bug where a partial WAL segment is restored from
the archive and streamed to a cascading standby. The code assumed that when
a WAL segment is copied from the archive, it can immediately be fully
streamed to a cascading standby. However, if the segment is only partially
filled, ie. has the right size, but only N first bytes contain valid WAL,
that's not safe. That can happen if a partial WAL segment is manually copied
to the archive, or if a partial WAL segment is archived because a server is
started up on a new timeline within that segment. The cascading standby will
get confused if the WAL it received is not valid, and will get stuck until
it's restarted. This patch fixes that problem by not allowing WAL restored
from the archive to be streamed to a cascading standby until it's been
replayed, and thus validated.
2012-09-04 19:33:21 -07:00
Bruce Momjian
015722fb36 Fix to_date() and to_timestamp() to allow specification of the day of
the week via ISO or Gregorian designations.  The fix is to store the
day-of-week consistently as 1-7, Sunday = 1.

Fixes bug reported by Marc Munro
2012-09-03 22:52:44 -04:00
Tom Lane
6d2c8c0e2a Drop cheap-startup-cost paths during add_path() if we don't need them.
We can detect whether the planner top level is going to care at all about
cheap startup cost (it will only do so if query_planner's tuple_fraction
argument is greater than zero).  If it isn't, we might as well discard
paths immediately whose only advantage over others is cheap startup cost.
This turns out to get rid of quite a lot of paths in complex queries ---
I saw planner runtime reduction of more than a third on one large query.

Since add_path isn't currently passed the PlannerInfo "root", the easiest
way to tell it whether to do this was to add a bool flag to RelOptInfo.
That's a bit redundant, since all relations in a given query level will
have the same setting.  But in the future it's possible that we'd refine
the control decision to work on a per-relation basis, so this seems like
a good arrangement anyway.

Per my suggestion of a few months ago.
2012-09-01 18:16:24 -04:00
Tom Lane
58a031f920 Make configure probe for mbstowcs_l as well as wcstombs_l.
We previously supposed that any given platform would supply both or neither
of these functions, so that one configure test would be sufficient.  It now
appears that at least on AIX this is not the case ... which is likely an
AIX bug, but nonetheless we need to cope with it.  So use separate tests.
Per bug #6758; thanks to Andrew Hastie for doing the followup testing
needed to confirm what was happening.

Backpatch to 9.1, where we began using these functions.
2012-08-31 14:17:56 -04:00
Alvaro Herrera
c219d9b0a5 Split tuple struct defs from htup.h to htup_details.h
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.

I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h.  In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
2012-08-30 16:52:35 -04:00
Tom Lane
77387f0ac8 Suppress creation of backwardly-indexed paths for LATERAL join clauses.
Given a query such as

SELECT * FROM foo JOIN LATERAL (SELECT foo.var1) ss(x) ON ss.x = foo.var2

the existence of the join clause "ss.x = foo.var2" encourages indxpath.c to
build a parameterized path for foo using any index available for foo.var2.
This is completely useless activity, though, since foo has got to be on the
outside not the inside of any nestloop join with ss.  It's reasonably
inexpensive to add tests that prevent creation of such paths, so let's do
that.
2012-08-30 14:33:00 -04:00
Tom Lane
e83bb10d6d Adjust definition of cheapest_total_path to work better with LATERAL.
In the initial cut at LATERAL, I kept the rule that cheapest_total_path
was always unparameterized, which meant it had to be NULL if the relation
has no unparameterized paths.  It turns out to work much more nicely if
we always have *some* path nominated as cheapest-total for each relation.
In particular, let's still say it's the cheapest unparameterized path if
there is one; if not, take the cheapest-total-cost path among those of
the minimum available parameterization.  (The first rule is actually
a special case of the second.)

This allows reversion of some temporary lobotomizations I'd put in place.
In particular, the planner can now consider hash and merge joins for
joins below a parameter-supplying nestloop, even if there aren't any
unparameterized paths available.  This should bring planning of
LATERAL-containing queries to the same level as queries not using that
feature.

Along the way, simplify management of parameterized paths in add_path()
and friends.  In the original coding for parameterized paths in 9.2,
I tried to minimize the logic changes in add_path(), so it just treated
parameterization as yet another dimension of comparison for paths.
We later made it ignore pathkeys (sort ordering) of parameterized paths,
on the grounds that ordering isn't a useful property for the path on the
inside of a nestloop, so we might as well get rid of useless parameterized
paths as quickly as possible.  But we didn't take that reasoning as far as
we should have.  Startup cost isn't a useful property inside a nestloop
either, so add_path() ought to discount startup cost of parameterized paths
as well.  Having done that, the secondary sorting I'd implemented (in
add_parameterized_path) is no longer needed --- any parameterized path that
survives add_path() at all is worth considering at higher levels.  So this
should be a bit faster as well as simpler.
2012-08-29 22:06:07 -04:00
Alvaro Herrera
21c09e99dc Split heapam_xlog.h from heapam.h
The heapam XLog functions are used by other modules, not all of which
are interested in the rest of the heapam API.  With this, we let them
get just the XLog stuff in which they are interested and not pollute
them with unrelated includes.

Also, since heapam.h no longer requires xlog.h, many files that do
include heapam.h no longer get xlog.h automatically, including a few
headers.  This is useful because heapam.h is getting pulled in by
execnodes.h, which is in turn included by a lot of files.
2012-08-28 19:02:00 -04:00
Alvaro Herrera
fda0594fc2 remove catcache.h from syscache.h
Instead, place a forward struct declaration for struct catclist in
syscache.h.  This reduces header proliferation somewhat.
2012-08-28 18:36:39 -04:00
Alvaro Herrera
45326c5a11 Split resowner.h
This lets files that are mere users of ResourceOwner not automatically
include the headers for stuff that is managed by the resowner mechanism.
2012-08-28 18:02:07 -04:00
Alvaro Herrera
095e6c5a7d syncrep.h must include xlogdefs.h 2012-08-28 09:46:08 -04:00
Heikki Linnakangas
918eee0c49 Collect and use histograms of lower and upper bounds for range types.
This enables selectivity estimation of the <<, >>, &<, &> and && operators,
as well as the normal inequality operators: <, <=, >=, >. "range @> element"
is also supported, but the range-variant @> and <@ operators are not,
because they cannot be sensibly estimated with lower and upper bound
histograms alone. We would need to make some assumption about the lengths of
the ranges for that. Alexander's patch included a separate histogram of
lengths for that, but I left that out of the patch for simplicity. Hopefully
that will be added as a followup patch.

The fraction of empty ranges is also calculated and used in estimation.

Alexander Korotkov, heavily modified by me.
2012-08-27 15:58:46 +03:00
Tom Lane
9ff79b9d4e Fix up planner infrastructure to support LATERAL properly.
This patch takes care of a number of problems having to do with failure
to choose valid join orders and incorrect handling of lateral references
pulled up from subqueries.  Notable changes:

* Add a LateralJoinInfo data structure similar to SpecialJoinInfo, to
represent join ordering constraints created by lateral references.
(I first considered extending the SpecialJoinInfo structure, but the
semantics are different enough that a separate data structure seems
better.)  Extend join_is_legal() and related functions to prevent trying
to form unworkable joins, and to ensure that we will consider joins that
satisfy lateral references even if the joins would be clauseless.

* Fill in the infrastructure needed for the last few types of relation scan
paths to support parameterization.  We'd have wanted this eventually
anyway, but it is necessary now because a relation that gets pulled up out
of a UNION ALL subquery may acquire a reltargetlist containing lateral
references, meaning that its paths *have* to be parameterized whether or
not we have any code that can push join quals down into the scan.

* Compute data about lateral references early in query_planner(), and save
in RelOptInfo nodes, to avoid repetitive calculations later.

* Assorted corner-case bug fixes.

There's probably still some bugs left, but this is a lot closer to being
real than it was before.
2012-08-26 22:50:23 -04:00
Peter Eisentraut
5c45d2f878 Mark DateTimeParseError() noreturn
This avoids a warning from clang 3.2 about an uninitialized variable
'dtype' in date_in().
2012-08-21 23:32:58 -04:00
Peter Eisentraut
71450d7fd6 Teach compiler that ereport(>=ERROR) does not return
When elevel >= ERROR, we add an abort() call to the ereport() macro to
give the compiler a hint that the ereport() expansion will not return,
but the abort() isn't actually reached because the longjmp happens in
errfinish().

Because the effect of ereport() varies with the elevel, we cannot use
standard compiler attributes such as noreturn for this.
2012-08-21 00:03:32 -04:00
Tom Lane
092d7ded29 Allow OLD and NEW in multi-row VALUES within rules.
Now that we have LATERAL, it's fairly painless to allow this case, which
was left as a TODO in the original multi-row VALUES implementation.
2012-08-19 14:12:16 -04:00
Tom Lane
470d0b9789 Check LIBXML_VERSION instead of testing in configure script.
We had put a test for libxml2's xmlStructuredErrorContext variable in
configure, but of course that doesn't work on Windows builds.  The next
best alternative seems to be to test the LIBXML_VERSION symbol provided
by xmlversion.h.

Per report from Talha Bin Rizwan, though this fixes it in a different way
than his proposed patch.
2012-08-17 00:05:26 -04:00
Heikki Linnakangas
317dd55a9c Add SP-GiST support for range types.
The implementation is a quad-tree, largely copied from the quad-tree
implementation for points. The lower and upper bound of ranges are the 2d
coordinates, with some extra code to handle empty ranges.

I left out the support for adjacent operator, -|-, from the original patch.
Not because there was necessarily anything wrong with it, but it was more
complicated than the other operators, and I only have limited time for
reviewing. That will follow as a separate patch.

Alexander Korotkov, reviewed by Jeff Davis and me.
2012-08-16 14:30:45 +03:00
Heikki Linnakangas
89911b3ab8 Fix GiST buffering build bug, which caused "failed to re-find parent" errors.
We use a hash table to track the parents of inner pages, but when inserting
to a leaf page, the caller of gistbufferinginserttuples() must pass a
correct block number of the leaf's parent page. Before gistProcessItup()
descends to a child page, it checks if the downlink needs to be adjusted to
accommodate the new tuple, and updates the downlink if necessary. However,
updating the downlink might require splitting the page, which might move the
downlink to a page to the right. gistProcessItup() doesn't realize that, so
when it descends to the leaf page, it might pass an out-of-date parent block
number as a result. Fix that by returning the block a tuple was inserted to
from gistbufferinginserttuples().

This fixes the bug reported by Zdeněk Jílovec.
2012-08-16 12:56:24 +03:00
Tom Lane
c1774d2c81 More fixes for planner's handling of LATERAL.
Re-allow subquery pullup for LATERAL subqueries, except when the subquery
is below an outer join and contains lateral references to relations outside
that outer join.  If we pull up in such a case, we risk introducing lateral
cross-references into outer joins' ON quals, which is something the code is
entirely unprepared to cope with right now; and I'm not sure it'll ever be
worth coping with.

Support lateral refs in VALUES (this seems to be the only additional path
type that needs such support as a consequence of re-allowing subquery
pullup).

Put in a slightly hacky fix for joinpath.c's refusal to consider
parameterized join paths even when there cannot be any unparameterized
ones.  This was causing "could not devise a query plan for the given query"
failures in queries involving more than two FROM items.

Put in an even more hacky fix for distribute_qual_to_rels() being unhappy
with join quals that contain references to rels outside their syntactic
scope; which is to say, disable that test altogether.  Need to think about
how to preserve some sort of debugging cross-check here, while not
expending more cycles than befits a debugging cross-check.
2012-08-12 16:01:26 -04:00
Tom Lane
b53800355f Fix dependencies generated during ALTER TABLE ADD CONSTRAINT USING INDEX.
This command generated new pg_depend entries linking the index to the
constraint and the constraint to the table, which match the entries made
when a unique or primary key constraint is built de novo.  However, it did
not bother to get rid of the entries linking the index directly to the
table.  We had considered the issue when the ADD CONSTRAINT USING INDEX
patch was written, and concluded that we didn't need to get rid of the
extra entries.  But this is wrong: ALTER COLUMN TYPE wasn't expecting such
redundant dependencies to exist, as reported by Hubert Depesz Lubaczewski.
On reflection it seems rather likely to break other things as well, since
there are many bits of code that crawl pg_depend for one purpose or
another, and most of them are pretty naive about what relationships they're
expecting to find.  Fortunately it's not that hard to get rid of the extra
dependency entries, so let's do that.

Back-patch to 9.1, where ALTER TABLE ADD CONSTRAINT USING INDEX was added.
2012-08-11 12:51:24 -04:00
Tom Lane
c9b0cbe98b Support having multiple Unix-domain sockets per postmaster.
Replace unix_socket_directory with unix_socket_directories, which is a list
of socket directories, and adjust postmaster's code to allow zero or more
Unix-domain sockets to be created.

This is mostly a straightforward change, but since the Unix sockets ought
to be created after the TCP/IP sockets for safety reasons (better chance
of detecting a port number conflict), AddToDataDirLockFile needs to be
fixed to support out-of-order updates of data directory lockfile lines.
That's a change that had been foreseen to be necessary someday anyway.

Honza Horak, reviewed and revised by Tom Lane
2012-08-10 17:27:15 -04:00
Tom Lane
eaccfded98 Centralize the logic for detecting misplaced aggregates, window funcs, etc.
Formerly we relied on checking after-the-fact to see if an expression
contained aggregates, window functions, or sub-selects when it shouldn't.
This is grotty, easily forgotten (indeed, we had forgotten to teach
DefineIndex about rejecting window functions), and none too efficient
since it requires extra traversals of the parse tree.  To improve matters,
define an enum type that classifies all SQL sub-expressions, store it in
ParseState to show what kind of expression we are currently parsing, and
make transformAggregateCall, transformWindowFuncCall, and transformSubLink
check the expression type and throw error if the type indicates the
construct is disallowed.  This allows removal of a large number of ad-hoc
checks scattered around the code base.  The enum type is sufficiently
fine-grained that we can still produce error messages of at least the
same specificity as before.

Bringing these error checks together revealed that we'd been none too
consistent about phrasing of the error messages, so standardize the wording
a bit.

Also, rewrite checking of aggregate arguments so that it requires only one
traversal of the arguments, rather than up to three as before.

In passing, clean up some more comments left over from add_missing_from
support, and annotate some tests that I think are dead code now that that's
gone.  (I didn't risk actually removing said dead code, though.)
2012-08-10 11:36:15 -04:00
Simon Riggs
da4efa13d8 Turn off WalSender keepalives by default, users can enable if desired 2012-08-09 17:07:03 +01:00
Simon Riggs
87d8bd7c9f Ensure all replication message info is available and correct via WalRcv 2012-08-09 17:03:59 +01:00
Tom Lane
f630157496 Merge parser's p_relnamespace and p_varnamespace lists into a single list.
Now that we are storing structs in these lists, the distinction between
the two lists can be represented with a couple of extra flags while using
only a single list.  This simplifies the code and should save a little
bit of palloc traffic, since the majority of RTEs are represented in both
lists anyway.
2012-08-08 16:41:31 -04:00
Tom Lane
5ebaaa4944 Implement SQL-standard LATERAL subqueries.
This patch implements the standard syntax of LATERAL attached to a
sub-SELECT in FROM, and also allows LATERAL attached to a function in FROM,
since set-returning function calls are expected to be one of the principal
use-cases.

The main change here is a rewrite of the mechanism for keeping track of
which relations are visible for column references while the FROM clause is
being scanned.  The parser "namespace" lists are no longer lists of bare
RTEs, but are lists of ParseNamespaceItem structs, which carry an RTE
pointer as well as some visibility-controlling flags.  Aside from
supporting LATERAL correctly, this lets us get rid of the ancient hacks
that required rechecking subqueries and JOIN/ON and function-in-FROM
expressions for invalid references after they were initially parsed.
Invalid column references are now always correctly detected on sight.

In passing, remove assorted parser error checks that are now dead code by
virtue of our having gotten rid of add_missing_from, as well as some
comments that are obsolete for the same reason.  (It was mainly
add_missing_from that caused so much fudging here in the first place.)

The planner support for this feature is very minimal, and will be improved
in future patches.  It works well enough for testing purposes, though.

catversion bump forced due to new field in RangeTblEntry.
2012-08-07 19:02:54 -04:00
Tom Lane
962e0cc71e Fix race conditions associated with SPGiST redirection tuples.
The correct test for whether a redirection tuple is removable is whether
tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding
failed to protect index searches being done in concurrent transactions that
have no XID.  This mirrors the recent fix in btree's page recycling logic
made in commit d3abbbebe5.

Also, WAL-log the newest XID of any removed redirection tuple on an index
page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL
replay.  This protects against concurrent Hot Standby transactions possibly
needing to see the redirection tuple(s).

Per my query of 2012-03-12 and subsequent discussion.
2012-08-02 15:34:14 -04:00
Tom Lane
f6ce81f55a Fix WITH attached to a nested set operation (UNION/INTERSECT/EXCEPT).
Parse analysis neglected to cover the case of a WITH clause attached to an
intermediate-level set operation; it only handled WITH at the top level
or WITH attached to a leaf-level SELECT.  Per report from Adam Mackler.

In HEAD, I rearranged the order of SelectStmt's fields to put withClause
with the other fields that can appear on non-leaf SelectStmts.  In back
branches, leave it alone to avoid a possible ABI break for third-party
code.

Back-patch to 8.4 where WITH support was added.
2012-07-31 17:56:21 -04:00
Tom Lane
31c7c642b6 Account for SRFs in targetlists in planner rowcount estimates.
We made use of the ROWS estimate for set-returning functions used in FROM,
but not for those used in SELECT targetlists; which is a bit of an
oversight considering there are common usages that require the latter
approach.  Improve that.  (I had initially thought it might be worth
folding this into cost_qual_eval, but after investigation concluded that
that wouldn't be very helpful, so just do it separately.)  Per complaint
from David Johnston.

Back-patch to 9.2, but not further, for fear of destabilizing plan choices
in existing releases.
2012-07-21 17:45:07 -04:00
Alvaro Herrera
f5bcd398ad connoinherit may be true only for CHECK constraints
The code was setting it true for other constraints, which is
bogus.  Doing so caused bogus catalog entries for such constraints, and
in particular caused an error to be raised when trying to drop a
constraint of types other than CHECK from a table that has children,
such as reported in bug #6712.

In 9.2, additionally ignore connoinherit=true for other constraint
types, to avoid having to force initdb; existing databases might already
contain bogus catalog entries.

Includes a catversion bump (in HEAD only).

Bug report from Miroslav Šulc
Analysis from Amit Kapila and Noah Misch; Amit also contributed the patch.
2012-07-20 14:08:07 -04:00
Tom Lane
8e617e29aa Fix whole-row Var evaluation to cope with resjunk columns (again).
When a whole-row Var is reading the result of a subquery, we need it to
ignore any "resjunk" columns that the subquery might have evaluated for
GROUP BY or ORDER BY purposes.  We've hacked this area before, in commit
68e40998d0, but that fix only covered
whole-row Vars of named composite types, not those of RECORD type; and it
was mighty klugy anyway, since it just assumed without checking that any
extra columns in the result must be resjunk.  A proper fix requires getting
hold of the subquery's targetlist so we can actually see which columns are
resjunk (whereupon we can use a JunkFilter to get rid of them).  So bite
the bullet and add some infrastructure to make that possible.

Per report from Andrew Dunstan and additional testing by Merlin Moncure.
Back-patch to all supported branches.  In 8.3, also back-patch commit
292176a118, which for some reason I had
not done at the time, but it's a prerequisite for this change.
2012-07-20 13:10:58 -04:00
Robert Haas
3a0e4d36eb Make new event trigger facility actually do something.
Commit 3855968f32 added syntax, pg_dump,
psql support, and documentation, but the triggers didn't actually fire.
With this commit, they now do.  This is still a pretty basic facility
overall because event triggers do not get a whole lot of information
about what the user is trying to do unless you write them in C; and
there's still no option to fire them anywhere except at the very
beginning of the execution sequence, but it's better than nothing,
and a good building block for future work.

Along the way, add a regression test for ALTER LARGE OBJECT, since
testing of event triggers reveals that we haven't got one.

Dimitri Fontaine and Robert Haas
2012-07-20 11:39:01 -04:00
Heikki Linnakangas
a7a4add6c4 Refactor the way code is shared between some range type functions.
Functions like range_eq, range_before etc. are exposed at the SQL-level, but
they're also used internally by the GiST consistent support function. The
code sharing was done by a hack, TrickFunctionCall2, which relied on the
knowledge that all the functions used fn_extra the same way. This commit
splits the functions into internal versions that take a TypeCacheEntry as
argument, and thin wrappers to expose the functions at the SQL-level. The
internal versions can then be called directly and in a less hacky way from
the GiST consistent function.

This is just cosmetic, but backpatch to 9.2 anyway, to avoid having a
different version of this code in the 9.2 branch. That would make
backpatching fixes in this area more difficult.

Alexander Korotkov
2012-07-18 23:14:56 +03:00
Tom Lane
4a9c30a8a1 Fix management of pendingOpsTable in auxiliary processes.
mdinit() was misusing IsBootstrapProcessingMode() to decide whether to
create an fsync pending-operations table in the current process.  This led
to creating a table not only in the startup and checkpointer processes as
intended, but also in the bgwriter process, not to mention other auxiliary
processes such as walwriter and walreceiver.  Creation of the table in the
bgwriter is fatal, because it absorbs fsync requests that should have gone
to the checkpointer; instead they just sit in bgwriter local memory and are
never acted on.  So writes performed by the bgwriter were not being fsync'd
which could result in data loss after an OS crash.  I think there is no
live bug with respect to walwriter and walreceiver because those never
perform any writes of shared buffers; but the potential is there for
future breakage in those processes too.

To fix, make AuxiliaryProcessMain() export the current process's
AuxProcType as a global variable, and then make mdinit() test directly for
the types of aux process that should have a pendingOpsTable.  Having done
that, we might as well also get rid of the random bool flags such as
am_walreceiver that some of the aux processes had grown.  (Note that we
could not have fixed the bug by examining those variables in mdinit(),
because it's called from BaseInit() which is run by AuxiliaryProcessMain()
before entering any of the process-type-specific code.)

Back-patch to 9.2, where the problem was introduced by the split-up of
bgwriter and checkpointer processes.  The bogus pendingOpsTable exists
in walwriter and walreceiver processes in earlier branches, but absent
any evidence that it causes actual problems there, I'll leave the older
branches alone.
2012-07-18 15:28:10 -04:00
Robert Haas
3855968f32 Syntax support and documentation for event triggers.
They don't actually do anything yet; that will get fixed in a
follow-on commit.  But this gets the basic infrastructure in place,
including CREATE/ALTER/DROP EVENT TRIGGER; support for COMMENT,
SECURITY LABEL, and ALTER EXTENSION .. ADD/DROP EVENT TRIGGER;
pg_dump and psql support; and documentation for the anticipated
initial feature set.

Dimitri Fontaine, with review and a bunch of additional hacking by me.
Thom Brown extensively reviewed earlier versions of this patch set,
but there's not a whole lot of that code left in this commit, as it
turns out.
2012-07-18 10:16:16 -04:00
Tom Lane
73b796a52c Improve coding around the fsync request queue.
In all branches back to 8.3, this patch fixes a questionable assumption in
CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are
no uninitialized pad bytes in the request queue structs.  This would only
cause trouble if (a) there were such pad bytes, which could happen in 8.4
and up if the compiler makes enum ForkNumber narrower than 32 bits, but
otherwise would require not-currently-planned changes in the widths of
other typedefs; and (b) the kernel has not uniformly initialized the
contents of shared memory to zeroes.  Still, it seems a tad risky, and we
can easily remove any risk by pre-zeroing the request array for ourselves.
In addition to that, we need to establish a coding rule that struct
RelFileNode can't contain any padding bytes, since such structs are copied
into the request array verbatim.  (There are other places that are assuming
this anyway, it turns out.)

In 9.1 and up, the risk was a bit larger because we were also effectively
assuming that struct RelFileNodeBackend contained no pad bytes, and with
fields of different types in there, that would be much easier to break.
However, there is no good reason to ever transmit fsync or delete requests
for temp files to the bgwriter/checkpointer, so we can revert the request
structs to plain RelFileNode, getting rid of the padding risk and saving
some marginal number of bytes and cycles in fsync queue manipulation while
we are at it.  The savings might be more than marginal during deletion of
a temp relation, because the old code transmitted an entirely useless but
nonetheless expensive-to-process ForgetRelationFsync request to the
background process, and also had the background process perform the file
deletion even though that can safely be done immediately.

In addition, make some cleanup of nearby comments and small improvements to
the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
2012-07-17 16:56:54 -04:00
Alvaro Herrera
f34c68f096 Introduce timeout handling framework
Management of timeouts was getting a little cumbersome; what we
originally had was more than enough back when we were only concerned
about deadlocks and query cancel; however, when we added timeouts for
standby processes, the code got considerably messier.  Since there are
plans to add more complex timeouts, this seems a good time to introduce
a central timeout handling module.

External modules register their timeout handlers during process
initialization, and later enable and disable them as they see fit using
a simple API; timeout.c is in charge of keeping track of which timeouts
are in effect at any time, installing a common SIGALRM signal handler,
and calling setitimer() as appropriate to ensure timely firing of
external handlers.

timeout.c additionally supports pluggable modules to add their own
timeouts, though this capability isn't exercised anywhere yet.

Additionally, as of this commit, walsender processes are aware of
timeouts; we had a preexisting bug there that made those ignore SIGALRM,
thus being subject to unhandled deadlocks, particularly during the
authentication phase.  This has already been fixed in back branches in
commit 0bf8eb2a, which see for more details.

Main author: Zoltán Böszörményi
Some review and cleanup by Álvaro Herrera
Extensive reworking by Tom Lane
2012-07-16 22:55:33 -04:00
Tom Lane
c92be3c059 Avoid pre-determining index names during CREATE TABLE LIKE parsing.
Formerly, when trying to copy both indexes and comments, CREATE TABLE LIKE
had to pre-assign names to indexes that had comments, because it made up an
explicit CommentStmt command to apply the comment and so it had to know the
name for the index.  This creates bad interactions with other indexes, as
shown in bug #6734 from Daniele Varrazzo: the preassignment logic couldn't
take any other indexes into account so it could choose a conflicting name.

To fix, add a field to IndexStmt that allows it to carry a comment to be
assigned to the new index.  (This isn't a user-exposed feature of CREATE
INDEX, only an internal option.)  Now we don't need preassignment of index
names in any situation.

I also took the opportunity to refactor DefineIndex to accept the IndexStmt
as such, rather than passing all its fields individually in a mile-long
parameter list.

Back-patch to 9.2, but no further, because it seems too dangerous to change
IndexStmt or DefineIndex's API in released branches.  The bug exists back
to 9.0 where CREATE TABLE LIKE grew the ability to copy comments, but given
the lack of prior complaints we'll just let it go unfixed before 9.2.
2012-07-16 13:25:18 -04:00
Tom Lane
b966dd6c42 Add fsync capability to initdb, and use sync_file_range() if available.
Historically we have not worried about fsync'ing anything during initdb
(in fact, initdb intentionally passes -F to each backend launch to prevent
it from fsync'ing).  But with filesystems getting more aggressive about
caching data, that's not such a good plan anymore.  Make initdb do a pass
over the finished data directory tree to fsync everything.  For testing
purposes, the -N/--nosync flag can be used to restore the old behavior.

Also, testing shows that on Linux, sync_file_range() is much faster than
posix_fadvise() for hinting to the kernel that an fsync is coming,
apparently because the latter blocks on a rather small request queue while
the former doesn't.  So use this function if available in initdb, and also
in the backend's pg_flush_data() (where it currently will affect only the
speed of CREATE DATABASE's cloning step).

We will later make pg_regress invoke initdb with the --nosync flag
to avoid slowing down cases such as "make check" in contrib.  But
let's not do so until we've shaken out any portability issues in this
patch.

Jeff Davis, reviewed by Andres Freund
2012-07-13 17:16:58 -04:00
Tom Lane
84a42560c8 Add array_remove() and array_replace() functions.
These functions support removing or replacing array element value(s)
matching a given search value.  Although intended mainly to support a
future array-foreign-key feature, they seem useful in their own right.

Marco Nenciarini and Gabriele Bartolini, reviewed by Alex Hunsaker
2012-07-11 13:59:35 -04:00
Tom Lane
01215d61a7 Fix bogus macro definition.
Per buildfarm complaints.
2012-07-10 22:36:11 -04:00
Tatsuo Ishii
1c7a7faa5b Add comments about additional mule-internal charsets from emacs's
source code(lisp/international/mule-conf.el).  These charsets have not
been supported up to now anyway, so this is just for adding
commentary.  Also add mention that we follow emacs's implementation,
not xemacs's.
2012-07-11 08:10:50 +09:00
Tom Lane
628cbb50ba Re-implement extraction of fixed prefixes from regular expressions.
To generate btree-indexable conditions from regex WHERE conditions (such as
WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed
prefix that a regex might have; that is, find any string that must be a
prefix of all strings satisfying the regex.  We used to do that with
entirely ad-hoc code that looked at the source text of the regex.  It
didn't know very much about regex syntax, which mostly meant that it would
fail to identify some optimizable cases; but Viktor Rosenfeld reported that
it would produce actively wrong answers for quantified parenthesized
subexpressions, such as '^(foo)?bar'.  Rather than trying to extend the
ad-hoc code to cover this, let's get rid of it altogether in favor of
identifying prefixes by examining the compiled form of a regex.

To do this, I've added a new entry point "pg_regprefix" to the regex library;
hopefully it is defined in a sufficiently general fashion that it can remain
in the library when/if that code gets split out as a standalone project.

Since this bug has been there for a very long time, this fix needs to get
back-patched.  However it depends on some other recent commits (particularly
the addition of wchar-to-database-encoding conversion), so I'll commit this
separately and then go to work on back-porting the necessary fixes.
2012-07-10 14:54:37 -04:00
Tom Lane
00dac6000d Refactor pattern_fixed_prefix() to avoid dealing in incomplete patterns.
Previously, pattern_fixed_prefix() was defined to return whatever fixed
prefix it could extract from the pattern, plus the "rest" of the pattern.
That definition was sensible for LIKE patterns, but not so much for
regexes, where reconstituting a valid pattern minus the prefix could be
quite tricky (certainly the existing code wasn't doing that correctly).
Since the only thing that callers ever did with the "rest" of the pattern
was to pass it to like_selectivity() or regex_selectivity(), let's cut out
the middle-man and just have pattern_fixed_prefix's subroutines do this
directly.  Then pattern_fixed_prefix can return a simple selectivity
number, and the question of how to cope with partial patterns is removed
from its API specification.

While at it, adjust the API spec so that callers who don't actually care
about the pattern's selectivity (which is a lot of them) can pass NULL for
the selectivity pointer to skip doing the work of computing a selectivity
estimate.

This patch is only an API refactoring that doesn't actually change any
processing, other than allowing a little bit of useless work to be skipped.
However, it's necessary infrastructure for my upcoming fix to regex prefix
extraction, because after that change there won't be any simple way to
identify the "rest" of the regex, not even to the low level of fidelity
needed by regex_selectivity.  We can cope with that if regex_fixed_prefix
and regex_selectivity communicate directly, but not if we have to work
within the old API.  Hence, back-patch to all active branches.
2012-07-09 23:22:55 -04:00
Tom Lane
e7ef6d7e24 Fix planner to pass correct collation to operator selectivity estimators.
We can do this without creating an API break for estimation functions
by passing the collation using the existing fmgr functionality for
passing an input collation as a hidden parameter.

The need for this was foreseen at the outset, but we didn't get around to
making it happen in 9.1 because of the decision to sort all pg_statistic
histograms according to the database's default collation.  That meant that
selectivity estimators generally need to use the default collation too,
even if they're estimating for an operator that will do something
different.  The reason it's suddenly become more interesting is that
regexp interpretation also uses a collation (for its LC_TYPE not LC_COLLATE
property), and we no longer want to use the wrong collation when examining
regexps during planning.  It's not that the selectivity estimate is likely
to change much from this; rather that we are thinking of caching compiled
regexps during planner estimation, and we won't get the intended benefit
if we cache them with a different collation than the executor will use.

Back-patch to 9.1, both because the regexp change is likely to get
back-patched and because we might as well get this right in all
collation-supporting branches, in case any third-party code wants to
rely on getting the collation.  The patch turns out to be minuscule
now that I've done it ...
2012-07-08 23:51:08 -04:00
Tom Lane
c6aae3042b Simplify and document regex library's compact-NFA representation.
The previous coding abused the first element of a cNFA state's arcs list
to hold a per-state flag bit, which was confusing, undocumented, and not
even particularly efficient.  Get rid of that in favor of a separate
"stflags" vector.  Since there's only one bit in use, I chose to allocate a
char per state; we could possibly replace this with a bitmap at some point,
but that would make accesses a little slower.  It's already about 8X
smaller than before, so let's not get overly tense.

Also document the representation better than it was before, which is to say
not at all.

This patch is a byproduct of investigations towards extracting a "fixed
prefix" string from the compact-NFA representation of regex patterns.
Might need to back-patch it if we decide to back-patch that fix, but for
now it's just code cleanup so I'll just put it in HEAD.
2012-07-07 17:39:50 -04:00
Tom Lane
fc548b2296 Remove support for using wait3() in place of waitpid().
All Unix-oid platforms that we currently support should have waitpid(),
since it's in V2 of the Single Unix Spec.  Our git history shows that
the wait3 code was added to support NextStep, which we officially dropped
support for as of 9.2.  So get rid of the configure test, and simplify the
macro spaghetti in reaper().  Per suggestion from Fujii Masao.
2012-07-05 14:00:40 -04:00
Robert Haas
72dd6291f2 Add wchar -> mb conversion routines.
This is infrastructure for Alexander Korotkov's work on indexing regular
expression searches.

Alexander Korotkov, with a bit of further hackery on the MULE conversion
by me
2012-07-04 17:10:10 -04:00
Tom Lane
09022de1f5 Improve documentation about MULE encoding.
This commit improves the comments in pg_wchar.h and creates #define symbols
for some formerly hard-coded values.  No substantive code changes.

Tatsuo Ishii and Tom Lane
2012-07-04 00:29:57 -04:00
Alvaro Herrera
0c7b9dc7d0 Have REASSIGN OWNED work on extensions, too
Per bug #6593, REASSIGN OWNED fails when the affected role has created
an extension.  Even though the user related to the extension is not
nominally the owner, its OID appears on pg_shdepend and thus causes
problems when the user is to be dropped.

This commit adds code to change the "ownership" of the extension itself,
not of the contained objects.  This is fine because it's currently only
called from REASSIGN OWNED, which would also modify the ownership of the
contained objects.  However, this is not sufficient for a working ALTER
OWNER implementation extension.

Back-patch to 9.1, where extensions were introduced.

Bug #6593 reported by Emiliano Leporati.
2012-07-03 15:09:59 -04:00
Robert Haas
f83b59997d Make walsender more responsive.
Per testing by Andres Freund, this improves replication performance
and reduces replication latency and latency jitter.  I was a bit
concerned about moving more work into XLogInsert, but testing seems
to show that it's not a problem in practice.

Along the way, improve comments for WaitLatchOrSocket.

Andres Freund.  Review and stylistic cleanup by me.
2012-07-02 09:41:01 -04:00
Tom Lane
541ffa65c3 Prevent CREATE TABLE LIKE/INHERITS from (mis) copying whole-row Vars.
If a CHECK constraint or index definition contained a whole-row Var (that
is, "table.*"), an attempt to copy that definition via CREATE TABLE LIKE or
table inheritance produced incorrect results: the copied Var still claimed
to have the rowtype of the source table, rather than the created table.

For the LIKE case, it seems reasonable to just throw error for this
situation, since the point of LIKE is that the new table is not permanently
coupled to the old, so there's no reason to assume its rowtype will stay
compatible.  In the inheritance case, we should ideally allow such
constraints, but doing so will require nontrivial refactoring of CREATE
TABLE processing (because we'd need to know the OID of the new table's
rowtype before we adjust inherited CHECK constraints).  In view of the lack
of previous complaints, that doesn't seem worth the risk in a back-patched
bug fix, so just make it throw error for the inheritance case as well.

Along the way, replace change_varattnos_of_a_node() with a more robust
function map_variable_attnos(), which is capable of being extended to
handle insertion of ConvertRowtypeExpr whenever we get around to fixing
the inheritance case nicely, and in the meantime it returns a failure
indication to the caller so that a helpful message with some context can be
thrown.  Also, this code will do the right thing with subselects (if we
ever allow them in CHECK or indexes), and it range-checks varattnos before
using them to index into the map array.

Per report from Sergey Konoplev.  Back-patch to all supported branches.
2012-06-30 16:45:14 -04:00
Robert Haas
b79ab00144 When LWLOCK_STATS is defined, count spindelays.
When LWLOCK_STATS is *not* defined, the only change is that
SpinLockAcquire now returns the number of delays.

Patch by me, review by Jeff Janes.
2012-06-26 16:06:07 -04:00
Alvaro Herrera
77ed0c6950 Tighten up includes in sinvaladt.h, twophase.h, proc.h
Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in
proc.h with xlogdefs.h.
2012-06-25 18:40:40 -04:00
Peter Eisentraut
eeece9e609 Unify calling conventions for postgres/postmaster sub-main functions
There was a wild mix of calling conventions: Some were declared to
return void and didn't return, some returned an int exit code, some
claimed to return an exit code, which the callers checked, but
actually never returned, and so on.

Now all of these functions are declared to return void and decorated
with attribute noreturn and don't return.  That's easiest, and most
code already worked that way.
2012-06-25 21:30:12 +03:00
Robert Haas
2dfa87bcb6 Remove sanity test in XRecOffIsValid.
Commit 061e7efb1b changed the rules
for splitting xlog records across pages, but neglected to update this
test.  It's possible that there's some better action here than just
removing the test completely, but this at least appears to get some
of the things that are currently broken (like initdb on MacOS X)
working again.
2012-06-25 12:14:43 -04:00
Peter Eisentraut
b8b2e3b2de Replace int2/int4 in C code with int16/int32
The latter was already the dominant use, and it's preferable because
in C the convention is that intXX means XX bits.  Therefore, allowing
mixed use of int2, int4, int8, int16, int32 is obviously confusing.

Remove the typedefs for int2 and int4 for now.  They don't seem to be
widely used outside of the PostgreSQL source tree, and the few uses
can probably be cleaned up by the time this ships.
2012-06-25 01:51:46 +03:00
Heikki Linnakangas
0687a26002 Use UINT64CONST for 64-bit integer constants.
Peter Eisentraut advised me that UINT64CONST is the proper way to do that,
not LL suffix.
2012-06-24 21:56:45 +03:00
Heikki Linnakangas
96ff85e2dd Use LL suffix for 64-bit constants.
Per warning from buildfarm member 'locust'. At least I think this what's
making it upset.
2012-06-24 20:01:55 +03:00
Heikki Linnakangas
0ab9d1c4b3 Replace XLogRecPtr struct with a 64-bit integer.
This simplifies code that needs to do arithmetic on XLogRecPtrs.

To avoid changing on-disk format of data pages, the LSN on data pages is
still stored in the old format. That should keep pg_upgrade happy. However,
we have XLogRecPtrs embedded in the control file, and in the structs that
are sent over the replication protocol, so this changes breaks compatibility
of pg_basebackup and server. I didn't do anything about this in this patch,
per discussion on -hackers, the right thing to do would to be to change the
replication protocol to be architecture-independent, so that you could use
a newer version of pg_receivexlog, for example, against an older server
version.
2012-06-24 19:19:45 +03:00
Heikki Linnakangas
061e7efb1b Allow WAL record header to be split across pages.
This saves a few bytes of WAL space, but the real motivation is to make it
predictable how much WAL space a record requires, as it no longer depends
on whether we need to waste the last few bytes at end of WAL page because
the header doesn't fit.

The total length field of WAL record, xl_tot_len, is moved to the beginning
of the WAL record header, so that it is still always found on the first page
where a WAL record begins.

Bump WAL version number again as this is an incompatible change.
2012-06-24 18:35:56 +03:00
Heikki Linnakangas
20ba5ca64c Move WAL continuation record information to WAL page header.
The continuation record only contained one field, xl_rem_len, so it makes
things simpler to just include it in the WAL page header. This wastes four
bytes on pages that don't begin with a continuation from previos page, plus
four bytes on every page, because of padding.

The motivation of this is to make it easier to calculate how much space a
WAL record needs. Before this patch, it depended on how many page boundaries
the record crosses. The motivation of that, in turn, is to separate the
allocation of space in the WAL from the copying of the record data to the
allocated space. Keeping the calculation of space required simple helps to
keep the critical section of allocating the space from WAL short. But that's
not included in this patch yet.

Bump WAL version number again, as this is an incompatible change.
2012-06-24 18:35:30 +03:00
Heikki Linnakangas
dfda6ebaec Don't waste the last segment of each 4GB logical log file.
The comments claimed that wasting the last segment made it easier to do
calculations with XLogRecPtrs, because you don't have problems representing
last-byte-position-plus-1 that way. In my experience, however, it only made
things more complicated, because the there was two ways to represent the
boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0,
or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were
picky about which representation was used.

Also, use a 64-bit segment number instead of the log/seg combination, to
point to a certain WAL segment. We assume that all platforms have a working
64-bit integer type nowadays.

This is an incompatible change in WAL format, so bumping WAL version number.
2012-06-24 18:35:29 +03:00
Tom Lane
d14241c2cf Fix memory leak in ARRAY(SELECT ...) subqueries.
Repeated execution of an uncorrelated ARRAY_SUBLINK sub-select (which
I think can only happen if the sub-select is embedded in a larger,
correlated subquery) would leak memory for the duration of the query,
due to not reclaiming the array generated in the previous execution.
Per bug #6698 from Armando Miraglia.  Diagnosis and fix idea by Heikki,
patch itself by me.

This has been like this all along, so back-patch to all supported versions.
2012-06-21 17:27:19 -04:00
Heikki Linnakangas
eeb6f37d89 Add a small cache of locks owned by a resource owner in ResourceOwner.
This speeds up reassigning locks to the parent owner, when the transaction
holds a lot of locks, but only a few of them belong to the current resource
owner. This is particularly helps pg_dump when dumping a large number of
objects.

The cache can hold up to 15 locks in each resource owner. After that, the
cache is marked as overflowed, and we fall back to the old method of
scanning the whole local lock table. The tradeoff here is that the cache has
to be scanned whenever a lock is released, so if the cache is too large,
lock release becomes more expensive. 15 seems enough to cover pg_dump, and
doesn't have much impact on lock release.

Jeff Janes, reviewed by Amit Kapila and Heikki Linnakangas.
2012-06-21 15:30:26 +03:00
Tom Lane
cfa0f4255b Improve tests for whether we can skip queueing RI enforcement triggers.
During an update of a PK row, we can skip firing the RI trigger if any old
key value is NULL, because then the row could not have had any matching
rows in the FK table.  Conversely, during an update of an FK row, the
outcome is determined if any new key value is NULL.  In either case it
becomes unnecessary to compare individual key values.

This patch was inspired by discussion of Vik Reykja's patch to use IS NOT
DISTINCT semantics for the key comparisons.  In the event there is no need
for that and so this patch looks nothing like his, but he should still get
credit for having re-opened consideration of the trigger skip logic.
2012-06-19 20:07:33 -04:00
Tom Lane
f5297bdfe4 Refer to the default foreign key match style as MATCH SIMPLE internally.
Previously we followed the SQL92 wording, "MATCH <unspecified>", but since
SQL99 there's been a less awkward way to refer to the default style.

In addition to the code changes, pg_constraint.confmatchtype now stores
this match style as 's' (SIMPLE) rather than 'u' (UNSPECIFIED).  This
doesn't affect pg_dump or psql because they use pg_get_constraintdef()
to reconstruct foreign key definitions.  But other client-side code might
examine that column directly, so this change will have to be marked as
an incompatibility in the 9.3 release notes.
2012-06-17 20:16:44 -04:00
Tom Lane
9e18eacbdf Fix stats collector to recover nicely when system clock goes backwards.
Formerly, if the system clock went backwards, the stats collector would
fail to update the stats file any more until the clock reading again
exceeds whatever timestamp was last written into the stats file.  Such
glitches in the clock's behavior are not terribly unlikely on machines
not using NTP.  Such a scenario has been observed to cause regression test
failures in the buildfarm, and it could have bad effects on the behavior
of autovacuum, so it seems prudent to install some defenses.

We could directly detect the clock going backwards by adding
GetCurrentTimestamp calls in the stats collector's main loop, but that
would hurt performance on platforms where GetCurrentTimestamp is expensive.
To minimize the performance hit in normal cases, adopt a more complicated
scheme wherein backends check for clock skew when reading the stats file,
and if they see it, signal the stats collector by sending an extra stats
inquiry message.  The stats collector does an extra GetCurrentTimestamp
only when it receives an inquiry with an apparently out-of-order
timestamp.

To avoid unnecessary GetCurrentTimestamp calls, expand the inquiry messages
to carry the backend's current clock reading as well as its stats cutoff
time.  The latter, being intentionally slightly in-the-past, would trigger
more clock rechecks than we need if it were used for this purpose.

We might want to backpatch this change at some point, but let's let it
shake out in the buildfarm for awhile first.
2012-06-17 17:11:49 -04:00
Peter Eisentraut
15b1918e7d Improve reporting of permission errors for array types
Because permissions are assigned to element types, not array types,
complaining about permission denied on an array type would be
misleading to users.  So adjust the reporting to refer to the element
type instead.

In order not to duplicate the required logic in two dozen places,
refactor the permission denied reporting for types a bit.

pointed out by Yeb Havinga during the review of the type privilege
feature
2012-06-15 22:55:03 +03:00
Robert Haas
68de499bda New SQL functons pg_backup_in_progress() and pg_backup_start_time()
Darold Gilles, reviewed by Gabriele Bartolini and others, rebased by
Marco Nenciarini.  Stylistic cleanup and OID fixes by me.
2012-06-14 13:25:43 -04:00
Robert Haas
6cd015bea3 Add new function log_newpage_buffer.
When I implemented the ginbuildempty() function as part of
implementing unlogged tables, I falsified the note in the header
comment for log_newpage.  Although we could fix that up by changing
the comment, it seems cleaner to add a new function which is
specifically intended to handle this case.  So do that.
2012-06-14 10:11:16 -04:00
Robert Haas
a475c60367 Remove misplaced sanity check from heap_create().
Even when allow_system_table_mods is not set, we allow creation of any
type of SQL object in pg_catalog, except for relations.  And you can
get relations into pg_catalog, too, by initially creating them in some
other schema and then moving them with ALTER .. SET SCHEMA.  So this
restriction, which prevents relations (only) from being created in
pg_catalog directly, is fairly pointless.  If we need a safety mechanism
for this, it should be placed further upstream, so that it affects all
SQL objects uniformly, and picks up both CREATE and SET SCHEMA.

For now, just rip it out, per discussion with Tom Lane.
2012-06-14 09:58:53 -04:00
Robert Haas
d2c86a1ccd Remove RELKIND_UNCATALOGED.
This may have been important at some point in the past, but it no
longer does anything useful.

Review by Tom Lane.
2012-06-14 09:47:30 -04:00
Tom Lane
bed88fceac Stamp HEAD as 9.3devel.
Let the hacking begin ...
2012-06-13 20:03:02 -04:00
Bruce Momjian
927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Peter Eisentraut
8570114dc1 Make include files work without having to include other ones first 2012-06-10 12:46:14 +03:00
Tom Lane
ece01aae47 Scan the buffer pool just once, not once per fork, during relation drop.
This provides a speedup of about 4X when NBuffers is large enough.
There is also a useful reduction in sinval traffic, since we
only do CacheInvalidateSmgr() once not once per fork.

Simon Riggs, reviewed and somewhat revised by Tom Lane
2012-06-07 17:43:11 -04:00
Simon Riggs
055c352abb After any checkpoint, close all smgr files handles in bgwriter 2012-06-01 09:24:53 +01:00
Tom Lane
4bec93ac0f Stamp 9.2beta2. 2012-05-31 19:16:55 -04:00
Tom Lane
ad0009e7be Force PL and range-type support functions to be owned by a superuser.
We allow non-superusers to create procedural languages (with restrictions)
and range datatypes.  Previously, the automatically-created support
functions for these objects ended up owned by the creating user.  This
represents a rather considerable security hazard, because the owning user
might be able to alter a support function's definition in such a way as to
crash the server, inject trojan-horse SQL code, or even execute arbitrary
C code directly.  It appears that right now the only actually exploitable
problem is the infinite-recursion bug fixed in the previous patch for
CVE-2012-2655.  However, it's not hard to imagine that future additions of
more ALTER FUNCTION capability might unintentionally open up new hazards.
To forestall future problems, cause these support functions to be owned by
the bootstrap superuser, not the user creating the parent object.
2012-05-30 23:47:57 -04:00
Tom Lane
cd0ff9c0f4 Expand the allowed range of timezone offsets to +/-15:59:59 from Greenwich.
We used to only allow offsets less than +/-13 hours, then it was +/14,
then it was +/-15.  That's still not good enough though, as per today's bug
report from Patric Bechtel.  This time I actually looked through the Olson
timezone database to find the largest offsets used anywhere.  The winners
are Asia/Manila, at -15:56:00 until 1844, and America/Metlakatla, at
+15:13:42 until 1867.  So we'd better allow offsets less than +/-16 hours.

Given the history, we are way overdue to have some greppable #define
symbols controlling this, so make some ... and also remove an obsolete
comment that didn't get fixed the last time.

Back-patch to all supported branches.
2012-05-30 19:58:35 -04:00
Heikki Linnakangas
d1996ed5e8 Change the way parent pages are tracked during buffered GiST build.
We used to mimic the way a stack is constructed when descending the tree
during normal GiST inserts, but that was quite complicated during a buffered
build. It was also wrong: in GiST, the left-to-right relationships on
different levels might not match each other, so that when you know the
parent of a child page, you won't necessarily find the parent of the page to
the right of the child page by following the rightlinks at the parent level.
This sometimes led to "could not re-find parent" errors while building a
GiST index.

We now use a simple hash table to track the parent of every internal page.
Whenever a page is split, and downlinks are moved from one page to another,
we update the hash table accordingly. This is also better for performance
than the old method, as we never need to move right to re-find the parent
page, which could take a significant amount of time for buffers that were
created much earlier in the index build.
2012-05-30 12:05:57 +03:00
Heikki Linnakangas
1d27dcf578 Fix bug in gistRelocateBuildBuffersOnSplit().
When we create a temporary copy of the old node buffer, in stack, we mustn't
leak that into any of the long-lived data structures. Before this patch,
when we called gistPopItupFromNodeBuffer(), it got added to the array of
"loaded buffers". After gistRelocateBuildBuffersOnSplit() exits, the
pointer added to the loaded buffers array points to garbage. Often that goes
unnotied, because when we go through the array of loaded buffers to unload
them, buffers with a NULL pageBuffer are ignored, which can often happen by
accident even if the pointer points to garbage.

This patch fixes that by marking the temporary copy in stack explicitly as
temporary, and refrain from adding buffers marked as temporary to the array
of loaded buffers.

While we're at it, initialize nodeBuffer->pageBlocknum to InvalidBlockNumber
and improve comments a bit. This isn't strictly necessary, but makes
debugging easier.
2012-05-18 19:38:32 +03:00
Peter Eisentraut
be6d1c88a4 Change COLLATION keyword category
It was changed from unreserved to reserved as part of the COLLATION
FOR syntax, but it turns out that type_func_name_keyword is
sufficient.
2012-05-16 20:19:44 +03:00
Tom Lane
f667747b6d Put back AC_REQUIRE([AC_STRUCT_TM]).
The BSD-ish members of the buildfarm all seem to think removing this
was a bad idea.  It looks to me like it resulted in omitting the system
header inclusion necessary to detect the fields of struct tm correctly.
2012-05-14 23:06:48 -04:00
Peter Eisentraut
ff4628f37a Remove unused AC_DEFINE symbols
ENABLE_DTRACE            unused as of a7b7b07af3
HAVE_ERR_SET_MARK        unused as of 4ed4b6c54e
HAVE_FCVT                unused as of 4553e1d80f
HAVE_STRUCT_SOCKADDR_UN  unused as of b4cea00a1f
HAVE_SYSCONF             unused as of f83356c7f5
TM_IN_SYS_TIME           never used, obsolescent per Autoconf documentation
2012-05-14 22:51:21 +03:00
Heikki Linnakangas
9e4637bf89 Update comments that became out-of-date with the PGXACT struct.
When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.

Noah Misch
2012-05-14 10:28:55 +03:00
Peter Eisentraut
64f09ca386 Remove leftovers of BeOS port
These should have been removed when the BeOS port was removed in
44f9021223.
2012-05-14 04:50:39 +03:00
Robert Haas
1331cc6c1a Prevent loss of init fork when truncating an unlogged table.
Fixes bug #6635, reported by Akira Kurosawa.
2012-05-11 09:48:56 -04:00
Simon Riggs
b06679e012 Ensure age() returns a stable value rather than the latest value 2012-05-11 14:36:24 +01:00
Bruce Momjian
ee24de4001 Revert catalog bump; was post-beta1, and unnecessary. 2012-05-10 18:44:47 -04:00
Bruce Momjian
d2fe836cd2 Update comment for 'name' data type to say 63 "bytes".
Catalog version bump so everyone has the same comment for beta1.
2012-05-10 18:40:40 -04:00
Tom Lane
f70fa835e0 Stamp 9.2beta1. 2012-05-10 18:35:09 -04:00
Heikki Linnakangas
60a3dffb72 Fix outdated comment.
Multi-insert records observe XLOG_HEAP_INIT_PAGE flag too, as Andres Freund
pointed out.
2012-05-10 09:55:48 +03:00
Tom Lane
6308ba05a7 Improve control logic for bgwriter hibernation mode.
Commit 6d90eaaa89 added a hibernation mode
to the bgwriter to reduce the server's idle-power consumption.  However,
its interaction with the detailed behavior of BgBufferSync's feedback
control loop wasn't very well thought out.  That control loop depends
primarily on the rate of buffer allocation, not the rate of buffer
dirtying, so the hibernation mode has to be designed to operate only when
no new buffer allocations are happening.  Also, the check for whether the
system is effectively idle was not quite right and would fail to detect
a constant low level of activity, thus allowing the bgwriter to go into
hibernation mode in a way that would let the cycle time vary quite a bit,
possibly further confusing the feedback loop.  To fix, move the wakeup
support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into
StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode
unless no buffer allocations have happened recently.

In addition, fix the delaying logic to remove the problem of possibly not
responding to signals promptly, which was basically caused by trying to use
the process latch's is_set flag for multiple purposes.  I can't prove it
but I'm suspicious that that hack was responsible for the intermittent
"postmaster does not shut down" failures we've been seeing in the buildfarm
lately.  In any case it did nothing to improve the readability or
robustness of the code.

In passing, express the hibernation sleep time as a multiplier on
BgWriterDelay, not a constant.  I'm not sure whether there's any value in
exposing the longer sleep time as an independently configurable setting,
but we can at least make it act like this for little extra code.
2012-05-09 23:37:10 -04:00
Simon Riggs
8f28789bff Rename BgWriterShmem/Request to CheckpointerShmem/Request 2012-05-09 14:23:45 +01:00
Simon Riggs
bbd3ec9dce Rename BgWriterCommLock to CheckpointerCommLock 2012-05-09 14:11:48 +01:00
Tom Lane
acd4c7d58b Fix an issue in recent walwriter hibernation patch.
Users of asynchronous-commit mode expect there to be a guaranteed maximum
delay before an async commit's WAL records get flushed to disk.  The
original version of the walwriter hibernation patch broke that.  Add an
extra shared-memory flag to allow async commits to kick the walwriter out
of hibernation mode, without adding any noticeable overhead in cases where
no action is needed.
2012-05-08 23:06:40 -04:00
Tom Lane
5461564a9d Reduce idle power consumption of walwriter and checkpointer processes.
This patch modifies the walwriter process so that, when it has not found
anything useful to do for many consecutive wakeup cycles, it extends its
sleep time to reduce the server's idle power consumption.  It reverts to
normal as soon as it's done any successful flushes.  It's still true that
during any async commit, backends check for completed, unflushed pages of
WAL and signal the walwriter if there are any; so that in practice the
walwriter can get awakened and returned to normal operation sooner than the
sleep time might suggest.

Also, improve the checkpointer so that it uses a latch and a computed delay
time to not wake up at all except when it has something to do, replacing a
previous hardcoded 0.5 sec wakeup cycle.  This also is primarily useful for
reducing the server's power consumption when idle.

In passing, get rid of the dedicated latch for signaling the walwriter in
favor of using its procLatch, since that comports better with possible
generic signal handlers using that latch.  Also, fix a pre-existing bug
with failure to save/restore errno in walwriter's signal handlers.

Peter Geoghegan, somewhat simplified by Tom
2012-05-08 20:03:26 -04:00
Peter Eisentraut
3284e03d5d Remove strdup, strtol, strtoul from libpgport
These should not be needed anymore, at least after the recent port
removals.  So let's see whether we can do without them.
2012-05-07 23:10:28 +03:00
Tom Lane
71b9549d05 Overdue code review for transaction-level advisory locks patch.
Commit 62c7bd31c8 had assorted problems, most
visibly that it broke PREPARE TRANSACTION in the presence of session-level
advisory locks (which should be ignored by PREPARE), as per a recent
complaint from Stephen Rees.  More abstractly, the patch made the
LockMethodData.transactional flag not merely useless but outright
dangerous, because in point of fact that flag no longer tells you anything
at all about whether a lock is held transactionally.  This fix therefore
removes that flag altogether.  We now rely entirely on the convention
already in use in lock.c that transactional lock holds must be owned by
some ResourceOwner, while session holds are never so owned.  Setting the
locallock struct's owner link to NULL thus denotes a session hold, and
there is no redundant marker for that.

PREPARE TRANSACTION now works again when there are session-level advisory
locks, and it is also able to transfer transactional advisory locks to the
prepared transaction, but for implementation reasons it throws an error if
we hold both types of lock on a single lockable object.  Perhaps it will be
worth improving that someday.

Assorted other minor cleanup and documentation editing, as well.

Back-patch to 9.1, except that in the 9.1 branch I did not remove the
LockMethodData.transactional flag for fear of causing an ABI break for
any external code that might be examining those structs.
2012-05-04 17:44:31 -04:00
Bruce Momjian
ebcaa5fcde Remove BSD/OS (BSDi) port. There are no known users upgrading to
Postgres 9.2, and perhaps no existing users either.
2012-05-03 10:58:44 -04:00
Robert Haas
8e0c5195df Add missing parenthesis in comment. 2012-05-02 14:30:58 -04:00
Robert Haas
0038110421 Avoid repeated CLOG access from heap_hot_search_buffer.
At the time we check whether the tuple is dead to all running
transactions, we've already verified that it isn't visible to our
scan, setting hint bits if appropriate.  So there's no need to
recheck CLOG for the all-dead test we do just a moment later.
So, add HeapTupleIsSurelyDead() to test the appropriate condition
under the assumption that all relevant hit bits are already set.

Review by Tom Lane.
2012-05-02 12:40:07 -04:00
Robert Haas
e01e66f808 More duplicate word removal. 2012-05-02 09:28:16 -04:00
Heikki Linnakangas
f291ccd43e Remove duplicate words in comments.
Found these with grep -r "for for ".
2012-05-02 10:20:27 +03:00
Peter Eisentraut
f2f9439fbf Remove dead ports
Remove the following ports:

- dgux
- nextstep
- sunos4
- svr4
- ultrix4
- univel

These are obsolete and not worth rescuing.  In most cases, there is
circumstantial evidence that they wouldn't work anymore anyway.
2012-05-01 22:11:12 +03:00
Tom Lane
809e7e21af Converge all SQL-level statistics timing values to float8 milliseconds.
This patch adjusts the core statistics views to match the decision already
taken for pg_stat_statements, that values representing elapsed time should
be represented as float8 and measured in milliseconds.  By using float8,
we are no longer tied to a specific maximum precision of timing data.
(Internally, it's still microseconds, but we could now change that without
needing changes at the SQL level.)

The columns affected are
pg_stat_bgwriter.checkpoint_write_time
pg_stat_bgwriter.checkpoint_sync_time
pg_stat_database.blk_read_time
pg_stat_database.blk_write_time
pg_stat_user_functions.total_time
pg_stat_user_functions.self_time
pg_stat_xact_user_functions.total_time
pg_stat_xact_user_functions.self_time

The first four of these are new in 9.2, so there is no compatibility issue
from changing them.  The others require a release note comment that they
are now double precision (and can show a fractional part) rather than
bigint as before; also their underlying statistics functions now match
the column definitions, instead of returning bigint microseconds.
2012-04-30 14:03:33 -04:00
Peter Eisentraut
26471a51fc Mark ReThrowError() with attribute noreturn
All related functions were already so marked.
2012-04-30 20:23:41 +03:00
Tom Lane
1dd89eadcd Rename I/O timing statistics columns to blk_read_time and blk_write_time.
This seems more consistent with the pre-existing choices for names of
other statistics columns.  Rename assorted internal identifiers to match.
2012-04-29 18:13:33 -04:00
Tom Lane
309c64745e Rename track_iotiming GUC to track_io_timing.
This spelling seems significantly more readable to me.
2012-04-29 16:23:54 -04:00
Peter Eisentraut
81107282a5 Change return type of ExceptionalCondition to void and mark it noreturn
In ancient times, it was thought that this wouldn't work because of
TrapMacro/AssertMacro, but changing those to use a comma operator
appears to work without compiler warnings.
2012-04-29 21:20:14 +03:00
Robert Haas
3424bff90f Prevent index-only scans from returning wrong answers under Hot Standby.
The alternative of disallowing index-only scans in HS operation was
discussed, but the consensus was that it was better to treat marking
a page all-visible as a recovery conflict for snapshots that could still
fail to see XIDs on that page.  We may in the future try to soften this,
so that we simply force index scans to do heap fetches in cases where
this may be an issue, rather than throwing a hard conflict.
2012-04-26 20:00:21 -04:00
Tom Lane
9fa82c9809 Fix planner's handling of RETURNING lists in writable CTEs.
setrefs.c failed to do "rtoffset" adjustment of Vars in RETURNING lists,
which meant they were left with the wrong varnos when the RETURNING list
was in a subquery.  That was never possible before writable CTEs, of
course, but now it's broken.  The executor fails to notice any problem
because ExecEvalVar just references the ecxt_scantuple for any normal
varno; but EXPLAIN breaks when the varno is wrong, as illustrated in a
recent complaint from Bartosz Dmytrak.

Since the eventual rtoffset of the subquery is not known at the time
we are preparing its plan node, the previous scheme of executing
set_returning_clause_references() at that time cannot handle this
adjustment.  Fortunately, it turns out that we don't really need to do it
that way, because all the needed information is available during normal
setrefs.c execution; we just have to dig it out of the ModifyTable node.
So, do that, and get rid of the kluge of early setrefs processing of
RETURNING lists.  (This is a little bit of a cheat in the case of inherited
UPDATE/DELETE, because we are not passing a "root" struct that corresponds
exactly to what the subplan was built with.  But that doesn't matter, and
anyway this is less ugly than early setrefs processing was.)

Back-patch to 9.1, where the problem became possible to hit.
2012-04-25 20:20:33 -04:00
Robert Haas
ca1e1a8da1 Remove prototype for nonexistent function. 2012-04-25 15:32:15 -04:00
Robert Haas
5d4b60f2f2 Lots of doc corrections.
Josh Kupershmidt
2012-04-23 22:43:09 -04:00
Alvaro Herrera
09ff76fcdb Recast "ONLY" column CHECK constraints as NO INHERIT
The original syntax wasn't universally loved, and it didn't allow its
usage in CREATE TABLE, only ALTER TABLE.  It now works everywhere, and
it also allows using ALTER TABLE ONLY to add an uninherited CHECK
constraint, per discussion.

The pg_constraint column has accordingly been renamed connoinherit.

This commit partly reverts some of the changes in
61d81bd28d, particularly some pg_dump and
psql bits, because now pg_get_constraintdef includes the necessary NO
INHERIT within the constraint definition.

Author: Nikhil Sontakke
Some tweaks by me
2012-04-20 23:56:57 -03:00
Tom Lane
5b7b5518d0 Revise parameterized-path mechanism to fix assorted issues.
This patch adjusts the treatment of parameterized paths so that all paths
with the same parameterization (same set of required outer rels) for the
same relation will have the same rowcount estimate.  We cache the rowcount
estimates to ensure that property, and hopefully save a few cycles too.
Doing this makes it practical for add_path_precheck to operate without
a rowcount estimate: it need only assume that paths with different
parameterizations never dominate each other, which is close enough to
true anyway for coarse filtering, because normally a more-parameterized
path should yield fewer rows thanks to having more join clauses to apply.

In add_path, we do the full nine yards of comparing rowcount estimates
along with everything else, so that we can discard parameterized paths that
don't actually have an advantage.  This fixes some issues I'd found with
add_path rejecting parameterized paths on the grounds that they were more
expensive than not-parameterized ones, even though they yielded many fewer
rows and hence would be cheaper once subsequent joining was considered.

To make the same-rowcounts assumption valid, we have to require that any
parameterized path enforce *all* join clauses that could be obtained from
the particular set of outer rels, even if not all of them are useful for
indexing.  This is required at both base scans and joins.  It's a good
thing anyway since the net impact is that join quals are checked at the
lowest practical level in the join tree.  Hence, discard the original
rather ad-hoc mechanism for choosing parameterization joinquals, and build
a better one that has a more principled rule for when clauses can be moved.
The original rule was actually buggy anyway for lack of knowledge about
which relations are part of an outer join's outer side; getting this right
requires adding an outer_relids field to RestrictInfo.
2012-04-19 15:53:47 -04:00
Robert Haas
4a6fab03f2 Finish rename of FastPathStrongLocks to FastPathStrongRelationLocks.
Commit 8e5ac74c12 tried to do this renaming,
but I relied on gcc to tell me where I needed to make changes, instead of
grep.

Noted by Jeff Davis.
2012-04-18 11:29:34 -04:00
Robert Haas
53c5b869b4 Tighten up error recovery for fast-path locking.
The previous code could cause a backend crash after BEGIN; SAVEPOINT a;
LOCK TABLE foo (interrupted by ^C or statement timeout); ROLLBACK TO
SAVEPOINT a; LOCK TABLE foo, and might have leaked strong-lock counts
in other situations.

Report by Zoltán Böszörményi; patch review by Jeff Davis.
2012-04-18 11:17:30 -04:00
Robert Haas
4a2d7ad76f pg_size_pretty(numeric)
The output of the new pg_xlog_location_diff function is of type numeric,
since it could theoretically overflow an int8 due to signedness; this
provides a convenient way to format such values.

Fujii Masao, with some beautification by me.
2012-04-14 08:07:25 -04:00
Peter Eisentraut
c0cc526e8b Rename bytea_agg to string_agg and add delimiter argument
Per mailing list discussion, we would like to keep the bytea functions
parallel to the text functions, so rename bytea_agg to string_agg,
which already exists for text.

Also, to satisfy the rule that we don't want aggregate functions of
the same name with a different number of arguments, add a delimiter
argument, just like string_agg for text already has.
2012-04-13 21:36:59 +03:00
Heikki Linnakangas
ef3883d130 Do stack-depth checking in all postmaster children.
We used to only initialize the stack base pointer when starting up a regular
backend, not in other processes. In particular, autovacuum workers can run
arbitrary user code, and without stack-depth checking, infinite recursion
in e.g an index expression will bring down the whole cluster.

The comment about PL/Java using set_stack_base() is not yet true. As the
code stands, PL/java still modifies the stack_base_ptr variable directly.
However, it's been discussed in the PL/Java mailing list that it should be
changed to use the function, because PL/Java is currently oblivious to the
register stack used on Itanium. There's another issues with PL/Java, namely
that the stack base pointer it sets is not really the base of the stack, it
could be something close to the bottom of the stack. That's a separate issue
that might need some further changes to this code, but that's a different
story.

Backpatch to all supported releases.
2012-04-08 19:07:55 +03:00
Tom Lane
cea49fe82f Dept of second thoughts: improve the API for AnalyzeForeignTable.
If we make the initially-called function return the table physical-size
estimate, acquire_inherited_sample_rows will be able to use that to
allocate numbers of samples among child tables, when the day comes that
we want to support foreign tables in inheritance trees.
2012-04-06 16:04:10 -04:00
Tom Lane
263d9de66b Allow statistics to be collected for foreign tables.
ANALYZE now accepts foreign tables and allows the table's FDW to control
how the sample rows are collected.  (But only manual ANALYZEs will touch
foreign tables, for the moment, since among other things it's not very
clear how to handle remote permissions checks in an auto-analyze.)

contrib/file_fdw is extended to support this.

Etsuro Fujita, reviewed by Shigeru Hanada, some further tweaking by me.
2012-04-06 15:02:35 -04:00
Simon Riggs
8cb53654db Add DROP INDEX CONCURRENTLY [IF EXISTS], uses ShareUpdateExclusiveLock 2012-04-06 10:21:40 +01:00
Robert Haas
21cc529698 checkopint -> checkpoint
Report by Guillaume Lelarge.
2012-04-05 21:37:33 -04:00
Robert Haas
b736aef2ec Publish checkpoint timing information to pg_stat_bgwriter.
Greg Smith, Peter Geoghegan, and Robert Haas
2012-04-05 14:04:37 -04:00
Robert Haas
644828908f Expose track_iotiming data via the statistics collector.
Ants Aasma's original patch to add timing information for buffer I/O
requests exposed this data at the relation level, which was judged too
costly.  I've here exposed it at the database level instead.
2012-04-05 11:40:24 -04:00
Peter Eisentraut
38b9693fd9 Add support for renaming domain constraints 2012-04-03 08:11:51 +03:00
Tom Lane
5e83854d71 Add PGDLLIMPORT to ScanKeywords and NumScanKeywords.
Per buildfarm, this is now needed by contrib/pg_stat_statements.
2012-03-31 10:56:21 -04:00
Heikki Linnakangas
5762a4d909 Inherit max_safe_fds to child processes in EXEC_BACKEND mode.
Postmaster sets max_safe_fds by testing how many open file descriptors it
can open, and that is normally inherited by all child processes at fork().
Not so on EXEC_BACKEND, ie. Windows, however. Because of that, we
effectively ignored max_files_per_process on Windows, and always assumed
a conservative default of 32 simultaneous open files. That could have an
impact on performance, if you need to access a lot of different files
in a query. After this patch, the value is passed to child processes by
save/restore_backend_variables() among many other global variables.

It has been like this forever, but given the lack of complaints about it,
I'm not backpatching this.
2012-03-29 08:19:11 +03:00
Andrew Dunstan
d2c1740dc2 Remove now redundant pgpipe code. 2012-03-28 23:24:07 -04:00
Tom Lane
a40fa613b5 Add some infrastructure for contrib/pg_stat_statements.
Add a queryId field to Query and PlannedStmt.  This is not used by the
core backend, except for being copied around at appropriate times.
It's meant to allow plug-ins to track a particular query forward from
parse analysis to execution.

The queryId is intentionally not dumped into stored rules (and hence this
commit doesn't bump catversion).  You could argue that choice either way,
but it seems better that stored rule strings not have any dependency
on plug-ins that might or might not be present.

Also, add a post_parse_analyze_hook that gets invoked at the end of
parse analysis (but only for top-level analysis of complete queries,
not cases such as analyzing a domain's default-value expression).
This is mainly meant to be used to compute and assign a queryId,
but it could have other applications.

Peter Geoghegan
2012-03-27 15:17:40 -04:00
Robert Haas
40b9b95769 New GUC, track_iotiming, to track I/O timings.
Currently, the only way to see the numbers this gathers is via
EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through
the stats collector and pg_stat_statements in subsequent patches.

Ants Aasma, reviewed by Greg Smith, with some further changes by me.
2012-03-27 14:55:02 -04:00
Robert Haas
7386089d23 Code cleanup for heap_freeze_tuple.
It used to be case that lazy vacuum could call this function with only
a shared lock on the buffer, but neither lazy vacuum nor any other
code path does that any more.  Simplify the code accordingly and clean
up some related, obsolete comments.
2012-03-26 11:03:06 -04:00
Tom Lane
c7cea267de Replace empty locale name with implied value in CREATE DATABASE and initdb.
setlocale() accepts locale name "" as meaning "the locale specified by the
process's environment variables".  Historically we've accepted that for
Postgres' locale settings, too.  However, it's fairly unsafe to store an
empty string in a new database's pg_database.datcollate or datctype fields,
because then the interpretation could vary across postmaster restarts,
possibly resulting in index corruption and other unpleasantness.

Instead, we should expand "" to whatever it means at the moment of calling
CREATE DATABASE, which we can do by saving the value returned by
setlocale().

For consistency, make initdb set up the initial lc_xxx parameter values the
same way.  initdb was already doing the right thing for empty locale names,
but it did not replace non-empty names with setlocale results.  On a
platform where setlocale chooses to canonicalize the spellings of locale
names, this would result in annoying inconsistency.  (It seems that popular
implementations of setlocale don't do such canonicalization, which is a
pity, but the POSIX spec certainly allows it to be done.)  The same risk
of inconsistency leads me to not venture back-patching this, although it
could certainly be seen as a longstanding bug.

Per report from Jeff Davis, though this is not his proposed patch.
2012-03-25 21:47:22 -04:00
Tom Lane
8279eb4191 Fix planner's handling of outer PlaceHolderVars within subqueries.
For some reason, in the original coding of the PlaceHolderVar mechanism
I had supposed that PlaceHolderVars couldn't propagate into subqueries.
That is of course entirely possible.  When it happens, we need to treat
an outer-level PlaceHolderVar much like an outer Var or Aggref, that is
SS_replace_correlation_vars() needs to replace the PlaceHolderVar with
a Param, and then when building the finished SubPlan we have to provide
the PlaceHolderVar expression as an actual parameter for the SubPlan.
The handling of the contained expression is a bit delicate but it can be
treated exactly like an Aggref's expression.

In addition to the missing logic in subselect.c, prepjointree.c was failing
to search subqueries for PlaceHolderVars that need their relids adjusted
during subquery pullup.  It looks like everyplace else that touches
PlaceHolderVars got it right, though.

Per report from Mark Murawski.  In 9.1 and HEAD, queries affected by this
oversight would fail with "ERROR: Upper-level PlaceHolderVar found where
not expected".  But in 9.0 and 8.4, you'd silently get possibly-wrong
answers, since the value transmitted into the subquery wouldn't go to null
when it should.
2012-03-24 16:21:39 -04:00
Tom Lane
0339047bc9 Code review for protransform patches.
Fix loss of previous expression-simplification work when a transform
function fires: we must not simply revert to untransformed input tree.
Instead build a dummy FuncExpr node to pass to the transform function.
This has the additional advantage of providing a simpler, more uniform
API for transform functions.

Move documentation to a somewhat less buried spot, relocate some
poorly-placed code, be more wary of null constants and invalid typmod
values, add an opr_sanity check on protransform function signatures,
and some other minor cosmetic adjustments.

Note: although this patch touches pg_proc.h, no need for catversion
bump, because the changes are cosmetic and don't actually change the
intended catalog contents.
2012-03-23 17:29:57 -04:00
Peter Eisentraut
0e85abd658 Clean up compiler warnings from unused variables with asserts disabled
For those variables only used when asserts are enabled, use a new
macro PG_USED_FOR_ASSERTS_ONLY, which expands to
__attribute__((unused)) when asserts are not enabled.
2012-03-21 23:33:10 +02:00
Tom Lane
9dbf2b7d75 Restructure SELECT INTO's parsetree representation into CreateTableAsStmt.
Making this operation look like a utility statement seems generally a good
idea, and particularly so in light of the desire to provide command
triggers for utility statements.  The original choice of representing it as
SELECT with an IntoClause appendage had metastasized into rather a lot of
places, unfortunately, so that this patch is a great deal more complicated
than one might at first expect.

In particular, keeping EXPLAIN working for SELECT INTO and CREATE TABLE AS
subcommands required restructuring some EXPLAIN-related APIs.  Add-on code
that calls ExplainOnePlan or ExplainOneUtility, or uses
ExplainOneQuery_hook, will need adjustment.

Also, the cases PREPARE ... SELECT INTO and CREATE RULE ... SELECT INTO,
which formerly were accepted though undocumented, are no longer accepted.
The PREPARE case can be replaced with use of CREATE TABLE AS EXECUTE.
The CREATE RULE case doesn't seem to have much real-world use (since the
rule would work only once before failing with "table already exists"),
so we'll not bother with that one.

Both SELECT INTO and CREATE TABLE AS still return a command tag of
"SELECT nnnn".  There was some discussion of returning "CREATE TABLE nnnn",
but for the moment backwards compatibility wins the day.

Andres Freund and Tom Lane
2012-03-19 21:38:12 -04:00
Tom Lane
dd4134ea56 Revisit handling of UNION ALL subqueries with non-Var output columns.
In commit 57664ed25e I tried to fix a bug
reported by Teodor Sigaev by making non-simple-Var output columns distinct
(by wrapping their expressions with dummy PlaceHolderVar nodes).  This did
not work too well.  Commit b28ffd0fcc fixed
some ensuing problems with matching to child indexes, but per a recent
report from Claus Stadler, constraint exclusion of UNION ALL subqueries was
still broken, because constant-simplification didn't handle the injected
PlaceHolderVars well either.  On reflection, the original patch was quite
misguided: there is no reason to expect that EquivalenceClass child members
will be distinct.  So instead of trying to make them so, we should ensure
that we can cope with the situation when they're not.

Accordingly, this patch reverts the code changes in the above-mentioned
commits (though the regression test cases they added stay).  Instead, I've
added assorted defenses to make sure that duplicate EC child members don't
cause any problems.  Teodor's original problem ("MergeAppend child's
targetlist doesn't match MergeAppend") is addressed more directly by
revising prepare_sort_from_pathkeys to let the parent MergeAppend's sort
list guide creation of each child's sort list.

In passing, get rid of add_sort_column; as far as I can tell, testing for
duplicate sort keys at this stage is dead code.  Certainly it doesn't
trigger often enough to be worth expending cycles on in ordinary queries.
And keeping the test would've greatly complicated the new logic in
prepare_sort_from_pathkeys, because comparing pathkey list entries against
a previous output array requires that we not skip any entries in the list.

Back-patch to 9.1, like the previous patches.  The only known issue in
this area that wasn't caused by the ill-advised previous patches was the
MergeAppend planning failure, which of course is not relevant before 9.1.
It's possible that we need some of the new defenses against duplicate child
EC entries in older branches, but until there's some clear evidence of that
I'm going to refrain from back-patching further.
2012-03-16 13:11:55 -04:00
Heikki Linnakangas
aef5fe7efe Add comments explaining why our Itanium spinlock implementation is safe. 2012-03-16 10:14:45 +02:00
Peter Eisentraut
eb990a2b9e Add const qualifier to tzn returned by timestamp2tm()
The tzn value might come from tm->tm_zone, which libc declares as
const, so it's prudent that the upper layers know about this as well.
2012-03-15 21:17:19 +02:00
Peter Eisentraut
ad4fb0d0d2 Improve EncodeDateTime and EncodeTimeOnly APIs
Use an explicit argument to tell whether to include the time zone in
the output, rather than using some undocumented pointer magic.
2012-03-14 23:03:34 +02:00
Tom Lane
c6a11b89e4 Teach SPGiST to store nulls and do whole-index scans.
This patch fixes the other major compatibility-breaking limitation of
SPGiST, that it didn't store anything for null values of the indexed
column, and so could not support whole-index scans or "x IS NULL"
tests.  The approach is to create a wholly separate search tree for
the null entries, and use fixed "allTheSame" insertion and search
rules when processing this tree, instead of calling the index opclass
methods.  This way the opclass methods do not need to worry about
dealing with nulls.

Catversion bump is for pg_am updates as well as the change in on-disk
format of SPGiST indexes; there are some tweaks in SPGiST WAL records
as well.

Heavily rewritten version of a patch by Oleg Bartunov and Teodor Sigaev.
(The original also stored nulls separately, but it reused GIN code to do
so; which required undesirable compromises in the on-disk format, and
would likely lead to bugs due to the GIN code being required to work in
two very different contexts.)
2012-03-11 16:29:59 -04:00
Tom Lane
03e56f798e Restructure SPGiST opclass interface API to support whole-index scans.
The original API definition was incapable of supporting whole-index scans
because there was no way to invoke leaf-value reconstruction without
checking any qual conditions.  Also, it was inefficient for
multiple-qual-condition scans because value reconstruction got done over
again for each qual condition, and because other internal work in the
consistent functions likewise had to be done for each qual.  To fix these
issues, pass the whole scankey array to the opclass consistent functions,
instead of only letting them see one item at a time.  (Essentially, the
loop over scankey entries is now inside the consistent functions not
outside them.  This makes the consistent functions a bit more complicated,
but not unreasonably so.)

In itself this commit does nothing except save a few cycles in
multiple-qual-condition index scans, since we can't support whole-index
scans on SPGiST indexes until nulls are included in the index.  However,
I consider this a must-fix for 9.2 because once we release it will get
very much harder to change the opclass API definition.
2012-03-10 18:36:49 -05:00
Peter Eisentraut
39d74e346c Add support for renaming constraints
reviewed by Josh Berkus and Dimitri Fontaine
2012-03-10 20:19:13 +02:00
Robert Haas
07d1edb954 Extend object access hook framework to support arguments, and DROP.
This allows loadable modules to get control at drop time, perhaps for the
purpose of performing additional security checks or to log the event.
The initial purpose of this code is to support sepgsql, but other
applications should be possible as well.

KaiGai Kohei, reviewed by me.
2012-03-09 14:34:56 -05:00
Tom Lane
b14953932d Revise FDW planning API, again.
Further reflection shows that a single callback isn't very workable if we
desire to let FDWs generate multiple Paths, because that forces the FDW to
do all work necessary to generate a valid Plan node for each Path.  Instead
split the former PlanForeignScan API into three steps: GetForeignRelSize,
GetForeignPaths, GetForeignPlan.  We had already bit the bullet of breaking
the 9.1 FDW API for 9.2, so this shouldn't cause very much additional pain,
and it's substantially more flexible for complex FDWs.

Add an fdw_private field to RelOptInfo so that the new functions can save
state there rather than possibly having to recalculate information two or
three times.

In addition, we'd not thought through what would be needed to allow an FDW
to set up subexpressions of its choice for runtime execution.  We could
treat ForeignScan.fdw_private as an executable expression but that seems
likely to break existing FDWs unnecessarily (in particular, it would
restrict the set of node types allowable in fdw_private to those supported
by expression_tree_walker).  Instead, invent a separate field fdw_exprs
which will receive the postprocessing appropriate for expression trees.
(One field is enough since it can be a list of expressions; also, we assume
the corresponding expression state tree(s) will be held within fdw_state,
so we don't need to add anything to ForeignScanState.)

Per review of Hanada Shigeru's pgsql_fdw patch.  We may need to tweak this
further as we continue to work on that patch, but to me it feels a lot
closer to being right now.
2012-03-09 12:49:25 -05:00
Tom Lane
08dd23cec7 Fix some issues with temp/transient tables in extension scripts.
Phil Sorber reported that a rewriting ALTER TABLE within an extension
update script failed, because it creates and then drops a placeholder
table; the drop was being disallowed because the table was marked as an
extension member.  We could hack that specific case but it seems likely
that there might be related cases now or in the future, so the most
practical solution seems to be to create an exception to the general rule
that extension member objects can only be dropped by dropping the owning
extension.  To wit: if the DROP is issued within the extension's own
creation or update scripts, we'll allow it, implicitly performing an
"ALTER EXTENSION DROP object" first.  This will simplify cases such as
extension downgrade scripts anyway.

No docs change since we don't seem to have documented the idea that you
would need ALTER EXTENSION DROP for such an action to begin with.

Also, arrange for explicitly temporary tables to not get linked as
extension members in the first place, and the same for the magic
pg_temp_nnn schemas that are created to hold them.  This prevents assorted
unpleasant results if an extension script creates a temp table: the forced
drop at session end would either fail or remove the entire extension, and
neither of those outcomes is desirable.  Note that this doesn't fix the
ALTER TABLE scenario, since the placeholder table is not temp (unless the
table being rewritten is).

Back-patch to 9.1.
2012-03-08 15:53:09 -05:00
Tom Lane
9088d1b965 Add GetForeignColumnOptions() to foreign.c, and add some documentation.
GetForeignColumnOptions provides some abstraction for accessing
column-specific FDW options, on a par with the access functions that were
already provided here for other FDW-related information.

Adjust file_fdw.c to use GetForeignColumnOptions instead of equivalent
hand-rolled code.

In addition, add some SGML documentation for the functions exported by
foreign.c that are meant for use by FDW authors.

(This is the fdw_helper portion of the proposed pgsql_fdw patch.)

Hanada Shigeru, reviewed by KaiGai Kohei
2012-03-07 18:20:58 -05:00
Tom Lane
d4bf3c9c94 Expose an API for calculating catcache hash values.
Now that cache invalidation callbacks get only a hash value, and not a
tuple TID (per commits 632ae6829f and
b5282aa893), the only way they can restrict
what they invalidate is to know what the hash values mean.  setrefs.c was
doing this via a hard-wired assumption but that seems pretty grotty, and
it'll only get worse as more cases come up.  So let's expose a calculation
function that takes the same parameters as SearchSysCache.  Per complaint
from Marko Kreen.
2012-03-07 14:51:13 -05:00
Tom Lane
19dbc34631 Add a hook for processing messages due to be sent to the server log.
Use-cases for this include custom log filtering rules and custom log
message transmission mechanisms (for instance, lossy log message
collection, which has been discussed several times recently).

As is our common practice for hooks, there's no regression test nor
user-facing documentation for this, though the author did exhibit a
sample module using the hook.

Martin Pihlak, reviewed by Marti Raudsepp
2012-03-06 15:35:41 -05:00
Tom Lane
6b289942bf Redesign PlanForeignScan API to allow multiple paths for a foreign table.
The original API specification only allowed an FDW to create a single
access path, which doesn't seem like a terribly good idea in hindsight.
Instead, move the responsibility for building the Path node and calling
add_path() into the FDW's PlanForeignScan function.  Now, it can do that
more than once if appropriate.  There is no longer any need for the
transient FdwPlan struct, so get rid of that.

Etsuro Fujita, Shigeru Hanada, Tom Lane
2012-03-05 16:15:59 -05:00
Magnus Hagander
bc5ac36865 Add function pg_xlog_location_diff to help comparisons
Comparing two xlog locations are useful for example when calculating
replication lag.

Euler Taveira de Oliveira, reviewed by Fujii Masao, and some cleanups
from me
2012-03-04 12:22:38 +01:00
Tom Lane
0e5e167aae Collect and use element-frequency statistics for arrays.
This patch improves selectivity estimation for the array <@, &&, and @>
(containment and overlaps) operators.  It enables collection of statistics
about individual array element values by ANALYZE, and introduces
operator-specific estimators that use these stats.  In addition,
ScalarArrayOpExpr constructs of the forms "const = ANY/ALL (array_column)"
and "const <> ANY/ALL (array_column)" are estimated by treating them as
variants of the containment operators.

Since we still collect scalar-style stats about the array values as a
whole, the pg_stats view is expanded to show both these stats and the
array-style stats in separate columns.  This creates an incompatible change
in how stats for tsvector columns are displayed in pg_stats: the stats
about lexemes are now displayed in the array-related columns instead of the
original scalar-related columns.

There are a few loose ends here, notably that it'd be nice to be able to
suppress either the scalar-style stats or the array-element stats for
columns for which they're not useful.  But the patch is in good enough
shape to commit for wider testing.

Alexander Korotkov, reviewed by Noah Misch and Nathan Boley
2012-03-03 20:20:57 -05:00
Peter Eisentraut
6688d2878e Add COLLATION FOR expression
reviewed by Jaime Casanova
2012-03-02 21:12:16 +02:00
Alvaro Herrera
3433c6ba00 Remove TOAST table from pg_database
The only toastable column now is datacl, but we don't really support
long ACLs anyway.  The TOAST table should have been removed when the
pg_db_role_setting catalog was introduced in commit
2eda8dfb52, but I forgot to do that.

Per -hackers discussion on March 2011.
2012-03-01 12:50:52 -03:00
Alvaro Herrera
58e9f974dc Fix typo in comment
Haifeng Liu
2012-02-28 23:52:52 -03:00
Tom Lane
5c02a00d44 Move CRC tables to libpgport, and provide them in a separate include file.
This makes it much more convenient to build tools for Postgres that are
separately compiled and require a matching CRC implementation.

To prevent multiple copies of the CRC polynomial tables being introduced
into the postgres binaries, they are now included in the static library
libpgport that is mainly meant for replacement system functions.  That
seems like a bit of a kludge, but there's no better place.

This cleans up building of the tools pg_controldata and pg_resetxlog,
which previously had to build their own copies of pg_crc.o.

In the future, external programs that need access to the CRC tables can
include the tables directly from the new header file pg_crc_tables.h.

Daniel Farina, reviewed by Abhijit Menon-Sen and Tom Lane
2012-02-28 19:53:39 -05:00
Peter Eisentraut
973e9fb294 Add const qualifiers where they are accidentally cast away
This only produces warnings under -Wcast-qual, but it's more correct
and consistent in any case.
2012-02-28 12:42:08 +02:00
Alvaro Herrera
cb3a7c2b95 ALTER TABLE: skip FK validation when it's safe to do so
We already skip rewriting the table in these cases, but we still force a
whole table scan to validate the data.  This can be skipped, and thus
we can make the whole ALTER TABLE operation just do some catalog touches
instead of scanning the table, when these two conditions hold:

(a) Old and new pg_constraint.conpfeqop match exactly.  This is actually
stronger than needed; we could loosen things by way of operator
families, but it'd require a lot more effort.

(b) The functions, if any, implementing a cast from the foreign type to
the primary opcintype are the same.  For this purpose, we can consider a
binary coercion equivalent to an exact type match.  When the opcintype
is polymorphic, require that the old and new foreign types match
exactly.  (Since ri_triggers.c does use the executor, the stronger check
for polymorphic types is no mere future-proofing.  However, no core type
exercises its necessity.)

Author: Noah Misch

Committer's note: catalog version bumped due to change of the Constraint
node.  I can't actually find any way to have such a node in a stored
rule, but given that we have "out" support for them, better be safe.
2012-02-27 19:10:24 -03:00
Peter Eisentraut
66f0cf7da8 Remove useless const qualifier
Claiming that the typevar argument to DefineCompositeType() is const
was a plain lie.  A similar case in DefineVirtualRelation() was
already changed in passing in commit 1575fbcb.  Also clean up the now
unnecessary casts that used to cast away the const.
2012-02-26 15:22:27 +02:00
Tom Lane
587359479a Avoid repeated creation/freeing of per-subre DFAs during regex search.
In nested sub-regex trees, lower-level nodes created DFAs and then
destroyed them again before exiting, which is a bit dumb considering that
the recursive search is likely to call those nodes again later.  Instead
cache each created DFA until the end of pg_regexec().  This is basically a
space for time tradeoff, in that it might increase the maximum memory
usage.  However, in most regex patterns there are not all that many subre
nodes, so not that many DFAs --- and in any case, the peak usage occurs
when reaching the bottom recursion level, and except for alternation cases
that's going to be the same anyway.
2012-02-24 18:40:30 -05:00
Tom Lane
3cbfe485e4 Remove useless "retry memory" logic within regex engine.
Apparently some primordial version of Spencer's engine needed cdissect()
and child functions to be able to continue matching from a previous
position when re-called.  That is dead code, though, since trivial
inspection shows that cdissect can never be entered without having
previously done zapmem which resets the relevant retry counter.  I have
also verified experimentally that no case in the Tcl regression tests
reaches cdissect with a nonzero retry value.  Accordingly, remove that
logic.  This doesn't really save any noticeable number of cycles in itself,
but it is one step towards making dissect() and cdissect() equivalent,
which will allow removing hundreds of lines of near-duplicated code.

Since struct subre's "retry" field is no longer particularly related to
any kind of retry, rename it to "id".  As of this commit it's only used
for identifying a subre node in debug printouts, so you might think we
should get rid of the field entirely; but I have a plan for another use.
2012-02-24 18:40:28 -05:00
Tom Lane
173e29aa5d Fix the general case of quantified regex back-references.
Cases where a back-reference is part of a larger subexpression that
is quantified have never worked in Spencer's regex engine, because
he used a compile-time transformation that neglected the need to
check the back-reference match in iterations before the last one.
(That was okay for capturing parens, and we still do it if the
regex has *only* capturing parens ... but it's not okay for backrefs.)

To make this work properly, we have to add an "iteration" node type
to the regex engine's vocabulary of sub-regex nodes.  Since this is a
moderately large change with a fair risk of introducing new bugs of its
own, apply to HEAD only, even though it's a fix for a longstanding bug.
2012-02-24 01:41:03 -05:00
Tom Lane
077711c2e3 Remove arbitrary limitation on length of common name in SSL certificates.
Both libpq and the backend would truncate a common name extracted from a
certificate at 32 bytes.  Replace that fixed-size buffer with dynamically
allocated string so that there is no hard limit.  While at it, remove the
code for extracting peer_dn, which we weren't using for anything; and
don't bother to store peer_cn longer than we need it in libpq.

This limit was not so terribly unreasonable when the code was written,
because we weren't using the result for anything critical, just logging it.
But now that there are options for checking the common name against the
server host name (in libpq) or using it as the user's name (in the server),
this could result in undesirable failures.  In the worst case it even seems
possible to spoof a server name or user name, if the correct name is
exactly 32 bytes and the attacker can persuade a trusted CA to issue a
certificate in which that string is a prefix of the certificate's common
name.  (To exploit this for a server name, he'd also have to send the
connection astray via phony DNS data or some such.)  The case that this is
a realistic security threat is a bit thin, but nonetheless we'll treat it
as one.

Back-patch to 8.4.  Older releases contain the faulty code, but it's not
a security problem because the common name wasn't used for anything
interesting.

Reported and patched by Heikki Linnakangas

Security: CVE-2012-0867
2012-02-23 15:48:04 -05:00
Tom Lane
74e29162a4 Allow MinGW builds to use standardly-named OpenSSL libraries.
In the Fedora variant of MinGW, the openssl libraries have their normal
names, not libeay32 and libssleay32.  Adjust configure probes to allow
that, per bug #6486.

Tomasz Ostrowski
2012-02-23 15:05:08 -05:00
Robert Haas
2254367435 Make EXPLAIN (BUFFERS) track blocks dirtied, as well as those written.
Also expose the new counters through pg_stat_statements.

Patch by me.  Review by Fujii Masao and Greg Smith.
2012-02-22 20:33:05 -05:00
Peter Eisentraut
a445cb92ef Add parameters for controlling locations of server-side SSL files
This allows changing the location of the files that were previously
hard-coded to server.crt, server.key, root.crt, root.crl.

server.crt and server.key continue to be the default settings and are
thus required to be present by default if SSL is enabled.  But the
settings for the server-side CA and CRL are now empty by default, and
if they are set, the files are required to be present.  This replaces
the previous behavior of ignoring the functionality if the files were
not found.
2012-02-22 23:40:46 +02:00
Alvaro Herrera
a417f85e1d REASSIGN OWNED: Support foreign data wrappers and servers
This was overlooked when implementing those kinds of objects, in commit
cae565e503.

Per report from Pawel Casperek.
2012-02-22 17:33:12 -03:00
Tom Lane
9789c99d01 Cosmetic cleanup for commit a760893dbd.
Mostly, fixing overlooked comments.
2012-02-21 14:14:16 -05:00
Tom Lane
27af91438b Create the beginnings of internals documentation for the regex code.
Create src/backend/regex/README to hold an implementation overview of
the regex package, and fill it in with some preliminary notes about
the code's DFA/NFA processing and colormap management.  Much more to
do there of course.

Also, improve some code comments around the colormap and cvec code.
No functional changes except to add one missing assert.
2012-02-19 18:58:23 -05:00
Andrew Dunstan
2f582f76b1 Improve pretty printing of viewdefs.
Some line feeds are added to target lists and from lists to make
them more readable. By default they wrap at 80 columns if possible,
but the wrap column is also selectable - if 0 it wraps after every
item.

Andrew Dunstan, reviewed by Hitoshi Harada.
2012-02-19 11:43:46 -05:00
Tom Lane
4767bc8ff2 Improve statistics estimation to make some use of DISTINCT in sub-queries.
Formerly, we just punted when trying to estimate stats for variables coming
out of sub-queries using DISTINCT, on the grounds that whatever stats we
might have for underlying table columns would be inapplicable.  But if the
sub-query has only one DISTINCT column, we can consider its output variable
as being unique, which is useful information all by itself.  The scope of
this improvement is pretty narrow, but it costs nearly nothing, so we might
as well do it.  Per discussion with Andres Freund.

This patch differs from the draft I submitted yesterday in updating various
comments about vardata.isunique (to reflect its extended meaning) and in
tweaking the interaction with security_barrier views.  There does not seem
to be a reason why we can't use this sort of knowledge even when the
sub-query is such a view.
2012-02-16 17:34:00 -05:00
Tom Lane
4bfe68dfab Run a portal's cleanup hook immediately when pushing it to FAILED state.
This extends the changes of commit 6252c4f9e2
so that we run the cleanup hook earlier for failure cases as well as
success cases.  As before, the point is to avoid an assertion failure from
an Assert I added in commit a874fe7b4c, which
was meant to check that no user-written code can be called during portal
cleanup.  This fixes a case reported by Pavan Deolasee in which the Assert
could be triggered during backend exit (see the new regression test case),
and also prevents the possibility that the cleanup hook is run after
portions of the portal's state have already been recycled.  That doesn't
really matter in current usage, but it foreseeably could matter in the
future.

Back-patch to 9.1 where the Assert in question was added.
2012-02-15 16:19:01 -05:00
Tom Lane
398f70ec07 Preserve column names in the execution-time tupledesc for a RowExpr.
The hstore and json datatypes both have record-conversion functions that
pay attention to column names in the composite values they're handed.
We used to not worry about inserting correct field names into tuple
descriptors generated at runtime, but given these examples it seems
useful to do so.  Observe the nicer-looking results in the regression
tests whose results changed.

catversion bump because there is a subtle change in requirements for stored
rule parsetrees: RowExprs from ROW() constructs now have to include field
names.

Andrew Dunstan and Tom Lane
2012-02-14 17:34:56 -05:00
Robert Haas
cd30728fb2 Allow LEAKPROOF functions for better performance of security views.
We don't normally allow quals to be pushed down into a view created
with the security_barrier option, but functions without side effects
are an exception: they're OK.  This allows much better performance in
common cases, such as when using an equality operator (that might
even be indexable).

There is an outstanding issue here with the CREATE FUNCTION / ALTER
FUNCTION syntax: there's no way to use ALTER FUNCTION to unset the
leakproof flag.  But I'm committing this as-is so that it doesn't
have to be rebased again; we can fix up the grammar in a future
commit.

KaiGai Kohei, with some wordsmithing by me.
2012-02-13 22:21:14 -05:00
Tom Lane
cbba55d6d7 Support min/max index optimizations on boolean columns.
Since bool_and() is equivalent to min(), and bool_or() to max(), we might
as well let them be index-optimized in the same way.  The practical value
of this is debatable at best, but it seems nearly cost-free to enable it.
Code-wise, we need only adjust the entries in pg_aggregate.  There is a
measurable planning speed penalty for a query involving one of these
aggregates, but it is only a few percent in simple cases, so that seems
acceptable.

Marti Raudsepp, reviewed by Abhijit Menon-Sen
2012-02-08 12:41:48 -05:00
Tom Lane
3db6524fe6 Mark some more I/O-conversion-invoking functions as stable not volatile.
When written, textanycat, anytextcat, quote_literal, and quote_nullable
were marked volatile, because they could invoke arbitrary type-specific
output functions as part of casting their anyelement arguments to text.
Since then, we have defined a project policy that I/O functions must not
be volatile, as per commit aab353a60b.
So these functions can safely be downgraded to stable.  Most of the time
this makes no difference since they'll get inlined anyway, but as noted
by Andrew Dunstan, there are cases where the volatile marking prevents
optimizations that the planner does before function inlining.  (I think
I might have overlooked these functions in the earlier commit on the
grounds that inlining would make it moot, but not so --- tgl)

This change results in a change in the expected output of the json
regression tests, because the planner can now flatten a sub-select
that it failed to before.  The old output is preferable, but getting
that back will require some as-yet-unfinished work on RowExpr handling.

Marti Raudsepp
2012-02-08 11:29:29 -05:00
Robert Haas
c13897983a Add transform functions for various temporal typmod coercisions.
This enables ALTER TABLE to skip table and index rebuilds in some cases.

Noah Misch, with trivial changes by me.
2012-02-08 09:33:37 -05:00
Heikki Linnakangas
1a01560cbb Rename LWLockWaitUntilFree to LWLockAcquireOrWait.
LWLockAcquireOrWait makes it more clear that the lock is acquired if it's
free.
2012-02-08 09:17:13 +02:00
Robert Haas
4f658dc851 Support fls().
The immediate impetus for this is that Noah Misch's patch to elide
unnecessary table and index rebuilds when changing typmod for temporal
types uses it; and this is extracted from that patch, with some
further commentary by me.  But it seems logically separate from the
remainder of the patch, so I'm committing it separately; this is not
the first time someone has wanted fls() in the backend and probably
won't be the last.

If we end up using this in more performance-critical spots it may be
worthwhile to add some architecture-specific optimizations to our
src/port version of fls() - e.g. any x86 platform can implement this
using the assembly instruction BSRL.  But performance won't matter
a bit for assessing typmod changes, so I'm not worried about that
right now.
2012-02-07 13:45:46 -05:00
Robert Haas
f7d7dade8a Add a transform function for varbit typmod coercisions.
This enables ALTER TABLE to skip table and index rebuilds when the
new type is unconstraint varbit, or when the allowable number of bits
is not decreasing.

Noah Misch, with review and a fix for an OID collision by me.
2012-02-07 12:42:50 -05:00
Robert Haas
3cc0800829 Add a transform function for numeric typmod coercisions.
This enables ALTER TABLE to skip table and index rebuilds when a column
is changed to an unconstrained numeric, or when the scale is unchanged
and the precision does not decrease.

Noah Misch, with a few stylistic changes and a fix for an OID
collision by me.
2012-02-07 12:08:26 -05:00
Robert Haas
af7914c662 Add TIMING option to EXPLAIN, to allow eliminating of timing overhead.
Sometimes it may be useful to get actual row counts out of EXPLAIN
(ANALYZE) without paying the cost of timing every node entry/exit.
With this patch, you can say EXPLAIN (ANALYZE, TIMING OFF) to get that.

Tomas Vondra, reviewed by Eric Theise, with minor doc changes by me.
2012-02-07 11:23:04 -05:00
Andrew Dunstan
39909d1d39 Add array_to_json and row_to_json functions.
Also move the escape_json function from explain.c to json.c where it
seems to belong.

Andrew Dunstan, Reviewd by Abhijit Menon-Sen.
2012-02-03 12:11:16 -05:00
Robert Haas
0ed7445d73 Allow spgist's text_ops to handle pattern-matching operators.
This was presumably intended to work this way all along, but a few key
bits of indxpath.c didn't get the memo.

Robert Haas and Tom Lane
2012-02-02 13:10:56 -05:00
Robert Haas
c327108140 Catversion bump for JSON patch.
Sigh.
2012-01-31 11:51:51 -05:00
Robert Haas
5384a73f98 Built-in JSON data type.
Like the XML data type, we simply store JSON data as text, after checking
that it is valid.  More complex operations such as canonicalization and
comparison may come later, but this is enough for not.

There are a few open issues here, such as whether we should attempt to
detect UTF-8 surrogate pairs represented as \uXXXX\uYYYY, but this gets
the basic framework in place.
2012-01-31 11:48:23 -05:00
Heikki Linnakangas
9b38d46d9f Make group commit more effective.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.

This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.

Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
2012-01-30 16:53:48 +02:00
Simon Riggs
73f617f13f Various minor comments changes from bgwriter to checkpointer. 2012-01-30 14:34:25 +00:00
Tom Lane
ad10853b30 Assorted comment fixes, mostly just typos, but some obsolete statements.
YAMAMOTO Takashi
2012-01-29 19:23:56 -05:00
Tom Lane
dd243b3e40 Fix typo in comment.
Peter Geoghegan
2012-01-29 18:56:35 -05:00
Tom Lane
e2fa76d80b Use parameterized paths to generate inner indexscans more flexibly.
This patch fixes the planner so that it can generate nestloop-with-
inner-indexscan plans even with one or more levels of joining between
the indexscan and the nestloop join that is supplying the parameter.
The executor was fixed to handle such cases some time ago, but the
planner was not ready.  This should improve our plans in many situations
where join ordering restrictions formerly forced complete table scans.

There is probably a fair amount of tuning work yet to be done, because
of various heuristics that have been added to limit the number of
parameterized paths considered.  However, we are not going to find out
what needs to be adjusted until the code gets some real-world use, so
it's time to get it in there where it can be tested easily.

Note API change for index AM amcostestimate functions.  I'm not aware of
any non-core index AMs, but if there are any, they will need minor
adjustments.
2012-01-27 19:26:38 -05:00
Peter Eisentraut
b376ec6fa5 Show default privileges in information schema
Hitherto, the information schema only showed explicitly granted
privileges that were visible in the *acl catalog columns.  If no
privileges had been granted, the implicit privileges were not shown.

To fix that, add an SQL-accessible version of the acldefault()
function, and use that inside the aclexplode() calls to substitute the
catalog-specific default privilege set for null values.

reviewed by Abhijit Menon-Sen
2012-01-27 21:58:51 +02:00
Peter Eisentraut
2787458362 Disallow ALTER DOMAIN on non-domain type everywhere
This has been the behavior already in most cases, but through
omission, ALTER DOMAIN / OWNER TO and ALTER DOMAIN / SET SCHEMA would
silently work on non-domain types as well.
2012-01-27 21:20:34 +02:00
Peter Eisentraut
8137f2c323 Hide most variable-length fields from Form_pg_* structs
Those fields only appear in the structs so that genbki.pl can create
the BKI bootstrap files for the catalogs.  But they are not actually
usable from C.  So hiding them can prevent coding mistakes, saves
stack space, and can help the compiler.

In certain catalogs, the first variable-length field has been kept
visible after manual inspection.  These exceptions are noted in C
comments.

reviewed by Tom Lane
2012-01-27 20:16:17 +02:00
Heikki Linnakangas
6d90eaaa89 Make bgwriter sleep longer when it has no work to do, to save electricity.
To make it wake up promptly when activity starts again, backends nudge it
by setting a latch in MarkBufferDirty(). The latch is kept set while
bgwriter is active, so there is very little overhead from that when the
system is busy. It is only armed before going into longer sleep.

Peter Geoghegan, with some changes by me.
2012-01-26 18:39:13 +02:00
Magnus Hagander
61cb8c5abb Add deadlock counter to pg_stat_database
Adds a counter that tracks number of deadlocks that occurred in
each database to pg_stat_database.

Magnus Hagander, reviewed by Jaime Casanova
2012-01-26 15:58:19 +01:00
Robert Haas
0e549697d1 Classify DROP operations by whether or not they are user-initiated.
This doesn't do anything useful just yet, but is intended as supporting
infrastructure for allowing sepgsql to sensibly check DROP permissions.

KaiGai Kohei and Robert Haas
2012-01-26 09:30:27 -05:00
Magnus Hagander
bc3347484a Track temporary file count and size in pg_stat_database
Add counters for number and size of temporary files used
for spill-to-disk queries for each database to the
pg_stat_database view.

Tomas Vondra, review by Magnus Hagander
2012-01-26 14:41:19 +01:00
Robert Haas
9f9135d129 Instrument index-only scans to count heap fetches performed.
Patch by me; review by Tom Lane, Jeff Davis, and Peter Geoghegan.
2012-01-25 20:41:52 -05:00
Simon Riggs
8366c7803e Allow pg_basebackup from standby node with safety checking.
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.

Jun Ishizuka and Fujii Masao, with minor modifications
2012-01-25 18:02:04 +00:00
Alvaro Herrera
74ab96a45e Add pg_trigger_depth() function
This reports the depth level of triggers currently in execution, or zero
if not called from inside a trigger.

No catversion bump in this patch, but you have to initdb if you want
access to the new function.

Author: Kevin Grittner
2012-01-25 13:22:54 -03:00
Simon Riggs
443b4821f1 Add new replication mode synchronous_commit = 'write'.
Replication occurs only to memory on standby, not to disk,
so provides additional performance if user wishes to
reduce durability level slightly. Adds concept of multiple
independent sync rep queues.

Fujii Masao and Simon Riggs
2012-01-24 20:22:37 +00:00
Simon Riggs
c172b7b02e Resolve timing issue with logging locks for Hot Standby.
We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.

Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
2012-01-23 23:37:32 +00:00
Simon Riggs
b8a91d9d1c ALTER <thing> [IF EXISTS] ... allows silent DDL if required,
e.g. ALTER FOREIGN TABLE IF EXISTS foo RENAME TO bar

Pavel Stehule
2012-01-23 23:25:04 +00:00
Robert Haas
cc53a1e7cc Add bitwise AND, OR, and NOT operators for macaddr data type.
Brendan Jurd, reviewed by Fujii Masao
2012-01-19 15:25:14 -05:00
Magnus Hagander
4f42b546fd Separate state from query string in pg_stat_activity
This separates the state (running/idle/idleintransaction etc) into
it's own field ("state"), and leaves the query field containing just
query text.

The query text will now mean "current query" when a query is running
and "last query" in other states. Accordingly,the field has been
renamed from current_query to query.

Since backwards compatibility was broken anyway to make that, the procpid
field has also been renamed to pid - along with the same field in
pg_stat_replication for consistency.

Scott Mead and Magnus Hagander, review work from Greg Smith
2012-01-19 14:19:20 +01:00
Robert Haas
1575fbcb79 Prevent adding relations to a concurrently dropped schema.
In the previous coding, it was possible for a relation to be created
via CREATE TABLE, CREATE VIEW, CREATE SEQUENCE, CREATE FOREIGN TABLE,
etc.  in a schema while that schema was meanwhile being concurrently
dropped.  This led to a pg_class entry with an invalid relnamespace
value.  The same problem could occur if a relation was moved using
ALTER .. SET SCHEMA while the target schema was being concurrently
dropped.  This patch prevents both of those scenarios by locking the
schema to which the relation is being added using AccessShareLock,
which conflicts with the AccessExclusiveLock taken by DROP.

As a desirable side effect, this also prevents the use of CREATE OR
REPLACE VIEW to queue for an AccessExclusiveLock on a relation on which
you have no rights: that will now fail immediately with a permissions
error, before trying to obtain a lock.

We need similar protection for all other object types, but as everything
other than relations uses a slightly different set of code paths, I'm
leaving that for a separate commit.

Original complaint (as far as I could find) about CREATE by Nikhil
Sontakke; risk for ALTER .. SET SCHEMA pointed out by Tom Lane;
further details by Dan Farina; patch by me; review by Hitoshi Harada.
2012-01-16 09:49:34 -05:00
Tom Lane
21b446dd09 Fix CLUSTER/VACUUM FULL for toast values owned by recently-updated rows.
In commit 7b0d0e9356, I made CLUSTER and
VACUUM FULL try to preserve toast value OIDs from the original toast table
to the new one.  However, if we have to copy both live and recently-dead
versions of a row that has a toasted column, those versions may well
reference the same toast value with the same OID.  The patch then led to
duplicate-key failures as we tried to insert the toast value twice with the
same OID.  (The previous behavior was not very desirable either, since it
would have silently inserted the same value twice with different OIDs.
That wastes space, but what's worse is that the toast values inserted for
already-dead heap rows would not be reclaimed by subsequent ordinary
VACUUMs, since they go into the new toast table marked live not deleted.)

To fix, check if the copied OID already exists in the new toast table, and
if so, assume that it stores the desired value.  This is reasonably safe
since the only case where we will copy an OID from a previous toast pointer
is when toast_insert_or_update was given that toast pointer and so we just
pulled the data from the old table; if we got two different values that way
then we have big problems anyway.  We do have to assume that no other
backend is inserting items into the new toast table concurrently, but
that's surely safe for CLUSTER and VACUUM FULL.

Per bug #6393 from Maxim Boguk.  Back-patch to 9.0, same as the previous
patch.
2012-01-12 16:40:14 -05:00
Heikki Linnakangas
1b9dea04b5 Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always
passed as 'true'.
2012-01-11 11:01:47 +02:00
Peter Eisentraut
db49517c62 Rename the internal structures of the CREATE TABLE (LIKE ...) facility
The original implementation of this interpreted it as a kind of
"inheritance" facility and named all the internal structures
accordingly.  This turned out to be very confusing, because it has
nothing to do with the INHERITS feature.  So rename all the internal
parser infrastructure, update the comments, adjust the error messages,
and split up the regression tests.
2012-01-07 23:02:33 +02:00
Tom Lane
0a41e86584 Use __sync_lock_test_and_set() for spinlocks on ARM, if available.
Historically we've used the SWPB instruction for TAS() on ARM, but this
is deprecated and not available on ARMv6 and later.  Instead, make use
of a GCC builtin if available.  We'll still fall back to SWPB if not,
so as not to break existing ports using older GCC versions.

Eventually we might want to try using __sync_lock_test_and_set() on some
other architectures too, but for now that seems to present only risk and
not reward.

Back-patch to all supported versions, since people might want to use any
of them on more recent ARM chips.

Martin Pitt
2012-01-07 15:38:52 -05:00
Robert Haas
1fc3d18faa Slightly reorganize struct SnapshotData.
This squeezes out a bunch of alignment padding, reducing the size
from 72 to 56 bytes on my machine.  At least in my testing, this
didn't produce any measurable performance improvement, but the space
savings seem like enough justification.

Andres Freund
2012-01-06 22:56:00 -05:00
Robert Haas
1489e2f26a Improve behavior of concurrent ALTER TABLE, and do some refactoring.
ALTER TABLE (and ALTER VIEW, ALTER SEQUENCE, etc.) now use a
RangeVarGetRelid callback to check permissions before acquiring a table
lock.  We also now use the same callback for all forms of ALTER TABLE,
rather than having separate, almost-identical callbacks for ALTER TABLE
.. SET SCHEMA and ALTER TABLE .. RENAME, and no callback at all for
everything else.

I went ahead and changed the code so that no form of ALTER TABLE works
on foreign tables; you must use ALTER FOREIGN TABLE instead.  In 9.1,
it was possible to use ALTER TABLE .. SET SCHEMA or ALTER TABLE ..
RENAME on a foreign table, but not any other form of ALTER TABLE, which
did not seem terribly useful or consistent.

Patch by me; review by Noah Misch.
2012-01-06 22:42:26 -05:00
Robert Haas
33aaa139e6 Make the number of CLOG buffers adaptive, based on shared_buffers.
Previously, this was hardcoded: we always had 8.  Performance testing
shows that isn't enough, especially on big SMP systems, so we allow it
to scale up as high as 32 when there's adequate memory.  On the flip
side, when shared_buffers is very small, drop the number of CLOG buffers
down to as little as 4, so that we can start the postmaster even
when very little shared memory is available.

Per extensive discussion with Simon Riggs, Tom Lane, and others on
pgsql-hackers.
2012-01-06 14:32:18 -05:00
Peter Eisentraut
104e7dac28 Improve ALTER DOMAIN / DROP CONSTRAINT with nonexistent constraint
ALTER DOMAIN / DROP CONSTRAINT on a nonexistent constraint name did
not report any error.  Now it reports an error.  The IF EXISTS option
was added to get the usual behavior of ignoring nonexistent objects to
drop.
2012-01-05 19:48:55 +02:00
Tom Lane
bc2a050d40 Use a non-locking initial test in TAS_SPIN on PPC.
Further testing convinces me that this is helpful at sufficiently high
contention levels, though it's still worrisome that it loses slightly
at lower contention levels.

Per Manabu Ori.
2012-01-03 16:00:06 -05:00
Andrew Dunstan
63876d3bac Support for building with MS Visual Studio 2010.
Brar Piening, reviewed by Craig Ringer.
2012-01-03 08:44:26 -05:00
Tom Lane
631beeac35 Use LWSYNC in place of SYNC/ISYNC in PPC spinlocks, where possible.
This is allegedly a win, at least on some PPC implementations, according
to the PPC ISA documents.  However, as with LWARX hints, some PPC
platforms give an illegal-instruction failure.  Use the same trick as
before of assuming that PPC64 platforms will accept it; we might need to
refine that based on experience, but there are other projects doing
likewise according to google.

I did not add an assembler compatibility test because LWSYNC has been
around much longer than hint bits, and it seems unlikely that any
toolchains currently in use don't recognize it.
2012-01-02 00:02:02 -05:00
Tom Lane
8496c6cd77 Use 4-byte slock_t on both PPC and PPC64.
Previously we defined slock_t as 8 bytes on PPC64, but the TAS assembly
code uses word-wide operations regardless, so that the second word was
just wasted space.  There doesn't appear to be any performance benefit
in adding the second word, so get rid of it to simplify the code.
2012-01-02 00:02:01 -05:00
Tom Lane
5cfa8dd300 Use mutex hint bit in PPC LWARX instructions, where possible.
The hint bit makes for a small but measurable performance improvement
in access to contended spinlocks.

On the other hand, some PPC chips give an illegal-instruction failure.
There doesn't seem to be a completely bulletproof way to tell whether the
hint bit will cause an illegal-instruction failure other than by trying
it; but most if not all 64-bit PPC machines should accept it, so follow
the Linux kernel's lead and assume it's okay to use it in 64-bit builds.
Of course we must also check whether the assembler accepts the command,
since even with a recent CPU the toolchain could be old.

Patch by Manabu Ori, significantly modified by me.
2012-01-02 00:02:00 -05:00
Bruce Momjian
e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Simon Riggs
64233902d2 Send new protocol keepalive messages to standby servers.
Allows streaming replication users to calculate transfer latency
and apply delay via internal functions. No external functions yet.
2011-12-31 13:30:26 +00:00
Peter Eisentraut
d383c23f6f Remove support for on_exit()
All supported platforms support the C89 standard function atexit()
(SunOS 4 probably being the last one not to), and supporting both
makes the code clumsy.
2011-12-27 20:57:59 +02:00
Tom Lane
472d3935a2 Rethink representation of index clauses' mapping to index columns.
In commit e2c2c2e8b1 I made use of nested
list structures to show which clauses went with which index columns, but
on reflection that's a data structure that only an old-line Lisp hacker
could love.  Worse, it adds unnecessary complication to the many places
that don't much care which clauses go with which index columns.  Revert
to the previous arrangement of flat lists of clauses, and instead add a
parallel integer list of column numbers.  The places that care about the
pairing can chase both lists with forboth(), while the places that don't
care just examine one list the same as before.

The only real downside to this is that there are now two more lists that
need to be passed to amcostestimate functions in case they care about
column matching (which btcostestimate does, so not passing the info is not
an option).  Rather than deal with 11-argument amcostestimate functions,
pass just the IndexPath and expect the functions to extract fields from it.
That gets us down to 7 arguments which is better than 11, and it seems
more future-proof against likely additions to the information we keep
about an index path.
2011-12-24 19:03:21 -05:00
Tom Lane
e2c2c2e8b1 Improve planner's handling of duplicated index column expressions.
It's potentially useful for an index to repeat the same indexable column
or expression in multiple index columns, if the columns have different
opclasses.  (If they share opclasses too, the duplicate column is pretty
useless, but nonetheless we've allowed such cases since 9.0.)  However,
the planner failed to cope with this, because createplan.c was relying on
simple equal() matching to figure out which index column each index qual
is intended for.  We do have that information available upstream in
indxpath.c, though, so the fix is to not flatten the multi-level indexquals
list when putting it into an IndexPath.  Then we can rely on the sublist
structure to identify target index columns in createplan.c.  There's a
similar issue for index ORDER BYs (the KNNGIST feature), so introduce a
multi-level-list representation for that too.  This adds a bit more
representational overhead, but we might more or less buy that back by not
having to search for matching index columns anymore in createplan.c;
likewise btcostestimate saves some cycles.

Per bug #6351 from Christian Rudolph.  Likely symptoms include the "btree
index keys must be ordered by attribute" failure shown there, as well as
"operator MMMM is not a member of opfamily NNNN".

Although this is a pre-existing problem that can be demonstrated in 9.0 and
9.1, I'm not going to back-patch it, because the API changes in the planner
seem likely to break things such as index plugins.  The corner cases where
this matters seem too narrow to justify possibly breaking things in a minor
release.
2011-12-23 18:45:14 -05:00
Robert Haas
d5448c7d31 Add bytea_agg, parallel to string_agg.
Pavel Stehule
2011-12-23 08:40:25 -05:00
Robert Haas
99b60fc04e Catversion bump for commit 0e4611c023.
It changed the format of stored rules.
2011-12-22 17:25:35 -05:00
Robert Haas
0e4611c023 Add a security_barrier option for views.
When a view is marked as a security barrier, it will not be pulled up
into the containing query, and no quals will be pushed down into it,
so that no function or operator chosen by the user can be applied to
rows not exposed by the view.  Views not configured with this
option cannot provide robust row-level security, but will perform far
better.

Patch by KaiGai Kohei; original problem report by Heikki Linnakangas
(in October 2009!).  Review (in earlier versions) by Noah Misch and
others.  Design advice by Tom Lane and myself.  Further review and
cleanup by me.
2011-12-22 16:16:31 -05:00
Peter Eisentraut
f90dd28062 Add ALTER DOMAIN ... RENAME
You could already rename domains using ALTER TYPE, but with this new
command it is more consistent with how other commands treat domains as
a subcategory of types.
2011-12-22 22:43:56 +02:00
Robert Haas
cbe24a6dd8 Improve behavior of concurrent CLUSTER.
In the previous coding, a user could queue up for an AccessExclusiveLock
on a table they did not have permission to cluster, thus potentially
interfering with access by authorized users who got stuck waiting behind
the AccessExclusiveLock.  This approach avoids that.  cluster() has the
same permissions-checking requirements as REINDEX TABLE, so this commit
moves the now-shared callback to tablecmds.c and renames it, per
discussion with Noah Misch.
2011-12-21 15:17:28 -05:00
Robert Haas
d573e239f0 Take fewer snapshots.
When a PORTAL_ONE_SELECT query is executed, we can opportunistically
reuse the parse/plan shot for the execution phase.  This cuts down the
number of snapshots per simple query from 2 to 1 for the simple
protocol, and 3 to 2 for the extended protocol.  Since we are only
reusing a snapshot taken early in the processing of the same protocol
message, the change shouldn't be user-visible, except that the remote
possibility of the planning and execution snapshots being different is
eliminated.

Note that this change does not make it safe to assume that the parse/plan
snapshot will certainly be reused; that will currently only happen if
PortalStart() decides to use the PORTAL_ONE_SELECT strategy.  It might
be worth trying to provide some stronger guarantees here in the future,
but for now we don't.

Patch by me; review by Dimitri Fontaine.
2011-12-21 09:16:55 -05:00
Peter Eisentraut
729205571e Add support for privileges on types
This adds support for the more or less SQL-conforming USAGE privilege
on types and domains.  The intent is to be able restrict which users
can create dependencies on types, which restricts the way in which
owners can alter types.

reviewed by Yeb Havinga
2011-12-20 00:05:19 +02:00
Alvaro Herrera
05e992e90e Forgot catversion bump on previous patch
Per Tom
2011-12-19 17:45:17 -03:00
Tom Lane
8f57b064fd Rename updateNodeLink to spgUpdateNodeLink.
On reflection, the original name seems way too generic for a global
symbol.  A quick check shows this is the only exported function name
in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the
rest of them seem all right.
2011-12-19 15:38:32 -05:00
Alvaro Herrera
61d81bd28d Allow CHECK constraints to be declared ONLY
This makes them enforceable only on the parent table, not on children
tables.  This is useful in various situations, per discussion involving
people bitten by the restrictive behavior introduced in 8.4.

Message-Id:
8762mp93iw.fsf@comcast.net
CAFaPBrSMMpubkGf4zcRL_YL-AERUbYF_-ZNNYfb3CVwwEqc9TQ@mail.gmail.com

Authors: Nikhil Sontakke, Alex Hunsaker
Reviewed by Robert Haas and myself
2011-12-19 17:30:23 -03:00
Tom Lane
9220362493 Teach SP-GiST to do index-only scans.
Operator classes can specify whether or not they support this; this
preserves the flexibility to use lossy representations within an index.

In passing, move constant data about a given index into the rd_amcache
cache area, instead of doing fresh lookups each time we start an index
operation.  This is mainly to try to make sure that spgcanreturn() has
insignificant cost; I still don't have any proof that it matters for
actual index accesses.  Also, get rid of useless copying of FmgrInfo
pointers; we can perfectly well use the relcache's versions in-place.
2011-12-19 14:58:41 -05:00
Tom Lane
3695a55513 Replace simple constant pg_am.amcanreturn with an AM support function.
The need for this was debated when we put in the index-only-scan feature,
but at the time we had no near-term expectation of having AMs that could
support such scans for only some indexes; so we kept it simple.  However,
the SP-GiST AM forces the issue, so let's fix it.

This patch only installs the new API; no behavior actually changes.
2011-12-18 15:50:37 -05:00
Tom Lane
5577ca5bfb Remove bogus entries in gist point_ops operator class.
These entries could never be matched to an index clause because they don't
have the index datatype on the left-hand side of the operator.  (Their
commutators are in the opclass, which is sensible, but that doesn't mean
these operators should be.)  Spotted by a test that I recently added to
opr_sanity to catch exactly this type of thinko.  AFAICT there is no code
in gistproc.c that is specifically meant to cover these cases, so nothing
to remove at that level.
2011-12-17 18:51:00 -05:00
Tom Lane
8daeb5ddd6 Add SP-GiST (space-partitioned GiST) index access method.
SP-GiST is comparable to GiST in flexibility, but supports non-balanced
partitioned search structures rather than balanced trees.  As described at
PGCon 2011, this new indexing structure can beat GiST in both index build
time and query speed for search problems that it is well matched to.

There are a number of areas that could still use improvement, but at this
point the code seems committable.

Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
2011-12-17 16:42:30 -05:00
Robert Haas
0d76b60db4 Various micro-optimizations for GetSnapshopData().
Heikki Linnakangas had the idea of rearranging GetSnapshotData to
avoid checking for sub-XIDs when no top-level XID is present.  This
patch does that plus further a bit of further, related rearrangement.
Benchmarking show a significant improvement on unlogged tables at
higher concurrency levels, and mostly indifferent result on permanent
tables (which are presumably bottlenecked elsewhere).  Most of the
benefit seems to come from using the new NormalTransactionIdPrecedes()
macro rather than the function call TransactionIdPrecedes().
2011-12-16 21:48:47 -05:00
Andrew Dunstan
6d09b2105f include_if_exists facility for config file.
This works the same as include, except that an error is not thrown
if the file is missing. Instead the fact that it's missing is
logged.

Greg Smith, reviewed by Euler Taveira de Oliveira.
2011-12-15 19:40:58 -05:00
Robert Haas
74a1d4fe7c Improve behavior of concurrent rename statements.
Previously, renaming a table, sequence, view, index, foreign table,
column, or trigger checked permissions before locking the object, which
meant that if permissions were revoked during the lock wait, we would
still allow the operation.  Similarly, if the original object is dropped
and a new one with the same name is created, the operation will be allowed
if we had permissions on the old object; the permissions on the new
object don't matter.  All this is now fixed.

Along the way, attempting to rename a trigger on a foreign table now gives
the same error message as trying to create one there in the first place
(i.e. that it's not a table or view) rather than simply stating that no
trigger by that name exists.

Patch by me; review by Noah Misch.
2011-12-15 19:02:38 -05:00
Robert Haas
f6835ea90a Fix typo. 2011-12-15 18:22:29 -05:00
Tom Lane
2dd9322ba6 Move BKP_REMOVABLE bit from individual WAL records to WAL page headers.
Removing this bit from xl_info allows us to restore the old limit of four
(not three) separate pages touched by a WAL record, which is needed for the
upcoming SP-GiST feature, and will likely be useful elsewhere in future.

When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that
because no special WAL-visible action was taken when starting a backup.
However, now we force a segment switch when starting a backup, so a
compressing WAL archiver (such as pglesslog) that uses the state shown in
the current page header will not be fooled as to removability of backup
blocks.  The only downside is that the archiver will not return to
compressing mode for up to one WAL page after the backup is over, which is
a small price to pay for getting back the extra xl_info bit.  In any case
the archiver could look for XLOG_BACKUP_END records if it thought it was
worth the trouble to do so.

Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
2011-12-12 16:22:14 -05:00
Heikki Linnakangas
8409b60476 Revert the behavior of inet/cidr functions to not unpack the arguments.
I forgot to change the functions to use the PG_GETARG_INET_PP() macro,
when I changed DatumGetInetP() to unpack the datum, like Datum*P macros
usually do. Also, I screwed up the definition of the PG_GETARG_INET_PP()
macro, and didn't notice because it wasn't used.

This fixes the memory leak when sorting inet values, as reported
by Jochen Erwied and debugged by Andres Freund. Backpatch to 8.3, like
the previous patch that broke it.
2011-12-12 10:10:53 +02:00
Andrew Dunstan
8e461ca5a9 Remove define inadvertantly left over from testing. 2011-12-10 16:29:37 -05:00
Andrew Dunstan
1a0c76c32f Enable compiling with the mingw-w64 32 bit compiler.
Original patch by Lars Kanis, reviewed by Nishiyama Tomoaki and tweaked some by me.

This compiler, or at least the latest version of it, is currently broken, and
only passes the regression tests if built with -O0.
2011-12-10 15:35:41 -05:00
Peter Eisentraut
5bcf8ede45 Add ALTER FOREIGN DATA WRAPPER / RENAME and ALTER SERVER / RENAME 2011-12-09 20:42:30 +02:00
Heikki Linnakangas
9f0d2bdc88 Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.

In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
2011-12-09 15:21:12 +02:00
Heikki Linnakangas
5d8a894e30 Cancel running query if it is detected that the connection to the client is
lost. The only way we detect that at the moment is when write() fails when
we try to write to the socket.

Florian Pflug with small changes by me, reviewed by Greg Jaskiewicz.
2011-12-09 14:21:36 +02:00
Peter Eisentraut
d5f23af6bf Add const qualifiers to node inspection functions
Thomas Munro
2011-12-07 21:46:56 +02:00
Magnus Hagander
16d8e594ac Remove spclocation field from pg_tablespace
Instead, add a function pg_tablespace_location(oid) used to return
the same information, and do this by reading the symbolic link.

Doing it this way makes it possible to relocate a tablespace when the
database is down by simply changing the symbolic link.
2011-12-07 10:37:33 +01:00
Tom Lane
c6e3ac11b6 Create a "sort support" interface API for faster sorting.
This patch creates an API whereby a btree index opclass can optionally
provide non-SQL-callable support functions for sorting.  In the initial
patch, we only use this to provide a directly-callable comparator function,
which can be invoked with a bit less overhead than the traditional
SQL-callable comparator.  While that should be of value in itself, the real
reason for doing this is to provide a datatype-extensible framework for
more aggressive optimizations, as in Peter Geoghegan's recent work.

Robert Haas and Tom Lane
2011-12-07 00:19:39 -05:00
Heikki Linnakangas
1e616f6391 During recovery, if we reach consistent state and still have entries in the
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.

Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.

Fujii Masao
2011-12-02 10:49:54 +02:00
Robert Haas
2ad36c4e44 Improve table locking behavior in the face of current DDL.
In the previous coding, callers were faced with an awkward choice:
look up the name, do permissions checks, and then lock the table; or
look up the name, lock the table, and then do permissions checks.
The first choice was wrong because the results of the name lookup
and permissions checks might be out-of-date by the time the table
lock was acquired, while the second allowed a user with no privileges
to interfere with access to a table by users who do have privileges
(e.g. if a malicious backend queues up for an AccessExclusiveLock on
a table on which AccessShareLock is already held, further attempts
to access the table will be blocked until the AccessExclusiveLock
is obtained and the malicious backend's transaction rolls back).

To fix, allow callers of RangeVarGetRelid() to pass a callback which
gets executed after performing the name lookup but before acquiring
the relation lock.  If the name lookup is retried (because
invalidation messages are received), the callback will be re-executed
as well, so we get the best of both worlds.  RangeVarGetRelid() is
renamed to RangeVarGetRelidExtended(); callers not wishing to supply
a callback can continue to invoke it as RangeVarGetRelid(), which is
now a macro.  Since the only one caller that uses nowait = true now
passes a callback anyway, the RangeVarGetRelid() macro defaults nowait
as well.  The callback can also be used for supplemental locking - for
example, REINDEX INDEX needs to acquire the table lock before the index
lock to reduce deadlock possibilities.

There's a lot more work to be done here to fix all the cases where this
can be a problem, but this commit provides the general infrastructure
and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE,
LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE.

Per discussion with Noah Misch and Alvaro Herrera.
2011-11-30 10:27:00 -05:00
Tom Lane
dd3bab5fd7 Ensure that whole-row junk Vars are always of composite type.
The EvalPlanQual machinery assumes that whole-row Vars generated for the
outputs of non-table RTEs will be of composite types.  However, for the
case where the RTE is a function call returning a scalar type, we were
doing the wrong thing, as a result of sharing code with a parser case
where the function's scalar output is wanted.  (Or at least, that's what
that case has done historically; it does seem a bit inconsistent.)

To fix, extend makeWholeRowVar's API so that it can support both use-cases.
This fixes Belinda Cussen's report of crashes during concurrent execution
of UPDATEs involving joins to the result of UNNEST() --- in READ COMMITTED
mode, we'd run the EvalPlanQual machinery after a conflicting row update
commits, and it was expecting to get a HeapTuple not a scalar datum from
the "wholerowN" variable referencing the function RTE.

Back-patch to 9.0 where the current EvalPlanQual implementation appeared.

In 9.1 and up, this patch also fixes failure to attach the correct
collation to the Var generated for a scalar-result case.  An example:
regression=# select upper(x.*) from textcat('ab', 'cd') x;
ERROR:  could not determine which collation to use for upper() function
2011-11-27 22:27:24 -05:00
Tom Lane
c66e4f138b Improve GiST range-contained-by searches by adding a flag for empty ranges.
In the original implementation, a range-contained-by search had to scan
the entire index because an empty range could be lurking anywhere.
Improve that by adding a flag to upper GiST entries that says whether the
represented subtree contains any empty ranges.

Also, make a simple mod to the penalty function to discourage empty ranges
from getting pushed into subtrees without any.  This needs more work, and
the picksplit function should be taught about it too, but that code can be
improved without causing an on-disk compatibility break; so we'll leave it
for another day.

Since we're breaking on-disk compatibility of range values anyway, I took
the opportunity to reorganize the range flags bits; the unused
RANGE_xB_NULL bits are now adjacent, which might open the door for using
them in some other way later.

In passing, remove the GiST range opclass entry for <>, which doesn't seem
like it can really be indexed usefully.

Alexander Korotkov, with some editorializing by Tom
2011-11-27 16:51:29 -05:00
Alvaro Herrera
9d3b502443 Improve logging of autovacuum I/O activity
This adds some I/O stats to the logging of autovacuum (when the
operation takes long enough that log_autovacuum_min_duration causes it
to be logged), so that it is easier to tune.  Notably, it adds buffer
I/O counts (hits, misses, dirtied) and read and write rate.

Authors: Greg Smith and Noah Misch
2011-11-25 16:34:32 -03:00
Robert Haas
ed0b409d22 Move "hot" members of PGPROC into a separate PGXACT array.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order.  These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-25 08:02:10 -05:00
Tom Lane
9ed439a9c0 Fix unsupported options in CREATE TABLE ... AS EXECUTE.
The WITH [NO] DATA option was not supported, nor the ability to specify
replacement column names; the former limitation wasn't even documented, as
per recent complaint from Naoya Anzai.  Fix by moving the responsibility
for supporting these options into the executor.  It actually takes less
code this way ...

catversion bump due to change in representation of IntoClause, which might
affect stored rules.
2011-11-24 23:21:45 -05:00
Tom Lane
74c1723fc8 Remove user-selectable ANALYZE option for range types.
It's not clear that a per-datatype typanalyze function would be any more
useful than a generic typanalyze for ranges.  What *is* clear is that
letting unprivileged users select typanalyze functions is a crash risk or
worse.  So remove the option from CREATE TYPE AS RANGE, and instead put in
a generic typanalyze function for ranges.  The generic function does
nothing as yet, but hopefully we'll improve that before 9.2 release.
2011-11-23 00:03:22 -05:00
Tom Lane
df73584431 Remove zero- and one-argument range constructor functions.
Per discussion, the zero-argument forms aren't really worth the catalog
space (just write 'empty' instead).  The one-argument forms have some use,
but they also have a serious problem with looking too much like functional
cast notation; to the point where in many real use-cases, the parser would
misinterpret what was wanted.

Committing this as a separate patch, with the thought that we might want
to revert part or all of it if we can think of some way around the cast
ambiguity.
2011-11-22 20:45:05 -05:00
Tom Lane
cddc819e45 Improve implementation of range-contains-element tests.
Implement these tests directly instead of constructing a singleton range
and then applying range-contains.  This saves a range serialize/deserialize
cycle as well as a couple of redundant bound-comparison steps, and adds
very little code on net.

Remove elem_contained_by_range from the GiST opclass: it doesn't belong
there because there is no way to use it in an index clause (where the
indexed column would have to be on the left).  Its commutator is in the
opclass, and that's what counts.
2011-11-22 17:45:37 -05:00