postgresql

mirror of https://github.com/postgres/postgres.git synced 2026-03-14 14:42:30 -04:00

Author	SHA1	Message	Date
Noah Misch	0b7026d964	Predict integer overflow to avoid buffer overruns. Several functions, mostly type input functions, calculated an allocation size such that the calculation wrapped to a small positive value when arguments implied a sufficiently-large requirement. Writes past the end of the inadvertent small allocation followed shortly thereafter. Coverity identified the path_in() vulnerability; code inspection led to the rest. In passing, add check_stack_depth() to prevent stack overflow in related functions. Back-patch to 8.4 (all supported versions). The non-comment hstore changes touch code that did not exist in 8.4, so that part stops at 9.0. Noah Misch and Heikki Linnakangas, reviewed by Tom Lane. Security: CVE-2014-0064	2014-02-17 09:33:37 -05:00
Noah Misch	6a10e57b0f	Fix handling of wide datetime input/output. Many server functions use the MAXDATELEN constant to size a buffer for parsing or displaying a datetime value. It was much too small for the longest possible interval output and slightly too small for certain valid timestamp input, particularly input with a long timezone name. The long input was rejected needlessly; the long output caused interval_out() to overrun its buffer. ECPG's pgtypes library has a copy of the vulnerable functions, which bore the same vulnerabilities along with some of its own. In contrast to the server, certain long inputs caused stack overflow rather than failing cleanly. Back-patch to 8.4 (all supported versions). Reported by Daniel Schüssler, reviewed by Tom Lane. Security: CVE-2014-0063	2014-02-17 09:33:37 -05:00
Robert Haas	b5c574399e	Avoid repeated name lookups during table and index DDL. If the name lookups come to different conclusions due to concurrent activity, we might perform some parts of the DDL on a different table than other parts. At least in the case of CREATE INDEX, this can be used to cause the permissions checks to be performed against a different table than the index creation, allowing for a privilege escalation attack. This changes the calling convention for DefineIndex, CreateTrigger, transformIndexStmt, transformAlterTableStmt, CheckIndexCompatible (in 9.2 and newer), and AlterTable (in 9.1 and older). In addition, CheckRelationOwnership is removed in 9.2 and newer and the calling convention is changed in older branches. A field has also been added to the Constraint node (FkConstraint in 8.4). Third-party code calling these functions or using the Constraint node will require updating. Report by Andres Freund. Patch by Robert Haas and Andres Freund, reviewed by Tom Lane. Security: CVE-2014-0062	2014-02-17 09:33:36 -05:00
Noah Misch	23b5a85e60	Prevent privilege escalation in explicit calls to PL validators. The primary role of PL validators is to be called implicitly during CREATE FUNCTION, but they are also normal functions that a user can call explicitly. Add a permissions check to each validator to ensure that a user cannot use explicit validator calls to achieve things he could not otherwise achieve. Back-patch to 8.4 (all supported versions). Non-core procedural language extensions ought to make the same two-line change to their own validators. Andres Freund, reviewed by Tom Lane and Noah Misch. Security: CVE-2014-0061	2014-02-17 09:33:36 -05:00
Tom Lane	ab4bb5c47a	Fix multiple bugs in index page locking during hot-standby WAL replay. In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.	2014-01-14 17:34:57 -05:00
Tom Lane	c53cbb39f8	Stamp 9.1.11.	2013-12-02 16:02:21 -05:00
Alvaro Herrera	0176f8bea4	Fix incomplete backpatch of pg_multixact truncation changes to <= 9.2 The backpatch of a95335b544d9c8377e9dc7a399d8e9a155895f82 to 9.2, 9.1 and 9.0 was incomplete, missing changes to xlog.c, primarily the call to TrimMultiXact(). Testing presumably didn't show a problem without these changes because TrimMultiXact() performs defense-in-depth work, which is not strictly necessary. It also missed moving StartupMultiXact() which would have been problematic if a restartpoing happened in exactly the wrong moment, causing a transient error. Andres Freund	2013-12-02 13:28:18 -03:00
Tom Lane	7b63528750	Fix array slicing of int2vector and oidvector values. The previous coding labeled expressions such as pg_index.indkey[1:3] as being of int2vector type; which is not right because the subscript bounds of such a result don't, in general, satisfy the restrictions of int2vector. To fix, implicitly promote the result of slicing int2vector to int2[], or oidvector to oid[]. This is similar to what we've done with domains over arrays, which is a good analogy because these types are very much like restricted domains of the corresponding regular-array types. A side-effect is that we now also forbid array-element updates on such columns, eg while "update pg_index set indkey[4] = 42" would have worked before if you were superuser (and corrupted your catalogs irretrievably, no doubt) it's now disallowed. This seems like a good thing since, again, some choices of subscripting would've led to results not satisfying the restrictions of int2vector. The case of an array-slice update was rejected before, though with a different error message than you get now. We could make these cases work in future if we added a cast from int2[] to int2vector (with a cast function checking the subscript restrictions) but it seems unlikely that there's any value in that. Per report from Ronan Dunklau. Back-patch to all supported branches because of the crash risks involved.	2013-11-23 20:04:06 -05:00
Heikki Linnakangas	3527a5f59f	Fix race condition in GIN posting tree page deletion. If a page is deleted, and reused for something else, just as a search is following a rightlink to it from its left sibling, the search would continue scanning whatever the new contents of the page are. That could lead to incorrect query results, or even something more curious if the page is reused for a different kind of a page. To fix, modify the search algorithm to lock the next page before releasing the previous one, and refrain from deleting pages from the leftmost branch of the tree. Add a new Concurrency section to the README, explaining why this works. There is a lot more one could say about concurrency in GIN, but that's for another patch. Backpatch to all supported versions.	2013-11-08 22:23:14 +02:00
Tom Lane	6e2c762480	Fix generation of MergeAppend plans for optimized min/max on expressions. Before jamming a desired targetlist into a plan node, one really ought to make sure the plan node can handle projections, and insert a buffering Result plan node if not. planagg.c forgot to do this, which is a hangover from the days when it only dealt with IndexScan plan types. MergeAppend doesn't project though, not to mention that it gets unhappy if you remove its possibly-resjunk sort columns. The code accidentally failed to fail for cases in which the min/max argument was a simple Var, because the new targetlist would be equivalent to the original "flat" tlist anyway. For any more complex case, it's been broken since 9.1 where we introduced the ability to optimize min/max using MergeAppend, as reported by Raphael Bauduin. Fix by duplicating the logic from grouping_planner that decides whether we need a Result node. In 9.2 and 9.1, this requires back-porting the tlist_same_exprs() function introduced in commit `4387cf956b`, else we'd uselessly add a Result node in cases that worked before. It's rather tempting to back-patch that whole commit so that we can avoid extra Result nodes in mainline cases too; but I'll refrain, since that code hasn't really seen all that much field testing yet.	2013-11-07 13:13:47 -05:00
Tom Lane	172ba45780	Fix some odd behaviors when using a SQL-style simple GMT offset timezone. Formerly, when using a SQL-spec timezone setting with a fixed GMT offset (called a "brute force" timezone in the code), the session_timezone variable was not updated to match the nominal timezone; rather, all code was expected to ignore session_timezone if HasCTZSet was true. This is of course obviously fragile, though a search of the code finds only timeofday() failing to honor the rule. A bigger problem was that DetermineTimeZoneOffset() supposed that if its pg_tz parameter was pointer-equal to session_timezone, then HasCTZSet should override the parameter. This would cause datetime input containing an explicit zone name to be treated as referencing the brute-force zone instead, if the zone name happened to match the session timezone that had prevailed before installing the brute-force zone setting (as reported in bug #8572). The same malady could affect AT TIME ZONE operators. To fix, set up session_timezone so that it matches the brute-force zone specification, which we can do using the POSIX timezone definition syntax "<abbrev>offset", and get rid of the bogus lookaside check in DetermineTimeZoneOffset(). Aside from fixing the erroneous behavior in datetime parsing and AT TIME ZONE, this will cause the timeofday() function to print its result in the user-requested time zone rather than some previously-set zone. It might also affect results in third-party extensions, if there are any that make use of session_timezone without considering HasCTZSet, but in all cases the new behavior should be saner than before. Back-patch to all supported branches.	2013-11-01 12:13:29 -04:00
Peter Eisentraut	b444726ef6	Stamp 9.1.10.	2013-10-07 23:13:47 -04:00
Kevin Grittner	556e6d0ba3	Eliminate xmin from hash tag for predicate locks on heap tuples. If a tuple was frozen while its predicate locks mattered, read-write dependencies could be missed, resulting in failure to detect conflicts which could lead to anomalies in committed serializable transactions. This field was added to the tag when we still thought that it was necessary to carry locks forward to a new version of an updated row. That was later proven to be unnecessary, which allowed simplification of the code, but elimination of xmin from the tag was missed at the time. Per report and analysis by Heikki Linnakangas. Backpatch to 9.1.	2013-10-07 14:05:26 -05:00
Andrew Dunstan	b8288038ba	Unconditionally use the WSA equivalents of Socket error constants. This change will only apply to mingw compilers, and has been found necessary by late versions of the mingw-w64 compiler. It's the same as what is done elsewhere for the Microsoft compilers. Backpatch of commit `73838b5251`. Problem reported by Michael Cronenworth, although not his patch.	2013-08-26 14:55:00 -04:00
Tom Lane	f7bbd46dcf	In locate_grouping_columns(), don't expect an exact match of Var typmods. It's possible that inlining of SQL functions (or perhaps other changes?) has exposed typmod information not known at parse time. In such cases, Vars generated by query_planner might have valid typmod values while the original grouping columns only have typmod -1. This isn't a semantic problem since the behavior of grouping only depends on type not typmod, but it breaks locate_grouping_columns' use of tlist_member to locate the matching entry in query_planner's result tlist. We can fix this without an excessive amount of new code or complexity by relying on the fact that locate_grouping_columns only gets called when make_subplanTargetList has set need_tlist_eval == false, and that can only happen if all the grouping columns are simple Vars. Therefore we only need to search the sub_tlist for a matching Var, and we can reasonably define a "match" as being a match of the Var identity fields varno/varattno/varlevelsup. The code still Asserts that vartype matches, but ignores vartypmod. Per bug #8393 from Evan Martin. The added regression test case is basically the same as his example. This has been broken for a very long time, so back-patch to all supported branches.	2013-08-23 17:31:03 -04:00
Tom Lane	c1f51ed296	Change post-rewriter representation of dropped columns in joinaliasvars. It's possible to drop a column from an input table of a JOIN clause in a view, if that column is nowhere actually referenced in the view. But it will still be there in the JOIN clause's joinaliasvars list. We used to replace such entries with NULL Const nodes, which is handy for generation of RowExpr expansion of a whole-row reference to the view. The trouble with that is that it can't be distinguished from the situation after subquery pull-up of a constant subquery output expression below the JOIN. Instead, replace such joinaliasvars with null pointers (empty expression trees), which can't be confused with pulled-up expressions. expandRTE() still emits the old convention, though, for convenience of RowExpr generation and to reduce the risk of breaking extension code. In HEAD and 9.3, this patch also fixes a problem with some new code in ruleutils.c that was failing to cope with implicitly-casted joinaliasvars entries, as per recent report from Feike Steenbergen. That oversight was because of an inadequate description of the data structure in parsenodes.h, which I've now corrected. There were some pre-existing oversights of the same ilk elsewhere, which I believe are now all fixed.	2013-07-23 16:23:11 -04:00
Magnus Hagander	271f7b0f8a	Fix include-guard Looks like a cut/paste error in the original addition of the file. Andres Freund	2013-07-07 13:38:57 +02:00
Simon Riggs	a41c88194b	Ensure no xid gaps during Hot Standby startup In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund	2013-06-23 14:50:17 +01:00
Robert Haas	7b2d780f2a	Fix typo in comment. Pavan Deolasee	2013-05-23 11:35:39 -04:00
Heikki Linnakangas	88b000c1b6	Fix crash on compiling a regular expression with more than 32k colors. Throw an error instead. Backpatch to all supported branches.	2013-04-04 19:32:05 +03:00
Tom Lane	114fca526e	Stamp 9.1.9.	2013-04-01 14:23:05 -04:00
Tom Lane	ddf177228f	Fix insecure parsing of server command-line switches. An oversight in commit `e710b65c1c` allowed database names beginning with "-" to be treated as though they were secure command-line switches; and this switch processing occurs before client authentication, so that even an unprivileged remote attacker could exploit the bug, needing only connectivity to the postmaster's port. Assorted exploits for this are possible, some requiring a valid database login, some not. The worst known problem is that the "-r" switch can be invoked to redirect the process's stderr output, so that subsequent error messages will be appended to any file the server can write. This can for example be used to corrupt the server's configuration files, so that it will fail when next restarted. Complete destruction of database tables is also possible. Fix by keeping the database name extracted from a startup packet fully separate from command-line switches, as had already been done with the user name field. The Postgres project thanks Mitsumasa Kondo for discovering this bug, Kyotaro Horiguchi for drafting the fix, and Noah Misch for recognizing the full extent of the danger. Security: CVE-2013-1899	2013-04-01 14:01:04 -04:00
Tom Lane	b403f4107b	Make REPLICATION privilege checks test current user not authenticated user. The pg_start_backup() and pg_stop_backup() functions checked the privileges of the initially-authenticated user rather than the current user, which is wrong. For example, a user-defined index function could successfully call these functions when executed by ANALYZE within autovacuum. This could allow an attacker with valid but low-privilege database access to interfere with creation of routine backups. Reported and fixed by Noah Misch. Security: CVE-2013-1901	2013-04-01 13:09:35 -04:00
Tom Lane	81e2255fc7	Fix to_char() to use ASCII-only case-folding rules where appropriate. formatting.c used locale-dependent case folding rules in some code paths where the result isn't supposed to be locale-dependent, for example to_char(timestamp, 'DAY'). Since the source data is always just ASCII in these cases, that usually didn't matter ... but it does matter in Turkish locales, which have unusual treatment of "i" and "I". To confuse matters even more, the misbehavior was only visible in UTF8 encoding, because in single-byte encodings we used pg_toupper/pg_tolower which don't have locale-specific behavior for ASCII characters. Fix by providing intentionally ASCII-only case-folding functions and using these where appropriate. Per bug #7913 from Adnan Dursun. Back-patch to all active branches, since it's been like this for a long time.	2013-03-05 13:02:38 -05:00
Tom Lane	a0698406f4	Document and clean up gistsplit.c. Improve comments, rename some variables and functions, slightly simplify a couple of APIs, in an attempt to make this code readable by people other than its original author. Even though this is essentially just cosmetic, back-patch to all active branches, because otherwise it's going to make back-patching future fixes in this file very painful.	2013-02-10 11:58:28 -05:00
Tom Lane	bb5e312bdc	Repair bugs in GiST page splitting code for multi-column indexes. When considering a non-last column in a multi-column GiST index, gistsplit.c tries to improve on the split chosen by the opclass-specific pickSplit function by considering penalties for the next column. However, there were two bugs in this code: it failed to recompute the union keys for the leftmost index columns, even though these might well change after reassigning tuples; and it included the old union keys in the recomputation for the columns it did recompute, so that those keys couldn't get smaller even if they should. The first problem could result in an invalid index in which searches wouldn't find index entries that are in fact present; the second would make the index less efficient to search. Both of these errors were caused by misuse of gistMakeUnionItVec, whose API was designed in a way that just begged such errors to be made. There is no situation in which it's safe or useful to compute the union keys for a subset of the index columns, and there is no caller that wants any previous union keys to be included in the computation; so the undocumented choice to treat the union keys as in/out rather than pure output parameters is a waste of code as well as being dangerous. Hence, rather than just making a minimal patch, I've changed the API of gistMakeUnionItVec to remove the "startkey" parameter (it now always processes all index columns) and treat the attr/isnull arrays as purely output parameters. In passing, also get rid of a couple of unnecessary and dangerous uses of static variables in gistutil.c. It's remarkable that the one in gistMakeUnionKey hasn't given us portability troubles before now, because in addition to posing a re-entrancy hazard, it was unsafely assuming that a static char[] array would have at least Datum alignment. Per investigation of a trouble report from Tomas Vondra. (There are also some bugs in contrib/btree_gist to be fixed, but that seems like material for a separate patch.) Back-patch to all supported branches.	2013-02-07 17:44:16 -05:00
Tom Lane	69c026512f	Stamp 9.1.8.	2013-02-04 16:28:27 -05:00
Andrew Dunstan	57d294a188	Use correct output device for Windows prompts. This ensures that mapping of non-ascii prompts to the correct code page occurs. Bug report and original patch from Alexander Law, reviewed and reworked by Noah Misch. Backpatch to all live branches.	2013-01-24 16:01:31 -05:00
Kevin Grittner	5454344b96	Fix performance problems with autovacuum truncation in busy workloads. In situations where there are over 8MB of empty pages at the end of a table, the truncation work for trailing empty pages takes longer than deadlock_timeout, and there is frequent access to the table by processes other than autovacuum, there was a problem with the autovacuum worker process being canceled by the deadlock checking code. The truncation work done by autovacuum up that point was lost, and the attempt tried again by a later autovacuum worker. The attempts could continue indefinitely without making progress, consuming resources and blocking other processes for up to deadlock_timeout each time. This patch has the autovacuum worker checking whether it is blocking any other thread at 20ms intervals. If such a condition develops, the autovacuum worker will persist the work it has done so far, release its lock on the table, and sleep in 50ms intervals for up to 5 seconds, hoping to be able to re-acquire the lock and try again. If it is unable to get the lock in that time, it moves on and a worker will try to continue later from the point this one left off. While this patch doesn't change the rules about when and what to truncate, it does cause the truncation to occur sooner, with less blocking, and with the consumption of fewer resources when there is contention for the table's lock. The only user-visible change other than improved performance is that the table size during truncation may change incrementally instead of just once. Backpatched to 9.0 from initial master commit at `b19e4250b4` -- before that the differences are too large to be clearly safe. Jan Wieck	2013-01-23 13:40:06 -06:00
Tom Lane	628ea7ea51	Prevent failure when RowExpr or XmlExpr is parse-analyzed twice. transformExpr() is required to cope with already-transformed expression trees, for various ugly-but-not-quite-worth-cleaning-up reasons. However, some of its newer subroutines hadn't gotten the memo. This accounts for bug #7763 from Norbert Buchmuller: transformRowExpr() was overwriting the previously determined type of a RowExpr during CREATE TABLE LIKE INCLUDING INDEXES. Additional investigation showed that transformXmlExpr had the same kind of problem, but all the other cases seem to be safe. Andres Freund and Tom Lane	2012-12-23 14:07:36 -05:00
Tom Lane	ed98b48bf4	Fix failure to ignore leftover temp tables after a server crash. During crash recovery, we remove disk files belonging to temporary tables, but the system catalog entries for such tables are intentionally not cleaned up right away. Instead, the first backend that uses a temp schema is expected to clean out any leftover objects therein. This approach requires that we be careful to ignore leftover temp tables (since any actual access attempt would fail), even if their BackendId matches our session, if we have not yet established use of the session's corresponding temp schema. That worked fine in the past, but was broken by commit `debcec7dc3` which incorrectly removed the rd_islocaltemp relcache flag. Put it back, and undo various changes that substituted tests like "rel->rd_backend == MyBackendId" for use of a state-aware flag. Per trouble report from Heikki Linnakangas. Back-patch to 9.1 where the erroneous change was made. In the back branches, be careful to add rd_islocaltemp in a spot in the struct that was alignment padding before, so as not to break existing add-on code.	2012-12-17 20:15:45 -05:00
Tom Lane	c47f643c49	Stamp 9.1.7.	2012-12-03 15:19:35 -05:00
Tom Lane	d08fd1f849	Don't advance checkPoint.nextXid near the end of a checkpoint sequence. This reverts commit `c11130690d` in favor of actually fixing the problem: namely, that we should never have been modifying the checkpoint record's nextXid at this point to begin with. The nextXid should match the state as of the checkpoint's logical WAL position (ie the redo point), not the state as of its physical position. It's especially bogus to advance it in some wal_levels and not others. In any case there is no need for the checkpoint record to carry the same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by LogStandbySnapshot, as any replay operation will already have adopted that value as current. This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug #6291 from Daniel Farina, in that if a checkpoint were in progress at the instant of XID wraparound, the epoch bump would be lost as reported. (And, of course, these days there's at least a 50-50 chance of a checkpoint being in progress at any given instant.) Diagnosed by me and independently by Andres Freund. Back-patch to all branches supporting hot standby.	2012-12-02 15:20:08 -05:00
Tom Lane	c6a91c92b5	Produce a more useful error message for over-length Unix socket paths. The length of a socket path name is constrained by the size of struct sockaddr_un, and there's not a lot we can do about it since that is a kernel API. However, it would be a good thing if we produced an intelligible error message when the user specifies a socket path that's too long --- and getaddrinfo's standard API is too impoverished to do this in the natural way. So insert explicit tests at the places where we construct a socket path name. Now you'll get an error that makes sense and even tells you what the limit is, rather than something generic like "Non-recoverable failure in name resolution". Per trouble report from Jeremy Drake and a fix idea from Andrew Dunstan.	2012-11-29 19:57:17 -05:00
Simon Riggs	6f9a9da85c	Correctly init/deinit recovery xact environment. Previously we performed VirtualXactLockTableInsert but didn't set MyProc->lxid for Startup process. pg_locks now correctly shows "1/1" for vxid of Startup process during Hot Standby. At end of Hot Standby the Virtual Transaction was not deleted, leading to problems after promoting to normal running for some commands, such as CREATE INDEX CONCURRENTLY.	2012-11-29 23:52:17 +00:00
Tom Lane	1da5bef317	Fix assorted bugs in CREATE INDEX CONCURRENTLY. This patch changes CREATE INDEX CONCURRENTLY so that the pg_index flag changes it makes without exclusive lock on the index are made via heap_inplace_update() rather than a normal transactional update. The latter is not very safe because moving the pg_index tuple could result in concurrent SnapshotNow scans finding it twice or not at all, thus possibly resulting in index corruption. In addition, fix various places in the code that ought to check to make sure that the indexes they are manipulating are valid and/or ready as appropriate. These represent bugs that have existed since 8.2, since a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid index behind, and we ought not try to do anything that might fail with such an index. Also fix RelationReloadIndexInfo to ensure it copies all the pg_index columns that are allowed to change after initial creation. Previously we could have been left with stale values of some fields in an index relcache entry. It's not clear whether this actually had any user-visible consequences, but it's at least a bug waiting to happen. This is a subset of a patch already applied in 9.2 and HEAD. Back-patch into all earlier supported branches. Tom Lane and Andres Freund	2012-11-29 14:51:46 -05:00
Tom Lane	bdceb861d7	Fix SELECT DISTINCT with index-optimized MIN/MAX on inheritance trees. In a query such as "SELECT DISTINCT min(x) FROM tab", the DISTINCT is pretty useless (there being only one output row), but nonetheless it shouldn't fail. But it could fail if "tab" is an inheritance parent, because planagg.c's code for fixing up equivalence classes after making the index-optimized MIN/MAX transformation wasn't prepared to find child-table versions of the aggregate expression. The least ugly fix seems to be to add an option to mutate_eclass_expressions() to skip child-table equivalence class members, which aren't used anymore at this stage of planning so it's not really necessary to fix them. Since child members are ignored in many cases already, it seems plausible for mutate_eclass_expressions() to have an option to ignore them too. Per bug #7703 from Maxim Boguk. Back-patch to 9.1. Although the same code exists before that, it cannot encounter child-table aggregates AFAICS, because the index optimization transformation cannot succeed on inheritance trees before 9.1 (for lack of MergeAppend).	2012-11-26 12:58:20 -05:00
Tom Lane	634e148dca	Fix multiple problems in WAL replay. Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.	2012-11-12 22:05:21 -05:00
Tom Lane	f43ca3c894	Fix handling of inherited check constraints in ALTER COLUMN TYPE. This case got broken in 8.4 by the addition of an error check that complains if ALTER TABLE ONLY is used on a table that has children. We do use ONLY for this situation, but it's okay because the necessary recursion occurs at a higher level. So we need to have a separate flag to suppress recursion without making the error check. Reported and patched by Pavan Deolasee, with some editorial adjustments by me. Back-patch to 8.4, since this is a regression of functionality that worked in earlier branches.	2012-11-05 13:36:26 -05:00
Alvaro Herrera	65225900de	Fix ALTER EXTENSION / SET SCHEMA In its original conception, it was leaving some objects into the old schema, but without their proper pg_depend entries; this meant that the old schema could be dropped, causing future pg_dump calls to fail on the affected database. This was originally reported by Jeff Frost as #6704; there have been other complaints elsewhere that can probably be traced to this bug. To fix, be more consistent about altering a table's subsidiary objects along the table itself; this requires some restructuring in how tables are relocated when altering an extension -- hence the new AlterTableNamespaceInternal routine which encapsulates it for both the ALTER TABLE and the ALTER EXTENSION cases. There was another bug lurking here, which was unmasked after fixing the previous one: certain objects would be reached twice via the dependency graph, and the second attempt to move them would cause the entire operation to fail. Per discussion, it seems the best fix for this is to do more careful tracking of objects already moved: we now maintain a list of moved objects, to avoid attempting to do it twice for the same object. Authors: Alvaro Herrera, Dimitri Fontaine Reviewed by Tom Lane	2012-10-31 10:49:14 -03:00
Tom Lane	447dad7193	Fix planning of non-strict equivalence clauses above outer joins. If a potential equivalence clause references a variable from the nullable side of an outer join, the planner needs to take care that derived clauses are not pushed to below the outer join; else they may use the wrong value for the variable. (The problem arises only with non-strict clauses, since if an upper clause can be proven strict then the outer join will get simplified to a plain join.) The planner attempted to prevent this type of error by checking that potential equivalence clauses aren't outerjoin-delayed as a whole, but actually we have to check each side separately, since the two sides of the clause will get moved around separately if it's treated as an equivalence. Bugs of this type can be demonstrated as far back as 7.4, even though releases before 8.3 had only a very ad-hoc notion of equivalence clauses. In addition, we neglected to account for the possibility that such clauses might have nonempty nullable_relids even when not outerjoin-delayed; so the equivalence-class machinery lacked logic to compute correct nullable_relids values for clauses it constructs. This oversight was harmless before 9.2 because we were only using RestrictInfo.nullable_relids for OR clauses; but as of 9.2 it could result in pushing constructed equivalence clauses to incorrect places. (This accounts for bug #7604 from Bill MacArthur.) Fix the first problem by adding a new test check_equivalence_delay() in distribute_qual_to_rels, and fix the second one by adding code in equivclass.c and called functions to set correct nullable_relids for generated clauses. Although I believe the second part of this is not currently necessary before 9.2, I chose to back-patch it anyway, partly to keep the logic similar across branches and partly because it seems possible we might find other reasons why we need valid values of nullable_relids in the older branches. Add regression tests illustrating these problems. In 9.0 and up, also add test cases checking that we can push constants through outer joins, since we've broken that optimization before and I nearly broke it again with an overly simplistic patch for this problem.	2012-10-18 12:29:00 -04:00
Tom Lane	473320e6c8	Close un-owned SMgrRelations at transaction end. If an SMgrRelation is not "owned" by a relcache entry, don't allow it to live past transaction end. This design allows the same SMgrRelation to be used for blind writes of multiple blocks during a transaction, but ensures that we don't hold onto such an SMgrRelation indefinitely. Because an SMgrRelation typically corresponds to open file descriptors at the fd.c level, leaving it open when there's no corresponding relcache entry can mean that we prevent the kernel from reclaiming deleted disk space. (While CacheInvalidateSmgr messages usually fix that, there are cases where they're not issued, such as DROP DATABASE. We might want to add some more sinval messaging for that, but I'd be inclined to keep this type of logic anyway, since allowing VFDs to accumulate indefinitely for blind-written relations doesn't seem like a good idea.) This code replaces a previous attempt towards the same goal that proved to be unreliable. Back-patch to 9.1 where the previous patch was added.	2012-10-17 12:38:33 -04:00
Tom Lane	cacb65263b	Revert "Use "transient" files for blind writes, take 2". This reverts commit `fba105b109`. That approach had problems with the smgr-level state not tracking what we really want to happen, and with the VFD-level state not tracking the smgr-level state very well either. In consequence, it was still possible to hold kernel file descriptors open for long-gone tables (as in recent report from Tore Halset), and yet there were also cases of FDs being closed undesirably soon. A replacement implementation will follow.	2012-10-17 12:37:20 -04:00
Tom Lane	eb5e0d8488	Split up process latch initialization for more-fail-soft behavior. In the previous coding, new backend processes would attempt to create their self-pipe during the OwnLatch call in InitProcess. However, pipe creation could fail if the kernel is short of resources; and the system does not recover gracefully from a FATAL error right there, since we have armed the dead-man switch for this process and not yet set up the on_shmem_exit callback that would disarm it. The postmaster then forces an unnecessary database-wide crash and restart, as reported by Sean Chittenden. There are various ways we could rearrange the code to fix this, but the simplest and sanest seems to be to split out creation of the self-pipe into a new function InitializeLatchSupport, which must be called from a place where failure is allowed. For most processes that gets called in InitProcess or InitAuxiliaryProcess, but processes that don't call either but still use latches need their own calls. Back-patch to 9.1, which has only a part of the latch logic that 9.2 and HEAD have, but nonetheless includes this bug.	2012-10-14 23:00:07 -04:00
Tom Lane	04a37a5716	Stamp 9.1.6.	2012-09-19 17:50:31 -04:00
Tom Lane	ff0b18cc49	Fix PARAM_EXEC assignment mechanism to be safe in the presence of WITH. The planner previously assumed that parameter Vars having the same absolute query level, varno, and varattno could safely be assigned the same runtime PARAM_EXEC slot, even though they might be different Vars appearing in different subqueries. This was (probably) safe before the introduction of CTEs, but the lazy-evalution mechanism used for CTEs means that a CTE can be executed during execution of some other subquery, causing the lifespan of Params at the same syntactic nesting level as the CTE to overlap with use of the same slots inside the CTE. In 9.1 we created additional hazards by using the same parameter-assignment technology for nestloop inner scan parameters, but it was broken before that, as illustrated by the added regression test. To fix, restructure the planner's management of PlannerParamItems so that items having different semantic lifespans are kept rigorously separated. This will probably result in complex queries using more runtime PARAM_EXEC slots than before, but the slots are cheap enough that this hardly matters. Also, stop generating PlannerParamItems containing Params for subquery outputs: all we really need to do is reserve the PARAM_EXEC slot number, and that now only takes incrementing a counter. The planning code is simpler and probably faster than before, as well as being more correct. Per report from Vik Reykja. Back-patch of commit `46c508fbcf` into all branches that support WITH.	2012-09-07 20:38:35 -04:00
Tom Lane	97395185b8	Make configure probe for mbstowcs_l as well as wcstombs_l. We previously supposed that any given platform would supply both or neither of these functions, so that one configure test would be sufficient. It now appears that at least on AIX this is not the case ... which is likely an AIX bug, but nonetheless we need to cope with it. So use separate tests. Per bug #6758; thanks to Andrew Hastie for doing the followup testing needed to confirm what was happening. Backpatch to 9.1, where we began using these functions.	2012-08-31 14:18:08 -04:00
Tom Lane	04e96bc69d	Stamp 9.1.5.	2012-08-14 18:41:04 -04:00
Tom Lane	b2cc611959	Fix dependencies generated during ALTER TABLE ADD CONSTRAINT USING INDEX. This command generated new pg_depend entries linking the index to the constraint and the constraint to the table, which match the entries made when a unique or primary key constraint is built de novo. However, it did not bother to get rid of the entries linking the index directly to the table. We had considered the issue when the ADD CONSTRAINT USING INDEX patch was written, and concluded that we didn't need to get rid of the extra entries. But this is wrong: ALTER COLUMN TYPE wasn't expecting such redundant dependencies to exist, as reported by Hubert Depesz Lubaczewski. On reflection it seems rather likely to break other things as well, since there are many bits of code that crawl pg_depend for one purpose or another, and most of them are pretty naive about what relationships they're expecting to find. Fortunately it's not that hard to get rid of the extra dependency entries, so let's do that. Back-patch to 9.1, where ALTER TABLE ADD CONSTRAINT USING INDEX was added.	2012-08-11 12:51:36 -04:00
Tom Lane	200ff8bf39	Fix whole-row Var evaluation to cope with resjunk columns (again). When a whole-row Var is reading the result of a subquery, we need it to ignore any "resjunk" columns that the subquery might have evaluated for GROUP BY or ORDER BY purposes. We've hacked this area before, in commit `68e40998d0`, but that fix only covered whole-row Vars of named composite types, not those of RECORD type; and it was mighty klugy anyway, since it just assumed without checking that any extra columns in the result must be resjunk. A proper fix requires getting hold of the subquery's targetlist so we can actually see which columns are resjunk (whereupon we can use a JunkFilter to get rid of them). So bite the bullet and add some infrastructure to make that possible. Per report from Andrew Dunstan and additional testing by Merlin Moncure. Back-patch to all supported branches. In 8.3, also back-patch commit `292176a118`, which for some reason I had not done at the time, but it's a prerequisite for this change.	2012-07-20 13:09:16 -04:00

1 2 3 4 5 ...

5588 commits