This longstanding functionality evidently got lost in commit
3d6d1b5855. Noted while studying an OOM report from Jaime
Casanova. Backpatch to 9.5 where the bug was introduced.
This patch widens SPI_processed, EState's es_processed field, PortalData's
portalPos field, FuncCallContext's call_cntr and max_calls fields,
ExecutorRun's count argument, PortalRunFetch's result, and the max number
of rows in a SPITupleTable to uint64, and deals with (I hope) all the
ensuing fallout. Some of these values were declared uint32 before, and
others "long".
I also removed PortalData's posOverflow field, since that logic seems
pretty useless given that portalPos is now always 64 bits.
The user-visible results are that command tags for SELECT etc will
correctly report tuple counts larger than 4G, as will plpgsql's GET
GET DIAGNOSTICS ... ROW_COUNT command. Queries processing more tuples
than that are still not exactly the norm, but they're becoming more
common.
Most values associated with FETCH/MOVE distances, such as PortalRun's count
argument and the count argument of most SPI functions that have one, remain
declared as "long". It's not clear whether it would be worth promoting
those to int64; but it would definitely be a large dollop of additional
API churn on top of this, and it would only help 32-bit platforms which
seem relatively less likely to see any benefit.
Andreas Scherbaum, reviewed by Christian Ullrich, additional hacking by me
Adds several tsvector editting function: convert tsvector to/from text array,
set weight for given lexemes, delete lexeme(s), unnest, filter lexemes
with given weights
Author: Stas Kelvich with some editorization by me
Reviewers: Tomas Vondram, Teodor Sigaev
emit_log_hook could only see the translated text, making it harder to identify
which message was being sent. Pass original text to allow the exact message to
be identified, whichever language is used for logging.
Discussion: 20160216.184755.59721141.horiguchi.kyotaro@lab.ntt.co.jp
Author: Kyotaro Horiguchi
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
All along, this function should have treated WindowFuncs in a manner
similar to Aggrefs, ie with an option whether or not to recurse into them.
By not considering the case, it was always recursing, which is OK for most
callers (although I suspect that the case in prepare_sort_from_pathkeys
might represent a bug). But now we need return-without-recursing behavior
as well. There are also more than a few callers that should never see a
WindowFunc, and now we'll get some error checking on that.
In commit 1d97c19a0f and later c1d9579dd8, we extended
pull_var_clause's API by adding enum-type arguments. That's sort of a pain
to maintain, though, because it means every time we add a new behavior we
must touch every last one of the call sites, even if there's a reasonable
default behavior that most of them could use. Let's switch over to using a
bitmask of flags, instead; that seems more maintainable and might save a
nanosecond or two as well. This commit changes no behavior in itself,
though I'm going to follow it up with one that does add a new behavior.
In passing, remove flatten_tlist(), which has not been used since 9.1
and would otherwise need the same API changes.
Removing these enums means that optimizer/tlist.h no longer needs to
depend on optimizer/var.h. Changing that caused a number of C files to
need addition of #include "optimizer/var.h" (probably we can thank old
runs of pgrminclude for that); but on balance it seems like a good change
anyway.
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting. Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.
Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
Renaming a file using rename(2) is not guaranteed to be durable in face
of crashes. Use the previously added durable_rename()/durable_link_or_rename()
in various places where we previously just renamed files.
Most of the changed call sites are arguably not critical, but it seems
better to err on the side of too much durability. The most prominent
known case where the previously missing fsyncs could cause data loss is
crashes at the end of a checkpoint. After the actual checkpoint has been
performed, old WAL files are recycled. When they're filled, their
contents are fdatasynced, but we did not fsync the containing
directory. An OS/hardware crash in an unfortunate moment could then end
up leaving that file with its old name, but new content; WAL replay
would thus not replay it.
Reported-By: Tomas Vondra
Author: Michael Paquier, Tomas Vondra, Andres Freund
Discussion: 56583BDD.9060302@2ndquadrant.com
Backpatch: All supported branches
Using this facility, any utility command can report the target relation
upon which it is operating, if there is one, and up to 10 64-bit
counters; the intent of this is that users should be able to figure out
what a utility command is doing without having to resort to ugly hacks
like attaching strace to a backend.
As a demonstration, this adds very crude reporting to lazy vacuum; we
just report the target relation and nothing else. A forthcoming patch
will make VACUUM report a bunch of additional data that will make this
much more interesting. But this gets the basic framework in place.
Vinayak Pokale, Rahila Syed, Amit Langote, Robert Haas, reviewed by
Kyotaro Horiguchi, Jim Nasby, Thom Brown, Masahiko Sawada, Fujii Masao,
and Masanori Oyama.
Add four new SQL accessible functions: pg_control_system(),
pg_control_checkpoint(), pg_control_recovery(), and pg_control_init()
which expose a subset of the control file data.
Along the way move the code to read and validate the control file to
src/common, where it can be shared by the new backend functions
and the original pg_controldata frontend program.
Patch by me, significant input, testing, and review by Michael Paquier.
A thinko concerning nesting depth caused json_to_record() to produce bogus
output if a field of its input object contained a sub-object with a field
name matching one of the requested output column names. Per bug #13996
from Johann Visagie.
I added a regression test case based on his example, plus parallel tests
for json_to_recordset, jsonb_to_record, jsonb_to_recordset. The latter
three do not exhibit the same bug (which suggests that we may be missing
some opportunities to share code...) but testing seems like a good idea
in any case.
Back-patch to 9.4 where these functions were introduced.
Commits 9ff60273e3 and dbe2328959 adjusted the declarations
of some core functions referenced by contrib/tsearch2's install script,
forgetting that in a pg_upgrade situation, we'll be trying to restore
operator class definitions that reference the old signatures. We've
hit this problem before; solve it in the same way as before, namely by
installing stub functions that have the expected signature and just
invoke the correct function. Per report from Jeff Janes.
(Someday we ought to stop supporting contrib/tsearch2, but I'm not
sure today is that day.)
It's harmless, but might confuse readers. Seems to have been introduced
in 6bc8ef0b7f. Back-patch, just to avoid cosmetic cross-branch
differences.
Amit Langote
A failure partway through PGLC_localeconv() led to a situation where
the next call would call free_struct_lconv() a second time, leading
to free() on already-freed strings, typically leading to a core dump.
Add a flag to remember whether we need to do that.
Per report from Thom Brown. His example case only provokes the failure
as far back as 9.4, but nonetheless this code is obviously broken, so
back-patch to all supported branches.
This patch introduces "pg_blocking_pids(int) returns int[]", which returns
the PIDs of any sessions that are blocking the session with the given PID.
Historically people have obtained such information using a self-join on
the pg_locks view, but it's unreasonably tedious to do it that way with any
modicum of correctness, and the addition of parallel queries has pretty
much broken that approach altogether. (Given some more columns in the view
than there are today, you could imagine handling parallel-query cases with
a 4-way join; but ugh.)
The new function has the following behaviors that are painful or impossible
to get right via pg_locks:
1. Correctly understands which lock modes block which other ones.
2. In soft-block situations (two processes both waiting for conflicting lock
modes), only the one that's in front in the wait queue is reported to
block the other.
3. In parallel-query cases, reports all sessions blocking any member of
the given PID's lock group, and reports a session by naming its leader
process's PID, which will be the pg_backend_pid() value visible to
clients.
The motivation for doing this right now is mostly to fix the isolation
tests. Commit 38f8bdcac4 lobotomized
isolationtester's is-it-waiting query by removing its ability to recognize
nonconflicting lock modes, as a crude workaround for the inability to
handle soft-block situations properly. But even without the lock mode
tests, the old query was excessively slow, particularly in
CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new
deadlock-hard test because the deadlock timeout elapses before they can
probe the waiting status of all eight sessions. Replacing the pg_locks
self-join with use of pg_blocking_pids() is not only much more correct, but
a lot faster: I measure it at about 9X faster in a typical dev build with
Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide
enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the
test, without having to lengthen deadlock_timeout yet more and thus slow
down the test for everyone else.
Some over-eager copy-and-pasting on my part resulted in a nonsense
result being returned in this case. I have adopted the same pattern for
handling this case as is used in the one argument form of the function,
i.e. we just skip over the code that adds values to the object.
Diagnosis and patch from Michael Paquier, although not quite his
solution.
Fixes bug #13936.
Backpatch to 9.5 where jsonb_object was introduced.
Commit 53874c5228 broke various 32-bit
buildfarm machines because it incorrectly used an 'L' suffix for what
needed to be a 64-bit literal. Thanks to Michael Paquier for helping
to diagnose this.
This will parse strings in the format produced by pg_size_pretty() and
return sizes in bytes. This allows queries to be written with clauses
like "pg_total_relation_size(oid) > pg_size_bytes('10 GB')".
Author: Pavel Stehule with various improvements by Vitaly Burovoy
Discussion: http://www.postgresql.org/message-id/CAFj8pRD-tGoDKnxdYgECzA4On01_uRqPrwF-8LdkSE-6bDHp0w@mail.gmail.com
Reviewed-by: Vitaly Burovoy, Oleksandr Shulgin, Kyotaro Horiguchi,
Michael Paquier and Robert Haas
Move and refactor the underlying code for the pg_config client
application to src/common in support of sharing it with a new
system information SRF called pg_config() which makes the same
information available via SQL. Additionally wrap the SRF with a
new system view, as called pg_config.
Patch by me with extensive input and review by Michael Paquier
and additional review by Alvaro Herrera.
When processing ordered aggregates following a sort that could make use
of the abbreviated key optimization, only call the equality operator to
compare successive pairs of tuples when their abbreviated keys were not
equal.
Peter Geoghegan, reviewd by Andreas Karlsson and by me.
Commit 4de82f7d7 increased the WAL flush rate, mainly to increase the
likelihood that hint bits can be set quickly. More quickly set hint bits
can reduce contention around the clog et al. But unfortunately the
increased flush rate can have a significant negative performance impact,
I have measured up to a factor of ~4. The reason for this slowdown is
that if there are independent writes to the underlying devices, for
example because shared buffers is a lot smaller than the hot data set,
or because a checkpoint is ongoing, the fdatasync() calls force cache
flushes to be emitted to the storage.
This is achieved by flushing WAL only if the last flush was longer than
wal_writer_delay ago, or if more than wal_writer_flush_after (new GUC)
unflushed blocks are pending. Based on some tests the default for
wal_writer_delay is 1MB, which seems to work well both on SSD and
rotational media.
To avoid negative performance impact due to 4de82f7d7 an earlier
commit (db76b1e) made SetHintBits() more likely to succeed; preventing
performance regressions in the pgbench tests I performed.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
Previously we only allowed SetHintBits() to succeed if the commit LSN of
the last transaction touching the page has already been flushed to
disk. We can't generally change the LSN of the page, because we don't
necessarily have the required locks on the page. But the required LSN
interlock does not mean the commit record has to be flushed immediately,
it just requires that the commit record will be flushed before the page is
written out. Therefore if the buffer LSN is newer than the commit LSN,
the hint bit can be safely set.
In a number of scenarios (e.g. pgbench) this noticeably increases the
number of hint bits are set. But more importantly it also keeps the
success rate up when flushing WAL less frequently. That was the original
reason for commit 4de82f7d7, which has negative performance consequences
in a number of scenarios. This will allow a followup commit to reduce
the flush rate.
Discussion: 20160118163908.GW10941@awork2.anarazel.de
Historically this message has been emitted at the end of ShutdownXLOG().
That's not an insane place for it in a standalone backend, but in the
postmaster environment we've grown a fair amount of stuff that happens
later, including archiver/walsender shutdown, stats collector shutdown,
etc. Recent buildfarm experimentation showed that on slower machines
there could be many seconds' delay between finishing ShutdownXLOG() and
actual postmaster exit. That's fairly confusing, both for testing
purposes and for DBAs. Hence, move the code that prints this message
into UnlinkLockFiles(), so that it comes out just after we remove the
postmaster's pidfile. That is a more appropriate definition of "is shut
down" from the point of view of "pg_ctl stop", for example. In general,
removing the pidfile should be the last externally-visible action of
either a postmaster or a standalone backend; compare commit
d73d14c271 for instance. So this seems
like a reasonably future-proof approach.
This is a quick hack, due to be reverted when its purpose has been served,
to try to gather information about why some of the buildfarm critters
regularly fail with "postmaster does not shut down" complaints. Maybe they
are just really overloaded, but maybe something else is going on. Hence,
instrument pg_ctl to print the current time when it starts waiting for
postmaster shutdown and when it gives up, and add a lot of logging of the
current time in the server's checkpoint and shutdown code paths.
No attempt has been made to make this pretty. I'm not even totally sure
if it will build on Windows, but we'll soon find out.
Since pgindent treats typedef names as global, the original coding of
b47b4dbf68 would have had rather nasty effects on the formatting
of other files in which "string" is used as a variable or field name.
Use a less generic name for this typedef, and rename some other
identifiers to match.
Peter Geoghegan, per gripe from me
The parser doesn't allow qualification of column names appearing in
these clauses, but ruleutils.c would sometimes qualify them, leading
to dump/reload failures. Per bug #13891 from Onder Kalaci.
(In passing, make stanzas in ruleutils.c that save/restore varprefix
more consistent.)
Peter Geoghegan
When force_parallel_mode = true, we enable the parallel mode restrictions
for all queries for which this is believed to be safe. For the subset of
those queries believed to be safe to run entirely within a worker, we spin
up a worker and run the query there instead of running it in the
original process. When force_parallel_mode = regress, make additional
changes to allow the regression tests to run cleanly even though parallel
workers have been injected under the hood.
Taken together, this facilitates both better user testing and better
regression testing of the parallelism code.
Robert Haas, with help from Amit Kapila and Rushabh Lathia.
It seems that sprintf(), at least in glibc's version, is unreasonably slow
compared to hand-rolled code for printing integers. Replacing most uses of
sprintf() in the datetime.c output functions with special-purpose code
turns out to give more than a 2X speedup in COPY of a table with a single
timestamp column; which is pretty impressive considering all the other
logic in that code path.
David Rowley and Andres Freund, reviewed by Peter Geoghegan and myself
Future PL/Java versions will close CVE-2016-0766 by making these GUCs
PGC_SUSET. This PostgreSQL change independently mitigates that PL/Java
vulnerability, helping sites that update PostgreSQL more frequently than
PL/Java. Back-patch to 9.1 (all supported versions).
An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that
exactly one of a,b,c isn't NULL. The functions are variadic, so they
can also be pressed into service to count the number of null or nonnull
elements in an array.
Marko Tiikkaja, reviewed by Pavel Stehule