Commit graph

137 commits

Author SHA1 Message Date
Heikki Linnakangas
e8f9fe4ae7 Fix typos in comments.
Backpatch to all supported versions, where applicable, to make backpatching
of future fixes go more smoothly.

Josh Soref

Discussion: https://www.postgresql.org/message-id/CACZqfqCf+5qRztLPgmmosr-B0Ye4srWzzw_mo4c_8_B_mtjmJQ@mail.gmail.com
2017-02-06 11:34:28 +02:00
Tom Lane
20a8595043 Fix misestimation of n_distinct for a nearly-unique column with many nulls.
If ANALYZE found no repeated non-null entries in its sample, it set the
column's stadistinct value to -1.0, intending to indicate that the entries
are all distinct.  But what this value actually means is that the number
of distinct values is 100% of the table's rowcount, and thus it was
overestimating the number of distinct values by however many nulls there
are.  This could lead to very poor selectivity estimates, as for example
in a recent report from Andreas Joseph Krogh.  We should discount the
stadistinct value by whatever we've estimated the nulls fraction to be.
(That is what will happen if we choose to use a negative stadistinct for
a column that does have repeated entries, so this code path was just
inconsistent.)

In addition to fixing the stadistinct entries stored by several different
ANALYZE code paths, adjust the logic where get_variable_numdistinct()
forces an "all distinct" estimate on the basis of finding a relevant unique
index.  Unique indexes don't reject nulls, so there's no reason to assume
that the null fraction doesn't apply.

Back-patch to all supported branches.  Back-patching is a bit of a judgment
call, but this problem seems to affect only a few users (else we'd have
identified it long ago), and it's bad enough when it does happen that
destabilizing plan choices in a worse direction seems unlikely.

Patch by me, with documentation wording suggested by Dean Rasheed

Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena>
Discussion: <16143.1470350371@sss.pgh.pa.us>
2016-08-07 18:52:02 -04:00
Teodor Sigaev
9d3fb209a0 Fix possible use of uninitialised value in ts_headline()
Found during investigation of failure of skink buildfarm member and its
valgrind report.

Backpatch to all supported branches
2016-04-08 21:25:59 +03:00
Tom Lane
a35f85a524 Fix not-terribly-safe coding in NIImportOOAffixes() and NIImportAffixes().
There were two places in spell.c that supposed that they could search
for a location in a string produced by lowerstr() and then transpose
the offset into the original string.  But this fails completely if
lowerstr() transforms any characters into characters of different byte
length, as can happen in Turkish UTF8 for instance.

We'd added some comments about this coding in commit 51e78ab4ff,
but failed to realize that it was not merely confusing but wrong.

Coverity complained about this code years ago, but in such an opaque
fashion that nobody understood what it was on about.  I'm not entirely
sure that this issue *is* what it's on about, actually, but perhaps
this patch will shut it up -- and in any case the problem is clear.

Back-patch to all supported branches.
2016-03-06 19:21:03 -05:00
Tom Lane
24ce5754ae Avoid use of sscanf() to parse ispell dictionary files.
It turns out that on FreeBSD-derived platforms (including OS X), the
*scanf() family of functions is pretty much brain-dead about multibyte
characters.  In particular it will apply isspace() to individual bytes
of input even when those bytes are part of a multibyte character, thus
allowing false recognition of a field-terminating space.

We appear to have little alternative other than instituting a coding
rule that *scanf() is not to be used if the input string might contain
multibyte characters.  (There was some discussion of relying on "%ls",
but that probably just moves the portability problem somewhere else,
and besides it doesn't fully prevent BSD *scanf() from using isspace().)

This patch is a down payment on that: it gets rid of use of sscanf()
to parse ispell dictionary files, which are certainly at great risk
of having a problem.  The code is cleaner this way anyway, though
a bit longer.

In passing, improve a few comments.

Report and patch by Artur Zakirov, reviewed and somewhat tweaked by me.
Back-patch to all supported branches.
2016-02-10 19:30:12 -05:00
Tom Lane
5a74ff373c Cope with more than 64K phrases in a thesaurus dictionary.
dict_thesaurus stored phrase IDs in uint16 fields, so it would get confused
and even crash if there were more than 64K entries in the configuration
file.  It turns out to be basically free to widen the phrase IDs to uint32,
so let's just do so.

This was complained of some time ago by David Boutin (in bug #7793);
he later submitted an informal patch but it was never acted on.
We now have another complaint (bug #11901 from Luc Ouellette) so it's
time to make something happen.

This is basically Boutin's patch, but for future-proofing I also added a
defense against too many words per phrase.  Note that we don't need any
explicit defense against overflow of the uint32 counters, since before that
happens we'd hit array allocation sizes that repalloc rejects.

Back-patch to all supported branches because of the crash risk.
2014-11-06 20:52:52 -05:00
Tom Lane
385f0d98a4 Improve ispell dictionary's defenses against bad affix files.
Don't crash if an ispell dictionary definition contains flags but not
any compound affixes.  (This isn't a security issue since only superusers
can install affix files, but still it's a bad thing.)

Also, be more careful about detecting whether an affix-file FLAG command
is old-format (ispell) or new-format (myspell/hunspell).  And change the
error message about mixed old-format and new-format commands into something
intelligible.

Per bug #11770 from Emre Hasegeli.  Back-patch to all supported branches.
2014-10-23 13:11:34 -04:00
Bruce Momjian
04e15c69d2 Remove tabs after spaces in C comments
This was not changed in HEAD, but will be done later as part of a
pgindent run.  Future pgindent runs will also do this.

Report by Tom Lane

Backpatch through all supported branches, but not HEAD
2014-05-06 11:26:28 -04:00
Heikki Linnakangas
8332fe76b5 Avoid memcpy() with same source and destination address.
The behavior of that is undefined, although unlikely to lead to problems in
practice.

Found by running regression tests with Valgrind.
2014-03-07 13:29:19 +02:00
Tom Lane
e3208fec32 Prevent potential overruns of fixed-size buffers.
Coverity identified a number of places in which it couldn't prove that a
string being copied into a fixed-size buffer would fit.  We believe that
most, perhaps all of these are in fact safe, or are copying data that is
coming from a trusted source so that any overrun is not really a security
issue.  Nonetheless it seems prudent to forestall any risk by using
strlcpy() and similar functions.

Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports.

In addition, fix a potential null-pointer-dereference crash in
contrib/chkpass.  The crypt(3) function is defined to return NULL on
failure, but chkpass.c didn't check for that before using the result.
The main practical case in which this could be an issue is if libc is
configured to refuse to execute unapproved hashing algorithms (e.g.,
"FIPS mode").  This ideally should've been a separate commit, but
since it touches code adjacent to one of the buffer overrun changes,
I included it in this commit to avoid last-minute merge issues.
This issue was reported by Honza Horak.

Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()
2014-02-17 11:20:24 -05:00
Heikki Linnakangas
928aec71c4 Use memmove() instead of memcpy() for copying overlapping regions.
In commit d2495f272c, I fixed this bug in
to_tsquery(), but missed the fact that plainto_tsquery() has the same bug.
2014-02-10 09:59:49 +02:00
Tom Lane
9beffdcc3c Fix some wide-character bugs in the text-search parser.
In p_isdigit and other character class test functions generated by the
p_iswhat macro, the code path for non-C locales with multibyte encodings
contained a bogus pointer cast that would accidentally fail to malfunction
if types wchar_t and wint_t have the same width.  Apparently that is true
on most platforms, but not on recent Cygwin releases.  Remove the cast,
as it seems completely unnecessary (I think it arose from a false analogy
to the need to cast to unsigned char when dealing with the <ctype.h>
functions).  Per bug #8970 from Marco Atzeri.

In the same functions, the code path for C locale with a multibyte encoding
simply ANDed each wide character with 0xFF before passing it to the
corresponding <ctype.h> function.  This could result in false positive
answers for some non-ASCII characters, so use a range test instead.
Noted by me while investigating Marco's complaint.

Also, remove some useless though not actually buggy maskings and casts
in the hand-coded p_isalnum and p_isalpha functions, which evidently
got tested a bit more carefully than the macro-generated functions.
2014-02-01 18:27:40 -05:00
Bruce Momjian
9af4159fce pgindent run for release 9.3
This is the first run of the Perl-based pgindent script.  Also update
pgindent instructions.
2013-05-29 16:58:43 -04:00
Bruce Momjian
bd61a623ac Update copyrights for 2013
Fully update git head, and update back branches in ./COPYRIGHT and
legal.sgml files.
2013-01-01 17:15:01 -05:00
Tom Lane
1faf866ace Fix logical errors in tsquery selectivity estimation for prefix queries.
I made multiple errors in commit 97532f7c29,
stemming mostly from failure to think about the available frequency data
as being element frequencies not value frequencies (so that occurrences of
different elements are not mutually exclusive).  This led to sillinesses
such as estimating that "word" would match more rows than "word:*".

The choice to clamp to a minimum estimate of DEFAULT_TS_MATCH_SEL also
seems pretty ill-considered in hindsight, as it would frequently result in
an estimate much larger than the available data suggests.  We do need some
sort of clamp, since a pattern not matching any of the MCELEMs probably
still needs a selectivity estimate of more than zero.  I chose instead to
clamp to at least what a non-MCELEM word would be estimated as, preserving
the property that "word:*" doesn't get an estimate less than plain "word",
whether or not the word appears in MCELEM.

Per investigation of a gripe from Bill Martin, though I suspect that his
example case actually isn't even reaching the erroneous code.

Back-patch to 9.1 where this code was introduced.
2012-09-11 21:23:20 -04:00
Alvaro Herrera
c219d9b0a5 Split tuple struct defs from htup.h to htup_details.h
This reduces unnecessary exposure of other headers through htup.h, which
is very widely included by many files.

I have chosen to move the function prototypes to the new file as well,
because that means htup.h no longer needs to include tupdesc.h.  In
itself this doesn't have much effect in indirect inclusion of tupdesc.h
throughout the tree, because it's also required by execnodes.h; but it's
something to explore in the future, and it seemed best to do the htup.h
change now while I'm busy with it.
2012-08-30 16:52:35 -04:00
Robert Haas
3a0e4d36eb Make new event trigger facility actually do something.
Commit 3855968f32 added syntax, pg_dump,
psql support, and documentation, but the triggers didn't actually fire.
With this commit, they now do.  This is still a pretty basic facility
overall because event triggers do not get a whole lot of information
about what the user is trying to do unless you write them in C; and
there's still no option to fire them anywhere except at the very
beginning of the execution sequence, but it's better than nothing,
and a good building block for future work.

Along the way, add a regression test for ALTER LARGE OBJECT, since
testing of event triggers reveals that we haven't got one.

Dimitri Fontaine and Robert Haas
2012-07-20 11:39:01 -04:00
Peter Eisentraut
dd16f9480a Remove unreachable code
The Solaris Studio compiler warns about these instances, unlike more
mainstream compilers such as gcc.  But manual inspection showed that
the code is clearly not reachable, and we hope no worthy compiler will
complain about removing this code.
2012-07-16 22:15:03 +03:00
Peter Eisentraut
b8b2e3b2de Replace int2/int4 in C code with int16/int32
The latter was already the dominant use, and it's preferable because
in C the convention is that intXX means XX bits.  Therefore, allowing
mixed use of int2, int4, int8, int16, int32 is obviously confusing.

Remove the typedefs for int2 and int4 for now.  They don't seem to be
widely used outside of the PostgreSQL source tree, and the few uses
can probably be cleaned up by the time this ships.
2012-06-25 01:51:46 +03:00
Bruce Momjian
927d61eeff Run pgindent on 9.2 source tree in preparation for first 9.3
commit-fest.
2012-06-10 15:20:04 -04:00
Heikki Linnakangas
d2495f272c Fix bug in to_tsquery().
We were using memcpy() to copy to a possibly overlapping memory region,
which is a no-no. Use memmove() instead.
2012-05-15 19:27:34 +03:00
Peter Eisentraut
e9605a039b Even more duplicate word removal, in the spirit of the season 2012-05-02 20:56:03 +03:00
Robert Haas
85efd5f065 Reduce hash size for compute_array_stats, compute_tsvector_stats.
The size is only a hint, but a big hint chews up a lot of memory without
apparently improving performance much.

Analysis and patch by Noah Misch.
2012-04-23 22:05:41 -04:00
Tom Lane
0e5e167aae Collect and use element-frequency statistics for arrays.
This patch improves selectivity estimation for the array <@, &&, and @>
(containment and overlaps) operators.  It enables collection of statistics
about individual array element values by ANALYZE, and introduces
operator-specific estimators that use these stats.  In addition,
ScalarArrayOpExpr constructs of the forms "const = ANY/ALL (array_column)"
and "const <> ANY/ALL (array_column)" are estimated by treating them as
variants of the containment operators.

Since we still collect scalar-style stats about the array values as a
whole, the pg_stats view is expanded to show both these stats and the
array-style stats in separate columns.  This creates an incompatible change
in how stats for tsvector columns are displayed in pg_stats: the stats
about lexemes are now displayed in the array-related columns instead of the
original scalar-related columns.

There are a few loose ends here, notably that it'd be nice to be able to
suppress either the scalar-style stats or the array-element stats for
columns for which they're not useful.  But the patch is in good enough
shape to commit for wider testing.

Alexander Korotkov, reviewed by Noah Misch and Nathan Boley
2012-03-03 20:20:57 -05:00
Bruce Momjian
e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Peter Eisentraut
1b81c2fe6e Remove many -Wcast-qual warnings
This addresses only those cases that are easy to fix by adding or
moving a const qualifier or removing an unnecessary cast.  There are
many more complicated cases remaining.
2011-09-11 21:54:32 +03:00
Bruce Momjian
6416a82a62 Remove unnecessary #include references, per pgrminclude script. 2011-09-01 10:04:27 -04:00
Bruce Momjian
6560407c7d Pgindent run before 9.1 beta2. 2011-06-09 14:32:50 -04:00
Tom Lane
6755558b92 Improve aset.c's space management in contexts with small maxBlockSize.
The previous coding would allow requests up to half of maxBlockSize to be
treated as "chunks", but when that actually did happen, we'd waste nearly
half of the space in the malloc block containing the chunk, if no smaller
requests came along to fill it.  Avoid this scenario by limiting the
maximum size of a chunk to 1/8th maxBlockSize, so that we can waste no more
than 1/8th of the allocated space.  This will not change the behavior at
all for the default context size parameters (with large maxBlockSize),
but it will change the behavior when using ALLOCSET_SMALL_MAXSIZE.

In particular, there's no longer a need for spell.c to be overly concerned
about the request size parameters it uses, so remove a rather unhelpful
comment about that.

Merlin Moncure, per an idea of Tom Lane's
2011-05-02 12:08:08 -04:00
Tom Lane
2ab0796d7a Fix char2wchar/wchar2char to support collations properly.
These functions should take a pg_locale_t, not a collation OID, and should
call mbstowcs_l/wcstombs_l where available.  Where those functions are not
available, temporarily select the correct locale with uselocale().

This change removes the bogus assumption that all locales selectable in
a given database have the same wide-character conversion method; in
particular, the collate.linux.utf8 regression test now passes with
LC_CTYPE=C, so long as the database encoding is UTF8.

I decided to move the char2wchar/wchar2char functions out of mbutils.c and
into pg_locale.c, because they work on wchar_t not pg_wchar_t and thus
don't really belong with the mbutils.c functions.  Keeping them where they
were would have required importing pg_locale_t into pg_wchar.h somehow,
which did not seem like a good plan.
2011-04-23 12:35:41 -04:00
Tom Lane
d64713df7e Pass collations to functions in FunctionCallInfoData, not FmgrInfo.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings.  Fix by passing it in FunctionCallInfoData
instead.  In particular this allows a clean fix for bug #5970 (record_cmp
not working).  This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
2011-04-12 19:19:24 -04:00
Tom Lane
1e16a8107d Teach regular expression operators to honor collations.
This involves getting the character classification and case-folding
functions in the regex library to use the collations infrastructure.
Most of this work had been done already in connection with the upper/lower
and LIKE logic, so it was a simple matter of transposition.

While at it, split out these functions into a separate source file
regc_pg_locale.c, so that they can be correctly labeled with the Postgres
project's license rather than the Scriptics license.  These functions are
100% Postgres-written code whereas what remains in regc_locale.c is still
mostly not ours, so lumping them both under the same copyright notice was
getting more and more misleading.
2011-04-10 18:03:09 -04:00
Bruce Momjian
bf50caf105 pgindent run before PG 9.1 beta 1. 2011-04-10 11:42:00 -04:00
Tom Lane
52b60530f2 Fix tsmatchsel() to account properly for null rows.
ts_typanalyze.c computes MCE statistics as fractions of the non-null rows,
which seems fairly reasonable, and anyway changing it in released versions
wouldn't be a good idea.  But then ts_selfuncs.c has to account for that.
Failure to do so results in overestimates in columns with a significant
fraction of null documents.  Back-patch to 8.4 where this stuff was
introduced.

Jesper Krogh
2011-02-17 19:00:49 -05:00
Bruce Momjian
135724ec35 Fix "variable not used" warnings when USE_WIDE_UPPER_LOWER is not
defined.
2011-02-10 16:58:02 -05:00
Peter Eisentraut
414c5a2ea6 Per-column collation support
This adds collation support for columns and domains, a COLLATE clause
to override it per expression, and B-tree index support.

Peter Eisentraut
reviewed by Pavel Stehule, Itagaki Takahiro, Robert Haas, Noah Misch
2011-02-08 23:04:18 +02:00
Bruce Momjian
97116ca417 Rename macro DECIMAL to DECIMAL_T to help pgindent; this is already
done for a few other macros in that file, for other reasons.  I also
remove pgindent/README mention of the file.
2011-02-06 10:48:17 -05:00
Bruce Momjian
5d950e3b0c Stamp copyrights for year 2011. 2011-01-01 13:18:15 -05:00
Peter Eisentraut
fc946c39ae Remove useless whitespace at end of lines 2010-11-23 22:34:55 +02:00
Robert Haas
5aa446c961 Cleanup various comparisons with the constant "true".
Itagaki Takahiro, with slight modifications.
2010-11-14 21:03:48 -05:00
Tom Lane
3e5f9412d0 Reduce the memory requirement for large ispell dictionaries.
This patch eliminates per-chunk palloc overhead for most small allocations
needed in the representation of an ispell dictionary.  This saves close to
a factor of 2 on the current Czech ispell data.  While it doesn't cover
every last small allocation in the ispell code, we are at the point of
diminishing returns, because about 95% of the allocations are covered
already.

Pavel Stehule, rather heavily revised by Tom
2010-10-06 19:31:05 -04:00
Tom Lane
9b910def24 Clean up temporary-memory management during ispell dictionary loading.
Add explicit initialization and cleanup functions to spell.c, and keep
all working state in the already-existing ISpellDict struct.  This lets us
get rid of a static variable along with some extremely shaky assumptions
about usage of child memory contexts.

This commit is just code beautification and has no impact on functionality
or performance, but it opens the way to a less-grotty implementation of
Pavel's memory-saving hack, which will follow shortly.
2010-10-06 15:15:15 -04:00
Magnus Hagander
9f2e211386 Remove cvs keywords from all files. 2010-09-20 22:08:53 +02:00
Peter Eisentraut
3f11971916 Remove extra newlines at end and beginning of files, add missing newlines
at end of files.
2010-08-19 05:57:36 +00:00
Robert Haas
fd1843ff89 Standardize get_whatever_oid functions for other object types.
- Rename TSParserGetPrsid to get_ts_parser_oid.
- Rename TSDictionaryGetDictid to get_ts_dict_oid.
- Rename TSTemplateGetTmplid to get_ts_template_oid.
- Rename TSConfigGetCfgid to get_ts_config_oid.
- Rename FindConversionByName to get_conversion_oid.
- Rename GetConstraintName to get_constraint_oid.
- Add new functions get_opclass_oid, get_opfamily_oid, get_rewrite_oid,
  get_rewrite_oid_without_relid, get_trigger_oid, and get_cast_oid.

The name of each function matches the corresponding catalog.

Thanks to KaiGai Kohei for the review.
2010-08-05 15:25:36 +00:00
Tom Lane
97532f7c29 Add some knowledge about prefix matches to tsmatchsel(). It's not terribly
bright, but it beats assuming that a prefix match behaves identically to an
exact match, which is what the code was doing before :-(.  Noted while
experimenting with Artur Dobrowski's example.
2010-08-01 21:31:08 +00:00
Tom Lane
b8c798ebc5 Tweak tsmatchsel() so that it examines the structure of the tsquery whenever
possible (ie, whenever the tsquery is a constant), even when no statistics
are available for the tsvector.  For example, foo @@ 'a & b'::tsquery
can be expected to be more selective than foo @@ 'a'::tsquery, whether
or not we know anything about foo.  We use DEFAULT_TS_MATCH_SEL as the assumed
selectivity of individual query terms when no stats are available, then
combine the terms according to the query's AND/OR structure as usual.

Per experimentation with Artur Dabrowski's example.  (The fact that there
are no stats available in that example is a problem in itself, but
nonetheless tsmatchsel should be smarter about the case.)

Back-patch to 8.4 to keep all versions of tsmatchsel() in sync.
2010-07-31 03:27:40 +00:00
Bruce Momjian
239d769e7e pgindent run for 9.0, second run 2010-07-06 19:19:02 +00:00
Tom Lane
bc0f080928 Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().
We must filter out hashtable entries with frequencies less than those
specified by the algorithm, else we risk emitting junk entries whose
actual frequency is much less than other lexemes that did not get
tabulated.  This is bad enough by itself, but even worse is that
tsquerysel() believes that the minimum frequency seen in pg_statistic is a
hard upper bound for lexemes not included, and was thus underestimating
the frequency of non-MCEs.

Also, set the threshold frequency to something with a little bit of theory
behind it, to wit assume that the input distribution is approximately
Zipfian.  This might need adjustment in future, but some preliminary
experiments suggest that it's not too unreasonable.

Back-patch to 8.4, where this code was introduced.

Jan Urbanski, with some editorialization by Tom
2010-05-30 21:59:02 +00:00
Tom Lane
ed437e2b27 Adjust comments about avoiding use of printf's %.*s.
My initial impression that glibc was measuring the precision in characters
(which is what the Linux man page says it does) was incorrect.  It does take
the precision to be in bytes, but it also tries to truncate the string at a
character boundary.  The bottom line remains the same: it will mess up
if the string is not in the encoding it expects, so we need to avoid %.*s
anytime there's a significant risk of that.  Previous code changes are still
good, but adjust the comments to reflect this knowledge.  Per research by
Hernan Gonzalez.
2010-05-09 02:16:00 +00:00