mirror of
https://github.com/postgres/postgres.git
synced 2026-03-04 14:24:01 -05:00
Since some preparation work had already been done, the only source changes left were changing empty-element tags like <xref linkend="foo"> to <xref linkend="foo"/>, and changing the DOCTYPE. The source files are still named *.sgml, but they are actually XML files now. Renaming could be considered later. In the build system, the intermediate step to convert from SGML to XML is removed. Everything is build straight from the source files again. The OpenSP (or the old SP) package is no longer needed. The documentation toolchain instructions are updated and are much simpler now. Peter Eisentraut, Alexander Lakhin, Jürgen Purtz
555 lines
24 KiB
Text
555 lines
24 KiB
Text
<!-- doc/src/sgml/arch-dev.sgml -->
|
|
|
|
<chapter id="overview">
|
|
<title>Overview of PostgreSQL Internals</title>
|
|
|
|
<note>
|
|
<title>Author</title>
|
|
<para>
|
|
This chapter originated as part of
|
|
<xref linkend="sim98"/>, Stefan Simkovics'
|
|
Master's Thesis prepared at Vienna University of Technology under the direction
|
|
of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
This chapter gives an overview of the internal structure of the
|
|
backend of <productname>PostgreSQL</productname>. After having
|
|
read the following sections you should have an idea of how a query
|
|
is processed. This chapter does not aim to provide a detailed
|
|
description of the internal operation of
|
|
<productname>PostgreSQL</productname>, as such a document would be
|
|
very extensive. Rather, this chapter is intended to help the reader
|
|
understand the general sequence of operations that occur within the
|
|
backend from the point at which a query is received, to the point
|
|
at which the results are returned to the client.
|
|
</para>
|
|
|
|
<sect1 id="query-path">
|
|
<title>The Path of a Query</title>
|
|
|
|
<para>
|
|
Here we give a short overview of the stages a query has to pass in
|
|
order to obtain a result.
|
|
</para>
|
|
|
|
<procedure>
|
|
<step>
|
|
<para>
|
|
A connection from an application program to the <productname>PostgreSQL</productname>
|
|
server has to be established. The application program transmits a
|
|
query to the server and waits to receive the results sent back by the
|
|
server.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>parser stage</firstterm> checks the query
|
|
transmitted by the application
|
|
program for correct syntax and creates
|
|
a <firstterm>query tree</firstterm>.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>rewrite system</firstterm> takes
|
|
the query tree created by the parser stage and looks for
|
|
any <firstterm>rules</firstterm> (stored in the
|
|
<firstterm>system catalogs</firstterm>) to apply to
|
|
the query tree. It performs the
|
|
transformations given in the <firstterm>rule bodies</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
One application of the rewrite system is in the realization of
|
|
<firstterm>views</firstterm>.
|
|
Whenever a query against a view
|
|
(i.e., a <firstterm>virtual table</firstterm>) is made,
|
|
the rewrite system rewrites the user's query to
|
|
a query that accesses the <firstterm>base tables</firstterm> given in
|
|
the <firstterm>view definition</firstterm> instead.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The <firstterm>planner/optimizer</firstterm> takes
|
|
the (rewritten) query tree and creates a
|
|
<firstterm>query plan</firstterm> that will be the input to the
|
|
<firstterm>executor</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
It does so by first creating all possible <firstterm>paths</firstterm>
|
|
leading to the same result. For example if there is an index on a
|
|
relation to be scanned, there are two paths for the
|
|
scan. One possibility is a simple sequential scan and the other
|
|
possibility is to use the index. Next the cost for the execution of
|
|
each path is estimated and the cheapest path is chosen. The cheapest
|
|
path is expanded into a complete plan that the executor can use.
|
|
</para>
|
|
</step>
|
|
|
|
<step>
|
|
<para>
|
|
The executor recursively steps through
|
|
the <firstterm>plan tree</firstterm> and
|
|
retrieves rows in the way represented by the plan.
|
|
The executor makes use of the
|
|
<firstterm>storage system</firstterm> while scanning
|
|
relations, performs <firstterm>sorts</firstterm> and <firstterm>joins</firstterm>,
|
|
evaluates <firstterm>qualifications</firstterm> and finally hands back the rows derived.
|
|
</para>
|
|
</step>
|
|
</procedure>
|
|
|
|
<para>
|
|
In the following sections we will cover each of the above listed items
|
|
in more detail to give a better understanding of <productname>PostgreSQL</productname>'s internal
|
|
control and data structures.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="connect-estab">
|
|
<title>How Connections are Established</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> is implemented using a
|
|
simple <quote>process per user</quote> client/server model. In this model
|
|
there is one <firstterm>client process</firstterm> connected to
|
|
exactly one <firstterm>server process</firstterm>. As we do not
|
|
know ahead of time how many connections will be made, we have to
|
|
use a <firstterm>master process</firstterm> that spawns a new
|
|
server process every time a connection is requested. This master
|
|
process is called <literal>postgres</literal> and listens at a
|
|
specified TCP/IP port for incoming connections. Whenever a request
|
|
for a connection is detected the <literal>postgres</literal>
|
|
process spawns a new server process. The server tasks
|
|
communicate with each other using <firstterm>semaphores</firstterm> and
|
|
<firstterm>shared memory</firstterm> to ensure data integrity
|
|
throughout concurrent data access.
|
|
</para>
|
|
|
|
<para>
|
|
The client process can be any program that understands the
|
|
<productname>PostgreSQL</productname> protocol described in
|
|
<xref linkend="protocol"/>. Many clients are based on the
|
|
C-language library <application>libpq</application>, but several independent
|
|
implementations of the protocol exist, such as the Java
|
|
<application>JDBC</application> driver.
|
|
</para>
|
|
|
|
<para>
|
|
Once a connection is established the client process can send a query
|
|
to the <firstterm>backend</firstterm> (server). The query is transmitted using plain text,
|
|
i.e., there is no parsing done in the <firstterm>frontend</firstterm> (client). The
|
|
server parses the query, creates an <firstterm>execution plan</firstterm>,
|
|
executes the plan and returns the retrieved rows to the client
|
|
by transmitting them over the established connection.
|
|
</para>
|
|
</sect1>
|
|
|
|
<sect1 id="parser-stage">
|
|
<title>The Parser Stage</title>
|
|
|
|
<para>
|
|
The <firstterm>parser stage</firstterm> consists of two parts:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The <firstterm>parser</firstterm> defined in
|
|
<filename>gram.y</filename> and <filename>scan.l</filename> is
|
|
built using the Unix tools <application>bison</application>
|
|
and <application>flex</application>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The <firstterm>transformation process</firstterm> does
|
|
modifications and augmentations to the data structures returned by the parser.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Parser</title>
|
|
|
|
<para>
|
|
The parser has to check the query string (which arrives as plain
|
|
text) for valid syntax. If the syntax is correct a
|
|
<firstterm>parse tree</firstterm> is built up and handed back;
|
|
otherwise an error is returned. The parser and lexer are
|
|
implemented using the well-known Unix tools <application>bison</application>
|
|
and <application>flex</application>.
|
|
</para>
|
|
|
|
<para>
|
|
The <firstterm>lexer</firstterm> is defined in the file
|
|
<filename>scan.l</filename> and is responsible
|
|
for recognizing <firstterm>identifiers</firstterm>,
|
|
the <firstterm>SQL key words</firstterm> etc. For
|
|
every key word or identifier that is found, a <firstterm>token</firstterm>
|
|
is generated and handed to the parser.
|
|
</para>
|
|
|
|
<para>
|
|
The parser is defined in the file <filename>gram.y</filename> and
|
|
consists of a set of <firstterm>grammar rules</firstterm> and
|
|
<firstterm>actions</firstterm> that are executed whenever a rule
|
|
is fired. The code of the actions (which is actually C code) is
|
|
used to build up the parse tree.
|
|
</para>
|
|
|
|
<para>
|
|
The file <filename>scan.l</filename> is transformed to the C
|
|
source file <filename>scan.c</filename> using the program
|
|
<application>flex</application> and <filename>gram.y</filename> is
|
|
transformed to <filename>gram.c</filename> using
|
|
<application>bison</application>. After these transformations
|
|
have taken place a normal C compiler can be used to create the
|
|
parser. Never make any changes to the generated C files as they
|
|
will be overwritten the next time <application>flex</application>
|
|
or <application>bison</application> is called.
|
|
|
|
<note>
|
|
<para>
|
|
The mentioned transformations and compilations are normally done
|
|
automatically using the <firstterm>makefiles</firstterm>
|
|
shipped with the <productname>PostgreSQL</productname>
|
|
source distribution.
|
|
</para>
|
|
</note>
|
|
</para>
|
|
|
|
<para>
|
|
A detailed description of <application>bison</application> or
|
|
the grammar rules given in <filename>gram.y</filename> would be
|
|
beyond the scope of this paper. There are many books and
|
|
documents dealing with <application>flex</application> and
|
|
<application>bison</application>. You should be familiar with
|
|
<application>bison</application> before you start to study the
|
|
grammar given in <filename>gram.y</filename> otherwise you won't
|
|
understand what happens there.
|
|
</para>
|
|
|
|
</sect2>
|
|
|
|
<sect2>
|
|
<title>Transformation Process</title>
|
|
|
|
<para>
|
|
The parser stage creates a parse tree using only fixed rules about
|
|
the syntactic structure of SQL. It does not make any lookups in the
|
|
system catalogs, so there is no possibility to understand the detailed
|
|
semantics of the requested operations. After the parser completes,
|
|
the <firstterm>transformation process</firstterm> takes the tree handed
|
|
back by the parser as input and does the semantic interpretation needed
|
|
to understand which tables, functions, and operators are referenced by
|
|
the query. The data structure that is built to represent this
|
|
information is called the <firstterm>query tree</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
The reason for separating raw parsing from semantic analysis is that
|
|
system catalog lookups can only be done within a transaction, and we
|
|
do not wish to start a transaction immediately upon receiving a query
|
|
string. The raw parsing stage is sufficient to identify the transaction
|
|
control commands (<command>BEGIN</command>, <command>ROLLBACK</command>, etc), and
|
|
these can then be correctly executed without any further analysis.
|
|
Once we know that we are dealing with an actual query (such as
|
|
<command>SELECT</command> or <command>UPDATE</command>), it is okay to
|
|
start a transaction if we're not already in one. Only then can the
|
|
transformation process be invoked.
|
|
</para>
|
|
|
|
<para>
|
|
The query tree created by the transformation process is structurally
|
|
similar to the raw parse tree in most places, but it has many differences
|
|
in detail. For example, a <structname>FuncCall</structname> node in the
|
|
parse tree represents something that looks syntactically like a function
|
|
call. This might be transformed to either a <structname>FuncExpr</structname>
|
|
or <structname>Aggref</structname> node depending on whether the referenced
|
|
name turns out to be an ordinary function or an aggregate function.
|
|
Also, information about the actual data types of columns and expression
|
|
results is added to the query tree.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="rule-system">
|
|
<title>The <productname>PostgreSQL</productname> Rule System</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname> supports a powerful
|
|
<firstterm>rule system</firstterm> for the specification
|
|
of <firstterm>views</firstterm> and ambiguous <firstterm>view updates</firstterm>.
|
|
Originally the <productname>PostgreSQL</productname>
|
|
rule system consisted of two implementations:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
The first one worked using <firstterm>row level</firstterm> processing and was
|
|
implemented deep in the <firstterm>executor</firstterm>. The rule system was
|
|
called whenever an individual row had been accessed. This
|
|
implementation was removed in 1995 when the last official release
|
|
of the <productname>Berkeley Postgres</productname> project was
|
|
transformed into <productname>Postgres95</productname>.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
The second implementation of the rule system is a technique
|
|
called <firstterm>query rewriting</firstterm>.
|
|
The <firstterm>rewrite system</firstterm> is a module
|
|
that exists between the <firstterm>parser stage</firstterm> and the
|
|
<firstterm>planner/optimizer</firstterm>. This technique is still implemented.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The query rewriter is discussed in some detail in
|
|
<xref linkend="rules"/>, so there is no need to cover it here.
|
|
We will only point out that both the input and the output of the
|
|
rewriter are query trees, that is, there is no change in the
|
|
representation or level of semantic detail in the trees. Rewriting
|
|
can be thought of as a form of macro expansion.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
<sect1 id="planner-optimizer">
|
|
<title>Planner/Optimizer</title>
|
|
|
|
<para>
|
|
The task of the <firstterm>planner/optimizer</firstterm> is to
|
|
create an optimal execution plan. A given SQL query (and hence, a
|
|
query tree) can be actually executed in a wide variety of
|
|
different ways, each of which will produce the same set of
|
|
results. If it is computationally feasible, the query optimizer
|
|
will examine each of these possible execution plans, ultimately
|
|
selecting the execution plan that is expected to run the fastest.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
In some situations, examining each possible way in which a query
|
|
can be executed would take an excessive amount of time and memory
|
|
space. In particular, this occurs when executing queries
|
|
involving large numbers of join operations. In order to determine
|
|
a reasonable (not necessarily optimal) query plan in a reasonable amount
|
|
of time, <productname>PostgreSQL</productname> uses a <firstterm>Genetic
|
|
Query Optimizer</firstterm> (see <xref linkend="geqo"/>) when the number of joins
|
|
exceeds a threshold (see <xref linkend="guc-geqo-threshold"/>).
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
The planner's search procedure actually works with data structures
|
|
called <firstterm>paths</firstterm>, which are simply cut-down representations of
|
|
plans containing only as much information as the planner needs to make
|
|
its decisions. After the cheapest path is determined, a full-fledged
|
|
<firstterm>plan tree</firstterm> is built to pass to the executor. This represents
|
|
the desired execution plan in sufficient detail for the executor to run it.
|
|
In the rest of this section we'll ignore the distinction between paths
|
|
and plans.
|
|
</para>
|
|
|
|
<sect2>
|
|
<title>Generating Possible Plans</title>
|
|
|
|
<para>
|
|
The planner/optimizer starts by generating plans for scanning each
|
|
individual relation (table) used in the query. The possible plans
|
|
are determined by the available indexes on each relation.
|
|
There is always the possibility of performing a
|
|
sequential scan on a relation, so a sequential scan plan is always
|
|
created. Assume an index is defined on a
|
|
relation (for example a B-tree index) and a query contains the
|
|
restriction
|
|
<literal>relation.attribute OPR constant</literal>. If
|
|
<literal>relation.attribute</literal> happens to match the key of the B-tree
|
|
index and <literal>OPR</literal> is one of the operators listed in
|
|
the index's <firstterm>operator class</firstterm>, another plan is created using
|
|
the B-tree index to scan the relation. If there are further indexes
|
|
present and the restrictions in the query happen to match a key of an
|
|
index, further plans will be considered. Index scan plans are also
|
|
generated for indexes that have a sort ordering that can match the
|
|
query's <literal>ORDER BY</literal> clause (if any), or a sort ordering that
|
|
might be useful for merge joining (see below).
|
|
</para>
|
|
|
|
<para>
|
|
If the query requires joining two or more relations,
|
|
plans for joining relations are considered
|
|
after all feasible plans have been found for scanning single relations.
|
|
The three available join strategies are:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<firstterm>nested loop join</firstterm>: The right relation is scanned
|
|
once for every row found in the left relation. This strategy
|
|
is easy to implement but can be very time consuming. (However,
|
|
if the right relation can be scanned with an index scan, this can
|
|
be a good strategy. It is possible to use values from the current
|
|
row of the left relation as keys for the index scan of the right.)
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>merge join</firstterm>: Each relation is sorted on the join
|
|
attributes before the join starts. Then the two relations are
|
|
scanned in parallel, and matching rows are combined to form
|
|
join rows. This kind of join is more
|
|
attractive because each relation has to be scanned only once.
|
|
The required sorting might be achieved either by an explicit sort
|
|
step, or by scanning the relation in the proper order using an
|
|
index on the join key.
|
|
</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>
|
|
<firstterm>hash join</firstterm>: the right relation is first scanned
|
|
and loaded into a hash table, using its join attributes as hash keys.
|
|
Next the left relation is scanned and the
|
|
appropriate values of every row found are used as hash keys to
|
|
locate the matching rows in the table.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
When the query involves more than two relations, the final result
|
|
must be built up by a tree of join steps, each with two inputs.
|
|
The planner examines different possible join sequences to find the
|
|
cheapest one.
|
|
</para>
|
|
|
|
<para>
|
|
If the query uses fewer than <xref linkend="guc-geqo-threshold"/>
|
|
relations, a near-exhaustive search is conducted to find the best
|
|
join sequence. The planner preferentially considers joins between any
|
|
two relations for which there exist a corresponding join clause in the
|
|
<literal>WHERE</literal> qualification (i.e., for
|
|
which a restriction like <literal>where rel1.attr1=rel2.attr2</literal>
|
|
exists). Join pairs with no join clause are considered only when there
|
|
is no other choice, that is, a particular relation has no available
|
|
join clauses to any other relation. All possible plans are generated for
|
|
every join pair considered by the planner, and the one that is
|
|
(estimated to be) the cheapest is chosen.
|
|
</para>
|
|
|
|
<para>
|
|
When <varname>geqo_threshold</varname> is exceeded, the join
|
|
sequences considered are determined by heuristics, as described
|
|
in <xref linkend="geqo"/>. Otherwise the process is the same.
|
|
</para>
|
|
|
|
<para>
|
|
The finished plan tree consists of sequential or index scans of
|
|
the base relations, plus nested-loop, merge, or hash join nodes as
|
|
needed, plus any auxiliary steps needed, such as sort nodes or
|
|
aggregate-function calculation nodes. Most of these plan node
|
|
types have the additional ability to do <firstterm>selection</firstterm>
|
|
(discarding rows that do not meet a specified Boolean condition)
|
|
and <firstterm>projection</firstterm> (computation of a derived column set
|
|
based on given column values, that is, evaluation of scalar
|
|
expressions where needed). One of the responsibilities of the
|
|
planner is to attach selection conditions from the
|
|
<literal>WHERE</literal> clause and computation of required
|
|
output expressions to the most appropriate nodes of the plan
|
|
tree.
|
|
</para>
|
|
</sect2>
|
|
</sect1>
|
|
|
|
<sect1 id="executor">
|
|
<title>Executor</title>
|
|
|
|
<para>
|
|
The <firstterm>executor</firstterm> takes the plan created by the
|
|
planner/optimizer and recursively processes it to extract the required set
|
|
of rows. This is essentially a demand-pull pipeline mechanism.
|
|
Each time a plan node is called, it must deliver one more row, or
|
|
report that it is done delivering rows.
|
|
</para>
|
|
|
|
<para>
|
|
To provide a concrete example, assume that the top
|
|
node is a <literal>MergeJoin</literal> node.
|
|
Before any merge can be done two rows have to be fetched (one from
|
|
each subplan). So the executor recursively calls itself to
|
|
process the subplans (it starts with the subplan attached to
|
|
<literal>lefttree</literal>). The new top node (the top node of the left
|
|
subplan) is, let's say, a
|
|
<literal>Sort</literal> node and again recursion is needed to obtain
|
|
an input row. The child node of the <literal>Sort</literal> might
|
|
be a <literal>SeqScan</literal> node, representing actual reading of a table.
|
|
Execution of this node causes the executor to fetch a row from the
|
|
table and return it up to the calling node. The <literal>Sort</literal>
|
|
node will repeatedly call its child to obtain all the rows to be sorted.
|
|
When the input is exhausted (as indicated by the child node returning
|
|
a NULL instead of a row), the <literal>Sort</literal> code performs
|
|
the sort, and finally is able to return its first output row, namely
|
|
the first one in sorted order. It keeps the remaining rows stored so
|
|
that it can deliver them in sorted order in response to later demands.
|
|
</para>
|
|
|
|
<para>
|
|
The <literal>MergeJoin</literal> node similarly demands the first row
|
|
from its right subplan. Then it compares the two rows to see if they
|
|
can be joined; if so, it returns a join row to its caller. On the next
|
|
call, or immediately if it cannot join the current pair of inputs,
|
|
it advances to the next row of one table
|
|
or the other (depending on how the comparison came out), and again
|
|
checks for a match. Eventually, one subplan or the other is exhausted,
|
|
and the <literal>MergeJoin</literal> node returns NULL to indicate that
|
|
no more join rows can be formed.
|
|
</para>
|
|
|
|
<para>
|
|
Complex queries can involve many levels of plan nodes, but the general
|
|
approach is the same: each node computes and returns its next output
|
|
row each time it is called. Each node is also responsible for applying
|
|
any selection or projection expressions that were assigned to it by
|
|
the planner.
|
|
</para>
|
|
|
|
<para>
|
|
The executor mechanism is used to evaluate all four basic SQL query types:
|
|
<command>SELECT</command>, <command>INSERT</command>, <command>UPDATE</command>, and
|
|
<command>DELETE</command>. For <command>SELECT</command>, the top-level executor
|
|
code only needs to send each row returned by the query plan tree off
|
|
to the client. For <command>INSERT</command>, each returned row is inserted
|
|
into the target table specified for the <command>INSERT</command>. This is
|
|
done in a special top-level plan node called <literal>ModifyTable</literal>.
|
|
(A simple
|
|
<command>INSERT ... VALUES</command> command creates a trivial plan tree
|
|
consisting of a single <literal>Result</literal> node, which computes just one
|
|
result row, and <literal>ModifyTable</literal> above it to perform the insertion.
|
|
But <command>INSERT ... SELECT</command> can demand the full power
|
|
of the executor mechanism.) For <command>UPDATE</command>, the planner arranges
|
|
that each computed row includes all the updated column values, plus
|
|
the <firstterm>TID</firstterm> (tuple ID, or row ID) of the original target row;
|
|
this data is fed into a <literal>ModifyTable</literal> node, which uses the
|
|
information to create a new updated row and mark the old row deleted.
|
|
For <command>DELETE</command>, the only column that is actually returned by the
|
|
plan is the TID, and the <literal>ModifyTable</literal> node simply uses the TID
|
|
to visit each target row and mark it deleted.
|
|
</para>
|
|
|
|
</sect1>
|
|
|
|
</chapter>
|