Read with me. PostgreSQL source Code (ix)--executor (--scan node of the query execution module (top))

Source: Internet
Author: User
Tags create index postgresql prefetch random seed



It is not difficult to realize that the core content of the optimized statement execution is the processing of various scheduling nodes from the optimized statements described earlier, and because of the design of node representation, recursive invocation, unified interface and so on, the function of the plan node is relatively independent and the code overall process is similar. The following describes the execution of the various scheduling nodes in the actuator.



In PostgreSQL, the planning node is divided into four categories, namely control node, scan node (scannode), materialized node (materialization nodes), connection node (join nodes).


    • Control node: A class of nodes that are used to handle special situations, for implementing special execution processes. For example, the result node can be used to represent the tuple of the INSERT statement that will be inserted as specified by the VALUES clause.

    • Scan nodes: As the name implies, such nodes are used to scan objects such as tables to get tuples from them. For example, the Seqscan node is used to sequentially scan a table. Each time a tuple is scanned.

    • Materialized nodes: These types of nodes are more complex, but they have a common feature of being able to cache execution results into secondary storage. The materialized node generates all the result tuples in the first execution, and then caches the resulting tuples for their upper-level nodes, while non-materialized nodes generate a result tuple each time they are executed and return to the upper-level node. For example, the sort node can get all the tuples returned by the underlying node and sort them according to the specified attributes, and cache the sort results all at once, each time the upper node fetches the tuple from the sort node, the next tuple is returned sequentially from the cache (see the Sort node of the materialized node in Postgres).

    • Connection node: This kind of node corresponds to the connection operation in the relational algebra, can realize a variety of connection methods (conditional connection, left join, right connection, full connection, natural connection, etc.), each node implements a connection algorithm. For example, Hashjoin implements a hash-based connection grate method.

Scan node


The function of a scan node is to scan a table and fetch one tuple at a time as input to the upper node. The scan node is ubiquitous in the leaf node of the query plan tree, it can not only scan the table, but also scan the function result set, linked list structure, subquery result set and so on.



All scan nodes use scan as the public parent class, and scan inherits not only all the properties of plan, but also the scanrelid used to record the ordinal number of the table being scanned in the scope table.


typedef struct Scan
{
    Plan        plan;
    Index       scanrelid;      /* relid is index into the range table */
} Scan;


The execution state node of the scan node is scanstate as the public parent class, Scanstate, in addition to all attributes that inherit planstate, also defines the SS_CURRENTSCANDESC (the location of the scan, the relationship, and so on). Currentrelation (records the scanned relationship) and Ss_scantupleslot (records the scanned results).


typedef struct ScanState
{
    PlanState   ps;             /* its first field is NodeTag */
    Relation    ss_currentRelation;
    HeapScanDesc ss_currentScanDesc;
    TupleTableSlot *ss_ScanTupleSlot;
} ScanState;


Here are all the scan types from the source:


T_SeqScanState,
    T_SampleScanState,
    T_IndexScanState,
    T_IndexOnlyScanState,
    T_BitmapIndexScanState,
    T_BitmapHeapScanState,
    T_TidScanState,
    T_SubqueryScanState,
    T_FunctionScanState,
    T_ValuesScanState,
    T_CteScanState,
    T_WorkTableScanState,
    T_ForeignScanState,
    T_CustomScanState,


They are described below.



The scan nodes have their own execution functions, but the execution functions are implemented by the public execution function Execscan.


TupleTableSlot *
ExecScan(ScanState *node,
         ExecScanAccessMtd accessMtd,   /* function returning a tuple */
         ExecScanRecheckMtd recheckMtd)


Execscan requires three parameters:


    • State node Scanstate,
    • Gets the function pointer of the scan tuple (ACCESSMTD, because each scan node scans different objects, so the functions are different),
    • Determines whether a tuple satisfies a function pointer (RECHECKMTD) that meets the filtering criteria. This is to say: This function is used for concurrency control, if the current tuple is modified by other things and committed, you need to detect whether the tuple still satisfies the selection criteria.


Execscan iteratively scans the object, returning one result per execution (the internal return tuple is implemented through Execscanfetch). Execscan uses ACCESSMTD to get the tuple, and then RECHECKMTD to determine the filter criteria, eventually returning the tuple.



See Example:


EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 100;  
  
                                  QUERY PLAN  
------------------------------------------------------------------------------  
 Bitmap Heap Scan on tenk1  (cost=5.07..229.20 rows=101 width=244)  
   Recheck Cond: (unique1 < 100)                  <---recheckMtd 的作用
   ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..5.04 rows=101 width=0)  
         Index Cond: (unique1 < 100
1.SeqScan node


Seqscan is the most basic scanning node that is used to scan physical tables and complete sequential scanning without index AIDS. Its plan node Seqscan is actually an alias for the scan node and does not define an extended property. Its execution state node Seqscanstate also uses scanstate directly.



The initialization of the Seqscan node is done by the function Execinitseqscan. The function first creates a seqscanstate structure that links the Seqscan node to the PS field in the seqscanstate structure. Then call execinitexpr to initialize the target properties and the survey criteria for the plan node and link them to the corresponding fields in the seqscanstate. Next, you will also assign a data structure for the scheduling node to store the result tuples and scan tuples. Finally, the relationdata structure of the scanned object is obtained through the information of the scanrelid field in the Plan node, and is linked in the ss_currentrelation field, and the information is invoked HEAP_ Beginscan initializes the scan descriptor Ss_currentscandesc.



The execution function of the Seqscan node is execseqscan, in this function:



Call the Execscan function and use the pointer to the Seqnext function as the value of the Execscan function ACCESSMTD parameter. The Seqrecheck function pointer is used as the value of the RECHECKMTD parameter of the Execscan function. The Seqnext function obtains the next tuple and returns through the function Heap_getnext provided by the enclosure;
Execscan after a tuple is obtained using seqnext, the resulting tuple is also conditionally checked and projected according to the polling conditions and projection requirements in the Plan node, and finally the result tuple that satisfies the requirement is returned. Here Seqrecheck actually did not do any processing and judgment, because this function does not use the keys returned by Heap_beginscan (that is, to find the table on its own, not affected by concurrency.) This later says).



The execution functions of other scan nodes are managed in a similar way, that is, calling Execscan uniformly, but depending on the type of node to which Execscan's parameters ACCESSMTD and RECHECKMTD are assigned different function pointers.



The Cheongju process of the Seqscan node is done by the function Execendseqscan, where additional calls to the function Heap_endscan are required to clean up the information in the Ss_currentscandesc.


2.SampleScan node


This is the new data sampling feature in version 9.5, which enables the query to return sampled data. The TABLESAMPLE clause is currently accepted only on regular tables and materialized views.



The syntax is probably this:


SELECT select_list FROM table_name TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]


Use the words to see here:



http://www.postgres.cn/docs/9.5/sql-select.html (using sample)



http://www.postgres.cn/docs/9.5/tablesample-method.html (custom sample function)



In a white word, I can sample the data that meets the criteria in the table. Saved you a lottery system, awesome!! (Smiling face).



Let's look at the node structure:


typedef struct SampleScan
{
    Scan        scan;
    /* use struct pointer to avoid including parsenodes.h here */
    struct TableSampleClause *tablesample;
} SampleScan;


As you can see, the tablesample-related structure is added on the basis of scan, and its data structure is as follows:


typedef struct TableSampleClause
{
    NodeTag     type;
    Oid         tsmhandler;     /* OID of the tablesample handler function */
    List       *args;           /* tablesample argument expression(s) */
    Expr       *repeatable;     /* REPEATABLE expression, or NULL if none */
} TableSampleClause;


The data structure describing the state of the Samplescan query is samplescanstate as follows, in simple terms, it adds the sample sampling strategy, random seed, sampling function, and sample-related information on the basis of scanstate. This data is derived from the tablesampleclause structure of the Samplescan node.


typedef struct SampleScanState
{
    ScanState   ss;
    List       *args;           /* expr states for TABLESAMPLE params */
    ExprState  *repeatable;     /* expr state for REPEATABLE expr */
    /* use struct pointer to avoid including tsmapi.h here */
    struct TsmRoutine *tsmroutine;      /* descriptor for tablesample method */
    void       *tsm_state;      /* tablesample method can keep state here */
    bool        use_bulkread;   /* use bulkread buffer access strategy? */
    bool        use_pagemode;   /* use page-at-a-time visibility checking? */
    bool        begun;          /* false means need to call BeginSampleScan */
    uint32      seed;           /* random seed */
} SampleScanState;


Other words, gentlemen please look at the Code bar ~


3.IndexScan node


If an index is established on the attributes involved in the selection criteria, the Indexscan node is used when scanning for tables in the generated poll plan. The node is able to scan the table with an index to obtain tuples that meet the selection criteria.



The definition of the Indexscan node is as follows. In addition to inheriting the properties defined by the Scan node, the Indexscan extension defines the IndexID property (the OID used to store the index), the Indexqual property (the condition used to store the index scan), The Indexqualorig property, which stores the original scan condition list without processing and the Indexonierdir property (for storing the direction of the scan).


typedef struct IndexScan
{
    Scan        scan;
    Oid         indexid;        /* OID of index to scan */
    List       *indexqual;      /* list of index quals (usually OpExprs) */
    List       *indexqualorig;  /* the same in original form */
    List       *indexorderby;   /* list of index ORDER BY exprs */
    List       *indexorderbyorig;       /* the same in original form */
    List       *indexorderbyops;    /* OIDs of sort ops for ORDER BY exprs */
    ScanDirection indexorderdir;    /* forward or backward or don't care */
} IndexScan;

The initialization process for the

Indexscan node is done by the function Execinitlndexscan. The function constructs the Indexscanstate node, and the Relationdata structure that uses INDEXID to get the index is stored in the Iss_relationdesc field. At the same time, by calling Execlndexbuildscankeys to convert the index scan condition in indexqual to the Scan keyword (scankey, the condition that the storage scan satisfies) and the run-time keyword calculation structure (indexruntimekeylnfo, Expression information that can be obtained at execution time) is stored in both the Iss_scankeys and Iss_runtimekeys arrays, respectively. Iss_numscankeys and Iss_numruntimekeys are used to indicate the length of the preceding two arrays, and also to set the move place Iss_numrumimekeys to false. Finally, the Index_beginscan initialization scan descriptor ISS_SCANDESC provided by the index module is called. The original constraint list, which is not specially processed by the index scan, is used to construct the Indexqualorig field.


typedef struct IndexOnlyScanState
{
    ScanState   ss;             /* its first field is NodeTag */
    List       *indexqual;                  execution state for indexqual expressions
    ScanKey     ioss_ScanKeys;              Skey structures for index quals
    int         ioss_NumScanKeys;           number of ScanKeys
    ScanKey     ioss_OrderByKeys;           Skey structures for index ordering operators
    int         ioss_NumOrderByKeys;        number of OrderByKeys
    IndexRuntimeKeyInfo *ioss_RuntimeKeys;  info about Skeys that must be evaluated at runtime
    int         ioss_NumRuntimeKeys;        number of RuntimeKeys
    bool        ioss_RuntimeKeysReady;      true if runtime Skeys have been computed
    ExprContext *ioss_RuntimeContext;       expr context for evaling runtime Skeys
    Relation    ioss_RelationDesc;          index relation descriptor
    IndexScanDesc ioss_ScanDesc;            index scan descriptor
    Buffer      ioss_VMBuffer;              buffer in use for visibility map testing, if any
    long        ioss_HeapFetches;           number of tuples we were forced to fetch from heap
} IndexOnlyScanState;


The execution of the Indexscan node is done by the Execindexscan function, and its execution is equally managed by Execscan, but the Indexscan node uses the Indexnext function to get the tuple. Execindexscan first determines if there is runtimekeys and needs to be computed (Iss_runtimekeyready is false), and if so, calls the Execindexrescan function to calculate all Iss_ Runtimekeys An expression and stores it in the associated Iss_scankeys. Then call Execscan to get the tuple through Indexnext, and in Indexnext the Index_getnext function provided by the calling index module takes advantage of the index to get the tuple.



The Cheongju process of Indexscan is done by the Endlndexscan function, which needs to reclaim the index relationship description structure ISS_RELATIONDESC (call index_close) and Index Scan descriptor Iss_scandesc (call Index_ Endacan).


4.IndexOnlyScan node


The so-called index only scan, that is, because the set index, contains the field collection, including our query statements in the field, so that the corresponding index, we do not have to extract the data block again.



For example: for a table:


create table test(id int, name text, age int);
insert into test select generate_series(1,100000),'test'::text,generate_series(1,100000);


We build a composite index of ID and age:


create index test_id_age on test(id ,age);


Then, execute the query:


explain select id, age from test  where id < 20 and age >0;


The query results are:


postgres=# explain select id ,age from test where id < 20 and age >0;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Index Only Scan using test_id_age on test  (cost=0.29..41.94 rows=20 width=8)
   Index Cond: ((id < 20) AND (age > 0))
(2 rows)


The ID of the query and age on the index test_id_age, when we take the index, we have obtained the (id,age) value of the sequence, so we do not have to go to the table to get the record, at index we get the data we need, so called index Only Scan.



We may have questions about this indexonlyscan, if my index is not updated in time, will it be queried for old outdated data?
Don't worry about this, we can look at Indexonlyscan's execution function:


voidExecReScanIndexOnlyScan(IndexOnlyScanState *node)


It is not simply based on the type of node to give Execscan parameters Accessmtd and RECHECKMTD to different function pointers, but also to:


*      Recalculates the values of any scan keys whose value depends on *      information known at runtime, then rescans the indexed relation.


In other words, we will rescan for scan key first and then take this key to scan. The call path is as follows:


ExecIndexOnlyScan
     -> ExecReScan * This is rescan, update scan keys
         -> ExecReScanIndexOnlyScan
    
     -> ExecScan ## Scan with new scan keys


Here, Indexonlyscan does not allow recheck.


static bool
IndexOnlyRecheck(IndexOnlyScanState *node, TupleTableSlot *slot)
{
    elog(ERROR, "EvalPlanQual recheck is not supported in index-only scans");
    return false;               /* keep compiler quiet */
}
5.BitmapIndexScan node


The Bitmapindexscan node is also scanned using an index on the property, but the result of Bitmapindexscan is not the actual tuple, but a bitmap that marks the offset of the tuple that satisfies the condition in the page. When the Bitmapindexscan node is executed for the first time, it gets all the tuples that satisfy the condition and marks them in the bitmap, and its upper node has a special scan node (for example, the Bitmapheapscan described below) that uses the bitmap to get the actual tuple. Therefore, the scanning mode does not produce the actual tuple, that is, the node does not appear in the call of the Execprocnode function, is not a separate execution node, only the special upper node is called .



The Bitmapindexscan is almost identical to the Indexscan node definition, so you do not need to record the Indexorderdir and Indexorderby fields of the scan direction because the bitmap is returned for all tuples at once.


typedef struct BitmapIndexScan
{
    Scan        scan;
    Oid         indexid;        /* OID of index to scan */
    List       *indexqual;      /* list of index quals (OpExprs) */
    List       *indexqualorig;  /* the same in original form */
} BitmapIndexScan;

The

Execution state node bitmapindexscanstate is similar to indexscanstate, but has more fields that represent an array of index keyword attributes and their length. The execution of Bitmapindexscan and Indexscan is similar, except that during the Bitmapindexscan process, the initialization function Execinitbitmapindexscan use Index_beginscan_ The bitmap function initializes the scan state, and the Multiexecbitmapindexscan function invokes Index_getbitmap to generate the bitmap and holds it in the execution state record node of the Biss_result field.


typedef struct BitmapIndexScanState
{
    ScanState   ss;             /* its first field is NodeTag */
    TIDBitmap  *biss_result;                bitmap to return output into, or NULL
    ScanKey     biss_ScanKeys;              Skey structures for index quals
    int         biss_NumScanKeys;           number of ScanKeys
    IndexRuntimeKeyInfo *biss_RuntimeKeys;  info about Skeys that must be evaluated at runtime
    int         biss_NumRuntimeKeys;        number of RuntimeKeys
    IndexArrayKeyInfo *biss_ArrayKeys;      info about Skeys that come from ScalarArrayOpExprs
    int         biss_NumArrayKeys;          number of ArrayKeys
    bool        biss_RuntimeKeysReady;      true if runtime Skeys have been computed
    ExprContext *biss_RuntimeContext;       expr context for evaling runtime Skeys
    Relation    biss_RelationDesc;          index relation descriptor
    IndexScanDesc biss_ScanDesc;            index scan descriptor
} BitmapIndexScanState;
6.BitmapHeapScan


The Bitmapindexscan node described above will output bitmaps instead of tuples, in order to get the actual tuple based on the bitmap, PostgreSQL provides the Bitmapheapscan node to get the tuple from the Bitmapindexscan output bitmap.



The Bitmapheapscan node definition is as follows, which extends only the Constraint check field (Bitmapqualorig) on the basis of scan, which is the same as the Indexqualorig function of the Indexscan node. When a concurrent transaction modifies and submits the currently processed tuple, the updated tuple needs to be re-scanned to see if the constraint is met, rather than retrieving the bitmap, and the expression is used for conditional evaluation. Bitmapheapscan has only one child node (left child node), obviously the left child node must be a plan node that provides the bitmap output.


typedef struct BitmapHeapScan
{
    Scan        scan;
    List       *bitmapqualorig; /* index quals, in standard expr form */
} BitmapHeapScan;


The initialization function Execinitbitmapheapscan initializes the scan descriptor Ss_currentscandesc based on the scanrelid in the node. Other initialization settings are performed during the execution.


The

Execute function Execbitmapheapscan passes the Bitmapheapnext function pointer to Execscan, Execscan uses Bitmapheapnext to get the tuple. Bitmapheapnext first determines whether the Bitmapheapscanstate TBM (bitmap) is empty, and if NULL, calls Multiexecprocnode to get the bitmap from the left child node and calls Tbm_begin_ Iterate initializes the tbmiterator. If prefetching is required, call Tbm_begin_iterate to initialize Prefetch_iterator, and set Prefetch_pages to 0, Prefetch_target to-1. The execution then takes the Tbmiterator traversal of the bitmap to get the offset of the physical tuple, and then fetches the tuple from the corresponding buffer and returns it as an offset.


typedef struct BitmapHeapScanState
{
    ScanState   ss;             /* its first field is NodeTag */
    List       *bitmapqualorig;     execution state for bitmapqualorig expressions
    TIDBitmap  *tbm;                bitmap obtained from child index scan(s)
    TBMIterator *tbmiterator;       iterator for scanning current pages
    TBMIterateResult *tbmres;       current-page data
    long        exact_pages;        total number of exact pages retrieved
    long        lossy_pages;        total number of lossy pages retrieved
    TBMIterator *prefetch_iterator; iterator for prefetching ahead of current page
    int         prefetch_pages;     # pages prefetch iterator is ahead of current
    int         prefetch_target;    target prefetch distance
} BitmapHeapScanState;


The cleanup process Execendbitmapheapscan need to call the cleanup function of the left Dial hand node and then clean up the Tbmiterator, Prefetch_iterator, and TBM bitmaps, and finally clean up the scan descriptor and close the open table.


7.TidScan node


The data type that the PostgreSQL system is designed to identify the physical location of tuples is called the TID (Tuple Identifier), a tid consists of a block number and an intra-block offset, and the system attribute Ctid is defined as this type.



PostgreSQL's own table is a heap table, the data is stored in the heap page by row, in the Btree index, in addition to storing the value of the field, the corresponding Ctid (line number) is stored, retrieving the record is also through the row number to retrieve. Therefore, a record can be retrieved quickly by line number.



The line number is written as (Page_number, Item_Number), the data block is numbered starting at 0, and the line number is numbered starting with 1.



Example:


postgres = # select ctid, * from zxc;
  ctid | id | name
------- + ---- + ----------
  (0,4) | 1 | asdftest
  (0,5) | 3 | asdftest
  (0,6) | 9 | asdftest
(3 lines)


Then we can use Ctid to access the data:


postgres = # select * from zxc where ctid = '(0,5)' :: tid;
  id | name
---- + ----------
   3 | asdftest
(1 line) 


Also, after you define a cursor, you can use the update/delete ... The WHERE CURRENT of ... "statement modifies/deletes the tuple that is currently positioned on the cursor. Refer to here: []http://www.postgres.cn/docs/9.5/sql-update.html (http://www.postgres.cn/docs/9.5/sql-update.html)



The survey plan tree generated at this time contains only one Tidscan node, and its scanned object is an expression list saved in the Tidscan node, where the stored expression can get Ctid value, and the Tidscan node will get the corresponding tuple based on the Ctid value. The Tidscan node extends only one field based on the scan node tidquals to hold the list of expressions that can be ctid.



The initialization function for the Tidscan node Execinittidscan initializes the Tss_tidquals field in Tidscanstate according to Tidquals, and then calls execinitexpr to initialize the expression in Tidquals. And initializes the scan descriptor Ss_currentscandesc based on the scanrelid in the node.



The Tidscan node's execution function (Exectidscan) also calls the function Execscan to complete execution, where the pointer passed to the Execscan function is tidnext. The function Tidnext first needs to construct the tss_tidlist array by calculating the expression in the Tss_tidquals list of the Tidscanstate node, which holds a series of Ctid, Tss_numtids is used to record the length of the array, Tss_ The tidptr is used to record the offset of the currently processed Ctid in the tss_tidlist array, with the initial value set to-1. The next Ctid value is then fetched from the tss_tidlist, and then the heap_fetch supplied by the storage module is called to get the tuple from the Ctid and return. For concurrent needs, when the Tidscan node is used for "currentof". (cursor name) statement, the obtained ctid may have been modified by another transaction, need to obtain the latest version of this ctid corresponding tuple (using the hot chain), and then call Heap_fetch to get it.



The cleanup process Execendtidscan does not require special operations to directly release the associated memory context and the space allocated during initialization.



There are about 7 scan methods left, and the next one will say



Read with me. PostgreSQL source Code (ix)--executor (--scan node of the query execution module (top))


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.