Query Processing#

Overview#

  • Databases implement relational algebra operators as the basic units of operation. The implemented operators are bag versions of the relational algebra, to be in sync with SQL (plus include a group by operation).

  • A logical query plan is represented as a query tree where the relations are leaves of the tree and the algebra operators are intermediate nodes.

  • A single algebra operation can have many different implementations. For example, joins can be implemented using sort-merge-join, hash-based-join, block-nested-join, etc.

  • The physical implementation plan of a query is a query tree with additional information, the amount of memory reserved for each node and the specific implementation of the given operation.

    We will use M to denote the number of blocks available to a query or operation.

  • Given a query, there may be many potential query trees. Each has an associated cost that can be estimated before the query is executed.

    The dominant cost is always the total number of disk pages read and written to execute the query.

  • Note that CPU complexity of a query is going to be disregarded (i.e. the complexity of the algorithm once the data is in memory) due to the following reasons:

    • The cost of reading/writing data is much higher than any operation in memory. So, the donimating cost is disk access.

    • We assume each algorithm is implemented as efficiently as possible.

    • It may not even be possible to bring all the data for a query in memory at once, which may result in the data being read multiple times.

Disk Access Process (Overly Simplifed)#

  • Remember that to process any data, it must be first brought to memory.

  • Some DBMS component indicates it wants to read record R

    • File Manager:

      Does security check

      Uses access structures to determine the page it is on

      Asks the buffer manager to find that page

    • Buffer Manager

      Checks to see if the page is already in the buffer

      If so, gives the buffer address to the requestor

      If not, allocates a buffer frame

      Asks the Disk Manager to get the page

    • Disk Manager

      Determines the physical address(es) of the page

      Asks the disk controller to get the appropriate block of data from the physical address

      Disk controller instructs disk driver to do the dirty job

Example query plans#

  • Let’s us look at a simple query:

    SELECT R.C,S.E
    FROM R,S
    WHERE R.B=S.D and R.A=10 ;
    

    and two potential logical query trees:

../../_images/query_plans.png
  • In plan A, we compute the join of R and S first. Then, the selection and projection can be computed over the results of the join.

  • In plan B, we first select tuples that satisfy the condition R.A. The resulting tuples are then joined with all the tuples in S and then we apply the projection to the resulting tuples.

Iterator Interface#

  • Each operator in the database is implemented using three main functions:

    • open() initializes the necessary memory structures (i.e. M buffers) and/or streams

    • getNext() reads data input streams and processes the data until a block full of output is produced or the input is completely processed, puts the output to the output buffer

    • close() frees all the structures used by the operator

  • Since each operator works the same way, we can chain up the operators by using the input buffer for an operation as the output buffer for the operation below.

    • If this is the last operation in the query tree, then the output buffer is simply the standard output to the user executing the query.

  • Certain operations are easy to combine with the previous operation without any additional cost. These are identified as “on-the-fly”. For example, if the tuples are already in memory, any selection and projection operation can be performed in memory.

  • Certain operations are blocking, i.e. these operations have to be completed before any tuples can be processed by the operations above. For example, sorting and group by are blocking operations unless the relation is already sorted by the relevant attributes.

  • If the operations are not blocking, then the output of one operation is pipelined into the input of the next operator immediately, producing some output tuples without the query is completed.

Example Execution#

  • Suppose we are processing the above query

    SELECT R.C,S.E
    FROM R,S
    WHERE R.B=S.D and R.A=10 ;
    

    by mapping the logical query trees above to more explicit physical implementations:

../../_images/query_plans_physical.png
  • In Plan A 1: To prepare the join operation, we ready R for a scan to fill the memory blocks for the block-nested-loop-join. Once the sufficient blocks are filled, we will start scanning S to compute the join.

    • We add a projection to sequential scan of R to keep the attributes needed by the query, which will reduce the size of the relation in memory, which may reduce the cost of the join (if we end up reading S fewer times). This projection is completed in memory as tuples are read and has no additional cost.

    • The next() operation for the join will output tuples to the selection operation as soon as joining tuples are found. Hence the join and selection are pipelined, they are not blocking.

    • The selection operation reads the tuples output by the join, removes tuples that do not match the join condition and projects the requested attributes. These operations are performed completely in memory and have no additional disk I/O cost.

  • In Plan B 1: Since we decided to do the selection first, we can use an index for the selection.

    • As tuples matching the index search condition are found and read from the disk, we can project the necessary attributes and fill the input buffer for the block-nested-loop join as before.

    • We have the index search operation being pipelined into the join, which is combined with the last projection operation. Because of this, there is no additional cost of reading the relation R in the join. The cost of reading R is included in the index scan.

Query Execution Costs#

  • Often query costs involve the sum of the cost of each operation. However, the costs may incorporate additional costs if we need to write the results of the relations into memory.

../../_images/query_plans_physical_2.png
  • Let’s assume the following statistics for these relations.

    PAGES(R )=1,000
    PAGES(S)=5,000
    PAGES( PROJECT_(R.A,R.B,R.C) (R ) = 500 
    PAGES( PROJECT_(S.D,S.E) (S) = 800
    
    PAGES( PROJECT_(R.B,R.C) (SELECT_(R.A=10) R) = 15
    
    Index on R(A):  200 left nodes 
    Selectivity (R.A=10) = 1/100
    TUPLES(R) = 20,000
    
  • Plan A 1: The main cost is the cost of join because the remaining selection and projection are performed on the fly.

    • Block-nested-loop-join: We read R once, 1,000 pages.

    • Since we do projection while reading R, we actually need 500 pages space to keep R in memory, which requires us to read S 500/20= 25 times.

    • Total cost: 1000 + 25*5000 = 126,000 pages.

  • Plan A 2: We will sort R and sort S fully , and then compute the join.

    • Sort R, read R fully (1,000 pages), project attributes A,B,C and write 500 pages worth of data using M=20 blocks, which is 500/20=25 partially sorted groups. (Cost = 1,500 pages)

    • Merge 25 groups into 2 groups using M=20 pages, and 2 groups into 1 group. Read/write 500 pages 2 times. (Cost = 500*4=2,000 pages)

    • Sort S: Read S fully (5,000 pages), project attributes D,E and write 800 pages worth of data using M=20 blocks into 800/20=40 partially sorted groups. (Cost = 5,800 pages)

    • Merge 40 groups into 2 using M=20 pages and 2 groups into 1 group. Read/write 500 pages 2 times. (Cost = 800*4=3,200 pages)

    • Assuming the join is involving unique attributes R.B and S.D, then the sort-merge-join can be done by reading sorted R and S only once, total cost of 500+800 = 1,300 pages.

    • Total cost = 1500+2000+5800+3200+1300=13,800.

    • Note that the actual cost is likely much lower as the last sort merge steps and the join can be combined, saving 2,600 pages.

  • Plan B 1: Index scan involves scanning, 200/100= 2 leaf nodes. The scan will find estimated: 20,000/100 = 200 tuples which requires reading 200 pages in the worst case. The cost of index scan is 2+200 pages (plus 1 or 2 for internal nodes that we will disregard).

    • Block nested loop does not have any additional cost of reading R since the R tuples are already in memory after the index scan. We simply need to find out how many times S needs to be read for the join.

    • Now the resulting 200 tuples from R fit in 15 pages (according to the statistics above), we can complete the join by reading S only once since the join has 20 blocks of memory allocated and all of found tuples of R fits in memory completely. Total additional cost for the join: 5,000 (read S once.)

    • Total cost = 202 + 5,000 = 5,002.

  • In this example, plan B is the clear winner because the selection on R.A is very selective.

Query operators#

  • In the rest of the course notes, we go over some basic types of operator implementations. Sort and block-nested-loop-join were already covered in the previous section.

  • Query operators are classified into classes:

    • One pass

    • Two pass

    • Multi-pass

    depending on the availability of memory, storage method of the relation (i.e. sortedness for example) and the number of pages it occupies on disk.

One pass algorithms#

  • The algorithms require one pass over a given relation.

  • One easy way to think about this is as follows. Suppose we have a unary operation over relation R. If R fits in memory allocated to the query, then we can easily compute the operation in memory. However, many operations do not read all of R to be read into memory either. This is discussed below for each operation.

Duplicate removal#

  • Given M pages of memory. Let X be 1 page, and Y be M-1 pages in memory.

    Read R into X, 1 page at a time.
    
    For each tuple t, check:
       If tuple t is in Y:
          it is already seen, remove t.
       Else
          insert t into Y.
    
  • getNext() will read a one block from Y and output. The next time it will be called, we need to process the operation until another block in Y is filled.

  • This is a one-pass operation only if after duplicate removal, R fits in M-1 blocks.

  • Not a blocking operation as we can start outputting tuples as long as we can mark them so that we do not return the same tuple more than once.

Group by#

  • Given M pages of memory. Let X be 1 page, and Y be M-1 pages in memory.

    For each group in Y, we are going to keep:
        the grouping attributes
        the aggregate value
           for min, max, count, sum: keep min, max, count,
           sum of the tuples seen so far
    
           for avg, keep count and sum.
    
    
    Read R into X, 1 page at a time.
    For each tuple t, check:
        If the corresponding group for tuple t is in Y:
             update its aggregates.
    
        Else
              create a new group for t into Y and initialize its statistics.
    
  • This is possible only if all the results fit in Y (M-1) blocks. Blocking operation: we cannot output the tuples until we finish processing all of R.

Set and bag operators#

  • Bag Union of R and S:

    Read R one block at a time and output
    Read S one block at a time and output
    
  • Set Union of R and S:

    Read R and remove duplicates
    Read S into the same space and continue to remove duplicates
    
  • Set intersection:

    Read R, remove duplicates and store into M-1 blocks (section Y).
    Read S.
    If the tuple is in Y
          output and remove from Y
    Else
          discard the tuple
    
  • Bag Intersection:

    Bag intersection requires that we keep track of how many copies of each tuple there are.

    Read R, group by all attributes and add a count for R.
    Store into M-1 blocks (Y section).
    
    Read S, for each tuple:
        If is in the Y section
             increment the count for S
        Else
             disregard
    Output min of count of R and S.
    
  • All set/bag operations are defined similarly. In most cases the algorithm is one pass only if the necessary memory is available.

  • In general, the cost of one pass the algorithms is PAGES® + PAGES(S) if R and S are being queried, again assuming memory is available.

External Sorting#

  • A large number of operators can be executed by an intermediate sorting step:

    • DISTINCT

    • ORDER BY

    • GROUP BY

    • UNION/INTERSECTION/DIFFERENCE

  • A limited amount of memory is available to the sort operation

    M: the number of memory pages available for the sort operation

    PAGES®: total number of disk pages for relation R

  • If PAGES® <= M, then the relation can be sorted in one pass: read the relation into memory and apply any sorting algorithm. The cost if PAGES® pages.

Sort based duplicate removal#

  • When a “distinct” projection is needed, we can do the following:

    Sort the relation
    Read the relation in sorted order
    For each tuple:
       If it is already seen
          discard
       Else:
          output
    
  • Need to keep in memory the last seen tuple only, so 1 page is sufficient for the operation.

  • It is possible to combine sort and duplicate removal

    Read the relation in groups, remove all unwanted attributes
    
    Sort the tuples in memory, remove duplicates (distinct) and write
    the sorted group to disk
    
    Read the sorted group for k-way-merge, during merge remove any
    additional duplicates
    

Sort based projection#

  • Cost is similar to external sort, but the relation being read in the second stage is reduced in size by removing unnecessary attributes

  • Tuples are smaller (how many attributes are removed?)

  • Duplicates are removed (how many duplicates are there?)

Hash based projection#

  • To compute “distinct R.A, R.B,…”:

    Read all tuples in R and extract the projected attributes

    Hash the remaining attributes to buckets 1 … M in memory (and continuously remove duplicates from buckets in memory)

    Whenever a bucket in memory is full, write it to disk

    For any bucket that takes up more than one disk page, read it back from disk to memory and remove duplicates within the bucket.

  • Example: 2 pages for hashing and 1 page for processing

../../_images/Hashing14.png
  • Read and put values into the two buckets

../../_images/Hashing24.png
  • If the bucket needs to be computed before the query executes, it needs to be written to disk.

../../_images/Hashing34.png
  • Once hashing is complete, different operations can be mapped to different buckets and applied independently in each bucket.

../../_images/Hashing44.png

Hash based projection#

  • The cost:

    The relation has to be read once for hashing

    If all the buckets after reduction are too large fit in a single memory block, then the relation will be written once to disk

    If all the disk pages in a single bucket will fit in the M available blocks, then the last step can be performed in one read.

Set operations#

  • To compute A UNION B (with duplicate removal), we first hash both A and B together and remove duplicates in each bucket separately.

  • To compute A - B, we hash A and B into the same buckets and then compute A - B to each bucket separately.

Index nested loop join#

  • Index loop join assumes a look-up of matching tuples for S using an index.

    Given R join S on R.A=S.A
    
    Read R one block at a time
       For each tuples r of R
          Use index on S(A) to find all matching tuples
          Read the tuples from disk and join with
    
  • If R is not sorted on A, then we might end up reading the same tuples of S many times

    • What if A is R’s primary key?

      Then, for each A value, we will look up S only once

  • Cost:

    • The outer relation R is read once, PAGES®

    • Assume for every tuple r in R, there are about c tuples in S that would join with R (ideal case is c is very small or 0 most of the time)

    • Then, for each tuples in R: - We find the matching tuples in S (cost is index look-up, in the

      best case one tuple for each level, so h+1 for an index with h levels)

      • Then, if we need attributes that are not in the join, we read c tuples from S (which can be in c pages in the worst case)

      • Total cost is between PAGES® + TUPLES® * (h+1) (if no tuples match or index only scan is fine) and PAGES® + TUPLES® * (h+1+c)

  • Example:

    SELECT S.B FROM R,S WHERE R.A = S.B
    
    Index I1 on S(B) with 2 levels (root, internal, leaf)
    PAGES(R)=100, TUPLES(R)= 2000
    PAGES(S)=200, TUPLES(S)= 4000
    
    Cost = 100 (for reading R) + 2000*3 (index look up for each tuple of R)
    
    SELECT S.C  FROM R,S WHERE R.A = S.B
    
    Assume statistics are same as above, but we now need to reach
    each matching tuple (at most 4000 tuples will match)
    
    Cost = 100+2000*3+4000 (one page read for each matched tuple)
    

    How could this be, there are only 200 pages of S?

    Well, if we are not finding all pages that we need to read first and reading each page as we find a match, we may end up reading the pages for S multiple times.

    Of course, in reality, you will likely do a lot of reduction of duplicate page requests in memory and improve on this. This is the worst case scenario.

    Will we ever do this?

    No, we will choose not to use index join for this case, clearly it looks very expensive and we better do some other operation.

  • Another example:

    SELECT S.B FROM R,S WHERE R.A = S.B AND R.B=100
    
    Index I1 on S(B) with 2 levels (root, internal, leaf)
    PAGES(R)=100, TUPLES(R)= 2000
    PAGES(S)=200, TUPLES(S)= 4000
    Suppose only 3 tuples match R.B=100
    
    So, we can:
       Scan all of R to find these 3 tuples (100 pages)
       Read matching tuples from S (3*3)
       Total cost = 109 (using only 2 pages of memory)
    
       This would cost a lot more in block-nested loop join
       with M=2 (200 pages). So, index join is for cases
       where the outer relation is very small (normally after
       a selection)
    

Sort-merge join#

  • Sort both R and S first

  • Read R and S one block at a time, and join the matching tuples.

  • Sort merge is similar to Step 2 of the sorting algorithm.

  • Sort-merge join

  • Example:

    R: [1, 2] [5, 8]    S:[1,5] [6, 7]
    Read [1,2] and [1,5] first, join 1.
    Read the next block of R, [5,8] [1,5]. Join 5
    Read the next block of S, [5,8] [6,7]. No more tuples, done.
    
  • If each joining attribute has unique values in R and S, then the join can be performed in a single step without reading each relation once forward, PAGES® + PAGES(S) + the cost of sort

  • If there are duplicate values, then we must worry about if all the duplicate values from both relations will fit in memory

  • Example:

    R [1,2] [2,2] [2,2] [2,2] [2,2] [2,2] [2,2] [2,2] [2,2][2,3]
    S [1,1] [1,1] [1,1][2,2] [2,2] [2,2] [2,2] [2,2][3,4]
    
  • We need a total of 15 buffer pages to be able to compute this join at one step.

  • In general, a block-nested loop join is performed for duplicate values.

Hash join#

  • Hash both R and S on the joining attribute, read R and S once:

    PAGES® + PAGES(S)

  • Each bucket will contain both tuples of R and S

  • Read each bucket into memory to perform the join within the bucket

  • If a bucket cannot fit in memory, then other methods should be used to perform the join operation within the bucket

Summary#

  • Sorting and hashing are two main methods that can be used to implement other operators.

  • In particular, sorting may help reduce the cost of multiple operations upstream