Index Scan#

Disk Access Process (Overly Simplifed)#

Remember that to process any data, it must be first brought to memory.
Some DBMS component indicates it wants to read record R
- File Manager:
  
  Does security check
  
  Uses access structures to determine the page it is on
  
  Asks the buffer manager to find that page
- Buffer Manager
  
  Checks to see if the page is already in the buffer
  
  If so, gives the buffer address to the requestor
  
  If not, allocates a buffer frame
  
  Asks the Disk Manager to get the page
- Disk Manager
  
  Determines the physical address(es) of the page
  
  Asks the disk controller to get the appropriate block of data from the physical address
  
  Disk controller instructs disk driver to do the dirty job

Cost and Resources#

An SQL query is translated to a combination of relational algebra operations.
Each operation in the query is given some resources: memory blocks to use for the operation.

We will use M to denote the number of blocks available to a query.
We are interested in the overall cost of a query. We will only consider total number of disk pages read/written to execute the query.

We will use PAGES® to denote the total number of pages in relation R.
Note that CPU complexity of a query is going to be disregarded (i.e. the complexity of the algorithm once the data is in memory) due to the following reasons:
- The cost of reading/writing data is much higher than any operation in memory. So, the donimating cost is disk access.
- We assume each algorithm is implemented as efficiently as possible.
- It may not even be possible to bring all the data for a query in memory at once, which may result in the data being read multiple times.
The cost of a query (total number of pages read and written) depends on how much memory is available (M). We will see how these two are related for each operation.

Access Paths#

Given a query
```
SELECT attrs FROM R WHERE C
```
the following are the possible access paths for this relation
Table scan: Always possible.
Hash index: Possible if the hash index is on a subset of attributes A1,…,An such that all conditions in C for A1,…,An are for equality and are conjunctive (ANDed).
B-tree index with search key A1,…,Am such that a prefix A1,…,An of A1,…,Am have equality conditions in C.
After an index scan, it is necessary to scan the relation if
- C contains conditions on attributes that are not in the index, or
- Projection attributes are not all in the index

Example:

FROM R WHERE R.A=10 AND R.B<10 AND R.C > 20

Can we use B-tree indices on

I1 on R(A,B) ?

I2 on R(B,C) ?

I3 on R(D,A) ?

Selections#

To compute SELECT FROM R WHERE C

[Sequential Scan] Read all of R one block at a time and check C. Only need a single buffer page to read R. Total cost PAGES®.

[Index Scan] Read the index to check all or part of C, find qualifying tuples. Read those tuples from disk and return. Total cost: the cost of index scan + cost of relation scan.
Example:

` SELECT R.C, R.D FROM R WHERE R.A=10 AND R.B=5 AND R.C<30 ; `
- Given index I1 on R.A,R.B such that
  
  I1 has depth 3
  
  I has about k leaf nodes with about k*c tuples in general for any specific R.A,R.B value (I.e. duplicate tuples)
- Finding the tuple identifier of all tuples with R.A=10 AND R.B=5 takes 3 + k disk reads in the average.
- We still need to read the qualifying k*c tuples from disk to check R.C, and return R.C, R.D
- Given k*c tuples with R.A=10 and R.B=5, how many disk pages for R need to be read to find all these tuples?
  
  If the relation is not sorted/clustered with respect to R.A, R.B, then we might end up reading a different page for each single tuple. The worst case then is k*c disk reads.
  
  If the relation is clustered with respect to R.A,R.B and suppose each disk page stores about n tuples of R, then we need to read about k*c/n disk pages.
- Given index I2 on R.A,R.B,R.C such that
  
  I has depth 3
  
  I has about k leaf nodes with about k*c tuples in general for any specific R.A,R.B value (I.e. duplicate tuples)
  
  Find the leaf nodes for R.A=10 AND R.B=5. For these, we scan the siblings left to right until R.C >= 30. In the worst case 3 + k nodes are scanned.
  
  Return all the R.C values from the index, no need to read the relation.

Access Paths#

Given a query

` SELECT attrs FROM R WHERE C `

the following are the possible access paths for this relation
Table scan: Always possible.
Hash index: Possible if the hash index is on a subset of attributes A1,…,An such that all conditions in C for A1,…,An are for equality and are conjunctive (ANDed).
B-tree index with search key A1,…,Am such that a prefix A1,…,An of A1,…,Am have equality conditions in C.
After an index scan, it is necessary to scan the relation if
- C contains conditions on attributes that are not in the index, or
- Projection attributes are not all in the index
Example:

` FROM R WHERE R.A=10 AND R.B<10 AND R.C > 20 `
Can we use B-tree indices on

I1 on R(A,B) ?

I2 on R(B,C) ?

I3 on R(D,A) ?

Complex Conditions#

Given an index can be used to evaluate the selection condition only partially, how can we compute complex conditions?

FROM R WHERE C1 AND C2
Use index to find tuples that satisfy C1, read tuples from disk and check C2 in memory.
Use two indices to find tuples that satisfy C1 and C2 separately, take the intersection of the tuple identifiers.
Given multiple indices to evaluate part of a complex query condition C, which one to choose?
Choose the most selective one.
If the resulting tuples can be reduced significantly by another index, use the next most selective index.
Otherwise, do a table scan.

Index Scan

Contents

Index Scan#

Disk Access Process (Overly Simplifed)#

Cost and Resources#

Access Paths#

Selections#

Access Paths#

Complex Conditions#