Lecture 23#

Announcements#

Last 4 lectures, you can expect a few more lecture exercises and at least one more optional lecture exercise
LE23 on at 2pm today, due wednesday midnight
Homework #6 is due next week monday
Grade database is up with all your grades up to now.
One of our office hours moved. Tuesday office hour from 11AM-12:50AM will now be at Amos Eaton 127.

Today’s lecture#

Size estimation and index scan revisited
Block nested loop join
External sort
Query processing, query trees and cost estimation

SELECT D FROM R WHERE R.A=10 and R.C like ‘xy%’ and R.B>=200 and R.B<500;

Which condition do you scan the index with?
Which conditions can you check with the index?
How many tuples do you need to read from disk after scanning the index?

Suppose you are using an index I1 on R(B,A) that stores 600 tuples per leaf node.

Which condition do you scan the index with?
- R.B>=200 and R.B<500
- 5000/600+1 (leaf) + 2 (internal)
Which conditions can you check with the index?
- R.B>=200 and R.B<500 AND R.A=10
How many tuples do you need to read from disk after scanning the index?
- 25 tuples
Total cost = 25+5000/600+1 (leaf) + 2 (internal)

Suppose you are using an index I2 on R(C,B) that stores 800 tuples per leaf node.

Which condition do you scan the index with?
- R.C like ‘xy%’: 800/800
Which conditions can you check with the index?
- R.C like ‘xy%’ and R.B>=200 and R.B<500
How many tuples do you need to read from disk after scanning the index?
- 400 tuples
Total cost = 400+2+2

Suppose you are using an index I3 on R(A,C,B,D) that stores 50 tuples per leaf node.

Which condition do you scan the index with?
- R.A=10 and R.C like ‘xy%’: 250/50
Which conditions can you check with the index?
- All
How many tuples do you need to read from disk after scanning the index?
- 0 (index only scan)
Total cost = 0 + 6+2 = 8

Query size estimation#

Common statistics:

TUPLES®: the number of tuples in R
PAGES®: the number of pages R spans
N_DISTINCT(R.A): the number of distinct values stored for R.A
MINVAL(R.A)/MAXVAL(R.A): min/max values stored for R.A

Selectivity:

Given a condition C, Sel© is the percentage of tuples that are expected to satisfy the condition.

Low Sel© means high selectivity.

Expected number of tuples

Given: Q1: select_© R
Exp(Q1) = TUPLES® * Sel©

Given Q2: R join_© S

Exp(Q2) = TUPLES® * TUPLES(S) * Sel©

Selectivity computation#

Sel(R.A=c)= 1/NDISTINCT(R.A)
- Assuming each R.A value is equally likely

Students(RIN,Class,MainMajor,Age)

TUPLES(Students) = 5,000
NDISTINCT(Class) = 4
NDISTINCT(RIN) = 5,000
NDISTINCT(MainMajor) = 50
NDISTINCT(Age) = 25
MinVal(R.Age)= 17
MaxVal(R.Age)= 50

Transcript(RIN, CRN, Grade)
TUPLES(Transcript) = 80,000
NDISTINCT(RIN) = 4,000

select count(*) from students where class = 2029;

Sel(Class=2029) = 1/4
Sel(RIN=12345566) = 1/5000
Sel(Major=‘DIS’) = 1/50
Sel(Major=‘CSCI’) = 1/50

Sel(R.A=S.B) = 1 /(max{NDISTINCT(R.A), NDISTINCT(S.B))

Choose more selective of:
R.A = c, 1/NDISTINCT(R.A)
S.B = c, 1/NDISTINCT(S.B)

select *
from students s, transcript t
where s.rin = t.rin

Sel(s.rin = t.rin) = 1/max(5000,4000) = 1/5000
Tuples(s join t) = 5,000 * 80,000 * 1/5,000 = 80,000
Sel(c1<= R.A <= c2) = (c2-c1)/(maxval(R.A)-minval(R.A)
Sel(18<=Students.Age<=22) = (22-18)/(50-17)
Sel(C1 AND C2) = Sel(C1) AND Sel(C2)
- Assumes C1 and C2 are independent
Sel(C1 OR C2) = Sel( NOT ((NOT C1) AND (NOT C2)) )

Index selectivity#

What is the selectivity of the condition that can be checked with the index?
- After scanning the index, how many tuples do you have to read?
The most selective index is the one on the primary key
Then, find the next single attribute index that is most selective
Otherwise, create an index on equality values (For Q2 on B and C)
If you want to create an index on many attributes, to help with index only scans
- The ordering is important
- The best multi-attribute index for Q1: Q1-best on R(A,C,B,D)
- The best multi-attribute index for Q2: Q2-best on R(B,C,A,F)

Q1: SELECT D FROM R WHERE R.A=10 and R.C like ‘xy%’ and R.B>=200 and R.B<500;

Q2: SELECT F FROM R WHERE R.B=100 AND R.C=100 AND R.A>25

Indices:

I1: R(A,B)
- Q1: R.A=10 and R.B>=200 and R.B<500;
- Sel = 1/NDISTINCT(R.A) * (500-200)/(MAXVAL(R.B)-MINVAL(R.B))
- Q2: R.B=100 AND R.A>25
- Sel = 1/NDISTINCT(R.B) * (MAXVAL(R.A)-25)/(MAXVAL(R.A)-MINVAL(R.A))
I2: R(B,C)
I3: R(C,D,A)
I4: R(A,B,C,D)

BLOCK NESTED LOOP JOIN#

PAGES® = 2,000 PAGES(S) = 4,000

Suppose I have M=6000 blocks, Read R once and Read S once, and join in memory
- Cost of reads = 6,000
Suppose I have M=2500 blocks
- Read R fully, and S into 500 blocks at a time
- Cost = 6,000
Suppose I have M=2,001 blocks
- Read R fully, and S into 1 block at a time
- Cost = 6,000
Suppose I have M=1,001 blocks
- Read 1000 blocks of R, and read S 1 block at a time
- Repeat with the next 1000 blocks of R
- Read R once, and S 2 times
- Cost = 2,000 + 2* 4,000 = 10,000
Given M blocks:
- Cost (R JOIN S) = PAGES® + ceil( PAGES®/(M-1)) * PAGES(S)
- Cost(R JOIN S): M= 1,001, 2000+4000*2 = 10K
- Cost(S JOIN R): M=1,001, 4000 + 4* 2000 = 12K

Lecture 23