Lecture 23#

Announcements#

  • Last 4 lectures, you can expect a few more lecture exercises and at least one more optional lecture exercise

  • LE23 on at 2pm today, due wednesday midnight

  • Homework #6 is due next week monday

  • Grade database is up with all your grades up to now.

  • One of our office hours moved. Tuesday office hour from 11AM-12:50AM will now be at Amos Eaton 127.

Today’s lecture#

  • Size estimation and index scan revisited

  • Block nested loop join

  • External sort

  • Query processing, query trees and cost estimation

SELECT D FROM R WHERE R.A=10 and R.C like ‘xy%’ and R.B>=200 and R.B<500;

  1. Which condition do you scan the index with?

  2. Which conditions can you check with the index?

  3. How many tuples do you need to read from disk after scanning the index?

Suppose you are using an index I1 on R(B,A) that stores 600 tuples per leaf node.

  1. Which condition do you scan the index with?

    • R.B>=200 and R.B<500

    • 5000/600+1 (leaf) + 2 (internal)

  2. Which conditions can you check with the index?

    • R.B>=200 and R.B<500 AND R.A=10

  3. How many tuples do you need to read from disk after scanning the index?

    • 25 tuples

  4. Total cost = 25+5000/600+1 (leaf) + 2 (internal)

Suppose you are using an index I2 on R(C,B) that stores 800 tuples per leaf node.

  1. Which condition do you scan the index with?

    • R.C like ‘xy%’: 800/800

  2. Which conditions can you check with the index?

    • R.C like ‘xy%’ and R.B>=200 and R.B<500

  3. How many tuples do you need to read from disk after scanning the index?

    • 400 tuples

  4. Total cost = 400+2+2

Suppose you are using an index I3 on R(A,C,B,D) that stores 50 tuples per leaf node.

  1. Which condition do you scan the index with?

    • R.A=10 and R.C like ‘xy%’: 250/50

  2. Which conditions can you check with the index?

    • All

  3. How many tuples do you need to read from disk after scanning the index?

    • 0 (index only scan)

  4. Total cost = 0 + 6+2 = 8

Query size estimation#

Common statistics:

  • TUPLES®: the number of tuples in R

  • PAGES®: the number of pages R spans

  • N_DISTINCT(R.A): the number of distinct values stored for R.A

  • MINVAL(R.A)/MAXVAL(R.A): min/max values stored for R.A

Selectivity:

Given a condition C, Sel© is the percentage of tuples that are expected to satisfy the condition.

  • Low Sel© means high selectivity.

Expected number of tuples

Given: Q1: select_© R
Exp(Q1) = TUPLES® * Sel©

Given Q2: R join_© S

Exp(Q2) = TUPLES® * TUPLES(S) * Sel©

Selectivity computation#

  • Sel(R.A=c)= 1/NDISTINCT(R.A)

    • Assuming each R.A value is equally likely

Students(RIN,Class,MainMajor,Age)

TUPLES(Students) = 5,000
NDISTINCT(Class) = 4
NDISTINCT(RIN) = 5,000
NDISTINCT(MainMajor) = 50
NDISTINCT(Age) = 25
MinVal(R.Age)= 17
MaxVal(R.Age)= 50

Transcript(RIN, CRN, Grade)
TUPLES(Transcript) = 80,000
NDISTINCT(RIN) = 4,000

select count(*) from students where class = 2029;

Sel(Class=2029) = 1/4
Sel(RIN=12345566) = 1/5000
Sel(Major=‘DIS’) = 1/50
Sel(Major=‘CSCI’) = 1/50

  • Sel(R.A=S.B) = 1 /(max{NDISTINCT(R.A), NDISTINCT(S.B))

Choose more selective of:
R.A = c, 1/NDISTINCT(R.A)
S.B = c, 1/NDISTINCT(S.B)

select *
from students s, transcript t
where s.rin = t.rin

  • Sel(s.rin = t.rin) = 1/max(5000,4000) = 1/5000

  • Tuples(s join t) = 5,000 * 80,000 * 1/5,000 = 80,000

  • Sel(c1<= R.A <= c2) = (c2-c1)/(maxval(R.A)-minval(R.A)

  • Sel(18<=Students.Age<=22) = (22-18)/(50-17)

  • Sel(C1 AND C2) = Sel(C1) AND Sel(C2)

    • Assumes C1 and C2 are independent

  • Sel(NOT C) = 1- Sel©

  • Sel(C1 OR C2) = Sel( NOT ((NOT C1) AND (NOT C2)) )

Index selectivity#

  • What is the selectivity of the condition that can be checked with the index?

    • After scanning the index, how many tuples do you have to read?

  • The most selective index is the one on the primary key

  • Then, find the next single attribute index that is most selective

  • Otherwise, create an index on equality values (For Q2 on B and C)

  • If you want to create an index on many attributes, to help with index only scans

    • The ordering is important

    • The best multi-attribute index for Q1: Q1-best on R(A,C,B,D)

    • The best multi-attribute index for Q2: Q2-best on R(B,C,A,F)

Q1: SELECT D FROM R WHERE R.A=10 and R.C like ‘xy%’ and R.B>=200 and R.B<500;

Q2: SELECT F FROM R WHERE R.B=100 AND R.C=100 AND R.A>25

Indices:

  • I1: R(A,B)

    • Q1: R.A=10 and R.B>=200 and R.B<500;

    • Sel = 1/NDISTINCT(R.A) * (500-200)/(MAXVAL(R.B)-MINVAL(R.B))

    • Q2: R.B=100 AND R.A>25

    • Sel = 1/NDISTINCT(R.B) * (MAXVAL(R.A)-25)/(MAXVAL(R.A)-MINVAL(R.A))

  • I2: R(B,C)

  • I3: R(C,D,A)

  • I4: R(A,B,C,D)

BLOCK NESTED LOOP JOIN#

PAGES® = 2,000 PAGES(S) = 4,000

  • Suppose I have M=6000 blocks, Read R once and Read S once, and join in memory

    • Cost of reads = 6,000

  • Suppose I have M=2500 blocks

    • Read R fully, and S into 500 blocks at a time

    • Cost = 6,000

  • Suppose I have M=2,001 blocks

    • Read R fully, and S into 1 block at a time

    • Cost = 6,000

  • Suppose I have M=1,001 blocks

    • Read 1000 blocks of R, and read S 1 block at a time

    • Repeat with the next 1000 blocks of R

    • Read R once, and S 2 times

    • Cost = 2,000 + 2* 4,000 = 10,000

  • Given M blocks:

    • Cost (R JOIN S) = PAGES® + ceil( PAGES®/(M-1)) * PAGES(S)

    • Cost(R JOIN S): M= 1,001, 2000+4000*2 = 10K

    • Cost(S JOIN R): M=1,001, 4000 + 4* 2000 = 12K