Lecture 23
-------------

- Hw7 due on thursday
- Lecture 23 exercise today

--------------------
Query Optimization


SELECT
    R.C
    ,S.E
FROM
    R
    ,S
WHERE
    R.A=S.B
    AND R.C>= 10
ORDER BY
    R.C
    , S.E;

Query Optimization
--------------------------
Given a query, find the fastest implementation of it
given the database model, storage and its current contents.

Step 1: Parse query

Step 2: Look at conditions, remove redundant conditions and
check for inconsistencies

Step 3: Find alternate logical query plans: join ordering/pushing
selections down/adding projections

Step 4: Query size estimation: estimate number of tuples matching a
specific condition (before executing a query)

Step 5: Enumerate all possible physical query plans
(operators+implementaiton), estimate its cost and find the cheapest
one.


Qopt step 2: 
------------------------------------

Look at conditions, remove redundant conditions and
check for inconsistencies

A>=10 and A>5   ===>A>=10

A>=10 and A<5   ===> 0 tuples

Qopt step 3: 
------------------------------------

Find alternate logical query plans: join ordering/pushing selections
down/adding projectins

(join ordering)

R join S = S join R

(R join S) join T = R join (S join T)

pushing selectins down  (reduce costs and help find applicable indices):

(SELECT_C (R) join S) = SELECT_C (R join S)

project_R.A,R.B( R join_{R.C=S.D} S) =

     project_R.A,R.B( project_{R.A,R.B,R.C}(R_ join_{R.C=S.D} project_{S.D}S) = 


Step 4: Size estimation
------------------------------------

Selectivity(C) = the proportion of tuples that will pass the condition C

Given:

T = SELECT_{C} (R)

TUPLES(T) = TUPLES(R) * SEL(C)

T2 = R JOIN_{C} S  = SELECT_{C} (R x S)

TUPLES(T2) = TUPLES(R) * TUPLES(S) * SEL(C)

Keep basic statistics for each relation R

TUPLES(R)

For each attribute R.A:

DISTINCT(R.A) -> number of distinct values currently stored
MINVAL(R.A) MAXVAL(R.A) for R.A

SEL(R.A=c) = 1/DISTINCT(R.A)

SEL(R.A>=c1 AND R.A<=c2) =  (c2-c1)/(MAXVAL(R.A)-MINVAL(R.A))

SEL(R.A=S.B) = 1/{MAX(DISTINCT(R.A), DISTINCT(S.B))}


SEL(C1 AND C2) = SEL(C1) * SEL(C2)

SEL (NOT C1) = 1-SEL(C1)

SEL(C1 OR C2) = SEL( NOT(NOT C1 AND NOT C2) )
              = 1 - ((1-SEL(C1)) * (1-SEL(C2)))

--------------------------


Example

Students(RIN, Email, Class, Year, Age)

Tuples(Students) = 6,000
DISTINCT(RIN) = 6,000
DISTINCT(Email) = 6,000
DISTINCT(Class) = 4
DISTINCT(Age) = 9
MINVAL(Age) = 17
MAXVAL(Age) = 30


SEL(Class='Senior') = 1/4

TUPLES(SELECT_{Class='Senior'}) = 6,000 * 1/4 = 1,500


SEL(Age=18) = 1/9
SEL(Age=30) = 1/9

SEL(Age<>18) = 1 - 1/9


SEL(Age>=18 AND Age<=21) = 4/(30-17)


Students(RIN, Email, Class, Year, Age)

Tuples(Students) = 6,000
DISTINCT(RIN) = 6,000


Transcript(RIN, CRN, Grade)

Tuples(Transcript)= 60,000
DISTINCT(RIN) = 5,500
DISTINCT(CRN)=400


T = SELECT * FROM Students S, Transcript T WHERE S.RIN=T.RIN

Sel (S.RIN=T.RIN) = 1/MAX(6000, 5500) = 1/6,000

TUPLES(T) = 6,000 * 60,000 * 1/6,000 = 60,000


SEL(Class='Senior' AND Age=18) = 1/4 * 1/9