Lecture 23 ------------- - Hw7 due on thursday - Lecture 23 exercise today -------------------- Query Optimization SELECT R.C ,S.E FROM R ,S WHERE R.A=S.B AND R.C>= 10 ORDER BY R.C , S.E; Query Optimization -------------------------- Given a query, find the fastest implementation of it given the database model, storage and its current contents. Step 1: Parse query Step 2: Look at conditions, remove redundant conditions and check for inconsistencies Step 3: Find alternate logical query plans: join ordering/pushing selections down/adding projections Step 4: Query size estimation: estimate number of tuples matching a specific condition (before executing a query) Step 5: Enumerate all possible physical query plans (operators+implementaiton), estimate its cost and find the cheapest one. Qopt step 2: ------------------------------------ Look at conditions, remove redundant conditions and check for inconsistencies A>=10 and A>5 ===>A>=10 A>=10 and A<5 ===> 0 tuples Qopt step 3: ------------------------------------ Find alternate logical query plans: join ordering/pushing selections down/adding projectins (join ordering) R join S = S join R (R join S) join T = R join (S join T) pushing selectins down (reduce costs and help find applicable indices): (SELECT_C (R) join S) = SELECT_C (R join S) project_R.A,R.B( R join_{R.C=S.D} S) = project_R.A,R.B( project_{R.A,R.B,R.C}(R_ join_{R.C=S.D} project_{S.D}S) = Step 4: Size estimation ------------------------------------ Selectivity(C) = the proportion of tuples that will pass the condition C Given: T = SELECT_{C} (R) TUPLES(T) = TUPLES(R) * SEL(C) T2 = R JOIN_{C} S = SELECT_{C} (R x S) TUPLES(T2) = TUPLES(R) * TUPLES(S) * SEL(C) Keep basic statistics for each relation R TUPLES(R) For each attribute R.A: DISTINCT(R.A) -> number of distinct values currently stored MINVAL(R.A) MAXVAL(R.A) for R.A SEL(R.A=c) = 1/DISTINCT(R.A) SEL(R.A>=c1 AND R.A<=c2) = (c2-c1)/(MAXVAL(R.A)-MINVAL(R.A)) SEL(R.A=S.B) = 1/{MAX(DISTINCT(R.A), DISTINCT(S.B))} SEL(C1 AND C2) = SEL(C1) * SEL(C2) SEL (NOT C1) = 1-SEL(C1) SEL(C1 OR C2) = SEL( NOT(NOT C1 AND NOT C2) ) = 1 - ((1-SEL(C1)) * (1-SEL(C2))) -------------------------- Example Students(RIN, Email, Class, Year, Age) Tuples(Students) = 6,000 DISTINCT(RIN) = 6,000 DISTINCT(Email) = 6,000 DISTINCT(Class) = 4 DISTINCT(Age) = 9 MINVAL(Age) = 17 MAXVAL(Age) = 30 SEL(Class='Senior') = 1/4 TUPLES(SELECT_{Class='Senior'}) = 6,000 * 1/4 = 1,500 SEL(Age=18) = 1/9 SEL(Age=30) = 1/9 SEL(Age<>18) = 1 - 1/9 SEL(Age>=18 AND Age<=21) = 4/(30-17) Students(RIN, Email, Class, Year, Age) Tuples(Students) = 6,000 DISTINCT(RIN) = 6,000 Transcript(RIN, CRN, Grade) Tuples(Transcript)= 60,000 DISTINCT(RIN) = 5,500 DISTINCT(CRN)=400 T = SELECT * FROM Students S, Transcript T WHERE S.RIN=T.RIN Sel (S.RIN=T.RIN) = 1/MAX(6000, 5500) = 1/6,000 TUPLES(T) = 6,000 * 60,000 * 1/6,000 = 60,000 SEL(Class='Senior' AND Age=18) = 1/4 * 1/9