Lecture 24 ------------ - No lecture exercise today - HW#7 is due today - Final exam: Wednesday December 14, 3:00PM-6:00PM (West 220) -> PLEASE REPORT ANY CONFLICTS NOW! Query Optimization -------------------------- Given a query, find the fastest implementation of it given the database model, storage and its current contents. Step 1: Parse query Step 2: Look at conditions, remove redundant conditions and check for inconsistencies Step 3: Find alternate logical query plans: join ordering/pushing selections down/adding projections Step 4: Query size estimation: estimate number of tuples matching a specific condition (before executing a query) Step 5: Enumerate all possible physical query plans (operators+implementaiton), estimate its cost and find the cheapest one. Step 4: Size estimation ------------------------------------ SEL(C) = the proportion of tuples that will pass the condition C Given: T = SELECT_{C} (R) TUPLES(T) = TUPLES(R) * SEL(C) T2 = R JOIN_{C} S TUPLES(T2) = TUPLES(R) * TUPLES(S) * SEL(C) DISTINCT(R.A): number of distinct R.A values currently stored MINVAL(R.A), MAXVAL(R.A): min and max values for R.A SEL(R.A=c) = 1/DISTINCT(R.A) SEL(R.A>=c1 AND R.A<=c2) = (c2-c1)/(MAXVAL(R.A)-MINVAL(R.A)) SEL(R.A=S.B) = 1/{MAX(DISTINCT(R.A), DISTINCT(S.B))} SEL(C1 AND C2) = SEL(C1) * SEL(C2) SEL (NOT C1) = 1-SEL(C1) SEL(C1 OR C2) = 1 - ((1-SEL(C1)) * (1-SEL(C2))) ----------------- Join ordering Index selectivity: ------------------- Given an index, how much does it reduce the number of truples that need to be retrieved for a query? R(A,B,C,K,L,M) 10,000,000 tuples Distinct(R.A)= 200 Distinct(R.B)=5,000 MinVal(R.B)=500 MaxVal(R.B)=10,500 Distinct(R.C)=6,000 Distinct(R.D)=500 MinVal(R.B)=100 MaxVal(R.B)=2100 Index I1 on R.A, R.A=10 Sel(R.A=10) = 1/200 Index I2 on R.B, R.B=400 Sel(R.b=400) = 1/5000 Index I3 on R.C, R.C='ABC' Sel(R.C='ABC') = 1/6000 Index I4 on R.D, R.D=400 Sel(R.D=400) = 1/500 1. Find most selective indexes 2. Find less selective indexes that have the attributes needed for a query SELECT F FROM R WHERE R.A = 200 AND R.E<= 500; F,A,E -> scan all leaf E,A,F -> Scan for R.E<=500 E,F,A -> Scan for R.E<=500 A,E,F -> best index (scan leaf from R.A=200 and min (R.E) up to R.A=200 and R.E>500) ----------------- TRANSACTION (XACT) PROCESSING ================================ Concurrency control Abstract sequence of operations in a database as read/write by transactions of data items Schedule S1: r1(X) r2(X) w1(X) w2(X) Schedule S2: r1(X) w1(X) r2(X) w2(X) Schedule S3: r2(X) w2(X) r1(X) w1(X) Schedule S4: r2(X) r1(X) w2(X) w1(X) A serial schedule is assumed to be always correct (S2 and S3 above are both serial!) A schedule is serializable if its final results are guaranteed to be equal to a serial schedule. If a xact in a schedule always reads the same values and writes values in the same order as a serial schedule, then it is serializable. S1: r1(x) r2(y) w1(x) r2(z) w2(y) r1(y) w1(y) S2: r2(y) r2(z) w2(y) r1(x) w1(x) r1(y) w1(y)