Lecture 24 ------------ - Last lecture exercise today, due on wed - HW#8 is due on thursday, it is worth 3 points. - Final exam: Wednesday December 14, 3:00PM-6:00PM (West 220) -> PLEASE REPORT ANY CONFLICTS NOW! - Grade database is in beta release select * from mygrades; select * from mytotal; -------------- Concurrency Control Starting X value 500 (suppose T1 adds 10,000 to X and T2 subtracts 400) s1: r1(x) r2(x) w1(x) w2(x) -> x value is 100 s2: r1(x) w1(x) r2(x) w2(x) -> x value is 10,100 s3: r2(x) w2(x) r1(x) w1(x) -> x value is 10,100 Starting X value 500 (suppose T1 adds 10,000 to X and T2 subtracts 1,000 if X>=1000) s2: r1(x) w1(x) r2(x) w2(x) -> x value is 9,500 s3: r2(x) w2(x) r1(x) w1(x) -> x value is 10,500 Correct execution of transactions: a serial schedule A schedule S1 is serializable if it is guaranteed to be equal to a serial schedule S2: i.e. 1. all reads in S1 are guaratneed to be the same as the reads in S2 2. all writes in S1 occur in the same order as S2 A conflict in a schedule is two operations on the same data item by two different transactions, and one of them is a write operation. r1(x) w2(x) w1(x) r2(x) w1(x) w2(x) S1: r1(x) w2(x) S2: w2(x) r1(x) r1(X) is not the same in S1 and S2 S1: w1(x) w2(x) S2: w2(x) w1(x) The last value of X is not the same. ------------------------ A Query Optimization -------------------------- Given a query, find the fastest implementation of it given the database model, storage and its current contents. Step 1: Parse query Step 2: Look at conditions, remove redundant conditions and check for inconsistencies Step 3: Find alternate logical query plans: join ordering/pushing selections down/adding projections Step 4: Query size estimation: estimate number of tuples matching a specific condition (before executing a query) Step 5: Enumerate all possible physical query plans (operators+implementaiton), estimate its cost and find the cheapest one. Step 4: Size estimation ------------------------------------ SEL(C) = the proportion of tuples that will pass the condition C Given: T = SELECT_{C} (R) TUPLES(T) = TUPLES(R) * SEL(C) T2 = R JOIN_{C} S TUPLES(T2) = TUPLES(R) * TUPLES(S) * SEL(C) DISTINCT(R.A): number of distinct R.A values currently stored MINVAL(R.A), MAXVAL(R.A): min and max values for R.A SEL(R.A=c) = 1/DISTINCT(R.A) SEL(R.A>=c1 AND R.A<=c2) = (c2-c1)/(MAXVAL(R.A)-MINVAL(R.A)) SEL(R.A=S.B) = 1/{MAX(DISTINCT(R.A), DISTINCT(S.B))} SEL(C1 AND C2) = SEL(C1) * SEL(C2) SEL (NOT C1) = 1-SEL(C1) SEL(C1 OR C2) = 1 - ((1-SEL(C1)) * (1-SEL(C2))) ----------------- Join ordering Index selectivity: ------------------- Given an index, how much does it reduce the number of truples that need to be retrieved for a query? R(A,B,C,K,L,M) 10,000,000 tuples Distinct(R.A)= 200 Distinct(R.B)=5,000 MinVal(R.B)=500 MaxVal(R.B)=10,500 Distinct(R.C)=6,000 Distinct(R.D)=500 MinVal(R.B)=100 MaxVal(R.B)=2100 Index I1 on R.A, R.A=10 Sel(R.A=10) = 1/200 Index I2 on R.B, R.B=400 Sel(R.b=400) = 1/5000 Index I3 on R.C, R.C='ABC' Sel(R.C='ABC') = 1/6000 Index I4 on R.D, R.D=400 Sel(R.D=400) = 1/500 1. Find most selective indexes 2. Find less selective indexes that have the attributes needed for a query SELECT F FROM R WHERE R.A = 200 AND R.E<= 500; F,A,E -> scan all leaf E,A,F -> Scan for R.E<=500 E,F,A -> Scan for R.E<=500 A,E,F -> best index (scan leaf from R.A=200 and min (R.E) up to R.A=200 and R.E>500) ----------------- TRANSACTION (XACT) PROCESSING ================================ Concurrency control Abstract sequence of operations in a database as read/write by transactions of data items Schedule S1: r1(X) r2(X) w1(X) w2(X) Schedule S2: r1(X) w1(X) r2(X) w2(X) Schedule S3: r2(X) w2(X) r1(X) w1(X) Schedule S4: r2(X) r1(X) w2(X) w1(X) A serial schedule is assumed to be always correct (S2 and S3 above are both serial!) A schedule is serializable if its final results are guaranteed to be equal to a serial schedule. If a xact in a schedule always reads the same values and writes values in the same order as a serial schedule, then it is serializable. S1: r1(x) r2(y) w1(x) r2(z) w2(y) r1(y) w1(y) S2: r2(y) r2(z) w2(y) r1(x) w1(x) r1(y) w1(y) A schedule S1 is serializable if it is guaranteed to be equal to a serial schedule S2: i.e. all conflicts in S1 and S2 occur in the same order. Conflict graph: Given a schedule S1, draw the conflict graph for S1 such that there is a node for each transaction in S1, and for each conflict ri(X) wj(x) wi(x) wj(x) wi(x) rj(x) there is a link from Ti to Tj. If the resulting graph has a cycle, then the schedule is not serializable. If there is no cycle, then the schedule is serializable. You can find a potential serial order by conducting topological sort on the conflict graph. Transaction management: how to ensure all schedules in the database are serializable? Two phase locking protocol ---------------------------- 1. All transactions must obtain a lock on an item before read/write. To read, get a read (shared) lock on X To write, get a write(exclusive) lock on X (either get the lock straight or upgrade your shared lock if no one else is holding a lock without releasing) Multiple transactions can hold a read lock on the same item Only one transaction can hold a write lock on an item If a transaction requests a lock and the lock is not available, then the transaction must wait until lock becomes available. Strict Two Phase Locking Protocol: ------------------------------------- Get all the locks you need and hold them all until you commit, then release all locks. Two Phase Locking Protocol: ------------------------------------- Get all the locks you need to be able to read or write. You can release locks if you do not need them anymore before commit, but you cannot get any new locks after releasing a lock. Two phases: - Growing phase: transactions can get new locks only - Shrinking phase: transactions can release locks only Two phase locking guarantees serializability. Hence if I am using 2PL, I cannot have cycles in my conflict graph. Deadlocks are possible ------------------------- - Periodically check for deadlocks, create waits-for-graph. If there is a cycle, then there is a deadlock. - Avoid deadlocks by an additional mechanism: give timestamps and rollback/restart if an older transaction holding a lock you need