Lecture 24
------------

- Last lecture exercise today, due on wed 
- HW#8 is due on thursday, it is worth 3 points.

- Final exam: Wednesday December 14, 3:00PM-6:00PM (West 220)
  -> PLEASE REPORT ANY CONFLICTS NOW!

- Grade database is in beta release

select * from mygrades;
select * from mytotal;

--------------

Concurrency Control

Starting X value 500 (suppose T1 adds 10,000 to X and T2 subtracts 400)

s1: r1(x) r2(x) w1(x) w2(x)  -> x value is 100
s2: r1(x) w1(x) r2(x) w2(x)  -> x value is 10,100
s3: r2(x) w2(x) r1(x) w1(x)  -> x value is 10,100


Starting X value 500 (suppose T1 adds 10,000 to X and T2 subtracts
1,000 if X>=1000)

s2: r1(x) w1(x) r2(x) w2(x)  -> x value is 9,500
s3: r2(x) w2(x) r1(x) w1(x)  -> x value is 10,500

Correct execution of transactions: a serial schedule

A schedule S1 is serializable if it is guaranteed to be equal to a serial
schedule S2: i.e. 
    1. all reads in S1 are guaratneed to be the same as the reads in S2
    2. all writes in S1 occur in the same order as S2


A conflict in a schedule is two operations on the same data item by
two different transactions, and one of them is a write operation.

r1(x) w2(x)
w1(x) r2(x)
w1(x) w2(x)


S1: r1(x) w2(x)
S2: w2(x) r1(x)

r1(X) is not the same in S1 and S2

S1: w1(x) w2(x)
S2: w2(x) w1(x)

The last value of X is not the same.

------------------------

A


Query Optimization
--------------------------
Given a query, find the fastest implementation of it
given the database model, storage and its current contents.

Step 1: Parse query

Step 2: Look at conditions, remove redundant conditions and
check for inconsistencies

Step 3: Find alternate logical query plans: join ordering/pushing
selections down/adding projections

Step 4: Query size estimation: estimate number of tuples matching a
specific condition (before executing a query)

Step 5: Enumerate all possible physical query plans
(operators+implementaiton), estimate its cost and find the cheapest
one.


Step 4: Size estimation
------------------------------------

SEL(C) = the proportion of tuples that will pass the condition C

Given:

T = SELECT_{C} (R)     TUPLES(T) = TUPLES(R) * SEL(C)

T2 = R JOIN_{C} S      TUPLES(T2) = TUPLES(R) * TUPLES(S) * SEL(C)


DISTINCT(R.A): number of distinct R.A values currently stored
MINVAL(R.A), MAXVAL(R.A): min and max values for R.A

SEL(R.A=c) = 1/DISTINCT(R.A)

SEL(R.A>=c1 AND R.A<=c2) =  (c2-c1)/(MAXVAL(R.A)-MINVAL(R.A))

SEL(R.A=S.B) = 1/{MAX(DISTINCT(R.A), DISTINCT(S.B))}


SEL(C1 AND C2) =  SEL(C1) * SEL(C2)
SEL (NOT C1)   =  1-SEL(C1)
SEL(C1 OR C2)  =  1 - ((1-SEL(C1)) * (1-SEL(C2)))

-----------------

Join ordering

Index selectivity:
-------------------
Given an index, how much does it reduce the number of
truples that need to be retrieved for a query?

R(A,B,C,K,L,M)  10,000,000 tuples
    Distinct(R.A)= 200    
    Distinct(R.B)=5,000   
          MinVal(R.B)=500   MaxVal(R.B)=10,500 
    Distinct(R.C)=6,000  
    Distinct(R.D)=500  
          MinVal(R.B)=100   MaxVal(R.B)=2100


Index I1 on R.A,   R.A=10

Sel(R.A=10) = 1/200

Index I2 on R.B,   R.B=400

Sel(R.b=400) = 1/5000

Index I3 on R.C,   R.C='ABC'

Sel(R.C='ABC') = 1/6000

Index I4 on R.D, R.D=400

Sel(R.D=400) = 1/500

1. Find most selective indexes

2. Find less selective indexes that have the attributes needed
for a query

SELECT F FROM R WHERE R.A = 200 AND R.E<= 500;


F,A,E -> scan all leaf
E,A,F -> Scan for R.E<=500
E,F,A -> Scan for R.E<=500

A,E,F -> best index (scan leaf from R.A=200 and min (R.E) up to
R.A=200 and R.E>500)

-----------------


TRANSACTION (XACT) PROCESSING
================================

Concurrency control

Abstract sequence of operations in a database as read/write by
transactions of data items

Schedule S1: r1(X) r2(X) w1(X) w2(X)

Schedule S2: r1(X) w1(X) r2(X) w2(X)

Schedule S3: r2(X) w2(X) r1(X) w1(X)

Schedule S4: r2(X) r1(X) w2(X) w1(X)

A serial schedule is assumed to be always correct
(S2 and S3 above are both serial!)

A schedule is serializable if its final results are guaranteed to be
equal to a serial schedule.

If a xact in a schedule always reads the same values and writes values
in the same order as a serial schedule, then it is serializable.


S1: r1(x) r2(y) w1(x) r2(z) w2(y) r1(y) w1(y)

S2: r2(y)  r2(z) w2(y) r1(x) w1(x) r1(y) w1(y)


A schedule S1 is serializable if it is guaranteed to be equal to a serial
schedule S2: i.e. all conflicts in S1 and S2 occur in the same order.

Conflict graph:

Given a schedule S1, draw the conflict graph for S1 such that
there is a node for each transaction in S1, and for each conflict

ri(X) wj(x)
wi(x) wj(x)
wi(x) rj(x)

there is a link from Ti to Tj.

If the resulting graph has a cycle, then the schedule is not
serializable.

If there is no cycle, then the schedule is serializable. You can find
a potential serial order by conducting topological sort on the
conflict graph.

Transaction management: how to ensure all schedules in the database
are serializable?

Two phase locking protocol
----------------------------

1. All transactions must obtain a lock on an item before read/write.
   To read, get a read (shared) lock on X
   To write, get a write(exclusive) lock on X
           (either get the lock straight or upgrade your shared lock if
	   no one else is holding a lock without releasing)

Multiple transactions can hold a read lock on the same item
Only one transaction can hold a write lock on an item

If a transaction requests a lock and the lock is not available, then
the transaction must wait until lock becomes available.

Strict Two Phase Locking Protocol:
-------------------------------------
Get all the locks you need and hold them all until you commit, then
release all locks.

Two Phase Locking Protocol:
-------------------------------------

Get all the locks you need to be able to read or write. You can
release locks if you do not need them anymore before commit, but you
cannot get any new locks after releasing a lock.

Two phases:

- Growing phase: transactions can get new locks only
- Shrinking phase: transactions can release locks only

Two phase locking guarantees serializability.

Hence if I am using 2PL, I cannot have cycles in my conflict graph.


Deadlocks are possible
-------------------------

- Periodically check for deadlocks, create waits-for-graph. If there
  is a cycle, then there is a deadlock.

- Avoid deadlocks by an additional mechanism: give timestamps and
  rollback/restart if an older transaction holding a lock you need