Query Optimization ------------------- Query optimization -> 1. Parse query and construct a relational algebra equivalent 2. Sanity check: logic and syntax check (select * from r where a = 1 and a = 2;) Enumerate many options for: 3. Construct a query tree from your query 4. Find an implementation for each operator 5. Find its expected cost (before running the query) Choose the cheapest one! - Finding equivalent query trees Relational Algebra equivalences ---------------------------------- R join S = S join R R join (S join T) = (R join S) join T Pushing selections down a join: select_C( R join S) = select_C(R) join select_C(S) If condition C only applies to R, then (***): select_C( R join S) = select_C(R) join S Advantage: joins may be cheaper because we are joining fewer tuples that span fewer pages. Pushing projections down: project_A,B (R join S) = project_{A,B} (project_{A,B,C1} R) join (project_{A,B,C2} S) where C1 and C2 are attributes from R and S respectively that are needed for the join. Advantage: the size of tuples is reduced so more can be stored in the same amount of memory, and reduce the cost of joins (and other operations). - Finding expected size of output of different operations (before I run the query) Statistics to keep for each table: Number of tuples in the table (TUPLES(R)) For each attribute A in the table, VALUES(R.A): number of distinct values stored for R.A (DISTINCT(R.A)) MINVAL(R.A) MAXVAL(R.A) ->current min and max value stored for R.A (Assumption is each value for R.A is equally like and is distributed uniformly between min/max values) Equality query: Selectivity(A=c) = 1/VALUES(A) Range query: Selectivity(A>=c1 and A 1/values(R.A) select * from r,s where s.b = c2 -> 1/values(S.B) Sel(R.A = S.B) = 1/max{values(R.A), values(S.B)} Tuples = TUPLES(R) * TUPLES(S) * Sel(R.A=S.B) ------------ Table: Students(RIN, Class, Major, Age) Tuples(Students) = 8,000 Values(RIN) = 8,000 Values(Class) = 4 Values(Major) = 40 Values(Age) = 10 Minval(Age) = 15, Maxval(Age) = 32 Table: Transcript(rin, crn, grade) Tuples(Transcript) = 200,000 Values(RIN) = 7,695 Values(CRN) = 300 Q6: SELECT * FROM Students S, Transcript T WHERE S.rin = T.rin; Sel(S.rin = T.rin) = 1/{8000,7695} Tuples = 8000 * 200,000 * 1/8000 = 200,000 Q1: select * from students where class = 'senior'; Selectivity(C) = the percentage of tuples that are expected to pass condition C Tuples(Q) = Tuples(R)* Selectivity(C) Selectivity(class = 'senior') = 1/4 Tuples(Q1) = 8,000 * 1/4 = 2,000 Q2: select * from students where major = 'Math' ; Selectivity (major = 'Math') = 1/40 Tuples(Q2) = 8,000 * 1/40 = 200 students Q2': select * from students where major = 'CSCI' ; Same expected values, but may be way off! Q0: select * from students where rin = 660012345 Selectivity =1/8000 Tuples = 8000 * 1/8000 = 1 Q3: select * from students where age >= 18 and age < 21; Selectivity = 3/(32-15) = 3/17 Tuples = 8,000 * 3/17 Q3': select * from students where age >= 29 and age < 40; Selectivity = (32-29)/(32-15) = 3/17 Q4: select * from students where major = 'CSCI' and class = 'senior'; sel = 1/40 * 1/4 tuples = 8000/160 Q5: select * from students where age = 18 and class = 'senior'; - Enumerate all possible implementations ---> Find all possible two-way joins (if joining R,S,T,W) R,S R,T S,T R,W S,W T,W Find cheapest ways to implement the join + keep a number of interesting query plans with the two relations that may provide cheaper plans later -> Take each 2-way join and add a third relation to join (R,S) + T (R,T) + W ... Prune options that are too expensive. Continue adding joins until the whole query is implemented, and choose the cheapest. ------------------------------------------- PAGES(R) = 100, TUPLES(R) = 20,000 PAGES(S) = 500, TUPLES(S) = 200,000 Suppose each attribute is 4 bytes long. VALUES(R.A)= 1,000 VALUES(R.E)= 2,000 VALUES(S.F)= 1,800 VALUES(R.B)= 400 VALUES(R.C)= 5,000 between 0 (minval) - 10,000 (maxval) SELECT R.D FROM R WHERE R.A = 5 AND R.B in ('a','b') AND R.C <= 100 AND R.C > 500 ; Sel( R.A = 5) = 1/1000 Sel (R.B in ('a','b')) = 1- (1-1/400)*(1-1/400) Sel( R.C <= 100 AND R.C > 500) = 400/10000 Sel (Query) = 1/1000 * (1- (1-1/400)*(1-1/400)) * (400/10000) SELECT R.D, S.G, count(*) FROM R,S WHERE R.E = S.F AND R.A = 5 GROUP BY R.D, S.G ; Sel(R.E=S.F) = 1/max{2000,1800} = 1/2000 Sel(R.A) = 1/1000 Sel = 1/1000 * 1/2000 Tuples = 20,000 * 200,000 * 1/1000 * 1/2000 ------- select avg_width , histogram_bounds from pg_stats where tablename = 'songs' and attname = 'decade';