Selectivity of an index on attribute A Tuples(R.A = c) / Tuples(R) The lower the selectivity, the more selective attribute A is. Create indices: 1. Create single attribute indices on more selective attributes 2. If there is a frequent range query (c1 < A < c2) that is costly, then we should consider clustering by A 3. Consider indices to facilitate index only searches. Query Optimization Steps: 1. Parse query and find equivalent/alternative relation algebra expressions. Those expressions implement the query 2. Simplify Boolean conditions. WHERE R > 5 AND R < 3 WHERE R > 5 AND R > 10 3. Deep processing Rewriting queries: - algebraic equivalences - join ordering 4. Assign memory and implementation each operation 5. Estimate the cost of query (before execution) 6. Choose among all possible plans the best option: - there might be exponentially many plans - searching for the best plan: enumerate all plans, estimate cost of them (some), eliminate potentially bad plans 7. Query execution Sorting ------- Sort is a blocking operation ------------------- PAGES(R) = 1,000 Sort: M = 1,000 blocks for this operation Cost = 1,000 (read once + output) ------------------- Sort: M = 500 blocks for this operation Step 1. Read 500 blocks, sort them and then write them to temp, read 500 blocks from temp, sort and write them to temp (results in 2 sorted groups). Cost: 2,000 Step 2. Read one block from temp for each sorted group, merge and output to the buffer Cost: 1,000 Total cost = 2,000 + 1,000 = 3,000 ----------------- Sort: M = 50 blocks Step 1. Read in 50 blocks at a time, sort them and write to temp. Cost: 2,000 Sorted groups: 1,000 / 50 = 20 sorted groups Step 2. We need 20 blocks to be able to merge, we have more - 50! Read R and marge and output. Cost: 1,000 Total cost: 3,000 --------------- Sort: M = 10 blocks Step 1: Read 10 blocks at a time, sort and write to temp Cost: 2,000 Sorted groups: 1,000 / 10 = 100 groups Step 2: Merge 10 groups at a time and write to temp Cost: 2,000 Reduce 100 sorted groups -> 10 sorted groups Step 2: Merge 10 groups and output Cost: 1,000 Total cost: 2,000 + 2,000 + 1,000 = 5,000 Query Size Estimation --------------------- How many tuples will be returned by a query? Sel() = the percentage of tuples that will pass this condition Selectivity of condition SELECT * FROM R WHERE ; Expected tuples in the result = Tuples(R) * Sel() Statistics ---------- N_DISTINCT(Attr) MinVal MaxVal Sel(X = a) = 1 / N_DISTINCT(X) Class (Fr/So/Jr/Se) Class = 'Junior' 1/4 Sel(Age = 20) 1/8 SELECT * FROM R WHERE 20 < age AND age < 35; Sel(a1 <= X and X <= a2) = (a2 - a1) / (MaxVal - MinVal) a1 |-----------|a2 min |-------------------------------------------------| max ------------------ SELECT * FROM R, S WHERE R.A = S.B; Sel(R.A = S.B) = 1 / max(N_DISTINCT(R.A), N_DISTINCT(S.B)) Movies => movieid is primary key, R 1,000 tuples (N_DISTINCT(movies.movieid) = 1,000) MovieGenres => 8,000 tuples (N_DISTINCT(moviegenres.movieid) = 950) Sel(movies.movieid = moviegenres.movieid) = 1 / max(1,000; 950) Tuples(Q) = Tuples(movies) * Tuples(moviegenres) * 1/1,000 = 1,000 * 8,000 = 8,000 Sel(not Cond) = 1 - Sel(Cond) Sel(Cond1 AND Cond2) = Sel(Cond1) * Sel(Cond2) How about Class = 'Junior' and Age = 20? Sel(Class = 'Junior' and Age = 20) = Sel(Class = 'Junior') * Sel(Age = 20) Sel(Cond1 OR Cond2) = Sel(NOT((NOT Cond1) AND (NOT Cond2))) = 1 - (1 - Sel(Cond1) * (1 - Sel(Cond2))) Query Size Estimation ------------------------- How many tuples will be returned by a query? Sel() = the percentage of tuples that will pass the condition C = selectivity of condition C SELECT * FROM R WHERE ; Expected tuples in the result = Tuples(R) * Sel() Statistics: ---------------- N_DISTINCT(Attr) MinVal MaxVal ----------------- Sel(X = a) = 1/N_DISTINCT(Attr) Class (Fr/So/Jr/Se) Class = 'Junior': 1/4 Sel(Age = 20) = 1/30 ------------------ Sel(a1 <= X and X <= a2) = (a2-a1)/ (maxval-minval) a1 |--------| a2 min |-----------------------------| max -------------------- Q: SELECT * FROM R WHERE R.A=S.B; sel(R.A = S.B) = 1/max(N_DISTINCT(R.A), N_DISTINCT(S.B)) Movies=> movieid primary key, R 1,000 tuples (N_DISTINCT(movies.movieid)=1000) MovieGenres=> 8,000 tuples (N_DISTINCT(MovieGenres.movieid)=950) sel(Movies.movieid = MovieGenres.movieid) = 1/max{1000,950} Tuples(Q) = Tuples(Movies)*Tuples(MovieGenres)* 1/1000 = 1000 * 8000 * 1/1000 = 8,000 -------------------- sel(not Cond) = 1 - sel(Cond) sel(Cond1 and Cond2) = sel(Cond1) * sel(Cond2) How about Class = 'Junior' and Age = 20? sel(Cond1 or Cond2) = sel( not ((not Cond1) and (not Cond2)) ) = 1 - ((1 - sel(Cond1)) * (1 - sel(Cond2))) -------------------- with drb_stats as ( select min(imdbrating) as min, max(imdbrating) as max from series ), histogram as ( select width_bucket(imdbrating, min, max, 9) as bucket, numrange(min(imdbrating)::numeric, max(imdbrating)::numeric, '[]') as range, count(*) as freq from series, drb_stats group by bucket order by bucket ) select bucket, range, freq, repeat(E'\u25a0', ( freq::float / max(freq) over() * 30 )::int ) as bar from histogram; select avg_width , most_common_vals , histogram_bounds from pg_stats where tablename = 'movies' and attname = 'duration'; avg_width: Average width in bytes of column's entries histogram_bounds: A list of values that divide the column's values into groups of approximately equal population. The values in most_common_vals, if present, are omitted from this histogram calculation. select avg_width , most_common_vals , histogram_bounds from pg_stats where tablename = 'series' and attname = 'imdbrating'; select avg_width , most_common_vals , histogram_bounds from pg_stats where tablename = 'series' and attname = 'yearreleased'; ---------- --------------------------- Example: Tuples(R)= 8,000 --students Tuples(S)= 150,000 --transcript Attribute N_DISTINCT Minval Maxval R.A 8,000 298382370 23129803912 R.B 6 A E R.C 10 17 44 R.D 700 198382370 3129803912 S.A 7,800 298382370 23129803912 S.E 200 dsljdslkaj werpewurpowe Q1: SELECT * FROM R WHERE A=340238404 ; Sel(Q1) = 1/8000 Exp(Q1) = 1 Q2: SELECT * FROM R WHERE B='B'; Sel(Q2) = 1/6 Exp(Q2) = 8000/6 = 1333 Q3: SELECT * FROM R WHERE B<>'B'; Sel(Q3) = 1-1/6 = 5/6 Exp(Q3) = 8000*5/6 = 6666 Q4: SELECT * FROM R WHERE C >= 20 AND C <= 24 ; Sel(Q4)= (24-19)/(44-16)=0.18 Exp(Q4)= 1429 Q5: SELECT * FROM R WHERE C >= 20 AND C <= 24 AND B='B'; Sel(Q5)=Sel(Q4)*Sel(Q2) = 0.18/6 = 0.03 Q6: SELECT * FROM R,S WHERE R.A=S.A AND (R.B='A' OR S.E= 'fsdfsdf') ; R.A=S.A => 1/8000 ## Size of the join; 150,000 * 8,000 * 1/8,000 = 150,000 R.B='A' => 1/6 S.E= 'fsdfsdf' => 1/200 Sel: 1/8,000 * ((1- ((1-1/6)*(1- 1/200)) ) Exp: 150,000 * 8,000 * 1/8,000 * ((1- ((1-1/6.)*(1- 1/200.)) ) Statistics: ---------------- N_DISTINCT(Attr) MinVal MaxVal Sel(R.A=x) = 1/N_DISTINCT(R.A) Sel(c1<=R.B<=c2) = (c2-c1)/(maxval-minval) Sel(R.A=S.B) = 1/max(N_DISTINCT(R.A), N_DISTINCT(S.B)) sel(not Cond) = 1 - sel(Cond) sel(Cond1 and Cond2) = sel(Cond1) * sel(Cond2) sel(Cond1 or Cond2) = sel( not ((not Cond1) and (not Cond2)) ) = 1 - ((1 - sel(Cond1)) * (1 - sel(Cond2))) ====================================================================