Query Execution: ---------------------- Unit of cost: Number of pages read/written Cost of queries: First/All, All Main operators/Multiple implementations: select/project/join/group by/distinct/union/intersect/except Each query has some memory allocated, each operator has its own memory allocation: Assume M blocks for a given query Dependence on memory available for operations and how the relations are stored, the cost of a specific implementation of an operation will differ ------- Size of a relation R, PAGES(R) and memory allocated M determine the cost ------- Iterator interface: open/close/get_next ----------------------- Operator implementations 1. Using indexes 2. Single pass/multi pass (Block nested join) 3. Temporary outputs (External sort) Index scans ---------------- Answering basic selection queries: SELECT FROM R WHERE ; PAGES(R) Index I cost of = number of leaf pages in I that needs to be scanned Relation look up cost of = how many tuples from R to be read after scanning the index and its cost. ---------------------------- Sequential scan: Cost = PAGES(R) --------------------------- Index scan: Scan index I1 for condition C, and read matching tuples from R to output attributes X Cost = Cost of scanning the index + Cost of reading tuples from R Index I cost of = number of index pages in I that needs to be scanned to find matching tuples for condition C Relation look up cost of = how many tuples from R to be read after scanning the index and its cost. Example: SELECT A,B,C FROM R WHERE D='a' AND 10 =< E AND E <= 30; PAGES(R) = 12,500 TUPLES(R) = 250,000 -- On average, 20 tuples per page in R -- How many tuples match? D='a': 5,000 tuples 10 =< E AND E <= 30: 300 tuples D='a' AND 10 =< E AND E <= 30: 10 tuples Index I1 on R(D): 3 levels and 500 nodes in the leaf level 250,000/500 = 500 tuples per page at the leaf level Index scan cost = 11 nodes + 2 = 13 nodes Worst case: all matching 5,000 tuples are in 5,000 separate pages Cost of this query using I1: 13 + 5,000 = 5,013 ------------- Index I3 on R(D, E) ------------- Index I4 on R(E,D): 3 levels and 2,000 nodes at the leaf level 250,000/2,000 = 125 tuples per page at the leaf level Scan I4 for the E condition, 300 tuples/125 = about 3 nodes in I4 leaf level Cost = 2 + 4 = 6 nodes I will read only 10 tuples from R Cost this query = 10 + 6 = 16 -------------------- Index only scan: SELECT A,B,C FROM R WHERE D='a' AND 10 =< E AND E <= 30; PAGES(R) = 12,500 TUPLES(R) = 250,000 -- 20 tuples per page in R -- Index I5 on R(D,E,A,B,C) Scan the index: find tuples matching condition C and attributes X needed for the output. --------------------------- Multiple index scan: scan multiple indices and intersect in memory and if needed, read matching tuples SELECT A,B,C FROM R WHERE D='a' AND 10 =< E AND E <= 30; Index I1 on R(D): Index I2 on R(E): Read multiple indices, intersect in memory (e.g.using hashing), then read matching tuples from R --------------------------- 1. Sequential Scan 2. Index scan: scan index + read R 3. Index only scan: scan only index (if it has all attributes needed for query) 4. Multiple index scan: scan multiple indices Join -------------------- R join S = For each tuple r in R For each tuple s in S if r join s, then output Suppose M = 2 1 block of R, read all of S one page at a time, join and output repeat this for every block of R Cost of this: PAGES(R) + PAGES(R) * PAGES(S) R join S and S join R may have different costs (join ordering) Suppose M >> 2 Allocate M-1 blocks to reading R and 1 block to reading S Cost = PAGES(R) [read R once] + PAGES(S) * ceil(PAGES(R)/(M-1)) Query Execution: ---------------------- How to speed up queries: - better queries - buy more memory - change your data model - create indices// remove indices --------------------- Query plan for any SQL query -> relational algebra operators // implementation How is data stored: relation clustered/hashed/distributed? - what indices are available What are the resources available for the query: memory allocation: M ----------------------------- Unit of cost: Number of pages read/written Size of a relation R, PAGES(R) Memory allocated M determine the cost ------- Iterator interface: open/close/get_next ----------------------- Sequential Scan: ----------------- SELECT * FROM R WHERE ; Memory M = 1 Cost = PAGES(R) Ex. SELECT A,B,C FROM R WHERE B=10; Pages(R) = 8,000 Tuples(R) = 600,000 Cost of sequential scan = 8,000 Index scan: ------------- Cost of index scan: Cost of scanning the index + cost of reading tuples from relation as needed Query 1: SELECT A,B,C FROM R WHERE B=10; [Query returns 200 tuples] Pages(R) = 8,000 Tuples(R) = 600,000 Index RI1 ON R(A) (3 levels, 400 pages at the leaf) Cost// don't use this index/not applicable Index RI2 ON R(B) (3 levels, 800 pages at the leaf) Cost = 4 (scanning the index) + 200 (worst case: each tuple is in a different page) RI2 stores: 600000/ 800 = 750 tuples/node Index RI3 ON R(A,B) (3 levels, 1,000 pages at the leaf) Cost = 2+1000+200 Index RI4 ON R(B,C,A) (3 levels, 1,500 pages at the leaf) RI4 stores: 600000/1500 = 400 tuples/node Cost = 4 + 0 (Index only scan!) ----------------------- Query 2: SELECT A,B,C FROM R WHERE B=10 AND A='romney'; [Query returns 15 tuples] B=10 returns 200 tuples A='romney' returns 1,000 tuples Index RI1 ON R(A) (3 levels, 400 pages at the leaf) R11 stores = 1500 tuples/node Scan for: A='romney' returns 1,000 tuples Cost of only scanning the index: 4 Multi-index scan: ---------------------- Index RI2 ON R(B) (3 levels, 800 pages at the leaf) RI2 stores: 600000/ 800 = 750 tuples/node Scan for: B= 10 Cost of only scanning the index: 4 Intersect the found tuple addresses in memory [no additional disk cost] For the matching 15 tuples: read from disk = 15 disk pages Cost = 4 (RI1) + 4 (RI2) + 15 pages. Cost of using Index RI2 to answer query 2? ------------------------------------------- Index RI2 ON R(B) (3 levels, 800 pages at the leaf) Total Cost = 4 + 200 Cost of using Index RI1 to answer query 2? ------------------------------------------- Total Cost = 4 + 1,000 Selectivity of an index on attr A Lower this value is: Tuples(R.A=c)/Tuples(R), the more selective attribute A is..... Creating indices: 1. Create single attribute indices on more selective attributes. 2. If there is a frequent range query (c1> M External sort: Step 1: While there are more pages in R: Read in groups of M pages, sort and write to disk Read R once fully and written R once fully -> 2*PAGES(R) Step 2: Merge from each block and output. PAGES(R) = 8 M=4 Step 1: Cost = 16 pages Resulted in 2 sorted groups (4 pages wide total) Step 2: Cost = Read R once more, 8 pages and output. ----------------- Ex: PAGES(R) = 1,000 M = 20 ------- Step 1: cost = 2,000 pages 1,000/20 = 50 sorted groups ------- Step 2: cost = 2,000 pages (read/write) ceil(50/20) = 3 sorted groups --------- Step 2: merge and output cost = 1,000 pages Cost = 5,000 pages ==================== PAGES(R) = 1,000 M = 100 Step 1: cost = 2,000 pages 1,000/100 = 10 sorted groups ------- Step 2: read/merge and output, Cost = 1,000 pages Multiple operations ------------------- explain SELECT b.baker , b.fullname , count(*) as numwins FROM bakers b , results r WHERE b.baker = r.baker and r.result = 'star baker' GROUP BY b.baker ; explain SELECT baker from bakers EXCEPT SELECT baker as b1 FROM results r WHERE result = 'eliminated'; explain SELECT baker from bakers EXCEPT SELECT baker as b1 FROM results r WHERE result = 'eliminated' order by baker; ---------------- Query Optimization Steps: -> Parse query and find equivalent/alternative relational algebra expressions that implement the query - simplify Boolean conditions WHERE R >5 AND R<3 WHERE R >5 AND R>10 - deep processing rewrite subqueries - algebraic equivalences - join ordering -> Assign memory and implementation to each operation -> Estimate cost of the query [before you execute] -> Choose among all possible query plans (exponential number of these) the best options: - search methods: enumerate all, estimate cost of some, eliminate potentially bad plans Examples: ---------------------- SELECT b.baker , b.fullname , count(*) as numwins FROM bakers b , results r WHERE b.baker = r.baker and r.result = 'star baker' GROUP BY b.baker; EXPLAIN SELECT b.baker FROM bakers b WHERE b.baker in (SELECT baker FROM results WHERE result = 'star baker'); EXPLAIN SELECT DISTINCT b.baker FROM bakers b, results r WHERE b.baker = r.baker and result = 'star baker'; Query Execution ----------------- Sort -> blocking operator ---------- PAGES(R) = 1,000 Sort: M=1,000 blocks for this operation Cost = 1,000 (Read once + output) -------- Sort: M=500 blocks for this operation Step 1: Read 500 blocks, sort and write, Read 500 blocks, sort and write (results in 2 sorted groups) Cost = 2,000 Step 2: Read one block from each sorted group, merge and output Cost = 1,000 (Read R once and output) Total Cost = 3,000 -------------------- PAGES(R) = 1,000 Sort: M = 50 blocks Step 1: Read in 50 blocks at a time, sort and write to temp. Cost = 2,000 (Read R once + Write R once) Sorted groups: 1,000/50= 20 sorted groups Step 2: I need 20 blocks to be able to merge, I have 50! Read R and merge and output, Cost = 1,000 Total cost = 3,000 ------------- Sort: M = 10 blocks Step 1: Read 10 blocks at a time, sort and write. Cost = 2,000 (read once write once) Sorted groups: 100 Step 2: Merge 10 groups at a time and write! Cost = 2,000 (read once/write once) Reduce 100 sorted groups -> 10 sorted groups Step 2: Merge 10 groups and output Cost = 1,000 Total cost = 5,000 ------------------------------ select R.A , count(*) as num from R,S where R.A=S.A group by R.A order by R.A, num ; ------------------