Announcements ------------------ - Monday December 12 at 3PM: review session and final office hours Will post link on Submitty - Final exam is comprehensive! I will post the self-study topic list and the final exam schema by later today or tomorrow morning the latest. - Exam is on Wednesday December 14, 3:00PM-6:00PM (West 220) - Those needing extra accommadations will get an email from me soon. Database Tuning ------------------ - Workload: A collection of DB operations: queries/inserts/updates/deletes and frequency of each operation - Response time expectations: time to first result and time to all results - Benchmarking Hardware/Systems Tuning ------------------------ - Database administration - Better hardware: more memory, faster disks, CPU with fast L3 caches - Memory per process, max run time, and other adjustable parameters - tune to operating system, change page size - distribute data to multiple disks using tablespaces - parallelism to make use of multiple CPUs Query Tuning ---------------- - track slow queries - analyze ; - periodically recreate db statistics - explain; - check query plans - write them in multiple ways and test 1. select * from t where t.a_timestamp + interval ‘3 days’ < current_timestamp; 2. select * from t where t.a_timestamp < current_timestamp - interval ‘3 days’; 1. select ..... where extract(year from published_date) = 2022; 2. select ..... where published_date >= '01/01/2022' and published_date <='12/31/2022'; 1. select isbn13 from books where title like '%Game%'; 2. select isbn13 from books where lower(title) like 'a game%'; 3. select isbn13 from books where title like 'A Game%'; Index on title, will be scanned fully for 1 and 2, but partially for 3. - Optimizer may think certain queries are not selective -- Suppose A only takes values 1,2,3 A<>3 - not very selective! A in (1,2) - more selective - Order by/distinct are costly, avoid if not necessary - Nested queries/subqueries: be careful, avoid correlated subqueries is possible - Balance cost of a query with the cost of other processing (operations and data shipping) ------------------------------ Indices ------- - Indexes speed up queries, slow down insert/update/delete - assess overall workload cost - Clustered index/sort the data itself (only 1 per relation, periodically reorganize it) MovieRoles(actorid, movieid) Cluster by: MovieRoles(actorid, movieid) --> extremely useful for queries: actorid = 1000 all movies of an actor --> not so useful for quries: movieid = 5000 all actors in a movie - Create all reasonable selective indices Indices that reduce the number of matching tuples considerably for a query Attributes where NDISTINCT(A) is large, A= c Index on R(A) - Less selective indices: add frequently queried attributes Suppose B is not selective but I frequently ask: select C from R where B=c1; Index on R(B,C) -- allows index only search Index types: create index mr_idx on movieroles(movieid,actorid) ; cluster movieroles using mr_idx ; -- recluster using cluster movierolesl CREATE INDEX ON films ((lower(title))); CREATE INDEX title_idx_german ON films (title COLLATE "de_DE"); CREATE INDEX title_idx_nulls_low ON films (title NULLS FIRST); CREATE INDEX code_idx ON films (code) TABLESPACE indexspace; -- partial indices CREATE INDEX orders_unbilled_index ON orders (order_nr) WHERE billed is not true; CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) WHERE success; - Consider other types of indices Text keyword query: inverted file indices (GIN) Spatial data query: spatial index: R-tree, quad-trees, kd-trees (GIST) Normalization ----------------- - Good normalized data simplifies application logic, easier to modify A->B, B->C R1(A,B) R2(B,C) - Denormalize - Add redundant data or aggregates to your data model to improve performance Actor(id, ... , nummovies) --need to implement a workflow to keep this updated - Column store: Extreme normalization (key stores) S(A,B,C,D,E) A is the key Store as: (A,B) (A,C), (A,D) (A,E)