B-Trees ------------------------------- Rudolf Bayer and Edward M. McCreight. Never explained what "B" stands for. Possible explanations: Boeing, balanced, broad, bushy, and Bayer. McCreight: "the more you think about what the B in B-trees means, the better you understand B-trees." Insertion --------- Insertion involves two steps: Insert starting from the root Then check if the insertion resulted in the root being split If insertion of a node that causes the node to be over full (with more than n key values for leaf or n+1 pointers for internal nodes), then split the node. Insertion(root, newval) node = leaf node to insert the newval starting with root result = insertion_helper(node, newval) if result != null : ##the root was split newroot = new_node point from newroot to root point from newroot to result root = newroot return root Insertion_helper(node, newval) If there is space in the node Insert and return null Else Create a new node (a new disk page) Distribute entries for the current node to the two nodes evenly recursively insert a pointer to the new node in the parent node if the parent is full, then split the parent (distribute pointers and key value evenly) and recursively insert to parent if root has split, return the new node created Deletion -------- Deletion works in reverse. If after removing a key value/pointer, the node is less than half full, then we try to borrow from a sibling. If this is not possible, then we merge with a sibling node. Call deletion helper from root If root now points to a single node in the next level, delete root and make the next node the new root Deletion helper: Find the key value to be deleted and if it exists delete it and update the parent node key value if necessary. If node is less than half full if it is possible to borrow a key value (leaf nodes ) or a pointer (for internal nodes) from a sibling: Borrow and adjust key values and done. Else: merge with one of the siblings and delete a pointer from the parent recursively B-trees on non-unique attributes \d bakers create index b2 on bakers(age); select age from bakers order by age; B-trees on multiple attributes select baker from showstoppers order by baker; select episodeid, baker from showstoppers order by episodeid, baker; select baker, episodeid from showstoppers order by baker, episodeid; create index s1idx on showstoppers (baker, episodeid) ; Searching in such B-trees R(X,Y) Which nodes in the index to search: index search selectivity Consecutive searches: X='A' and Y=3 X='A' and 3<= Y <= 8 'A'<= X <= 'C' AND Y= 3 -> searching only 'A'<= X <= 'C' and find tuples that satisfy Y=3 How many tuples you find in the index that satisfy the search condition selectivity of index B-tree search/insertion/deletion ---------------- B-trees with duplicate entries ---------------- Indexing: - multi-attribute indexes ---------------- SELECT relname, relpages FROM pg_class pc, pg_user pu WHERE pc.relowner = pu.usesysid and pu.usename = user ORDER BY relpages desc; ------- select * from bakers where baker = 'Briony' ; ---------------- Index on R(X) X = 4 4 <= X and X < 10 -------------- X = 'C', Y = 3, cost = 3, # tuples found = 0 X = 'D', 2 <= Y <= 5, cost = 4, # tuples found = 2 'B' <= X <= 'D', Y = 3; B,3 -> D,3; cost = 5, # tuples found = 2 Y = 3, cost = 11, # tuples found = 5 'D' <= X <= 'F', 3 <= Y <= 6; D,3 -> F,6; cost = 4, # tuples found = 2 -------------- R(A,B,C,D,E), Assume: 100,000 tuples 10,000 pages ( 10 tuples per page) Index I1 on R(A): 400 pages at leaf, 3 levels (root+internal+leaf) Index I2 on R(B): 400 pages at leaf, 3 levels (root+internal+leaf) Index I3 on R(A,B): 1,000 pages at leaf, 3 levels (root+internal+leaf) Index I4 on R(B,A): 1,000 pages at leaf, 3 levels (root+internal+leaf) -------------- SELECT A,B,C,D,E FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples) Query plan 1 -> Scan the whole relation: Cost = 10,000 Query plan 2: Scan I1 (find tuple pointers that match 10<=A<=20) + Read all tuples that I find based on the index from relation R 100,000/400 = 250 tuples per node 100 tuples for this query will fit in a single leaf node! Cost to scan the index = 1 (root) + 1 (internal) + 1 or 2 (leaf) = 4 max To read the matching 100 tuples: I need to read pages of R How many pages do I need to read? Best case: 10 pages (all packed sorted by A) Worst case: 100 Cost of this query (max) : 4 + 100 = 104 pages ------ Query plan 3: Use index I3: 100,000 (tuples of R) / 1,000 = 100 tuples per node Cost of index search = 4 (max) Cost of searching the relation = 100 Total cost = 104 (max) ------- SELECT A,B,C,D,E FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples) Query plan 4: Use index I4: Index I4 on R(B,A): 1,000 pages at leaf, 3 levels (root+internal+leaf) Scan all leaf, cost of index search = 1 (root) + 1 (internal) + 1,000 (all leaf) = 1002 Scan the tuples = 100 Cost = 1,102 =========================== SELECT B FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples) Query plans: 1. Scan all of R = 10,000 2. Use I1 on R(A) = 104 (max) 3. Use I3 on R(A,B) = 4 (index only search) 4. Use I4 on R(B,A) = 1,002 (index only search) Other index structures: A primary index determines the location of the records of the data file, while a secondary index does not. PRIMARY --------------------------- Hashing CREATE INDEX name ON table USING HASH (column); CREATE INDEX R ON table USING HASH (A); Useful for searching on equality, not on range searches A=10 A='ABC' Clustering: sorting CREATE INDEX I1 on movieroles(actorid, movieid); CLUSTER movieroles USING I1; Great for queries: SELECT * FROM movieroles WHERE actorid = 20; SELECT * FROM actors a, movieroles mr WHERE a.id = 66206 and a.id = mr.actorid ; SECONDARY - WHAT TO INDEX ON? -------------------------------- single attribute, multiple attributes index on a condition for all tuples: CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); partial index on a subset of tuples: CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255'); constraint satisfaction: CREATE UNIQUE INDEX name ON table (column [, ...]); OTHER TYPES OF SECONDARY INDICES ------------------------------------- GIST / GIN / Sp-GIST (non-balanced) Index spatial data, X,Y R-tree -> B-Tree Different types on indexes: R-trees (Type of GIST) SP-Gist: non-balanced trees. Punched cards! https://computerhistory.org/blog/the-bourne-collection-online-search-is-older-than-you-think/ CREATE INDEX name ON table USING GIST (column [ { DEFAULT | tsvector_ops } (siglen = number) ] ); Creates a GiST (Generalized Search Tree)-based index. The column can be of tsvector or tsquery type. Optional integer parameter siglen determines signature length in bytes. ----------------------- A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector; ----------------------- Inverted files CREATE INDEX name ON table USING GIN (column); Creates a GIN (Generalized Inverted Index)-based index. The column must be of tsvector type. select to_tsvector('english',make) from signatures ; select make, to_tsvector('english',make) from signatures ; -----------------