B-Trees

-------------------------------

Rudolf Bayer and Edward M. McCreight. Never explained what "B" stands for. Possible explanations: Boeing, balanced, broad, bushy, and Bayer.
McCreight: "the more you think about what the B in B-trees means, the better you understand B-trees."

Insertion
---------

Insertion involves two steps:

Insert starting from the root
Then check if the insertion resulted in the root being split
If insertion of a node that causes the node to be over full (with more than n key values for leaf or n+1 pointers for internal nodes), then split the node.

Insertion(root, newval)

   node = leaf node to insert the newval starting with root
   result = insertion_helper(node, newval)
   if result != null : ##the root was split
      newroot = new_node
      point from newroot to root
      point from newroot to result
      root = newroot
   return root

Insertion_helper(node, newval)

   If there is space in the node
      Insert and return null
   Else
      Create a new node (a new disk page)
      Distribute entries for the current node to the two nodes evenly

      recursively insert a pointer to the new node in the
      parent node

      if the parent is full, then split the parent (distribute
      pointers and key value evenly) and recursively insert to
      parent

      if root has split, return the new node created


Deletion
--------

Deletion works in reverse. If after removing a key value/pointer, the node is less than half full, then we try to borrow from a sibling.

If this is not possible, then we merge with a sibling node.

Call deletion helper from root
If root now points to a single node in the next level,
delete root and make the next node the new root

Deletion helper:

   Find the key value to be deleted and if it exists delete it
   and update the parent node key value if necessary.

   If node is less than half full
      if it is possible to borrow a key value (leaf
      nodes ) or a pointer (for internal nodes) from a sibling:
         Borrow and adjust key values and done.
   Else:
      merge with one of the siblings and
      delete a pointer from the parent recursively







   B-trees on non-unique attributes

\d bakers

create index b2 on bakers(age);

select age from bakers order by age;


   B-trees on multiple attributes

   select baker from showstoppers order by baker;

   select episodeid, baker from showstoppers order by episodeid, baker;

   select baker, episodeid from showstoppers order by baker, episodeid;

   create index s1idx on showstoppers (baker, episodeid) ;

Searching in such B-trees

   R(X,Y)

   Which nodes in the index to search:  index search selectivity

   Consecutive searches:
   
   X='A' and Y=3
   X='A' and 3<= Y <= 8

   'A'<= X <= 'C' AND Y= 3 -> searching only 'A'<= X <= 'C'  and find tuples
   that satisfy Y=3
  

   How many tuples you find in the index that satisfy the search condition

   selectivity of index





B-tree search/insertion/deletion

----------------

B-trees with duplicate entries


----------------

Indexing:

- multi-attribute indexes


----------------

SELECT relname, relpages
FROM pg_class pc, pg_user pu
WHERE pc.relowner = pu.usesysid and pu.usename = user
ORDER BY relpages desc;

-------

select * from bakers where baker = 'Briony' ;

----------------

Index on R(X)

X = 4

4 <= X and X < 10

--------------

X = 'C', Y = 3, cost = 3, # tuples found = 0
X = 'D', 2 <= Y <= 5, cost = 4, # tuples found = 2
'B' <= X <= 'D', Y = 3; B,3 -> D,3; cost = 5, # tuples found = 2
Y = 3, cost = 11, # tuples found = 5
'D' <= X <= 'F', 3 <= Y <= 6; D,3 -> F,6; cost = 4, # tuples found = 2

--------------

R(A,B,C,D,E),  Assume: 100,000 tuples  10,000 pages  ( 10 tuples per page)

Index I1 on R(A):  400 pages at leaf, 3 levels (root+internal+leaf)

Index I2 on R(B): 400 pages at leaf, 3 levels (root+internal+leaf)

Index I3 on R(A,B): 1,000 pages at leaf, 3 levels (root+internal+leaf)

Index I4 on R(B,A): 1,000 pages at leaf, 3 levels (root+internal+leaf)

--------------

SELECT A,B,C,D,E FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples)

Query plan 1 -> Scan the whole relation: Cost = 10,000

Query plan 2: Scan I1 (find tuple pointers that match 10<=A<=20) +
Read all tuples that I find based on the index from relation R

100,000/400 = 250 tuples per node

100 tuples for this query will fit in a single leaf node!

Cost to scan the index = 1 (root) + 1 (internal) + 1 or 2 (leaf) = 4 max


To read the matching 100 tuples: I need to read pages of R

How many pages do I need to read?

Best case: 10 pages  (all packed sorted by A)

Worst case: 100

Cost of this query (max) : 4 + 100 = 104 pages

------

Query plan 3: Use index I3:

100,000 (tuples of R) / 1,000 = 100 tuples per node

Cost of index search = 4 (max)

Cost of searching the relation = 100 

Total cost = 104 (max)
-------

SELECT A,B,C,D,E FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples)

Query plan 4: Use index I4:

Index I4 on R(B,A): 1,000 pages at leaf, 3 levels (root+internal+leaf)

Scan all leaf, cost of index search = 1 (root) + 1 (internal) + 1,000
(all leaf) = 1002

Scan the tuples = 100

Cost = 1,102

===========================


SELECT B FROM R WHERE 10 <= A and A <= 20 ; (returns: 100 tuples)

Query plans:

1. Scan all of R = 10,000
2. Use I1 on R(A)  = 104 (max)
3. Use I3 on R(A,B) = 4 (index only search)
4. Use I4 on R(B,A) = 1,002 (index only search)


Other index structures:


A primary index determines the location of the records of the data file, while a secondary index does not.

PRIMARY
---------------------------

Hashing


CREATE INDEX name ON table USING HASH (column);

CREATE INDEX R ON table USING HASH (A);

Useful for searching on equality, not on range searches
A=10
A='ABC'


Clustering: sorting


CREATE INDEX I1 on movieroles(actorid, movieid);

CLUSTER movieroles USING I1;

Great for queries:

SELECT  * FROM movieroles WHERE actorid = 20;

SELECT  * FROM actors a, movieroles mr WHERE a.id = 66206 and a.id = mr.actorid ;


SECONDARY - WHAT TO INDEX ON?
--------------------------------

single attribute, multiple attributes

index on a condition for all tuples:

CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1));


partial index on a subset of tuples:

CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
WHERE NOT (client_ip > inet '192.168.100.0' AND
           client_ip < inet '192.168.100.255');


constraint satisfaction:

CREATE UNIQUE INDEX name ON table (column [, ...]);


OTHER TYPES OF SECONDARY INDICES
-------------------------------------

GIST / GIN / Sp-GIST (non-balanced)

Index spatial data, X,Y

R-tree -> B-Tree

Different types on indexes:

R-trees  (Type of GIST)


SP-Gist: non-balanced trees.





Punched cards!

https://computerhistory.org/blog/the-bourne-collection-online-search-is-older-than-you-think/

CREATE INDEX name ON table USING GIST (column [ { DEFAULT | tsvector_ops } (siglen = number) ] );

    Creates a GiST (Generalized Search Tree)-based index. The column
    can be of tsvector or tsquery type. Optional integer parameter
    siglen determines signature length in bytes.

-----------------------


A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;

-----------------------

Inverted files


CREATE INDEX name ON table USING GIN (column);

    Creates a GIN (Generalized Inverted Index)-based index. The column
    must be of tsvector type.

select to_tsvector('english',make) from signatures ;

select make, to_tsvector('english',make) from signatures ;

-----------------