Lecture 20 - Secondary Storage Management

Lecture 20 - Secondary Storage Management#

Announcements#

  • Hw#5 is due tonight, I will extend the deadline to monday night. We will grade it over the Thanksgiving break.

  • No new hw or lecture exercise today

  • We need to discuss the next hw assignment.

Secondary Storage#

  • Data is stored on disk for stable storage.

  • Query execution, to run a query

    • Parse query, validate, optimize and generate query plan

    • Execute the query:

      • Read the data from disk into memory

      • Compute operations (such as joins, group by, sort, etc.)

      • Output the results

  • Reading data from disk is much much more expensive than doing operations in memory.

  • Unit of data to be read is a disk page, often 8K in size, read into a memory block. We will assume a disk page and a memory block are the same size.

  • For any relation, we will use PAGES® as the total number of pages the relation is stored in

  • A page stores many tuples, how many tuples depends on the size of a tuple

Types of storage#

  • Magnetic disks: to read a page, we need to move a disk head (seek) and wait for the disk to spin to the correct place (rotational latency) and read the page (data transfer)

    Cost of reading a page = seek time + rotational latency + transfer time

    Cost of reading 100 pages:

    100 pages in random locations: = 100 * (seek time + rotational latency + transfer time)

    100 pages in a sequential location (same track)
    = seek time + rotational latency + 100* transfer time

    There is a big difference in cost of random and sequential I/O

  • Solid state disks: no movable parts and faster than random I/O for sure

  • Virtual machines (access disk through a virtualization layer that has additional costs)

ioping . -warmup 5 -c 10

us -> microseconds, 1/1 000 000 ms -> miliseconds, 1/1 000

Solid state: min/avg/max/mdev = 51 us / 56 us / 60 us / 3.52 us Magnetic disk: min/avg/max/mdev = 158.4 us / 170.9 us / 180.2 us / 8.82 us Virtual machine: min/avg/max/mdev = 472.6 us / 504.5 us / 588.1 us / 43.7 us

  • Cost of a query: Number of pages read from disk (or written to disk) to answer a query

SELECT * FROM R WHERE R.A = 5 ;

  • What is the cost of this query?

  • Sequential scan:

    • Read all the data pages into memory

    • Read all the tuples and check if R.A=5

    • If yes, output

  • Cost of sequential scan: PAGES®

  • The main cost in sequential scan is driven by how many pages the relation spans, not how many tuples it has