Lecture 16 ------------------------- 1. Intro to part 3 of class Disk access -> final frontier Hard disk access is very slow compared to memory access. 2. Disk access overview https://www.youtube.com/watch?v=wI0upu9eVcw https://www.youtube.com/watch?v=kdmLvl1n82U https://www.youtube.com/watch?v=p-JJp-oLx58 Terms ----- Cylinders Tracks Disks/platters Surfaces Sectors Cluster -> Block -> Page (8KB) Fast HD: 3.5 Inch – Seagate BarraCuda Pro ST12000DM0007 https://www.seagate.com/www-content/datasheets/pdfs/barracuda-pro-12-tbDS1901-7-1707US-en_US.pdf https://www.seagate.com/www-content/product-content/barracuda-fam/barracuda-new/en-us/docs/100818004c.pdf 12TB RPM: 7200 Read and Write Speed: 243 MB/s, 236 MB/s (250 MB/s max) 256MB Cache 512 bytes/sector 8 platters, areal density 923 Gb/in2 avg. Magnetic disk access [Random I/O]: Seek time + rotational latency + transfer time Seek time >> rotational latency >> transfer time Cost to read 1 disk page ------------------------ 3. Disk read cost: Sequential vs. Random I/O Sequential I/O [Reading X pages with a single seek]: Seek time + rotational latency + X * transfer time/page Random I/O of X pages (each page requires a seek): X * (Seek time + rotational latency + transfer time) Random I/O is much much slower than sequential I/O for magnetic disks......... ------------------ Consider a 1 TB disk: There are eight platters providing sixteen surfaces. There are 2^16, or 65,536, tracks per surface (approx. 65K). There are (on average) 2^8 = 256 sectors per track. There are 2^12 = 4096 bytes per sector. The disk rotates at 7200 rpm; i.e., it makes one rotation in 8.33 milliseconds. To move the head assembly between cylinders takes one millisecond to start and stop, plus one additional millisecond for every 4000 cylinders traveled. Thus, the heads move one track in 1.00025 milliseconds and move from the innermost to the outermost track, a distance of 65,536 tracks, in about 17.38 milliseconds. Gaps occupy 10% of the space around a track. 7200 rpm means one rotation takes 8.33 ms (on average, 1/2 of the disk needs to be rotated before the correct location is found, 4.17 ms) Seek time between 0 - 17.38 ms (on average, 1/3 of the disk surface is scanned 65536 / 3 / 4000 + 1 = 6.46 ms) Transfer time for one sector : 8.33 / 256 = 0.03 ms Read a 16,384-byte block from disk. ----------------------------------- Find the minimum, maximum, and average times to read that 16,384-byte block. Minimum time is just the transfer time. Since there are 4096 bytes per sector, the block occupies four sectors. The heads must therefore pass over four sectors and the three gaps between them. We assume that gaps represent 10% of the circle and sectors the remaining 90%. There are 256 gaps and 256 sectors around the circle. Since the gaps together cover 36 degrees of arc and sectors the remaining 324 degrees, the total degrees of arc covered by 3 gaps and 4 sectors is 36 × 3/256 + 324 × 4/256 = 5.48 degrees. The transfer time is thus (5.48/360)×0.00833 = .00013 seconds. That is, 5.48/360 is the fraction of a rotation needed to read the entire block, and .00833 seconds is the amount of time for a 360-degree rotation. Maximum possible time to read the block. In the worst case, the heads are positioned at the innermost cylinder, and the block we want to read is on the outermost cylinder (or vice versa). Thus, the first thing the controller must do is move the heads. As we observed above, the time it takes to move the heads across all cylinders is about 17.38 milliseconds. This quantity is the seek time for the read. The worst thing that can happen when the heads arrive at the correct cylinder is that the beginning of the desired block has just passed under the head. Assuming we must read the block starting at the beginning, we have to wait essentially a full rotation, or 8.33 milliseconds, for the beginning of the block to reach the head again. Once that happens, we have only to wait an amount equal to the transfer time, 0.13 milliseconds, to read the entire block. Thus, the worst-case latency is 17.38 + 8.33 + 0.13 = 25.84 milliseconds. The average latency. Two of the components of the latency are easy to compute: the transfer time is always 0.13 milliseconds, and the average rotational latency is the time to rotate the disk half way around, or 4.17 milliseconds. We might suppose that the average seek time is just the time to move across half the tracks. However, that is not quite right, since typically, the heads are initially somewhere near the middle and therefore will have to move less than half the distance, on average, to the desired cylinder. The average distance traveled is 1/3 of the way across the disk. The time it takes the hard drive to move 1/3 of the way across the disk is 1 + (65536/3)/4000 = 6.46 milliseconds. Our estimate of the average latency is thus 6.46+4.17+0.13 = 10.76 milliseconds; the three terms represent average seek time, average rotational latency, and transfer time, respectively. ------------------ Speeding up typical database accesses to disk: 1. Place blocks that are accessed together on the same cylinder, so we can often avoid seek time, and possibly rotational latency as well. 2. Divide the data among several smaller disks rather than one large one. Having more head assemblies that can go after blocks independently can increase the number of block accesses per unit time. 3. "Mirror" a disk: making two or more copies of the data on different disks. In addition to saving the data in case one of the disks fails, this strategy, like dividing the data among several disks, lets us access several blocks at once. 4. Use a disk-scheduling algorithm, either in the operating system, in the DBMS, or in the disk controller, to select the order in which several requested blocks will be read or written. 5. Prefetch blocks to main memory in anticipation of their later use. ------------ Elevator algorithm ------------------ Recall, average seek, rotational latency, and transfer times are 6.46, 4.17, and 0.13 milliseconds. Suppose that at some time there are pending requests for block accesses at cylinders 8000, 24,000, and 56,000. The heads are located at cylinder 8000. In addition, there are three more requests for block accesses that come in at later times. For instance, the request for a block from cylinder 16,000 is made at time 10 milliseconds. Cylinder First time of request available 8000 0 24000 0 56000 0 16000 10 64000 20 40000 30 We shall assume that each block access incurs time 0.13 for transfer and 4.17 for average rotational latency, i.e., we need 4.3 milliseconds plus whatever the seek time is for each block access. The seek time can be calculated by the following rule: 1 plus the number of tracks divided by 4000. Let us see what happens if we schedule disk accesses using the elevator algorithm. The first request, at cylinder 8000, requires no seek, since the heads are already there. Thus, at time 4.3 the first access will be complete. The request for cylinder 16,000 has not arrived at this point, so we move the heads to cylinder 24,000, the next requested "stop" on our sweep to the highest-numbered tracks. The seek from cylinder 8000 to 24,000 takes 5 milliseconds, so we arrive at time 9.3 and complete the access in another 4.3. Thus, the second access is complete at time 13.6. By this time, the request for cylinder 16,000 has arrived, but we passed that cylinder at time 7.3 and will not come back to it until the next pass. We thus move next to cylinder 56,000, taking time 9 to seek and 4.3 for rotation and transfer. The third access is thus complete at time 26.9. Now, the request for cylinder 64,000 has arrived, so we continue outward. We require 3 milliseconds for seek time, so this access is complete at time 26.9 + 3 + 4.3 = 34.2. At this time, the request for cylinder 40,000 has been made, so it and the request at cylinder 16,000 remain. We thus sweep inward, honoring these two requests. Cylinder Time of request completed 8000 4.3 24000 13.6 56000 26.9 64000 34.2 40000 45.5 16000 56.8 First-come-first-served scheduling: Cylinder Time of request completed 8000 4.3 24000 13.6 56000 26.9 16000 42.2 64000 59.5 40000 70.8 ------------ - Faster and faster disks: SSD (less difference between sequential and random I/O) - Slower and slower data access too: Virtualization ---------------------- 4. Disk speed comparisons ioping us, usec microseconds (a millionth of a second, 1 / 1 000 000) ms, msec milliseconds (a thousandth of a second, 1 / 1 000) --------------------------- --------------------------- 5. RAID: Redundant arrays of inexpensive disks 4 disks: RAID - 0: Disk 1: 1 5 9 .. Disk 2: 2 6 10 .. Disk 3: 3 7 11 Disk 4: 4 8 12 Read pages 1-4: Reads/writes are faster, no redundancy RAID - 1: Mirror data Disk 1: 1 2 3 ... Disk 2: 1 2 3 ... 1/2 space is used Reads are 2x faster Writes are slightly slower than a single disk Redundancy: even if one disk fails, no data is lost RAID - 4: Disk 1: 1 5 9 .. Disk 2: 2 6 10 .. Disk 3: 3 7 11 Disk 4: 4 8 12 Disk 5: parity(pages 1,2,3,4) parity(pages 5,6,7,8), parity (9,10,11,12) Page 1 1 0 1 1 1 0 1 0 Page 2 0 1 0 1 1 0 0 1 Page 3 0 1 1 0 0 1 1 0 Page 4 0 0 1 0 1 0 0 1 Parity 1 0 1 0 1 1 0 0 ----------------- Reads as fast as RAID-0 Writes are complex: each write must update the parity disk Page 4 0 0 1 0 1 0 0 1 Page 4' 0 1 1 0 1 1 0 1 Parity 1 1 1 0 1 0 0 0 If one disk fails, I can still read/write the data without any loss! Page 1 1 0 1 1 1 0 1 0 Page 2 0 1 0 1 1 0 0 1 ----- 0 1 1 0 0 1 1 0 --> construct from other disks Page 4 0 0 1 0 1 0 0 1 Parity 1 0 1 0 1 1 0 0 Reads are slower during recovery RAID-5: Striping the parity block --------- Disk 1: 1 p 9 13 Disk 2: 2 5 p 14 Disk 3: 3 6 10 p Disk 4: 4 7 11 15 Disk 5: p 8 12 16 Parity disk is not a bottleneck, writes are faster, reads are still as fast as RAID-4 RAID-6: Hamming code using 2-bit parity that can recover from 2 disk failures ------------------ Data page // Data block: smallest unit of data that I can read/write to a disk: 8KB From https://www.postgresql.org/docs/current/storage-page-layout.html: "Every table and index is stored as an array of pages of a fixed size (usually 8 kB, although a different page size can be selected when compiling the server)." SELECT current_setting('block_size'); Relation storage on disk ------------------------ \d pg_class SELECT * FROM pg_user ; SELECT relname, relkind, relpages FROM pg_class pc, pg_user pu WHERE pc.relowner = pu.usesysid and pu.usename = user ORDER BY relpages desc; -- For small tables autoanalyze never gets triggered. Since autovacuum_analyze_threshold is 50 by default, tables with less than 50 data modifications are not automatically analyzed. vacuum analyze region; select pg_relation_size('bakers'); select pg_table_size('bakers'); select pg_indexes_size('bakers'); SELECT pg_size_pretty( pg_total_relation_size('bakers') ); --- 1. Disk access is slow 2. All data may not fit in memory 3. Optimize number of disk pages read/written for each query -> Query Cost: number of disk pages read/written