Go up to One Disk Configuration
Go forward to Multi-Zone Disks

File System Design

The current implementation of Mitra assumes that a PM only displays a stream (i.e., it does not perform complex operations such as Fast-Forward, Fast-Rewind or Pause operations). Upon the arrival of a request for object X belonging to media type MX, the admission control policy of the Scheduler is as follows. First, the Scheduler checks to see if another scheduled display is beginning the display of X, i.e., references X0. If so, these two new requests are combined with each other into one. This enables Mitra to multiplex a single stream among multiple PMs. If no other stream is referencing X0, starting with the current active group, the Scheduler locates the group with sufficient idle time to accommodate the retrieval of a block of size B(Mi). The implementation details of this policy are contained in the Appendix. If no group can accommodate the retrieval of this request, the Scheduler queues this request and examines the possibility of admitting it during the next time period. (The performance results of the Evaluation Section are obtained based on a configuration of the Scheduler that does not merge several streams referencing the same object into one.)

With media types, Mitra's file system might be forced to manage different block sizes. Moreover, the blocks of different objects might be staged from the tertiary storage device onto magnetic disk storage on demand. A block should be stored contiguously on disk. Otherwise, the disk would incur seeks when reading a block, reducing disk bandwidth. Moreover, it might result in hiccups because the retrieval time of a block might become unpredictable. To ensure a contiguous layout of a block, we considered four alternative approaches: disk partitioning, extent-based [ABCea76][CDKK85][GR93b], multiple block sizes, and an approximate contiguous layout of a file. We chose the final approach, resulting in the design and implementation of the EVEREST file system. Below, we describe each of the other three approaches and our reasons for abandoning them.

With disk partitioning, assuming media types with different block sizes, the available disk space is partitioned into regions, one region per media type. A region i corresponds to media type i. The space of this region is partitioned into fix sized blocks, corresponding to B(Mi). The objects of media type i compete for the available blocks of this region. The amount of space allocated to a region i might be estimated as a function of both the size and frequency of access of objects of media type i [GI94]. However, partitioning of disk space is inappropriate for a dynamic environment where the frequency of access to the different media types might change as a function of time. This is because when a region becomes cold, its space should be made available to a region that has become hot. Otherwise, the hot region might start to exhibit a thrashing [Den80] behavior that would increase the number of retrievals from the tertiary storage device. This motivates a re-organization process to re-arrange disk space. This process would be time consuming due to the overhead associated with performing I/O operations.

With an extent-based design, a fixed contiguous chunk of disk space, termed an extent, is partitioned into fix-sized blocks. Two or more extents might have different page sizes. Both the size of an extent and the number of extents with a pre-specified block size (i.e., for a media type) is fixed at system configuration time. A single file may span one or more extents. However, an extent may contain no more than a single file. With this design, an object of a media type i is assigned one or more extents with block size B(Mi). In addition to suffering from the limitations associated with disk partitioning, this approach suffers from internal fragmentation with the last extent of an object being only partially occupied. This would waste disk space, increasing the number of references to the tertiary storage device.

With the Multiple Bock Size approach (MBS), the system is configured based on the media type with the lowest bandwidth requirement, say M1. MBS requires the block size of each of media type j to be a multiple of B(M1), i.e., B(Mj)= (B(Mj))/(B(M1)) B(M1). This might simplify the management of disk space to: 1) avoid its fragmentation, and 2) ensure the contiguous layout of each block of an object. However, MBS might waste disk bandwidth by forcing the disk to: (1) retrieve more data on behalf of a PM per time period due to rounding up of block size, and (2) remain idle during other time periods to avoid an overflow of memory at the PM. These are best illustrated using an example. Assume two media types MPEG-1 and MPEG-2 objects with bandwidth requirements of 1.5 Mbps and 4 Mbps, respectively. With this approach, the block size of the system is chosen based on MPEG-1 objects. Assume, it is chosen to be 512 KByte, B(MPEG-1)=512 KByte. This implies that B(MPEG-2)=1365.33 KByte. MBS would increase B(MPEG-2) to equal 1536 KByte. To avoid excessive amount of accumulated data at a PM displaying an MPEG-2 clip, the Scheduler might skip the retrieval of data one time period every nine time periods using the PM-driven scheduling paradigm of the PM-driven Section. The Scheduler may not employ this idle slot to service another request because it is required during the next time period to retrieve the next block of current MPEG-2 display. If all active requests are MPEG-2 video clips and a time period supports nine displays with B(MPEG-2)=1536 KByte then, with B(MPEG-2)=1365.33 KByte, the system would support ten simultaneous displays (10% improvement in performance). In summary, the block size for a media type should approximate its theoretical value in order to minimize the percentage of wasted disk bandwidth.

The final approach, and the one used by Mitra, employs the buddy algorithm to approximate a contiguous layout of a file on the disk without wasting disk space. The number of contiguous chunks that constitute a file is a fixed function of the file size and the configuration of the buddy algorithm. Based on this information, Mitra can either (1) prevent a block from overlapping two non-contiguous chunks or (2) allow a block to overlap two chunks and require the PM to cache enough data to hide the seeks associated with the retrieval of these blocks. Currently, Mitra implements the first approach. To illustrate the second approach, if a file consists of five contiguous chunks then at most four blocks of this file might span two different chunks. This implies that the retrieval of four blocks will incur seeks with at most one seek per block retrieval. To avoid hiccups, the Scheduler should delay the display of the data at the PM until it has cached enough data to hide the latency associated with four seeks. The amount of cached data is not significant. For example, assuming a maximum seek time of 20 milliseconds, with MPEG-2 objects (4 Mbps), the PM should cache 10 KByte to hide each seek. However, this approach complicates the admission control policy because the retrieval of a block might incur either one or zero seeks.


Figure 2: Physical division of disk space into pages and the corresponding logical view of the sections with an example base of = 2.

With EVEREST, the basic unit of allocation is a page,(3) also termed sections of height 0. EVEREST organizes these sections as a tree to form larger, contiguous sections. As illustrated in Figure 2, only sections of size(page) ×i (for i >=0) are valid, where the base is a system configuration parameter. If a section consists of i pages then i is said to be the height of the section. height i sections that are buddies (physically adjacent) might be combined to construct a height i+1 section.

To illustrate, the disk in Figure 2 consists of 16 pages. The system is configured with = 2. Thus, the size of a section may vary from 1, 2, 4, 8, up to 16 pages. In essence, a binary tree is imposed upon the sequence of pages. The maximum height, computed by(4) S = log((Capacity)/(size(page)) ) , is 4. With this organization imposed upon the device, sections of height i>=0 cannot start at just any page number, but only at offsets that are multiples of i. This restriction ensures that any section, with the exception of the one at height S, has a total of -1 adjacent buddy sections of the same size at all times. With the base 2 organization of Figure 2, each section has one buddy.

With EVEREST, a portion of the available disk space is allocated to objects. The remainder, should any exist, is free. The sections that constitute the available space are handled by a free list. This free list is actually maintained as a sequence of lists, one for each section height. The information about an unused section of height i is enqueued in the list that handles sections of that height. In order to simplify object allocation, the following bounded list length property is always maintained: For each height i = 0, ..., S, at most -1 free sections of i are allowed. Informally, this property implies that whenever there exists sufficient free space at the free list of height i, EVEREST must compact these free sections into sections of a larger height(5).

The materialization of an object is as follows. The first step is to check, whether the total number of pages in all the sections on the free list is either greater than or equal to the number of pages (denoted no-of-pages(ox)) that the new object ox requires. If this is not the case then one or more victim objects are elected and deleted. (The procedure for selecting a victim is based on heat [GIKZ96]. The deletion of a victim object is described further below.) Assuming enough free space is available at this point, ox is divided into its corresponding sections as follows. First, the number m = no-of-pages(ox) is converted to base . For example, if = 2, and no-of-pages(ox) = 1310 then its binary representation is 11012. The full representation of such a converted number is m = dj-1 ×j-1 + ...+ d2 ×2 + d1 ×1 + d0 ×0. In our example, the number 11012 can be written as 1 ×23 + 1 ×22 + 0 ×21 + 1 ×20. In general, for every digit di that is non-zero, di sections are allocated from height i of the free list on behalf of ox. In our example, ox requires 1 section from height 0, no sections from height 1, 1 section from height 2, and 1 section from height 3.

For each object, the number nu of contiguous pieces is equal to the number of one's in the binary representation of m, or with a general base , nu= SUMi=0j di (where j is the total number of digits). Note that nu is always bounded by log m . For any object, nu defines the maximum number of sections occupied by the object. (The minimum is 1 if all nu sections are physically adjacent.) A complication arises when no section at the right height exists. For example, suppose that a section of size i is required, but the smallest section larger than i on the free list is of size j (j > i). In this case, the section of size j can be split into sections of size j-1. If j-1 = i, then -1 of these are enqueued on the list of height i and the remainder is allocated. However, if j-1>i then -1 of these sections are again enqueued at level j-1, and the splitting procedure is repeated on the remaining section. It is easy to see that, whenever the total amount of free space on these lists is sufficient to accommodate the object, then for each section that the object occupies, there is always a section of the appropriate size, or larger, on the list. This splitting procedure will guarantee that the appropriate number of sections, each of the right size, will be allocated, and that the bounded list length property is never violated.

When the system elects that an object must be materialized and there is insufficient free space, then one or more victims are removed from the disk. Reclaiming the space of a victim requires two steps for each of its sections. First, the section must be appended to the free list at the appropriate height. The second step ensures that the bounded list length property is not violated. Therefore, whenever a section is enqueued in the free list at height i and the number of sections at that height is equal to or greater than , then sections must be combined into one section at height i+1. If the list at i+1 now violates bounded list length property, then once again space must be compacted and moved to section i+2. This procedure might be repeated several times. It terminates when the length of the list for a higher height is less than .


Figure 3a: Two sections are on the freelist already (7 and 14) and object o3 is deallocated.

Figure 3b: Sections 7 and 13 should be combined, however, they are not contiguous.

Figure 3c: The buddy of section 7 is 6. Data must move from 6 to 13.

Figure 3d: Sections 6 and 7 are contiguous and can be combined.

Figure 3e: The buddy of section 6 is 4. Data must move from (4,5) to (14,15).

Figure 3f: Sections 4 and 6 are now adjacent and can be combined.

Figure 3g: The final view of the disk and the free list after removal of o3.

Figure 3: Deallocation of an object. The example sequence shows the removal of object o3 from the initial disk resident object set { o1, o2, o3}. Base two, = 2.

Compaction of free sections into a larger section is simple when they are buddies; in this case, the combined space is already contiguous. Otherwise, the system might be forced to exchange one occupied section of an object with one on the free list in order to ensure contiguity of an appropriate sequence of sections at the same height. The following algorithm achieves space-contiguity among free sections at height i.
  1. Check if there are at least sections for height i on the free list. If not, stop.
  2. Select the first section (denoted sj) and record its page-number (i.e., the offset on the disk drive). The goal is to free -1 sections that are buddies of sj.
  3. Calculate the page-numbers of sj's buddies. EVEREST's division of disk space guarantees the existence of -1 buddy sections physically adjacent to sj.
  4. For every buddy sk, k <=0 <=-1, k !=j, if it exists on the free list then mark it.
  5. Any of the sk unmarked buddies currently store parts of other object(s). The space must be re-arranged by swapping these sk sections with those on the free list. Note that for every buddy section that should be freed there exists a section on the free list. After swapping space between every unmarked buddy section and a free list section, enough contiguous space has been acquired to create a section at height i+1 of the free list.
  6. Go back to Step 1.
To illustrate, consider the organization of space in Figure 3a. The initial set of disk resident objects is { o1, o2, o3} and the system is configured with = 2. In Figure 3a, two sections are on the free list at height 0 and 1 (addresses 7 and 14 respectively), and o3 is the victim object that is deleted. Once page 13 is placed on the free list in Figure 3b, the number of sections at height 0 is increased to and it must be compacted according to Step 1. As sections 7 and 13 are not contiguous, section 13 is elected to be swapped with section 7's buddy, i.e., section 6 (Figure 3c). In Figure 3d, the data of section 6 is moved to section 13 and section 6 is now on the free list. The compaction of sections 6 and 7 results in a new section with address 6 at height 1 of the free list. Once again, a list of length two at height 1 violates the bounded list length property and pages (4,5) are identified as the buddy of section 6 in Figure 3e. After moving the data in Figure 3f from pages (4,5) to (14,15), another compaction is performed with the final state of the disk space emerging as in Figure 3g.

Once all sections of a deallocated object are on the free list, the iterative algorithm above is run on each list, from the lowest to the highest height. The previous algorithm is somewhat simplified because it does not support the following scenario: a section at height i is not on the free list, however, it has been broken down to a lower height (say i-1) and not all subsections have been used. One of them is still on the free list at height i-1. In these cases, the free list for height i-1 should be updated with care because those free sections have moved to new locations. In addition, note that the algorithm described above actually performs more work than is strictly necessary. A single section of a small height, for example, may end up being read and written several times as its section is combined into larger and larger sections. This is eliminated in the following manner. The algorithm is first performed "virtually" -- that is, in main memory, as a compaction algorithm on the free lists. Once completed, the entire sequence of operations that have been performed determines the ultimate destination of each of the modified sections. The Scheduler constructs a list of these sections. This list is inserted into a queue of house keeping I/Os. Associated with each element of the queue is an estimated amount of time required to perform the task. Whenever the Scheduler locates one or more idle slots in the time period, it analyzes the queue of work for the element that can be processed using the available time. (Idle slots might be available with a workload that has completely utilized the number of idle slots due to the PM-driven scheduling paradigm of the PM-driven Section.)

The value of impacts the frequency of preventive operations. If is set to its minimum value (i.e., = 2), then preventive operations would be invoked frequently because every time a new section is enqueued there is a 50% chance for a height of the free list to consist of two sections (violates the bounded list length property). Increasing the value of will therefore "relax" the system because it reduces the probability that an insertion to the free list would violate the bounded list length property. However, this would increase the expected number of bytes migrated per preventive operation. For example, at the extreme value of = n (where n is the total number of pages), the organization of blocks will consist of two levels, and for all practical purpose, EVEREST reduces to a standard file system that manages fix-sized pages.

The design of EVEREST suffers from the following limitation: the overhead of its preventive operations may become significant if many objects are swapped in and out of the disk drive. This happens when the working set of an application cannot become resident on the disk drive.

In our implementation of EVEREST, it was not possible to fix the number of disk pages as an exact power of . The most important implication of an arbitrary number of pages is that some sections may not have the correct number of buddies (-1 of them). However, we can always move those sections to one end of the disk -- for example, to the side with the highest page-offsets. Then instead of choosing the first section in Step 2 in the object deallocation algorithm, Mitra chooses the one with the lowest page-number. This ensures that the sections towards the critical end of the disk -- that might not have the correct number of buddies -- are never used in both Steps 4 and 5 of the algorithm.

Our implementation enables a process to retrieve a file using block sizes that are at the granularity of (1)/(2)KByte. For example, EVEREST might be configured with a 64 KByte page size. One process might read a file at the granularity of 1365.50 KByte blocks, while another might read a second file at the granularity of 512 KByte.


Figure 4: Zone characteristics of the Seagate ST31200W magnetic disk

The design of EVEREST is related to the buddy system proposed in [Kno65][LD91] for an efficient main memory storage allocator (DRAM). The difference is that EVEREST satisfies a request for b pages by allocating a number of sections such that their total number of pages equals b. The storage allocator algorithm, on the other hand, will allocate one section that is rounded up to 2lgb pages, resulting in fragmentation and motivating the need for either a re-organization process or a garbage collector [GR93b]. The primary advantage of the elaborate object deallocation technique of EVEREST is that it avoid internal and external fragmentation of space as described for traditional buddy systems (see [GR93b]).

Up Next