With media types, Mitra's file system might be
forced to manage
different block sizes.
Moreover, the blocks of different objects might be staged from
the tertiary storage device onto magnetic disk storage on demand.
A block should be stored contiguously on disk.
Otherwise, the disk would incur seeks when reading a block, reducing
disk bandwidth.
Moreover, it might result in hiccups because the retrieval time of a
block might become unpredictable.
To ensure a contiguous layout of a block, we considered four alternative
approaches:
disk partitioning,
extent-based [ABCea76][CDKK85][GR93b],
multiple block sizes, and an approximate contiguous
layout of a file.
We chose the final approach, resulting in the design and
implementation of the EVEREST file system.
Below, we describe each of the other three approaches and our
reasons for abandoning them.
With disk partitioning, assuming media types with
different
block sizes, the available disk space is partitioned into
regions, one region per media type.
A region i corresponds to media type i.
The space of this region is partitioned into
fix sized blocks, corresponding to B(Mi).
The objects of media type i
compete for the available blocks of this region.
The amount of space allocated to a region i might be estimated as
a function of both the size and frequency of access of objects
of media type i [GI94].
However, partitioning of disk space is inappropriate for a dynamic
environment where the frequency of access to the different media types
might change as a function of time.
This is because when a region becomes cold, its space should be
made available to a region that has become hot.
Otherwise, the hot region might start to
exhibit a thrashing [Den80] behavior that would
increase the number of retrievals from the tertiary storage
device.
This motivates a re-organization process to re-arrange disk
space.
This process would be time consuming due to the
overhead associated with performing I/O operations.
With an extent-based design, a fixed contiguous chunk of disk space, termed an extent, is partitioned into fix-sized blocks. Two or more extents might have different page sizes. Both the size of an extent and the number of extents with a pre-specified block size (i.e., for a media type) is fixed at system configuration time. A single file may span one or more extents. However, an extent may contain no more than a single file. With this design, an object of a media type i is assigned one or more extents with block size B(Mi). In addition to suffering from the limitations associated with disk partitioning, this approach suffers from internal fragmentation with the last extent of an object being only partially occupied. This would waste disk space, increasing the number of references to the tertiary storage device.
With the Multiple Bock Size approach (MBS), the system is configured based on the media type with the lowest bandwidth requirement, say M1. MBS requires the block size of each of media type j to be a multiple of B(M1), i.e., B(Mj)= (B(Mj))/(B(M1)) B(M1). This might simplify the management of disk space to: 1) avoid its fragmentation, and 2) ensure the contiguous layout of each block of an object. However, MBS might waste disk bandwidth by forcing the disk to: (1) retrieve more data on behalf of a PM per time period due to rounding up of block size, and (2) remain idle during other time periods to avoid an overflow of memory at the PM. These are best illustrated using an example. Assume two media types MPEG-1 and MPEG-2 objects with bandwidth requirements of 1.5 Mbps and 4 Mbps, respectively. With this approach, the block size of the system is chosen based on MPEG-1 objects. Assume, it is chosen to be 512 KByte, B(MPEG-1)=512 KByte. This implies that B(MPEG-2)=1365.33 KByte. MBS would increase B(MPEG-2) to equal 1536 KByte. To avoid excessive amount of accumulated data at a PM displaying an MPEG-2 clip, the Scheduler might skip the retrieval of data one time period every nine time periods using the PM-driven scheduling paradigm of the PM-driven Section. The Scheduler may not employ this idle slot to service another request because it is required during the next time period to retrieve the next block of current MPEG-2 display. If all active requests are MPEG-2 video clips and a time period supports nine displays with B(MPEG-2)=1536 KByte then, with B(MPEG-2)=1365.33 KByte, the system would support ten simultaneous displays (10% improvement in performance). In summary, the block size for a media type should approximate its theoretical value in order to minimize the percentage of wasted disk bandwidth.
The final approach, and the one used by Mitra, employs the buddy
algorithm to approximate a contiguous layout of a file on the disk
without wasting disk space.
The number of contiguous chunks that constitute a file is a fixed
function of the file size and the configuration of the buddy algorithm.
Based on this information, Mitra can either (1) prevent a block from
overlapping two non-contiguous chunks or (2) allow a block to overlap
two chunks and require the PM to cache enough data to hide the
seeks associated with the retrieval of these blocks.
Currently, Mitra implements the first approach.
To illustrate the second approach, if a file consists of five
contiguous chunks then at most four blocks of this file might
span two different chunks.
This implies that the retrieval of four blocks will incur seeks with
at most one seek per block retrieval.
To avoid hiccups, the Scheduler should
delay the display of the data at the PM until
it has cached enough data to hide the latency associated
with four seeks.
The amount of cached data is not significant.
For example, assuming a maximum seek time of 20 milliseconds,
with MPEG-2 objects (4 Mbps), the PM should cache
10 KByte to hide each seek.
However, this approach complicates the admission control
policy because the retrieval of a block might incur either
one or zero seeks.
EVEREST
Figure 2:
Physical division of disk space into pages and the
corresponding logical view of the sections with an example
base of = 2.
i (for i >=0) are valid, where the
base
is a system configuration parameter.
If a section consists of
i pages then i is said to be
the height of the section.
height i
sections that are buddies (physically adjacent)
might be combined to construct a height i+1 section.
To illustrate, the disk in Figure 2 consists of 16
pages. The system is configured with = 2. Thus, the size
of a section may vary from 1, 2, 4, 8, up to 16
pages.
In essence, a binary tree is imposed upon the sequence of pages.
The maximum height, computed by(4)
S = log
((Capacity)/(size(page))
) , is 4.
With this organization imposed upon the device,
sections of height i>=0 cannot start at just any page number,
but only at offsets that are multiples of
i.
This restriction ensures that any section, with the exception of the
one at height S, has a total of
-1 adjacent buddy
sections of the same size at all times. With the base 2 organization of
Figure 2, each section has one buddy.
With EVEREST, a portion of the available disk space is allocated
to objects. The remainder, should any exist, is free.
The sections that constitute the available space
are handled by a free list. This free list is actually maintained as a sequence
of lists, one for each section height. The information about an unused
section of height i is enqueued in the list that handles
sections of that height.
In order to simplify object allocation, the following
bounded list length property is always maintained:
For each height i = 0, ..., S, at most
-1 free sections of i are allowed.
Informally, this property implies that whenever there exists
sufficient free space
at the free list of height i, EVEREST must compact these
free sections into sections of a larger height(5).
The materialization of an object is as follows.
The first step is to check, whether the
total number of pages in all the sections on the free list is
either greater than or equal to the number of pages (denoted
no-of-pages(ox)) that the new object ox requires.
If this is not the case then one or more victim objects
are elected and deleted.
(The procedure for selecting a victim is based on heat [GIKZ96].
The deletion of a victim object is described further below.)
Assuming enough free space is available at this point,
ox is divided into its corresponding sections as follows.
First, the number m = no-of-pages(ox) is converted to base .
For example, if
= 2, and no-of-pages(ox) = 1310 then
its binary representation is 11012.
The full representation of such a converted number is m = dj-1 ×
j-1 + ...+ d2 ×
2 + d1 ×
1 +
d0 ×
0.
In our example, the number 11012 can be written as
1 ×23 + 1 ×22 + 0 ×21 + 1 ×20.
In general, for every digit di that is non-zero, di sections are
allocated from height i of the free list on behalf of ox.
In our example, ox requires 1 section from height 0,
no sections from height 1, 1 section from height 2,
and 1 section from height 3.
For each object, the number nu of contiguous pieces is equal to
the number of one's in the binary representation of m,
or with a general base , nu= SUMi=0j di (where j is the
total number of digits). Note that nu is always bounded by
log
m .
For any object, nu defines the maximum
number of sections occupied by the object. (The minimum
is 1 if all nu sections are physically adjacent.)
A complication arises when no section at the right height exists.
For example, suppose that a section of size
i is required,
but the smallest section larger than
i
on the free list is of size
j (j > i).
In this case, the section of size
j can be split into
sections of size
j-1.
If j-1 = i, then
-1 of these are enqueued on the list
of height i and the remainder is allocated.
However, if j-1>i then
-1 of these sections are
again enqueued at level j-1, and the splitting procedure is
repeated on the remaining section.
It is easy to see that, whenever the total amount of free space on these
lists is sufficient to accommodate the object, then for each
section
that the object occupies, there is always a
section
of the appropriate size, or larger, on the list.
This splitting procedure will guarantee that the
appropriate number of sections, each of the right size,
will be allocated, and that the bounded list length property
is never violated.
When the system elects that an object must be materialized and
there is insufficient free space, then one or more victims are
removed from the disk.
Reclaiming the space of a victim requires
two steps for each of its sections.
First, the section must be appended to the free list at the
appropriate height.
The second step ensures that the bounded list length property
is not violated.
Therefore, whenever a section is enqueued in the free
list at height i and the number of sections at that height
is equal to or greater than , then
sections must
be combined into one section at height i+1. If the list at
i+1 now violates bounded list length property, then once again
space must be compacted and moved to section i+2. This
procedure might be repeated several times.
It terminates when the length of the list for a higher
height is less than
.
Figure 3a: Two sections are on the freelist already (7 and 14) and object o3 is deallocated.
Figure 3b: Sections 7 and 13 should be combined, however, they are not contiguous.
Figure 3c: The buddy of section 7 is 6. Data must move from 6 to 13.
Figure 3d: Sections 6 and 7 are contiguous and can be combined.
Figure 3e: The buddy of section 6 is 4. Data must move from (4,5) to (14,15).
Figure 3f: Sections 4 and 6 are now adjacent and can be combined.
Figure 3g: The final view of the disk and the free list after removal of o3.
Figure 3: Deallocation of an object. The example sequence shows the removal of object o3 from the initial disk resident object set { o1, o2, o3}. Base two,= 2.
Once all sections of a deallocated object are on the free list, the iterative algorithm above is run on each list, from the lowest to the highest height. The previous algorithm is somewhat simplified because it does not support the following scenario: a section at height i is not on the free list, however, it has been broken down to a lower height (say i-1) and not all subsections have been used. One of them is still on the free list at height i-1. In these cases, the free list for height i-1 should be updated with care because those free sections have moved to new locations. In addition, note that the algorithm described above actually performs more work than is strictly necessary. A single section of a small height, for example, may end up being read and written several times as its section is combined into larger and larger sections. This is eliminated in the following manner. The algorithm is first performed "virtually" -- that is, in main memory, as a compaction algorithm on the free lists. Once completed, the entire sequence of operations that have been performed determines the ultimate destination of each of the modified sections. The Scheduler constructs a list of these sections. This list is inserted into a queue of house keeping I/Os. Associated with each element of the queue is an estimated amount of time required to perform the task. Whenever the Scheduler locates one or more idle slots in the time period, it analyzes the queue of work for the element that can be processed using the available time. (Idle slots might be available with a workload that has completely utilized the number of idle slots due to the PM-driven scheduling paradigm of the PM-driven Section.)
The value of impacts the frequency of preventive operations.
If
is set to its minimum value (i.e.,
= 2),
then preventive operations would be invoked frequently because
every time a new section is enqueued there is
a 50% chance for a height of the free list to consist of two
sections (violates the bounded list length property).
Increasing the value of
will therefore "relax" the system
because it reduces the probability that an insertion to the
free list would violate the bounded list length property.
However, this would increase the expected number of bytes
migrated per preventive operation.
For example, at the extreme value of
= n
(where n is the total number of pages), the
organization of blocks will consist of two levels,
and for all practical purpose, EVEREST reduces to
a standard file system that manages fix-sized pages.
The design of EVEREST suffers from the following limitation: the overhead of its preventive operations may become significant if many objects are swapped in and out of the disk drive. This happens when the working set of an application cannot become resident on the disk drive.
In our implementation of EVEREST, it was not possible to fix
the number of disk pages as an exact power of .
The most important implication of an arbitrary number of pages is
that some sections may not have the correct number of buddies
(
-1 of them).
However, we can always move those sections to one end of the
disk --
for example, to the side with the highest page-offsets.
Then instead of choosing the first section in Step 2 in the
object deallocation algorithm, Mitra chooses
the one with the lowest page-number.
This ensures that the sections towards the critical
end of the disk -- that might not have the correct number
of buddies -- are never used in both Steps 4 and 5 of the algorithm.
Our implementation enables a process to retrieve a file using block sizes that are at the granularity of (1)/(2)KByte. For example, EVEREST might be configured with a 64 KByte page size. One process might read a file at the granularity of 1365.50 KByte blocks, while another might read a second file at the granularity of 512 KByte.
Figure 4: Zone characteristics of the Seagate ST31200W magnetic disk