require("../../course.php"); echo startPage("Exercises 05","q+a", "Implementing Sorting and Projection"), alternativeViews(); ?>
You have an unsorted heap file containing 4500 records and a select query is asked that requires the file to be sorted. The DBMS uses an external merge-sort that makes efficient use of the available buffer space.
Assume that: records are 48-bytes long (including a 4-byte sort key); the page size is 512-bytes; each page has 12 bytes of control information in it; 4 buffer pages are available.
How many sorted subfiles will there be after the initial pass of the sort algorithm? How long will each subfile be?
showAnswer(<<In the first pass, we read as many pages as we can into the available buffers, sort these page in memory, and then write them out. Given that 4 buffer pages are available, there will be ceil(450/4)=113 sorted runs (sub-files) of 4 pages each, except the last run which is only 2 pages long.
xxAAxx );?>How many passes (including the initial pass considered above) will be required to sort this file?
showAnswer(<<
For this scenario, b=450, B=4, N=113 so #passes = 1 + ceil(log4-1113) = 6.
xxAAxx );?>What will be the total I/O cost for sorting this file?
showAnswer(<<What is the largest file, in terms of the number of records, that you can sort with just 4 buffer pages in 2 passes? How would your answer change if you had 257 buffer pages?
showAnswer(<<In pass 0, ceil(b/B) runs are produced. In pass 1, we must have enough buffers to enable us to merge this many runs, so B-1 must be larger or equal to ceil(b/B). Given B, we can determine b by "solving" (B-1) ≥ ceil(b/B). Re-arranging the formula (with a few simplifying assumptions) gives b = B*(B-1).
When B is given to be 4, b = 4*(4-1) = 12. The maximum number of records on 12 pages is 12*10=120.
When B is given as 257, b= 257*256 = 65792, with 65792*10=657920 records.
xxAAxx );?>For each of these scenarios:
answer the following questions assuming that external merge-sort is used to sort each of the files:
How many runs will you produce on the first pass?
showAnswer(<<How many passes will it take to sort the file completely?
showAnswer(<<What is the total I/O cost for sorting the file? (measured in #pages read/written)
showAnswer(<<Consider processing the following SQL projection query:
select distinct course from Students;
where there are only 10 distinct values (0..9) for Student.course, and student enrolments are distributed over the courses as follows:
cid | course | #students | cid | course | #students |
---|---|---|---|---|---|
0 | BSc | 5,000 | 1 | BA | 4,000 |
2 | BE | 5,000 | 3 | BCom | 3,000 |
4 | BAppSc | 2,000 | 5 | LLB | 1,000 |
6 | MA | 1,000 | 7 | MSc | 1,000 |
8 | MEng | 2,000 | 9 | PhD | 1,000 |
Show the flow of records among the pages (buffers and files) when a hash-based implementation of projection is used to eliminate duplicates in this relation.
Assume that:
(x mod 3)
(x mod 4)
The following diagram shows the partitioning phase:
From the record counts, we can compute that the first partition contains 5,000+3,000+1,000+1,000 = 10,000 records. Given that that there are 1000 records per page, this makes 10 pages. The second partition contains 4,000+2,000+1,000 records, giving 7 pages. The third partition contains 5,000+1,000+2,000 records, giving 8 pages.
In the elimination phase, there are three stages, one to deal with each of the partitions.
In the first stage, we read back all of the 0,3,6,9 records and apply the mod 4 hash function. The first record with key=0 goes into buffer 0, and all other key=0 records are eliminated when they see the first one. The first record with key=3 goes into buffer 3, and all other key=3 records are eliminated when they see the first one. The first record with key=6 goes into buffer 2, and all other key=6 records are eliminated when they see the first one. The first record with key=9 goes into buffer 1, and all other key=9 records are eliminated when they see the first one. At the end of this stage, we can scan through the buffers and extract four distinct output records.
The following diagram shows the elimination phase for this first partition:
The second stage proceeds similarly, with key=1 records hashing to buffer 1, key=4 records hashing to buffer 0 and key=7 records hashing to buffer 3. Buffer 2 is unused.
The third stage proceeds similarly, with key=2 records hashing to buffer 2, key=5 records hashing to buffer 1 and key=8 records hashing to buffer 0. Buffer 3 is unused.
xxAAxx );?>Consider processing the following SQL projection query:
select distinct title,name from Staff;
You are given the following information:
Consider an optimised version of the sorting-based projection algorithm: The initial sorting pass reads the input and creates sorted runs of tuples containing only the attributes name and title. Subsequent merging passes eliminate duplicates while merging the initial runs to obtain a single sorted result (as opposed to doing a separate pass to eliminate duplicates from a sorted result).
How many sorted runs are produced on the first pass? What is the average length of these runs?
showAnswer(<<How many additional passes will be required to compute the final result of the projection query? What is the I/O cost of these additional passes?
showAnswer(<<Suppose that a clustered B+ tree index on title is available. Is this index likely to offer a cheaper alternative to sorting? Would your answer change if the index were unclustered?
showAnswer(<<Suppose that a clustered B+ tree index on name is available. Is this index likely to offer a cheaper alternative to sorting? Would your answer change if the index were unclustered?
showAnswer(<<Suppose that a clustered B+ tree index on (name,title) is available. Is this index likely to offer a cheaper alternative to sorting? Would your answer change if the index were unclustered?
showAnswer(<<