As I learn the concept from database book “Fundamental of database systems(6th)”,
there are some passages I cannot understand, Is there anybody who can help me to clear the concept?
s : estimation of the selection cardinality ,which is the average number of records that will satisfy an
equality selection condition on that attribute. x : index level
(1)Binary search: This search accesses approximately = log2b +⌈(s/bfr)⌉-1 file blocks.
This reduces to log2b if the equality condition is on a unique (key) attribute, because s = 1 in this case
Question : Why needs to minus 1 ?
(2) There is a Example in the book to illustrate how cost estimates may be used, as below :
Suppose that the EMPLOYEE file in Figure 3.5 has rE = 10,000 records
stored in bE = 2000 disk blocks with blocking factor bfrE = 5 records/block and the
following access paths:
1. A clustering index on Salary, with levels xSalary = 3 and average selection cardinality
sSalary = 20. (This corresponds to a selectivity of slSalary = 0.002).
2. A secondary index on the key attribute Ssn, with xSsn = 4 (sSsn = 1, slSsn =
3. A secondary index on the nonkey attribute Dno, with xDno = 2 and first-level
index blocks bI1Dno = 4. There are dDno = 125 distinct values for Dno, so the
selectivity of Dno is slDno = (1/dDno) = 0.008, and the selection cardinality is
sDno = (rE * slDno) = (rE/dDno) = 80
4.A secondary index on Sex, with xSex = 1. There are dSex = 2 values for the Sex
attribute, so the average selection cardinality is sSex = (rE/dSex) = 5000. (Note
that in this case, a histogram giving the percentage of male and female
employees may be useful, unless they are approximately equal.)
We illustrate the use of cost functions with the following examples:
OP4: Dno=5 AND SALARY>30000 AND Sex=‘F’(EMPLOYEE)
The cost of the brute force (linear search or file scan) option S1 will be estimated as
CS1a = bE = 2000 (for a selection on a nonkey attribute) or CS1b = (bE/2) = 1000
(average cost for a selection on a key attribute). For OP1 we can use either method
S1 or method S6a; the cost estimate for S6a is CS6a = xSsn + 1 = 4 + 1 = 5, and it is
chosen over method S1, whose average cost is CS1b = 1000. For OP2 we can use
either method S1 (with estimated cost CS1a = 2000) or method S6b (with estimated
cost CS6b = xDno + (bI1Dno/2) + (rE /2) = 2 + (4/2) + (10,000/2) = 5004), so we choose
the linear search approach for OP2. For OP3 we can use either method S1 (with estimated
cost CS1a = 2000) or method S6a (with estimated cost CS6a = xDno + sDno = 2
+ 80 = 82), so we choose method S6a.
Question : in the 3rd point, why the first-level index blocks will be 4 ?
That is a secondary index on the nonkey attribute Dno , there should be 2000blocks for the first-level index,
Since the secondary index imply the file doesn’t ordered by the index column ,
and the secondary index should be dense to direct to the record to direct to each record ?